Incidents & Operations with Dan Slimmon
1 Stunde 1 Minute
Podcast
Podcaster
Adam Hawkins presents the theory and practices behind software delivery excellence.
Beschreibung
vor 1 Jahr
In this episode, Adam welcomes Dan Slimmon, an experienced Site
Reliability Engineer (SRE) to discuss aspects of incident
response and troubleshooting in software engineering. Dan
explains his methodology for clinical troubleshooting, the
importance of maintaining a common mental model, and techniques
for leading effective incident response efforts. They also delve
into the value of continuous ops reviews and ongoing mental model
updates to prevent issues, emphasizing the need for structured
processes and effective communication.
Want more?
New listener? Start with the introduction.
Enter the FREE giveaway for a copy of "Release It!"
Get the Small Batches Way guide to software delivery
excellence
Software Kaizen: My One-on-One System for Engineering
Leadership
Dan's course on leading incidents (Code SMALLBATCHES24 for
24% off!)
Chapters
(00:00) - Incidents & Operations
(01:14) - Guest Welcome
(01:40) - Dan's Career Journey
(02:33) - Evolution of Tech Stacks
(04:59) - Clinical Troubleshooting Explained
(11:53) - Incident Response Fundamentals
(17:41) - Effective Communication in Incidents
(26:09) - Training for Incident Response
(33:22) - The Essence of Incident Response
(33:53) - Balancing Short-Term and Long-Term Fixes
(35:01) - The Firefighting Analogy in Software Incidents
(37:11) - Postmortems: Learning from Incidents
(42:14) - Building a Shared Mental Model
(42:41) - Looking for Trouble: Proactive System Monitoring
(47:59) - Ops Reviews: Continuous Improvement
(54:37) - The Importance of Closing the Feedback Loop
(59:40) - Final Thoughts and Resources
Support this podcast on Patreon
Weitere Episoden
74 Sekunden
vor 1 Jahr
7 Minuten
vor 1 Jahr
11 Minuten
vor 1 Jahr
10 Minuten
vor 1 Jahr
13 Minuten
vor 1 Jahr
In Podcasts werben
Kommentare (0)