Incident Postmortem

Basic Information Write a brief description of what happened, why it occurred, its severity, and how long it lasted. Example:On Monday, April 11th, 2016, a bug was found in the staging environment of website X during a pre-deploy meeting. Deployment was postponed until the following day. What Happened (Incident Overview) Describe the sequence of events that led up to the incident. Contributing Factors (Root Cause Exploration) Start from the observed failure and walk through the contributing causes using the "5 Whys" method. Explain why each step occurred and how it compounded into the final failure. If any mitigation attempts worsened the issue, include them as well. Root Cause (Final Diagnosis) Identify the primary factor that must change to prevent recurrence. Impact Assessment Explain who was affected, how severely, and for how long. Include exact user numbers, support cases, or financial impact if applicable. Detection & Alerting Document how the issue was initially detected, whether alerts fired appropriately, and if there are opportunities to reduce detection time in the future. Response & Mitigation Summarize who responded, when they responded, what actions they took, and whether any delays or blockers affected the response process. Resolution (Technical Fixes) Describe the steps that ultimately resolved the issue, including both short-term mitigations and long-term fixes. Incident Timeline (Use exact time zones. Include all notable events before, during, and after impact.) Time (UTC) Event Description 14:51 News article triggers traffic spike 14:53 Reddit post drives 88x spike 14:55 API latency spikes to 5s 14:58 On-call alerted via PagerDuty 15:10 Autoscaling fails 15:20 Manual scale-out initiated 15:33 Latency back to normal Backlog & Known Issues Check Was there any prior known work in backlog that could have prevented or reduced this incident? Lessons Learned Action Items (Preventive Work) Table Action Item Owner Due Date Tracking Ticket Priority Example: Implement load testing in CI/CD DevOps June 15 JIRA-12345 High
Preview of the Incident Postmortem template.

Categories

More like this