Post-mortem template

Summary A deployment of our authentication service resulted in some customers being unable to log in to our application for a period of 20 minutes. The issue was initially reported by Customer Support, 4 minutes before our alerting system detected it. We promptly rolled back the problematic deployment, which immediately restored service. Approximately 20 users were affected by this issue, and we proactively reached out to them directly to apologize. 🔗 Important Links Type Link 💬 Channel [#incident-auth] 🌐 Incident Page [View Homepage] 📄 Postmortem Doc [Link if exists] ℹ️ Roles & Responsibilities Role Name Incident Lead @Name Reporter @Name Active Participants @Name @Name ⏱️ Timestamps Event Time Reported June 3, 2025 04:35 AM Impact Start June 3, 2025 04:15 AM Resolution June 3, 2025 04:39 AM Incident Closed June 3, 2025 04:45 AM Incident Timeline Time (UTC) Event Description 04:15 Deployment triggered 04:18 Users begin experiencing login failures 04:35 Support reports issue 04:39 Rollback completed, service restored 04:45 Incident officially closed What Happened A bug in the newly deployed authentication service caused login failures. The deployment passed automated tests but failed under real load due to a configuration mismatch. Root Cause Misconfigured environment variable during deployment → session handling broke under live traffic. Mitigators Summary Details Deployment transparency Deployment logs made it easy to confirm which version caused the issue Rollback tooling Rollback process was smooth and quick Contributing Factors Summary Details On-call response delay Engineer was in subway; SLA is 15 mins, triggered at 20 mins No pre-deploy staging traffic test Would have caught the issue if run against live-like load Learnings Type Insight What went well Fast rollback; low user impact; proactive comms to affected users What went wrong Delayed alert; staging didn’t catch config issue Near miss Only a minor update; no data loss or outage Process insight On-call override request too complex; needs simplification Follow-up actions Table Action Item Owner Due Date Priority Simplify on-call override request process @DevOps 1749513600000 High Add live traffic validation to staging tests @QA 1749686400000 Medium Improve alert rule to detect login anomalies @Infra Team 1749859200000 High
Preview of the Post-mortem template template.

Categories

More like this