Post-mortem template
Summary
A deployment of our authentication service resulted in some customers being unable to log in to our application for a period of 20 minutes. The issue was initially reported by Customer Support, 4 minutes before our alerting system detected it. We promptly rolled back the problematic deployment, which immediately restored service. Approximately 20 users were affected by this issue, and we proactively reached out to them directly to apologize.
🔗 Important Links
Type
Link
💬 Channel
[#incident-auth]
🌐 Incident Page
[View Homepage]
📄 Postmortem Doc
[Link if exists]
ℹ️ Roles & Responsibilities
Role
Name
Incident Lead
@Name
Reporter
@Name
Active Participants
@Name @Name
⏱️ Timestamps
Event
Time
Reported
June 3, 2025 04:35 AM
Impact Start
June 3, 2025 04:15 AM
Resolution
June 3, 2025 04:39 AM
Incident Closed
June 3, 2025 04:45 AM
Incident Timeline
Time (UTC)
Event Description
04:15
Deployment triggered
04:18
Users begin experiencing login failures
04:35
Support reports issue
04:39
Rollback completed, service restored
04:45
Incident officially closed
What Happened
A bug in the newly deployed authentication service caused login failures. The deployment passed automated tests but failed under real load due to a configuration mismatch.
Root Cause
Misconfigured environment variable during deployment → session handling broke under live traffic.
Mitigators
Summary
Details
Deployment transparency
Deployment logs made it easy to confirm which version caused the issue
Rollback tooling
Rollback process was smooth and quick
Contributing Factors
Summary
Details
On-call response delay
Engineer was in subway; SLA is 15 mins, triggered at 20 mins
No pre-deploy staging traffic test
Would have caught the issue if run against live-like load
Learnings
Type
Insight
What went well
Fast rollback; low user impact; proactive comms to affected users
What went wrong
Delayed alert; staging didn’t catch config issue
Near miss
Only a minor update; no data loss or outage
Process insight
On-call override request too complex; needs simplification
Follow-up actions
Table
Action Item
Owner
Due Date
Priority
Simplify on-call override request process @DevOps 1749513600000 High
Add live traffic validation to staging tests @QA 1749686400000 Medium
Improve alert rule to detect login anomalies @Infra Team 1749859200000 High