PM × Engineering

Post-Mortem Template

A blameless post-mortem template for product incidents and failed launches. Documents what happened, why it happened, what was learned, and what changes will prevent recurrence. Free to copy, download, and use. No signup required.

Template
# Post-Mortem
**Incident / Event:** [Brief title]
**Date of incident:** [Date]
**Date of post-mortem:** [Date]
**Severity:** [ ] P0 — all users affected  [ ] P1 — significant subset  [ ] P2 — minor / limited impact
**Facilitator:** [Name]
**Attendees:** [Names]
**Status:** [ ] Draft  [ ] Final  [ ] Action items in progress

---

## 1. Summary

**What happened** (2–3 sentences, no jargon):
[Plain-language description of what users experienced and for how long]

**Impact:**
- Users affected: [number or %, or "unknown"]
- Duration: [start time] → [end time] = [total duration]
- Revenue / business impact: [if known]
- Data loss: [ ] Yes  [ ] No

---

## 2. Timeline

| Time (UTC) | Event |
|---|---|
| [HH:MM] | [First sign of issue / alert triggered] |
| [HH:MM] | [Team notified / on-call paged] |
| [HH:MM] | [Initial investigation started] |
| [HH:MM] | [Hypothesis formed: suspected cause] |
| [HH:MM] | [Mitigation applied / rollback initiated] |
| [HH:MM] | [Service restored / incident resolved] |
| [HH:MM] | [Post-mortem scheduled] |

---

## 3. Root cause analysis

### Immediate cause
[The direct technical or process failure that caused the incident]
*Example: "A deploy at 14:32 introduced a null pointer exception in the payment flow."*

### Contributing factors
[The conditions that allowed the immediate cause to happen — these are the real leverage points]

1. **[Factor 1]:** [Explanation]
2. **[Factor 2]:** [Explanation]
3. **[Factor 3]:** [Explanation]

### 5 Whys

| Why | Answer |
|---|---|
| Why did [incident] happen? | [Answer] |
| Why did [answer 1] happen? | [Answer] |
| Why did [answer 2] happen? | [Answer] |
| Why did [answer 3] happen? | [Answer] |
| Why did [answer 4] happen? | [Root cause] |

### Root cause (single sentence):
> [The fundamental systemic reason this incident occurred]

---

## 4. Detection & response

| Question | Answer |
|---|---|
| How was the incident detected? | [Alert / customer report / manual discovery] |
| How long until detection after onset? | [Duration] |
| How long from detection to resolution? | [Duration] |
| Was the runbook followed? | [ ] Yes  [ ] No  [ ] No runbook existed |
| Was escalation appropriate and timely? | [ ] Yes  [ ] No — explain: |

**What slowed the response?**
[List any information gaps, missing tooling, or process failures that extended the incident]

---

## 5. What went well

[List things that worked as intended — detection that fired correctly, clear communication, fast rollback, good runbook. This is not spin — genuine positives should be reinforced.]

- [Item 1]
- [Item 2]
- [Item 3]

---

## 6. What went wrong

[List failures honestly — missing monitoring, slow escalation, insufficient testing, lack of feature flags, etc. This is blameless: focus on systems and processes, not individuals.]

- [Item 1]
- [Item 2]
- [Item 3]

---

## 7. Action items

| Action | Owner | Type | Due date | Status |
|---|---|---|---|---|
| [Action 1] | [Name] | Prevention / Detection / Process | [Date] | Open |
| [Action 2] | [Name] | Prevention / Detection / Process | [Date] | Open |
| [Action 3] | [Name] | Prevention / Detection / Process | [Date] | Open |

**Action types:**
- **Prevention** — stops this class of incident from occurring
- **Detection** — catches this class of incident faster when it does occur
- **Process** — improves response, communication, or escalation

---

## 8. Lessons learned

**What should every engineer on the team know from this incident?**
[1–3 sentences capturing the key lesson. This is the part that gets linked from future PRDs and architecture discussions.]

**Should this change our deployment / review / testing process?**
[ ] Yes — describe change: ________________________________
[ ] No

**Does this incident reveal a systemic risk elsewhere in the product?**
[ ] Yes — describe: ________________________________
[ ] No

---

## 9. Follow-up

- [ ] Action items added to Jira/Linear
- [ ] Runbook updated (or created)
- [ ] Monitoring / alerting improved
- [ ] Post-mortem shared with relevant stakeholders
- [ ] Reviewed in next sprint retrospective

How to use this Post-Mortem template

1

Run the post-mortem within 72 hours

Context decays fast. The engineer who fixed the incident at 2am will remember the exact sequence of events for 48 hours and a rough outline at two weeks. The timeline and root cause sections are most accurate when written while the incident is fresh. Schedule the post-mortem before the incident is resolved — don't wait until people have moved on.

2

Keep it blameless by focusing on systems, not people

'The deploy pipeline didn't require a second reviewer' is a system failure. 'John didn't get a second review' is a blame. Both describe the same gap. The blameless framing is not just culturally kinder — it's more accurate, because the root cause is always a system that allowed one person's error to propagate unchecked.

3

The 5 Whys stops when you reach something you can actually fix

The 5 Whys exercise is complete when you reach a root cause that has a concrete, ownable action. If the last 'why' is 'because software is complex' or 'because mistakes happen', go back one level — that's not a root cause, it's a shrug. A good root cause has a specific action item attached to it.

4

Action items without owners and dates are decorations

The most common post-mortem failure mode: a well-written document with three action items, none assigned to a specific person, all marked 'TBD'. In the next incident retrospective, those same three items appear again. Every action item needs a name and a date before the post-mortem is closed.

Want a Post-Mortem grounded in your actual customer data?

PMRead ingests your customer interviews, feedback, and Slack threads — and generates PRDs backed by real evidence, not guesses.

Try PMRead free →

Frequently asked questions

Should every bug have a post-mortem?

No — post-mortems are for significant incidents: P0/P1 outages, data loss, failed launches with material business impact, or any event the team wants to learn from systematically. Minor bugs fixed in a patch don't need a post-mortem. A rough threshold: if it was escalated to leadership or affected more than 5% of users, write a post-mortem.

How do we make post-mortems actually change behaviour?

Two things: (1) action items must be tracked in the same system as regular engineering work — Jira or Linear, not a Google Doc. (2) Review open post-mortem action items in the next sprint retrospective. Post-mortems that live only in a Google Drive folder get read once and forgotten.

How long should a post-mortem document be?

Long enough to capture the timeline, root cause, and action items clearly — typically 1–3 pages. Longer is not better. The goal is a document that a new engineer can read in 15 minutes and understand what happened, why, and what changed as a result.

Should customers or stakeholders see the post-mortem?

A summarised, non-technical version should be shared with affected customers and relevant stakeholders within 48–72 hours of resolution. The internal blameless post-mortem stays internal. The external communication should cover: what happened, how long, what was affected, what's been fixed, and what will prevent recurrence.