Post-mortem checklist for indie SaaS (status page + customer comms)
Your status page says "resolved" but customers still ask what broke. A post mortem template for SaaS turns Slack threads into a blameless record for the status page and subscriber email. You will learn when a full post-mortem beats a short incident note, how to build a UTC timeline from StillOnline history, and how to run a 30-point checklist without an enterprise SRE stack.
Quick answer
Ship short incident notes while the outage is active; publish a blameless post-mortem within 24–48 hours after resolution. Pull timestamps from StillOnline status page history and owner alerts. Public post-mortems need impact, timeline, root cause in plain language, and SMART action items. StillOnline marks checks DOWN after two failed probes (~10 minutes on Free five-minute intervals) but does not auto-write the post-mortem.
A post-mortem is a structured write-up after an outage. Blameless means asking why the system allowed the failure, not who clicked wrong. For live incident copy, use our incident post template. For update timing, see incident update cadence.
1. Choose when to publish a post-mortem vs a short incident note
During an outage, customers need short updates — not a twelve-page root cause essay. After stability, they expect a deeper post-mortem.
| Situation | Now | Within 24–48h |
|---|---|---|
| Login broken for paying users | Incident note | Full post-mortem |
| Vendor API slow with workaround | Degraded note | Light retro |
| False alert, no impact | Quiet resolve | Internal only |
Do: keep incident notes under 150 words with impact, state, and next update time. Do not: speculate on root cause until confirmed.
- Tag severity internally (P1 = revenue stop).
- Post first StillOnline incident update within ~5 minutes of acknowledging impact.
- Update every 15–30 minutes until stable.
- Resolve on the status page only after probes stay green or you verified manually.
- Schedule blameless review within 48 hours.
2. Build timeline, impact, root cause, and corrective actions
Use UTC timestamps. Pull evidence from StillOnline owner alerts and incident posts.
Workflow: owner alert DOWN → incident declared → status updates → fix → probes green → post-mortem draft.
Impact: "EU users could not export CSV for 51 minutes" beats "degraded performance." Root cause: apply five Whys until you hit a systemic gap. Actions: each row needs owner, due date, priority — "Add StillOnline check on /ready when DB pool exhausted — Alex — 2026-06-24 — P1" ships; "improve monitoring" does not.
Do: copy times from StillOnline alerts and status posts. Do not: rebuild the timeline from memory next week.
3. Decide what belongs on the public status page
| Public | Internal only |
|---|---|
| Customer impact and duration | On-call names |
| Root cause in plain language | Secrets, exact configs |
| Actions customers care about | HR discussions |
StillOnline incident posts are customer-facing. Probe history shows when checks flipped DOWN/UP. StillOnline does not auto-generate post-mortems — paste your summary into a final incident update. See status page habits after your first outage.
Do: publish post-mortem as final incident update. Do not: leave "monitoring" forever without a closing narrative.
4. Write customer email and subscriber updates
Owner alerts (Telegram, Slack, email) ping you on StillOnline probe failure. Subscribers on a public page get email per incident post — different lists.
- During outage: align subscriber email with each status page update.
- After resolution: one summary email or final incident post with post-mortem link.
- Prepare enterprise talking points if needed.
- Never paste internal blame into subscriber copy.
- Add support path if data was affected.
Do: match status page and email wording. Do not: email on every internal deploy.
5. Run the 30-point blameless checklist
- Incident resolved on status page first.
- Review within 48 hours.
- Blameless rule stated in doc header.
- Executive summary for non-technical readers.
- Impact quantified (users, duration).
- UTC timeline with detect, ack, fix, resolve.
- StillOnline DOWN alert time recorded.
- First public update time recorded.
- Customer reports vs monitor compared.
- Root cause names failure mode, not a person.
- Five Whys applied once.
- Contributing factors listed.
- Two+ "what went well" bullets.
- Two+ "what went poorly" bullets.
- Each action specific and measurable.
- Each action has owner.
- Each action has due date.
- Actions tracked in issue board.
- Public version stripped of secrets.
- Post-mortem linked from final incident update.
- Subscriber notified.
- Support FAQ if tickets spiked.
- Vendor linked if third-party.
- Runbook updated.
- Health URL reviewed if detection late.
- False positive noted for tuning.
- Doc stored searchable.
- Themes tagged for quarterly review.
- One structural P1 fix before cosmetic tweaks.
- StillOnline check covers customer-critical URL.
Do: treat unchecked items as open work. Add a StillOnline check on the URL that means "customers cannot work." Free: one project, one URL, five-minute checks.
What's next
After your next resolved incident, draft the post-mortem within 48 hours using StillOnline timestamps. Publish it as the final incident update and notify subscribers.
Open the StillOnline dashboard and confirm your health URL is the one customers depend on.
Related guides
- Incident post template for SaaS
- Incident update cadence: five-minute rule
- Status page habits after your first outage
- Customer email during outage template
FAQ
How long after resolution should I publish a post-mortem with StillOnline data?
Within 24–48 hours. Announce the target in your final incident update. Export StillOnline alert timestamps into your template.
Does StillOnline auto-generate a post-mortem?
No. StillOnline records checks, alerts, and incident posts you write. Export timestamps into your blameless template.
Incident post vs post-mortem on StillOnline — what is the difference?
Incident posts are short updates during an outage. Post-mortem is the blameless summary after resolution — publish as a final StillOnline incident update.
Should StillOnline post-mortems name engineers publicly?
Use "engineering team" on the public page. Blameless means systems, not public shaming.
Do StillOnline subscribers get the post-mortem?
They get email when you publish a new incident update on a public page. Publish the post-mortem as that update.
How do I handle third-party root cause in a StillOnline post-mortem?
State it plainly, link vendor status, and note your detection and comms speed.
Solo founder — can I skip the StillOnline checklist?
Keep 30 minutes, but never skip timeline, root cause, and one owned action — or you repeat the outage.