What should you change on your status page after the first real outage?
The first time StillOnline pages you at 2 a.m., adrenaline carries you through. A week later you forget which component failed, whether Telegram fired before email, and what you promised in the incident thread. That is normal for a solo founder — and fixable without hiring an SRE team.
A lightweight post-incident retrospective for indie SaaS is not a 20-page blameless postmortem. It is three habits: components that match how customers experience failure, owner alerts you actually read, and a runbook link your future self trusts. StillOnline gives the timeline and probes; you supply the honesty.
Quick answer
Within 48 hours of resolving your first production outage, update (1) status page components so labels match customer-visible surfaces, (2) owner alerts — confirm the StillOnline bot on Telegram or your chosen channel fired on DOWN and recovery, and (3) a one-page runbook with health URL, deploy rollback, and incident copy starters. StillOnline marks checks DOWN after two consecutive failed probes — ~10 minutes on Free five-minute intervals per pricing. Use Google’s postmortem culture guide for blameless tone; use Atlassian’s incident communication handbook for customer-facing cadence.
The 48-hour retrospective (solo founder edition)
Block 30–45 minutes. No committee required.
| Step | Output |
|---|---|
| 1 — Timeline | 5 bullets: alert time, customer symptom, root cause, fix, resolve time (UTC) |
| 2 — Components | List what broke vs what customers blamed |
| 3 — Alerts | Did owner channel beat your first support DM? |
| 4 — Comms | Did status updates match probe state? |
| 5 — One change | Pick one structural fix for next time |
Store the doc next to your incident template — not in a chat thread that scrolls away.
Habit 1 — Fix status page components
Components are the labels customers scan during stress. After the first outage, mismatched names cause “API says up but login is broken” distrust.
| Before first incident | After first incident |
|---|---|
| One generic “API” | Split Web app, API, Webhooks if failures differ |
| Marketing site on same row as API | Separate checks on Pro when URLs split |
| Third-party vendor unnamed | Add Stripe / Auth0 row with link to vendor status — third-party status |
Design principles: status page components. StillOnline reflects check UP/DOWN per URL you register — components should map 1:1 to those checks or honest manual incidents when a dependency fails.
Free = one URL; fold subsystems into /health JSON or upgrade to Pro ($9/mo) for up to 10 URLs when surfaces diverge — which plan.
Habit 2 — Tune owner alerts (not more noise)
First outages expose false positives and alert fatigue faster than dashboards.
| Check | Action |
|---|---|
| Telegram via StillOnline bot | Re-run Connect Telegram in settings if messages were delayed |
| Email only on Free | Consider Telegram for mobile push — still one channel on Free |
| Alert before you noticed users | Good — keep interval; fix health URL if probe was wrong |
| Users complained before alert | Point check at authoritative /health; read false positives |
StillOnline Free fixes interval at 300 s; debounce requires two fails before DOWN. Pro allows 60–300 s per check when you need faster signal without duplicating vendors.
Habit 3 — Runbook your future self will open
A runbook is not bureaucracy — it is the file you open when cognition is offline.
Minimum sections:
- Health URL +
curlone-liner (quickstart) - StillOnline project link + public
/s/...for support macros - Rollback command or platform “redeploy previous” steps
- Incident phases copy-paste — incident template
- Update cadence — e.g. every 30 min until stable — five-minute rule
Add the runbook link to your phone home screen or pin it in Telegram — not buried in Notion page 47.
Customer comms habits (what the first outage taught you)
| Lesson | Next time |
|---|---|
| You posted late | Pre-draft Investigating text in the template |
| Updates were vague | Name the component and next update time UTC |
| Support repeated “is it down?” | Paste status URL in macro header |
| Resolved post was missing duration | Add Resolved with minutes down + one-line cause |
StillOnline subscribers on a public page get email when you publish incident updates — separate from owner Slack/Telegram/email (subscribers vs owner).
What not to do after the first outage
- Buy enterprise incident tooling before a second failure repeats the same gap.
- Add five new monitors without fixing the one health URL that lied green.
- Promise “24/7” on the status page before owner alerts are reliable.
- Skip the retrospective because “it was a one-off” — most second outages rhyme with the first.
StillOnline checklist (copy after incident)
- Confirm check targets post-fix
GET /health→ 200 from externalcurl. - Update components to match what broke.
- Test owner alert with a staging fail — not production during peak.
- Publish a short Resolved follow-up if customers watched the thread.
- Schedule Pro review if you needed a second URL or private staging page.
Related guides
- Incident post template for indie SaaS
- Status page components: customer-facing design
- Incident update cadence: five-minute rule
- False positive uptime alerts: tuning
FAQ
Does StillOnline auto-generate a post-mortem after an outage?
No. StillOnline records probe history, check state, and incident posts you publish. Use this habits list plus incident template for customer copy; store internal timeline notes in your runbook.
How soon should I update StillOnline components after the first downtime?
Within 48 hours while memory is fresh. Components should mirror checks or honest manual incidents — components design. Misaligned labels erode trust faster than a second outage.
My StillOnline Telegram alert arrived late — what should I change?
Verify Connect Telegram and the official StillOnline bot in settings. If alerts lagged behind user reports, aim the check at a stable /health URL and review debounce (~two five-minute fails on Free) — false positive tuning.
Should I add more URL checks on StillOnline after one incident?
If web and API failed independently, Pro ($9/mo) supports up to 10 URLs. If one /health summary lied, fix the handler first — health endpoint design — before multiplying monitors.
How often should I post incident updates on StillOnline during the next outage?
Follow a fixed cadence — many indie teams use 30 minutes or sooner when state changes — incident update cadence. Owner alerts tell you something broke; incident posts tell customers what you know.