← Blog

What should you change on your status page after the first real outage?

The first time StillOnline pages you at 2 a.m., adrenaline carries you through. A week later you forget which component failed, whether Telegram fired before email, and what you promised in the incident thread. That is normal for a solo founder — and fixable without hiring an SRE team.

A lightweight post-incident retrospective for indie SaaS is not a 20-page blameless postmortem. It is three habits: components that match how customers experience failure, owner alerts you actually read, and a runbook link your future self trusts. StillOnline gives the timeline and probes; you supply the honesty.

Quick answer

Within 48 hours of resolving your first production outage, update (1) status page components so labels match customer-visible surfaces, (2) owner alerts — confirm the StillOnline bot on Telegram or your chosen channel fired on DOWN and recovery, and (3) a one-page runbook with health URL, deploy rollback, and incident copy starters. StillOnline marks checks DOWN after two consecutive failed probes~10 minutes on Free five-minute intervals per pricing. Use Google’s postmortem culture guide for blameless tone; use Atlassian’s incident communication handbook for customer-facing cadence.

The 48-hour retrospective (solo founder edition)

Block 30–45 minutes. No committee required.

StepOutput
1 — Timeline5 bullets: alert time, customer symptom, root cause, fix, resolve time (UTC)
2 — ComponentsList what broke vs what customers blamed
3 — AlertsDid owner channel beat your first support DM?
4 — CommsDid status updates match probe state?
5 — One changePick one structural fix for next time

Store the doc next to your incident template — not in a chat thread that scrolls away.

Habit 1 — Fix status page components

Components are the labels customers scan during stress. After the first outage, mismatched names cause “API says up but login is broken” distrust.

Before first incidentAfter first incident
One generic “API”Split Web app, API, Webhooks if failures differ
Marketing site on same row as APISeparate checks on Pro when URLs split
Third-party vendor unnamedAdd Stripe / Auth0 row with link to vendor status — third-party status

Design principles: status page components. StillOnline reflects check UP/DOWN per URL you register — components should map 1:1 to those checks or honest manual incidents when a dependency fails.

Free = one URL; fold subsystems into /health JSON or upgrade to Pro ($9/mo) for up to 10 URLs when surfaces diverge — which plan.

Habit 2 — Tune owner alerts (not more noise)

First outages expose false positives and alert fatigue faster than dashboards.

CheckAction
Telegram via StillOnline botRe-run Connect Telegram in settings if messages were delayed
Email only on FreeConsider Telegram for mobile push — still one channel on Free
Alert before you noticed usersGood — keep interval; fix health URL if probe was wrong
Users complained before alertPoint check at authoritative /health; read false positives

StillOnline Free fixes interval at 300 s; debounce requires two fails before DOWN. Pro allows 60–300 s per check when you need faster signal without duplicating vendors.

Habit 3 — Runbook your future self will open

A runbook is not bureaucracy — it is the file you open when cognition is offline.

Minimum sections:

  1. Health URL + curl one-liner (quickstart)
  2. StillOnline project link + public /s/... for support macros
  3. Rollback command or platform “redeploy previous” steps
  4. Incident phases copy-paste — incident template
  5. Update cadence — e.g. every 30 min until stable — five-minute rule

Add the runbook link to your phone home screen or pin it in Telegram — not buried in Notion page 47.

Customer comms habits (what the first outage taught you)

LessonNext time
You posted latePre-draft Investigating text in the template
Updates were vagueName the component and next update time UTC
Support repeated “is it down?”Paste status URL in macro header
Resolved post was missing durationAdd Resolved with minutes down + one-line cause

StillOnline subscribers on a public page get email when you publish incident updates — separate from owner Slack/Telegram/email (subscribers vs owner).

What not to do after the first outage

  • Buy enterprise incident tooling before a second failure repeats the same gap.
  • Add five new monitors without fixing the one health URL that lied green.
  • Promise “24/7” on the status page before owner alerts are reliable.
  • Skip the retrospective because “it was a one-off” — most second outages rhyme with the first.

StillOnline checklist (copy after incident)

  1. Confirm check targets post-fix GET /health → 200 from external curl.
  2. Update components to match what broke.
  3. Test owner alert with a staging fail — not production during peak.
  4. Publish a short Resolved follow-up if customers watched the thread.
  5. Schedule Pro review if you needed a second URL or private staging page.

Related guides

FAQ

Does StillOnline auto-generate a post-mortem after an outage?

No. StillOnline records probe history, check state, and incident posts you publish. Use this habits list plus incident template for customer copy; store internal timeline notes in your runbook.

How soon should I update StillOnline components after the first downtime?

Within 48 hours while memory is fresh. Components should mirror checks or honest manual incidents — components design. Misaligned labels erode trust faster than a second outage.

My StillOnline Telegram alert arrived late — what should I change?

Verify Connect Telegram and the official StillOnline bot in settings. If alerts lagged behind user reports, aim the check at a stable /health URL and review debounce (~two five-minute fails on Free) — false positive tuning.

Should I add more URL checks on StillOnline after one incident?

If web and API failed independently, Pro ($9/mo) supports up to 10 URLs. If one /health summary lied, fix the handler first — health endpoint design — before multiplying monitors.

How often should I post incident updates on StillOnline during the next outage?

Follow a fixed cadence — many indie teams use 30 minutes or sooner when state changes — incident update cadence. Owner alerts tell you something broke; incident posts tell customers what you know.