False positive uptime alerts: practical tuning for indie SaaS
False positives train you to ignore on-call: a blip during deploy, a cold start, or a 200 from a login page while the API is fine. The fix is not “buy Datadog” — it is health URL design and understanding how StillOnline marks checks DOWN.
This guide complements uptime probes and antibot with interval choices, debounce behavior, and when to split checks on Pro.
Quick answer
StillOnline marks a check DOWN after two consecutive failed probes — on Free, five-minute interval means roughly 10 minutes from first failure to alert. Reduce noise by pointing at a stable GET /api/health (not homepage behind WAF), returning 200 in under two seconds, and exempting the health path from aggressive bot rules. Free cannot change interval (300 s only); Pro allows 120–300 s and up to 10 URL checks per project. Fix redirect chains with curl -L before blaming the monitor.
Knobs you actually have
| Knob | Free | Pro / Ultimate |
|---|---|---|
| Probe interval | 300 s (5 min) fixed | 60–300 s per check |
| Fail threshold → DOWN | 2 consecutive fails | Same |
| Owner alert repeat while DOWN | Email throttled 15 min; Telegram per transition | Same |
| Number of URL checks | 1 per project | 10 / 25 |
Debounce is intentional — notifications avoid paging on one packet loss.
Tuning workflow
1 — Verify the URL like a probe
curl -sS -o /dev/null -w "%{http_code} time:%{time_total}s final:%{url_effective}\n" -L --max-redirs 5 "https://api.yourproduct.com/health"
Expect 200, time < 2s, stable final: URL.
2 — Separate liveness from heavy /ready
Cold starts (serverless, Edge Functions) may exceed timeout once — use lightweight /health without DB on the cold path — health design.
3 — Antibot and redirects
Homepage 200 with challenge HTML = false green; login redirect = false green for product — full guide: antibot probes. PROBE_LIMITED (yellow) means antibot blocked probe without opening incident.
4 — Split checks on Pro
- Check A: marketing site (optional)
- Check B:
api.yourproduct.com/health(authoritative)
Free must combine signals in one URL or accept tradeoffs.
5 — Deploy windows
Brief 503 during deploy may trigger DOWN — use /health liveness only if you want green during rolling deploys, or pause alerts manually (no snooze button in v1 — plan deploys).
When not to tune the monitor
| Situation | Fix infrastructure, not threshold |
|---|---|
| Real 500s after deploy | Roll back |
| DB pool exhausted | Fix pool or return 503 on /ready — DB pool health |
| Cert expires tomorrow | SSL monitoring |
Related guides
- Uptime probes, redirects, antibot
- Health endpoint design
- Multi-region strategy
- Telegram owner alerts
FAQ
Can I set StillOnline to alert on one failure only?
No. Production uses two consecutive failures before DOWN to reduce noise.
Why did I get DOWN for 30 seconds during deploy?
Two failed probes in a row crossed the threshold. Use lighter /health or deploy when you can accept brief red — interval sets minimum recovery visibility.
Does shortening interval to 60 s on Pro increase false positives?
It can. Faster detection = more sensitivity to blips. Start at 300 s until health URL is stable.
StillOnline shows green but users cannot log in?
Probe likely hits wrong URL — auth flow limits and antibot guide.