Troubleshooting reference
Mental models that turn panic into a method.
Confirm symptom & scope
Problem · An alert fired — but what's actually broken, and for whom?
# Reproduce it · who/what is affected? · one user, one host, one site, or global?
Scope tells you blast radius and where to look. 'Everyone' and 'one user' need very different first moves.
Don't start changing things before you know what 'broken' means and how widespread it is.
Narrow the layer
Problem · A web app is down — where in the stack?
# user → DNS → network → firewall → LB → web → app → cache → DB → storage dig / curl -I / ss -tulpn / systemctl status / df -h
Isolate the failing layer before fixing anything; each layer has a quick check.
Symptoms surface high (app), causes often live low (DNS, disk, identity).
Mitigate before root cause
Problem · Prod is down and the cause isn't known yet.
# Stop the bleeding: roll back the recent change, fail over, shed load — restore service first
Restore the user experience first; do the full root-cause analysis once stable.
Blind restarts can clear evidence. Capture state (logs, lsof, ps) before you reset.
What changed?
Problem · It was fine yesterday.
# deploys? configs? certs? DNS edits? cron? patches? expiries? journalctl --since yesterday | grep -i change git log / change tickets
Change is the #1 cause of incidents. Correlate the start time with a change event.
'No one changed anything' usually means an automated or expiry-driven change.
Blast radius
Problem · How bad is this, really?
# one user? one service? one site? the whole platform?
Blast radius sets severity and how aggressive your fix should be.
Underestimating scope delays escalation; overestimating causes needless panic.