Troubleshooting reference

Mental models that turn panic into a method.

Confirm symptom & scope

Problem · An alert fired — but what's actually broken, and for whom?

# Reproduce it · who/what is affected? · one user, one host, one site, or global?

What the output means

Scope tells you blast radius and where to look. 'Everyone' and 'one user' need very different first moves.

Common traps

Don't start changing things before you know what 'broken' means and how widespread it is.

Problem · A web app is down — where in the stack?

# user → DNS → network → firewall → LB → web → app → cache → DB → storage
dig / curl -I / ss -tulpn / systemctl status / df -h

What the output means

Isolate the failing layer before fixing anything; each layer has a quick check.

Common traps

Symptoms surface high (app), causes often live low (DNS, disk, identity).

Problem · Prod is down and the cause isn't known yet.

# Stop the bleeding: roll back the recent change, fail over, shed load — restore service first

What the output means

Restore the user experience first; do the full root-cause analysis once stable.

Common traps

Blind restarts can clear evidence. Capture state (logs, lsof, ps) before you reset.

Problem · It was fine yesterday.

# deploys? configs? certs? DNS edits? cron? patches? expiries?
journalctl --since yesterday | grep -i change
git log / change tickets

What the output means

Change is the #1 cause of incidents. Correlate the start time with a change event.

Common traps

'No one changed anything' usually means an automated or expiry-driven change.

Problem · How bad is this, really?

# one user? one service? one site? the whole platform?

What the output means

Blast radius sets severity and how aggressive your fix should be.

Common traps

Underestimating scope delays escalation; overestimating causes needless panic.