How to Troubleshoot DNS Like an Infrastructure Engineer
DNS is the first domino. If names don't resolve, users don't care that your servers are healthy — to them, the app is down. The good news: DNS failures follow a small number of patterns, and a calm method beats frantic cache-flushing every time.
The mental model
Your device asks a resolver (recursive) for a name. If the resolver doesn't already know the answer, it walks the hierarchy — root → TLD → authoritative — and caches what it learns for the record's TTL. The client asks once; the resolver does the legwork.
Three facts solve most incidents:
- The client only knows what its configured resolver tells it.
/etc/hosts(viansswitch.conf) is usually checked before DNS.- A record can be perfectly healthy and still be invisible to the wrong resolver (split-horizon).
The method
1. Confirm and scope it. Does it fail for everyone, or just one network? Per-network symptoms point at per-network config, not the authoritative data.
dig app.internal # what does THIS box get?
2. Find out which resolver you're using.
cat /etc/resolv.conf # nameserver lines + search domains
3. Compare resolvers. Ask the internal resolver and a public one for the same name:
dig @10.0.0.53 app.internal # internal
dig @1.1.1.1 app.internal # public
If the internal resolver answers and the public one returns NXDOMAIN, the zone is fine — your clients are simply pointed at a server that can't see it. That's a resolver/DHCP problem, not a dead record.
4. Walk to the source of truth.
dig +trace app.example.com
This goes root → TLD → authoritative, bypassing caches. The last server is where the truth lives.
5. Suspect caching for "I changed it but it's still wrong". Read the TTL in the answer; the old value sticks until it expires. Lowering TTL helps future changes — it can't un-cache one already out there.
The traps that waste hours
- /etc/hosts wins. A stale incident-time entry makes one box resolve differently from everyone else.
getent hosts <name>honours the real order. - Flushing caches that aren't stale. If the answer is fresh but wrong, you're asking the wrong server — flushing changes nothing.
- Blaming the DNS team. Query the authoritative server directly before you escalate. If it returns the right record, the problem is downstream of them.
- NXDOMAIN vs SERVFAIL. NXDOMAIN = "this name doesn't exist (here)". SERVFAIL = "the resolver couldn't get an answer" — often a broken forwarder or DNSSEC issue.
The 30-second checklist
- Reproduce; check scope.
cat /etc/resolv.conf— which resolver?dig @internalvsdig @public— who has the record?dig +trace— confirm the authoritative answer.- Check
/etc/hostsand TTLs before touching anything.
Master this and "it works here but not there" stops being scary and becomes a two-minute diagnosis.
Want to practise on a live scenario? Try the puzzle DNS Works Here, Not There.
Liked this? ShellQuest turns these mental models into puzzles and labs you can actually practise.
Join the waitlist