DNSNetworkingTroubleshooting

How to Troubleshoot DNS Like an Infrastructure Engineer

2 February 2026 · 7 min

DNS is the first domino. If names don't resolve, users don't care that your servers are healthy — to them, the app is down. The good news: DNS failures follow a small number of patterns, and a calm method beats frantic cache-flushing every time.

The mental model

Your device asks a resolver (recursive) for a name. If the resolver doesn't already know the answer, it walks the hierarchy — root → TLD → authoritative — and caches what it learns for the record's TTL. The client asks once; the resolver does the legwork.

Three facts solve most incidents:

The client only knows what its configured resolver tells it.
/etc/hosts (via nsswitch.conf) is usually checked before DNS.
A record can be perfectly healthy and still be invisible to the wrong resolver (split-horizon).

The method

1. Confirm and scope it. Does it fail for everyone, or just one network? Per-network symptoms point at per-network config, not the authoritative data.

dig app.internal           # what does THIS box get?

2. Find out which resolver you're using.

cat /etc/resolv.conf       # nameserver lines + search domains

3. Compare resolvers. Ask the internal resolver and a public one for the same name:

dig @10.0.0.53 app.internal     # internal
dig @1.1.1.1   app.internal     # public

If the internal resolver answers and the public one returns NXDOMAIN, the zone is fine — your clients are simply pointed at a server that can't see it. That's a resolver/DHCP problem, not a dead record.

4. Walk to the source of truth.

dig +trace app.example.com

This goes root → TLD → authoritative, bypassing caches. The last server is where the truth lives.

5. Suspect caching for "I changed it but it's still wrong". Read the TTL in the answer; the old value sticks until it expires. Lowering TTL helps future changes — it can't un-cache one already out there.

The traps that waste hours

/etc/hosts wins. A stale incident-time entry makes one box resolve differently from everyone else. getent hosts <name> honours the real order.
Flushing caches that aren't stale. If the answer is fresh but wrong, you're asking the wrong server — flushing changes nothing.
Blaming the DNS team. Query the authoritative server directly before you escalate. If it returns the right record, the problem is downstream of them.
NXDOMAIN vs SERVFAIL. NXDOMAIN = "this name doesn't exist (here)". SERVFAIL = "the resolver couldn't get an answer" — often a broken forwarder or DNSSEC issue.

The 30-second checklist

Reproduce; check scope.
cat /etc/resolv.conf — which resolver?
dig @internal vs dig @public — who has the record?
dig +trace — confirm the authoritative answer.
Check /etc/hosts and TTLs before touching anything.

Master this and "it works here but not there" stops being scary and becomes a two-minute diagnosis.

Want to practise on a live scenario? Try the puzzle DNS Works Here, Not There.

Liked this? ShellQuest turns these mental models into puzzles and labs you can actually practise.

Join the waitlist