← All articles
DNSNetworkingTroubleshooting

How to Troubleshoot DNS Like an Infrastructure Engineer

2 February 2026 · 7 min

DNS is the first domino. If names don't resolve, users don't care that your servers are healthy — to them, the app is down. The good news: DNS failures follow a small number of patterns, and a calm method beats frantic cache-flushing every time.

The mental model

Your device asks a resolver (recursive) for a name. If the resolver doesn't already know the answer, it walks the hierarchy — root → TLD → authoritative — and caches what it learns for the record's TTL. The client asks once; the resolver does the legwork.

Three facts solve most incidents:

  • The client only knows what its configured resolver tells it.
  • /etc/hosts (via nsswitch.conf) is usually checked before DNS.
  • A record can be perfectly healthy and still be invisible to the wrong resolver (split-horizon).

The method

1. Confirm and scope it. Does it fail for everyone, or just one network? Per-network symptoms point at per-network config, not the authoritative data.

dig app.internal           # what does THIS box get?

2. Find out which resolver you're using.

cat /etc/resolv.conf       # nameserver lines + search domains

3. Compare resolvers. Ask the internal resolver and a public one for the same name:

dig @10.0.0.53 app.internal     # internal
dig @1.1.1.1   app.internal     # public

If the internal resolver answers and the public one returns NXDOMAIN, the zone is fine — your clients are simply pointed at a server that can't see it. That's a resolver/DHCP problem, not a dead record.

4. Walk to the source of truth.

dig +trace app.example.com

This goes root → TLD → authoritative, bypassing caches. The last server is where the truth lives.

5. Suspect caching for "I changed it but it's still wrong". Read the TTL in the answer; the old value sticks until it expires. Lowering TTL helps future changes — it can't un-cache one already out there.

The traps that waste hours

  • /etc/hosts wins. A stale incident-time entry makes one box resolve differently from everyone else. getent hosts <name> honours the real order.
  • Flushing caches that aren't stale. If the answer is fresh but wrong, you're asking the wrong server — flushing changes nothing.
  • Blaming the DNS team. Query the authoritative server directly before you escalate. If it returns the right record, the problem is downstream of them.
  • NXDOMAIN vs SERVFAIL. NXDOMAIN = "this name doesn't exist (here)". SERVFAIL = "the resolver couldn't get an answer" — often a broken forwarder or DNSSEC issue.

The 30-second checklist

  1. Reproduce; check scope.
  2. cat /etc/resolv.conf — which resolver?
  3. dig @internal vs dig @public — who has the record?
  4. dig +trace — confirm the authoritative answer.
  5. Check /etc/hosts and TTLs before touching anything.

Master this and "it works here but not there" stops being scary and becomes a two-minute diagnosis.

Want to practise on a live scenario? Try the puzzle DNS Works Here, Not There.

Liked this? ShellQuest turns these mental models into puzzles and labs you can actually practise.

Join the waitlist