Quoting Michael Paoli (Michael.Paoli@cal.berkeley.edu):
[lots of interesting and diligent attempting to get the full story and learn possible lessons]
May or may not be worth doing anything explicitly about - reasonably attentive operations, it's not at all an impossible error to repeat, but it's also fairly low probability, and results aren't too nastily horrific - though it of course would be better to not have such boo-boos.
Absolutely agree.
Also suggested, thing(s) to (more) carefully watch logs, and catch/notify on issues ... but in the above case and with presumed cause of issue, such on the VM itself wouldn't have detected any issues ... but monitoring external to the VM would have caught the issue - e.g. on the nominal physical host - as that's up *most* of the time (where as the alternate physical spends most of its time down).
Yeah, that's the bit that I'd already spent time gnawing over. On the DNS master, as you say, following a VM image shuffle boo-boo, the DNS master sees absolutely nothing out of kilter, so there's no obvious way for it to detect being, um, the wrong clone at the wrong time, sort of like Sam Rockwell's character in the brilliant 2009 Duncan Jones movie 'Moon' (winner of that year's Hugo Award for Best Dramatic Presentation - Long Form, by the way). Except, say, my primitive weekly domain-checking scripts, about which more below.
On any DNS slave, such as in this case ns1.linuxmafia.com, the glitch becomes obvious from daemon.log scrutiny (e.g., using Logcheck).
Also, operator(s) doing the migrations when more attentive, less rushed/tired/sleepy would also reduce probability of error(s).
Yeah, gosh, rather like being an airline pilot. [pause for irony] https://riceo.me/posts/software-engineering-lessons-from-aviation/
Your point about the value of _external_ checks is of course profound and valuable. Those of us running our own servers typically don't even use process supervisor software (that respawns dead network daemons), let alone doing external service-availability checks. Back at $FIRM, we paid mucho dinero to companies whose business service was running periodic scripted checks of the customer's (e.g., $FIRM's) essential public network services. And, gosh, I'm not going to pay for that for ns1.linuxmafia.com. Maybe in the fullness of time, I might get around to setting up _some_ service-availability checking from a host of mine devoted to that function -- but definitely not for a while.
Anyway, here's perfunctory script /etc//cron.weekly/mydomains, which as you'll see among other things collects S/Ns from each of the authoritative nameservers for my two domains. If something like this were run, you would presumably notice the S/N discrepancy.
Much smarter implementations are, I'm certain, possible.
#!/bin/sh
# mydomains Cron script to sanity-check my domains' SOA records at # all of their authoritative nameservers, as a quick and # dirty way of making sure (1) they're all online and # (2) they're all serving up the same data (or at least # data with the same zonefile serial number). # # The script queries all nameservers for their current # SOA value, and then uses awk to parse out of that # verbose record just the S/N field, which is field #3. # The point is that you can visually spot offline or # aberrant nameservers by their S/Ns being (respectively) # missing or an out-of-step value. # # Written by Rick Moen (rick@linuxmafia.com) # $Id: cron.weekly,v 1.04 2018/10/03 00:58:00 rick # Copyright (C) Rick Moen, 2011-2018. Do anything you want with this work.
set -o errexit #aka "set -e": exit if any line returns non-true value set -o nounset #aka "set -u": exit upon finding an uninitialised variable
test -x /usr/bin/mail || exit 0 test -x /usr/bin/whois || exit 0 test -x /usr/bin/awk || exit 0 test -x /bin/grep || exit 0 test -x /usr/bin/dig || exit 0
{ echo "As of 2018-10-03, linuxmafia.com should show five authoritative nameservers:" echo "" echo "ns.primate.net. 198.144.194.12, (Aaron T. Porter)" echo "ns.tx.primate.net. 72.249.38.88 (Aaron T. Porter)" echo "ns3.linuxmafia.com. 63.193.123.122, aka ns.catwhisker.org (David Wolfskill)" echo "ns6.linuxmafia.com. 209.205.200.166, aka ns1.thecoop.net (Drew Bertola)" echo "ns1.linuxmafia.com. 198.144.195.186 (Rick Moen)" echo "" echo "As of 2018-10-03, unixmercenary.net should show five authoritative nameservers:" echo "" echo "ns.primate.net. 198.144.194.12, (Aaron T. Porter)" echo "ns.tx.primate.net. 72.249.38.88 (Aaron T. Porter)" echo "ns3.linuxmafia.com. 63.193.123.122, aka ns.catwhisker.org (David Wolfskill)" echo "ns6.linuxmafia.com. 209.205.200.166, ns1.thecoop.net (Drew Bertola)" echo "ns1.linuxmafia.com. 198.144.195.186 (Rick Moen)" echo "" echo "If any is missing from reports below, or produces odd data, something is wrong." echo "" echo "Zonefile S/Ns, linuxmafia.com:" echo "" dig -t soa linuxmafia.com. @ns.primate.net. +short | awk '{ print $3 " on ns.primate.net." }' dig -t soa linuxmafia.com. @ns.tx.primate.net. +short | awk '{ print $3 " on ns.tx.primate.net." }' dig -t soa linuxmafia.com. @ns3.linuxmafia.com. +short | awk '{ print $3 " on ns3.linuxmafia.com." }' dig -t soa linuxmafia.com. @ns6.linuxmafia.com. +short | awk '{ print $3 " on ns6.linuxmafia.com."}' dig -t soa linuxmafia.com. @ns1.linuxmafia.com. +short | awk '{ print $3 " on ns1.linuxmafia.com."}' echo "" echo "Zonefile S/Ns, unixmercenary.net:" echo "" dig -t soa unixmercenary.net. @ns.primate.net. +short | awk '{ print $3 " on ns.primate.net." }' dig -t soa unixmercenary.net. @ns.tx.primate.net. +short | awk '{ print $3 " on ns.tx.primate.net." }' dig -t soa unixmercenary.net. @ns3.linuxmafia.com. +short | awk '{ print $3 " on ns3.linuxmafia.com." }' dig -t soa unixmercenary.net. @ns6.linuxmafia.com. +short | awk '{ print $3 " on ns6.linuxmafia.com."}' dig -t soa unixmercenary.net. @ns1.linuxmafia.com. +short | awk '{ print $3 " on ns1.linuxmafia.com."}' echo "" echo "Authoritative nameservers from whois, linuxmafia.com:" echo "" whois linuxmafia.com | grep 'Name Server' | awk -F: '{ print $2 }' | head -n 7 echo "" echo "Authoritative nameservers from whois, unixmercenary.net:" echo "" whois unixmercenary.net | grep 'Name Server' | awk -F: '{ print $2 }' | head -n 7 echo "" echo "Parent-zone NS records and matching A records (glue), linuxmafia.com:" echo "" dig -t ns linuxmafia.com. @$(dig -t ns com. +short | head -n 1) +nocmd +noquestion +nostats +nocomments echo "" echo "Parent-zone NS records and matching A records (glue), unixmercenary.net:" echo "" dig -t ns unixmercenary.net. @$(dig -t ns net. +short | head -n 1) +nocmd +noquestion +nostats +nocomments echo "" echo "In-domain NS records and matching A records, linuxmafia.com:" echo "" dig -t ns linuxmafia.com. @$(dig -t ns linuxmafia.com. +short | head -n 1) +nocmd +noquestion +nostats +nocomments echo "" echo "In-domain NS records and matching A records, unixmercenary.net:" echo "" dig -t ns unixmercenary.net. @$(dig -t ns unixmercenary.net. +short | head -n 1) +nocmd +noquestion +nostats +nocomments
} | mail -s "Domains linuxmafia.com and unixmercenary.net SOA check" rick@linuxmafia.com