[BALUG-Admin] Root Cause Analysis (RCA)*: Re: S/N bobble -- waiting for Michael P. on this 8-O ... "fixed"[1]

Rick Moen rick@linuxmafia.com
Sat Jun 1 05:45:13 UTC 2019


Quoting Michael Paoli (Michael.Paoli@cal.berkeley.edu):

[lots of interesting and diligent attempting to get the full story and
learn possible lessons]

> May or may not be worth doing anything explicitly about - reasonably
> attentive operations, it's not at all an impossible error to
> repeat, but it's also fairly low probability, and results aren't
> too nastily horrific - though it of course would be better
> to not have such boo-boos.

Absolutely agree.

> Also suggested, thing(s) to (more) carefully watch logs, and
> catch/notify on issues ... but in the above case and with
> presumed cause of issue, such on the VM itself wouldn't have
> detected any issues ... but monitoring external to the VM
> would have caught the issue - e.g. on the nominal physical
> host - as that's up *most* of the time (where as the
> alternate physical spends most of its time down).

Yeah, that's the bit that I'd already spent time gnawing over.
On the DNS master, as you say, following a VM image shuffle boo-boo, the
DNS master sees absolutely nothing out of kilter, so there's no obvious
way for it to detect being, um, the wrong clone at the wrong time, sort
of like Sam Rockwell's character in the brilliant 2009 Duncan Jones
movie 'Moon' (winner of that year's Hugo Award for Best Dramatic
Presentation - Long Form, by the way).  Except, say, my primitive weekly
domain-checking scripts, about which more below.

On any DNS slave, such as in this case ns1.linuxmafia.com, the glitch 
becomes obvious from daemon.log scrutiny (e.g., using Logcheck).


> Also, operator(s) doing the migrations when more attentive,
> less rushed/tired/sleepy would also reduce probability of
> error(s). 

Yeah, gosh, rather like being an airline pilot.  [pause for irony]
https://riceo.me/posts/software-engineering-lessons-from-aviation/


Your point about the value of _external_ checks is of course profound
and valuable.  Those of us running our own servers typically don't even
use process supervisor software (that respawns dead network daemons),
let alone doing external service-availability checks.  Back at $FIRM, 
we paid mucho dinero to companies whose business service was running
periodic scripted checks of the customer's (e.g., $FIRM's) essential
public network services.  And, gosh, I'm not going to pay for that for
ns1.linuxmafia.com.  Maybe in the fullness of time, I might get around
to setting up _some_ service-availability checking from a host of mine
devoted to that function -- but definitely not for a while.

Anyway, here's perfunctory script /etc//cron.weekly/mydomains, which as
you'll see among other things collects S/Ns from each of the
authoritative nameservers for my two domains.  If something like this
were run, you would presumably notice the S/N discrepancy.

Much smarter implementations are, I'm certain, possible.  


#!/bin/sh

# mydomains     Cron script to sanity-check my domains' SOA records at
#               all of their authoritative nameservers, as a quick and 
#               dirty way of making sure (1) they're all online and
#               (2) they're all serving up the same data (or at least
#               data with the same zonefile serial number).
#  
#               The script queries all nameservers for their current
#               SOA value, and then uses awk to parse out of that 
#               verbose record just the S/N field, which is field #3.  
#               The point is that you can visually spot offline or 
#               aberrant nameservers by their S/Ns being (respectively) 
#               missing or an out-of-step value.
#
#		Written by Rick Moen (rick@linuxmafia.com)
#		$Id: cron.weekly,v 1.04 2018/10/03 00:58:00 rick
# Copyright (C) Rick Moen, 2011-2018.  Do anything you want with this work.

set -o errexit  #aka "set -e": exit if any line returns non-true value
set -o nounset  #aka "set -u": exit upon finding an uninitialised variable

test -x /usr/bin/mail || exit 0
test -x /usr/bin/whois || exit 0
test -x /usr/bin/awk || exit 0
test -x /bin/grep || exit 0
test -x /usr/bin/dig || exit 0


{
echo "As of 2018-10-03, linuxmafia.com should show five authoritative nameservers:"
echo ""
echo "ns.primate.net. 198.144.194.12, (Aaron T. Porter)"
echo "ns.tx.primate.net. 72.249.38.88 (Aaron T. Porter)"
echo "ns3.linuxmafia.com. 63.193.123.122, aka ns.catwhisker.org (David Wolfskill)"
echo "ns6.linuxmafia.com. 209.205.200.166, aka ns1.thecoop.net (Drew Bertola)"
echo "ns1.linuxmafia.com. 198.144.195.186 (Rick Moen)"
echo ""
echo "As of 2018-10-03, unixmercenary.net should show five authoritative nameservers:"
echo ""
echo "ns.primate.net. 198.144.194.12, (Aaron T. Porter)"
echo "ns.tx.primate.net. 72.249.38.88 (Aaron T. Porter)"
echo "ns3.linuxmafia.com. 63.193.123.122, aka ns.catwhisker.org (David Wolfskill)"
echo "ns6.linuxmafia.com. 209.205.200.166, ns1.thecoop.net (Drew Bertola)"
echo "ns1.linuxmafia.com. 198.144.195.186 (Rick Moen)"
echo ""
echo "If any is missing from reports below, or produces odd data, something is wrong."
echo ""
echo "Zonefile S/Ns, linuxmafia.com:"
echo ""
dig -t soa linuxmafia.com. @ns.primate.net. +short | awk '{ print $3 " on ns.primate.net." }'
dig -t soa linuxmafia.com. @ns.tx.primate.net. +short | awk '{ print $3 " on ns.tx.primate.net." }'
dig -t soa linuxmafia.com. @ns3.linuxmafia.com. +short | awk '{ print $3 " on ns3.linuxmafia.com." }'
dig -t soa linuxmafia.com. @ns6.linuxmafia.com. +short | awk '{ print $3 " on ns6.linuxmafia.com."}'
dig -t soa linuxmafia.com. @ns1.linuxmafia.com. +short | awk '{ print $3 " on ns1.linuxmafia.com."}'
echo ""
echo "Zonefile S/Ns, unixmercenary.net:"
echo ""
dig -t soa unixmercenary.net. @ns.primate.net. +short | awk '{ print $3 " on ns.primate.net." }'
dig -t soa unixmercenary.net. @ns.tx.primate.net. +short | awk '{ print $3 " on ns.tx.primate.net." }'  
dig -t soa unixmercenary.net. @ns3.linuxmafia.com. +short | awk '{ print $3 " on ns3.linuxmafia.com." }'
dig -t soa unixmercenary.net. @ns6.linuxmafia.com. +short | awk '{ print $3 " on ns6.linuxmafia.com."}'
dig -t soa unixmercenary.net. @ns1.linuxmafia.com. +short | awk '{ print $3 " on ns1.linuxmafia.com."}' 
echo ""
echo "Authoritative nameservers from whois, linuxmafia.com:"
echo ""
whois linuxmafia.com | grep 'Name Server' | awk -F: '{ print $2 }' | head -n 7
echo ""
echo "Authoritative nameservers from whois, unixmercenary.net:"
echo ""
whois unixmercenary.net | grep 'Name Server' | awk -F: '{ print $2 }' | head -n 7
echo ""
echo "Parent-zone NS records and matching A records (glue), linuxmafia.com:"
echo ""
dig -t ns linuxmafia.com. @$(dig -t ns com. +short | head -n 1) +nocmd +noquestion +nostats +nocomments
echo ""
echo "Parent-zone NS records and matching A records (glue), unixmercenary.net:"
echo ""
dig -t ns unixmercenary.net. @$(dig -t ns net. +short | head -n 1) +nocmd +noquestion +nostats +nocomments
echo ""
echo "In-domain NS records and matching A records, linuxmafia.com:"
echo ""
dig -t ns linuxmafia.com. @$(dig -t ns linuxmafia.com. +short | head -n 1) +nocmd +noquestion +nostats +nocomments
echo ""
echo "In-domain NS records and matching A records, unixmercenary.net:"
echo ""
dig -t ns unixmercenary.net. @$(dig -t ns unixmercenary.net. +short | head -n 1) +nocmd +noquestion +nostats +nocomments

} |
mail -s "Domains linuxmafia.com and unixmercenary.net SOA check" rick@linuxmafia.com




More information about the BALUG-Admin mailing list