[BALUG-Admin] Root Cause Analysis (RCA)*: Re: S/N bobble -- waiting for Michael P. on this 8-O ... "fixed"[1]

Sat Jun 1 04:30:56 UTC 2019

Root Cause Analysis (RCA)*
*or at least reasonable as feasible attempt(s) thereof.
Analysis/data/etc. below (at least mostly as relevant),
then excerpts of at least much of the earlier closely related
to the event emails.

"Root Cause Analysis" (RCA) of DNS serial number issues
noted & reported 2019-05-30 US/Pacific

Executive summary:
Likely a boo-boo in (lack of, or wrong way 'round) copying of VM disk
images from (physical) host-to-host when doing a (cold) migration of
VM from one host to the other.
"Prevention" or reducing probability of same/similar issues again?
See the last several paragraphs or so under line matching the above.

Analysis, conclusions (as feasible), and possible "corrective"/preventive
information follows:

relevant history excerpts of shell session typically used for disk image
copies of VM between physical hosts for migrations:
534      dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l root  
192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda'
538      dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l root  
192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda'
I believe the first was repeated, as I'd unintentionally started copy
from source to target before stopping source - so that would be an
invalid copy; then restarted copy after stopping source
545      ssh -ax -l root 192.168.55.2 'dd bs=4194304  
if=/var/local/balug/balug-sda' | time dd bs=4194304  
of=/var/local/balug/balug-sda
552      dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l root  
192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda'
556      ssh -ax -l root 192.168.55.2 'dd bs=4194304  
if=/var/local/balug/balug-sda' | time dd bs=4194304  
of=/var/local/balug/balug-sda

at time of examination, host is physical nominal host

Checking on file://linuxmafia.com/var/log/daemon.log* files
looks like domains:
balug.org
sf-lug.com
sf-lug.org
sflug.org
were impacted,
all okay with serial numbers now matching between master(s) & slaves,
after manual correction on balug.org and sf-lug.org,
and looks like sf-lug.com and sflug.org were corrected earlier (probably
as zones were routinely updated/maintained).
Oldest issue detection still in linuxmafia.com log files examined:
May 25 09:23:51
last reported in those logs:
May 30 13:54:57
linuxmafia.com timestamps from those are timezone America/Los_Angeles
excluding timestamps, looking at unique S/N complaints, we have:
balug.org/IN: serial number (1558725628) received from master  
198.144.194.238#53 < ours (1558799284)
sf-lug.com/IN: serial number (1558125318) received from master  
198.144.194.238#53 < ours (1558799275)
sf-lug.org/IN: serial number (1558622463) received from master  
198.144.194.238#53 < ours (1558799278)
sflug.org/IN: serial number (1557434452) received from master  
198.144.194.238#53 < ours (1557834269)
What times do those correspond to?  Well, we're using seconds from
epoch when we update our S/N, so ...
for t in 1557434452 1557834269 1558125318 1558622463 1558725628 \
         1558799275 1558799278 1558799284
do
         echo "$t" $(TZ=GMT0 date -Iseconds -d @"$t"
         TZ=US/Pacific date -Iseconds -d @"$t")
done
1557434452 2019-05-09T20:40:52+00:00 2019-05-09T13:40:52-07:00 sflug.org
1557834269 2019-05-14T11:44:29+00:00 2019-05-14T04:44:29-07:00 sflug.org *
1558125318 2019-05-17T20:35:18+00:00 2019-05-17T13:35:18-07:00 sf-lug.com
1558622463 2019-05-23T14:41:03+00:00 2019-05-23T07:41:03-07:00 sf-lug.org
1558725628 2019-05-24T19:20:28+00:00 2019-05-24T12:20:28-07:00 balug.org
1558799275 2019-05-25T15:47:55+00:00 2019-05-25T08:47:55-07:00 sf-lug.com *
1558799278 2019-05-25T15:47:58+00:00 2019-05-25T08:47:58-07:00 sf-lug.org *
1558799284 2019-05-25T15:48:04+00:00 2019-05-25T08:48:04-07:00 balug.org *
I added the domains to the above, and marked with * the later S/N on each
       May 2019
Su Mo Tu We Th Fr Sa
           1  2  3  4
  5  6  7  8  9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
Not all changes are necessarily checked into version control (e.g. RCS),
however, many are.  Ones that aren't are often short-term interim
changes, e.g. temporary changes for letsencrypt wildcard certs
validation via DNS.
Let's see what we have in the current running VM ... but note that what
nameserver serves up might not match, as DNSSEC is later added to that
and may increase S/N, notably in our bind9 configuration for these
master zones, we have:
         inline-signing yes;
         auto-dnssec maintain;
         serial-update-method unixtime;
With that there, and since our "hand" updated SNs (those in master
zone files themselves) are (or at least should be) generated via:
date +%s
we expect the SNs in the zone master files, and those served by
the nameserver, to reflect when they were last updated in
seconds since the epoch (or at least quite close to that - there
may be modest delay between the data being altered and the newer
data being (re)loaded or served up).

domains, RCS versions and S/N line from zone master files over relevant
time/SN ranges:

balug.org
1.82 1559249859; SERIAL; date +%s
1.81 1557834269; SERIAL; date +%s
1.80 1550844926; SERIAL; date +%s

sf-lug.org
1.49 1559249874 ; SERIAL ; date +%s
1.48 1557834269 ; SERIAL ; date +%s
1.47 1550845043 ; SERIAL ; date +%s

sf-lug.com
1.49 1557834269 ; SERIAL ; date +%s
1.48 1550845063 ; SERIAL ; date +%s

sflug.org
1.4 1557834269 ; SERIAL ; date +%s
1.3 1557434452 ; SERIAL ; date +%s

So, VM - on the two hosts.  Nominally it only is run on one at any given
time (same IP address, and other stateful stuff).  However,
okay to run a 2nd disconnected (virtually or otherwise) from network,
or in single-user mode (or equivalent) - so that may also sometimes
be done (e.g. to examine slightly older data, or to test something,
etc.)

Logs bits from the physical hosts of both below:
from file://tigger/var/log/libvirt/qemu/balug.log we have:
2019-04-29 06:04:16.775+0000: starting up
2019-05-12 15:01:56.595+0000: shutting down, reason=shutdown
2019-05-13 00:32:26.019+0000: starting up
2019-05-25 07:38:01.482+0000: starting up
2019-05-25 16:42:59.970+0000: shutting down, reason=shutdown
2019-05-26 10:50:34.956+0000: starting up
2019-05-26 16:26:07.416+0000: shutting down, reason=shutdown
2019-05-26 23:19:50.635+0000: starting up

from file://vicki/var/log/libvirt/qemu/balug.log we have:
2019-04-29 05:57:14.081+0000: shutting down, reason=shutdown
2019-05-12 15:07:50.838+0000: starting up
2019-05-13 00:24:13.725+0000: shutting down, reason=shutdown
2019-05-14 02:51:10.547+0000: starting up
2019-05-14 03:32:32.323+0000: shutting down, reason=destroyed
2019-05-25 15:45:36.856+0000: starting up
2019-05-25 16:28:53.723+0000: shutting down, reason=shutdown
2019-05-25 16:34:45.054+0000: starting up
2019-05-25 16:49:33.468+0000: shutting down, reason=destroyed
2019-05-25 16:56:14.352+0000: starting up
2019-05-26 08:28:08.171+0000: shutting down, reason=shutdown
2019-05-26 16:33:39.196+0000: starting up
2019-05-26 23:12:43.082+0000: shutting down, reason=shutdown
2019-05-31 03:39:45.011+0000: starting up

taking the above, prefixing with a letter to distinguish the physical
hosts, sorting by time, and I mark with * where there's overlap:
v 2019-04-29 05:57:14.081+0000: shutting down, reason=shutdown
t 2019-04-29 06:04:16.775+0000: starting up
t 2019-05-12 15:01:56.595+0000: shutting down, reason=shutdown
v 2019-05-12 15:07:50.838+0000: starting up
v 2019-05-13 00:24:13.725+0000: shutting down, reason=shutdown
t 2019-05-13 00:32:26.019+0000: starting up
v 2019-05-14 02:51:10.547+0000: starting up *
v 2019-05-14 03:32:32.323+0000: shutting down, reason=destroyed
t 2019-05-25 07:38:01.482+0000: starting up
v 2019-05-25 15:45:36.856+0000: starting up *
v 2019-05-25 16:28:53.723+0000: shutting down, reason=shutdown
v 2019-05-25 16:34:45.054+0000: starting up *
t 2019-05-25 16:42:59.970+0000: shutting down, reason=shutdown
v 2019-05-25 16:49:33.468+0000: shutting down, reason=destroyed
v 2019-05-25 16:56:14.352+0000: starting up
v 2019-05-26 08:28:08.171+0000: shutting down, reason=shutdown
t 2019-05-26 10:50:34.956+0000: starting up
t 2019-05-26 16:26:07.416+0000: shutting down, reason=shutdown
v 2019-05-26 16:33:39.196+0000: starting up
v 2019-05-26 23:12:43.082+0000: shutting down, reason=shutdown
t 2019-05-26 23:19:50.635+0000: starting up
v 2019-05-31 03:39:45.011+0000: starting up * (current examination,
                                                in single user mode
                                                or equivalent)

Kind'a doubt I directly goofed zone SNs, thinking more likely a
VM image shuffle boo-boo, e.g. didn't copy latest image before
bringing up on other physical, or maybe even possibly copied
the wrong way 'round (less likely); then in either case, some data would
be bit older than it should have been, and the VM host would effectively
take a step back in time regarding its data (but not its system time).

There's some sf-lug stuff that's backed up nominally overnightly,
that has version control (RCS) on it.  If we had an image anomaly
where there was goof on not copying over the most current,
or copying older atop newer,
the sf-lug RCS may show a jump/gap - e.g. on its
tracking of the mbox changes.

$ sf-lug_mbox_stats
YYYY-MM-DD (in UTC) and # of lines in mbox file
2019-05-28 1734690
2019-05-26 1732849
2019-05-25 1732221
2019-05-25 1732150
2019-05-24 1731781
2019-05-23 1729078
2019-05-22 1727972
2019-05-21 1727109
2019-05-19 1723626
2019-05-17 1722133
2019-05-17 1721723
2019-05-15 1721100
2019-05-15 1719952
2019-05-13 1719589
2019-05-10 1719495
2019-05-08 1719346
2019-05-06 1719019
2019-05-04 1718393
2019-05-02 1718311
2019-05-01 1716716
2019-04-30 1716055
Nothing obvious there.
Some days there will be nothing, if the raw mbox didn't have any
data changes.  Not sure about the multiple on same day - I'd expect
perhaps one of those where timezone of host was changed to UTC, but not
sure why multiple of those are seen for some other days too - that/those
may have been side-effect of a data copy error (either failed to update
system disk image, or copied the wrong way 'around - so state files
would be reverted, and an additional copy may occur).

Well, nothing obvious found.

Likely was a VM copy boo-boo (missed doing copy or copied wrong way
'round).

Checking a bit further across domains, see some residual issues on one
other domain too:
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
Michael\.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800 3600  
1209600 86400 @198.144.194.238 (ns0.berkeleylug.org.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
Michael\.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800 3600  
1209600 86400 @2001:470:1f05:19e::4 (ns0.berkeleylug.org.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
Michael\.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800 3600  
1209600 86400 @198.144.195.186 (ns1.linuxmafia.com.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
michael\.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800 3600  
1209600 86400 @64.62.190.98 (ns1.svlug.org.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
michael\.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800 3600  
1209600 86400 @2600:3c01::f03c:91ff:fe96:e78e (ns1.svlug.org.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
michael\.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800 3600  
1209600 86400 @204.42.254.5 (puck.nether.net.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
michael\.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800 3600  
1209600 86400 @2001:418:3f4::5 (puck.nether.net.)
And the likewise the when of those epoch based timestamps / SNs:
1558682705 2019-05-24T07:25:05+00:00 2019-05-24T00:25:05-07:00
1558799274 2019-05-25T15:47:54+00:00 2019-05-25T08:47:54-07:00
Can we spot anything different in the data besides the SN?
... compared, only differences found were in RRSIG and SOA records
... will proceed to "bump" the S/N to correct that one ... done

So, ... cause and prevention of issue?
Don't have "smoking gun" - insufficient data available to positively
isolate.

I also examined the /var/log/daemon.log* files data.
All the SN data within looks consistent - no errors, no SNs going
"backwards".
More specifically, all the SNs for the impacted zones in those logs
always go forwards, never backwards, specifically both the unsigned
(within zone master files themselves) and signed (+DNSSEC) each
only go forwards, and the signed is always >= the unsigned,
(it's only the signed that's seen externally anyway and that we
mostly care about, but in the interests of careful examination),
and each time the unsigned is "bumped" (we semi-manually increase
it), the signed effectively immediately jumps up to that value
(our new unsigned is never <= the old signed, but we always
increase it beyond that - as both our semi-manual and +DNSSEC
bump always advance to seconds from epoch).  So, *within the
surviving system image* (and presuming the most recent older
too, and each step along the way), the image is self-consistent
with *itself*.  It's just if copy was missed or done wrong way
'round, it would be inconsistent with external reality
(it may have served newer to slaves, then unintentionally
image was reverted to earlier).

Anyway, that's what we'd expect under hypothesis that an
image copy was missed or was done wrong way 'round - the "surviving"
(current running) image would be self-consistent with itself, etc.,
just not (quite) consistent with external reality, as some changes
on the VM image would've been lost.

So, most likely a VM disk image boo-boo
(e.g. migrated VM but failed to migrate/refresh disk image, or copied
wrong way 'round).

"Prevention" or reducing probablity of same/simlar issues again?

Could possibly add something to check mtime of image to prevent copying
the wrong way around - but that does nothing to prevent failing to
copy/refresh the image, which could still result in same issue.

Could do higher-level program do manage those particular migrations,
that would make it more goof-resistant.

# virsh migrate --live --copy-storage-all
would reduce probability of error, but current Debian stable has a bug
that keeps that from working (worked in oldstable, likely will work in the
forthcoming stable).  Might be worth checking if there's a fix for that
in backports, as that would also have benefit of reducing VM downtime.
... checking bug data and such, looks like it's likely fixed in
backports, the forthcoming stable, testing, and unstable.

May or may not be worth doing anything explicitly about - reasonably
attentive operations, it's not at all an impossible error to
repeat, but it's also fairly low probability, and results aren't
too nastily horrific - though it of course would be better
to not have such boo-boos.  Level-of-effort vs. risk ...
and after examining this DNS data boo-boo, and how it probably happened,
the operations will also likely be more attentive, and even less
probable to have same (type of) boo-boo again (at least in the
nearish future).  Also, longer term, upcoming stable will become
stable, both physical hosts will get upgraded, and then with
# virsh migrate --live --copy-storage-all
the probability of such an (operator!) error again goes way down.

Other bits - some more slaves (even non-advertised (no NS records)
for same, would also help in logging/analyzing data (e.g. another
"external reality" point of reference/logging relative to
the VM and it's view of "reality" - and that might possibly
suffer from, uh, "amnesia", in some (boo-boo - or significant
hardware failure) cases.

Also suggested, thing(s) to (more) carefully watch logs, and
catch/notify on issues ... but in the above case and with
presumed cause of issue, such on the VM itself wouldn't have
detected any issues ... but monitoring external to the VM
would have caught the issue - e.g. on the nominal physical
host - as that's up *most* of the time (where as the
alternate physical spends most of its time down).  Checks
pre and post migration could also help thwart such boo-boos.

Also, operator(s) doing the migrations when more attentive,
less rushed/tired/sleepy would also reduce probability of
error(s).  Having and (strictly) following a well tested and
debugged checklist would also help (what exists presently is
a pretty good "outline" plus some relevant script bits,
but it's not exactly in a full checklist format to
also take steps to thwart at least the more probable
boo-boos).

> From: "Michael Paoli" <Michael.Paoli@cal.berkeley.edu>
> Subject: Re: [BALUG-Admin] S/N bobble -- waiting for Michael P. on  
> this 8-O ... "fixed"[1]
> Date: Thu, 30 May 2019 14:28:19 -0700

> Okay, fixed[1].
> I'll dig[2] into it later to investigate how issue did or likely
> occurred.
>
> Thanks Rick for catching that!
>
> footnotes/references/excerpts:
> 1. For certain definitions of "fixed" - serial numbers corrected, didn't
>    check/validate anything else in particular - just "bumped" (updated)
>    them and "pushed" (notified) 'em out, and rechecked 'till they all
>    appeared out there okay on master(s) & slaves.
> $ DNS_SOA_CK balug.org sf-lug.org
> FQDN=balug.org. authority:
> balug.org.              86400   IN      NS      ns1.balug.org.
> balug.org.              86400   IN      NS      ns1.linuxmafia.com.
> balug.org.              86400   IN      NS      ns1.svlug.org.
> balug.org.              86400   IN      NS      puck.nether.net.
> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
> 9000 1800 1814400 86400 @198.144.194.238 (ns1.balug.org.)
> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
> 9000 1800 1814400 86400 @2001:470:1f04:19e::2 (ns1.balug.org.)
> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
> 9000 1800 1814400 86400 @198.144.195.186 (ns1.linuxmafia.com.)
> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
> 9000 1800 1814400 86400 @64.62.190.98 (ns1.svlug.org.)
> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
> 9000 1800 1814400 86400 @2600:3c01::f03c:91ff:fe96:e78e  
> (ns1.svlug.org.)
> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
> 9000 1800 1814400 86400 @204.42.254.5 (puck.nether.net.)
> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
> 9000 1800 1814400 86400 @2001:418:3f4::5 (puck.nether.net.)
> FQDN=sf-lug.org. authority:
> sf-lug.org.             86400   IN      NS      ns.primate.net.
> sf-lug.org.             86400   IN      NS      ns1.linuxmafia.com.
> sf-lug.org.             86400   IN      NS      ns1.sf-lug.org.
> sf-lug.org.             86400   IN      NS      ns1.svlug.org.
> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
> 3600 1209600 86400 @198.144.194.12 (ns.primate.net.)
> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
> 3600 1209600 86400 @2001:470:1f04:51a::2 (ns.primate.net.)
> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
> 3600 1209600 86400 @198.144.195.186 (ns1.linuxmafia.com.)
> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
> 3600 1209600 86400 @198.144.194.238 (ns1.sf-lug.org.)
> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
> 3600 1209600 86400 @2001:470:1f04:19e::2 (ns1.sf-lug.org.)
> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
> 3600 1209600 86400 @64.62.190.98 (ns1.svlug.org.)
> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
> 3600 1209600 86400 @2600:3c01::f03c:91ff:fe96:e78e (ns1.svlug.org.)
> $
> 2. pun or the like retroactively intended.  ;-)
> 3. unreferenced footnote.  My DNS_SOA_CK program essentially grabs the
>    "upstream" delegating NS records (at least one instance of 'em),
>    gets the A and AAAA records of all the delegated nameservers,
>    then gets the SOA records from each of those, and displays that
>    data, for the domain(s) specified - or a default set if none specified.
>    (I wrote it semi-recently - I got tired of doing it semi-manually
>    on a semi-frequent basis; very handy for, among other things,
>    also checking that master(s) and slaves are caught up when
>    going through letsencrypt.org wildcard cert validation
>    request via DNS verification; also very handy to see that the
>    delegated nameservers are responding and with the expected
>    data (or at least expected zone S/N)).
>
>> From: "Michael Paoli" <Michael.Paoli@cal.berkeley.edu>
>> Subject: Re: [BALUG-Admin] S/N bobble -- waiting for Michael P. on this 8-O
>> Date: Thu, 30 May 2019 13:50:37 -0700
>
>>> From: "Rick Moen" <rick@linuxmafia.com>
>>> Subject: [BALUG-Admin] S/N bobble -- waiting for Michael P. on this
>>> Date: Thu, 30 May 2019 11:20:26 -0700
>>
>>> Magic 8-ball (i.e., logcheck on ns1.linuxmafia.com) says:
>>>
>>> System Events
>>> =-=-=-=-=-=-=
>>> May 30 10:05:36 linuxmafia named[11750]: zone balug.org/IN: serial  
>>> number (1558725628) received from master 198.144.194.238#53 < ours  
>>> (1558799284)
>>> May 30 10:30:43 linuxmafia named[11750]: zone sf-lug.org/IN:  
>>> serial number (1558622463) received from master 198.144.194.238#53  
>>> < ours (1558799278)
>>> May 30 10:32:42 linuxmafia named[11750]: zone balug.org/IN: serial  
>>> number (1558725628) received from master 198.144.194.238#53 < ours  
>>> (1558799284)
>>> May 30 10:57:43 linuxmafia named[11750]: zone balug.org/IN: serial  
>>> number (1558725628) received from master 198.144.194.238#53 < ours  
>>> (1558799284)
>>>
>>> Er?
>>>
>>> Michael, O Great Oracle of the DNS master, before I go expunging the
>>> local cached zone on ns1.linuxmafia.com so as to converge in the master,
>>> any thoughts or desire to act on your end?  Normally, I would expect the
>>> current situation to be _strenuously avoided_ by never taking S/Ns in a
>>> retrograde direction on a zone's DNS master, so I infer that
>>> investigation may be in order (or at least brief discussion).
>>
>> 8-O
>> Opps, ... that should'a never happened.
>> I'll investigate & correct.  Shouldn't require any explicit
>> slave action.
>>
>> I wonder if maybe VM came up that shouldn't have, or ???
>> Anyway, will check into it and correct (might be busy mostly
>> with other stuff 'till about this evening or so, but expect
>> I'll have it rectified by/around then ... might then take
>> wee bit for slaves to follow along & get themselves
>> corrected - but likely pretty fast on that and without
>> explicit slave action needed).