[BALUG-Admin] Improved uptime/availability of balug VM & improved goof-resistance: Re: Root Cause Analysis (RCA)*: Re: S/N bobble -- waiting for Michael P. on this 8-O ... "fixed"[1]

Thu Jun 6 08:34:54 UTC 2019

So ... managed to accomplish several things which will:
o generally increase the uptime/availability of the balug VM
o reduce probability of balug (& other(s)) VM disk image boo-boo,
   notably failing to update or copying the wrong way 'round and
   unintentionally reverting to older image

Namely, did:
o got live migrations working again much more fully on
   Debian GNU/Linux 9[.x] (stretch), approximately (and mostly
   effectively) as they had been working on
   Debian GNU/Linux 8[.x] (jessie), and with --copy-storage-all
   ... and in fact, apparently even more reliably than was the
   earlier case on Debian GNU/Linux 8[.x] (jessie)
o Wrote some higher-level live migration (with --copy-storage-all)
   programs, to make the live migration process more goof-resistant.

So, ... now the balug VM should not need to go down at all to migrate
physical hosts ... which among other things means no longer having
approximately 6 downtimes per month (when I typically take the primary
physical host off-site ... and back, which means balug VM doing a
migration off that physical host ... and back ... since the live
migrations hadn't been working, it was shutdown, cold copy of disk
image, and reboot, for each migration).

Key bits on "solving" that:
o Information on relevant Debian bugs was quite useful/informative,
   most notably:
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=658112
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=796122
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=873012
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878299
   The bug information suggested newer versions (e.g. presumably
   available from backports, or by running testing) would correct the
   issue, but they also provided information on effective
   work-around(s), most notably including the options:
   --p2p --tunnelled
   which was "good enough" in this case (requires a trust relationship
   between the physical hosts).
   I may also test further (on other test VM(s) to see if that's actually
   needed or not.
o At least after using the above information, was able to successfully
   do live --copy-storage-all migrations on a test VM - and worked fine
   both ways between the physicals.  But alas, when that was done with
   the balug VM, it worked one direction, but "failed" the other (the
   migration itself appeared to go fine, but the VM ended up
   wedged/hung on the target physical - even though its status showed as
   "running".  But at that point, since it worked with other VM both
   ways, it was down to a divide and conquer troubleshooting isolation
   issue.
o Eventually worked it out to be the virtual hardware configuration.  By
   changing that, was able to successfully migrate both ways.
   Did at least 6 round trip migrations of a balug.test VM that
   was mostly a clone of the balug VM (with relevant services disabled
   and networking changed, otherwise at least initially highly
   identical).  Once that was worked out, applied to the balug VM,
   and tested live migration - both ways - and worked fine.
o Virtual hardware ... mostly just saved config, blew away config,
   recreated virtual machine, using "import" capability to build it
   around disk image, but keeping Ethernet MAC and networking
   connectivity the same.  Then adjust virtual CPU - disabling
   capabilities not supported in the VM environment of the other physical
   due to different physical CPU(s).  After that, all works fine with the
   migrations both ways of the balug VM.  Note also - (virtual) CPU -
   that's apparently (mostly) not the hang issue - the VM software would
   report the incompatibilities, and refuse to migrate in those cases ...
   never got as far as a "hang" in those cases.  (By default the
   software creates virtual CPU that's close in capabilities to the
   physical host CPU(s)).

References:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=658112
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=796122
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=873012
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878299
https://lists.balug.org/pipermail/balug-admin/2019-May/000984.html
https://lists.balug.org/pipermail/balug-admin/2019-May/000985.html
https://lists.balug.org/pipermail/balug-admin/2019-May/000986.html
https://lists.balug.org/pipermail/balug-admin/2019-June/000988.html

> From: "Michael Paoli" <Michael.Paoli@cal.berkeley.edu>
> Subject: Root Cause Analysis (RCA)*: Re: [BALUG-Admin] S/N bobble --  
> waiting for Michael P. on this 8-O ... "fixed"[1]
> Date: Fri, 31 May 2019 21:30:56 -0700

> Root Cause Analysis (RCA)*
> *or at least reasonable as feasible attempt(s) thereof.
> Analysis/data/etc. below (at least mostly as relevant),
> then excerpts of at least much of the earlier closely related
> to the event emails.
>
> "Root Cause Analysis" (RCA) of DNS serial number issues
> noted & reported 2019-05-30 US/Pacific
>
> Executive summary:
> Likely a boo-boo in (lack of, or wrong way 'round) copying of VM disk
> images from (physical) host-to-host when doing a (cold) migration of
> VM from one host to the other.
> "Prevention" or reducing probability of same/similar issues again?
> See the last several paragraphs or so under line matching the above.
>
> Analysis, conclusions (as feasible), and possible "corrective"/preventive
> information follows:
>
> relevant history excerpts of shell session typically used for disk image
> copies of VM between physical hosts for migrations:
> 534      dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l  
> root 192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda'
> 538      dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l  
> root 192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda'
> I believe the first was repeated, as I'd unintentionally started copy
> from source to target before stopping source - so that would be an
> invalid copy; then restarted copy after stopping source
> 545      ssh -ax -l root 192.168.55.2 'dd bs=4194304  
> if=/var/local/balug/balug-sda' | time dd bs=4194304  
> of=/var/local/balug/balug-sda
> 552      dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l  
> root 192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda'
> 556      ssh -ax -l root 192.168.55.2 'dd bs=4194304  
> if=/var/local/balug/balug-sda' | time dd bs=4194304  
> of=/var/local/balug/balug-sda
>
> at time of examination, host is physical nominal host
>
> Checking on file://linuxmafia.com/var/log/daemon.log* files
> looks like domains:
> balug.org
> sf-lug.com
> sf-lug.org
> sflug.org
> were impacted,
> all okay with serial numbers now matching between master(s) & slaves,
> after manual correction on balug.org and sf-lug.org,
> and looks like sf-lug.com and sflug.org were corrected earlier (probably
> as zones were routinely updated/maintained).
> Oldest issue detection still in linuxmafia.com log files examined:
> May 25 09:23:51
> last reported in those logs:
> May 30 13:54:57
> linuxmafia.com timestamps from those are timezone America/Los_Angeles
> excluding timestamps, looking at unique S/N complaints, we have:
> balug.org/IN: serial number (1558725628) received from master  
> 198.144.194.238#53 < ours (1558799284)
> sf-lug.com/IN: serial number (1558125318) received from master  
> 198.144.194.238#53 < ours (1558799275)
> sf-lug.org/IN: serial number (1558622463) received from master  
> 198.144.194.238#53 < ours (1558799278)
> sflug.org/IN: serial number (1557434452) received from master  
> 198.144.194.238#53 < ours (1557834269)
> What times do those correspond to?  Well, we're using seconds from
> epoch when we update our S/N, so ...
> for t in 1557434452 1557834269 1558125318 1558622463 1558725628 \
>         1558799275 1558799278 1558799284
> do
>         echo "$t" $(TZ=GMT0 date -Iseconds -d @"$t"
>         TZ=US/Pacific date -Iseconds -d @"$t")
> done
> 1557434452 2019-05-09T20:40:52+00:00 2019-05-09T13:40:52-07:00 sflug.org
> 1557834269 2019-05-14T11:44:29+00:00 2019-05-14T04:44:29-07:00 sflug.org *
> 1558125318 2019-05-17T20:35:18+00:00 2019-05-17T13:35:18-07:00 sf-lug.com
> 1558622463 2019-05-23T14:41:03+00:00 2019-05-23T07:41:03-07:00 sf-lug.org
> 1558725628 2019-05-24T19:20:28+00:00 2019-05-24T12:20:28-07:00 balug.org
> 1558799275 2019-05-25T15:47:55+00:00 2019-05-25T08:47:55-07:00 sf-lug.com *
> 1558799278 2019-05-25T15:47:58+00:00 2019-05-25T08:47:58-07:00 sf-lug.org *
> 1558799284 2019-05-25T15:48:04+00:00 2019-05-25T08:48:04-07:00 balug.org *
> I added the domains to the above, and marked with * the later S/N on each
>       May 2019
> Su Mo Tu We Th Fr Sa
>           1  2  3  4
>  5  6  7  8  9 10 11
> 12 13 14 15 16 17 18
> 19 20 21 22 23 24 25
> 26 27 28 29 30 31
> Not all changes are necessarily checked into version control (e.g. RCS),
> however, many are.  Ones that aren't are often short-term interim
> changes, e.g. temporary changes for letsencrypt wildcard certs
> validation via DNS.
> Let's see what we have in the current running VM ... but note that what
> nameserver serves up might not match, as DNSSEC is later added to that
> and may increase S/N, notably in our bind9 configuration for these
> master zones, we have:
>         inline-signing yes;
>         auto-dnssec maintain;
>         serial-update-method unixtime;
> With that there, and since our "hand" updated SNs (those in master
> zone files themselves) are (or at least should be) generated via:
> date +%s
> we expect the SNs in the zone master files, and those served by
> the nameserver, to reflect when they were last updated in
> seconds since the epoch (or at least quite close to that - there
> may be modest delay between the data being altered and the newer
> data being (re)loaded or served up).
>
> domains, RCS versions and S/N line from zone master files over relevant
> time/SN ranges:
>
> balug.org
> 1.82 1559249859; SERIAL; date +%s
> 1.81 1557834269; SERIAL; date +%s
> 1.80 1550844926; SERIAL; date +%s
>
> sf-lug.org
> 1.49 1559249874 ; SERIAL ; date +%s
> 1.48 1557834269 ; SERIAL ; date +%s
> 1.47 1550845043 ; SERIAL ; date +%s
>
> sf-lug.com
> 1.49 1557834269 ; SERIAL ; date +%s
> 1.48 1550845063 ; SERIAL ; date +%s
>
> sflug.org
> 1.4 1557834269 ; SERIAL ; date +%s
> 1.3 1557434452 ; SERIAL ; date +%s
>
> So, VM - on the two hosts.  Nominally it only is run on one at any given
> time (same IP address, and other stateful stuff).  However,
> okay to run a 2nd disconnected (virtually or otherwise) from network,
> or in single-user mode (or equivalent) - so that may also sometimes
> be done (e.g. to examine slightly older data, or to test something,
> etc.)
>
> Logs bits from the physical hosts of both below:
> from file://tigger/var/log/libvirt/qemu/balug.log we have:
> 2019-04-29 06:04:16.775+0000: starting up
> 2019-05-12 15:01:56.595+0000: shutting down, reason=shutdown
> 2019-05-13 00:32:26.019+0000: starting up
> 2019-05-25 07:38:01.482+0000: starting up
> 2019-05-25 16:42:59.970+0000: shutting down, reason=shutdown
> 2019-05-26 10:50:34.956+0000: starting up
> 2019-05-26 16:26:07.416+0000: shutting down, reason=shutdown
> 2019-05-26 23:19:50.635+0000: starting up
>
> from file://vicki/var/log/libvirt/qemu/balug.log we have:
> 2019-04-29 05:57:14.081+0000: shutting down, reason=shutdown
> 2019-05-12 15:07:50.838+0000: starting up
> 2019-05-13 00:24:13.725+0000: shutting down, reason=shutdown
> 2019-05-14 02:51:10.547+0000: starting up
> 2019-05-14 03:32:32.323+0000: shutting down, reason=destroyed
> 2019-05-25 15:45:36.856+0000: starting up
> 2019-05-25 16:28:53.723+0000: shutting down, reason=shutdown
> 2019-05-25 16:34:45.054+0000: starting up
> 2019-05-25 16:49:33.468+0000: shutting down, reason=destroyed
> 2019-05-25 16:56:14.352+0000: starting up
> 2019-05-26 08:28:08.171+0000: shutting down, reason=shutdown
> 2019-05-26 16:33:39.196+0000: starting up
> 2019-05-26 23:12:43.082+0000: shutting down, reason=shutdown
> 2019-05-31 03:39:45.011+0000: starting up
>
> taking the above, prefixing with a letter to distinguish the physical
> hosts, sorting by time, and I mark with * where there's overlap:
> v 2019-04-29 05:57:14.081+0000: shutting down, reason=shutdown
> t 2019-04-29 06:04:16.775+0000: starting up
> t 2019-05-12 15:01:56.595+0000: shutting down, reason=shutdown
> v 2019-05-12 15:07:50.838+0000: starting up
> v 2019-05-13 00:24:13.725+0000: shutting down, reason=shutdown
> t 2019-05-13 00:32:26.019+0000: starting up
> v 2019-05-14 02:51:10.547+0000: starting up *
> v 2019-05-14 03:32:32.323+0000: shutting down, reason=destroyed
> t 2019-05-25 07:38:01.482+0000: starting up
> v 2019-05-25 15:45:36.856+0000: starting up *
> v 2019-05-25 16:28:53.723+0000: shutting down, reason=shutdown
> v 2019-05-25 16:34:45.054+0000: starting up *
> t 2019-05-25 16:42:59.970+0000: shutting down, reason=shutdown
> v 2019-05-25 16:49:33.468+0000: shutting down, reason=destroyed
> v 2019-05-25 16:56:14.352+0000: starting up
> v 2019-05-26 08:28:08.171+0000: shutting down, reason=shutdown
> t 2019-05-26 10:50:34.956+0000: starting up
> t 2019-05-26 16:26:07.416+0000: shutting down, reason=shutdown
> v 2019-05-26 16:33:39.196+0000: starting up
> v 2019-05-26 23:12:43.082+0000: shutting down, reason=shutdown
> t 2019-05-26 23:19:50.635+0000: starting up
> v 2019-05-31 03:39:45.011+0000: starting up * (current examination,
>                                                in single user mode
>                                                or equivalent)
>
> Kind'a doubt I directly goofed zone SNs, thinking more likely a
> VM image shuffle boo-boo, e.g. didn't copy latest image before
> bringing up on other physical, or maybe even possibly copied
> the wrong way 'round (less likely); then in either case, some data would
> be bit older than it should have been, and the VM host would effectively
> take a step back in time regarding its data (but not its system time).
>
> There's some sf-lug stuff that's backed up nominally overnightly,
> that has version control (RCS) on it.  If we had an image anomaly
> where there was goof on not copying over the most current,
> or copying older atop newer,
> the sf-lug RCS may show a jump/gap - e.g. on its
> tracking of the mbox changes.
>
> $ sf-lug_mbox_stats
> YYYY-MM-DD (in UTC) and # of lines in mbox file
> 2019-05-28 1734690
> 2019-05-26 1732849
> 2019-05-25 1732221
> 2019-05-25 1732150
> 2019-05-24 1731781
> 2019-05-23 1729078
> 2019-05-22 1727972
> 2019-05-21 1727109
> 2019-05-19 1723626
> 2019-05-17 1722133
> 2019-05-17 1721723
> 2019-05-15 1721100
> 2019-05-15 1719952
> 2019-05-13 1719589
> 2019-05-10 1719495
> 2019-05-08 1719346
> 2019-05-06 1719019
> 2019-05-04 1718393
> 2019-05-02 1718311
> 2019-05-01 1716716
> 2019-04-30 1716055
> Nothing obvious there.
> Some days there will be nothing, if the raw mbox didn't have any
> data changes.  Not sure about the multiple on same day - I'd expect
> perhaps one of those where timezone of host was changed to UTC, but not
> sure why multiple of those are seen for some other days too - that/those
> may have been side-effect of a data copy error (either failed to update
> system disk image, or copied the wrong way 'around - so state files
> would be reverted, and an additional copy may occur).
>
> Well, nothing obvious found.
>
> Likely was a VM copy boo-boo (missed doing copy or copied wrong way
> 'round).
>
> Checking a bit further across domains, see some residual issues on one
> other domain too:
> berkeleylug.org. IN SOA ns0.berkeleylug.org.  
> Michael\.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800  
> 3600 1209600 86400 @198.144.194.238 (ns0.berkeleylug.org.)
> berkeleylug.org. IN SOA ns0.berkeleylug.org.  
> Michael\.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800  
> 3600 1209600 86400 @2001:470:1f05:19e::4 (ns0.berkeleylug.org.)
> berkeleylug.org. IN SOA ns0.berkeleylug.org.  
> Michael\.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800  
> 3600 1209600 86400 @198.144.195.186 (ns1.linuxmafia.com.)
> berkeleylug.org. IN SOA ns0.berkeleylug.org.  
> michael\.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800  
> 3600 1209600 86400 @64.62.190.98 (ns1.svlug.org.)
> berkeleylug.org. IN SOA ns0.berkeleylug.org.  
> michael\.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800  
> 3600 1209600 86400 @2600:3c01::f03c:91ff:fe96:e78e (ns1.svlug.org.)
> berkeleylug.org. IN SOA ns0.berkeleylug.org.  
> michael\.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800  
> 3600 1209600 86400 @204.42.254.5 (puck.nether.net.)
> berkeleylug.org. IN SOA ns0.berkeleylug.org.  
> michael\.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800  
> 3600 1209600 86400 @2001:418:3f4::5 (puck.nether.net.)
> And the likewise the when of those epoch based timestamps / SNs:
> 1558682705 2019-05-24T07:25:05+00:00 2019-05-24T00:25:05-07:00
> 1558799274 2019-05-25T15:47:54+00:00 2019-05-25T08:47:54-07:00
> Can we spot anything different in the data besides the SN?
> ... compared, only differences found were in RRSIG and SOA records
> ... will proceed to "bump" the S/N to correct that one ... done
>
> So, ... cause and prevention of issue?
> Don't have "smoking gun" - insufficient data available to positively
> isolate.
>
> I also examined the /var/log/daemon.log* files data.
> All the SN data within looks consistent - no errors, no SNs going
> "backwards".
> More specifically, all the SNs for the impacted zones in those logs
> always go forwards, never backwards, specifically both the unsigned
> (within zone master files themselves) and signed (+DNSSEC) each
> only go forwards, and the signed is always >= the unsigned,
> (it's only the signed that's seen externally anyway and that we
> mostly care about, but in the interests of careful examination),
> and each time the unsigned is "bumped" (we semi-manually increase
> it), the signed effectively immediately jumps up to that value
> (our new unsigned is never <= the old signed, but we always
> increase it beyond that - as both our semi-manual and +DNSSEC
> bump always advance to seconds from epoch).  So, *within the
> surviving system image* (and presuming the most recent older
> too, and each step along the way), the image is self-consistent
> with *itself*.  It's just if copy was missed or done wrong way
> 'round, it would be inconsistent with external reality
> (it may have served newer to slaves, then unintentionally
> image was reverted to earlier).
>
> Anyway, that's what we'd expect under hypothesis that an
> image copy was missed or was done wrong way 'round - the "surviving"
> (current running) image would be self-consistent with itself, etc.,
> just not (quite) consistent with external reality, as some changes
> on the VM image would've been lost.
>
> So, most likely a VM disk image boo-boo
> (e.g. migrated VM but failed to migrate/refresh disk image, or copied
> wrong way 'round).
>
> "Prevention" or reducing probablity of same/simlar issues again?
>
> Could possibly add something to check mtime of image to prevent copying
> the wrong way around - but that does nothing to prevent failing to
> copy/refresh the image, which could still result in same issue.
>
> Could do higher-level program do manage those particular migrations,
> that would make it more goof-resistant.
>
> # virsh migrate --live --copy-storage-all
> would reduce probability of error, but current Debian stable has a bug
> that keeps that from working (worked in oldstable, likely will work in the
> forthcoming stable).  Might be worth checking if there's a fix for that
> in backports, as that would also have benefit of reducing VM downtime.
> ... checking bug data and such, looks like it's likely fixed in
> backports, the forthcoming stable, testing, and unstable.
>
> May or may not be worth doing anything explicitly about - reasonably
> attentive operations, it's not at all an impossible error to
> repeat, but it's also fairly low probability, and results aren't
> too nastily horrific - though it of course would be better
> to not have such boo-boos.  Level-of-effort vs. risk ...
> and after examining this DNS data boo-boo, and how it probably happened,
> the operations will also likely be more attentive, and even less
> probable to have same (type of) boo-boo again (at least in the
> nearish future).  Also, longer term, upcoming stable will become
> stable, both physical hosts will get upgraded, and then with
> # virsh migrate --live --copy-storage-all
> the probability of such an (operator!) error again goes way down.
>
> Other bits - some more slaves (even non-advertised (no NS records)
> for same, would also help in logging/analyzing data (e.g. another
> "external reality" point of reference/logging relative to
> the VM and it's view of "reality" - and that might possibly
> suffer from, uh, "amnesia", in some (boo-boo - or significant
> hardware failure) cases.
>
> Also suggested, thing(s) to (more) carefully watch logs, and
> catch/notify on issues ... but in the above case and with
> presumed cause of issue, such on the VM itself wouldn't have
> detected any issues ... but monitoring external to the VM
> would have caught the issue - e.g. on the nominal physical
> host - as that's up *most* of the time (where as the
> alternate physical spends most of its time down).  Checks
> pre and post migration could also help thwart such boo-boos.
>
> Also, operator(s) doing the migrations when more attentive,
> less rushed/tired/sleepy would also reduce probability of
> error(s).  Having and (strictly) following a well tested and
> debugged checklist would also help (what exists presently is
> a pretty good "outline" plus some relevant script bits,
> but it's not exactly in a full checklist format to
> also take steps to thwart at least the more probable
> boo-boos).
>
>> From: "Michael Paoli" <Michael.Paoli@cal.berkeley.edu>
>> Subject: Re: [BALUG-Admin] S/N bobble -- waiting for Michael P. on  
>> this 8-O ... "fixed"[1]
>> Date: Thu, 30 May 2019 14:28:19 -0700
>
>> Okay, fixed[1].
>> I'll dig[2] into it later to investigate how issue did or likely
>> occurred.
>>
>> Thanks Rick for catching that!
>>
>> footnotes/references/excerpts:
>> 1. For certain definitions of "fixed" - serial numbers corrected, didn't
>>   check/validate anything else in particular - just "bumped" (updated)
>>   them and "pushed" (notified) 'em out, and rechecked 'till they all
>>   appeared out there okay on master(s) & slaves.
>> $ DNS_SOA_CK balug.org sf-lug.org
>> FQDN=balug.org. authority:
>> balug.org.              86400   IN      NS      ns1.balug.org.
>> balug.org.              86400   IN      NS      ns1.linuxmafia.com.
>> balug.org.              86400   IN      NS      ns1.svlug.org.
>> balug.org.              86400   IN      NS      puck.nether.net.
>> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
>> 9000 1800 1814400 86400 @198.144.194.238 (ns1.balug.org.)
>> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
>> 9000 1800 1814400 86400 @2001:470:1f04:19e::2 (ns1.balug.org.)
>> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
>> 9000 1800 1814400 86400 @198.144.195.186 (ns1.linuxmafia.com.)
>> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
>> 9000 1800 1814400 86400 @64.62.190.98 (ns1.svlug.org.)
>> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
>> 9000 1800 1814400 86400 @2600:3c01::f03c:91ff:fe96:e78e  
>> (ns1.svlug.org.)
>> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
>> 9000 1800 1814400 86400 @204.42.254.5 (puck.nether.net.)
>> balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
>> 9000 1800 1814400 86400 @2001:418:3f4::5 (puck.nether.net.)
>> FQDN=sf-lug.org. authority:
>> sf-lug.org.             86400   IN      NS      ns.primate.net.
>> sf-lug.org.             86400   IN      NS      ns1.linuxmafia.com.
>> sf-lug.org.             86400   IN      NS      ns1.sf-lug.org.
>> sf-lug.org.             86400   IN      NS      ns1.svlug.org.
>> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
>> 3600 1209600 86400 @198.144.194.12 (ns.primate.net.)
>> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
>> 3600 1209600 86400 @2001:470:1f04:51a::2 (ns.primate.net.)
>> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
>> 3600 1209600 86400 @198.144.195.186 (ns1.linuxmafia.com.)
>> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
>> 3600 1209600 86400 @198.144.194.238 (ns1.sf-lug.org.)
>> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
>> 3600 1209600 86400 @2001:470:1f04:19e::2 (ns1.sf-lug.org.)
>> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
>> 3600 1209600 86400 @64.62.190.98 (ns1.svlug.org.)
>> sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
>> 3600 1209600 86400 @2600:3c01::f03c:91ff:fe96:e78e (ns1.svlug.org.)
>> $
>> 2. pun or the like retroactively intended.  ;-)
>> 3. unreferenced footnote.  My DNS_SOA_CK program essentially grabs the
>>   "upstream" delegating NS records (at least one instance of 'em),
>>   gets the A and AAAA records of all the delegated nameservers,
>>   then gets the SOA records from each of those, and displays that
>>   data, for the domain(s) specified - or a default set if none specified.
>>   (I wrote it semi-recently - I got tired of doing it semi-manually
>>   on a semi-frequent basis; very handy for, among other things,
>>   also checking that master(s) and slaves are caught up when
>>   going through letsencrypt.org wildcard cert validation
>>   request via DNS verification; also very handy to see that the
>>   delegated nameservers are responding and with the expected
>>   data (or at least expected zone S/N)).
>>
>>> From: "Michael Paoli" <Michael.Paoli@cal.berkeley.edu>
>>> Subject: Re: [BALUG-Admin] S/N bobble -- waiting for Michael P. on this 8-O
>>> Date: Thu, 30 May 2019 13:50:37 -0700
>>
>>>> From: "Rick Moen" <rick@linuxmafia.com>
>>>> Subject: [BALUG-Admin] S/N bobble -- waiting for Michael P. on this
>>>> Date: Thu, 30 May 2019 11:20:26 -0700
>>>
>>>> Magic 8-ball (i.e., logcheck on ns1.linuxmafia.com) says:
>>>>
>>>> System Events
>>>> =-=-=-=-=-=-=
>>>> May 30 10:05:36 linuxmafia named[11750]: zone balug.org/IN:  
>>>> serial number (1558725628) received from master  
>>>> 198.144.194.238#53 < ours (1558799284)
>>>> May 30 10:30:43 linuxmafia named[11750]: zone sf-lug.org/IN:  
>>>> serial number (1558622463) received from master  
>>>> 198.144.194.238#53 < ours (1558799278)
>>>> May 30 10:32:42 linuxmafia named[11750]: zone balug.org/IN:  
>>>> serial number (1558725628) received from master  
>>>> 198.144.194.238#53 < ours (1558799284)
>>>> May 30 10:57:43 linuxmafia named[11750]: zone balug.org/IN:  
>>>> serial number (1558725628) received from master  
>>>> 198.144.194.238#53 < ours (1558799284)
>>>>
>>>> Er?
>>>>
>>>> Michael, O Great Oracle of the DNS master, before I go expunging the
>>>> local cached zone on ns1.linuxmafia.com so as to converge in the master,
>>>> any thoughts or desire to act on your end?  Normally, I would expect the
>>>> current situation to be _strenuously avoided_ by never taking S/Ns in a
>>>> retrograde direction on a zone's DNS master, so I infer that
>>>> investigation may be in order (or at least brief discussion).
>>>
>>> 8-O
>>> Opps, ... that should'a never happened.
>>> I'll investigate & correct.  Shouldn't require any explicit
>>> slave action.
>>>
>>> I wonder if maybe VM came up that shouldn't have, or ???
>>> Anyway, will check into it and correct (might be busy mostly
>>> with other stuff 'till about this evening or so, but expect
>>> I'll have it rectified by/around then ... might then take
>>> wee bit for slaves to follow along & get themselves
>>> corrected - but likely pretty fast on that and without
>>> explicit slave action needed).