[BALUG-Admin] Improved uptime/availability of balug VM & improved goof-resistance: Re: Root Cause Analysis (RCA)*: Re: S/N bobble -- waiting for Michael P. on this 8-O ... "fixed"[1]

6 Jun 2019

      So ... managed to accomplish several things which will:
o generally increase the uptime/availability of the balug VM
o reduce probability of balug (& other(s)) VM disk image boo-boo,
   notably failing to update or copying the wrong way 'round and
   unintentionally reverting to older image
Namely, did:
o got live migrations working again much more fully on
   Debian GNU/Linux 9[.x] (stretch), approximately (and mostly
   effectively) as they had been working on
   Debian GNU/Linux 8[.x] (jessie), and with --copy-storage-all
   ... and in fact, apparently even more reliably than was the
   earlier case on Debian GNU/Linux 8[.x] (jessie)
o Wrote some higher-level live migration (with --copy-storage-all)
   programs, to make the live migration process more goof-resistant.
So, ... now the balug VM should not need to go down at all to migrate
physical hosts ... which among other things means no longer having
approximately 6 downtimes per month (when I typically take the primary
physical host off-site ... and back, which means balug VM doing a
migration off that physical host ... and back ... since the live
migrations hadn't been working, it was shutdown, cold copy of disk
image, and reboot, for each migration).
Key bits on "solving" that:
o Information on relevant Debian bugs was quite useful/informative,
   most notably:
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=658112
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=796122
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=873012
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878299
   The bug information suggested newer versions (e.g. presumably
   available from backports, or by running testing) would correct the
   issue, but they also provided information on effective
   work-around(s), most notably including the options:
   --p2p --tunnelled
   which was "good enough" in this case (requires a trust relationship
   between the physical hosts).
   I may also test further (on other test VM(s) to see if that's actually
   needed or not.
o At least after using the above information, was able to successfully
   do live --copy-storage-all migrations on a test VM - and worked fine
   both ways between the physicals.  But alas, when that was done with
   the balug VM, it worked one direction, but "failed" the other (the
   migration itself appeared to go fine, but the VM ended up
   wedged/hung on the target physical - even though its status showed as
   "running".  But at that point, since it worked with other VM both
   ways, it was down to a divide and conquer troubleshooting isolation
   issue.
o Eventually worked it out to be the virtual hardware configuration.  By
   changing that, was able to successfully migrate both ways.
   Did at least 6 round trip migrations of a balug.test VM that
   was mostly a clone of the balug VM (with relevant services disabled
   and networking changed, otherwise at least initially highly
   identical).  Once that was worked out, applied to the balug VM,
   and tested live migration - both ways - and worked fine.
o Virtual hardware ... mostly just saved config, blew away config,
   recreated virtual machine, using "import" capability to build it
   around disk image, but keeping Ethernet MAC and networking
   connectivity the same.  Then adjust virtual CPU - disabling
   capabilities not supported in the VM environment of the other physical
   due to different physical CPU(s).  After that, all works fine with the
   migrations both ways of the balug VM.  Note also - (virtual) CPU -
   that's apparently (mostly) not the hang issue - the VM software would
   report the incompatibilities, and refuse to migrate in those cases ...
   never got as far as a "hang" in those cases.  (By default the
   software creates virtual CPU that's close in capabilities to the
   physical host CPU(s)).
References:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=658112
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=796122
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=873012
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878299
https://lists.balug.org/pipermail/balug-admin/2019-May/000984.html
https://lists.balug.org/pipermail/balug-admin/2019-May/000985.html
https://lists.balug.org/pipermail/balug-admin/2019-May/000986.html
https://lists.balug.org/pipermail/balug-admin/2019-June/000988.html
...
From: "Michael Paoli" Michael.Paoli@cal.berkeley.edu
Subject: Root Cause Analysis (RCA)*: Re: [BALUG-Admin] S/N bobble --  
waiting for Michael P. on this 8-O ... "fixed"[1]
Date: Fri, 31 May 2019 21:30:56 -0700
...
Root Cause Analysis (RCA)*
*or at least reasonable as feasible attempt(s) thereof.
Analysis/data/etc. below (at least mostly as relevant),
then excerpts of at least much of the earlier closely related
to the event emails.
"Root Cause Analysis" (RCA) of DNS serial number issues
noted & reported 2019-05-30 US/Pacific
Executive summary:
Likely a boo-boo in (lack of, or wrong way 'round) copying of VM disk
images from (physical) host-to-host when doing a (cold) migration of
VM from one host to the other.
"Prevention" or reducing probability of same/similar issues again?
See the last several paragraphs or so under line matching the above.
Analysis, conclusions (as feasible), and possible "corrective"/preventive
information follows:
relevant history excerpts of shell session typically used for disk image
copies of VM between physical hosts for migrations:
534      dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l  
root 192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda'
538      dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l  
root 192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda'
I believe the first was repeated, as I'd unintentionally started copy
from source to target before stopping source - so that would be an
invalid copy; then restarted copy after stopping source
545      ssh -ax -l root 192.168.55.2 'dd bs=4194304  
if=/var/local/balug/balug-sda' | time dd bs=4194304  
of=/var/local/balug/balug-sda
552      dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l  
root 192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda'
556      ssh -ax -l root 192.168.55.2 'dd bs=4194304  
if=/var/local/balug/balug-sda' | time dd bs=4194304  
of=/var/local/balug/balug-sda
at time of examination, host is physical nominal host
Checking on file://linuxmafia.com/var/log/daemon.log* files
looks like domains:
balug.org
sf-lug.com
sf-lug.org
sflug.org
were impacted,
all okay with serial numbers now matching between master(s) & slaves,
after manual correction on balug.org and sf-lug.org,
and looks like sf-lug.com and sflug.org were corrected earlier (probably
as zones were routinely updated/maintained).
Oldest issue detection still in linuxmafia.com log files examined:
May 25 09:23:51
last reported in those logs:
May 30 13:54:57
linuxmafia.com timestamps from those are timezone America/Los_Angeles
excluding timestamps, looking at unique S/N complaints, we have:
balug.org/IN: serial number (1558725628) received from master  
198.144.194.238#53 < ours (1558799284)
sf-lug.com/IN: serial number (1558125318) received from master  
198.144.194.238#53 < ours (1558799275)
sf-lug.org/IN: serial number (1558622463) received from master  
198.144.194.238#53 < ours (1558799278)
sflug.org/IN: serial number (1557434452) received from master  
198.144.194.238#53 < ours (1557834269)
What times do those correspond to?  Well, we're using seconds from
epoch when we update our S/N, so ...
for t in 1557434452 1557834269 1558125318 1558622463 1558725628 \
        1558799275 1558799278 1558799284
do
        echo "$t" $(TZ=GMT0 date -Iseconds -d @"$t"
        TZ=US/Pacific date -Iseconds -d @"$t")
done
1557434452 2019-05-09T20:40:52+00:00 2019-05-09T13:40:52-07:00 sflug.org
1557834269 2019-05-14T11:44:29+00:00 2019-05-14T04:44:29-07:00 sflug.org *
1558125318 2019-05-17T20:35:18+00:00 2019-05-17T13:35:18-07:00 sf-lug.com
1558622463 2019-05-23T14:41:03+00:00 2019-05-23T07:41:03-07:00 sf-lug.org
1558725628 2019-05-24T19:20:28+00:00 2019-05-24T12:20:28-07:00 balug.org
1558799275 2019-05-25T15:47:55+00:00 2019-05-25T08:47:55-07:00 sf-lug.com *
1558799278 2019-05-25T15:47:58+00:00 2019-05-25T08:47:58-07:00 sf-lug.org *
1558799284 2019-05-25T15:48:04+00:00 2019-05-25T08:48:04-07:00 balug.org *
I added the domains to the above, and marked with * the later S/N on each
      May 2019
Su Mo Tu We Th Fr Sa
          1  2  3  4
 5  6  7  8  9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
Not all changes are necessarily checked into version control (e.g. RCS),
however, many are.  Ones that aren't are often short-term interim
changes, e.g. temporary changes for letsencrypt wildcard certs
validation via DNS.
Let's see what we have in the current running VM ... but note that what
nameserver serves up might not match, as DNSSEC is later added to that
and may increase S/N, notably in our bind9 configuration for these
master zones, we have:
        inline-signing yes;
        auto-dnssec maintain;
        serial-update-method unixtime;
With that there, and since our "hand" updated SNs (those in master
zone files themselves) are (or at least should be) generated via:
date +%s
we expect the SNs in the zone master files, and those served by
the nameserver, to reflect when they were last updated in
seconds since the epoch (or at least quite close to that - there
may be modest delay between the data being altered and the newer
data being (re)loaded or served up).
domains, RCS versions and S/N line from zone master files over relevant
time/SN ranges:
balug.org
1.82 1559249859; SERIAL; date +%s
1.81 1557834269; SERIAL; date +%s
1.80 1550844926; SERIAL; date +%s
sf-lug.org
1.49 1559249874 ; SERIAL ; date +%s
1.48 1557834269 ; SERIAL ; date +%s
1.47 1550845043 ; SERIAL ; date +%s
sf-lug.com
1.49 1557834269 ; SERIAL ; date +%s
1.48 1550845063 ; SERIAL ; date +%s
sflug.org
1.4 1557834269 ; SERIAL ; date +%s
1.3 1557434452 ; SERIAL ; date +%s
So, VM - on the two hosts.  Nominally it only is run on one at any given
time (same IP address, and other stateful stuff).  However,
okay to run a 2nd disconnected (virtually or otherwise) from network,
or in single-user mode (or equivalent) - so that may also sometimes
be done (e.g. to examine slightly older data, or to test something,
etc.)
Logs bits from the physical hosts of both below:
from file://tigger/var/log/libvirt/qemu/balug.log we have:
2019-04-29 06:04:16.775+0000: starting up
2019-05-12 15:01:56.595+0000: shutting down, reason=shutdown
2019-05-13 00:32:26.019+0000: starting up
2019-05-25 07:38:01.482+0000: starting up
2019-05-25 16:42:59.970+0000: shutting down, reason=shutdown
2019-05-26 10:50:34.956+0000: starting up
2019-05-26 16:26:07.416+0000: shutting down, reason=shutdown
2019-05-26 23:19:50.635+0000: starting up
from file://vicki/var/log/libvirt/qemu/balug.log we have:
2019-04-29 05:57:14.081+0000: shutting down, reason=shutdown
2019-05-12 15:07:50.838+0000: starting up
2019-05-13 00:24:13.725+0000: shutting down, reason=shutdown
2019-05-14 02:51:10.547+0000: starting up
2019-05-14 03:32:32.323+0000: shutting down, reason=destroyed
2019-05-25 15:45:36.856+0000: starting up
2019-05-25 16:28:53.723+0000: shutting down, reason=shutdown
2019-05-25 16:34:45.054+0000: starting up
2019-05-25 16:49:33.468+0000: shutting down, reason=destroyed
2019-05-25 16:56:14.352+0000: starting up
2019-05-26 08:28:08.171+0000: shutting down, reason=shutdown
2019-05-26 16:33:39.196+0000: starting up
2019-05-26 23:12:43.082+0000: shutting down, reason=shutdown
2019-05-31 03:39:45.011+0000: starting up
taking the above, prefixing with a letter to distinguish the physical
hosts, sorting by time, and I mark with * where there's overlap:
v 2019-04-29 05:57:14.081+0000: shutting down, reason=shutdown
t 2019-04-29 06:04:16.775+0000: starting up
t 2019-05-12 15:01:56.595+0000: shutting down, reason=shutdown
v 2019-05-12 15:07:50.838+0000: starting up
v 2019-05-13 00:24:13.725+0000: shutting down, reason=shutdown
t 2019-05-13 00:32:26.019+0000: starting up
v 2019-05-14 02:51:10.547+0000: starting up *
v 2019-05-14 03:32:32.323+0000: shutting down, reason=destroyed
t 2019-05-25 07:38:01.482+0000: starting up
v 2019-05-25 15:45:36.856+0000: starting up *
v 2019-05-25 16:28:53.723+0000: shutting down, reason=shutdown
v 2019-05-25 16:34:45.054+0000: starting up *
t 2019-05-25 16:42:59.970+0000: shutting down, reason=shutdown
v 2019-05-25 16:49:33.468+0000: shutting down, reason=destroyed
v 2019-05-25 16:56:14.352+0000: starting up
v 2019-05-26 08:28:08.171+0000: shutting down, reason=shutdown
t 2019-05-26 10:50:34.956+0000: starting up
t 2019-05-26 16:26:07.416+0000: shutting down, reason=shutdown
v 2019-05-26 16:33:39.196+0000: starting up
v 2019-05-26 23:12:43.082+0000: shutting down, reason=shutdown
t 2019-05-26 23:19:50.635+0000: starting up
v 2019-05-31 03:39:45.011+0000: starting up * (current examination,
                                               in single user mode
                                               or equivalent)
Kind'a doubt I directly goofed zone SNs, thinking more likely a
VM image shuffle boo-boo, e.g. didn't copy latest image before
bringing up on other physical, or maybe even possibly copied
the wrong way 'round (less likely); then in either case, some data would
be bit older than it should have been, and the VM host would effectively
take a step back in time regarding its data (but not its system time).
There's some sf-lug stuff that's backed up nominally overnightly,
that has version control (RCS) on it.  If we had an image anomaly
where there was goof on not copying over the most current,
or copying older atop newer,
the sf-lug RCS may show a jump/gap - e.g. on its
tracking of the mbox changes.
$ sf-lug_mbox_stats
YYYY-MM-DD (in UTC) and # of lines in mbox file
2019-05-28 1734690
2019-05-26 1732849
2019-05-25 1732221
2019-05-25 1732150
2019-05-24 1731781
2019-05-23 1729078
2019-05-22 1727972
2019-05-21 1727109
2019-05-19 1723626
2019-05-17 1722133
2019-05-17 1721723
2019-05-15 1721100
2019-05-15 1719952
2019-05-13 1719589
2019-05-10 1719495
2019-05-08 1719346
2019-05-06 1719019
2019-05-04 1718393
2019-05-02 1718311
2019-05-01 1716716
2019-04-30 1716055
Nothing obvious there.
Some days there will be nothing, if the raw mbox didn't have any
data changes.  Not sure about the multiple on same day - I'd expect
perhaps one of those where timezone of host was changed to UTC, but not
sure why multiple of those are seen for some other days too - that/those
may have been side-effect of a data copy error (either failed to update
system disk image, or copied the wrong way 'around - so state files
would be reverted, and an additional copy may occur).
Well, nothing obvious found.
Likely was a VM copy boo-boo (missed doing copy or copied wrong way
'round).
Checking a bit further across domains, see some residual issues on one
other domain too:
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
Michael.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800  
3600 1209600 86400 @198.144.194.238 (ns0.berkeleylug.org.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
Michael.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800  
3600 1209600 86400 @2001:470:1f05:19e::4 (ns0.berkeleylug.org.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
Michael.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800  
3600 1209600 86400 @198.144.195.186 (ns1.linuxmafia.com.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
michael.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800  
3600 1209600 86400 @64.62.190.98 (ns1.svlug.org.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
michael.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800  
3600 1209600 86400 @2600:3c01::f03c:91ff:fe96:e78e (ns1.svlug.org.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
michael.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800  
3600 1209600 86400 @204.42.254.5 (puck.nether.net.)
berkeleylug.org. IN SOA ns0.berkeleylug.org.  
michael.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800  
3600 1209600 86400 @2001:418:3f4::5 (puck.nether.net.)
And the likewise the when of those epoch based timestamps / SNs:
1558682705 2019-05-24T07:25:05+00:00 2019-05-24T00:25:05-07:00
1558799274 2019-05-25T15:47:54+00:00 2019-05-25T08:47:54-07:00
Can we spot anything different in the data besides the SN?
... compared, only differences found were in RRSIG and SOA records
... will proceed to "bump" the S/N to correct that one ... done
So, ... cause and prevention of issue?
Don't have "smoking gun" - insufficient data available to positively
isolate.
I also examined the /var/log/daemon.log* files data.
All the SN data within looks consistent - no errors, no SNs going
"backwards".
More specifically, all the SNs for the impacted zones in those logs
always go forwards, never backwards, specifically both the unsigned
(within zone master files themselves) and signed (+DNSSEC) each
only go forwards, and the signed is always >= the unsigned,
(it's only the signed that's seen externally anyway and that we
mostly care about, but in the interests of careful examination),
and each time the unsigned is "bumped" (we semi-manually increase
it), the signed effectively immediately jumps up to that value
(our new unsigned is never <= the old signed, but we always
increase it beyond that - as both our semi-manual and +DNSSEC
bump always advance to seconds from epoch).  So, *within the
surviving system image* (and presuming the most recent older
too, and each step along the way), the image is self-consistent
with *itself*.  It's just if copy was missed or done wrong way
'round, it would be inconsistent with external reality
(it may have served newer to slaves, then unintentionally
image was reverted to earlier).
Anyway, that's what we'd expect under hypothesis that an
image copy was missed or was done wrong way 'round - the "surviving"
(current running) image would be self-consistent with itself, etc.,
just not (quite) consistent with external reality, as some changes
on the VM image would've been lost.
So, most likely a VM disk image boo-boo
(e.g. migrated VM but failed to migrate/refresh disk image, or copied
wrong way 'round).
"Prevention" or reducing probablity of same/simlar issues again?
Could possibly add something to check mtime of image to prevent copying
the wrong way around - but that does nothing to prevent failing to
copy/refresh the image, which could still result in same issue.
Could do higher-level program do manage those particular migrations,
that would make it more goof-resistant.
# virsh migrate --live --copy-storage-all
would reduce probability of error, but current Debian stable has a bug
that keeps that from working (worked in oldstable, likely will work in the
forthcoming stable).  Might be worth checking if there's a fix for that
in backports, as that would also have benefit of reducing VM downtime.
... checking bug data and such, looks like it's likely fixed in
backports, the forthcoming stable, testing, and unstable.
May or may not be worth doing anything explicitly about - reasonably
attentive operations, it's not at all an impossible error to
repeat, but it's also fairly low probability, and results aren't
too nastily horrific - though it of course would be better
to not have such boo-boos.  Level-of-effort vs. risk ...
and after examining this DNS data boo-boo, and how it probably happened,
the operations will also likely be more attentive, and even less
probable to have same (type of) boo-boo again (at least in the
nearish future).  Also, longer term, upcoming stable will become
stable, both physical hosts will get upgraded, and then with
# virsh migrate --live --copy-storage-all
the probability of such an (operator!) error again goes way down.
Other bits - some more slaves (even non-advertised (no NS records)
for same, would also help in logging/analyzing data (e.g. another
"external reality" point of reference/logging relative to
the VM and it's view of "reality" - and that might possibly
suffer from, uh, "amnesia", in some (boo-boo - or significant
hardware failure) cases.
Also suggested, thing(s) to (more) carefully watch logs, and
catch/notify on issues ... but in the above case and with
presumed cause of issue, such on the VM itself wouldn't have
detected any issues ... but monitoring external to the VM
would have caught the issue - e.g. on the nominal physical
host - as that's up *most* of the time (where as the
alternate physical spends most of its time down).  Checks
pre and post migration could also help thwart such boo-boos.
Also, operator(s) doing the migrations when more attentive,
less rushed/tired/sleepy would also reduce probability of
error(s).  Having and (strictly) following a well tested and
debugged checklist would also help (what exists presently is
a pretty good "outline" plus some relevant script bits,
but it's not exactly in a full checklist format to
also take steps to thwart at least the more probable
boo-boos).
...
From: "Michael Paoli" Michael.Paoli@cal.berkeley.edu
Subject: Re: [BALUG-Admin] S/N bobble -- waiting for Michael P. on  
this 8-O ... "fixed"[1]
Date: Thu, 30 May 2019 14:28:19 -0700
...
Okay, fixed[1].
I'll dig[2] into it later to investigate how issue did or likely
occurred.
Thanks Rick for catching that!
footnotes/references/excerpts:

For certain definitions of "fixed" - serial numbers corrected, didn't

check/validate anything else in particular - just "bumped" (updated)
  them and "pushed" (notified) 'em out, and rechecked 'till they all
  appeared out there okay on master(s) & slaves.
$ DNS_SOA_CK balug.org sf-lug.org
FQDN=balug.org. authority:
balug.org.              86400   IN      NS      ns1.balug.org.
balug.org.              86400   IN      NS      ns1.linuxmafia.com.
balug.org.              86400   IN      NS      ns1.svlug.org.
balug.org.              86400   IN      NS      puck.nether.net.
balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
9000 1800 1814400 86400 @198.144.194.238 (ns1.balug.org.)
balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
9000 1800 1814400 86400 @2001:470:1f04:19e::2 (ns1.balug.org.)
balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
9000 1800 1814400 86400 @198.144.195.186 (ns1.linuxmafia.com.)
balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
9000 1800 1814400 86400 @64.62.190.98 (ns1.svlug.org.)
balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
9000 1800 1814400 86400 @2600:3c01::f03c:91ff:fe96:e78e  
(ns1.svlug.org.)
balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
9000 1800 1814400 86400 @204.42.254.5 (puck.nether.net.)
balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859  
9000 1800 1814400 86400 @2001:418:3f4::5 (puck.nether.net.)
FQDN=sf-lug.org. authority:
sf-lug.org.             86400   IN      NS      ns.primate.net.
sf-lug.org.             86400   IN      NS      ns1.linuxmafia.com.
sf-lug.org.             86400   IN      NS      ns1.sf-lug.org.
sf-lug.org.             86400   IN      NS      ns1.svlug.org.
sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
3600 1209600 86400 @198.144.194.12 (ns.primate.net.)
sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
3600 1209600 86400 @2001:470:1f04:51a::2 (ns.primate.net.)
sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
3600 1209600 86400 @198.144.195.186 (ns1.linuxmafia.com.)
sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
3600 1209600 86400 @198.144.194.238 (ns1.sf-lug.org.)
sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
3600 1209600 86400 @2001:470:1f04:19e::2 (ns1.sf-lug.org.)
sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
3600 1209600 86400 @64.62.190.98 (ns1.svlug.org.)
sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800  
3600 1209600 86400 @2600:3c01::f03c:91ff:fe96:e78e (ns1.svlug.org.)
$
2. pun or the like retroactively intended.  ;-)
3. unreferenced footnote.  My DNS_SOA_CK program essentially grabs the
  "upstream" delegating NS records (at least one instance of 'em),
  gets the A and AAAA records of all the delegated nameservers,
  then gets the SOA records from each of those, and displays that
  data, for the domain(s) specified - or a default set if none specified.
  (I wrote it semi-recently - I got tired of doing it semi-manually
  on a semi-frequent basis; very handy for, among other things,
  also checking that master(s) and slaves are caught up when
  going through letsencrypt.org wildcard cert validation
  request via DNS verification; also very handy to see that the
  delegated nameservers are responding and with the expected
  data (or at least expected zone S/N)).
...
From: "Michael Paoli" Michael.Paoli@cal.berkeley.edu
Subject: Re: [BALUG-Admin] S/N bobble -- waiting for Michael P. on this 8-O
Date: Thu, 30 May 2019 13:50:37 -0700
...
...
From: "Rick Moen" rick@linuxmafia.com
Subject: [BALUG-Admin] S/N bobble -- waiting for Michael P. on this
Date: Thu, 30 May 2019 11:20:26 -0700
...
Magic 8-ball (i.e., logcheck on ns1.linuxmafia.com) says:
System Events
May 30 10:05:36 linuxmafia named[11750]: zone balug.org/IN:  
serial number (1558725628) received from master  
198.144.194.238#53 < ours (1558799284)
May 30 10:30:43 linuxmafia named[11750]: zone sf-lug.org/IN:  
serial number (1558622463) received from master  
198.144.194.238#53 < ours (1558799278)
May 30 10:32:42 linuxmafia named[11750]: zone balug.org/IN:  
serial number (1558725628) received from master  
198.144.194.238#53 < ours (1558799284)
May 30 10:57:43 linuxmafia named[11750]: zone balug.org/IN:  
serial number (1558725628) received from master  
198.144.194.238#53 < ours (1558799284)
Er?
Michael, O Great Oracle of the DNS master, before I go expunging the
local cached zone on ns1.linuxmafia.com so as to converge in the master,
any thoughts or desire to act on your end?  Normally, I would expect the
current situation to be _strenuously avoided_ by never taking S/Ns in a
retrograde direction on a zone's DNS master, so I infer that
investigation may be in order (or at least brief discussion).
8-O
Opps, ... that should'a never happened.
I'll investigate & correct.  Shouldn't require any explicit
slave action.
I wonder if maybe VM came up that shouldn't have, or ???
Anyway, will check into it and correct (might be busy mostly
with other stuff 'till about this evening or so, but expect
I'll have it rectified by/around then ... might then take
wee bit for slaves to follow along & get themselves
corrected - but likely pretty fast on that and without
explicit slave action needed).

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[BALUG-Admin] Improved uptime/availability of balug VM & improved goof-resistance: Re: Root Cause Analysis (RCA)*: Re: S/N bobble -- waiting for Michael P. on this 8-O ... "fixed"[1]

System Events