So ... managed to accomplish several things which will: o generally increase the uptime/availability of the balug VM o reduce probability of balug (& other(s)) VM disk image boo-boo, notably failing to update or copying the wrong way 'round and unintentionally reverting to older image
Namely, did: o got live migrations working again much more fully on Debian GNU/Linux 9[.x] (stretch), approximately (and mostly effectively) as they had been working on Debian GNU/Linux 8[.x] (jessie), and with --copy-storage-all ... and in fact, apparently even more reliably than was the earlier case on Debian GNU/Linux 8[.x] (jessie) o Wrote some higher-level live migration (with --copy-storage-all) programs, to make the live migration process more goof-resistant.
So, ... now the balug VM should not need to go down at all to migrate physical hosts ... which among other things means no longer having approximately 6 downtimes per month (when I typically take the primary physical host off-site ... and back, which means balug VM doing a migration off that physical host ... and back ... since the live migrations hadn't been working, it was shutdown, cold copy of disk image, and reboot, for each migration).
Key bits on "solving" that: o Information on relevant Debian bugs was quite useful/informative, most notably: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=658112 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=796122 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=873012 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878299 The bug information suggested newer versions (e.g. presumably available from backports, or by running testing) would correct the issue, but they also provided information on effective work-around(s), most notably including the options: --p2p --tunnelled which was "good enough" in this case (requires a trust relationship between the physical hosts). I may also test further (on other test VM(s) to see if that's actually needed or not. o At least after using the above information, was able to successfully do live --copy-storage-all migrations on a test VM - and worked fine both ways between the physicals. But alas, when that was done with the balug VM, it worked one direction, but "failed" the other (the migration itself appeared to go fine, but the VM ended up wedged/hung on the target physical - even though its status showed as "running". But at that point, since it worked with other VM both ways, it was down to a divide and conquer troubleshooting isolation issue. o Eventually worked it out to be the virtual hardware configuration. By changing that, was able to successfully migrate both ways. Did at least 6 round trip migrations of a balug.test VM that was mostly a clone of the balug VM (with relevant services disabled and networking changed, otherwise at least initially highly identical). Once that was worked out, applied to the balug VM, and tested live migration - both ways - and worked fine. o Virtual hardware ... mostly just saved config, blew away config, recreated virtual machine, using "import" capability to build it around disk image, but keeping Ethernet MAC and networking connectivity the same. Then adjust virtual CPU - disabling capabilities not supported in the VM environment of the other physical due to different physical CPU(s). After that, all works fine with the migrations both ways of the balug VM. Note also - (virtual) CPU - that's apparently (mostly) not the hang issue - the VM software would report the incompatibilities, and refuse to migrate in those cases ... never got as far as a "hang" in those cases. (By default the software creates virtual CPU that's close in capabilities to the physical host CPU(s)).
References: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=658112 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=796122 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=873012 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878299 https://lists.balug.org/pipermail/balug-admin/2019-May/000984.html https://lists.balug.org/pipermail/balug-admin/2019-May/000985.html https://lists.balug.org/pipermail/balug-admin/2019-May/000986.html https://lists.balug.org/pipermail/balug-admin/2019-June/000988.html
From: "Michael Paoli" Michael.Paoli@cal.berkeley.edu Subject: Root Cause Analysis (RCA)*: Re: [BALUG-Admin] S/N bobble -- waiting for Michael P. on this 8-O ... "fixed"[1] Date: Fri, 31 May 2019 21:30:56 -0700
Root Cause Analysis (RCA)* *or at least reasonable as feasible attempt(s) thereof. Analysis/data/etc. below (at least mostly as relevant), then excerpts of at least much of the earlier closely related to the event emails.
"Root Cause Analysis" (RCA) of DNS serial number issues noted & reported 2019-05-30 US/Pacific
Executive summary: Likely a boo-boo in (lack of, or wrong way 'round) copying of VM disk images from (physical) host-to-host when doing a (cold) migration of VM from one host to the other. "Prevention" or reducing probability of same/similar issues again? See the last several paragraphs or so under line matching the above.
Analysis, conclusions (as feasible), and possible "corrective"/preventive information follows:
relevant history excerpts of shell session typically used for disk image copies of VM between physical hosts for migrations: 534 dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l root 192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda' 538 dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l root 192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda' I believe the first was repeated, as I'd unintentionally started copy from source to target before stopping source - so that would be an invalid copy; then restarted copy after stopping source 545 ssh -ax -l root 192.168.55.2 'dd bs=4194304 if=/var/local/balug/balug-sda' | time dd bs=4194304 of=/var/local/balug/balug-sda 552 dd bs=4194304 if=/var/local/balug/balug-sda | ssh -ax -l root 192.168.55.2 'dd bs=4194304 of=/var/local/balug/balug-sda' 556 ssh -ax -l root 192.168.55.2 'dd bs=4194304 if=/var/local/balug/balug-sda' | time dd bs=4194304 of=/var/local/balug/balug-sda
at time of examination, host is physical nominal host
Checking on file://linuxmafia.com/var/log/daemon.log* files looks like domains: balug.org sf-lug.com sf-lug.org sflug.org were impacted, all okay with serial numbers now matching between master(s) & slaves, after manual correction on balug.org and sf-lug.org, and looks like sf-lug.com and sflug.org were corrected earlier (probably as zones were routinely updated/maintained). Oldest issue detection still in linuxmafia.com log files examined: May 25 09:23:51 last reported in those logs: May 30 13:54:57 linuxmafia.com timestamps from those are timezone America/Los_Angeles excluding timestamps, looking at unique S/N complaints, we have: balug.org/IN: serial number (1558725628) received from master 198.144.194.238#53 < ours (1558799284) sf-lug.com/IN: serial number (1558125318) received from master 198.144.194.238#53 < ours (1558799275) sf-lug.org/IN: serial number (1558622463) received from master 198.144.194.238#53 < ours (1558799278) sflug.org/IN: serial number (1557434452) received from master 198.144.194.238#53 < ours (1557834269) What times do those correspond to? Well, we're using seconds from epoch when we update our S/N, so ... for t in 1557434452 1557834269 1558125318 1558622463 1558725628 \ 1558799275 1558799278 1558799284 do echo "$t" $(TZ=GMT0 date -Iseconds -d @"$t" TZ=US/Pacific date -Iseconds -d @"$t") done 1557434452 2019-05-09T20:40:52+00:00 2019-05-09T13:40:52-07:00 sflug.org 1557834269 2019-05-14T11:44:29+00:00 2019-05-14T04:44:29-07:00 sflug.org * 1558125318 2019-05-17T20:35:18+00:00 2019-05-17T13:35:18-07:00 sf-lug.com 1558622463 2019-05-23T14:41:03+00:00 2019-05-23T07:41:03-07:00 sf-lug.org 1558725628 2019-05-24T19:20:28+00:00 2019-05-24T12:20:28-07:00 balug.org 1558799275 2019-05-25T15:47:55+00:00 2019-05-25T08:47:55-07:00 sf-lug.com * 1558799278 2019-05-25T15:47:58+00:00 2019-05-25T08:47:58-07:00 sf-lug.org * 1558799284 2019-05-25T15:48:04+00:00 2019-05-25T08:48:04-07:00 balug.org * I added the domains to the above, and marked with * the later S/N on each May 2019 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Not all changes are necessarily checked into version control (e.g. RCS), however, many are. Ones that aren't are often short-term interim changes, e.g. temporary changes for letsencrypt wildcard certs validation via DNS. Let's see what we have in the current running VM ... but note that what nameserver serves up might not match, as DNSSEC is later added to that and may increase S/N, notably in our bind9 configuration for these master zones, we have: inline-signing yes; auto-dnssec maintain; serial-update-method unixtime; With that there, and since our "hand" updated SNs (those in master zone files themselves) are (or at least should be) generated via: date +%s we expect the SNs in the zone master files, and those served by the nameserver, to reflect when they were last updated in seconds since the epoch (or at least quite close to that - there may be modest delay between the data being altered and the newer data being (re)loaded or served up).
domains, RCS versions and S/N line from zone master files over relevant time/SN ranges:
balug.org 1.82 1559249859; SERIAL; date +%s 1.81 1557834269; SERIAL; date +%s 1.80 1550844926; SERIAL; date +%s
sf-lug.org 1.49 1559249874 ; SERIAL ; date +%s 1.48 1557834269 ; SERIAL ; date +%s 1.47 1550845043 ; SERIAL ; date +%s
sf-lug.com 1.49 1557834269 ; SERIAL ; date +%s 1.48 1550845063 ; SERIAL ; date +%s
sflug.org 1.4 1557834269 ; SERIAL ; date +%s 1.3 1557434452 ; SERIAL ; date +%s
So, VM - on the two hosts. Nominally it only is run on one at any given time (same IP address, and other stateful stuff). However, okay to run a 2nd disconnected (virtually or otherwise) from network, or in single-user mode (or equivalent) - so that may also sometimes be done (e.g. to examine slightly older data, or to test something, etc.)
Logs bits from the physical hosts of both below: from file://tigger/var/log/libvirt/qemu/balug.log we have: 2019-04-29 06:04:16.775+0000: starting up 2019-05-12 15:01:56.595+0000: shutting down, reason=shutdown 2019-05-13 00:32:26.019+0000: starting up 2019-05-25 07:38:01.482+0000: starting up 2019-05-25 16:42:59.970+0000: shutting down, reason=shutdown 2019-05-26 10:50:34.956+0000: starting up 2019-05-26 16:26:07.416+0000: shutting down, reason=shutdown 2019-05-26 23:19:50.635+0000: starting up
from file://vicki/var/log/libvirt/qemu/balug.log we have: 2019-04-29 05:57:14.081+0000: shutting down, reason=shutdown 2019-05-12 15:07:50.838+0000: starting up 2019-05-13 00:24:13.725+0000: shutting down, reason=shutdown 2019-05-14 02:51:10.547+0000: starting up 2019-05-14 03:32:32.323+0000: shutting down, reason=destroyed 2019-05-25 15:45:36.856+0000: starting up 2019-05-25 16:28:53.723+0000: shutting down, reason=shutdown 2019-05-25 16:34:45.054+0000: starting up 2019-05-25 16:49:33.468+0000: shutting down, reason=destroyed 2019-05-25 16:56:14.352+0000: starting up 2019-05-26 08:28:08.171+0000: shutting down, reason=shutdown 2019-05-26 16:33:39.196+0000: starting up 2019-05-26 23:12:43.082+0000: shutting down, reason=shutdown 2019-05-31 03:39:45.011+0000: starting up
taking the above, prefixing with a letter to distinguish the physical hosts, sorting by time, and I mark with * where there's overlap: v 2019-04-29 05:57:14.081+0000: shutting down, reason=shutdown t 2019-04-29 06:04:16.775+0000: starting up t 2019-05-12 15:01:56.595+0000: shutting down, reason=shutdown v 2019-05-12 15:07:50.838+0000: starting up v 2019-05-13 00:24:13.725+0000: shutting down, reason=shutdown t 2019-05-13 00:32:26.019+0000: starting up v 2019-05-14 02:51:10.547+0000: starting up * v 2019-05-14 03:32:32.323+0000: shutting down, reason=destroyed t 2019-05-25 07:38:01.482+0000: starting up v 2019-05-25 15:45:36.856+0000: starting up * v 2019-05-25 16:28:53.723+0000: shutting down, reason=shutdown v 2019-05-25 16:34:45.054+0000: starting up * t 2019-05-25 16:42:59.970+0000: shutting down, reason=shutdown v 2019-05-25 16:49:33.468+0000: shutting down, reason=destroyed v 2019-05-25 16:56:14.352+0000: starting up v 2019-05-26 08:28:08.171+0000: shutting down, reason=shutdown t 2019-05-26 10:50:34.956+0000: starting up t 2019-05-26 16:26:07.416+0000: shutting down, reason=shutdown v 2019-05-26 16:33:39.196+0000: starting up v 2019-05-26 23:12:43.082+0000: shutting down, reason=shutdown t 2019-05-26 23:19:50.635+0000: starting up v 2019-05-31 03:39:45.011+0000: starting up * (current examination, in single user mode or equivalent)
Kind'a doubt I directly goofed zone SNs, thinking more likely a VM image shuffle boo-boo, e.g. didn't copy latest image before bringing up on other physical, or maybe even possibly copied the wrong way 'round (less likely); then in either case, some data would be bit older than it should have been, and the VM host would effectively take a step back in time regarding its data (but not its system time).
There's some sf-lug stuff that's backed up nominally overnightly, that has version control (RCS) on it. If we had an image anomaly where there was goof on not copying over the most current, or copying older atop newer, the sf-lug RCS may show a jump/gap - e.g. on its tracking of the mbox changes.
$ sf-lug_mbox_stats YYYY-MM-DD (in UTC) and # of lines in mbox file 2019-05-28 1734690 2019-05-26 1732849 2019-05-25 1732221 2019-05-25 1732150 2019-05-24 1731781 2019-05-23 1729078 2019-05-22 1727972 2019-05-21 1727109 2019-05-19 1723626 2019-05-17 1722133 2019-05-17 1721723 2019-05-15 1721100 2019-05-15 1719952 2019-05-13 1719589 2019-05-10 1719495 2019-05-08 1719346 2019-05-06 1719019 2019-05-04 1718393 2019-05-02 1718311 2019-05-01 1716716 2019-04-30 1716055 Nothing obvious there. Some days there will be nothing, if the raw mbox didn't have any data changes. Not sure about the multiple on same day - I'd expect perhaps one of those where timezone of host was changed to UTC, but not sure why multiple of those are seen for some other days too - that/those may have been side-effect of a data copy error (either failed to update system disk image, or copied the wrong way 'around - so state files would be reverted, and an additional copy may occur).
Well, nothing obvious found.
Likely was a VM copy boo-boo (missed doing copy or copied wrong way 'round).
Checking a bit further across domains, see some residual issues on one other domain too: berkeleylug.org. IN SOA ns0.berkeleylug.org. Michael.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800 3600 1209600 86400 @198.144.194.238 (ns0.berkeleylug.org.) berkeleylug.org. IN SOA ns0.berkeleylug.org. Michael.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800 3600 1209600 86400 @2001:470:1f05:19e::4 (ns0.berkeleylug.org.) berkeleylug.org. IN SOA ns0.berkeleylug.org. Michael.Paoli.cal.berkeley.edu.berkeleylug.org. 1558682705 10800 3600 1209600 86400 @198.144.195.186 (ns1.linuxmafia.com.) berkeleylug.org. IN SOA ns0.berkeleylug.org. michael.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800 3600 1209600 86400 @64.62.190.98 (ns1.svlug.org.) berkeleylug.org. IN SOA ns0.berkeleylug.org. michael.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800 3600 1209600 86400 @2600:3c01::f03c:91ff:fe96:e78e (ns1.svlug.org.) berkeleylug.org. IN SOA ns0.berkeleylug.org. michael.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800 3600 1209600 86400 @204.42.254.5 (puck.nether.net.) berkeleylug.org. IN SOA ns0.berkeleylug.org. michael.paoli.cal.berkeley.edu.berkeleylug.org. 1558799274 10800 3600 1209600 86400 @2001:418:3f4::5 (puck.nether.net.) And the likewise the when of those epoch based timestamps / SNs: 1558682705 2019-05-24T07:25:05+00:00 2019-05-24T00:25:05-07:00 1558799274 2019-05-25T15:47:54+00:00 2019-05-25T08:47:54-07:00 Can we spot anything different in the data besides the SN? ... compared, only differences found were in RRSIG and SOA records ... will proceed to "bump" the S/N to correct that one ... done
So, ... cause and prevention of issue? Don't have "smoking gun" - insufficient data available to positively isolate.
I also examined the /var/log/daemon.log* files data. All the SN data within looks consistent - no errors, no SNs going "backwards". More specifically, all the SNs for the impacted zones in those logs always go forwards, never backwards, specifically both the unsigned (within zone master files themselves) and signed (+DNSSEC) each only go forwards, and the signed is always >= the unsigned, (it's only the signed that's seen externally anyway and that we mostly care about, but in the interests of careful examination), and each time the unsigned is "bumped" (we semi-manually increase it), the signed effectively immediately jumps up to that value (our new unsigned is never <= the old signed, but we always increase it beyond that - as both our semi-manual and +DNSSEC bump always advance to seconds from epoch). So, *within the surviving system image* (and presuming the most recent older too, and each step along the way), the image is self-consistent with *itself*. It's just if copy was missed or done wrong way 'round, it would be inconsistent with external reality (it may have served newer to slaves, then unintentionally image was reverted to earlier).
Anyway, that's what we'd expect under hypothesis that an image copy was missed or was done wrong way 'round - the "surviving" (current running) image would be self-consistent with itself, etc., just not (quite) consistent with external reality, as some changes on the VM image would've been lost.
So, most likely a VM disk image boo-boo (e.g. migrated VM but failed to migrate/refresh disk image, or copied wrong way 'round).
"Prevention" or reducing probablity of same/simlar issues again?
Could possibly add something to check mtime of image to prevent copying the wrong way around - but that does nothing to prevent failing to copy/refresh the image, which could still result in same issue.
Could do higher-level program do manage those particular migrations, that would make it more goof-resistant.
# virsh migrate --live --copy-storage-all would reduce probability of error, but current Debian stable has a bug that keeps that from working (worked in oldstable, likely will work in the forthcoming stable). Might be worth checking if there's a fix for that in backports, as that would also have benefit of reducing VM downtime. ... checking bug data and such, looks like it's likely fixed in backports, the forthcoming stable, testing, and unstable.
May or may not be worth doing anything explicitly about - reasonably attentive operations, it's not at all an impossible error to repeat, but it's also fairly low probability, and results aren't too nastily horrific - though it of course would be better to not have such boo-boos. Level-of-effort vs. risk ... and after examining this DNS data boo-boo, and how it probably happened, the operations will also likely be more attentive, and even less probable to have same (type of) boo-boo again (at least in the nearish future). Also, longer term, upcoming stable will become stable, both physical hosts will get upgraded, and then with # virsh migrate --live --copy-storage-all the probability of such an (operator!) error again goes way down.
Other bits - some more slaves (even non-advertised (no NS records) for same, would also help in logging/analyzing data (e.g. another "external reality" point of reference/logging relative to the VM and it's view of "reality" - and that might possibly suffer from, uh, "amnesia", in some (boo-boo - or significant hardware failure) cases.
Also suggested, thing(s) to (more) carefully watch logs, and catch/notify on issues ... but in the above case and with presumed cause of issue, such on the VM itself wouldn't have detected any issues ... but monitoring external to the VM would have caught the issue - e.g. on the nominal physical host - as that's up *most* of the time (where as the alternate physical spends most of its time down). Checks pre and post migration could also help thwart such boo-boos.
Also, operator(s) doing the migrations when more attentive, less rushed/tired/sleepy would also reduce probability of error(s). Having and (strictly) following a well tested and debugged checklist would also help (what exists presently is a pretty good "outline" plus some relevant script bits, but it's not exactly in a full checklist format to also take steps to thwart at least the more probable boo-boos).
From: "Michael Paoli" Michael.Paoli@cal.berkeley.edu Subject: Re: [BALUG-Admin] S/N bobble -- waiting for Michael P. on this 8-O ... "fixed"[1] Date: Thu, 30 May 2019 14:28:19 -0700
Okay, fixed[1]. I'll dig[2] into it later to investigate how issue did or likely occurred.
Thanks Rick for catching that!
footnotes/references/excerpts:
- For certain definitions of "fixed" - serial numbers corrected, didn't
check/validate anything else in particular - just "bumped" (updated) them and "pushed" (notified) 'em out, and rechecked 'till they all appeared out there okay on master(s) & slaves. $ DNS_SOA_CK balug.org sf-lug.org FQDN=balug.org. authority: balug.org. 86400 IN NS ns1.balug.org. balug.org. 86400 IN NS ns1.linuxmafia.com. balug.org. 86400 IN NS ns1.svlug.org. balug.org. 86400 IN NS puck.nether.net. balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859 9000 1800 1814400 86400 @198.144.194.238 (ns1.balug.org.) balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859 9000 1800 1814400 86400 @2001:470:1f04:19e::2 (ns1.balug.org.) balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859 9000 1800 1814400 86400 @198.144.195.186 (ns1.linuxmafia.com.) balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859 9000 1800 1814400 86400 @64.62.190.98 (ns1.svlug.org.) balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859 9000 1800 1814400 86400 @2600:3c01::f03c:91ff:fe96:e78e (ns1.svlug.org.) balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859 9000 1800 1814400 86400 @204.42.254.5 (puck.nether.net.) balug.org. IN SOA ns1.balug.org. hostmaster.balug.org. 1559249859 9000 1800 1814400 86400 @2001:418:3f4::5 (puck.nether.net.) FQDN=sf-lug.org. authority: sf-lug.org. 86400 IN NS ns.primate.net. sf-lug.org. 86400 IN NS ns1.linuxmafia.com. sf-lug.org. 86400 IN NS ns1.sf-lug.org. sf-lug.org. 86400 IN NS ns1.svlug.org. sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800 3600 1209600 86400 @198.144.194.12 (ns.primate.net.) sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800 3600 1209600 86400 @2001:470:1f04:51a::2 (ns.primate.net.) sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800 3600 1209600 86400 @198.144.195.186 (ns1.linuxmafia.com.) sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800 3600 1209600 86400 @198.144.194.238 (ns1.sf-lug.org.) sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800 3600 1209600 86400 @2001:470:1f04:19e::2 (ns1.sf-lug.org.) sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800 3600 1209600 86400 @64.62.190.98 (ns1.svlug.org.) sf-lug.org. IN SOA ns1.sf-lug.org. jim.well.com. 1559249874 10800 3600 1209600 86400 @2600:3c01::f03c:91ff:fe96:e78e (ns1.svlug.org.) $ 2. pun or the like retroactively intended. ;-) 3. unreferenced footnote. My DNS_SOA_CK program essentially grabs the "upstream" delegating NS records (at least one instance of 'em), gets the A and AAAA records of all the delegated nameservers, then gets the SOA records from each of those, and displays that data, for the domain(s) specified - or a default set if none specified. (I wrote it semi-recently - I got tired of doing it semi-manually on a semi-frequent basis; very handy for, among other things, also checking that master(s) and slaves are caught up when going through letsencrypt.org wildcard cert validation request via DNS verification; also very handy to see that the delegated nameservers are responding and with the expected data (or at least expected zone S/N)).
From: "Michael Paoli" Michael.Paoli@cal.berkeley.edu Subject: Re: [BALUG-Admin] S/N bobble -- waiting for Michael P. on this 8-O Date: Thu, 30 May 2019 13:50:37 -0700
From: "Rick Moen" rick@linuxmafia.com Subject: [BALUG-Admin] S/N bobble -- waiting for Michael P. on this Date: Thu, 30 May 2019 11:20:26 -0700
Magic 8-ball (i.e., logcheck on ns1.linuxmafia.com) says:
System Events
May 30 10:05:36 linuxmafia named[11750]: zone balug.org/IN: serial number (1558725628) received from master 198.144.194.238#53 < ours (1558799284) May 30 10:30:43 linuxmafia named[11750]: zone sf-lug.org/IN: serial number (1558622463) received from master 198.144.194.238#53 < ours (1558799278) May 30 10:32:42 linuxmafia named[11750]: zone balug.org/IN: serial number (1558725628) received from master 198.144.194.238#53 < ours (1558799284) May 30 10:57:43 linuxmafia named[11750]: zone balug.org/IN: serial number (1558725628) received from master 198.144.194.238#53 < ours (1558799284)
Er?
Michael, O Great Oracle of the DNS master, before I go expunging the local cached zone on ns1.linuxmafia.com so as to converge in the master, any thoughts or desire to act on your end? Normally, I would expect the current situation to be _strenuously avoided_ by never taking S/Ns in a retrograde direction on a zone's DNS master, so I infer that investigation may be in order (or at least brief discussion).
8-O Opps, ... that should'a never happened. I'll investigate & correct. Shouldn't require any explicit slave action.
I wonder if maybe VM came up that shouldn't have, or ??? Anyway, will check into it and correct (might be busy mostly with other stuff 'till about this evening or so, but expect I'll have it rectified by/around then ... might then take wee bit for slaves to follow along & get themselves corrected - but likely pretty fast on that and without explicit slave action needed).