From: "Rick Moen" rick@linuxmafia.com Subject: Re: [BALUG-Admin] Root Cause Analysis (RCA)*: Re: S/N bobble -- waiting for Michael P. on this 8-O ... "fixed"[1] Date: Fri, 31 May 2019 22:45:13 -0700
Quoting Michael Paoli (Michael.Paoli@cal.berkeley.edu):
Also suggested, thing(s) to (more) carefully watch logs, and catch/notify on issues ... but in the above case and with presumed cause of issue, such on the VM itself wouldn't have detected any issues ... but monitoring external to the VM would have caught the issue - e.g. on the nominal physical host - as that's up *most* of the time (where as the alternate physical spends most of its time down).
Yeah, that's the bit that I'd already spent time gnawing over. On the DNS master, as you say, following a VM image shuffle boo-boo, the DNS master sees absolutely nothing out of kilter, so there's no obvious way for it to detect being, um, the wrong clone at the wrong time, sort
Yup ... one (or a few) step(s) closer to being able to better monitor such a potential boo-boo.
On the nominal physical host of the VM - which is up most of the time, added it as (stealth, non-NS delegated) slave of most of the relevant zones/domains. I'll recheck to see that I have all the relevant (may have missed BALUG.org, covered the SFLUG and BerkeleyLUG domains). Also just looked over the also-notify - many already had the relevant IPs in there (doesn't particularly hurt even if there's no slave there - generally cover my whole teensy local subnet of IPv4 Internet addressable IPs - "just in case") ... but also noticed some don't have those IPs in there, so, still some to add - that will help ensure, at least in nominal case, that the slaves are at least typically quickly aware of (presumably) newer DNS data to pull (transfer). Anyway, that alone doesn't fully address the matter, but does at least externally log some key relevant data - along with some other minor advantages, so it's at least potentially a useful place where such issue(s) may also potentially be detected.
And why stealth, non-NS delegated slaves? Not enough value add to be delegated advertised NS slaves - far too many single points of failure in common (e.g. location, power, gateway, DSL, >>~90% of the time whole lot 'o hardware in common, etc.). Much more useful to have other NS slaves which are much less probable to fail or be unreachable at same time.
Oh, and another thing I failed to mention on Root Cause Analysis (RCA). For more boo-boo resistance, and, reasonable space permitting, when doing migration host-to-host of VM, rather than nominally just overwriting the disk image, copy/rotate it out elsewhere on that physical and keep it some moderate while or so ... while that doesn't itself *prevent* the problem/issue, it would allow for better analysis/recovery/"fixing" any data "damage"/loss that might have occurred. (Yet another bit of "backup" - wouldn't change the frequency of what happens in that regard for the disk image copies ... but would, perhaps significantly, increase the retention (from a mere active and last active on other host, to additional copies on each). "Of course" those aren't the only backups, but the others aren't as frequent (fulls nominally monthly, and rotated to off-site, and quite a number of those preserved going back quite a ways in time). Might also be good to have some of the bit more critical data backed up bit more frequently, e.g. DNSSEC KSKs, TLS/"SSL" keys & certs ... maybe some other bits too. Ah well, more resources, and some more regular rsync + snapshots would also well cover that (and be able to be more comprehensive too).