[BALUG-Admin] Root Cause Analysis (RCA)*: Re: S/N bobble -- waiting for Michael P. on this 8-O ... "fixed"[1]

Sat Jun 1 06:49:36 UTC 2019

> From: "Rick Moen" <rick@linuxmafia.com>
> Subject: Re: [BALUG-Admin] Root Cause Analysis (RCA)*: Re: S/N  
> bobble -- waiting for Michael P. on this 8-O ... "fixed"[1]
> Date: Fri, 31 May 2019 22:45:13 -0700

> Quoting Michael Paoli (Michael.Paoli@cal.berkeley.edu):
>
>> Also suggested, thing(s) to (more) carefully watch logs, and
>> catch/notify on issues ... but in the above case and with
>> presumed cause of issue, such on the VM itself wouldn't have
>> detected any issues ... but monitoring external to the VM
>> would have caught the issue - e.g. on the nominal physical
>> host - as that's up *most* of the time (where as the
>> alternate physical spends most of its time down).
>
> Yeah, that's the bit that I'd already spent time gnawing over.
> On the DNS master, as you say, following a VM image shuffle boo-boo, the
> DNS master sees absolutely nothing out of kilter, so there's no obvious
> way for it to detect being, um, the wrong clone at the wrong time, sort

Yup ... one (or a few) step(s) closer to being able to better monitor
such a potential boo-boo.

On the nominal physical host of the VM - which is up most of
the time, added it as (stealth, non-NS delegated) slave
of most of the relevant zones/domains.  I'll recheck to see
that I have all the relevant (may have missed BALUG.org,
covered the SFLUG and BerkeleyLUG domains).  Also just looked
over the also-notify - many already had the relevant IPs
in there (doesn't particularly hurt even if there's no
slave there - generally cover my whole teensy local
subnet of IPv4 Internet addressable IPs - "just in case") ...
but also noticed some don't have those IPs in there, so,
still some to add - that will help ensure, at least in
nominal case, that the slaves are at least typically quickly
aware of (presumably) newer DNS data to pull (transfer).
Anyway, that alone doesn't fully address the matter, but does at
least externally log some key relevant data - along with some
other minor advantages,
so it's at least potentially a useful place where such issue(s)
may also potentially be detected.

And why stealth, non-NS delegated slaves?
Not enough value add to be delegated advertised
NS slaves - far too many single points of failure in common (e.g.
location, power, gateway, DSL, >>~90% of the time whole lot 'o
hardware in common, etc.).  Much more useful to have
other NS slaves which are much less probable to fail or
be unreachable at same time.

Oh, and another thing I failed to mention on
Root Cause Analysis (RCA).
For more boo-boo resistance, and, reasonable space permitting,
when doing migration host-to-host of VM, rather than nominally just
overwriting the disk image, copy/rotate it out elsewhere on that
physical and keep it some moderate while or so ... while that
doesn't itself *prevent* the problem/issue, it would allow for
better analysis/recovery/"fixing" any data "damage"/loss that
might have occurred.  (Yet another bit of "backup" - wouldn't
change the frequency of what happens in that regard for the
disk image copies ... but would, perhaps significantly, increase the
retention (from a mere active and last active on other host, to
additional copies on each).  "Of course" those aren't the only
backups, but the others aren't as frequent (fulls nominally
monthly, and rotated to off-site, and quite a number of those
preserved going back quite a ways in time).
Might also be good to have some of the bit more critical
data backed up bit more frequently, e.g. DNSSEC KSKs,
TLS/"SSL" keys & certs ... maybe some other bits
too.  Ah well, more resources, and some more regular
rsync + snapshots would also well cover that (and
be able to be more comprehensive too).