BALUG-Admin
Threads by month
- ----- 2026 -----
- February
- January
- ----- 2025 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2008 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2007 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2006 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2005 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
October 2017
- 2 participants
- 4 discussions
Hey, Michael P.! Remember when we noticed this one, I filed a bug
report on July 29th, and upstream Mailman maintainer Mark Sapiro
immediately furnished a patch? Here it is, closing the loop by being
changing fix status from committed to released.
----- Forwarded message from Mark Sapiro <mark(a)msapiro.net> -----
Date: Thu, 26 Oct 2017 21:22:50 -0000
From: Mark Sapiro <mark(a)msapiro.net>
To: rick(a)linuxmafia.com
Subject: [Bug 1707447] Re: Roster should not lowercase addresses
Reply-To: Bug 1707447 <1707447(a)bugs.launchpad.net>
** Changed in: mailman
Status: Fix Committed => Fix Released
--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/1707447
Title:
Roster should not lowercase addresses
To manage notifications about this bug go to:
https://bugs.launchpad.net/mailman/+bug/1707447/+subscriptions
----- End forwarded message -----
2
1
16 Oct '17
Thought you'd be amused at this cautionary tale, Michael P.
----- Forwarded message from Rick Moen <rick(a)linuxmafia.com> -----
Date: Mon, 16 Oct 2017 10:05:29 -0700
From: Rick Moen <rick(a)linuxmafia.com>
To: Duncan MacKinnon <duncan1(a)gmail.com>
Subject: Rest of that story about secondary DNS
Organization: If you lived here, you'd be $HOME already.
I _think_ I started to tell you a story about doing secondary DNS for
people, and something I learned. Of course, the standard model is
supposed to be: You do auth DNS for my domains; I do it for yours.
Years ago, I started to see the flaw in that optimistic,
we-help-each-other mental model. There was a user group in Santa Cruz,
SMAUG, which owned domain 'scruz.net'. (Terrible naming and choice of
domain; not my doing, not my call.) I ended up being primary/master DNS, and
we were doing really well because we signed up five more individuals
with auth nameservers to help out with secondary/slave DNS, for six
auth nameservers total, widely dispersed geographically and with a lot
of geographic diversity. That's fabulous redundancy. What could
possibly go wrong? </deadpan>
I relaxed about quality of service, because obviously we were way ahead
of the game. (SMAUG had a mailing list on the SVLUG mailing list
server. The mailing list still exists, derelict, the group having now
fallen apart.) Roll forward to one day when my household uplink through
Raw Bandwidth Communications was offline for about an hour because
SBC/AT&T shot the company in the foot.
When my aDSL came back, I found postings to SMAUG's mailing list
bitching about the scruz.net nameservice having been totally offline.
I noticed that some of the complaining came from the five individuals
who were allegedly doing secondary/slave nameservice. Hmm?
So, I checked on the five secondaries. Certainly, my aDSL being offline
for an hour should not have taken all DNS offline. And what I found
was: Over about a two-year period, some of the five had moved their
nameservers to new IPs and failed to notify me as master nameserver
admin. Some had ceased doing auth DNS entirely, and failed to notify me
as master nameserver admin. Some still had the same nameserver running
at the same IP as always, but had quietly ceased doing auth namservice
for scruz.net, and failed to notify me as master nameserver admin.
All of the nameserver IPs they'd provided me for their secondary
nameservice were still listed in the whois (and as NS lines for the
domain in the parent .net zone). But exactly one nameserver still
existed and was actually _doing_ auth DNS for scruz.net -- mine.
All five of the others had silently flaked out. Which made it extra
galling that some of these guys complained about -my- nameservice being
unreliable, since theirs was 100% unreliable, their having broken it
in various ways, whereas mine worked great except once in a blue moon
when my uplink went down.
I thought: OK, obviously it turns out to be a mistake to just trust
that secondaries will continue to exist and that their operators will
do due-diligence communication with the primary when something important
changes. They _should_, but it turns out they don't.
So, I wrote a weekly cron script to check on all the secondaries for my
two domains, linuxmafia.com and unixmercenary.net: It queries and
reports the parent-zone NS "glue" records, queries and reports the
nameservers declared authoritative in whois, and reports each auth
nameserver's zonefile S/N so I can make sure they all respond and give
the same value. This means I can detect and act on flaky secondaries.
What I did _not_ do was bother writing a script to check on other
people's master nameservers, on domains for which *I* do secondary
nameservice. Failures in this case are almost entirely the domain
owner's problem, not mine. As long as I keep _my_ word for quality of
secondary DNS, I'm OK.
Well, almost. I do secondary for five or six domains Ruben Safir owns,
and recently double-checked those. For most of them, I advised Ruben
_again_ that having only two auth nameserver isn't enough and is
dangerously thin. I urged him to find a couple more, somewhere.
For one of them, nylxs.com, I noticed and advised Ruben that _neither_
his nor my auth nameservers were authoritative any more. Instead of
WWW2.MRBRKLYN.COM
NS1.LINUXMAFIA.COM
the records now listed the auth nameservers like so:
$ whois nylxs.com | grep 'Name Server'
Name Server: NS69.DOMAINCONTROL.COM
Name Server: NS70.DOMAINCONTROL.COM
$
I wrote Ruben: 'I find that you have _ceased_ using my secondary
(slave) nameservice, but neglected to inform me. That's rude, Ruben.
You need to friggin' tell your secondaries if/when you move auth
nameservice somewhere else. Grr. Learn to do it right, already!'
Turns out, that's not what happened, exactly.
Ruben had failed to pay his domain renewal, so his registrar (GoDaddy)
had repointed its DNS to 'parked domain' nameservice from its own
domaincontrol.com nameservers, making those authoritative in place of
Ruben's and mine.
Luckily, because I (in effect) warned Ruben of his expired domain in
time, he was able to renew it.
So, lesson: When you find yourself annoyingly still doing futile
secondary DNS for a domain whose owner _seems_ to have moved auth
nameservice elsewhere without telling you, the explanation isn't always
owner lack of diligence concerning communicating with secondaries:
Sometimes, it's merely owner lack of diligence in paying the bill.
----- End forwarded message -----
1
0
Michael P.:
I periodically check my mail server IP on http://multirbl.valli.org/
and/or http://www.dnsbl.info/ , to make sure it's not on any blocklist.
(It was recently on one for cryptic reasons. When I politely inquired
and stressed that I'd be glad to fix any problem but didn't understand
the existing one, the listing was removed without comment.)
$ host lists.balug.org
lists.balug.org has address 198.144.194.238
lists.balug.org has IPv6 address 2001:470:1f04:19e::2
lists.balug.org mail is handled by 100 mx.lists.balug.org.
$
http://multirbl.valli.org/lookup/198.144.194.238.html shows that one
cluster of blocklists, the rfc-clueless.org one, doesn't like your IP.
This is the successor to Derek Balling's rfc-ignorant.org DNSBL, which
Derek eventually shut down, and has the same mission. The listing
policy is here: http://rfc-clueless.org/pages/listing_policy
Looks like they believe that 198.144.194.238 isn't accepting mail to
postmaster. an RFC requirement for any FQDN that deals in SMTP
(http://rfc-clueless.org/pages/listing_policy-postmaster)
When I did a quick check telneting to 25/tcp, it seems to me that the
system _was_ going to accept my manually composed mail to
postmaster(a)lists.balug.org -- so I'm unclear on why that listing's
there.
Seems like they also have no-postmaster listings for FQDNs
balug-sf-lug-v2.balug.org, balug.org, and temp.balug.org. Maybe you
should actually cease accepting mail for those FQDNs.
2
7
balug.org(/sf-lug.{org, com}) host OOM oops Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
by Michael Paoli 04 Oct '17
by Michael Paoli 04 Oct '17
04 Oct '17
To on-list, because, well, ... why not? :-)
And ... BALUG-Admin, (and/or) SF-LUG?
Well, it's more so balug.org host also running some SF-LUG
services, rather than vice versa, so ...
"She's dead, Jim." - Rick Moen (Thanks Rick!) noticed some issues,
checked a bit and ... named was no longer running! 8-O
Host was still up and otherwise seemed (relatively) healthy,
but ... what happened to named? Did some digging ... not dig(1)
particularly, but ... checking logs.
So ... restarted named ... seemed healthy and okay and such.
Started checking logs ...
/var/log/daemon.log*
Sep 29 02:58:37 balug-sf-lug-v2 systemd[1]: Unit bind9.service entered
failed state.
That seemed about the last peep about it ... but there were indications
of other problems ... notably things apparently being killed off. :-/
/var/log/messages*
Uh oh, ... the dreaded OOM PID killer! ...
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.816128]
fail2ban-server invoked oom-killer: gfp_mask=0x201da, order=0,
oom_score_adj=0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.830751]
fail2ban-server cpuset=/ mems_allowed=0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.832896] CPU: 0 PID:
914 Comm: fail2ban-server Not tainted 3.16.0-4-amd64 #1 Debian
3.16.43-2+deb8u5
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.834810] Hardware name:
Bochs Bochs, BIOS Bochs 01/01/2011
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.835826]
0000000000000000 ffffffff81514291 ffff88003ca049e0 0000000000000000
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
ffffffff81511e69 0000000000000000 ffffffff810d6f6f 0000000000000000
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
ffffffff81518c4e 0000000000000200 ffffffff81068a53 ffffffff810c44e4
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] Call Trace:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81514291>] ? dump_stack+0x5d/0x78
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81511e69>] ? dump_header+0x76/0x1e8
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff810d6f6f>] ? smp_call_function_single+0x5f/0xa0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81518c4e>] ? mutex_lock+0xe/0x2a
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81068a53>] ? put_online_cpus+0x23/0x80
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff810c44e4>] ? rcu_oom_notify+0xc4/0xe0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8115431c>] ? do_try_to_free_pages+0x4ac/0x520
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81142ddd>] ? oom_kill_process+0x21d/0x370
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8114299d>] ? find_lock_task_mm+0x3d/0x90
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81143543>] ? out_of_memory+0x473/0x4b0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8114940f>] ? __alloc_pages_nodemask+0x9ef/0xb50
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8118894d>] ? alloc_pages_current+0x9d/0x150
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81141b40>] ? filemap_fault+0x1a0/0x420
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81167d0a>] ? __do_fault+0x3a/0xa0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8116a8ce>] ? do_read_fault.isra.54+0x4e/0x300
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8116c0fc>] ? handle_mm_fault+0x63c/0x1150
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff810582c7>] ? __do_page_fault+0x177/0x4f0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8109ea37>] ? put_prev_entity+0x57/0x350
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8105299b>] ? kvm_clock_get_cycles+0x1b/0x20
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff810c9b32>] ? ktime_get_ts+0x42/0xe0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff811be157>] ? poll_select_copy_remaining+0xe7/0x140
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8151c4d8>] ? async_page_fault+0x28/0x30
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.875236] Mem-Info:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.876102] Node 0 DMA per-cpu:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.877119] CPU 0: hi:
0, btch: 1 usd: 0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.878426] Node 0 DMA32 per-cpu:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.879490] CPU 0: hi:
186, btch: 31 usd: 0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714]
active_anon:113183 inactive_anon:113245 isolated_anon:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714]
active_file:46 inactive_file:58 isolated_file:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] unevictable:0
dirty:0 writeback:0 unstable:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] free:12241
slab_reclaimable:3169 slab_unreclaimable:5274
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] mapped:2458
shmem:2849 pagetables:5275 bounce:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] free_cma:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.888200] Node 0 DMA
free:4636kB min:700kB low:872kB high:1048kB active_anon:5292kB
inactive_anon:5328kB active_file:0kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB
mlocked:0kB dirty:0kB writeback:0kB mapped:280kB shmem:280kB
slab_reclaimable:104kB slab_unreclaimable:244kB kernel_stack:48kB
pagetables:124kB unstable:0kB bounce:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.897611]
lowmem_reserve[]: 0 982 982 982
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.899344] Node 0 DMA32
free:44328kB min:44352kB low:55440kB high:66528kB active_anon:447440kB
inactive_anon:447652kB active_file:184kB inactive_file:232kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1032184kB managed:1008516kB mlocked:0kB dirty:0kB
writeback:0kB mapped:9552kB shmem:11116kB slab_reclaimable:12572kB
slab_unreclaimable:20852kB kernel_stack:3008kB pagetables:20976kB
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:727 all_unreclaimable? yes
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.910174]
lowmem_reserve[]: 0 0 0 0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.911889] Node 0 DMA:
5*4kB (M) 1*8kB (U) 10*16kB (U) 7*32kB (UM) 10*64kB (UM) 2*128kB (UM)
3*256kB (UM) 1*512kB (M) 0*1024kB 1*2048kB (R) 0*4096kB = 4636kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.917906] Node 0 DMA32:
738*4kB (UE) 390*8kB (EM) 189*16kB (UE) 117*32kB (UEM) 120*64kB (EM)
60*128kB (UEM) 23*256kB (UEM) 8*512kB (UEM) 2*1024kB (U) 0*2048kB
1*4096kB (R) = 44328kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.925589] Node 0
hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=2048kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.927978] 12189 total
pagecache pages
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.928995] 9236 pages in
swap cache
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.929978] Swap cache
stats: add 1598908, delete 1589672, find 1953234/2271848
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.931903] Free swap = 0kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.932854] Total swap = 1048544kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.933794] 262044 pages RAM
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.934655] 0 pages
HighMem/MovableOnly
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.935688] 5917 pages reserved
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.936598] 0 pages hwpoisoned
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.937500] [ pid ] uid
tgid total_vm rss nr_ptes swapents oom_score_adj name
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.940324] [ 172] 0
172 7217 269 20 43 0 systemd-journal
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.942449] [ 184] 0
184 10356 2 22 217 -1000 systemd-udevd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.944605] [ 597] 0
597 3015 16 12 768 0 haveged
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.946531] [ 598] 1
598 1571 14 9 13 0 uptimed
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.948669] [ 599] 101
599 42559 4837 67 6402 0 named
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.950662] [ 600] 0
600 64667 70 29 197 0 rsyslogd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.952633] [ 601] 0
601 4756 5 15 40 0 atd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.954470] [ 602] 0
602 6476 20 18 47 0 cron
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.957340] [ 603] 0
603 39348 46 43 2237 0 lwresd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.960249] [ 607] 0
607 13796 29 31 139 -1000 sshd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.962284] [ 609] 107
609 10563 48 24 69 -900 dbus-daemon
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.964515] [ 617] 0
617 7088 44 19 38 0 systemd-logind
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.966680] [ 671] 105
671 305156 30716 486 181302 0 clamd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.968711] [ 687] 0
687 15925 289 35 2831 0 spfd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.970791] [ 699] 103
699 8346 49 22 109 0 ntpd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.973677] [ 701] 0
701 3724 3 12 36 0 agetty
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.976699] [ 702] 0
702 17950 3 40 132 0 login
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.978631] [ 715] 0
715 5062 3 15 113 0 mysqld_safe
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.980675] [ 858] 17041
858 150162 3794 90 11586 0 mysqld
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.982763] [ 859] 0
859 5536 3 16 51 0 logger
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.984811] [ 878] 38
878 15029 249 32 2049 0 mailmanctl
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.987040] [ 879] 38
879 15637 1826 34 1126 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.989403] [ 880] 38
880 15700 1885 35 1145 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.992585] [ 881] 38
881 14998 283 32 2005 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.994953] [ 882] 38
882 15700 548 34 2470 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.997219] [ 883] 38
883 15002 286 33 2019 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.999303] [ 884] 38
884 16807 521 36 2524 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.001351] [ 885] 38
885 15690 968 33 1995 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.003496] [ 886] 38
886 15024 250 33 2036 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.005724] [ 912] 0
912 42169 599 87 17903 0 /usr/sbin/spamd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.008183] [ 914] 0
914 49613 670 30 1160 0 fail2ban-server
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.010262] [ 997] 0
997 78222 269 114 3705 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.011990] [ 1027] 0
1027 43875 10022 91 10191 0 spamd child
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.013870] [ 1029] 0
1029 42735 3586 87 15469 0 spamd child
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.015655] [ 1039] 113
1039 25706 45 48 473 0 exim4
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.017462] [ 1048] 1607
1048 9051 141 22 119 0 systemd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.020096] [ 1049] 1607
1049 12900 14 26 740 0 (sd-pam)
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.022931] [ 1051] 0
1051 13451 3 33 115 0 sudo
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.025523] [ 1083] 0
1083 13311 1 32 106 0 su
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.027987] [ 1084] 0
1084 5085 3 15 142 0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.030570] [ 2215] 0
2215 23309 3 49 234 0 sshd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.033272] [ 2217] 1607
2217 23309 44 47 201 0 sshd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.035822] [ 2218] 1607
2218 5081 3 15 138 0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.038453] [ 2253] 1607
2253 5992 7 18 55 0 screen
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.041654] [ 2258] 1607
2258 6066 69 18 106 0 screen
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.044778] [ 2273] 1607
2273 5083 82 15 63 0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.047324] [ 2306] 0
2306 13451 3 32 118 0 sudo
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.050021] [ 2339] 0
2339 13311 1 31 105 0 su
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.051936] [ 2340] 0
2340 5103 73 16 91 0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.053694] [15410] 33
15410 88963 13397 135 3271 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.055645] [16284] 33
16284 80083 4647 118 2974 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.058999] [19656] 33
19656 80456 4573 117 3142 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.062614] [20288] 33
20288 79754 3974 117 3037 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.065805] [21280] 33
21280 79333 3244 115 3168 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.068966] [21283] 33
21283 85967 10413 129 3078 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.072292] [21284] 33
21284 86848 10748 130 3174 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.075728] [21306] 33
21306 85528 9358 127 3217 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.078872] [21420] 33
21420 86187 9993 128 3206 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.082082] [21832] 33
21832 85304 9835 128 3046 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.085599] [21937] 33
21937 79546 3702 116 3210 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.090313] [21940] 33
21940 85942 9674 128 3325 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.093417] [21950] 33
21950 88632 12360 133 3305 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.096711] [21951] 33
21951 87235 11223 131 3214 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.099971] [21953] 33
21953 87544 10917 131 3330 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.103143] [21976] 33
21976 87350 10978 131 3313 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.105194] [21984] 33
21984 87404 11163 131 3246 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.107390] [21985] 33
21985 86796 10190 129 3419 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.110497] [22036] 33
22036 87183 10215 129 3421 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.112329] [22039] 33
22039 85816 9314 128 3330 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.114268] [22040] 33
22040 78803 2445 113 3337 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.115973] [22042] 33
22042 79744 2694 116 3336 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.118243] [22043] 33
22043 84175 7485 123 3421 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.120124] [22044] 33
22044 83023 6360 121 3416 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.122435] [22052] 33
22052 78230 288 100 3689 0 apache2
clamd is fair bit of memory hog, but knew that, added more RAM to host
earlier, and also, it's only one PID.
The big RAM suck is Apache ... not per PID, but in total. 26 Apache
PIDs, that quickly adds up.
/var/log/syslog*
And looks like it was around here that named got whacked:
Sep 29 02:52:04 balug-sf-lug-v2 systemd[1]: bind9.service: main
process exited, code=killed, status=9/KILL
reviewed the Apache logs (/var/log/apache2/*)
... looks mostly like some overzealous web crawlers, plus some bad bots,
were both simultaneously pounding away quite hard on Apache. Looks like
I'd not tuned Apache sufficiently to well withstand such a load, so
Apache (mpm_prefork) forked lots of PIDs to handle the requests and
attempt to keep up, but alas, beyond the resources reasonably available
on the host ... and that's when things then went South rather quickly.
So ... would be better to limit how much resource Apache would suck up,
rather than allow Apache to consume excessive resources relative to
those available. That may result in some web service failures/errors
... but that's better than Apache otherwise negatively impacing services
on the host.
So ... Apache configuration ...
cd /etc/apache2 && ./.unroll < apache2.conf
...
# ./.unroll START: IncludeOptional mods-enabled/*.load
LoadModule mpm_prefork_module /usr/lib/apache2/modules/mod_mpm_prefork.so
...
# ./.unroll START: IncludeOptional mods-enabled/*.conf
...
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxRequestWorkers 150
MaxConnectionsPerChild 0
</IfModule>
From the Apache documentation:
"Most important is that MaxRequestWorkers be big enough to handle as
many simultaneous requests as you expect to receive, but small enough
to assure that there is enough physical RAM for all processes."
Oops!
So, looks like things were going badly with 26 Apache PIDs (likely 1
"master" (parent) and the rest spawned children). So, likely
MaxRequestWorkers should be something at least below 25 - and even 25
would be (bit) too high. That's also a lot of load/work -
simultaneously handling up to 25 requests. If something(s) are
requesting that much ... some requests can wait - better that than quite
negatively impacting the host overall. So ... I'm thinking 20.
How does it look at present?
2>&1 fuser /usr/sbin/apache2
/usr/sbin/apache2: 443e 969e 997e 3211e 7433e 13566e 13604e
13605e 13623e 31738e 31749e
11 PIDs, so 20 ought be pretty reasonable for MaxRequestWorkers. That
would be about 21 Apache PIDs total, and simultaneously handling up to
about 20 requests.
# ls -ld /etc/apache2/mods-enabled/mpm_prefork.conf
lrwxrwxrwx 1 root root 34 Oct 30 2015
/etc/apache2/mods-enabled/mpm_prefork.conf ->
../mods-available/mpm_prefork.conf
# cd /etc/apache2/mods-available
# ex mpm_prefork.conf
mpm_prefork.conf: unmodified: line 16
:/150
MaxRequestWorkers 150
:s/150/20/p
MaxRequestWorkers 20
:w
mpm_prefork.conf: 16 lines, 570 characters
:q
# (cd / && umask 022 && apachectl graceful)
# ci -d -u -M -m'adjust MaxRequestWorkers to avoid excess RAM
consumption and dreaded kernel OOM PID killer' mpm_prefork.conf
RCS/mpm_prefork.conf,v <-- mpm_prefork.conf
new revision: 1.2; previous revision: 1.1
done
#
Could also possibly consider going to a threaded model (if that plays
well/safe with other Apache bits and related installed).
... and then a reboot for good measure (notably in case other PIDs got
whacked by kernel OOM PID killer that ought be restarted).
And then hopefully all is well with the universe again ... at least for
now.
> From: "Michael Paoli" <Michael.Paoli(a)cal.berkeley.edu>
> To: "Rick Moen" <rick(a)linuxmafia.com>
> Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
> Date: Sat, 30 Sep 2017 20:41:23 -0700
> So, ... looks like, at least to a 1st order
> approximation/guestimation based upon various log data ...
> a combination of some overzealous web crawlers and bad bots
> hit hard with the frequency and simultaneity of their requests,
> Apache - probably still at or near to default on its configuration
> for handling such as for workers/threads, tried to be very highly
> accommodating to the excessive requests, starting up lots of
> processes/threads ... and ... probably not properly tuned,
> beyond reasonable for the available resources ... and
> then fairly soon into that things started going seriously South.
> So ... probably first bit, ... tweak some Apache settings so it's
> not overly generous in providing resources, and excessive in
> consumption of resources ... better Apache return a "server busy"
> or similar type of error, than consume excessive resources
> to the point where it quite negatively impacts the host
> (dang Linux memory (mis) management ... if it didn't overcommit,
> it could simply tell Apache, "sorry no more memory for you - enough
> aleady" ... and generally nothing else would suffer, but ... alas,
> if only ...). So, ... anyway, tweak some Apache settings on that,
> reboot (to revive any other innocent PIDs that may have been
> unceremoniously slaughtered), and ... keep an eye on it and see
> how things go. Have bumped into relatively similar before ...
> but it's been several years. Issue from many years ago were
> bad bots massively excessively registering accounts on the
> wiki ... the bots were too stupid to manage to do anything with
> the wiki once they registered all those accounts ... but the
> registration was so massively parallel it was a DoS attack
> on the wiki/host, and it balooned resource consumption so high and
> fast, host would lock up solid without leaving much of a trace
> as to what happened ... took a bit of sleuthing and adding
> some wee bits 'o extra data collection to track down
> and nail that one. The work-around was then to change
> then wiki so no web based registrations were allowed
> anymore ... rare enough folks are added on the wiki,
> that such can be handled manually ... that was sufficient
> change to work around that issue that several years ago
> when that was going on. Anyway, MTA/Mailman/anti-spam etc.
> have upped the resource requirements/consumption some fairish
> bit (did also give the virtual host more virtual RAM at the time
> too) ... but ... probably again time for some more resource
> allocation/tuning ... and looks like at present time Apache is
> first logical place to adjust that.
>
>> From: "Michael Paoli" <Michael.Paoli(a)cal.berkeley.edu>
>> Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
>> Date: Sat, 30 Sep 2017 20:07:48 -0700
>
>> The bloody OOM PID killer kicked in, and at some point
>> in its infinite wisdom (stupidity) thought that SIGKILL
>> to named on a nameserver was a "good" <cough, cough> idea.
>> I'll have to see if I can isolate what was sucking up too
>> much resource. I never did much care for Linux's (or at least
>> some/many distributions leaving it enabled in kernel by default)
>> overcommitting (notably as crude work-around for folks that
>> write crappy programs that request lots of memory which they often
>> never need or use) ... which then, when something actually needs
>> the memory the kernel "gave" (promised) it, when it cheated
>> and overcommitted ... yeah, ... that's when things get very ugly
>> very fast - a.k.a. OOM PID killer, ... ugh!
>>
>> Anyway, more log sleuthing, to see what ate up so much
>> resource ... and ... probably due for reboot after the OOM
>> kicked in anyway, ... dear knows what else got whacked that
>> ought not have been whacked.
>>
>>
>>> From: "Michael Paoli" <Michael.Paoli(a)cal.berkeley.edu>
>>> Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
>>> Date: Sat, 30 Sep 2017 19:36:01 -0700
>>
>>> Thanks, ... hopefully it's "all better now".
>>> Somehow named wasn't listening 8-O ... and I think that's also
>>> why the (still listening) MTA was then, uh, "upset."
>>> I'll see what I can find as to why named was down and when it
>>> went down ... maybe operator error, maybe ... who knows what.
>>> Anyway, I'll see what I can find (and will recheck its general
>>> health).
>>>
>>> Anyway, thanks for bringing it to my attention - I'd not
>>> seen that quite yet.
>>>
>>>> From: "Rick Moen" <rick(a)linuxmafia.com>
>>>> Subject: (forw) linuxmafia.com 2017-09-30 11:02 System Events
>>>> Date: Sat, 30 Sep 2017 12:00:28 -0700
>>>
>>>> Just checking to make sure you're aware of ongoing nameserver downtime.
>>>> Also, MTA downtime.
>>>>
>>>> $ telnet mx.balug.org 25
>>>> Trying 198.144.194.238...
>>>> Connected to mx.balug.org.
>>>> Escape character is '^]'.
>>>> 451 Temporary local problem - please try later
>>>> Connection closed by foreign host.
>>>> $
>>>>
>>>> ----- Forwarded message from logcheck system account
>>>> <logcheck(a)linuxmafia.com> -----
>>>>
>>>> Date: Sat, 30 Sep 2017 11:02:01 -0700
>>>> From: logcheck system account <logcheck(a)linuxmafia.com>
>>>> To: root(a)linuxmafia.com
>>>> Subject: linuxmafia.com 2017-09-30 11:02 System Events
>>>>
>>>> System Events
>>>> =-=-=-=-=-=-=
>>>> Sep 30 10:20:52 linuxmafia named[32734]: zone sf-lug.org/IN:
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded
>>>> (source 0.0.0.0#0)
>>>> Sep 30 10:20:52 linuxmafia named[32734]: zone sf-lug.org/IN:
>>>> Transfer started.
>>>> Sep 30 10:20:52 linuxmafia named[32734]: transfer of
>>>> 'sf-lug.org/IN' from 198.144.194.238#53: failed to connect:
>>>> connection refused
>>>> Sep 30 10:20:52 linuxmafia named[32734]: transfer of
>>>> 'sf-lug.org/IN' from 198.144.194.238#53: Transfer completed: 0
>>>> messages, 0 records, 0 bytes, 0.062 secs (0 bytes/sec)
>>>> Sep 30 10:24:59 linuxmafia named[32734]: zone balug.org/IN:
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded
>>>> (source 0.0.0.0#0)
>>>> Sep 30 10:30:20 linuxmafia named[32734]: zone sf-lug.com/IN:
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded
>>>> (source 0.0.0.0#0)
>>>> Sep 30 10:45:15 linuxmafia named[32734]: zone
>>>> e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN: refresh: retry limit
>>>> for master 198.144.194.238#53 exceeded (source 0.0.0.0#0)
>>>> Sep 30 10:45:15 linuxmafia named[32734]: zone
>>>> e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN: Transfer started.
>>>> Sep 30 10:45:15 linuxmafia named[32734]: transfer of
>>>> 'e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN' from
>>>> 198.144.194.238#53: failed to connect: connection refused
>>>> Sep 30 10:45:15 linuxmafia named[32734]: transfer of
>>>> 'e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN' from
>>>> 198.144.194.238#53: Transfer completed: 0 messages, 0 records, 0
>>>> bytes, 0.065 secs (0 bytes/sec)
>>>> Sep 30 10:51:37 linuxmafia named[32734]: zone balug.org/IN:
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded
>>>> (source 0.0.0.0#0)
>>>>
>>>>
>>>> ----- End forwarded message -----
2
3