BALUG-Admin October 2017

balug-admin@lists.balug.org

2 participants
4 discussions

(forw) [Bug 1707447] Re: Roster should not lowercase addresses
by Rick Moen 27 Oct '17

27 Oct '17

Hey, Michael P.! Remember when we noticed this one, I filed a bug report on July 29th, and upstream Mailman maintainer Mark Sapiro immediately furnished a patch? Here it is, closing the loop by being changing fix status from committed to released. ----- Forwarded message from Mark Sapiro <mark(a)msapiro.net> ----- Date: Thu, 26 Oct 2017 21:22:50 -0000 From: Mark Sapiro <mark(a)msapiro.net> To: rick(a)linuxmafia.com Subject: [Bug 1707447] Re: Roster should not lowercase addresses Reply-To: Bug 1707447 <1707447(a)bugs.launchpad.net> ** Changed in: mailman Status: Fix Committed => Fix Released -- You received this bug notification because you are subscribed to the bug report. https://bugs.launchpad.net/bugs/1707447 Title: Roster should not lowercase addresses To manage notifications about this bug go to: https://bugs.launchpad.net/mailman/+bug/1707447/+subscriptions ----- End forwarded message -----

2 1

After compensating for flaky secondaries, one notes flaky primaries
by Rick Moen 16 Oct '17

16 Oct '17

Thought you'd be amused at this cautionary tale, Michael P. ----- Forwarded message from Rick Moen <rick(a)linuxmafia.com> ----- Date: Mon, 16 Oct 2017 10:05:29 -0700 From: Rick Moen <rick(a)linuxmafia.com> To: Duncan MacKinnon <duncan1(a)gmail.com> Subject: Rest of that story about secondary DNS Organization: If you lived here, you'd be $HOME already. I _think_ I started to tell you a story about doing secondary DNS for people, and something I learned. Of course, the standard model is supposed to be: You do auth DNS for my domains; I do it for yours. Years ago, I started to see the flaw in that optimistic, we-help-each-other mental model. There was a user group in Santa Cruz, SMAUG, which owned domain 'scruz.net'. (Terrible naming and choice of domain; not my doing, not my call.) I ended up being primary/master DNS, and we were doing really well because we signed up five more individuals with auth nameservers to help out with secondary/slave DNS, for six auth nameservers total, widely dispersed geographically and with a lot of geographic diversity. That's fabulous redundancy. What could possibly go wrong? </deadpan> I relaxed about quality of service, because obviously we were way ahead of the game. (SMAUG had a mailing list on the SVLUG mailing list server. The mailing list still exists, derelict, the group having now fallen apart.) Roll forward to one day when my household uplink through Raw Bandwidth Communications was offline for about an hour because SBC/AT&T shot the company in the foot. When my aDSL came back, I found postings to SMAUG's mailing list bitching about the scruz.net nameservice having been totally offline. I noticed that some of the complaining came from the five individuals who were allegedly doing secondary/slave nameservice. Hmm? So, I checked on the five secondaries. Certainly, my aDSL being offline for an hour should not have taken all DNS offline. And what I found was: Over about a two-year period, some of the five had moved their nameservers to new IPs and failed to notify me as master nameserver admin. Some had ceased doing auth DNS entirely, and failed to notify me as master nameserver admin. Some still had the same nameserver running at the same IP as always, but had quietly ceased doing auth namservice for scruz.net, and failed to notify me as master nameserver admin. All of the nameserver IPs they'd provided me for their secondary nameservice were still listed in the whois (and as NS lines for the domain in the parent .net zone). But exactly one nameserver still existed and was actually _doing_ auth DNS for scruz.net -- mine. All five of the others had silently flaked out. Which made it extra galling that some of these guys complained about -my- nameservice being unreliable, since theirs was 100% unreliable, their having broken it in various ways, whereas mine worked great except once in a blue moon when my uplink went down. I thought: OK, obviously it turns out to be a mistake to just trust that secondaries will continue to exist and that their operators will do due-diligence communication with the primary when something important changes. They _should_, but it turns out they don't. So, I wrote a weekly cron script to check on all the secondaries for my two domains, linuxmafia.com and unixmercenary.net: It queries and reports the parent-zone NS "glue" records, queries and reports the nameservers declared authoritative in whois, and reports each auth nameserver's zonefile S/N so I can make sure they all respond and give the same value. This means I can detect and act on flaky secondaries. What I did _not_ do was bother writing a script to check on other people's master nameservers, on domains for which *I* do secondary nameservice. Failures in this case are almost entirely the domain owner's problem, not mine. As long as I keep _my_ word for quality of secondary DNS, I'm OK. Well, almost. I do secondary for five or six domains Ruben Safir owns, and recently double-checked those. For most of them, I advised Ruben _again_ that having only two auth nameserver isn't enough and is dangerously thin. I urged him to find a couple more, somewhere. For one of them, nylxs.com, I noticed and advised Ruben that _neither_ his nor my auth nameservers were authoritative any more. Instead of WWW2.MRBRKLYN.COM NS1.LINUXMAFIA.COM the records now listed the auth nameservers like so: $ whois nylxs.com | grep 'Name Server' Name Server: NS69.DOMAINCONTROL.COM Name Server: NS70.DOMAINCONTROL.COM $ I wrote Ruben: 'I find that you have _ceased_ using my secondary (slave) nameservice, but neglected to inform me. That's rude, Ruben. You need to friggin' tell your secondaries if/when you move auth nameservice somewhere else. Grr. Learn to do it right, already!' Turns out, that's not what happened, exactly. Ruben had failed to pay his domain renewal, so his registrar (GoDaddy) had repointed its DNS to 'parked domain' nameservice from its own domaincontrol.com nameservers, making those authoritative in place of Ruben's and mine. Luckily, because I (in effect) warned Ruben of his expired domain in time, he was able to renew it. So, lesson: When you find yourself annoyingly still doing futile secondary DNS for a domain whose owner _seems_ to have moved auth nameservice elsewhere without telling you, the explanation isn't always owner lack of diligence concerning communicating with secondaries: Sometimes, it's merely owner lack of diligence in paying the bill. ----- End forwarded message -----

1 0

DNSBL: lists.balug.org
by Rick Moen 12 Oct '17

12 Oct '17

Michael P.: I periodically check my mail server IP on http://multirbl.valli.org/ and/or http://www.dnsbl.info/ , to make sure it's not on any blocklist. (It was recently on one for cryptic reasons. When I politely inquired and stressed that I'd be glad to fix any problem but didn't understand the existing one, the listing was removed without comment.) $ host lists.balug.org lists.balug.org has address 198.144.194.238 lists.balug.org has IPv6 address 2001:470:1f04:19e::2 lists.balug.org mail is handled by 100 mx.lists.balug.org. $ http://multirbl.valli.org/lookup/198.144.194.238.html shows that one cluster of blocklists, the rfc-clueless.org one, doesn't like your IP. This is the successor to Derek Balling's rfc-ignorant.org DNSBL, which Derek eventually shut down, and has the same mission. The listing policy is here: http://rfc-clueless.org/pages/listing_policy Looks like they believe that 198.144.194.238 isn't accepting mail to postmaster. an RFC requirement for any FQDN that deals in SMTP (http://rfc-clueless.org/pages/listing_policy-postmaster). When I did a quick check telneting to 25/tcp, it seems to me that the system _was_ going to accept my manually composed mail to postmaster(a)lists.balug.org -- so I'm unclear on why that listing's there. Seems like they also have no-postmaster listings for FQDNs balug-sf-lug-v2.balug.org, balug.org, and temp.balug.org. Maybe you should actually cease accepting mail for those FQDNs.

2 7

balug.org(/sf-lug.{org, com}) host OOM oops Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
by Michael Paoli 04 Oct '17

04 Oct '17

To on-list, because, well, ... why not? :-) And ... BALUG-Admin, (and/or) SF-LUG? Well, it's more so balug.org host also running some SF-LUG services, rather than vice versa, so ... "She's dead, Jim." - Rick Moen (Thanks Rick!) noticed some issues, checked a bit and ... named was no longer running! 8-O Host was still up and otherwise seemed (relatively) healthy, but ... what happened to named? Did some digging ... not dig(1) particularly, but ... checking logs. So ... restarted named ... seemed healthy and okay and such. Started checking logs ... /var/log/daemon.log* Sep 29 02:58:37 balug-sf-lug-v2 systemd[1]: Unit bind9.service entered failed state. That seemed about the last peep about it ... but there were indications of other problems ... notably things apparently being killed off. :-/ /var/log/messages* Uh oh, ... the dreaded OOM PID killer! ... Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.816128] fail2ban-server invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.830751] fail2ban-server cpuset=/ mems_allowed=0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.832896] CPU: 0 PID: 914 Comm: fail2ban-server Not tainted 3.16.0-4-amd64 #1 Debian 3.16.43-2+deb8u5 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.834810] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.835826] 0000000000000000 ffffffff81514291 ffff88003ca049e0 0000000000000000 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] ffffffff81511e69 0000000000000000 ffffffff810d6f6f 0000000000000000 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] ffffffff81518c4e 0000000000000200 ffffffff81068a53 ffffffff810c44e4 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] Call Trace: Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81514291>] ? dump_stack+0x5d/0x78 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81511e69>] ? dump_header+0x76/0x1e8 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff810d6f6f>] ? smp_call_function_single+0x5f/0xa0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81518c4e>] ? mutex_lock+0xe/0x2a Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81068a53>] ? put_online_cpus+0x23/0x80 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff810c44e4>] ? rcu_oom_notify+0xc4/0xe0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8115431c>] ? do_try_to_free_pages+0x4ac/0x520 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81142ddd>] ? oom_kill_process+0x21d/0x370 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8114299d>] ? find_lock_task_mm+0x3d/0x90 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81143543>] ? out_of_memory+0x473/0x4b0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8114940f>] ? __alloc_pages_nodemask+0x9ef/0xb50 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8118894d>] ? alloc_pages_current+0x9d/0x150 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81141b40>] ? filemap_fault+0x1a0/0x420 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81167d0a>] ? __do_fault+0x3a/0xa0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8116a8ce>] ? do_read_fault.isra.54+0x4e/0x300 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8116c0fc>] ? handle_mm_fault+0x63c/0x1150 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff810582c7>] ? __do_page_fault+0x177/0x4f0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8109ea37>] ? put_prev_entity+0x57/0x350 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8105299b>] ? kvm_clock_get_cycles+0x1b/0x20 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff810c9b32>] ? ktime_get_ts+0x42/0xe0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff811be157>] ? poll_select_copy_remaining+0xe7/0x140 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8151c4d8>] ? async_page_fault+0x28/0x30 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.875236] Mem-Info: Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.876102] Node 0 DMA per-cpu: Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.877119] CPU 0: hi: 0, btch: 1 usd: 0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.878426] Node 0 DMA32 per-cpu: Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.879490] CPU 0: hi: 186, btch: 31 usd: 0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] active_anon:113183 inactive_anon:113245 isolated_anon:0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] active_file:46 inactive_file:58 isolated_file:0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] unevictable:0 dirty:0 writeback:0 unstable:0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] free:12241 slab_reclaimable:3169 slab_unreclaimable:5274 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] mapped:2458 shmem:2849 pagetables:5275 bounce:0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] free_cma:0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.888200] Node 0 DMA free:4636kB min:700kB low:872kB high:1048kB active_anon:5292kB inactive_anon:5328kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:280kB shmem:280kB slab_reclaimable:104kB slab_unreclaimable:244kB kernel_stack:48kB pagetables:124kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.897611] lowmem_reserve[]: 0 982 982 982 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.899344] Node 0 DMA32 free:44328kB min:44352kB low:55440kB high:66528kB active_anon:447440kB inactive_anon:447652kB active_file:184kB inactive_file:232kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1032184kB managed:1008516kB mlocked:0kB dirty:0kB writeback:0kB mapped:9552kB shmem:11116kB slab_reclaimable:12572kB slab_unreclaimable:20852kB kernel_stack:3008kB pagetables:20976kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:727 all_unreclaimable? yes Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.910174] lowmem_reserve[]: 0 0 0 0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.911889] Node 0 DMA: 5*4kB (M) 1*8kB (U) 10*16kB (U) 7*32kB (UM) 10*64kB (UM) 2*128kB (UM) 3*256kB (UM) 1*512kB (M) 0*1024kB 1*2048kB (R) 0*4096kB = 4636kB Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.917906] Node 0 DMA32: 738*4kB (UE) 390*8kB (EM) 189*16kB (UE) 117*32kB (UEM) 120*64kB (EM) 60*128kB (UEM) 23*256kB (UEM) 8*512kB (UEM) 2*1024kB (U) 0*2048kB 1*4096kB (R) = 44328kB Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.925589] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.927978] 12189 total pagecache pages Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.928995] 9236 pages in swap cache Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.929978] Swap cache stats: add 1598908, delete 1589672, find 1953234/2271848 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.931903] Free swap = 0kB Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.932854] Total swap = 1048544kB Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.933794] 262044 pages RAM Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.934655] 0 pages HighMem/MovableOnly Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.935688] 5917 pages reserved Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.936598] 0 pages hwpoisoned Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.937500] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.940324] [ 172] 0 172 7217 269 20 43 0 systemd-journal Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.942449] [ 184] 0 184 10356 2 22 217 -1000 systemd-udevd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.944605] [ 597] 0 597 3015 16 12 768 0 haveged Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.946531] [ 598] 1 598 1571 14 9 13 0 uptimed Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.948669] [ 599] 101 599 42559 4837 67 6402 0 named Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.950662] [ 600] 0 600 64667 70 29 197 0 rsyslogd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.952633] [ 601] 0 601 4756 5 15 40 0 atd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.954470] [ 602] 0 602 6476 20 18 47 0 cron Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.957340] [ 603] 0 603 39348 46 43 2237 0 lwresd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.960249] [ 607] 0 607 13796 29 31 139 -1000 sshd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.962284] [ 609] 107 609 10563 48 24 69 -900 dbus-daemon Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.964515] [ 617] 0 617 7088 44 19 38 0 systemd-logind Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.966680] [ 671] 105 671 305156 30716 486 181302 0 clamd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.968711] [ 687] 0 687 15925 289 35 2831 0 spfd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.970791] [ 699] 103 699 8346 49 22 109 0 ntpd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.973677] [ 701] 0 701 3724 3 12 36 0 agetty Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.976699] [ 702] 0 702 17950 3 40 132 0 login Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.978631] [ 715] 0 715 5062 3 15 113 0 mysqld_safe Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.980675] [ 858] 17041 858 150162 3794 90 11586 0 mysqld Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.982763] [ 859] 0 859 5536 3 16 51 0 logger Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.984811] [ 878] 38 878 15029 249 32 2049 0 mailmanctl Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.987040] [ 879] 38 879 15637 1826 34 1126 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.989403] [ 880] 38 880 15700 1885 35 1145 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.992585] [ 881] 38 881 14998 283 32 2005 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.994953] [ 882] 38 882 15700 548 34 2470 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.997219] [ 883] 38 883 15002 286 33 2019 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.999303] [ 884] 38 884 16807 521 36 2524 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.001351] [ 885] 38 885 15690 968 33 1995 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.003496] [ 886] 38 886 15024 250 33 2036 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.005724] [ 912] 0 912 42169 599 87 17903 0 /usr/sbin/spamd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.008183] [ 914] 0 914 49613 670 30 1160 0 fail2ban-server Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.010262] [ 997] 0 997 78222 269 114 3705 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.011990] [ 1027] 0 1027 43875 10022 91 10191 0 spamd child Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.013870] [ 1029] 0 1029 42735 3586 87 15469 0 spamd child Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.015655] [ 1039] 113 1039 25706 45 48 473 0 exim4 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.017462] [ 1048] 1607 1048 9051 141 22 119 0 systemd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.020096] [ 1049] 1607 1049 12900 14 26 740 0 (sd-pam) Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.022931] [ 1051] 0 1051 13451 3 33 115 0 sudo Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.025523] [ 1083] 0 1083 13311 1 32 106 0 su Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.027987] [ 1084] 0 1084 5085 3 15 142 0 bash Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.030570] [ 2215] 0 2215 23309 3 49 234 0 sshd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.033272] [ 2217] 1607 2217 23309 44 47 201 0 sshd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.035822] [ 2218] 1607 2218 5081 3 15 138 0 bash Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.038453] [ 2253] 1607 2253 5992 7 18 55 0 screen Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.041654] [ 2258] 1607 2258 6066 69 18 106 0 screen Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.044778] [ 2273] 1607 2273 5083 82 15 63 0 bash Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.047324] [ 2306] 0 2306 13451 3 32 118 0 sudo Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.050021] [ 2339] 0 2339 13311 1 31 105 0 su Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.051936] [ 2340] 0 2340 5103 73 16 91 0 bash Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.053694] [15410] 33 15410 88963 13397 135 3271 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.055645] [16284] 33 16284 80083 4647 118 2974 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.058999] [19656] 33 19656 80456 4573 117 3142 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.062614] [20288] 33 20288 79754 3974 117 3037 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.065805] [21280] 33 21280 79333 3244 115 3168 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.068966] [21283] 33 21283 85967 10413 129 3078 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.072292] [21284] 33 21284 86848 10748 130 3174 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.075728] [21306] 33 21306 85528 9358 127 3217 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.078872] [21420] 33 21420 86187 9993 128 3206 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.082082] [21832] 33 21832 85304 9835 128 3046 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.085599] [21937] 33 21937 79546 3702 116 3210 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.090313] [21940] 33 21940 85942 9674 128 3325 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.093417] [21950] 33 21950 88632 12360 133 3305 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.096711] [21951] 33 21951 87235 11223 131 3214 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.099971] [21953] 33 21953 87544 10917 131 3330 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.103143] [21976] 33 21976 87350 10978 131 3313 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.105194] [21984] 33 21984 87404 11163 131 3246 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.107390] [21985] 33 21985 86796 10190 129 3419 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.110497] [22036] 33 22036 87183 10215 129 3421 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.112329] [22039] 33 22039 85816 9314 128 3330 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.114268] [22040] 33 22040 78803 2445 113 3337 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.115973] [22042] 33 22042 79744 2694 116 3336 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.118243] [22043] 33 22043 84175 7485 123 3421 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.120124] [22044] 33 22044 83023 6360 121 3416 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.122435] [22052] 33 22052 78230 288 100 3689 0 apache2 clamd is fair bit of memory hog, but knew that, added more RAM to host earlier, and also, it's only one PID. The big RAM suck is Apache ... not per PID, but in total. 26 Apache PIDs, that quickly adds up. /var/log/syslog* And looks like it was around here that named got whacked: Sep 29 02:52:04 balug-sf-lug-v2 systemd[1]: bind9.service: main process exited, code=killed, status=9/KILL reviewed the Apache logs (/var/log/apache2/*) ... looks mostly like some overzealous web crawlers, plus some bad bots, were both simultaneously pounding away quite hard on Apache. Looks like I'd not tuned Apache sufficiently to well withstand such a load, so Apache (mpm_prefork) forked lots of PIDs to handle the requests and attempt to keep up, but alas, beyond the resources reasonably available on the host ... and that's when things then went South rather quickly. So ... would be better to limit how much resource Apache would suck up, rather than allow Apache to consume excessive resources relative to those available. That may result in some web service failures/errors ... but that's better than Apache otherwise negatively impacing services on the host. So ... Apache configuration ... cd /etc/apache2 && ./.unroll < apache2.conf ... # ./.unroll START: IncludeOptional mods-enabled/*.load LoadModule mpm_prefork_module /usr/lib/apache2/modules/mod_mpm_prefork.so ... # ./.unroll START: IncludeOptional mods-enabled/*.conf ... <IfModule mpm_prefork_module> StartServers 5 MinSpareServers 5 MaxSpareServers 10 MaxRequestWorkers 150 MaxConnectionsPerChild 0 </IfModule> From the Apache documentation: "Most important is that MaxRequestWorkers be big enough to handle as many simultaneous requests as you expect to receive, but small enough to assure that there is enough physical RAM for all processes." Oops! So, looks like things were going badly with 26 Apache PIDs (likely 1 "master" (parent) and the rest spawned children). So, likely MaxRequestWorkers should be something at least below 25 - and even 25 would be (bit) too high. That's also a lot of load/work - simultaneously handling up to 25 requests. If something(s) are requesting that much ... some requests can wait - better that than quite negatively impacting the host overall. So ... I'm thinking 20. How does it look at present? 2>&1 fuser /usr/sbin/apache2 /usr/sbin/apache2: 443e 969e 997e 3211e 7433e 13566e 13604e 13605e 13623e 31738e 31749e 11 PIDs, so 20 ought be pretty reasonable for MaxRequestWorkers. That would be about 21 Apache PIDs total, and simultaneously handling up to about 20 requests. # ls -ld /etc/apache2/mods-enabled/mpm_prefork.conf lrwxrwxrwx 1 root root 34 Oct 30 2015 /etc/apache2/mods-enabled/mpm_prefork.conf -> ../mods-available/mpm_prefork.conf # cd /etc/apache2/mods-available # ex mpm_prefork.conf mpm_prefork.conf: unmodified: line 16 :/150 MaxRequestWorkers 150 :s/150/20/p MaxRequestWorkers 20 :w mpm_prefork.conf: 16 lines, 570 characters :q # (cd / && umask 022 && apachectl graceful) # ci -d -u -M -m'adjust MaxRequestWorkers to avoid excess RAM consumption and dreaded kernel OOM PID killer' mpm_prefork.conf RCS/mpm_prefork.conf,v <-- mpm_prefork.conf new revision: 1.2; previous revision: 1.1 done # Could also possibly consider going to a threaded model (if that plays well/safe with other Apache bits and related installed). ... and then a reboot for good measure (notably in case other PIDs got whacked by kernel OOM PID killer that ought be restarted). And then hopefully all is well with the universe again ... at least for now. > From: "Michael Paoli" <Michael.Paoli(a)cal.berkeley.edu> > To: "Rick Moen" <rick(a)linuxmafia.com> > Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events > Date: Sat, 30 Sep 2017 20:41:23 -0700 > So, ... looks like, at least to a 1st order > approximation/guestimation based upon various log data ... > a combination of some overzealous web crawlers and bad bots > hit hard with the frequency and simultaneity of their requests, > Apache - probably still at or near to default on its configuration > for handling such as for workers/threads, tried to be very highly > accommodating to the excessive requests, starting up lots of > processes/threads ... and ... probably not properly tuned, > beyond reasonable for the available resources ... and > then fairly soon into that things started going seriously South. > So ... probably first bit, ... tweak some Apache settings so it's > not overly generous in providing resources, and excessive in > consumption of resources ... better Apache return a "server busy" > or similar type of error, than consume excessive resources > to the point where it quite negatively impacts the host > (dang Linux memory (mis) management ... if it didn't overcommit, > it could simply tell Apache, "sorry no more memory for you - enough > aleady" ... and generally nothing else would suffer, but ... alas, > if only ...). So, ... anyway, tweak some Apache settings on that, > reboot (to revive any other innocent PIDs that may have been > unceremoniously slaughtered), and ... keep an eye on it and see > how things go. Have bumped into relatively similar before ... > but it's been several years. Issue from many years ago were > bad bots massively excessively registering accounts on the > wiki ... the bots were too stupid to manage to do anything with > the wiki once they registered all those accounts ... but the > registration was so massively parallel it was a DoS attack > on the wiki/host, and it balooned resource consumption so high and > fast, host would lock up solid without leaving much of a trace > as to what happened ... took a bit of sleuthing and adding > some wee bits 'o extra data collection to track down > and nail that one. The work-around was then to change > then wiki so no web based registrations were allowed > anymore ... rare enough folks are added on the wiki, > that such can be handled manually ... that was sufficient > change to work around that issue that several years ago > when that was going on. Anyway, MTA/Mailman/anti-spam etc. > have upped the resource requirements/consumption some fairish > bit (did also give the virtual host more virtual RAM at the time > too) ... but ... probably again time for some more resource > allocation/tuning ... and looks like at present time Apache is > first logical place to adjust that. > >> From: "Michael Paoli" <Michael.Paoli(a)cal.berkeley.edu> >> Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events >> Date: Sat, 30 Sep 2017 20:07:48 -0700 > >> The bloody OOM PID killer kicked in, and at some point >> in its infinite wisdom (stupidity) thought that SIGKILL >> to named on a nameserver was a "good" <cough, cough> idea. >> I'll have to see if I can isolate what was sucking up too >> much resource. I never did much care for Linux's (or at least >> some/many distributions leaving it enabled in kernel by default) >> overcommitting (notably as crude work-around for folks that >> write crappy programs that request lots of memory which they often >> never need or use) ... which then, when something actually needs >> the memory the kernel "gave" (promised) it, when it cheated >> and overcommitted ... yeah, ... that's when things get very ugly >> very fast - a.k.a. OOM PID killer, ... ugh! >> >> Anyway, more log sleuthing, to see what ate up so much >> resource ... and ... probably due for reboot after the OOM >> kicked in anyway, ... dear knows what else got whacked that >> ought not have been whacked. >> >> >>> From: "Michael Paoli" <Michael.Paoli(a)cal.berkeley.edu> >>> Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events >>> Date: Sat, 30 Sep 2017 19:36:01 -0700 >> >>> Thanks, ... hopefully it's "all better now". >>> Somehow named wasn't listening 8-O ... and I think that's also >>> why the (still listening) MTA was then, uh, "upset." >>> I'll see what I can find as to why named was down and when it >>> went down ... maybe operator error, maybe ... who knows what. >>> Anyway, I'll see what I can find (and will recheck its general >>> health). >>> >>> Anyway, thanks for bringing it to my attention - I'd not >>> seen that quite yet. >>> >>>> From: "Rick Moen" <rick(a)linuxmafia.com> >>>> Subject: (forw) linuxmafia.com 2017-09-30 11:02 System Events >>>> Date: Sat, 30 Sep 2017 12:00:28 -0700 >>> >>>> Just checking to make sure you're aware of ongoing nameserver downtime. >>>> Also, MTA downtime. >>>> >>>> $ telnet mx.balug.org 25 >>>> Trying 198.144.194.238... >>>> Connected to mx.balug.org. >>>> Escape character is '^]'. >>>> 451 Temporary local problem - please try later >>>> Connection closed by foreign host. >>>> $ >>>> >>>> ----- Forwarded message from logcheck system account >>>> <logcheck(a)linuxmafia.com> ----- >>>> >>>> Date: Sat, 30 Sep 2017 11:02:01 -0700 >>>> From: logcheck system account <logcheck(a)linuxmafia.com> >>>> To: root(a)linuxmafia.com >>>> Subject: linuxmafia.com 2017-09-30 11:02 System Events >>>> >>>> System Events >>>> =-=-=-=-=-=-= >>>> Sep 30 10:20:52 linuxmafia named[32734]: zone sf-lug.org/IN: >>>> refresh: retry limit for master 198.144.194.238#53 exceeded >>>> (source 0.0.0.0#0) >>>> Sep 30 10:20:52 linuxmafia named[32734]: zone sf-lug.org/IN: >>>> Transfer started. >>>> Sep 30 10:20:52 linuxmafia named[32734]: transfer of >>>> 'sf-lug.org/IN' from 198.144.194.238#53: failed to connect: >>>> connection refused >>>> Sep 30 10:20:52 linuxmafia named[32734]: transfer of >>>> 'sf-lug.org/IN' from 198.144.194.238#53: Transfer completed: 0 >>>> messages, 0 records, 0 bytes, 0.062 secs (0 bytes/sec) >>>> Sep 30 10:24:59 linuxmafia named[32734]: zone balug.org/IN: >>>> refresh: retry limit for master 198.144.194.238#53 exceeded >>>> (source 0.0.0.0#0) >>>> Sep 30 10:30:20 linuxmafia named[32734]: zone sf-lug.com/IN: >>>> refresh: retry limit for master 198.144.194.238#53 exceeded >>>> (source 0.0.0.0#0) >>>> Sep 30 10:45:15 linuxmafia named[32734]: zone >>>> e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN: refresh: retry limit >>>> for master 198.144.194.238#53 exceeded (source 0.0.0.0#0) >>>> Sep 30 10:45:15 linuxmafia named[32734]: zone >>>> e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN: Transfer started. >>>> Sep 30 10:45:15 linuxmafia named[32734]: transfer of >>>> 'e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN' from >>>> 198.144.194.238#53: failed to connect: connection refused >>>> Sep 30 10:45:15 linuxmafia named[32734]: transfer of >>>> 'e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN' from >>>> 198.144.194.238#53: Transfer completed: 0 messages, 0 records, 0 >>>> bytes, 0.065 secs (0 bytes/sec) >>>> Sep 30 10:51:37 linuxmafia named[32734]: zone balug.org/IN: >>>> refresh: retry limit for master 198.144.194.238#53 exceeded >>>> (source 0.0.0.0#0) >>>> >>>> >>>> ----- End forwarded message -----

2 3

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

BALUG-Admin October 2017