To on-list, because, well, ... why not? :-) And ... BALUG-Admin, (and/or) SF-LUG? Well, it's more so balug.org host also running some SF-LUG services, rather than vice versa, so ...
"She's dead, Jim." - Rick Moen (Thanks Rick!) noticed some issues, checked a bit and ... named was no longer running! 8-O Host was still up and otherwise seemed (relatively) healthy, but ... what happened to named? Did some digging ... not dig(1) particularly, but ... checking logs.
So ... restarted named ... seemed healthy and okay and such. Started checking logs ...
/var/log/daemon.log* Sep 29 02:58:37 balug-sf-lug-v2 systemd[1]: Unit bind9.service entered failed state. That seemed about the last peep about it ... but there were indications of other problems ... notably things apparently being killed off. :-/
/var/log/messages* Uh oh, ... the dreaded OOM PID killer! ... Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.816128] fail2ban-server invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.830751] fail2ban-server cpuset=/ mems_allowed=0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.832896] CPU: 0 PID: 914 Comm: fail2ban-server Not tainted 3.16.0-4-amd64 #1 Debian 3.16.43-2+deb8u5 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.834810] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.835826] 0000000000000000 ffffffff81514291 ffff88003ca049e0 0000000000000000 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] ffffffff81511e69 0000000000000000 ffffffff810d6f6f 0000000000000000 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] ffffffff81518c4e 0000000000000200 ffffffff81068a53 ffffffff810c44e4 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] Call Trace: Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81514291>] ? dump_stack+0x5d/0x78 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81511e69>] ? dump_header+0x76/0x1e8 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff810d6f6f>] ? smp_call_function_single+0x5f/0xa0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81518c4e>] ? mutex_lock+0xe/0x2a Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81068a53>] ? put_online_cpus+0x23/0x80 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff810c44e4>] ? rcu_oom_notify+0xc4/0xe0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8115431c>] ? do_try_to_free_pages+0x4ac/0x520 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81142ddd>] ? oom_kill_process+0x21d/0x370 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8114299d>] ? find_lock_task_mm+0x3d/0x90 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81143543>] ? out_of_memory+0x473/0x4b0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8114940f>] ? __alloc_pages_nodemask+0x9ef/0xb50 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8118894d>] ? alloc_pages_current+0x9d/0x150 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81141b40>] ? filemap_fault+0x1a0/0x420 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff81167d0a>] ? __do_fault+0x3a/0xa0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8116a8ce>] ? do_read_fault.isra.54+0x4e/0x300 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8116c0fc>] ? handle_mm_fault+0x63c/0x1150 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff810582c7>] ? __do_page_fault+0x177/0x4f0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8109ea37>] ? put_prev_entity+0x57/0x350 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8105299b>] ? kvm_clock_get_cycles+0x1b/0x20 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff810c9b32>] ? ktime_get_ts+0x42/0xe0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff811be157>] ? poll_select_copy_remaining+0xe7/0x140 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] [<ffffffff8151c4d8>] ? async_page_fault+0x28/0x30 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.875236] Mem-Info: Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.876102] Node 0 DMA per-cpu: Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.877119] CPU 0: hi: 0, btch: 1 usd: 0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.878426] Node 0 DMA32 per-cpu: Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.879490] CPU 0: hi: 186, btch: 31 usd: 0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] active_anon:113183 inactive_anon:113245 isolated_anon:0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] active_file:46 inactive_file:58 isolated_file:0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] unevictable:0 dirty:0 writeback:0 unstable:0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] free:12241 slab_reclaimable:3169 slab_unreclaimable:5274 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] mapped:2458 shmem:2849 pagetables:5275 bounce:0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] free_cma:0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.888200] Node 0 DMA free:4636kB min:700kB low:872kB high:1048kB active_anon:5292kB inactive_anon:5328kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:280kB shmem:280kB slab_reclaimable:104kB slab_unreclaimable:244kB kernel_stack:48kB pagetables:124kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.897611] lowmem_reserve[]: 0 982 982 982 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.899344] Node 0 DMA32 free:44328kB min:44352kB low:55440kB high:66528kB active_anon:447440kB inactive_anon:447652kB active_file:184kB inactive_file:232kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1032184kB managed:1008516kB mlocked:0kB dirty:0kB writeback:0kB mapped:9552kB shmem:11116kB slab_reclaimable:12572kB slab_unreclaimable:20852kB kernel_stack:3008kB pagetables:20976kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:727 all_unreclaimable? yes Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.910174] lowmem_reserve[]: 0 0 0 0 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.911889] Node 0 DMA: 5*4kB (M) 1*8kB (U) 10*16kB (U) 7*32kB (UM) 10*64kB (UM) 2*128kB (UM) 3*256kB (UM) 1*512kB (M) 0*1024kB 1*2048kB (R) 0*4096kB = 4636kB Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.917906] Node 0 DMA32: 738*4kB (UE) 390*8kB (EM) 189*16kB (UE) 117*32kB (UEM) 120*64kB (EM) 60*128kB (UEM) 23*256kB (UEM) 8*512kB (UEM) 2*1024kB (U) 0*2048kB 1*4096kB (R) = 44328kB Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.925589] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.927978] 12189 total pagecache pages Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.928995] 9236 pages in swap cache Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.929978] Swap cache stats: add 1598908, delete 1589672, find 1953234/2271848 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.931903] Free swap = 0kB Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.932854] Total swap = 1048544kB Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.933794] 262044 pages RAM Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.934655] 0 pages HighMem/MovableOnly Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.935688] 5917 pages reserved Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.936598] 0 pages hwpoisoned Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.937500] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.940324] [ 172] 0 172 7217 269 20 43 0 systemd-journal Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.942449] [ 184] 0 184 10356 2 22 217 -1000 systemd-udevd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.944605] [ 597] 0 597 3015 16 12 768 0 haveged Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.946531] [ 598] 1 598 1571 14 9 13 0 uptimed Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.948669] [ 599] 101 599 42559 4837 67 6402 0 named Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.950662] [ 600] 0 600 64667 70 29 197 0 rsyslogd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.952633] [ 601] 0 601 4756 5 15 40 0 atd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.954470] [ 602] 0 602 6476 20 18 47 0 cron Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.957340] [ 603] 0 603 39348 46 43 2237 0 lwresd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.960249] [ 607] 0 607 13796 29 31 139 -1000 sshd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.962284] [ 609] 107 609 10563 48 24 69 -900 dbus-daemon Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.964515] [ 617] 0 617 7088 44 19 38 0 systemd-logind Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.966680] [ 671] 105 671 305156 30716 486 181302 0 clamd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.968711] [ 687] 0 687 15925 289 35 2831 0 spfd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.970791] [ 699] 103 699 8346 49 22 109 0 ntpd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.973677] [ 701] 0 701 3724 3 12 36 0 agetty Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.976699] [ 702] 0 702 17950 3 40 132 0 login Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.978631] [ 715] 0 715 5062 3 15 113 0 mysqld_safe Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.980675] [ 858] 17041 858 150162 3794 90 11586 0 mysqld Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.982763] [ 859] 0 859 5536 3 16 51 0 logger Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.984811] [ 878] 38 878 15029 249 32 2049 0 mailmanctl Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.987040] [ 879] 38 879 15637 1826 34 1126 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.989403] [ 880] 38 880 15700 1885 35 1145 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.992585] [ 881] 38 881 14998 283 32 2005 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.994953] [ 882] 38 882 15700 548 34 2470 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.997219] [ 883] 38 883 15002 286 33 2019 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.999303] [ 884] 38 884 16807 521 36 2524 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.001351] [ 885] 38 885 15690 968 33 1995 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.003496] [ 886] 38 886 15024 250 33 2036 0 python Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.005724] [ 912] 0 912 42169 599 87 17903 0 /usr/sbin/spamd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.008183] [ 914] 0 914 49613 670 30 1160 0 fail2ban-server Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.010262] [ 997] 0 997 78222 269 114 3705 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.011990] [ 1027] 0 1027 43875 10022 91 10191 0 spamd child Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.013870] [ 1029] 0 1029 42735 3586 87 15469 0 spamd child Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.015655] [ 1039] 113 1039 25706 45 48 473 0 exim4 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.017462] [ 1048] 1607 1048 9051 141 22 119 0 systemd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.020096] [ 1049] 1607 1049 12900 14 26 740 0 (sd-pam) Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.022931] [ 1051] 0 1051 13451 3 33 115 0 sudo Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.025523] [ 1083] 0 1083 13311 1 32 106 0 su Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.027987] [ 1084] 0 1084 5085 3 15 142 0 bash Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.030570] [ 2215] 0 2215 23309 3 49 234 0 sshd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.033272] [ 2217] 1607 2217 23309 44 47 201 0 sshd Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.035822] [ 2218] 1607 2218 5081 3 15 138 0 bash Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.038453] [ 2253] 1607 2253 5992 7 18 55 0 screen Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.041654] [ 2258] 1607 2258 6066 69 18 106 0 screen Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.044778] [ 2273] 1607 2273 5083 82 15 63 0 bash Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.047324] [ 2306] 0 2306 13451 3 32 118 0 sudo Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.050021] [ 2339] 0 2339 13311 1 31 105 0 su Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.051936] [ 2340] 0 2340 5103 73 16 91 0 bash Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.053694] [15410] 33 15410 88963 13397 135 3271 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.055645] [16284] 33 16284 80083 4647 118 2974 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.058999] [19656] 33 19656 80456 4573 117 3142 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.062614] [20288] 33 20288 79754 3974 117 3037 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.065805] [21280] 33 21280 79333 3244 115 3168 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.068966] [21283] 33 21283 85967 10413 129 3078 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.072292] [21284] 33 21284 86848 10748 130 3174 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.075728] [21306] 33 21306 85528 9358 127 3217 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.078872] [21420] 33 21420 86187 9993 128 3206 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.082082] [21832] 33 21832 85304 9835 128 3046 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.085599] [21937] 33 21937 79546 3702 116 3210 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.090313] [21940] 33 21940 85942 9674 128 3325 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.093417] [21950] 33 21950 88632 12360 133 3305 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.096711] [21951] 33 21951 87235 11223 131 3214 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.099971] [21953] 33 21953 87544 10917 131 3330 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.103143] [21976] 33 21976 87350 10978 131 3313 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.105194] [21984] 33 21984 87404 11163 131 3246 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.107390] [21985] 33 21985 86796 10190 129 3419 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.110497] [22036] 33 22036 87183 10215 129 3421 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.112329] [22039] 33 22039 85816 9314 128 3330 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.114268] [22040] 33 22040 78803 2445 113 3337 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.115973] [22042] 33 22042 79744 2694 116 3336 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.118243] [22043] 33 22043 84175 7485 123 3421 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.120124] [22044] 33 22044 83023 6360 121 3416 0 apache2 Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.122435] [22052] 33 22052 78230 288 100 3689 0 apache2 clamd is fair bit of memory hog, but knew that, added more RAM to host earlier, and also, it's only one PID. The big RAM suck is Apache ... not per PID, but in total. 26 Apache PIDs, that quickly adds up.
/var/log/syslog* And looks like it was around here that named got whacked: Sep 29 02:52:04 balug-sf-lug-v2 systemd[1]: bind9.service: main process exited, code=killed, status=9/KILL
reviewed the Apache logs (/var/log/apache2/*) ... looks mostly like some overzealous web crawlers, plus some bad bots, were both simultaneously pounding away quite hard on Apache. Looks like I'd not tuned Apache sufficiently to well withstand such a load, so Apache (mpm_prefork) forked lots of PIDs to handle the requests and attempt to keep up, but alas, beyond the resources reasonably available on the host ... and that's when things then went South rather quickly.
So ... would be better to limit how much resource Apache would suck up, rather than allow Apache to consume excessive resources relative to those available. That may result in some web service failures/errors ... but that's better than Apache otherwise negatively impacing services on the host.
So ... Apache configuration ... cd /etc/apache2 && ./.unroll < apache2.conf ... # ./.unroll START: IncludeOptional mods-enabled/*.load LoadModule mpm_prefork_module /usr/lib/apache2/modules/mod_mpm_prefork.so ... # ./.unroll START: IncludeOptional mods-enabled/*.conf ... <IfModule mpm_prefork_module> StartServers 5 MinSpareServers 5 MaxSpareServers 10 MaxRequestWorkers 150 MaxConnectionsPerChild 0 </IfModule>
From the Apache documentation: "Most important is that MaxRequestWorkers be big enough to handle as many simultaneous requests as you expect to receive, but small enough to assure that there is enough physical RAM for all processes."
Oops!
So, looks like things were going badly with 26 Apache PIDs (likely 1 "master" (parent) and the rest spawned children). So, likely MaxRequestWorkers should be something at least below 25 - and even 25 would be (bit) too high. That's also a lot of load/work - simultaneously handling up to 25 requests. If something(s) are requesting that much ... some requests can wait - better that than quite negatively impacting the host overall. So ... I'm thinking 20. How does it look at present?
2>&1 fuser /usr/sbin/apache2 /usr/sbin/apache2: 443e 969e 997e 3211e 7433e 13566e 13604e 13605e 13623e 31738e 31749e 11 PIDs, so 20 ought be pretty reasonable for MaxRequestWorkers. That would be about 21 Apache PIDs total, and simultaneously handling up to about 20 requests.
# ls -ld /etc/apache2/mods-enabled/mpm_prefork.conf lrwxrwxrwx 1 root root 34 Oct 30 2015 /etc/apache2/mods-enabled/mpm_prefork.conf -> ../mods-available/mpm_prefork.conf # cd /etc/apache2/mods-available # ex mpm_prefork.conf mpm_prefork.conf: unmodified: line 16 :/150 MaxRequestWorkers 150 :s/150/20/p MaxRequestWorkers 20 :w mpm_prefork.conf: 16 lines, 570 characters :q # (cd / && umask 022 && apachectl graceful) # ci -d -u -M -m'adjust MaxRequestWorkers to avoid excess RAM consumption and dreaded kernel OOM PID killer' mpm_prefork.conf RCS/mpm_prefork.conf,v <-- mpm_prefork.conf new revision: 1.2; previous revision: 1.1 done #
Could also possibly consider going to a threaded model (if that plays well/safe with other Apache bits and related installed).
... and then a reboot for good measure (notably in case other PIDs got whacked by kernel OOM PID killer that ought be restarted).
And then hopefully all is well with the universe again ... at least for now.
From: "Michael Paoli" Michael.Paoli@cal.berkeley.edu To: "Rick Moen" rick@linuxmafia.com Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events Date: Sat, 30 Sep 2017 20:41:23 -0700
So, ... looks like, at least to a 1st order approximation/guestimation based upon various log data ... a combination of some overzealous web crawlers and bad bots hit hard with the frequency and simultaneity of their requests, Apache - probably still at or near to default on its configuration for handling such as for workers/threads, tried to be very highly accommodating to the excessive requests, starting up lots of processes/threads ... and ... probably not properly tuned, beyond reasonable for the available resources ... and then fairly soon into that things started going seriously South. So ... probably first bit, ... tweak some Apache settings so it's not overly generous in providing resources, and excessive in consumption of resources ... better Apache return a "server busy" or similar type of error, than consume excessive resources to the point where it quite negatively impacts the host (dang Linux memory (mis) management ... if it didn't overcommit, it could simply tell Apache, "sorry no more memory for you - enough aleady" ... and generally nothing else would suffer, but ... alas, if only ...). So, ... anyway, tweak some Apache settings on that, reboot (to revive any other innocent PIDs that may have been unceremoniously slaughtered), and ... keep an eye on it and see how things go. Have bumped into relatively similar before ... but it's been several years. Issue from many years ago were bad bots massively excessively registering accounts on the wiki ... the bots were too stupid to manage to do anything with the wiki once they registered all those accounts ... but the registration was so massively parallel it was a DoS attack on the wiki/host, and it balooned resource consumption so high and fast, host would lock up solid without leaving much of a trace as to what happened ... took a bit of sleuthing and adding some wee bits 'o extra data collection to track down and nail that one. The work-around was then to change then wiki so no web based registrations were allowed anymore ... rare enough folks are added on the wiki, that such can be handled manually ... that was sufficient change to work around that issue that several years ago when that was going on. Anyway, MTA/Mailman/anti-spam etc. have upped the resource requirements/consumption some fairish bit (did also give the virtual host more virtual RAM at the time too) ... but ... probably again time for some more resource allocation/tuning ... and looks like at present time Apache is first logical place to adjust that.
From: "Michael Paoli" Michael.Paoli@cal.berkeley.edu Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events Date: Sat, 30 Sep 2017 20:07:48 -0700
The bloody OOM PID killer kicked in, and at some point in its infinite wisdom (stupidity) thought that SIGKILL to named on a nameserver was a "good" <cough, cough> idea. I'll have to see if I can isolate what was sucking up too much resource. I never did much care for Linux's (or at least some/many distributions leaving it enabled in kernel by default) overcommitting (notably as crude work-around for folks that write crappy programs that request lots of memory which they often never need or use) ... which then, when something actually needs the memory the kernel "gave" (promised) it, when it cheated and overcommitted ... yeah, ... that's when things get very ugly very fast - a.k.a. OOM PID killer, ... ugh!
Anyway, more log sleuthing, to see what ate up so much resource ... and ... probably due for reboot after the OOM kicked in anyway, ... dear knows what else got whacked that ought not have been whacked.
From: "Michael Paoli" Michael.Paoli@cal.berkeley.edu Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events Date: Sat, 30 Sep 2017 19:36:01 -0700
Thanks, ... hopefully it's "all better now". Somehow named wasn't listening 8-O ... and I think that's also why the (still listening) MTA was then, uh, "upset." I'll see what I can find as to why named was down and when it went down ... maybe operator error, maybe ... who knows what. Anyway, I'll see what I can find (and will recheck its general health).
Anyway, thanks for bringing it to my attention - I'd not seen that quite yet.
From: "Rick Moen" rick@linuxmafia.com Subject: (forw) linuxmafia.com 2017-09-30 11:02 System Events Date: Sat, 30 Sep 2017 12:00:28 -0700
Just checking to make sure you're aware of ongoing nameserver downtime. Also, MTA downtime.
$ telnet mx.balug.org 25 Trying 198.144.194.238... Connected to mx.balug.org. Escape character is '^]'. 451 Temporary local problem - please try later Connection closed by foreign host. $
----- Forwarded message from logcheck system account logcheck@linuxmafia.com -----
Date: Sat, 30 Sep 2017 11:02:01 -0700 From: logcheck system account logcheck@linuxmafia.com To: root@linuxmafia.com Subject: linuxmafia.com 2017-09-30 11:02 System Events
System Events
Sep 30 10:20:52 linuxmafia named[32734]: zone sf-lug.org/IN: refresh: retry limit for master 198.144.194.238#53 exceeded (source 0.0.0.0#0) Sep 30 10:20:52 linuxmafia named[32734]: zone sf-lug.org/IN: Transfer started. Sep 30 10:20:52 linuxmafia named[32734]: transfer of 'sf-lug.org/IN' from 198.144.194.238#53: failed to connect: connection refused Sep 30 10:20:52 linuxmafia named[32734]: transfer of 'sf-lug.org/IN' from 198.144.194.238#53: Transfer completed: 0 messages, 0 records, 0 bytes, 0.062 secs (0 bytes/sec) Sep 30 10:24:59 linuxmafia named[32734]: zone balug.org/IN: refresh: retry limit for master 198.144.194.238#53 exceeded (source 0.0.0.0#0) Sep 30 10:30:20 linuxmafia named[32734]: zone sf-lug.com/IN: refresh: retry limit for master 198.144.194.238#53 exceeded (source 0.0.0.0#0) Sep 30 10:45:15 linuxmafia named[32734]: zone e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN: refresh: retry limit for master 198.144.194.238#53 exceeded (source 0.0.0.0#0) Sep 30 10:45:15 linuxmafia named[32734]: zone e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN: Transfer started. Sep 30 10:45:15 linuxmafia named[32734]: transfer of 'e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN' from 198.144.194.238#53: failed to connect: connection refused Sep 30 10:45:15 linuxmafia named[32734]: transfer of 'e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN' from 198.144.194.238#53: Transfer completed: 0 messages, 0 records, 0 bytes, 0.065 secs (0 bytes/sec) Sep 30 10:51:37 linuxmafia named[32734]: zone balug.org/IN: refresh: retry limit for master 198.144.194.238#53 exceeded (source 0.0.0.0#0)
----- End forwarded message -----
Quoting Michael Paoli (Michael.Paoli@cal.berkeley.edu):
"She's dead, Jim." - Rick Moen (Thanks Rick!) noticed some issues, checked a bit and ... named was no longer running! 8-O
I figure it's always nice if your secondary DNS person is your wingman. ;->
My secret weapon: A well-tuned instance of logcheck. The trick to running logcheck is to spend some time iteratively 'tuning' its /etc/logcheck/ignore.d.server/local.rules file, to make it cease reporting routine system events of no interest being reported in the system logs. Eventually, you get to the point where logcheck sends you e-mail only when something _interesting_ and potentially significant happens -- like when your nameserver ceases to be able to pull down zone transfers from a remote master nameserver.
And that is also why I collect means of out-of-band contact for all the people I do DNS for, or who DNS for me, and include those as comment lines where appropriate in /etc/bind/named.conf.local.
(I include that file for public download as a teaching example in http://linuxmafia.com/pub/linux/network/bind9-examples-linuxmafia.tar.gz , except with telephone numbers redacted.)
Incidentally, you and I should both transition from BIND9 to a better authoritative-only nameserver (such as NSD) and from Apache http to a lighter and more secure httpd (such as Lighty or nginx).
I wrote:
Incidentally, you and I should both transition from BIND9 to a better authoritative-only nameserver (such as NSD) and from Apache http to a lighter and more secure httpd (such as Lighty or nginx).
Also on the cutting block: NTP Project ntpd, _also_ traditionally a source of recurring security problems and notably overfeatured. I'd been thinking the leading alternative for my use case would be OpenBSD Foundation's OpenNTPd, but the Red Hat-sponsored Chrony appears surprisingly good: https://www.coreinfrastructure.org/news/blogs/2017/09/securing-network-time
It's a pity that the security audit in question didn't include OpenNTPd.
(Implementations studied: NTP Project ntpd, NTPSec, Chrony. The study notes that the NTPSec fork is still in early days, doing cleanup of NTP Project legacy code, so current results don't necessarily predict well what's coming. The same can probably be said of OpenBSD Foundation's project, likewise a fork of the reference codebase focussed on losing legacy cruft and less-necessary features. Chrony stands out as being a from-scratch fresh implementaiton.)
Correcting what I wrote, recently:
The study notes that the NTPSec fork is still in early days, doing cleanup of NTP Project legacy code, so current results don't necessarily predict well what's coming. The same can probably be said of OpenBSD Foundation's project, likewise a fork of the reference
^^^^^^^^^^^^^^^
codebase focussed on losing legacy cruft and less-necessary features.
Actually, OpenBSD Foundation's OpenNTPD is _not_ a fork, but rather another from-scratch reimplementation, as is the Red Hat-sponsored Chrony codebase. It's unfortunate that OpenNTPd wasn't included in the Core Infrastructure Initiative security study, but I would expect on general principles for it to have excellent prospects.