[BALUG-Admin] balug.org(/sf-lug.{org, com}) host OOM oops Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
Michael Paoli
Michael.Paoli@cal.berkeley.edu
Sun Oct 1 02:08:18 PDT 2017
To on-list, because, well, ... why not? :-)
And ... BALUG-Admin, (and/or) SF-LUG?
Well, it's more so balug.org host also running some SF-LUG
services, rather than vice versa, so ...
"She's dead, Jim." - Rick Moen (Thanks Rick!) noticed some issues,
checked a bit and ... named was no longer running! 8-O
Host was still up and otherwise seemed (relatively) healthy,
but ... what happened to named? Did some digging ... not dig(1)
particularly, but ... checking logs.
So ... restarted named ... seemed healthy and okay and such.
Started checking logs ...
/var/log/daemon.log*
Sep 29 02:58:37 balug-sf-lug-v2 systemd[1]: Unit bind9.service entered
failed state.
That seemed about the last peep about it ... but there were indications
of other problems ... notably things apparently being killed off. :-/
/var/log/messages*
Uh oh, ... the dreaded OOM PID killer! ...
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.816128]
fail2ban-server invoked oom-killer: gfp_mask=0x201da, order=0,
oom_score_adj=0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.830751]
fail2ban-server cpuset=/ mems_allowed=0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.832896] CPU: 0 PID:
914 Comm: fail2ban-server Not tainted 3.16.0-4-amd64 #1 Debian
3.16.43-2+deb8u5
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.834810] Hardware name:
Bochs Bochs, BIOS Bochs 01/01/2011
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.835826]
0000000000000000 ffffffff81514291 ffff88003ca049e0 0000000000000000
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
ffffffff81511e69 0000000000000000 ffffffff810d6f6f 0000000000000000
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
ffffffff81518c4e 0000000000000200 ffffffff81068a53 ffffffff810c44e4
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] Call Trace:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81514291>] ? dump_stack+0x5d/0x78
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81511e69>] ? dump_header+0x76/0x1e8
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff810d6f6f>] ? smp_call_function_single+0x5f/0xa0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81518c4e>] ? mutex_lock+0xe/0x2a
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81068a53>] ? put_online_cpus+0x23/0x80
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff810c44e4>] ? rcu_oom_notify+0xc4/0xe0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8115431c>] ? do_try_to_free_pages+0x4ac/0x520
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81142ddd>] ? oom_kill_process+0x21d/0x370
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8114299d>] ? find_lock_task_mm+0x3d/0x90
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81143543>] ? out_of_memory+0x473/0x4b0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8114940f>] ? __alloc_pages_nodemask+0x9ef/0xb50
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8118894d>] ? alloc_pages_current+0x9d/0x150
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81141b40>] ? filemap_fault+0x1a0/0x420
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff81167d0a>] ? __do_fault+0x3a/0xa0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8116a8ce>] ? do_read_fault.isra.54+0x4e/0x300
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8116c0fc>] ? handle_mm_fault+0x63c/0x1150
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff810582c7>] ? __do_page_fault+0x177/0x4f0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8109ea37>] ? put_prev_entity+0x57/0x350
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8105299b>] ? kvm_clock_get_cycles+0x1b/0x20
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff810c9b32>] ? ktime_get_ts+0x42/0xe0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff811be157>] ? poll_select_copy_remaining+0xe7/0x140
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]
[<ffffffff8151c4d8>] ? async_page_fault+0x28/0x30
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.875236] Mem-Info:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.876102] Node 0 DMA per-cpu:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.877119] CPU 0: hi:
0, btch: 1 usd: 0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.878426] Node 0 DMA32 per-cpu:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.879490] CPU 0: hi:
186, btch: 31 usd: 0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714]
active_anon:113183 inactive_anon:113245 isolated_anon:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714]
active_file:46 inactive_file:58 isolated_file:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] unevictable:0
dirty:0 writeback:0 unstable:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] free:12241
slab_reclaimable:3169 slab_unreclaimable:5274
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] mapped:2458
shmem:2849 pagetables:5275 bounce:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714] free_cma:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.888200] Node 0 DMA
free:4636kB min:700kB low:872kB high:1048kB active_anon:5292kB
inactive_anon:5328kB active_file:0kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB
mlocked:0kB dirty:0kB writeback:0kB mapped:280kB shmem:280kB
slab_reclaimable:104kB slab_unreclaimable:244kB kernel_stack:48kB
pagetables:124kB unstable:0kB bounce:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.897611]
lowmem_reserve[]: 0 982 982 982
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.899344] Node 0 DMA32
free:44328kB min:44352kB low:55440kB high:66528kB active_anon:447440kB
inactive_anon:447652kB active_file:184kB inactive_file:232kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1032184kB managed:1008516kB mlocked:0kB dirty:0kB
writeback:0kB mapped:9552kB shmem:11116kB slab_reclaimable:12572kB
slab_unreclaimable:20852kB kernel_stack:3008kB pagetables:20976kB
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:727 all_unreclaimable? yes
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.910174]
lowmem_reserve[]: 0 0 0 0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.911889] Node 0 DMA:
5*4kB (M) 1*8kB (U) 10*16kB (U) 7*32kB (UM) 10*64kB (UM) 2*128kB (UM)
3*256kB (UM) 1*512kB (M) 0*1024kB 1*2048kB (R) 0*4096kB = 4636kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.917906] Node 0 DMA32:
738*4kB (UE) 390*8kB (EM) 189*16kB (UE) 117*32kB (UEM) 120*64kB (EM)
60*128kB (UEM) 23*256kB (UEM) 8*512kB (UEM) 2*1024kB (U) 0*2048kB
1*4096kB (R) = 44328kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.925589] Node 0
hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=2048kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.927978] 12189 total
pagecache pages
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.928995] 9236 pages in
swap cache
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.929978] Swap cache
stats: add 1598908, delete 1589672, find 1953234/2271848
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.931903] Free swap = 0kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.932854] Total swap = 1048544kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.933794] 262044 pages RAM
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.934655] 0 pages
HighMem/MovableOnly
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.935688] 5917 pages reserved
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.936598] 0 pages hwpoisoned
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.937500] [ pid ] uid
tgid total_vm rss nr_ptes swapents oom_score_adj name
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.940324] [ 172] 0
172 7217 269 20 43 0 systemd-journal
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.942449] [ 184] 0
184 10356 2 22 217 -1000 systemd-udevd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.944605] [ 597] 0
597 3015 16 12 768 0 haveged
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.946531] [ 598] 1
598 1571 14 9 13 0 uptimed
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.948669] [ 599] 101
599 42559 4837 67 6402 0 named
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.950662] [ 600] 0
600 64667 70 29 197 0 rsyslogd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.952633] [ 601] 0
601 4756 5 15 40 0 atd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.954470] [ 602] 0
602 6476 20 18 47 0 cron
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.957340] [ 603] 0
603 39348 46 43 2237 0 lwresd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.960249] [ 607] 0
607 13796 29 31 139 -1000 sshd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.962284] [ 609] 107
609 10563 48 24 69 -900 dbus-daemon
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.964515] [ 617] 0
617 7088 44 19 38 0 systemd-logind
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.966680] [ 671] 105
671 305156 30716 486 181302 0 clamd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.968711] [ 687] 0
687 15925 289 35 2831 0 spfd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.970791] [ 699] 103
699 8346 49 22 109 0 ntpd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.973677] [ 701] 0
701 3724 3 12 36 0 agetty
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.976699] [ 702] 0
702 17950 3 40 132 0 login
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.978631] [ 715] 0
715 5062 3 15 113 0 mysqld_safe
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.980675] [ 858] 17041
858 150162 3794 90 11586 0 mysqld
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.982763] [ 859] 0
859 5536 3 16 51 0 logger
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.984811] [ 878] 38
878 15029 249 32 2049 0 mailmanctl
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.987040] [ 879] 38
879 15637 1826 34 1126 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.989403] [ 880] 38
880 15700 1885 35 1145 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.992585] [ 881] 38
881 14998 283 32 2005 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.994953] [ 882] 38
882 15700 548 34 2470 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.997219] [ 883] 38
883 15002 286 33 2019 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.999303] [ 884] 38
884 16807 521 36 2524 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.001351] [ 885] 38
885 15690 968 33 1995 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.003496] [ 886] 38
886 15024 250 33 2036 0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.005724] [ 912] 0
912 42169 599 87 17903 0 /usr/sbin/spamd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.008183] [ 914] 0
914 49613 670 30 1160 0 fail2ban-server
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.010262] [ 997] 0
997 78222 269 114 3705 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.011990] [ 1027] 0
1027 43875 10022 91 10191 0 spamd child
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.013870] [ 1029] 0
1029 42735 3586 87 15469 0 spamd child
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.015655] [ 1039] 113
1039 25706 45 48 473 0 exim4
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.017462] [ 1048] 1607
1048 9051 141 22 119 0 systemd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.020096] [ 1049] 1607
1049 12900 14 26 740 0 (sd-pam)
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.022931] [ 1051] 0
1051 13451 3 33 115 0 sudo
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.025523] [ 1083] 0
1083 13311 1 32 106 0 su
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.027987] [ 1084] 0
1084 5085 3 15 142 0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.030570] [ 2215] 0
2215 23309 3 49 234 0 sshd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.033272] [ 2217] 1607
2217 23309 44 47 201 0 sshd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.035822] [ 2218] 1607
2218 5081 3 15 138 0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.038453] [ 2253] 1607
2253 5992 7 18 55 0 screen
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.041654] [ 2258] 1607
2258 6066 69 18 106 0 screen
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.044778] [ 2273] 1607
2273 5083 82 15 63 0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.047324] [ 2306] 0
2306 13451 3 32 118 0 sudo
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.050021] [ 2339] 0
2339 13311 1 31 105 0 su
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.051936] [ 2340] 0
2340 5103 73 16 91 0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.053694] [15410] 33
15410 88963 13397 135 3271 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.055645] [16284] 33
16284 80083 4647 118 2974 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.058999] [19656] 33
19656 80456 4573 117 3142 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.062614] [20288] 33
20288 79754 3974 117 3037 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.065805] [21280] 33
21280 79333 3244 115 3168 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.068966] [21283] 33
21283 85967 10413 129 3078 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.072292] [21284] 33
21284 86848 10748 130 3174 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.075728] [21306] 33
21306 85528 9358 127 3217 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.078872] [21420] 33
21420 86187 9993 128 3206 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.082082] [21832] 33
21832 85304 9835 128 3046 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.085599] [21937] 33
21937 79546 3702 116 3210 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.090313] [21940] 33
21940 85942 9674 128 3325 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.093417] [21950] 33
21950 88632 12360 133 3305 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.096711] [21951] 33
21951 87235 11223 131 3214 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.099971] [21953] 33
21953 87544 10917 131 3330 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.103143] [21976] 33
21976 87350 10978 131 3313 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.105194] [21984] 33
21984 87404 11163 131 3246 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.107390] [21985] 33
21985 86796 10190 129 3419 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.110497] [22036] 33
22036 87183 10215 129 3421 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.112329] [22039] 33
22039 85816 9314 128 3330 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.114268] [22040] 33
22040 78803 2445 113 3337 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.115973] [22042] 33
22042 79744 2694 116 3336 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.118243] [22043] 33
22043 84175 7485 123 3421 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.120124] [22044] 33
22044 83023 6360 121 3416 0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.122435] [22052] 33
22052 78230 288 100 3689 0 apache2
clamd is fair bit of memory hog, but knew that, added more RAM to host
earlier, and also, it's only one PID.
The big RAM suck is Apache ... not per PID, but in total. 26 Apache
PIDs, that quickly adds up.
/var/log/syslog*
And looks like it was around here that named got whacked:
Sep 29 02:52:04 balug-sf-lug-v2 systemd[1]: bind9.service: main
process exited, code=killed, status=9/KILL
reviewed the Apache logs (/var/log/apache2/*)
... looks mostly like some overzealous web crawlers, plus some bad bots,
were both simultaneously pounding away quite hard on Apache. Looks like
I'd not tuned Apache sufficiently to well withstand such a load, so
Apache (mpm_prefork) forked lots of PIDs to handle the requests and
attempt to keep up, but alas, beyond the resources reasonably available
on the host ... and that's when things then went South rather quickly.
So ... would be better to limit how much resource Apache would suck up,
rather than allow Apache to consume excessive resources relative to
those available. That may result in some web service failures/errors
... but that's better than Apache otherwise negatively impacing services
on the host.
So ... Apache configuration ...
cd /etc/apache2 && ./.unroll < apache2.conf
...
# ./.unroll START: IncludeOptional mods-enabled/*.load
LoadModule mpm_prefork_module /usr/lib/apache2/modules/mod_mpm_prefork.so
...
# ./.unroll START: IncludeOptional mods-enabled/*.conf
...
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxRequestWorkers 150
MaxConnectionsPerChild 0
</IfModule>
From the Apache documentation:
"Most important is that MaxRequestWorkers be big enough to handle as
many simultaneous requests as you expect to receive, but small enough
to assure that there is enough physical RAM for all processes."
Oops!
So, looks like things were going badly with 26 Apache PIDs (likely 1
"master" (parent) and the rest spawned children). So, likely
MaxRequestWorkers should be something at least below 25 - and even 25
would be (bit) too high. That's also a lot of load/work -
simultaneously handling up to 25 requests. If something(s) are
requesting that much ... some requests can wait - better that than quite
negatively impacting the host overall. So ... I'm thinking 20.
How does it look at present?
2>&1 fuser /usr/sbin/apache2
/usr/sbin/apache2: 443e 969e 997e 3211e 7433e 13566e 13604e
13605e 13623e 31738e 31749e
11 PIDs, so 20 ought be pretty reasonable for MaxRequestWorkers. That
would be about 21 Apache PIDs total, and simultaneously handling up to
about 20 requests.
# ls -ld /etc/apache2/mods-enabled/mpm_prefork.conf
lrwxrwxrwx 1 root root 34 Oct 30 2015
/etc/apache2/mods-enabled/mpm_prefork.conf ->
../mods-available/mpm_prefork.conf
# cd /etc/apache2/mods-available
# ex mpm_prefork.conf
mpm_prefork.conf: unmodified: line 16
:/150
MaxRequestWorkers 150
:s/150/20/p
MaxRequestWorkers 20
:w
mpm_prefork.conf: 16 lines, 570 characters
:q
# (cd / && umask 022 && apachectl graceful)
# ci -d -u -M -m'adjust MaxRequestWorkers to avoid excess RAM
consumption and dreaded kernel OOM PID killer' mpm_prefork.conf
RCS/mpm_prefork.conf,v <-- mpm_prefork.conf
new revision: 1.2; previous revision: 1.1
done
#
Could also possibly consider going to a threaded model (if that plays
well/safe with other Apache bits and related installed).
... and then a reboot for good measure (notably in case other PIDs got
whacked by kernel OOM PID killer that ought be restarted).
And then hopefully all is well with the universe again ... at least for
now.
> From: "Michael Paoli" <Michael.Paoli@cal.berkeley.edu>
> To: "Rick Moen" <rick@linuxmafia.com>
> Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
> Date: Sat, 30 Sep 2017 20:41:23 -0700
> So, ... looks like, at least to a 1st order
> approximation/guestimation based upon various log data ...
> a combination of some overzealous web crawlers and bad bots
> hit hard with the frequency and simultaneity of their requests,
> Apache - probably still at or near to default on its configuration
> for handling such as for workers/threads, tried to be very highly
> accommodating to the excessive requests, starting up lots of
> processes/threads ... and ... probably not properly tuned,
> beyond reasonable for the available resources ... and
> then fairly soon into that things started going seriously South.
> So ... probably first bit, ... tweak some Apache settings so it's
> not overly generous in providing resources, and excessive in
> consumption of resources ... better Apache return a "server busy"
> or similar type of error, than consume excessive resources
> to the point where it quite negatively impacts the host
> (dang Linux memory (mis) management ... if it didn't overcommit,
> it could simply tell Apache, "sorry no more memory for you - enough
> aleady" ... and generally nothing else would suffer, but ... alas,
> if only ...). So, ... anyway, tweak some Apache settings on that,
> reboot (to revive any other innocent PIDs that may have been
> unceremoniously slaughtered), and ... keep an eye on it and see
> how things go. Have bumped into relatively similar before ...
> but it's been several years. Issue from many years ago were
> bad bots massively excessively registering accounts on the
> wiki ... the bots were too stupid to manage to do anything with
> the wiki once they registered all those accounts ... but the
> registration was so massively parallel it was a DoS attack
> on the wiki/host, and it balooned resource consumption so high and
> fast, host would lock up solid without leaving much of a trace
> as to what happened ... took a bit of sleuthing and adding
> some wee bits 'o extra data collection to track down
> and nail that one. The work-around was then to change
> then wiki so no web based registrations were allowed
> anymore ... rare enough folks are added on the wiki,
> that such can be handled manually ... that was sufficient
> change to work around that issue that several years ago
> when that was going on. Anyway, MTA/Mailman/anti-spam etc.
> have upped the resource requirements/consumption some fairish
> bit (did also give the virtual host more virtual RAM at the time
> too) ... but ... probably again time for some more resource
> allocation/tuning ... and looks like at present time Apache is
> first logical place to adjust that.
>
>> From: "Michael Paoli" <Michael.Paoli@cal.berkeley.edu>
>> Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
>> Date: Sat, 30 Sep 2017 20:07:48 -0700
>
>> The bloody OOM PID killer kicked in, and at some point
>> in its infinite wisdom (stupidity) thought that SIGKILL
>> to named on a nameserver was a "good" <cough, cough> idea.
>> I'll have to see if I can isolate what was sucking up too
>> much resource. I never did much care for Linux's (or at least
>> some/many distributions leaving it enabled in kernel by default)
>> overcommitting (notably as crude work-around for folks that
>> write crappy programs that request lots of memory which they often
>> never need or use) ... which then, when something actually needs
>> the memory the kernel "gave" (promised) it, when it cheated
>> and overcommitted ... yeah, ... that's when things get very ugly
>> very fast - a.k.a. OOM PID killer, ... ugh!
>>
>> Anyway, more log sleuthing, to see what ate up so much
>> resource ... and ... probably due for reboot after the OOM
>> kicked in anyway, ... dear knows what else got whacked that
>> ought not have been whacked.
>>
>>
>>> From: "Michael Paoli" <Michael.Paoli@cal.berkeley.edu>
>>> Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
>>> Date: Sat, 30 Sep 2017 19:36:01 -0700
>>
>>> Thanks, ... hopefully it's "all better now".
>>> Somehow named wasn't listening 8-O ... and I think that's also
>>> why the (still listening) MTA was then, uh, "upset."
>>> I'll see what I can find as to why named was down and when it
>>> went down ... maybe operator error, maybe ... who knows what.
>>> Anyway, I'll see what I can find (and will recheck its general
>>> health).
>>>
>>> Anyway, thanks for bringing it to my attention - I'd not
>>> seen that quite yet.
>>>
>>>> From: "Rick Moen" <rick@linuxmafia.com>
>>>> Subject: (forw) linuxmafia.com 2017-09-30 11:02 System Events
>>>> Date: Sat, 30 Sep 2017 12:00:28 -0700
>>>
>>>> Just checking to make sure you're aware of ongoing nameserver downtime.
>>>> Also, MTA downtime.
>>>>
>>>> $ telnet mx.balug.org 25
>>>> Trying 198.144.194.238...
>>>> Connected to mx.balug.org.
>>>> Escape character is '^]'.
>>>> 451 Temporary local problem - please try later
>>>> Connection closed by foreign host.
>>>> $
>>>>
>>>> ----- Forwarded message from logcheck system account
>>>> <logcheck@linuxmafia.com> -----
>>>>
>>>> Date: Sat, 30 Sep 2017 11:02:01 -0700
>>>> From: logcheck system account <logcheck@linuxmafia.com>
>>>> To: root@linuxmafia.com
>>>> Subject: linuxmafia.com 2017-09-30 11:02 System Events
>>>>
>>>> System Events
>>>> =-=-=-=-=-=-=
>>>> Sep 30 10:20:52 linuxmafia named[32734]: zone sf-lug.org/IN:
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded
>>>> (source 0.0.0.0#0)
>>>> Sep 30 10:20:52 linuxmafia named[32734]: zone sf-lug.org/IN:
>>>> Transfer started.
>>>> Sep 30 10:20:52 linuxmafia named[32734]: transfer of
>>>> 'sf-lug.org/IN' from 198.144.194.238#53: failed to connect:
>>>> connection refused
>>>> Sep 30 10:20:52 linuxmafia named[32734]: transfer of
>>>> 'sf-lug.org/IN' from 198.144.194.238#53: Transfer completed: 0
>>>> messages, 0 records, 0 bytes, 0.062 secs (0 bytes/sec)
>>>> Sep 30 10:24:59 linuxmafia named[32734]: zone balug.org/IN:
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded
>>>> (source 0.0.0.0#0)
>>>> Sep 30 10:30:20 linuxmafia named[32734]: zone sf-lug.com/IN:
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded
>>>> (source 0.0.0.0#0)
>>>> Sep 30 10:45:15 linuxmafia named[32734]: zone
>>>> e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN: refresh: retry limit
>>>> for master 198.144.194.238#53 exceeded (source 0.0.0.0#0)
>>>> Sep 30 10:45:15 linuxmafia named[32734]: zone
>>>> e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN: Transfer started.
>>>> Sep 30 10:45:15 linuxmafia named[32734]: transfer of
>>>> 'e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN' from
>>>> 198.144.194.238#53: failed to connect: connection refused
>>>> Sep 30 10:45:15 linuxmafia named[32734]: transfer of
>>>> 'e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN' from
>>>> 198.144.194.238#53: Transfer completed: 0 messages, 0 records, 0
>>>> bytes, 0.065 secs (0 bytes/sec)
>>>> Sep 30 10:51:37 linuxmafia named[32734]: zone balug.org/IN:
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded
>>>> (source 0.0.0.0#0)
>>>>
>>>>
>>>> ----- End forwarded message -----
More information about the BALUG-Admin
mailing list