[BALUG-Admin] balug.org(/sf-lug.{org, com}) host OOM oops Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events

Michael Paoli Michael.Paoli@cal.berkeley.edu
Sun Oct 1 02:08:18 PDT 2017


To on-list, because, well, ... why not?  :-)
And ... BALUG-Admin, (and/or) SF-LUG?
Well, it's more so balug.org host also running some SF-LUG
services, rather than vice versa, so ...

"She's dead, Jim." - Rick Moen (Thanks Rick!) noticed some issues,
checked a bit and ... named was no longer running!  8-O
Host was still up and otherwise seemed (relatively) healthy,
but ... what happened to named?  Did some digging ... not dig(1)
particularly, but ... checking logs.

So ... restarted named ... seemed healthy and okay and such.
Started checking logs ...

/var/log/daemon.log*
Sep 29 02:58:37 balug-sf-lug-v2 systemd[1]: Unit bind9.service entered  
failed state.
That seemed about the last peep about it ... but there were indications
of other problems ... notably things apparently being killed off.  :-/

/var/log/messages*
Uh oh, ... the dreaded OOM PID killer! ...
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.816128]  
fail2ban-server invoked oom-killer: gfp_mask=0x201da, order=0,  
oom_score_adj=0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.830751]  
fail2ban-server cpuset=/ mems_allowed=0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.832896] CPU: 0 PID:  
914 Comm: fail2ban-server Not tainted 3.16.0-4-amd64 #1 Debian  
3.16.43-2+deb8u5
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.834810] Hardware name:  
Bochs Bochs, BIOS Bochs 01/01/2011
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.835826]   
0000000000000000 ffffffff81514291 ffff88003ca049e0 0000000000000000
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
ffffffff81511e69 0000000000000000 ffffffff810d6f6f 0000000000000000
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
ffffffff81518c4e 0000000000000200 ffffffff81068a53 ffffffff810c44e4
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197] Call Trace:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff81514291>] ? dump_stack+0x5d/0x78
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff81511e69>] ? dump_header+0x76/0x1e8
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff810d6f6f>] ? smp_call_function_single+0x5f/0xa0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff81518c4e>] ? mutex_lock+0xe/0x2a
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff81068a53>] ? put_online_cpus+0x23/0x80
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff810c44e4>] ? rcu_oom_notify+0xc4/0xe0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff8115431c>] ? do_try_to_free_pages+0x4ac/0x520
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff81142ddd>] ? oom_kill_process+0x21d/0x370
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff8114299d>] ? find_lock_task_mm+0x3d/0x90
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff81143543>] ? out_of_memory+0x473/0x4b0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff8114940f>] ? __alloc_pages_nodemask+0x9ef/0xb50
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff8118894d>] ? alloc_pages_current+0x9d/0x150
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff81141b40>] ? filemap_fault+0x1a0/0x420
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff81167d0a>] ? __do_fault+0x3a/0xa0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff8116a8ce>] ? do_read_fault.isra.54+0x4e/0x300
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff8116c0fc>] ? handle_mm_fault+0x63c/0x1150
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff810582c7>] ? __do_page_fault+0x177/0x4f0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff8109ea37>] ? put_prev_entity+0x57/0x350
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff8105299b>] ? kvm_clock_get_cycles+0x1b/0x20
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff810c9b32>] ? ktime_get_ts+0x42/0xe0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff811be157>] ? poll_select_copy_remaining+0xe7/0x140
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.836197]   
[<ffffffff8151c4d8>] ? async_page_fault+0x28/0x30
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.875236] Mem-Info:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.876102] Node 0 DMA per-cpu:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.877119] CPU    0: hi:   
   0, btch:   1 usd:   0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.878426] Node 0 DMA32 per-cpu:
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.879490] CPU    0: hi:   
186, btch:  31 usd:   0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714]  
active_anon:113183 inactive_anon:113245 isolated_anon:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714]   
active_file:46 inactive_file:58 isolated_file:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714]  unevictable:0  
dirty:0 writeback:0 unstable:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714]  free:12241  
slab_reclaimable:3169 slab_unreclaimable:5274
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714]  mapped:2458  
shmem:2849 pagetables:5275 bounce:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.880714]  free_cma:0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.888200] Node 0 DMA  
free:4636kB min:700kB low:872kB high:1048kB active_anon:5292kB  
inactive_anon:5328kB active_file:0kB inactive_file:0kB unevictable:0kB  
isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB  
mlocked:0kB dirty:0kB writeback:0kB mapped:280kB shmem:280kB  
slab_reclaimable:104kB slab_unreclaimable:244kB kernel_stack:48kB  
pagetables:124kB unstable:0kB bounce:0kB free_cma:0kB  
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.897611]  
lowmem_reserve[]: 0 982 982 982
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.899344] Node 0 DMA32  
free:44328kB min:44352kB low:55440kB high:66528kB active_anon:447440kB  
inactive_anon:447652kB active_file:184kB inactive_file:232kB  
unevictable:0kB isolated(anon):0kB isolated(file):0kB  
present:1032184kB managed:1008516kB mlocked:0kB dirty:0kB  
writeback:0kB mapped:9552kB shmem:11116kB slab_reclaimable:12572kB  
slab_unreclaimable:20852kB kernel_stack:3008kB pagetables:20976kB  
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB  
pages_scanned:727 all_unreclaimable? yes
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.910174]  
lowmem_reserve[]: 0 0 0 0
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.911889] Node 0 DMA:  
5*4kB (M) 1*8kB (U) 10*16kB (U) 7*32kB (UM) 10*64kB (UM) 2*128kB (UM)  
3*256kB (UM) 1*512kB (M) 0*1024kB 1*2048kB (R) 0*4096kB = 4636kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.917906] Node 0 DMA32:  
738*4kB (UE) 390*8kB (EM) 189*16kB (UE) 117*32kB (UEM) 120*64kB (EM)  
60*128kB (UEM) 23*256kB (UEM) 8*512kB (UEM) 2*1024kB (U) 0*2048kB  
1*4096kB (R) = 44328kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.925589] Node 0  
hugepages_total=0 hugepages_free=0 hugepages_surp=0  
hugepages_size=2048kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.927978] 12189 total  
pagecache pages
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.928995] 9236 pages in  
swap cache
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.929978] Swap cache  
stats: add 1598908, delete 1589672, find 1953234/2271848
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.931903] Free swap  = 0kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.932854] Total swap = 1048544kB
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.933794] 262044 pages RAM
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.934655] 0 pages  
HighMem/MovableOnly
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.935688] 5917 pages reserved
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.936598] 0 pages hwpoisoned
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.937500] [ pid ]   uid   
tgid total_vm      rss nr_ptes swapents oom_score_adj name
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.940324] [  172]     0   
  172     7217      269      20       43             0 systemd-journal
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.942449] [  184]     0   
  184    10356        2      22      217         -1000 systemd-udevd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.944605] [  597]     0   
  597     3015       16      12      768             0 haveged
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.946531] [  598]     1   
  598     1571       14       9       13             0 uptimed
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.948669] [  599]   101   
  599    42559     4837      67     6402             0 named
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.950662] [  600]     0   
  600    64667       70      29      197             0 rsyslogd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.952633] [  601]     0   
  601     4756        5      15       40             0 atd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.954470] [  602]     0   
  602     6476       20      18       47             0 cron
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.957340] [  603]     0   
  603    39348       46      43     2237             0 lwresd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.960249] [  607]     0   
  607    13796       29      31      139         -1000 sshd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.962284] [  609]   107   
  609    10563       48      24       69          -900 dbus-daemon
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.964515] [  617]     0   
  617     7088       44      19       38             0 systemd-logind
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.966680] [  671]   105   
  671   305156    30716     486   181302             0 clamd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.968711] [  687]     0   
  687    15925      289      35     2831             0 spfd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.970791] [  699]   103   
  699     8346       49      22      109             0 ntpd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.973677] [  701]     0   
  701     3724        3      12       36             0 agetty
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.976699] [  702]     0   
  702    17950        3      40      132             0 login
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.978631] [  715]     0   
  715     5062        3      15      113             0 mysqld_safe
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.980675] [  858] 17041   
  858   150162     3794      90    11586             0 mysqld
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.982763] [  859]     0   
  859     5536        3      16       51             0 logger
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.984811] [  878]    38   
  878    15029      249      32     2049             0 mailmanctl
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.987040] [  879]    38   
  879    15637     1826      34     1126             0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.989403] [  880]    38   
  880    15700     1885      35     1145             0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.992585] [  881]    38   
  881    14998      283      32     2005             0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.994953] [  882]    38   
  882    15700      548      34     2470             0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.997219] [  883]    38   
  883    15002      286      33     2019             0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188213.999303] [  884]    38   
  884    16807      521      36     2524             0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.001351] [  885]    38   
  885    15690      968      33     1995             0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.003496] [  886]    38   
  886    15024      250      33     2036             0 python
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.005724] [  912]     0   
  912    42169      599      87    17903             0 /usr/sbin/spamd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.008183] [  914]     0   
  914    49613      670      30     1160             0 fail2ban-server
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.010262] [  997]     0   
  997    78222      269     114     3705             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.011990] [ 1027]     0   
1027    43875    10022      91    10191             0 spamd child
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.013870] [ 1029]     0   
1029    42735     3586      87    15469             0 spamd child
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.015655] [ 1039]   113   
1039    25706       45      48      473             0 exim4
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.017462] [ 1048]  1607   
1048     9051      141      22      119             0 systemd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.020096] [ 1049]  1607   
1049    12900       14      26      740             0 (sd-pam)
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.022931] [ 1051]     0   
1051    13451        3      33      115             0 sudo
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.025523] [ 1083]     0   
1083    13311        1      32      106             0 su
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.027987] [ 1084]     0   
1084     5085        3      15      142             0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.030570] [ 2215]     0   
2215    23309        3      49      234             0 sshd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.033272] [ 2217]  1607   
2217    23309       44      47      201             0 sshd
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.035822] [ 2218]  1607   
2218     5081        3      15      138             0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.038453] [ 2253]  1607   
2253     5992        7      18       55             0 screen
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.041654] [ 2258]  1607   
2258     6066       69      18      106             0 screen
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.044778] [ 2273]  1607   
2273     5083       82      15       63             0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.047324] [ 2306]     0   
2306    13451        3      32      118             0 sudo
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.050021] [ 2339]     0   
2339    13311        1      31      105             0 su
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.051936] [ 2340]     0   
2340     5103       73      16       91             0 bash
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.053694] [15410]    33  
15410    88963    13397     135     3271             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.055645] [16284]    33  
16284    80083     4647     118     2974             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.058999] [19656]    33  
19656    80456     4573     117     3142             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.062614] [20288]    33  
20288    79754     3974     117     3037             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.065805] [21280]    33  
21280    79333     3244     115     3168             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.068966] [21283]    33  
21283    85967    10413     129     3078             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.072292] [21284]    33  
21284    86848    10748     130     3174             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.075728] [21306]    33  
21306    85528     9358     127     3217             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.078872] [21420]    33  
21420    86187     9993     128     3206             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.082082] [21832]    33  
21832    85304     9835     128     3046             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.085599] [21937]    33  
21937    79546     3702     116     3210             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.090313] [21940]    33  
21940    85942     9674     128     3325             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.093417] [21950]    33  
21950    88632    12360     133     3305             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.096711] [21951]    33  
21951    87235    11223     131     3214             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.099971] [21953]    33  
21953    87544    10917     131     3330             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.103143] [21976]    33  
21976    87350    10978     131     3313             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.105194] [21984]    33  
21984    87404    11163     131     3246             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.107390] [21985]    33  
21985    86796    10190     129     3419             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.110497] [22036]    33  
22036    87183    10215     129     3421             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.112329] [22039]    33  
22039    85816     9314     128     3330             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.114268] [22040]    33  
22040    78803     2445     113     3337             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.115973] [22042]    33  
22042    79744     2694     116     3336             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.118243] [22043]    33  
22043    84175     7485     123     3421             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.120124] [22044]    33  
22044    83023     6360     121     3416             0 apache2
Sep 29 01:16:05 balug-sf-lug-v2 kernel: [188214.122435] [22052]    33  
22052    78230      288     100     3689             0 apache2
clamd is fair bit of memory hog, but knew that, added more RAM to host
earlier, and also, it's only one PID.
The big RAM suck is Apache ... not per PID, but in total.  26 Apache
PIDs, that quickly adds up.

/var/log/syslog*
And looks like it was around here that named got whacked:
Sep 29 02:52:04 balug-sf-lug-v2 systemd[1]: bind9.service: main  
process exited, code=killed, status=9/KILL

reviewed the Apache logs (/var/log/apache2/*)
... looks mostly like some overzealous web crawlers, plus some bad bots,
were both simultaneously pounding away quite hard on Apache.  Looks like
I'd not tuned Apache sufficiently to well withstand such a load, so
Apache (mpm_prefork) forked lots of PIDs to handle the requests and
attempt to keep up, but alas, beyond the resources reasonably available
on the host ... and that's when things then went South rather quickly.

So ... would be better to limit how much resource Apache would suck up,
rather than allow Apache to consume excessive resources relative to
those available.  That may result in some web service failures/errors
... but that's better than Apache otherwise negatively impacing services
on the host.

So ... Apache configuration ...
cd /etc/apache2 && ./.unroll < apache2.conf
...
# ./.unroll START: IncludeOptional mods-enabled/*.load
LoadModule mpm_prefork_module /usr/lib/apache2/modules/mod_mpm_prefork.so
...
# ./.unroll START: IncludeOptional mods-enabled/*.conf
...
<IfModule mpm_prefork_module>
StartServers                     5
MinSpareServers           5
MaxSpareServers          10
MaxRequestWorkers         150
MaxConnectionsPerChild   0
</IfModule>

 From the Apache documentation:
"Most important is that MaxRequestWorkers be big enough to handle as
many simultaneous requests as you expect to receive, but small enough
to assure that there is enough physical RAM for all processes."

Oops!

So, looks like things were going badly with 26 Apache PIDs (likely 1
"master" (parent) and the rest spawned children).  So, likely
MaxRequestWorkers should be something at least below 25 - and even 25
would be (bit) too high.  That's also a lot of load/work -
simultaneously handling up to 25 requests.  If something(s) are
requesting that much ... some requests can wait - better that than quite
negatively impacting the host overall.  So ... I'm thinking 20.
How does it look at present?

2>&1 fuser /usr/sbin/apache2
/usr/sbin/apache2:     443e   969e   997e  3211e  7433e 13566e 13604e  
13605e 13623e 31738e 31749e
11 PIDs, so 20 ought be pretty reasonable for MaxRequestWorkers.  That
would be about 21 Apache PIDs total, and simultaneously handling up to
about 20 requests.

# ls -ld /etc/apache2/mods-enabled/mpm_prefork.conf
lrwxrwxrwx 1 root root 34 Oct 30  2015  
/etc/apache2/mods-enabled/mpm_prefork.conf ->  
../mods-available/mpm_prefork.conf
# cd /etc/apache2/mods-available
# ex mpm_prefork.conf
mpm_prefork.conf: unmodified: line 16
:/150
         MaxRequestWorkers         150
:s/150/20/p
         MaxRequestWorkers         20
:w
mpm_prefork.conf: 16 lines, 570 characters
:q
# (cd / && umask 022 && apachectl graceful)
# ci -d -u -M -m'adjust MaxRequestWorkers to avoid excess RAM  
consumption and dreaded kernel OOM PID killer' mpm_prefork.conf
RCS/mpm_prefork.conf,v  <--  mpm_prefork.conf
new revision: 1.2; previous revision: 1.1
done
#

Could also possibly consider going to a threaded model (if that plays
well/safe with other Apache bits and related installed).

... and then a reboot for good measure (notably in case other PIDs got
whacked by kernel OOM PID killer that ought be restarted).

And then hopefully all is well with the universe again ... at least for
now.

> From: "Michael Paoli" <Michael.Paoli@cal.berkeley.edu>
> To: "Rick Moen" <rick@linuxmafia.com>
> Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
> Date: Sat, 30 Sep 2017 20:41:23 -0700

> So, ... looks like, at least to a 1st order
> approximation/guestimation based upon various log data ...
> a combination of some overzealous web crawlers and bad bots
> hit hard with the frequency and simultaneity of their requests,
> Apache - probably still at or near to default on its configuration
> for handling such as for workers/threads, tried to be very highly
> accommodating to the excessive requests, starting up lots of
> processes/threads ... and ... probably not properly tuned,
> beyond reasonable for the available resources ... and
> then fairly soon into that things started going seriously South.
> So ... probably first bit, ... tweak some Apache settings so it's
> not overly generous in providing resources, and excessive in
> consumption of resources ... better Apache return a "server busy"
> or similar type of error, than consume excessive resources
> to the point where it quite negatively impacts the host
> (dang Linux memory (mis) management ... if it didn't overcommit,
> it could simply tell Apache, "sorry no more memory for you - enough
> aleady" ... and generally nothing else would suffer, but ... alas,
> if only ...).  So, ... anyway, tweak some Apache settings on that,
> reboot (to revive any other innocent PIDs that may have been
> unceremoniously slaughtered), and ... keep an eye on it and see
> how things go.  Have bumped into relatively similar before ...
> but it's been several years.  Issue from many years ago were
> bad bots massively excessively registering accounts on the
> wiki ... the bots were too stupid to manage to do anything with
> the wiki once they registered all those accounts ... but the
> registration was so massively parallel it was a DoS attack
> on the wiki/host, and it balooned resource consumption so high and
> fast, host would lock up solid without leaving much of a trace
> as to what happened ... took a bit of sleuthing and adding
> some wee bits 'o extra data collection to track down
> and nail that one.  The work-around was then to change
> then wiki so no web based registrations were allowed
> anymore ... rare enough folks are added on the wiki,
> that such can be handled manually ... that was sufficient
> change to work around that issue that several years ago
> when that was going on.  Anyway, MTA/Mailman/anti-spam etc.
> have upped the resource requirements/consumption some fairish
> bit (did also give the virtual host more virtual RAM at the time
> too) ... but ... probably again time for some more resource
> allocation/tuning ... and looks like at present time Apache is
> first logical place to adjust that.
>
>> From: "Michael Paoli" <Michael.Paoli@cal.berkeley.edu>
>> Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
>> Date: Sat, 30 Sep 2017 20:07:48 -0700
>
>> The bloody OOM PID killer kicked in, and at some point
>> in its infinite wisdom (stupidity) thought that SIGKILL
>> to named on a nameserver was a "good" <cough, cough> idea.
>> I'll have to see if I can isolate what was sucking up too
>> much resource.  I never did much care for Linux's (or at least
>> some/many distributions leaving it enabled in kernel by default)
>> overcommitting (notably as crude work-around for folks that
>> write crappy programs that request lots of memory which they often
>> never need or use) ... which then, when something actually needs
>> the memory the kernel "gave" (promised) it, when it cheated
>> and overcommitted ... yeah, ... that's when things get very ugly
>> very fast - a.k.a. OOM PID killer, ... ugh!
>>
>> Anyway, more log sleuthing, to see what ate up so much
>> resource ... and ... probably due for reboot after the OOM
>> kicked in anyway, ... dear knows what else got whacked that
>> ought not have been whacked.
>>
>>
>>> From: "Michael Paoli" <Michael.Paoli@cal.berkeley.edu>
>>> Subject: Re: (forw) linuxmafia.com 2017-09-30 11:02 System Events
>>> Date: Sat, 30 Sep 2017 19:36:01 -0700
>>
>>> Thanks, ... hopefully it's "all better now".
>>> Somehow named wasn't listening  8-O ... and I think that's also
>>> why the (still listening) MTA was then, uh, "upset."
>>> I'll see what I can find as to why named was down and when it
>>> went down ... maybe operator error, maybe ... who knows what.
>>> Anyway, I'll see what I can find (and will recheck its general
>>> health).
>>>
>>> Anyway, thanks for bringing it to my attention - I'd not
>>> seen that quite yet.
>>>
>>>> From: "Rick Moen" <rick@linuxmafia.com>
>>>> Subject: (forw) linuxmafia.com 2017-09-30 11:02 System Events
>>>> Date: Sat, 30 Sep 2017 12:00:28 -0700
>>>
>>>> Just checking to make sure you're aware of ongoing nameserver downtime.
>>>> Also, MTA downtime.
>>>>
>>>> $ telnet mx.balug.org 25
>>>> Trying 198.144.194.238...
>>>> Connected to mx.balug.org.
>>>> Escape character is '^]'.
>>>> 451 Temporary local problem - please try later
>>>> Connection closed by foreign host.
>>>> $
>>>>
>>>> ----- Forwarded message from logcheck system account  
>>>> <logcheck@linuxmafia.com> -----
>>>>
>>>> Date: Sat, 30 Sep 2017 11:02:01 -0700
>>>> From: logcheck system account <logcheck@linuxmafia.com>
>>>> To: root@linuxmafia.com
>>>> Subject: linuxmafia.com 2017-09-30 11:02 System Events
>>>>
>>>> System Events
>>>> =-=-=-=-=-=-=
>>>> Sep 30 10:20:52 linuxmafia named[32734]: zone sf-lug.org/IN:  
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded  
>>>> (source 0.0.0.0#0)
>>>> Sep 30 10:20:52 linuxmafia named[32734]: zone sf-lug.org/IN:  
>>>> Transfer started.
>>>> Sep 30 10:20:52 linuxmafia named[32734]: transfer of  
>>>> 'sf-lug.org/IN' from 198.144.194.238#53: failed to connect:  
>>>> connection refused
>>>> Sep 30 10:20:52 linuxmafia named[32734]: transfer of  
>>>> 'sf-lug.org/IN' from 198.144.194.238#53: Transfer completed: 0  
>>>> messages, 0 records, 0 bytes, 0.062 secs (0 bytes/sec)
>>>> Sep 30 10:24:59 linuxmafia named[32734]: zone balug.org/IN:  
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded  
>>>> (source 0.0.0.0#0)
>>>> Sep 30 10:30:20 linuxmafia named[32734]: zone sf-lug.com/IN:  
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded  
>>>> (source 0.0.0.0#0)
>>>> Sep 30 10:45:15 linuxmafia named[32734]: zone  
>>>> e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN: refresh: retry limit  
>>>> for master 198.144.194.238#53 exceeded (source 0.0.0.0#0)
>>>> Sep 30 10:45:15 linuxmafia named[32734]: zone  
>>>> e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN: Transfer started.
>>>> Sep 30 10:45:15 linuxmafia named[32734]: transfer of  
>>>> 'e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN' from  
>>>> 198.144.194.238#53: failed to connect: connection refused
>>>> Sep 30 10:45:15 linuxmafia named[32734]: transfer of  
>>>> 'e.9.1.0.5.0.f.1.0.7.4.0.1.0.0.2.ip6.arpa/IN' from  
>>>> 198.144.194.238#53: Transfer completed: 0 messages, 0 records, 0  
>>>> bytes, 0.065 secs (0 bytes/sec)
>>>> Sep 30 10:51:37 linuxmafia named[32734]: zone balug.org/IN:  
>>>> refresh: retry limit for master 198.144.194.238#53 exceeded  
>>>> (source 0.0.0.0#0)
>>>>
>>>>
>>>> ----- End forwarded message -----




More information about the BALUG-Admin mailing list