BALUG VM was down for fair while earlier today.
Has now been up again for over 7 hours now.
Looks like there was an I/O hiccup on the physical host,
which didn't particularly impact the physical hosts, but
was enough of an interruption (delay) that the BALUG VM kernel
paniced.
Did have a 3rd hard drive testing, etc. on the physical host
at the time ... might've hit issues and possibly it did a bus
reset? Who knows for sure. Anyway ...
Went down sometime after:
2018-09-02T01:27:36-07:00
and was brought back up around:
2018-09-02T13:39:30-07:00
Various bits I noted in log:
$ curl -s --range 375155-378925 http://www.archive.balug.org/log.txt
2018-09-02 Michael Paoli
host crashed sometime after:
2018-09-02T01:27:36-07:00
but probably before about:
2018-09-02T01:35:00-07:00
on console, we got:
# [54894.969741] sd 0:0:0:0: [sda] tag#3 ABORT operation started
[54900.078084] sd 0:0:0:0: ABORT operation timed-out.
[54900.080312] sd 0:0:0:0: [sda] tag#2 ABORT operation started
[54905.198438] sd 0:0:0:0: ABORT operation timed-out.
[54905.200517] sd 0:0:0:0: [sda] tag#1 ABORT operation started
[54905.357128] Kernel panic - not syncing: assertion "i &&
sym_get_cam_status(cp->cmd) == DID_SOFT_ERROR" failed: file
"/build/linux-AcJpTp/linux-4.9.110/drivers/scsi/sym53c8xx_2/sym_hipd.c", line
3399
[54905.357128]
[54905.367774] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-8-amd64
#1 Debian 4.9.110-3+deb9u4
[54905.370776] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[54905.372768] 0000000000000000 ffffffff84f31e54 ffff9e2f75d5a300
ffff9e2f7fc03e50
[54905.375471] ffffffff84d7f6ad 0000000000000020 ffff9e2f7fc03e60
ffff9e2f7fc03df8
[54905.378226] 3ea9db08406f9671 0000000100d04ae4 ffffffffc048a250
ffffffffc0489e80
[54905.380982] Call Trace:
[54905.381867] <IRQ> [54905.382541] [<ffffffff84f31e54>] ?
dump_stack+0x5c/0x78
[54905.384428] [<ffffffff84d7f6ad>] ? panic+0xe4/0x23f
[54905.386164] [<ffffffffc048512e>] ? sym_interrupt+0x1c9e/0x1e80 [sym53c8xx]
[54905.388543] [<ffffffffc03aa010>] ?
usb_hcd_poll_rh_status+0x170/0x170 [usbcore]
[54905.391102] [<ffffffffc03a9fc9>] ?
usb_hcd_poll_rh_status+0x129/0x170 [usbcore]
[54905.393627] [<ffffffffc03aa010>] ?
usb_hcd_poll_rh_status+0x170/0x170 [usbcore]
[54905.396144] [<ffffffff84ce7562>] ? call_timer_fn+0x32/0x120
[54905.398071] [<ffffffffc047ea4b>] ? sym53c8xx_intr+0x3b/0x70 [sym53c8xx]
[54905.400386] [<ffffffff84cd418e>] ? __handle_irq_event_percpu+0x7e/0x1a0
[54905.402673] [<ffffffff84cd42e0>] ? handle_irq_event_percpu+0x30/0x70
[54905.404898] [<ffffffff84cd4359>] ? handle_irq_event+0x39/0x60
[54905.406901] [<ffffffff84cd7870>] ? handle_fasteoi_irq+0xa0/0x170
[54905.409001] [<ffffffff84c27faf>] ? handle_irq+0x1f/0x30
[54905.410834] [<ffffffff852187ee>] ? do_IRQ+0x4e/0xe0
[54905.412528] [<ffffffff85216556>] ? common_interrupt+0x96/0x96
[54905.414523] <EOI> [54905.415216] [<ffffffff852151f0>] ?
__sched_text_end+0x1/0x1
[54905.417231] [<ffffffff852154c2>] ? native_safe_halt+0x2/0x10
[54905.419235] [<ffffffff8521520a>] ? default_idle+0x1a/0xd0
[54905.421137] [<ffffffff84cbc7da>] ? cpu_startup_entry+0x1ca/0x240
[54905.423215] [<ffffffff8593df5e>] ? start_kernel+0x447/0x467
[54905.425186] [<ffffffff8593d120>] ? early_idt_handler_array+0x120/0x120
[54905.427438] [<ffffffff8593d408>] ? x86_64_start_kernel+0x14c/0x170
[54905.429842] Kernel Offset: 0x3c00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[54905.433484] ---[ end Kernel panic - not syncing: assertion "i &&
sym_get_cam_status(cp->cmd) == DID_SOFT_ERROR" failed: file
"/build/linux-AcJpTp/linux-4.9.110/drivers/scsi/sym53c8xx_2/sym_hipd.c", line
3399
[54905.433484]
... also noted within that same timeframe, on physical host, there
were some storage related events ... but no hard failues seen on that
physical host and no outages or failures or such observed on that
physical host:
Sep 2 01:29:04 vicki smartd[1093]: Device: /dev/sda [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 69
Sep 2 01:29:04 vicki smartd[1093]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 69 to 70
Sep 2 01:29:04 vicki smartd[1093]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 31 to 30
Sep 2 01:29:04 vicki smartd[1093]: Device: /dev/sdb [SAT], SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 66
$