-
nomad
I'm getting bunches and bunches of these in dmesg:
-
nomad
Dec 19 11:07:00 hvfs1 ahci: [ID 296163 kern.warning] WARNING: ahci0: ahci port 5 has task file error
-
nomad
Dec 19 11:07:00 hvfs1 ahci: [ID 687168 kern.warning] WARNING: ahci0: ahci port 5 is trying to do error recovery
-
nomad
Dec 19 11:07:00 hvfs1 ahci: [ID 693748 kern.warning] WARNING: ahci0: ahci port 5 task_file_status = 0x451
-
nomad
Dec 19 11:07:00 hvfs1 ahci: [ID 332577 kern.warning] WARNING: ahci0: the below command (s) on port 5 are aborted
-
nomad
Dec 19 11:07:00 hvfs1 ahci: [ID 117845 kern.warning] WARNING: satapkt 0xfffffe844ff199e8: cmd_reg = 0xb0 features_reg = 0x0 sec_count_msb = 0x0 lba_low_msb = 0x4f lba_mid_msb = 0x4f lba_high_msb = 0x0 sec_count_lsb = 0x0 lba_low_lsb = 0x0 lba_mid_lsb = 0x4f lba_high_lsb = 0xc2 device_reg = 0x0 addr_type = 0x4 cmd_flags = 0x12
-
nomad
Dec 19 11:07:00 hvfs1 ahci: [ID 657156 kern.warning] WARNING: ahci0: error recovery for port 5 succeed
-
nomad
also saw for port 4, suspect lower numbers as well but it's scrolling off the top of the log.
-
nomad
an, nope, just 4 & 5
-
nomad
also, should 'smbd[554]: [ID 801721 daemon.error] SMF initialization problem: %s#012: handle not bound' be reported in the OmniOS or Illumos bug tracker?
-
richlowe
illumos, but that looks unfortunate.
-
richlowe
that's on x86 right?
-
andyf
I have a bug open for the %s thing at least, and I even wrote a patch but I need to rework it
-
richlowe
Ok, that's a thing that sometimes happens?
-
richlowe
last time I saw that was on another platform, and it was Bad.
-
andyf
The fact the handle is not bound might be, but the wonky error message is illumos 15132 (fenix)
-
fenix
BUG 15132: libsmb SMF initialization problem messages incorrect (In Progress)
-
fenix
-
andyf
I just need to find 30 minutes to finish that off
-
richlowe
whew
-
nomad
richlowe, yes, x86.
-
nomad
should I not bother filing the report about the 'handle not bound' thing then?
-
nomad
any hints on what I should be doing about the ahci problem?
-
danmcd
I'm assuming those are real disks on ports 4 and 5?
-
danmcd
Are you running smartctl or smartmon? I've noticed when I run smartctl I see the cmd_reg... line, folowed by the recovery line.
-
nomad
I'm running smartd.
-
nomad
hmm. I'm starting to think there's a real hardware problem here :(
-
nomad
I'm migrating a bunch of VMDKs back to this host (just patched & rebooted) and it's going very slowly.
-
nomad
Dec 19 11:34:35 hvfs1 mr_sas: [ID 992625 kern.notice] mr_sas0: cmd retry count = 1
-
nomad
Dec 19 11:34:35 hvfs1 mr_sas: [ID 992625 kern.notice] mr_sas0: cmd retry count = 1
-
nomad
Dec 19 11:34:35 hvfs1 mr_sas: [ID 692495 kern.notice] NOTICE: mr_sas0: TBOLT adapter reset successfully
-
nomad
Dec 19 11:34:00 hvfs1 mr_sas: [ID 270009 kern.warning] WARNING: mr_sas0: io_timeout_checker: FW Fault, calling reset adapter
-
nomad
Dec 19 11:34:00 hvfs1 mr_sas: [ID 643100 kern.notice] mr_sas0: io_timeout_checker: fw_outstanding 0x2A max_fw_cmds 0x39F
-
nomad
Dec 19 11:34:14 hvfs1 mr_sas: [ID 229198 kern.notice] NOTICE: Device scan in progress ...#012
-
nomad
Dec 19 11:34:35 hvfs1 mr_sas: [ID 564675 kern.notice] NOTICE: mr_sas0: mrsas_print_pending_cmds(): Called
-
nomad
oh yeah, not looking happy :(
-
nomad
ZFS isn't complaining at all.
-
nomad
ugh. Looks like 15 DEC saw a burst of these errors. Yeah, it's not unique to today.
-
» nomad feels so special
-
nomad
Is there a way to map "NOTICE: Device scan in progress ...#012" (logged by mr_sas to /var/adm/messages) to an actual device?
-
nomad
all of the hits I'm seeing in splunk mention #012 so I'm hoping that's just a failing SSD I can kick out of the pool.
-
danmcd
I don't know it, but there likely is. (Unless it's dynamic every boot?)
-
nomad
I'm seeing it across multiple boots with the same number so I presume it's static.
-
» nomad sighs
-
nomad
I told the vendor we bought this box from we didn't need hardware RAID but they seem to have ignored that specification and given me an HBA that really wants to 'help'.
-
nomad
megaraid is megaannoying.
-
nomad
I have a feeling I'm going to be kicking this over to them to resolve.