13:24:53 jbk: bit delayed sorry, mlxcx has a bunch of internal counters for stuff like that but they are not wired up to kstats. that's been on my to-do list for a while, sorry 13:25:57 and yeah that sounds like it's mostly a victim, though the attach procedure could probably stand to give up at some point if the device is never giving us any interrupts 13:26:23 I think it will currently wait forever for a few of the steps, which is likely the proximal cause of the hang 13:33:46 also if anyone wants the connectx ethernet prm just shoot me a message, I'm very clumsy with PDFs and sometimes leave them in ~~/public on manta 13:33:55 it's deeply unfortunate 15:08:06 arekinath: yeah, I think this isn't specific to mlxcx... however just trying to figure out what in the world is going on 15:08:44 i modified the mlxcx driver (i disabled lmrc completely -- it was exhibiting similar symptoms) and limited mlxcx_attach to only allow instance 0 to succeed 15:09:31 and it did, and snoop at least seems to work, but it's flooding /var/adm/messages with 'EQ {13,15} isn't armed' now 15:10:11 (that's when I gave up for the night)... I'll need to look at the source, but my hope is that all of those have their own interrupts, which means I can try to compare how those two are different from the other ones 15:10:18 and see if that reveals anything... 15:13:51 given that the 2nd CPUs (two discrete CPUs w/ 64 cores ea) are using apic ids > 255, i'm wondering if that might be a factor 15:22:05 possibly a coincidence, but cpu 0 has no interrupts assigned to it (i've not looked at how the system distributes msi-x interrupts across processors, but I'm guessing it's probably not using random numbers and likely would start with 0 16:06:19 this system does define ACPI_MADT_TYPE_LOCAL_X2APIC_NMI (which we don't support), though the value is 1 (which I believe is the default) 16:07:45 so that i suspect is harmless... but lots of ACPI errors which could be bad data (though these systems are very recent, I'd expect HPE at this point to be able to get that right), or an artifact of the version of our acpi implementation... 16:25:25 jbk: Does your distro have any of the ACPI updates that, e.g., Gordon has tried to do? 16:25:35 I do wonder if there are load bearing pieces in there.. 16:26:33 Might be interesting to poke at OpenBSD or FreeBSD to see if they have quirks for these systems too 16:27:09 A long time ago I experienced interrupt bullshit on HP machines as well. Though in that case the issue was that fixed interrupts would eventually get stuck on 16:27:11 not MSI-X 16:27:20 the fix was indeed to start using MSI/MSI-X there 16:27:40 I am not surprised to find more interrupt BS though 16:30:07 jclulow: no, but I do have a commit I could cherry-pick with the last attempt at an update (I forget who put it out) 16:31:07 the whole process is incredibly tedious... i'm not sure if it's this web-based vpn/rdp to this lab, or the hp's ssh client itself (I'm leaning towards the latter) 16:31:27 but control-c not only disconnects you from the virtual serial port, it disconnects your ssh session to the ilo 16:31:55 and the delete key generates 'ignoring key 7f' on the output 16:32:27 i don't think i'd ever want to be in the same room as the people that thought this was acceptable behavior 17:34:48 https://code.illumos.org/c/illumos-gate/+/3183 that was the change i had... i've been running it on my home server for the past 2 years... 17:34:49 → CODE REVIEW 3183: 16113 Update acpica to 20230628 (NEW) | https://www.illumos.org/issues/16113 18:01:04 FYI: OpenZFS summit is going on today, youtube etc at https://openzfs.org/wiki/OpenZFS_Developer_Summit 22:58:41 just to make sure i'm not conflating concepts (very possible).. the trapno (as in bad trap) is equivalent to the ('local') vector (e.g. 0 - divide by zero, 2 - nmi, 13 -- gpf, etc) 22:58:45 correct?