-
arekinath
jbk: bit delayed sorry, mlxcx has a bunch of internal counters for stuff like that but they are not wired up to kstats. that's been on my to-do list for a while, sorry
-
arekinath
and yeah that sounds like it's mostly a victim, though the attach procedure could probably stand to give up at some point if the device is never giving us any interrupts
-
arekinath
I think it will currently wait forever for a few of the steps, which is likely the proximal cause of the hang
-
arekinath
also if anyone wants the connectx ethernet prm just shoot me a message, I'm very clumsy with PDFs and sometimes leave them in ~~/public on manta
-
arekinath
it's deeply unfortunate
-
jbk
arekinath: yeah, I think this isn't specific to mlxcx... however just trying to figure out what in the world is going on
-
jbk
i modified the mlxcx driver (i disabled lmrc completely -- it was exhibiting similar symptoms) and limited mlxcx_attach to only allow instance 0 to succeed
-
jbk
and it did, and snoop at least seems to work, but it's flooding /var/adm/messages with 'EQ {13,15} isn't armed' now
-
jbk
(that's when I gave up for the night)... I'll need to look at the source, but my hope is that all of those have their own interrupts, which means I can try to compare how those two are different from the other ones
-
jbk
and see if that reveals anything...
-
jbk
given that the 2nd CPUs (two discrete CPUs w/ 64 cores ea) are using apic ids > 255, i'm wondering if that might be a factor
-
jbk
possibly a coincidence, but cpu 0 has no interrupts assigned to it (i've not looked at how the system distributes msi-x interrupts across processors, but I'm guessing it's probably not using random numbers and likely would start with 0
-
jbk
this system does define ACPI_MADT_TYPE_LOCAL_X2APIC_NMI (which we don't support), though the value is 1 (which I believe is the default)
-
jbk
so that i suspect is harmless... but lots of ACPI errors which could be bad data (though these systems are very recent, I'd expect HPE at this point to be able to get that right), or an artifact of the version of our acpi implementation...
-
jclulow
jbk: Does your distro have any of the ACPI updates that, e.g., Gordon has tried to do?
-
jclulow
I do wonder if there are load bearing pieces in there..
-
jclulow
Might be interesting to poke at OpenBSD or FreeBSD to see if they have quirks for these systems too
-
jclulow
A long time ago I experienced interrupt bullshit on HP machines as well. Though in that case the issue was that fixed interrupts would eventually get stuck on
-
jclulow
not MSI-X
-
jclulow
the fix was indeed to start using MSI/MSI-X there
-
jclulow
I am not surprised to find more interrupt BS though
-
jbk
jclulow: no, but I do have a commit I could cherry-pick with the last attempt at an update (I forget who put it out)
-
jbk
the whole process is incredibly tedious... i'm not sure if it's this web-based vpn/rdp to this lab, or the hp's ssh client itself (I'm leaning towards the latter)
-
jbk
but control-c not only disconnects you from the virtual serial port, it disconnects your ssh session to the ilo
-
jbk
and the delete key generates 'ignoring key 7f' on the output
-
jbk
i don't think i'd ever want to be in the same room as the people that thought this was acceptable behavior
-
jbk
code.illumos.org/c/illumos-gate/+/3183 that was the change i had... i've been running it on my home server for the past 2 years...
-
fenix
→ CODE REVIEW 3183: 16113 Update acpica to 20230628 (NEW) |
illumos.org/issues/16113
-
josephholsten
FYI: OpenZFS summit is going on today, youtube etc at
openzfs.org/wiki/OpenZFS_Developer_Summit
-
jbk
just to make sure i'm not conflating concepts (very possible).. the trapno (as in bad trap) is equivalent to the ('local') vector (e.g. 0 - divide by zero, 2 - nmi, 13 -- gpf, etc)
-
jbk
correct?