-
jbk
hrm.. one other thing that might just be a coincidence... but when it locks up starting up the CPUs.. one CPU is running thread_reaper() and is executing mutex_sync() and gets stuck
-
jbk
also a bit interesting.. ::stacks -t onproc shows 7 stacks while ::cpuinfo only shows 3 cpu_t's
-
arekinath
rip token ring
-
arekinath
gone to the MAU in the sky
-
Reinhilde
?!
-
Reinhilde
was TR yanked from mainline illumos?
-
tsoome
not yet;)
-
Reinhilde
is this «yet» ominous
-
gitomat
[illumos-gate] 16479 NFS client regression with macOS 14 server -- Gordon Ross <gordon.w.ross⊙gc>
-
cross
Ah Token Ring. Great idea to reduce collisions; but as much complexity went into managing the token as to dealing with collisions, and then DEC invented the Ethernet bridge and Radia Perlman did the spanning tree protocol, and collisions domains became management, and then multiport bridges (e.g., switches) came around and collisions just stopped being a problem, and then TR, FDDI/CDDI, etc, all were swept
-
cross
away by the wave of modernity.
-
cross
Regarding the x2APIC/MSI/etc: the big contribution of x2 mode for the APIC is in widening the APIC ID to 32 bits to accommodate core counts >255. Unfortunately, the die for the MSI data field was already cast, and while Intel nominally said that there was space reserved for larger IDs, it's never really been implemented.
-
cross
The upshot is that, to deliver external device interrupts to high-numbered cores, you've pretty much got to use the interrupt remapping capabilities of the IOMMU. Why the IOMMU? Since PCIe did away with the 4 physical, level-triggered interrupt lines of legacy PCI and replaced them with messages, it all looks like a memory access anyway.
-
cross
In x2 mode, you can still send IPIs to high-numbered cores without having to do interrupt remapping, since the ICR on the LAPIC understands x2 addresses. And of course you can deliver interrupts to _low_ numbered LPs without doing the remapping thing.
-
cross
But if you want to deliver an external device interrupt to an APIC with ID >255, you've got to do the VT-d thing (or AMD equivalent, which I'm working on).
-
cross
-
cross
Responding to something from a few days ago.... Each LP (so thread) gets its own LAPIC.
-
tsoome
um, what is the "best" option for illumos detection today? check the presence of /etc/versions/build ?:)
-
cross
Oh, one other little wrinkle here is VMs: basically, no one wants to have to virtualize the IOMMU for a guest, so Woodhouse (and I imagine others) has been trying to get folks to agree to this proposal, basically extending the size of the destination APIC ID field in the MSI data register by eating into some of the reserved bits:
david.woodhou.se/ExtDestId.pdf
-
jbk
the intel iommu is disabled by default in illumos-gate
-
andyf
tsoome - uname -o?
-
tsoome
oh, forgot about that
-
cross
Most of the time you don't need it, I imagine. But if you want to deliver external device interrupts to LPs on your second CPU, you'll have to turn it on and make sure the interrupt remapping stuff is using it. There is support for that in apix (for VT-d, anyway, not so much for AMD's version), at least. I haven't tried to use it, though.
-
jbk
let me kick off a build with it enabled and see what happens
-
jbk
i was also seeing where it'd lockup fairly hard while starting the CPUs.. even the ones with lapics < 255 (in fact one attempt had it hang on cpu#3)
-
jbk
and felt more like some sort of race
-
jbk
since sometimes all 128 would start
-
jbk
(and then mlxcx would hang because it wasn't getting an interrupt during attach)
-
jbk
i did notice that we're only serializing in x2apic on cross calls, and not when sending EOI... which is maybe ok, but raises some eyebrows
-
jbk
just since that's sent in the midst of a lot of rather intricate updating of state
-
jbk
and in apic mode, sending an EOI IS serializing (at least as I've read intel's specs)...
-
jbk
grrr...
-
jbk
there's so many cpus that the online messages basically roll off anything prior to that in msgbuf
-
cross
If the state is all in RAM, it should be ok: x86 is TSO, and if everything is that needs to be visible across cores is either protected by a mutex (or rwlock or whatever) or an atomic, then you don't necessarily need fences.
-
cross
Loads can be reordered with stores, so the places where you might need a fence or something are usually mutexes and the like; but those should have them built in already (either a fence instruction or an xchg or something).
-
cross
The hard-hangs on startup is kind of weird; I wouldn't expect that necessarily. I assume this is i86pc? The INIT/SIPI/SIPI path is pretty well trod at this point.
-
cross
In xAPIC mode it makes sense for a write to EOI to serialize, since it's going to update the ISR/IRR registers and the state of those should be reflective of reality. In x2 mode, since you access those via MSRs and not MMIO, less so. But I wonder if there's some deeper reason; let me check the SDM.
-
jbk
it looks like the start -- they start running code, but at some point, when they to hang, when I drop into kmdb, it looks like one cpu is always in thread_reaper calling MUTEX_SYNC() -- it looks like it's trying to pause the cpu, but is stuck, ::cpuregs fails on the cpu prior to that with EINVAL
-
jbk
and nmi's only work once
-
jbk
grr.. and for some reason, the intel iommu won't enable on this box... so I can't test that out...
-
jbk
i had been setting ddidebug to 0x401 to look at the assignments, that might be exposing it more easily... a regular verbose boot seems less likely to hang...
-
jbk
uuugh
-
jbk
'invalid segment# <1>in DRHD unit in ACPI DMAR table'
-
jbk
i'm guessing this is why it isn't loading
-
jbk
we seem to limit the # of segments to 1
-
danmcd
Our ACPI could use some more love and updating. :(
-
jbk
i think this is immu_dmar.c
-
jbk
it looks like (I'm still digging into the code -- first time looking at it).. we define IMMU_MAXSEG to 1
-
jbk
and fail if we encounter more than IMMU_MAXSEG in that table
-
jbk
it _looks_ like (from an initial glance -- again first time looking) that the code itself seems to at least anticipate more than 1
-
jbk
but maybe was limited due to lack of an ability to test??
-
jbk
(this code looks like it's been largely untouched since the sun days)
-
jbk
being an optimistic (or just naive :P) I'm going to bump it to 2 and see what happens
-
jbk
(not like this box can get any less functional right now)
-
jbk
i should see what it'd take to bump the size of the kernel's log
-
jbk
i would have found this sooner had all the cpu online messages not bumped the message off the end
-
jbk
oh, immu did _not_ like that... le sigh
-
richlowe
nobody has touched the iommu code at all, no. We talked to Hans about 2 months ago, about how it was all broken and he thought we should start over
-
jbk
i thought that was the amd one, not the intel
-
richlowe
and he was involved the first time at AMD, so I believe that.
-
jbk
or do i have it backwards?
-
richlowe
I would trust his view of both, tbh
-
cross
I was under the impression that the Intel one was ok, but yeah, the AMD one has little that is salvagable.
-
cross
Re: serialization and EOI: I read through the SDM and Intel (at least) relaxed the typical serializing behavior of `wrmsr` so that access to LAPIC registers in x2 mode is more efficient. This is not really a change to xAPIC mode, however; in that mode, the LAPIC registers are memory-mapped, but uncached: so reads and writes to them defacto have serializing semantics, but I don't think they necessarily
-
cross
flush the store buffer or anything like that.
-
jbk
cross: i may be, but it looks like perhaps this is a configuration that was not available at the time...
-
jbk
i'll read through the bits in the vt spec about the DMAR table
-
Woodstock
i tried intel once in 2017, at the time it worked reasonably well
-
jbk
this system apparently has multiple pci/pci-e bus segments
-
jbk
which the intel iommu does not support (or at least it self-limited itself to 1 segment)
-
jbk
i don't know what implications (if any) that has for the rest of the OS
-
jbk
i've not looked much at the pci code, and don't have a copy of the PCI spec to help make sense of the code either
-
jbk
Woodstock: btw, did you by any chance ever start updating the qlc driver as well, or did you just do emlxs ?
-
richlowe
our support for domain/segmentation is non-existent, but in practice that likely only matters if your devices try to talk to eachother or if they collide in bdf without the extra info
-
richlowe
basically, you seem to have found the hardware that embodies all of some of our nightmares at once
-
jbk
yay HPE :P
-
richlowe
my understanding (without the spec) is that there's really two pieces. The extra ID space, and potentially later that ID space being used on the wire.
-
richlowe
the first part might line up and work out ok anyway
-
richlowe
cross would be more expert by far
-
richlowe
and also in possession of the spec, which helps :)
-
cross
So...segments expand the BDF space (by adding the segment); the idea is that, within a segment, BDFs are unique; but that's not guaranteed across segments.
-
cross
If I had to hazard a guess, your DMAR has two segments on it, one for each package?
-
jbk
probably... i'd need to build an image with acpidump and iasl in it so I can see...
-
jbk
i've been tempted to port iasl to UEFI -- I don't recall seeing anything in the UEFI shell that'd do anything like that, and it seems like it could be useful...
-
jbk
cross: I don't suppose segments is something that you guys have already supported in your branch? :)
-
jbk
from the acpi spec, it looks like at minimum, we'd need to note the segment for each host bridge (and propgate that to any children)...
-
jbk
and since each segment has its own cfgspace, i'm guessing the relevant code would need to account for that (i've not looked at the pci code too much yet to have an idea of what that would exactly entail)
-
jbk
but also don't know what I don't know...
-
jbk
(those unknown unkowns :P)
-
richlowe
that all only really applies when you may see them on the wire
-
richlowe
as far as assigning them sensibly, etc, the important part is the extra bit of ID for uniqueness
-
richlowe
I wish I could remember the acronym with which they become part of the network-y specification
-
richlowe
it is in spec that is only fairly recently in hardware
-
jbk
it seems like the system (ACPI) assigns the segments to a PCI host bridge
-
richlowe
right, you'd probably see a segment per root complex
-
richlowe
I'm trying to remember what this looks like for an x86
-
richlowe
and it's not paging in
-
richlowe
but what I gathered is that the hardware itself doesn't care even a bit about a "segment" until you get to the later spec that puts them on the wire for cross-segment inter-device communication. All it does is give you an extra level of ID space in the host, so you can talk to more than 256 devices
-
richlowe
so on arm we only talk to devices relative to a root complex, and it sort of works itself out just about. I can't authoritatively remember how x86 looks
-
cross
richlowe: do you mean the NFM vs FM stuff? (Non-Flit Mode vs Flit Mode?)
-
cross
Flit mode is the "Flow Control Unit" (I think "Unit" is right) stuff they added recently.
-
cross
each segment can have up to 256 buses, but a root complex can have multiple segments, IIUC. I think in practice that doesn't really happen, and it's more likely that each package in a multi-socket configuration gets its own segment number, which if you think about it, kinda makes sense since PCI is usually overloaded with internal bridges and stuff for the SoC's internal use, and so there has to be some
-
cross
way to disambiguate that when you've got >1 SoC.
-
cross
jbk: sadly (heh) we haven't had a need to deal with it yet.
-
jbk
oh well :(
-
jbk
i'd be happy to try to do the work, but I don't have $2k to drop on a copy of the pci spec
-
richlowe
jbk: this is what I meant by hitting all of the problems people fear at once, since you now have segmented PCI-e in a way that matters and also cpu ids that need the iommu to interrupt
-
richlowe
on a random remote machine
-
jbk
yeah :(
-
cross
Well... I guess some questions are, do you actually need to deal with it? UEFI will do the work of setting up the IO buses and assigning APIC IDs and all that stuff for i86pc.
-
jbk
well that's what I don't know :)
-
jbk
it seems like the iommu already 'parameterizes' on the segment, but it just always sets the segment to 0
-
jbk
and i don't think it'd be too bad to (just to pick an implementation) add a 'pci-segment' property on devices -- using the _SEG method (if there) and propagate to children and have the iommu grab that when it's looking up the BDF for a device (where it just sets the segment to 0)... but are there pci specific things to deal with, or is it all 'in' the iommu (so to speak)
-
jbk
the intel vt spec seems like it might have enough for the iommu
-
cross
And I guess it sort of depends on what the IO interface looks like; if the actual peripherals in the machine are mostly lined out to a single socket, then you don't really need to worry about the second segment.
-
cross
(Sorry, got a phone call from a plumber working on our heating while I was typing that.)
-
cross
And as long as you can talk to the peripherals you care about and access config space for them, then you can _probably_ just turn on the interrupt remapping part of the IOMMU, and program the MSI/MSI-X capabilities and it should all Just Work(TM).
-
cross
That is, the interesting thing you want from the IOMMU, at least in the short term, is the interrupt remapping capability, not the DMA remapping stuff.
-
cross
(Sorry again, got to go make dinner for the kids)
-
cross
(Of course, if you've got devices split across the sockets then you might have two IOMMUs, in which case we're back to where we just were. But that's not a very flexible design, I guess, since it means you can't take out one of the packages, but maybe you're not meant to with that machine.)
-
jbk
hmmm.. looking at the 'quickspecs'... there's 2 PCI risers in the system, and the 2nd one is noted as requiring a 2nd processor...
-
jbk
and the 1st riser notes proc 1, so that's suggestive of how things are wired together...