01:28:29 hrm.. one other thing that might just be a coincidence... but when it locks up starting up the CPUs.. one CPU is running thread_reaper() and is executing mutex_sync() and gets stuck 01:43:07 also a bit interesting.. ::stacks -t onproc shows 7 stacks while ::cpuinfo only shows 3 cpu_t's 08:47:43 rip token ring 08:48:24 gone to the MAU in the sky 10:28:47 ?! 10:28:58 was TR yanked from mainline illumos? 10:30:24 not yet;) 10:32:25 is this «yet» ominous 14:23:37 [illumos-gate] 16479 NFS client regression with macOS 14 server -- Gordon Ross 14:31:54 Ah Token Ring. Great idea to reduce collisions; but as much complexity went into managing the token as to dealing with collisions, and then DEC invented the Ethernet bridge and Radia Perlman did the spanning tree protocol, and collisions domains became management, and then multiport bridges (e.g., switches) came around and collisions just stopped being a problem, and then TR, FDDI/CDDI, etc, all were swept 14:32:00 away by the wave of modernity. 14:33:25 Regarding the x2APIC/MSI/etc: the big contribution of x2 mode for the APIC is in widening the APIC ID to 32 bits to accommodate core counts >255. Unfortunately, the die for the MSI data field was already cast, and while Intel nominally said that there was space reserved for larger IDs, it's never really been implemented. 14:34:45 The upshot is that, to deliver external device interrupts to high-numbered cores, you've pretty much got to use the interrupt remapping capabilities of the IOMMU. Why the IOMMU? Since PCIe did away with the 4 physical, level-triggered interrupt lines of legacy PCI and replaced them with messages, it all looks like a memory access anyway. 14:35:51 In x2 mode, you can still send IPIs to high-numbered cores without having to do interrupt remapping, since the ICR on the LAPIC understands x2 addresses. And of course you can deliver interrupts to _low_ numbered LPs without doing the remapping thing. 14:36:48 But if you want to deliver an external device interrupt to an APIC with ID >255, you've got to do the VT-d thing (or AMD equivalent, which I'm working on). 14:39:14 This actually gives a pretty good overview: https://david.woodhou.se/more-than-you-ever-wanted-to-know-about-x86-msis.txt 14:45:10 Responding to something from a few days ago.... Each LP (so thread) gets its own LAPIC. 15:25:11 um, what is the "best" option for illumos detection today? check the presence of /etc/versions/build ?:) 15:25:16 Oh, one other little wrinkle here is VMs: basically, no one wants to have to virtualize the IOMMU for a guest, so Woodhouse (and I imagine others) has been trying to get folks to agree to this proposal, basically extending the size of the destination APIC ID field in the MSI data register by eating into some of the reserved bits: http://david.woodhou.se/ExtDestId.pdf 15:25:19 the intel iommu is disabled by default in illumos-gate 15:25:35 tsoome - uname -o? 15:26:02 oh, forgot about that 15:27:30 Most of the time you don't need it, I imagine. But if you want to deliver external device interrupts to LPs on your second CPU, you'll have to turn it on and make sure the interrupt remapping stuff is using it. There is support for that in apix (for VT-d, anyway, not so much for AMD's version), at least. I haven't tried to use it, though. 16:10:24 let me kick off a build with it enabled and see what happens 16:10:56 i was also seeing where it'd lockup fairly hard while starting the CPUs.. even the ones with lapics < 255 (in fact one attempt had it hang on cpu#3) 16:11:06 and felt more like some sort of race 16:11:17 since sometimes all 128 would start 16:11:36 (and then mlxcx would hang because it wasn't getting an interrupt during attach) 16:48:48 i did notice that we're only serializing in x2apic on cross calls, and not when sending EOI... which is maybe ok, but raises some eyebrows 16:51:04 just since that's sent in the midst of a lot of rather intricate updating of state 16:51:30 and in apic mode, sending an EOI IS serializing (at least as I've read intel's specs)... 17:23:21 grrr... 17:23:36 there's so many cpus that the online messages basically roll off anything prior to that in msgbuf 17:41:13 If the state is all in RAM, it should be ok: x86 is TSO, and if everything is that needs to be visible across cores is either protected by a mutex (or rwlock or whatever) or an atomic, then you don't necessarily need fences. 17:42:47 Loads can be reordered with stores, so the places where you might need a fence or something are usually mutexes and the like; but those should have them built in already (either a fence instruction or an xchg or something). 17:43:37 The hard-hangs on startup is kind of weird; I wouldn't expect that necessarily. I assume this is i86pc? The INIT/SIPI/SIPI path is pretty well trod at this point. 17:45:50 In xAPIC mode it makes sense for a write to EOI to serialize, since it's going to update the ISR/IRR registers and the state of those should be reflective of reality. In x2 mode, since you access those via MSRs and not MMIO, less so. But I wonder if there's some deeper reason; let me check the SDM. 18:02:31 it looks like the start -- they start running code, but at some point, when they to hang, when I drop into kmdb, it looks like one cpu is always in thread_reaper calling MUTEX_SYNC() -- it looks like it's trying to pause the cpu, but is stuck, ::cpuregs fails on the cpu prior to that with EINVAL 18:02:55 and nmi's only work once 18:04:40 grr.. and for some reason, the intel iommu won't enable on this box... so I can't test that out... 18:06:40 i had been setting ddidebug to 0x401 to look at the assignments, that might be exposing it more easily... a regular verbose boot seems less likely to hang... 18:27:52 uuugh 18:28:06 'invalid segment# <1>in DRHD unit in ACPI DMAR table' 18:28:14 i'm guessing this is why it isn't loading 18:31:16 we seem to limit the # of segments to 1 18:47:36 Our ACPI could use some more love and updating. :( 18:48:30 i think this is immu_dmar.c 18:48:55 it looks like (I'm still digging into the code -- first time looking at it).. we define IMMU_MAXSEG to 1 18:49:06 and fail if we encounter more than IMMU_MAXSEG in that table 18:49:31 it _looks_ like (from an initial glance -- again first time looking) that the code itself seems to at least anticipate more than 1 18:49:45 but maybe was limited due to lack of an ability to test?? 18:49:56 (this code looks like it's been largely untouched since the sun days) 18:50:44 being an optimistic (or just naive :P) I'm going to bump it to 2 and see what happens 18:50:58 (not like this box can get any less functional right now) 19:00:56 i should see what it'd take to bump the size of the kernel's log 19:01:25 i would have found this sooner had all the cpu online messages not bumped the message off the end 19:27:54 oh, immu did _not_ like that... le sigh 19:32:55 nobody has touched the iommu code at all, no. We talked to Hans about 2 months ago, about how it was all broken and he thought we should start over 19:33:09 i thought that was the amd one, not the intel 19:33:12 and he was involved the first time at AMD, so I believe that. 19:33:23 or do i have it backwards? 19:33:25 I would trust his view of both, tbh 19:45:32 I was under the impression that the Intel one was ok, but yeah, the AMD one has little that is salvagable. 19:47:59 Re: serialization and EOI: I read through the SDM and Intel (at least) relaxed the typical serializing behavior of `wrmsr` so that access to LAPIC registers in x2 mode is more efficient. This is not really a change to xAPIC mode, however; in that mode, the LAPIC registers are memory-mapped, but uncached: so reads and writes to them defacto have serializing semantics, but I don't think they necessarily 19:48:05 flush the store buffer or anything like that. 19:51:58 cross: i may be, but it looks like perhaps this is a configuration that was not available at the time... 19:52:53 i'll read through the bits in the vt spec about the DMAR table 19:55:07 i tried intel once in 2017, at the time it worked reasonably well 19:56:12 this system apparently has multiple pci/pci-e bus segments 19:56:39 which the intel iommu does not support (or at least it self-limited itself to 1 segment) 19:58:31 i don't know what implications (if any) that has for the rest of the OS 19:59:04 i've not looked much at the pci code, and don't have a copy of the PCI spec to help make sense of the code either 19:59:45 Woodstock: btw, did you by any chance ever start updating the qlc driver as well, or did you just do emlxs ? 19:59:48 our support for domain/segmentation is non-existent, but in practice that likely only matters if your devices try to talk to eachother or if they collide in bdf without the extra info 20:00:05 basically, you seem to have found the hardware that embodies all of some of our nightmares at once 20:00:13 yay HPE :P 20:01:34 my understanding (without the spec) is that there's really two pieces. The extra ID space, and potentially later that ID space being used on the wire. 20:02:18 the first part might line up and work out ok anyway 20:02:24 cross would be more expert by far 20:02:47 and also in possession of the spec, which helps :) 20:23:06 So...segments expand the BDF space (by adding the segment); the idea is that, within a segment, BDFs are unique; but that's not guaranteed across segments. 20:23:42 If I had to hazard a guess, your DMAR has two segments on it, one for each package? 20:24:30 probably... i'd need to build an image with acpidump and iasl in it so I can see... 20:25:19 i've been tempted to port iasl to UEFI -- I don't recall seeing anything in the UEFI shell that'd do anything like that, and it seems like it could be useful... 21:19:12 cross: I don't suppose segments is something that you guys have already supported in your branch? :) 21:22:15 from the acpi spec, it looks like at minimum, we'd need to note the segment for each host bridge (and propgate that to any children)... 21:22:53 and since each segment has its own cfgspace, i'm guessing the relevant code would need to account for that (i've not looked at the pci code too much yet to have an idea of what that would exactly entail) 21:23:04 but also don't know what I don't know... 21:23:16 (those unknown unkowns :P) 21:23:21 that all only really applies when you may see them on the wire 21:23:43 as far as assigning them sensibly, etc, the important part is the extra bit of ID for uniqueness 21:23:59 I wish I could remember the acronym with which they become part of the network-y specification 21:24:10 it is in spec that is only fairly recently in hardware 21:24:20 it seems like the system (ACPI) assigns the segments to a PCI host bridge 21:24:33 right, you'd probably see a segment per root complex 21:25:12 I'm trying to remember what this looks like for an x86 21:25:20 and it's not paging in 21:27:27 but what I gathered is that the hardware itself doesn't care even a bit about a "segment" until you get to the later spec that puts them on the wire for cross-segment inter-device communication. All it does is give you an extra level of ID space in the host, so you can talk to more than 256 devices 21:28:54 so on arm we only talk to devices relative to a root complex, and it sort of works itself out just about. I can't authoritatively remember how x86 looks 21:38:21 richlowe: do you mean the NFM vs FM stuff? (Non-Flit Mode vs Flit Mode?) 21:39:08 Flit mode is the "Flow Control Unit" (I think "Unit" is right) stuff they added recently. 21:42:44 each segment can have up to 256 buses, but a root complex can have multiple segments, IIUC. I think in practice that doesn't really happen, and it's more likely that each package in a multi-socket configuration gets its own segment number, which if you think about it, kinda makes sense since PCI is usually overloaded with internal bridges and stuff for the SoC's internal use, and so there has to be some 21:42:50 way to disambiguate that when you've got >1 SoC. 21:46:59 jbk: sadly (heh) we haven't had a need to deal with it yet. 21:48:42 oh well :( 21:49:27 i'd be happy to try to do the work, but I don't have $2k to drop on a copy of the pci spec 21:49:30 jbk: this is what I meant by hitting all of the problems people fear at once, since you now have segmented PCI-e in a way that matters and also cpu ids that need the iommu to interrupt 21:49:45 on a random remote machine 21:51:38 yeah :( 21:51:43 Well... I guess some questions are, do you actually need to deal with it? UEFI will do the work of setting up the IO buses and assigning APIC IDs and all that stuff for i86pc. 21:52:07 well that's what I don't know :) 21:52:30 it seems like the iommu already 'parameterizes' on the segment, but it just always sets the segment to 0 21:54:30 and i don't think it'd be too bad to (just to pick an implementation) add a 'pci-segment' property on devices -- using the _SEG method (if there) and propagate to children and have the iommu grab that when it's looking up the BDF for a device (where it just sets the segment to 0)... but are there pci specific things to deal with, or is it all 'in' the iommu (so to speak) 21:54:47 the intel vt spec seems like it might have enough for the iommu 22:05:30 And I guess it sort of depends on what the IO interface looks like; if the actual peripherals in the machine are mostly lined out to a single socket, then you don't really need to worry about the second segment. 22:05:48 (Sorry, got a phone call from a plumber working on our heating while I was typing that.) 22:06:56 And as long as you can talk to the peripherals you care about and access config space for them, then you can _probably_ just turn on the interrupt remapping part of the IOMMU, and program the MSI/MSI-X capabilities and it should all Just Work(TM). 22:07:25 That is, the interesting thing you want from the IOMMU, at least in the short term, is the interrupt remapping capability, not the DMA remapping stuff. 22:07:42 (Sorry again, got to go make dinner for the kids) 22:11:02 (Of course, if you've got devices split across the sockets then you might have two IOMMUs, in which case we're back to where we just were. But that's not a very flexible design, I guess, since it means you can't take out one of the packages, but maybe you're not meant to with that machine.) 22:18:14 hmmm.. looking at the 'quickspecs'... there's 2 PCI risers in the system, and the 2nd one is noted as requiring a 2nd processor... 22:26:34 and the 1st riser notes proc 1, so that's suggestive of how things are wired together...