-
danmcd
Is anyone running code on a Xeon Scalable Gen4 ("Sapphire Rapids")? Bonus points if it's
-
danmcd
"GenuineIntel 806F8 family 6 model 143 step 8"
-
danmcd
MORE bonus points if it's a Xeon Gold 5418Y
-
danmcd
I'm seeing some strange behavior, and I want to know if anyone else is seeing anything strange or not before I completely rabbit-hole myself.
-
richlowe
I'm not, but you should describe the behaviour
-
jbk
we might have some in our lab, but I'll need to check later
-
danmcd
A specific version of one of our node.js binaries craps out on this specific server. Don't wanna go blame bad-HW just yet.
-
jbk
unfortunately, we don't
-
danmcd
My BIG and hopefully overblown worry is that Sapphire Rapids has something CPU-indirect-jump-protection wise that's kneecapping us, and I haven't accounted for it. Both pre-and-post latest-ucode misbehave that way.
-
danmcd
It's gonna be eating my time all today and into next week if I don't figure it out quickly.
-
danmcd
I'm insta-core-dumping with SEGV, and it's likely in an indirect jump inside elf_rtbndr+0x2c
-
richlowe
that is us binding the symbols of dependencies
-
richlowe
so the indirect jump is expected, where does it go?
-
richlowe
another place that might be sad is the FPU saving stuff
-
danmcd
This binary works on other non-Sapphire-Rapids machines.
-
richlowe
the FPU saving stuff is dynamic
-
richlowe
it figures out the size of AVX* stuff at runtime, to avoid the massive hit the maximum would be
-
danmcd
Also it's possible we need to compile it with better options. I'm less worried now about illumos-kernel-interventions than I am about out stuff. And yes, that could be a concern re: AVX*.
-
richlowe
yeah, I would bet this is the FPU save
-
richlowe
the first call *(%r10)
-
richlowe
so the difference in CPU model is going to be the info provided for _plt_save_size et al.
-
richlowe
danmcd: if mdb is annoying you with `::dis`, you need to specify the object (`ld.so.1`elf_rtbndr::dis`), even though the tab complete works without
-
danmcd
Hang on. I haven't gotten that far yet.
-
danmcd
Trying to inventory what's going on, and what is common vs. not-common.
-
danmcd
And I meant "our stuff" instead of "out stuff" above.
-
danmcd
<sigh> THe Sapphire Rapids box kernel panicked with the very current `master` SmartOS build.
-
richlowe
if it is weird about the FPU save, it seems reasonable to suspect kfpu will be broken too
-
richlowe
I can't remember if we can/how to disable that
-
danmcd
Lemme see WHAT the panic is first. Lucky me it's small, in spite of the machine's 512GiB memory.
-
danmcd
Ooooh, "bad mutex".
-
danmcd
Maybe it IS a hardware problem?
-
danmcd
Or something unrelated to my node SEGVs? It's a deep stack starting with hubd_root_hub_cleanup_thread() and going deep from there.
-
danmcd
HAH! Use-after-free.
-
danmcd
It's fenix illumos#17443
-
fenix
BUG 17443: xhci panics when unable to stop device (New)
-
fenix
-
danmcd
Hey @jbk --> want a kernel dump from a Sapphire Rapids machine (that also happens to be a Dell)?
-
danmcd
I can even boot a DEBUG kernel on it if you'd like. :)
-
danmcd
@richlowe you nailed it:
-
danmcd
> <r10=P
-
danmcd
ld.so.1`_plt_fp_save
-
danmcd
>
-
danmcd
> <r10=J
-
danmcd
fffff9ffef3fed90
-
danmcd
> fffff9ffef3fed90/P
-
danmcd
ld.so.1`_plt_fp_save:
-
danmcd
ld.so.1`_plt_fp_save: ld.so.1`_elf_rtbndr_fp_xsave
-
danmcd
>
-
danmcd
Am I gonna have to introduce a new AT_386_FPINFO_ value for Sapphire Rapids ?
-
danmcd
related, I think: fenix illumos#9595 illumos#9596
-
fenix
FEATURE 9595: rtld should conditionally save AVX-512 state (Closed)
-
fenix
-
fenix
BUG 9596: Initial xsave xstate_bv should not include all features (Closed)
-
fenix
-
danmcd
Those two bugs are supposed to make things easier for future things, if I read it carefully.
-
jbk
danmcd: tur s out we do have one if you still need some testing