19:43:52 Is anyone running code on a Xeon Scalable Gen4 ("Sapphire Rapids")? Bonus points if it's 19:44:48 "GenuineIntel 806F8 family 6 model 143 step 8" 19:44:56 MORE bonus points if it's a Xeon Gold 5418Y 19:45:24 I'm seeing some strange behavior, and I want to know if anyone else is seeing anything strange or not before I completely rabbit-hole myself. 19:58:25 I'm not, but you should describe the behaviour 19:59:54 we might have some in our lab, but I'll need to check later 20:08:47 A specific version of one of our node.js binaries craps out on this specific server. Don't wanna go blame bad-HW just yet. 20:09:58 unfortunately, we don't 20:10:10 My BIG and hopefully overblown worry is that Sapphire Rapids has something CPU-indirect-jump-protection wise that's kneecapping us, and I haven't accounted for it. Both pre-and-post latest-ucode misbehave that way. 20:11:01 It's gonna be eating my time all today and into next week if I don't figure it out quickly. 20:11:32 I'm insta-core-dumping with SEGV, and it's likely in an indirect jump inside elf_rtbndr+0x2c 20:25:48 that is us binding the symbols of dependencies 20:26:08 so the indirect jump is expected, where does it go? 20:27:37 another place that might be sad is the FPU saving stuff 20:37:01 This binary works on other non-Sapphire-Rapids machines. 20:37:39 the FPU saving stuff is dynamic 20:37:53 it figures out the size of AVX* stuff at runtime, to avoid the massive hit the maximum would be 20:38:15 Also it's possible we need to compile it with better options. I'm less worried now about illumos-kernel-interventions than I am about out stuff. And yes, that could be a concern re: AVX*. 20:38:52 yeah, I would bet this is the FPU save 20:39:00 the first call *(%r10) 20:39:21 so the difference in CPU model is going to be the info provided for _plt_save_size et al. 20:48:13 danmcd: if mdb is annoying you with `::dis`, you need to specify the object (`ld.so.1`elf_rtbndr::dis`), even though the tab complete works without 20:48:55 Hang on. I haven't gotten that far yet. 20:49:10 Trying to inventory what's going on, and what is common vs. not-common. 20:49:30 And I meant "our stuff" instead of "out stuff" above. 21:01:35 THe Sapphire Rapids box kernel panicked with the very current `master` SmartOS build. 21:03:34 if it is weird about the FPU save, it seems reasonable to suspect kfpu will be broken too 21:03:45 I can't remember if we can/how to disable that 21:07:04 Lemme see WHAT the panic is first. Lucky me it's small, in spite of the machine's 512GiB memory. 21:07:28 Ooooh, "bad mutex". 21:07:34 Maybe it IS a hardware problem? 21:09:12 Or something unrelated to my node SEGVs? It's a deep stack starting with hubd_root_hub_cleanup_thread() and going deep from there. 21:10:16 HAH! Use-after-free. 21:12:15 It's fenix illumos#17443 21:12:16 BUG 17443: xhci panics when unable to stop device (New) 21:12:16 ↳ https://www.illumos.org/issues/17443 21:12:49 Hey @jbk --> want a kernel dump from a Sapphire Rapids machine (that also happens to be a Dell)? 21:12:56 I can even boot a DEBUG kernel on it if you'd like. :) 21:29:15 @richlowe you nailed it: 21:29:18 > ld.so.1`_plt_fp_save 21:29:18 > 21:33:52 > fffff9ffef3fed90 21:33:52 > fffff9ffef3fed90/P 21:33:54 ld.so.1`_plt_fp_save: 21:33:56 ld.so.1`_plt_fp_save: ld.so.1`_elf_rtbndr_fp_xsave 21:33:58 > 21:41:28 Am I gonna have to introduce a new AT_386_FPINFO_ value for Sapphire Rapids ? 21:44:40 related, I think: fenix illumos#9595 illumos#9596 21:44:42 FEATURE 9595: rtld should conditionally save AVX-512 state (Closed) 21:44:42 ↳ https://www.illumos.org/issues/9595 21:44:42 BUG 9596: Initial xsave xstate_bv should not include all features (Closed) 21:44:42 ↳ https://www.illumos.org/issues/9596 21:50:55 Those two bugs are supposed to make things easier for future things, if I read it carefully. 23:17:05 danmcd: tur s out we do have one if you still need some testing