-
gitomat
[illumos-gate] 16137 Update tzdata to 2023d -- Robert Mustacchi <rm⊙fo>
-
tsoome
introducing qemu to kernel did not eliminate the hung on multi-core VM startup, so I'm rather out of ideas. The only thing I'm sure of is that it does happen while we are attempting to bring non-boot cpus online:)
-
richlowe
the only stack I've seen from you, we have an interrupt pinning the pause thread
-
tsoome
oh, that I can do.
-
tsoome
-
richlowe
huh
-
richlowe
the warnings are worrying, too
-
richlowe
apparently we don't know how to save fpu state
-
richlowe
and we got preempted starting that cpu?
-
richlowe
does everything stay like that?
-
richlowe
and does this look like the hang in qemu 6 thing I forgot the details of? :)
-
sommerfeld
but it works some of the time?
-
richlowe
I don't think the fpu state warnings would break us yet(?), but from what I remember of rmustacc's xsave stuff, it's not good
-
sommerfeld
also the: WARNING: kmem_alloc(): sleeping allocation with size of 0; see kmem_zerosized_log for details
-
sommerfeld
I'd track that one down and figure out why it's happening. at the very least it would get the noise out of the boot log but it might indicate something is mis-sized.
-
sommerfeld
what do we know about the multi-core startup? I assume you're trying 2 cpus - is cpu 0 hanging waiting for cpu 1 to show signs of life?
-
sommerfeld
the one time I had to debug something like this I ended embedding a bunch of "store constant to global variable" through early cpu startup code to see evidence that cpu 1 was doing stuff.
-
jbk
what system is this? the broken cpuid bits seem strange... I thought some things will key off of data returned from them, but not sure if any of that would be relevant to this
-
richlowe
sommerfeld: i admit I assumed, but I think that's probably the save area we got told has a size of 0
-
richlowe
jbk: could you re-up the err format review? :)
-
jbk
done
-
tsoome
yes, it does work sometimes. I actually did try to use some intel cpus there (to eliminate those amd related warnings), but that did not really change the behavior.
-
tsoome
sommerfeld actual setup is 4 cores, but 2 will hung just as well. at this point, we actually do have 2 cpus detected and somewhat configured.
-
tsoome
well, apparently it is not just completely hung, thread on cpu0 is processing interrupts - it does look like it is rather busy doing so and maybe at some point of time it will get back to onlining cpu1:D
-
sommerfeld
maybe a stuck/not properly acked interrupt?
-
tsoome
-
tsoome
cpu1 has thread fffffe000fca3c20, but it has nothing in stack. reported by ::stacks is 'fffffe000fca3c20 ONPROC <NONE> 2' and '<no consistent stack found>'
-
sommerfeld
either stuck spinning, or not actually running anything yet...
-
tsoome
ok, got the RIP of CPU1 from qemu info regs, it is sitting in fffffffffb8a4f68 is pause_cpus+0x128, it seems to be
src.illumos.org/source/xref/illumos….c?r=3df2e8b2&mo=54021&fi=1968#1069
-
tsoome
pause_cpus+0x128: movzbl 0xfffffffffbc59640(%rdx),%eax <safe_list>
-
sommerfeld
so it's spinning, waiting for the other cpu(s) to pause..
-
tsoome
yep
-
tsoome
RAX=0000000000000001 so it is != PAUSE_WAIT (2)
-
tsoome
[0]> safe_list::print
-
tsoome
[ '\001', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002
-
tsoome
', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\002', '\00
-
tsoome
2', '\002', '\002', '\002', '\002', ... ]
-
sommerfeld
and cpu0 is not cooperating.
-
tsoome
so the cpu0 does not want to move from PAUSE_READY to PAUSE_WAIT
-
tsoome
yep
-
tsoome
at least the mechanism is understandable now:D
-
tsoome
-
tsoome
so we got _interrupt while running cpu_pause
-
sommerfeld
interrupted in the: while (cpi->cp_go == 0) ; ?
-
sommerfeld
before it could splhigh() ?
-
tsoome
it seems, in sema_v() just a sec, getting paste...
-
tsoome
-
tsoome
cpu_pause+0x58, but yes, before spllhigh()
-
tsoome
seems like race case
-
tsoome
in cpu_pause_info we have cp_go = 0x1
-
sommerfeld
so perhaps looping in the interrupt handler on cpu0, not able to reach the test on cp_go ..
-
tsoome
"...we don't want to spl because that might block clock interrupts needed to preempt threads on other CPUs." -- but there are no other cpus yet.
-
tsoome
so thats the catch there.
-
sommerfeld
possibly looping in tsc_gethrtime() ?
-
tsoome
no, I mean, we only do have 2 cpus set up yet, we are about to online cpu1, but we need to quiesce cpu0 first, but there is small window between rising semafore and going splhigh() during which we are actually interrupted ...
-
tsoome
um, well, to put it other way around -- if all other cpus are paused, we do not need to keep that window any more