-
gemelen
Hi! (I chat here quite rare, last time it was still on freenode...). I bumped into what I assume was a panic: system became unresponsive over network for a few minutes and after I was able to get back into it, uptime was reset and I've got a system mail about Fault Management Event. And I was able to reproduce the fault scenario another time (I didn't try to repeat it more since it's my only SmartOS
-
gemelen
box). Should I describe the case in bugtracker or here first, so somebody'd be able to say if something of worth to fix? Scenario itself is related to running snoop on vnics over etherstub
-
gemelen
(I searched for something similar on bugtracker, but didn't find anything resembling what I did and saw)
-
neuroserve
gemelen : have you looked for a core dump, yet?
-
rmustacc
gemelen: I would check to see if a crash dump was created. It'd usually be under something like /var/crash/volatile and have a name like vmdump.<num>
-
gemelen
yep, I could see two dump files
-
rmustacc
OK, so that'll have information about what went wrong.
-
rmustacc
A useful starting point is to run savecore -f vmdump.<num> somewhere which will decompress the dump creating two files. With that we can get a starting point of information about what's going wrong. Normally my default is to run something like mdb -e '::status; $C' <num>
-
gemelen
Sounds good. Could you please recommend any guide on more detailed instructions (if any)? I'd understand what to do, just don't have experience dealing with panics on this platform
-
rmustacc
There's
illumos.org/books/mdb which talks about the debugger itself.
-
rmustacc
There's
illumos.org/books/dev/debugging.html#panics which has a little bit about showing the example of what I described.
-
rmustacc
But I don't have a good guide off hand on how to really go diagnose this. The hope is that those two commands with mdb will give us a good starting point. From there, we can either suggest some follow ups or if you can share the dump, someone may be able to look.
-
gemelen
No problem. I'll look myself at first. There is no problem to share dumps, nothing special or confidential about this sytem. So I'll be back with questions if they'd arise. Thanks
-
rmustacc
If you can share the stack and status output, that'll at least let us tell you if it's a known bug or not.
-
gemelen
oh, yes, give me a few minutes
-
rmustacc
Take your time, it's at your convenience. It'll be a bit before I could look at anything myself.
-
gemelen
-
danmcd
Interesting...
-
wiedi
sounds similar to what was fixed in
illumos/illumos-gate 85dff7a
-
fenix
→ GitHub commit 85dff7a: 15167 Panic when halting a zone with self-created links 15407 (committed)
-
wiedi
-
danmcd
The release is very new though: joyent_20250206T001102Z
-
danmcd
@gemelen can you email me (danmcd⊙mi) the reproduction?
-
gemelen
danmcd: I've added the steps in a comment to the gist.
-
danmcd
Oooh thank you. Etherstub use like that isn't tested ... and you say `snoop -z` (A SmartOS specific change)?
-
danmcd
I *think* I know what happened, but I'll have to find out myself for sure.
-
gemelen
yes, I was running it over ifaces in zones, which are not available to snoop from gz otherwise without that flag
-
gemelen
I'll share a link to dump in a moment, if it'd be useful
-
danmcd
The `snoop -z` you're running is likely causing the extra reference that is triggering the VERIFY.
-
danmcd
Asa workaround, you could `zlogin $ZONE snoop -d ....` instead. That way when the vmadm stop happens, the snoop process dies first BEFORE datalinks get unloaded.
-
gemelen
yeah, makes sense
-
gemelen
just a short disclaimer: it's my personal box, so no pressure of any production. and I'm open to help with more info if required
-
danmcd
The use case is interesting, however, and deserves some deeper dives.
-
gemelen
it felt natural: I have a network config to fix and it was reasonable to sniff what's going on between zones (as in a router and a regular client)
-
danmcd
I don't think it's unnatural, I do think it's not been used on SmartOS all that much before, esp. in combination with `snoop -z`
-
danmcd
Dumb question: Your zones... are they native, LX, KVM, or BHYVE ?
-
danmcd
And yes, the dump link would be helpful.
-
gemelen
I started with one native "router" zone and one "bhyve" client, but the second was with both native zones
-
danmcd
I'd like the second dump, please? Easier to reproduce here.
-
gemelen
sure
-
gemelen
danmcd:
gemelen.net/share/vmdump.1 (not sure if it makes sense to try to compress it again)
-
danmcd
It does not. Thank you.
-
danmcd
I'm downloading it now, but I'm out for the rest of the day starting in less than an hour. Thank you for bringing this up. I *do* wonder if it can be recreated merely by using `snoop -z` or not?
-
gemelen
yes, it should be
-
danmcd
I think I just induced it myself with a quick little try.
-
danmcd
Okay, this is all useful data, thank you gemelen. I don't know when I can dig deeply into it, but it's definitely a bug.
-
gemelen
glad to help