-
myrkraverk
Pheugh, for a moment there I thought you meant the binary, but it looks like you're converting the mdb.1 man page to mdoc, gitomat.
-
sommerfeld
jbk: tx lifespan is a lot more deterministic (and, if congestion control is working and avoiding bufferbloat, shorter) than rx
-
gitomat
[illumos-gate] 18147 viona_rx should not interrupt the guest if all packets are dropped -- Kyle Simpson <kyle⊙oc>
-
gitomat
[illumos-gate] 17691 acpi_get_possible_irq_resources() will always succeed -- Shardul Bankar <shardulsb08⊙gc>
-
cgr
How can I analyze a OS crash dump which is initiated by expired deadman timer?
-
myrkraverk
cgr, look at my mdb short intro? Does that help?
myrkraverk.com/blog/2014/04
-
myrkraverk
Basically, cgr, have a serial debugger running in the kernel when the deadman timer expires. Can you do that?
-
myrkraverk
-
cgr
The whole thing: The hanging OS is a Solaris 11.4 bhyve guest running on Omnios.
-
cgr
I have a crash dump. Zpool seems to be not the problem. How can I see which treads are hanging? >::walk thread| ::findstack delivers a lot of, most on _resume_from_idle
-
cgr
is this a normal state
-
cgr
stack pointer for thread ffffe33000005ac0: ffffe330000ddeb0
-
cgr
ffffe330000ddf50 xc_serv+0x16d((char *) 0, (char *) 0)
-
cgr
ffffe330000ddfa0 av_dispatch_autovect+0x81((uint_t) f0)
-
cgr
ffffe330000ddff0 dispatch_hilevel+0x5d((uint_t) f0, (uint_t) 0)
-
cgr
ffffe330000058c0 switch_sp_and_call+0x13()
-
cgr
ffffe33000005920 do_interrupt+0x1df((struct regs *) 0xffffe33000005930, (trap_trace_rec_t *) 0)
-
cgr
ffffe33000005930 _interrupt+0xc4()
-
cgr
ffffe33000005a20 mach_cpu_idle+0x17()
-
cgr
ffffe33000005a60 cpu_idle+0x2b7()
-
cgr
ffffe33000005a70 cpu_idle_adaptive+0x19()
-
cgr
ffffe33000005aa0 idle+0x11e()
-
cgr
ffffe33000005ab0 thread_start+8()
-
cgr
?
-
cgr
::memstat has a problem
-
cgr
vmcore.4> ::memstat
-
cgr
mdb: unable to read 5 page_ts at ffffe10005f58d80: no mapping for address
-
cgr
mdb: can't walk rm: walker failed
-
cgr
vmcore.4> ffffe10005f58d80::whatis
-
cgr
ffffe10005f58d80 is the page_t for PFN 0xbeb1b (NOTE: read of pp failed)
-
jbk
is this a crash dump from the solaris guest, or from the omnios host?
-
cgr
the Solaris guest
-
cgr
Omnios host has nothing noticed
-
jbk
::stacks (note plural) can be useful -- it'll 'summarize' all the different stacks
-
jbk
with a count
-
jbk
the above stack (without digging a bit more) looks like it might just be a periodic interrupt firing and being handled at the time the dump started
-
cgr
ok
-
jbk
if it's hung, presumably everything is probably sleeping
-
jbk
also ::cpuinfo might be interesting
-
jbk
the 'SWITCH' column shows how long it's been since the cpu has switched threads... if there's one that's been sitting there for a long time (the amount will stand out), that might suggest a stuck kernel thread
-
jbk
unfortuantely, it's a bit of knowing that looks 'typical' vs. what doesn't in those stacks to see which one seems odd
-
jbk
e.g. if you see a driver sitting in it's attach method, that's probably interesting
-
jbk
or stacks in the device tree code (ndi_xxxx)
-
cgr
vmcore.4> ::cpuinfo
-
cgr
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
-
cgr
0 ffffe08000000000 1f 0 0 -1 no no t-2 ffffe33000005ac0 (idle)
-
cgr
1 fffffffffc0d3c60 1b 0 0 -1 no no t-1 ffffe33000c63ac0 (idle)
-
cgr
2 ffffe08000400000 1f 0 0 -1 no no t-0 ffffe33000d8bac0 (idle)
-
cgr
3 ffffe08000600000 1f 1 0 -1 no no t-0 ffffe33000e0aac0 (idle)
-
jbk
if you pastebin (please don't paste it here, it'd be too long) the ::stacks output, we could see if something looks off..
-
cgr
-
jbk
the door_upcall() thread is _potentiall_ interesting -- however the thread it's in appears to be something unique to solaris, so it's unclear what it's doing (so could be normal)
-
jbk
there is also a thread blocked waiting to read from a pipe
-
jbk
::msgbuf and ::ps might help as well
-
cgr
::msgbuf seems normal:
pastebin.com/0MPUFqtZ
-
cgr
-
cgr
the normal running system has no stack with nfs4_async_unmount_manager_thread
-
cgr
it is not a nfs client nor nfs server
-
jbk
hmm..ok the panic message (I thought it was a forced panic)...
-
cgr
forced by deadman timer
-
jbk
sorry.. meetings for a while...
-
cgr
ok I am also
-
myrkraverk
cgr, when you continue, can you let me know if it's better when you have a physical serial cable connected to the virtual machine? I believe that's how I originally tested the OS X debugging; a coworker at the time had a Macbook of some sort; I was running the Illumos kernel in a VirtualBox on Windows, and I believe I mention that in the tutorial, but nothing about a physical serial cable [it wasn't directly relevant to my case.]
-
myrkraverk
I believe I also tested the command on a BeOS laptop -- or was it HaikuOS? I don't remember that details -- and I believe the OS X command also magically worked there.
-
myrkraverk
You know, through the serial cable.
-
myrkraverk
Via the USB protocol.
-
myrkraverk
I now realize I've installed OS/2, actually ArcaOS, on that laptop; but forgot where it is. I'll have to do this experiment again via two virtual machines, the debuggee on VirtualBox, and the debugger on a different VirtualBox running OS/2. Possibly on a different physical machine, because I have two USB RS232 protocol cables, and a null modem.
-
myrkraverk
Unless I find the different laptop where I've dual booted OS/2 [ArcaOS] and FreeDOS. I have to do this experiment there; via the docking feature and real RS-232 port.
-
myrkraverk
This is getting too pornographic, I'll have to do this, and upload pictures to my Pornographic Operating System collection on Xvideos.
-
myrkraverk
And I'll absolutely have to find my physical copy of the NextStep for PC, and see if that'll boot on either the same physical laptop, or a VirtualBox. You know, for pornographic purposes.
-
myrkraverk
If that works, I'll have to label it for content warning: extreme.
-
myrkraverk
cgr, please let me know what operating system you find best for the other and of the virtual, or physical serial cable.
-
myrkraverk
*end
-
gitomat
[illumos-gate] 15209 Fix C++ math function visibility -- Gordon Ross <gwr⊙rc>
-
gitomat
[illumos-gate] 18138 Document C++ header guidance -- Gordon Ross <gordon.w.ross⊙gc>
-
gitomat
[illumos-gate] 18124 want test case for dladm create-vnic -- Hans Rosenfeld <rosenfeld⊙gho>
-
cgr63
ls
-
cgr63
I have 3 other crash dumps initiated by the deadman timeout, not all have the nfs thread, all have the door_upcall and all
-
cgr63
deadman: timed out after 300 seconds of deadman cyclic inactivity on CPU 2
-
cgr63
every time the CPU 2
-
jbk
IIRC, unless Oracle has made major changes in the area, we use the CPUs APIC to trigger an interrupt when the next cyclic is supposed to fire
-
jbk
since we don't have access to the solaris source, the message suggests that somehow that's not happening but some other watchdog process is managing to fire and notice the lack of cyclic activity
-
cgr63
The strange thing is that the VM has been running for two years, yet the problems have only started appearing in the last two days—with no changes and no updates to Solaris or OmniOS.
-
jbk
that is odd...
-
jbk
almost makes we wonder if there's some sort of overflow or truncation happening somewhere
-
cgr63
the ::memstat problem have all crash dumps
-
cgr63
vmcore.1> ::memstat
-
cgr63
mdb: unable to read 5 page_ts at ffffe10005f58d80: no mapping for address
-
cgr63
mdb: can't walk rm: walker failed
-
jbk
are you vieweing the dump using mdb on a solaris system or an omnios system?
-
myrkraverk
Why not dbx on Linux?
-
myrkraverk
-
myrkraverk
Still has Linux downloads.
-
jbk
because it doesn't have the modules to interpret internal kernel structures or similarly for any loaded kernel modules
-
jbk
mdb does
-
myrkraverk
Ah, sorry about that.
-
jbk
and (unless oracle has changed things) the mdb that ships with the OS is probably going to have the modules that know best about the structures in a dump from the same one
-
jbk
though with CTF and depending on thow the modules are written, you can often get away with using a different build because internals don't change too much typically
-
jbk
but it's not guaranteed
-
myrkraverk
Yeah, you need to be careful if you update the operating system from one crash dump to the next.
-
myrkraverk
I've had that experience on FreeBSD.
-
myrkraverk
But that was, I believe, more a factor of the FreeBSD team deleting or overwriting the symbol files for the in-kernel data structures. I just couldn't retroactively create a compatible system, given the crash dump I had in my hands.
-
myrkraverk
[Or rather, the budget for the debugging project didn't account for me creating a build system for FreeBSD, and see if I could recreate all the symbol files for this minor kernel update that was in the past.]
-
alanc
Oracle has not changed that - things that just rely on CTF data should be fine, but compiled mdb modules are best used with the same OS version as the crash dump (often they'll still work as many things don't change in each release, but that's relying on luck, not design)
-
gitomat
[illumos-gate] 18153 loader: clean up smatch warnings in loader (pre-userboot) -- Toomas Soome <tsoome⊙mc>
-
cgr63
I did all on the same Solaris 11.4 instance where the dumps are from.
-
jbk
what arguments are being used w/ bhyve?
-
jbk
maybe with or without the -x option might be interesting to see if the problem persists
-
jbk
... uugh.. i do wish we could split sd.c into smaller pieces w/o losing (an easy) git blame
-
richlowe
I thought I had configuration for blame to follow things, but I don't?
-
richlowe
is it just you have to use -M -C etc?
-
richlowe
or does it just totally break the ancestry?
-
jbk
from what i tried, the only way I could get it to work is if you basically cp sd.c sd_copy.c; commit, then remove the bits from each (or some variant of that)
-
richlowe
sounds likely :\
-
jbk
(don't need to compile sd_copy.c in that example, just need a commit with both)
-
richlowe
the unix history repo does horrible things to make things like that work
-
richlowe
dspinellis/unix-history I think
-
richlowe
but at the cost of looking like that
-
jbk
unfortunately, i don't think having two commits like that would be received well
-
richlowe
I don't judge such things, but I would weigh it against the benefits at least
-
richlowe
but I also frequently require blame
-
jbk
yeah, I wouldn't want to do it without preserving blame..
-
jbk
i mean, you could work around it by checking out the prior version and then using blame
-
jbk
but that i think would be needlessly annoying
-
jbk
and i wouldn't want to throw it away either
-
jbk
esp for something like that...
-
richlowe
the workaround doesn't work because you have to know there was a prior combined version
-
richlowe
the unfortunate git approach to these things means that you don't actually know when you're missing data, you just have to go check
-
richlowe
(see also -M, -C, etc.)
-
richlowe
I think the git-author approved way, is that you should be using log -S/-G/etc, to find what you want, not blame the file it happens to be in right now.