#illumos

00:06

myrkraverk

Pheugh, for a moment there I thought you meant the binary, but it looks like you're converting the mdb.1 man page to mdoc, gitomat.
01:08

sommerfeld

jbk: tx lifespan is a lot more deterministic (and, if congestion control is working and avoiding bufferbloat, shorter) than rx
09:31

gitomat

[illumos-gate] 18147 viona_rx should not interrupt the guest if all packets are dropped -- Kyle Simpson <kyle⊙oc>
09:40

gitomat

[illumos-gate] 17691 acpi_get_possible_irq_resources() will always succeed -- Shardul Bankar <shardulsb08⊙gc>
13:27

cgr

How can I analyze a OS crash dump which is initiated by expired deadman timer?
13:29

myrkraverk

cgr, look at my mdb short intro? Does that help? myrkraverk.com/blog/2014/04
13:29

myrkraverk

Basically, cgr, have a serial debugger running in the kernel when the deadman timer expires. Can you do that?
13:29

myrkraverk

Sorry, full link: myrkraverk.com/blog/2014/04/illumos-getting-started-with-mdb/#more-41
13:31

cgr

The whole thing: The hanging OS is a Solaris 11.4 bhyve guest running on Omnios.
13:37

cgr

I have a crash dump. Zpool seems to be not the problem. How can I see which treads are hanging? >::walk thread| ::findstack delivers a lot of, most on _resume_from_idle
13:37

cgr

is this a normal state
13:37

cgr

stack pointer for thread ffffe33000005ac0: ffffe330000ddeb0
13:37

cgr

ffffe330000ddf50 xc_serv+0x16d((char *) 0, (char *) 0)
13:37

cgr

ffffe330000ddfa0 av_dispatch_autovect+0x81((uint_t) f0)
13:37

cgr

ffffe330000ddff0 dispatch_hilevel+0x5d((uint_t) f0, (uint_t) 0)
13:37

cgr

ffffe330000058c0 switch_sp_and_call+0x13()
13:37

cgr

ffffe33000005920 do_interrupt+0x1df((struct regs *) 0xffffe33000005930, (trap_trace_rec_t *) 0)
13:37

cgr

ffffe33000005930 _interrupt+0xc4()
13:37

cgr

ffffe33000005a20 mach_cpu_idle+0x17()
13:37

cgr

ffffe33000005a60 cpu_idle+0x2b7()
13:38

cgr

ffffe33000005a70 cpu_idle_adaptive+0x19()
13:38

cgr

ffffe33000005aa0 idle+0x11e()
13:38

cgr

ffffe33000005ab0 thread_start+8()
13:38

cgr

?
13:38

cgr

::memstat has a problem
13:38

cgr

vmcore.4> ::memstat
13:38

cgr

mdb: unable to read 5 page_ts at ffffe10005f58d80: no mapping for address
13:38

cgr

mdb: can't walk rm: walker failed
13:39

cgr

vmcore.4> ffffe10005f58d80::whatis
13:39

cgr

ffffe10005f58d80 is the page_t for PFN 0xbeb1b (NOTE: read of pp failed)
13:40

jbk

is this a crash dump from the solaris guest, or from the omnios host?
13:40

cgr

the Solaris guest
13:41

cgr

Omnios host has nothing noticed
13:42

jbk

::stacks (note plural) can be useful -- it'll 'summarize' all the different stacks
13:42

jbk

with a count
13:43

jbk

the above stack (without digging a bit more) looks like it might just be a periodic interrupt firing and being handled at the time the dump started
13:43

cgr

ok
13:43

jbk

if it's hung, presumably everything is probably sleeping
13:44

jbk

also ::cpuinfo might be interesting
13:45

jbk

the 'SWITCH' column shows how long it's been since the cpu has switched threads... if there's one that's been sitting there for a long time (the amount will stand out), that might suggest a stuck kernel thread
13:46

jbk

unfortuantely, it's a bit of knowing that looks 'typical' vs. what doesn't in those stacks to see which one seems odd
13:46

jbk

e.g. if you see a driver sitting in it's attach method, that's probably interesting
13:47

jbk

or stacks in the device tree code (ndi_xxxx)
13:47

cgr

vmcore.4> ::cpuinfo
13:47

cgr

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
13:47

cgr

0 ffffe08000000000 1f 0 0 -1 no no t-2 ffffe33000005ac0 (idle)
13:47

cgr

1 fffffffffc0d3c60 1b 0 0 -1 no no t-1 ffffe33000c63ac0 (idle)
13:47

cgr

2 ffffe08000400000 1f 0 0 -1 no no t-0 ffffe33000d8bac0 (idle)
13:47

cgr

3 ffffe08000600000 1f 1 0 -1 no no t-0 ffffe33000e0aac0 (idle)
13:48

jbk

if you pastebin (please don't paste it here, it'd be too long) the ::stacks output, we could see if something looks off..
13:52

cgr

>::stacks pastebin.com/PJiHWZCE
13:56

jbk

the door_upcall() thread is _potentiall_ interesting -- however the thread it's in appears to be something unique to solaris, so it's unclear what it's doing (so could be normal)
13:57

jbk

there is also a thread blocked waiting to read from a pipe
13:58

jbk

::msgbuf and ::ps might help as well
14:01

cgr

::msgbuf seems normal: pastebin.com/0MPUFqtZ
14:02

cgr

::ps pastebin.com/uBjX6WiH
14:05

cgr

the normal running system has no stack with nfs4_async_unmount_manager_thread
14:08

cgr

it is not a nfs client nor nfs server
14:09

jbk

hmm..ok the panic message (I thought it was a forced panic)...
14:12

cgr

forced by deadman timer
14:22

jbk

sorry.. meetings for a while...
14:24

cgr

ok I am also
15:16

myrkraverk

cgr, when you continue, can you let me know if it's better when you have a physical serial cable connected to the virtual machine? I believe that's how I originally tested the OS X debugging; a coworker at the time had a Macbook of some sort; I was running the Illumos kernel in a VirtualBox on Windows, and I believe I mention that in the tutorial, but nothing about a physical serial cable [it wasn't directly relevant to my case.]
15:17

myrkraverk

I believe I also tested the command on a BeOS laptop -- or was it HaikuOS? I don't remember that details -- and I believe the OS X command also magically worked there.
15:17

myrkraverk

You know, through the serial cable.
15:17

myrkraverk

Via the USB protocol.
15:19

myrkraverk

I now realize I've installed OS/2, actually ArcaOS, on that laptop; but forgot where it is. I'll have to do this experiment again via two virtual machines, the debuggee on VirtualBox, and the debugger on a different VirtualBox running OS/2. Possibly on a different physical machine, because I have two USB RS232 protocol cables, and a null modem.
15:20

myrkraverk

Unless I find the different laptop where I've dual booted OS/2 [ArcaOS] and FreeDOS. I have to do this experiment there; via the docking feature and real RS-232 port.
15:20

myrkraverk

This is getting too pornographic, I'll have to do this, and upload pictures to my Pornographic Operating System collection on Xvideos.
15:23

myrkraverk

And I'll absolutely have to find my physical copy of the NextStep for PC, and see if that'll boot on either the same physical laptop, or a VirtualBox. You know, for pornographic purposes.
15:24

myrkraverk

If that works, I'll have to label it for content warning: extreme.
15:25

myrkraverk

cgr, please let me know what operating system you find best for the other and of the virtual, or physical serial cable.
15:25

myrkraverk

*end
16:15

gitomat

[illumos-gate] 15209 Fix C++ math function visibility -- Gordon Ross <gwr⊙rc>
16:15

gitomat

[illumos-gate] 18138 Document C++ header guidance -- Gordon Ross <gordon.w.ross⊙gc>
16:42

gitomat

[illumos-gate] 18124 want test case for dladm create-vnic -- Hans Rosenfeld <rosenfeld⊙gho>
17:10

cgr63

ls
17:14

cgr63

I have 3 other crash dumps initiated by the deadman timeout, not all have the nfs thread, all have the door_upcall and all
17:14

cgr63

deadman: timed out after 300 seconds of deadman cyclic inactivity on CPU 2
17:14

cgr63

every time the CPU 2
17:16

jbk

IIRC, unless Oracle has made major changes in the area, we use the CPUs APIC to trigger an interrupt when the next cyclic is supposed to fire
17:17

jbk

since we don't have access to the solaris source, the message suggests that somehow that's not happening but some other watchdog process is managing to fire and notice the lack of cyclic activity
17:22

cgr63

The strange thing is that the VM has been running for two years, yet the problems have only started appearing in the last two days—with no changes and no updates to Solaris or OmniOS.
17:23

jbk

that is odd...
17:23

jbk

almost makes we wonder if there's some sort of overflow or truncation happening somewhere
17:23

cgr63

the ::memstat problem have all crash dumps
17:24

cgr63

vmcore.1> ::memstat
17:24

cgr63

mdb: unable to read 5 page_ts at ffffe10005f58d80: no mapping for address
17:24

cgr63

mdb: can't walk rm: walker failed
17:32

jbk

are you vieweing the dump using mdb on a solaris system or an omnios system?
17:37

myrkraverk

Why not dbx on Linux?
17:37

myrkraverk

oracle.com/tools/developerstudio/do…/solaris-studio-v123-downloads.html
17:37

myrkraverk

Still has Linux downloads.
17:38

jbk

because it doesn't have the modules to interpret internal kernel structures or similarly for any loaded kernel modules
17:38

jbk

mdb does
17:38

myrkraverk

Ah, sorry about that.
17:39

jbk

and (unless oracle has changed things) the mdb that ships with the OS is probably going to have the modules that know best about the structures in a dump from the same one
17:39

jbk

though with CTF and depending on thow the modules are written, you can often get away with using a different build because internals don't change too much typically
17:39

jbk

but it's not guaranteed
17:39

myrkraverk

Yeah, you need to be careful if you update the operating system from one crash dump to the next.
17:40

myrkraverk

I've had that experience on FreeBSD.
17:41

myrkraverk

But that was, I believe, more a factor of the FreeBSD team deleting or overwriting the symbol files for the in-kernel data structures. I just couldn't retroactively create a compatible system, given the crash dump I had in my hands.
17:42

myrkraverk

[Or rather, the budget for the debugging project didn't account for me creating a build system for FreeBSD, and see if I could recreate all the symbol files for this minor kernel update that was in the past.]
17:42

alanc

Oracle has not changed that - things that just rely on CTF data should be fine, but compiled mdb modules are best used with the same OS version as the crash dump (often they'll still work as many things don't change in each release, but that's relying on luck, not design)
18:39

gitomat

[illumos-gate] 18153 loader: clean up smatch warnings in loader (pre-userboot) -- Toomas Soome <tsoome⊙mc>
18:51

cgr63

I did all on the same Solaris 11.4 instance where the dumps are from.
19:11

jbk

what arguments are being used w/ bhyve?
19:14

jbk

maybe with or without the -x option might be interesting to see if the problem persists
19:53

jbk

... uugh.. i do wish we could split sd.c into smaller pieces w/o losing (an easy) git blame
19:53

richlowe

I thought I had configuration for blame to follow things, but I don't?
19:53

richlowe

is it just you have to use -M -C etc?
19:54

richlowe

or does it just totally break the ancestry?
19:54

jbk

from what i tried, the only way I could get it to work is if you basically cp sd.c sd_copy.c; commit, then remove the bits from each (or some variant of that)
19:54

richlowe

sounds likely :\
19:54

jbk

(don't need to compile sd_copy.c in that example, just need a commit with both)
19:55

richlowe

the unix history repo does horrible things to make things like that work
19:55

richlowe

dspinellis/unix-history I think
19:55

richlowe

but at the cost of looking like that
20:00

jbk

unfortunately, i don't think having two commits like that would be received well
20:08

richlowe

I don't judge such things, but I would weigh it against the benefits at least
20:08

richlowe

but I also frequently require blame
20:09

jbk

yeah, I wouldn't want to do it without preserving blame..
20:09

jbk

i mean, you could work around it by checking out the prior version and then using blame
20:10

jbk

but that i think would be needlessly annoying
20:10

jbk

and i wouldn't want to throw it away either
20:10

jbk

esp for something like that...
20:11

richlowe

the workaround doesn't work because you have to know there was a prior combined version
20:11

richlowe

the unfortunate git approach to these things means that you don't actually know when you're missing data, you just have to go check
20:11

richlowe

(see also -M, -C, etc.)
20:46

richlowe

I think the git-author approved way, is that you should be using log -S/-G/etc, to find what you want, not blame the file it happens to be in right now.

18 days ago

« a day earlier

a day later »

today »