00:06:45 <myrkraverk> Pheugh, for a moment there I thought you meant the binary, but it looks like you're converting the mdb.1 man page to mdoc, gitomat.
01:08:23 <sommerfeld> jbk: tx lifespan is a lot more deterministic (and, if congestion control is working and avoiding bufferbloat, shorter) than rx
09:31:01 <gitomat> [illumos-gate] 18147 viona_rx should not interrupt the guest if all packets are dropped -- Kyle Simpson <kyle⊙oc>
09:40:06 <gitomat> [illumos-gate] 17691 acpi_get_possible_irq_resources() will always succeed -- Shardul Bankar <shardulsb08⊙gc>
13:27:17 <cgr> How can I analyze a OS crash dump which is initiated by expired deadman timer?
13:29:01 <myrkraverk> cgr, look at my mdb short intro?  Does that help?  http://www.myrkraverk.com/blog/2014/04/
13:29:23 <myrkraverk> Basically, cgr, have a serial debugger running in the kernel when the deadman timer expires.  Can you do that?
13:29:57 <myrkraverk> Sorry, full link: http://www.myrkraverk.com/blog/2014/04/illumos-getting-started-with-mdb/#more-41
13:31:37 <cgr> The whole thing: The hanging OS is a Solaris 11.4 bhyve guest running on Omnios.
13:37:37 <cgr> I have a crash dump. Zpool seems to be not the problem. How can I see which treads are hanging? >::walk thread| ::findstack delivers a lot of, most on _resume_from_idle
13:37:55 <cgr> is this a normal state
13:37:55 <cgr> stack pointer for thread ffffe33000005ac0: ffffe330000ddeb0
13:37:56 <cgr>   ffffe330000ddf50 xc_serv+0x16d((char *) 0, (char *) 0)
13:37:56 <cgr>   ffffe330000ddfa0 av_dispatch_autovect+0x81((uint_t) f0)
13:37:57 <cgr>   ffffe330000ddff0 dispatch_hilevel+0x5d((uint_t) f0, (uint_t) 0)
13:37:57 <cgr>   ffffe330000058c0 switch_sp_and_call+0x13()
13:37:58 <cgr>   ffffe33000005920 do_interrupt+0x1df((struct regs *) 0xffffe33000005930, (trap_trace_rec_t *) 0)
13:37:58 <cgr>   ffffe33000005930 _interrupt+0xc4()
13:37:59 <cgr>   ffffe33000005a20 mach_cpu_idle+0x17()
13:37:59 <cgr>   ffffe33000005a60 cpu_idle+0x2b7()
13:38:00 <cgr>   ffffe33000005a70 cpu_idle_adaptive+0x19()
13:38:00 <cgr>   ffffe33000005aa0 idle+0x11e()
13:38:01 <cgr>   ffffe33000005ab0 thread_start+8()
13:38:07 <cgr> ?
13:38:45 <cgr> ::memstat has a problem
13:38:57 <cgr> vmcore.4> ::memstat
13:38:57 <cgr> mdb: unable to read 5 page_ts at ffffe10005f58d80: no mapping for address
13:38:58 <cgr> mdb: can't walk rm: walker failed
13:39:26 <cgr> vmcore.4> ffffe10005f58d80::whatis
13:39:27 <cgr> ffffe10005f58d80 is the page_t for PFN 0xbeb1b (NOTE: read of pp failed)
13:40:42 <jbk> is this a crash dump from the solaris guest, or from the omnios host?
13:40:58 <cgr> the Solaris guest
13:41:53 <cgr> Omnios host has nothing noticed
13:42:25 <jbk> ::stacks (note plural) can be useful -- it'll 'summarize' all the different stacks
13:42:28 <jbk> with a count
13:43:06 <jbk> the above stack (without digging a bit more) looks like it might just be a periodic interrupt firing and being handled at the time the dump started
13:43:39 <cgr> ok
13:43:41 <jbk> if it's hung, presumably everything is probably sleeping
13:44:39 <jbk> also ::cpuinfo might be interesting
13:45:30 <jbk> the 'SWITCH' column shows how long it's been since the cpu has switched threads... if there's one that's been sitting there for a long time (the amount will stand out), that might suggest a stuck kernel thread
13:46:05 <jbk> unfortuantely, it's a bit of knowing that looks 'typical' vs. what doesn't in those stacks to see which one seems odd
13:46:40 <jbk> e.g. if you see a driver sitting in it's attach method, that's probably interesting
13:47:20 <jbk> or stacks in the device tree code (ndi_xxxx)
13:47:45 <cgr> vmcore.4> ::cpuinfo
13:47:45 <cgr>   ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
13:47:46 <cgr>    0 ffffe08000000000  1f    0    0  -1   no    no t-2    ffffe33000005ac0 (idle)
13:47:46 <cgr>    1 fffffffffc0d3c60  1b    0    0  -1   no    no t-1    ffffe33000c63ac0 (idle)
13:47:47 <cgr>    2 ffffe08000400000  1f    0    0  -1   no    no t-0    ffffe33000d8bac0 (idle)
13:47:47 <cgr>    3 ffffe08000600000  1f    1    0  -1   no    no t-0    ffffe33000e0aac0 (idle)
13:48:33 <jbk> if you pastebin (please don't paste it here, it'd be too long) the ::stacks output, we could see if something looks off..
13:52:20 <cgr> >::stacks https://pastebin.com/PJiHWZCE
13:56:23 <jbk> the door_upcall() thread is _potentiall_ interesting -- however the thread it's in appears to be something unique to solaris, so it's unclear what it's doing (so could be normal)
13:57:19 <jbk> there is also a thread blocked waiting to read from a pipe
13:58:42 <jbk> ::msgbuf and ::ps might help as well
14:01:03 <cgr> ::msgbuf seems normal: https://pastebin.com/0MPUFqtZ
14:02:28 <cgr> ::ps https://pastebin.com/uBjX6WiH
14:05:43 <cgr> the normal running system has no stack with nfs4_async_unmount_manager_thread
14:08:09 <cgr> it is not a nfs client nor nfs server
14:09:14 <jbk> hmm..ok the panic message (I thought it was a forced panic)...
14:12:35 <cgr> forced by deadman timer
14:22:48 <jbk> sorry.. meetings for a while...
14:24:02 <cgr> ok I am also
15:16:22 <myrkraverk> cgr, when you continue, can you let me know if it's better when you have a physical serial cable connected to the virtual machine?  I believe that's how I originally tested the OS X debugging; a coworker at the time had a Macbook of some sort; I was running the Illumos kernel in a VirtualBox on Windows, and I believe I mention that in the tutorial, but nothing about a physical serial cable [it wasn't directly relevant to my case.]
15:17:02 <myrkraverk> I believe I also tested the command on a BeOS laptop -- or was it HaikuOS? I don't remember that details -- and I believe the OS X command also magically worked there.
15:17:11 <myrkraverk> You know, through the serial cable.
15:17:16 <myrkraverk> Via the USB protocol.
15:19:26 <myrkraverk> I now realize I've installed OS/2, actually ArcaOS, on that laptop; but forgot where it is.  I'll have to do this experiment again via two virtual machines, the debuggee on VirtualBox, and the debugger on a different VirtualBox running OS/2.  Possibly on a different physical machine, because I have two USB RS232 protocol cables, and a null modem.
15:20:15 <myrkraverk> Unless I find the different laptop where I've dual booted OS/2 [ArcaOS] and FreeDOS.  I have to do this experiment there; via the docking feature and real RS-232 port.
15:20:40 <myrkraverk> This is getting too pornographic, I'll have to do this, and upload pictures to my Pornographic Operating System collection on Xvideos.
15:23:47 <myrkraverk> And I'll absolutely have to find my physical copy of the NextStep for PC, and see if that'll boot on either the same physical laptop, or a VirtualBox.  You know, for pornographic purposes.
15:24:16 <myrkraverk> If that works, I'll have to label it for content warning: extreme.
15:25:30 <myrkraverk> cgr, please let me know what operating system you find best for the other and of the virtual, or physical serial cable.
15:25:37 <myrkraverk> *end
16:15:25 <gitomat> [illumos-gate] 15209 Fix C++ math function visibility -- Gordon Ross <gwr⊙rc>
16:15:25 <gitomat> [illumos-gate] 18138 Document C++ header guidance -- Gordon Ross <gordon.w.ross⊙gc>
16:42:07 <gitomat> [illumos-gate] 18124 want test case for dladm create-vnic -- Hans Rosenfeld <rosenfeld⊙gho>
17:10:03 <cgr63> ls
17:14:12 <cgr63> I have 3 other crash dumps initiated by the deadman timeout, not all have the nfs thread, all have the door_upcall and all
17:14:18 <cgr63> deadman: timed out after 300 seconds of deadman cyclic inactivity on CPU 2
17:14:44 <cgr63> every time the CPU 2
17:16:28 <jbk> IIRC, unless Oracle has made major changes in the area, we use the CPUs APIC to trigger an interrupt when the next cyclic is supposed to fire
17:17:11 <jbk> since we don't have access to the solaris source, the message suggests that somehow that's not happening but some other watchdog process is managing to fire and notice the lack of cyclic activity
17:22:49 <cgr63> The strange thing is that the VM has been running for two years, yet the problems have only started appearing in the last two days—with no changes and no updates to Solaris or OmniOS.
17:23:12 <jbk> that is odd...
17:23:31 <jbk> almost makes we wonder if there's some sort of overflow or truncation happening somewhere
17:23:51 <cgr63> the ::memstat problem have all crash dumps
17:24:30 <cgr63> vmcore.1> ::memstat
17:24:31 <cgr63> mdb: unable to read 5 page_ts at ffffe10005f58d80: no mapping for address
17:24:31 <cgr63> mdb: can't walk rm: walker failed
17:32:50 <jbk> are you vieweing the dump using mdb on a solaris system or an omnios system?
17:37:12 <myrkraverk> Why not dbx on Linux?
17:37:41 <myrkraverk> https://www.oracle.com/tools/developerstudio/downloads/solaris-studio-v123-downloads.html
17:37:45 <myrkraverk> Still has Linux downloads.
17:38:24 <jbk> because it doesn't have the modules to interpret internal kernel structures or similarly for any loaded kernel modules
17:38:27 <jbk> mdb does
17:38:35 <myrkraverk> Ah, sorry about that.
17:39:15 <jbk> and (unless oracle has changed things) the mdb that ships with the OS is probably going to have the modules that know best about the structures in a dump from the same one
17:39:50 <jbk> though with CTF and depending on thow the modules are written, you can often get away with using a different build because internals don't change too much typically
17:39:57 <jbk> but it's not guaranteed
17:39:58 <myrkraverk> Yeah, you need to be careful if you update the operating system from one crash dump to the next.
17:40:10 <myrkraverk> I've had that experience on FreeBSD.
17:41:14 <myrkraverk> But that was, I believe, more a factor of the FreeBSD team deleting or overwriting the symbol files for the in-kernel data structures.  I just couldn't retroactively create a compatible system, given the crash dump I had in my hands.
17:42:23 <myrkraverk> [Or rather, the budget for the debugging project didn't account for me creating a build system for FreeBSD, and see if I could recreate all the symbol files for this minor kernel update that was in the past.]
17:42:24 <alanc> Oracle has not changed that - things that just rely on CTF data should be fine, but compiled mdb modules are best used with the same OS version as the crash dump (often they'll still work as many things don't change in each release, but that's relying on luck, not design)
18:39:48 <gitomat> [illumos-gate] 18153 loader: clean up smatch warnings in loader (pre-userboot) -- Toomas Soome <tsoome⊙mc>
18:51:35 <cgr63> I did all on the same Solaris 11.4 instance where the dumps are from.
19:11:16 <jbk> what arguments are being used w/ bhyve?
19:14:39 <jbk> maybe with or without the -x option might be interesting to see if the problem persists
19:53:03 <jbk> ... uugh.. i do wish we could split sd.c into smaller pieces w/o losing (an easy) git blame
19:53:45 <richlowe> I thought I had configuration for blame to follow things, but I don't?
19:53:58 <richlowe> is it just you have to use -M -C etc?
19:54:15 <richlowe> or does it just totally break the ancestry?
19:54:38 <jbk> from what i tried, the only way I could get it to work is if you basically cp sd.c sd_copy.c; commit, then remove the bits from each (or some variant of that)
19:54:47 <richlowe> sounds likely :\
19:54:55 <jbk> (don't need to compile sd_copy.c in that example, just need a commit with both)
19:55:24 <richlowe> the unix history repo does horrible things to make things like that work
19:55:32 <richlowe> dspinellis/unix-history I think
19:55:51 <richlowe> but at the cost of looking like that
20:00:49 <jbk> unfortunately, i don't think having two commits like that would be received well
20:08:20 <richlowe> I don't judge such things, but I would weigh it against the benefits at least
20:08:43 <richlowe> but I also frequently require blame
20:09:43 <jbk> yeah, I wouldn't want to do it without preserving blame..
20:09:59 <jbk> i mean, you could work around it by checking out the prior version and then using blame
20:10:13 <jbk> but that i think would be needlessly annoying
20:10:21 <jbk> and i wouldn't want to throw it away either
20:10:25 <jbk> esp for something like that...
20:11:31 <richlowe> the workaround doesn't work because you have to know there was a prior combined version
20:11:44 <richlowe> the unfortunate git approach to these things means that you don't actually know when you're missing data, you just have to go check
20:11:51 <richlowe> (see also -M, -C, etc.)
20:46:30 <richlowe> I think the git-author approved way, is that you should be using log -S/-G/etc, to find what you want, not blame the file it happens to be in right now.