00:06:45 Pheugh, for a moment there I thought you meant the binary, but it looks like you're converting the mdb.1 man page to mdoc, gitomat. 01:08:23 jbk: tx lifespan is a lot more deterministic (and, if congestion control is working and avoiding bufferbloat, shorter) than rx 09:31:01 [illumos-gate] 18147 viona_rx should not interrupt the guest if all packets are dropped -- Kyle Simpson 09:40:06 [illumos-gate] 17691 acpi_get_possible_irq_resources() will always succeed -- Shardul Bankar 13:27:17 How can I analyze a OS crash dump which is initiated by expired deadman timer? 13:29:01 cgr, look at my mdb short intro? Does that help? http://www.myrkraverk.com/blog/2014/04/ 13:29:23 Basically, cgr, have a serial debugger running in the kernel when the deadman timer expires. Can you do that? 13:29:57 Sorry, full link: http://www.myrkraverk.com/blog/2014/04/illumos-getting-started-with-mdb/#more-41 13:31:37 The whole thing: The hanging OS is a Solaris 11.4 bhyve guest running on Omnios. 13:37:37 I have a crash dump. Zpool seems to be not the problem. How can I see which treads are hanging? >::walk thread| ::findstack delivers a lot of, most on _resume_from_idle 13:37:55 is this a normal state 13:37:55 stack pointer for thread ffffe33000005ac0: ffffe330000ddeb0 13:37:56   ffffe330000ddf50 xc_serv+0x16d((char *) 0, (char *) 0) 13:37:56   ffffe330000ddfa0 av_dispatch_autovect+0x81((uint_t) f0) 13:37:57   ffffe330000ddff0 dispatch_hilevel+0x5d((uint_t) f0, (uint_t) 0) 13:37:57   ffffe330000058c0 switch_sp_and_call+0x13() 13:37:58   ffffe33000005920 do_interrupt+0x1df((struct regs *) 0xffffe33000005930, (trap_trace_rec_t *) 0) 13:37:58   ffffe33000005930 _interrupt+0xc4() 13:37:59   ffffe33000005a20 mach_cpu_idle+0x17() 13:37:59   ffffe33000005a60 cpu_idle+0x2b7() 13:38:00   ffffe33000005a70 cpu_idle_adaptive+0x19() 13:38:00   ffffe33000005aa0 idle+0x11e() 13:38:01   ffffe33000005ab0 thread_start+8() 13:38:07 ? 13:38:45 ::memstat has a problem 13:38:57 vmcore.4> ::memstat 13:38:57 mdb: unable to read 5 page_ts at ffffe10005f58d80: no mapping for address 13:38:58 mdb: can't walk rm: walker failed 13:39:26 vmcore.4> ffffe10005f58d80::whatis 13:39:27 ffffe10005f58d80 is the page_t for PFN 0xbeb1b (NOTE: read of pp failed) 13:40:42 is this a crash dump from the solaris guest, or from the omnios host? 13:40:58 the Solaris guest 13:41:53 Omnios host has nothing noticed 13:42:25 ::stacks (note plural) can be useful -- it'll 'summarize' all the different stacks 13:42:28 with a count 13:43:06 the above stack (without digging a bit more) looks like it might just be a periodic interrupt firing and being handled at the time the dump started 13:43:39 ok 13:43:41 if it's hung, presumably everything is probably sleeping 13:44:39 also ::cpuinfo might be interesting 13:45:30 the 'SWITCH' column shows how long it's been since the cpu has switched threads... if there's one that's been sitting there for a long time (the amount will stand out), that might suggest a stuck kernel thread 13:46:05 unfortuantely, it's a bit of knowing that looks 'typical' vs. what doesn't in those stacks to see which one seems odd 13:46:40 e.g. if you see a driver sitting in it's attach method, that's probably interesting 13:47:20 or stacks in the device tree code (ndi_xxxx) 13:47:45 vmcore.4> ::cpuinfo 13:47:45   ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC 13:47:46    0 ffffe08000000000  1f    0    0  -1   no    no t-2    ffffe33000005ac0 (idle) 13:47:46    1 fffffffffc0d3c60  1b    0    0  -1   no    no t-1    ffffe33000c63ac0 (idle) 13:47:47    2 ffffe08000400000  1f    0    0  -1   no    no t-0    ffffe33000d8bac0 (idle) 13:47:47    3 ffffe08000600000  1f    1    0  -1   no    no t-0    ffffe33000e0aac0 (idle) 13:48:33 if you pastebin (please don't paste it here, it'd be too long) the ::stacks output, we could see if something looks off.. 13:52:20 >::stacks https://pastebin.com/PJiHWZCE 13:56:23 the door_upcall() thread is _potentiall_ interesting -- however the thread it's in appears to be something unique to solaris, so it's unclear what it's doing (so could be normal) 13:57:19 there is also a thread blocked waiting to read from a pipe 13:58:42 ::msgbuf and ::ps might help as well 14:01:03 ::msgbuf seems normal: https://pastebin.com/0MPUFqtZ 14:02:28 ::ps https://pastebin.com/uBjX6WiH 14:05:43 the normal running system has no stack with nfs4_async_unmount_manager_thread 14:08:09 it is not a nfs client nor nfs server 14:09:14 hmm..ok the panic message (I thought it was a forced panic)... 14:12:35 forced by deadman timer 14:22:48 sorry.. meetings for a while... 14:24:02 ok I am also 15:16:22 cgr, when you continue, can you let me know if it's better when you have a physical serial cable connected to the virtual machine? I believe that's how I originally tested the OS X debugging; a coworker at the time had a Macbook of some sort; I was running the Illumos kernel in a VirtualBox on Windows, and I believe I mention that in the tutorial, but nothing about a physical serial cable [it wasn't directly relevant to my case.] 15:17:02 I believe I also tested the command on a BeOS laptop -- or was it HaikuOS? I don't remember that details -- and I believe the OS X command also magically worked there. 15:17:11 You know, through the serial cable. 15:17:16 Via the USB protocol. 15:19:26 I now realize I've installed OS/2, actually ArcaOS, on that laptop; but forgot where it is. I'll have to do this experiment again via two virtual machines, the debuggee on VirtualBox, and the debugger on a different VirtualBox running OS/2. Possibly on a different physical machine, because I have two USB RS232 protocol cables, and a null modem. 15:20:15 Unless I find the different laptop where I've dual booted OS/2 [ArcaOS] and FreeDOS. I have to do this experiment there; via the docking feature and real RS-232 port. 15:20:40 This is getting too pornographic, I'll have to do this, and upload pictures to my Pornographic Operating System collection on Xvideos. 15:23:47 And I'll absolutely have to find my physical copy of the NextStep for PC, and see if that'll boot on either the same physical laptop, or a VirtualBox. You know, for pornographic purposes. 15:24:16 If that works, I'll have to label it for content warning: extreme. 15:25:30 cgr, please let me know what operating system you find best for the other and of the virtual, or physical serial cable. 15:25:37 *end 16:15:25 [illumos-gate] 15209 Fix C++ math function visibility -- Gordon Ross 16:15:25 [illumos-gate] 18138 Document C++ header guidance -- Gordon Ross 16:42:07 [illumos-gate] 18124 want test case for dladm create-vnic -- Hans Rosenfeld 17:10:03 ls 17:14:12 I have 3 other crash dumps initiated by the deadman timeout, not all have the nfs thread, all have the door_upcall and all 17:14:18 deadman: timed out after 300 seconds of deadman cyclic inactivity on CPU 2 17:14:44 every time the CPU 2 17:16:28 IIRC, unless Oracle has made major changes in the area, we use the CPUs APIC to trigger an interrupt when the next cyclic is supposed to fire 17:17:11 since we don't have access to the solaris source, the message suggests that somehow that's not happening but some other watchdog process is managing to fire and notice the lack of cyclic activity 17:22:49 The strange thing is that the VM has been running for two years, yet the problems have only started appearing in the last two days—with no changes and no updates to Solaris or OmniOS. 17:23:12 that is odd... 17:23:31 almost makes we wonder if there's some sort of overflow or truncation happening somewhere 17:23:51 the ::memstat problem have all crash dumps 17:24:30 vmcore.1> ::memstat 17:24:31 mdb: unable to read 5 page_ts at ffffe10005f58d80: no mapping for address 17:24:31 mdb: can't walk rm: walker failed 17:32:50 are you vieweing the dump using mdb on a solaris system or an omnios system? 17:37:12 Why not dbx on Linux? 17:37:41 https://www.oracle.com/tools/developerstudio/downloads/solaris-studio-v123-downloads.html 17:37:45 Still has Linux downloads. 17:38:24 because it doesn't have the modules to interpret internal kernel structures or similarly for any loaded kernel modules 17:38:27 mdb does 17:38:35 Ah, sorry about that. 17:39:15 and (unless oracle has changed things) the mdb that ships with the OS is probably going to have the modules that know best about the structures in a dump from the same one 17:39:50 though with CTF and depending on thow the modules are written, you can often get away with using a different build because internals don't change too much typically 17:39:57 but it's not guaranteed 17:39:58 Yeah, you need to be careful if you update the operating system from one crash dump to the next. 17:40:10 I've had that experience on FreeBSD. 17:41:14 But that was, I believe, more a factor of the FreeBSD team deleting or overwriting the symbol files for the in-kernel data structures. I just couldn't retroactively create a compatible system, given the crash dump I had in my hands. 17:42:23 [Or rather, the budget for the debugging project didn't account for me creating a build system for FreeBSD, and see if I could recreate all the symbol files for this minor kernel update that was in the past.] 17:42:24 Oracle has not changed that - things that just rely on CTF data should be fine, but compiled mdb modules are best used with the same OS version as the crash dump (often they'll still work as many things don't change in each release, but that's relying on luck, not design) 18:39:48 [illumos-gate] 18153 loader: clean up smatch warnings in loader (pre-userboot) -- Toomas Soome 18:51:35 I did all on the same Solaris 11.4 instance where the dumps are from. 19:11:16 what arguments are being used w/ bhyve? 19:14:39 maybe with or without the -x option might be interesting to see if the problem persists 19:53:03 ... uugh.. i do wish we could split sd.c into smaller pieces w/o losing (an easy) git blame 19:53:45 I thought I had configuration for blame to follow things, but I don't? 19:53:58 is it just you have to use -M -C etc? 19:54:15 or does it just totally break the ancestry? 19:54:38 from what i tried, the only way I could get it to work is if you basically cp sd.c sd_copy.c; commit, then remove the bits from each (or some variant of that) 19:54:47 sounds likely :\ 19:54:55 (don't need to compile sd_copy.c in that example, just need a commit with both) 19:55:24 the unix history repo does horrible things to make things like that work 19:55:32 dspinellis/unix-history I think 19:55:51 but at the cost of looking like that 20:00:49 unfortunately, i don't think having two commits like that would be received well 20:08:20 I don't judge such things, but I would weigh it against the benefits at least 20:08:43 but I also frequently require blame 20:09:43 yeah, I wouldn't want to do it without preserving blame.. 20:09:59 i mean, you could work around it by checking out the prior version and then using blame 20:10:13 but that i think would be needlessly annoying 20:10:21 and i wouldn't want to throw it away either 20:10:25 esp for something like that... 20:11:31 the workaround doesn't work because you have to know there was a prior combined version 20:11:44 the unfortunate git approach to these things means that you don't actually know when you're missing data, you just have to go check 20:11:51 (see also -M, -C, etc.) 20:46:30 I think the git-author approved way, is that you should be using log -S/-G/etc, to find what you want, not blame the file it happens to be in right now.