13:01:29 jbk: and others, your talk about talking to mainframes, back in early '90s we build a system that used a pho-printer queue to get data from the sunos to mainframes :-) 13:02:09 lpd (or lprNG was the translator) :-) 13:29:38 i had another 'fun' mainframe experience.. 13:30:49 they were doing a data migration which involved downloading whatever the mainframe equivalent of exported tables was, transferring them over a wan link and importing them into the new database (running on Solaris 10) 13:31:17 the group had originally decided that they should just mount that dataset over the WAN with NFS4 13:31:55 72 hours into their 48 hour transfer window (rehearsal) they realized they had a problem 13:32:23 also.. for some reason, the mainframe 'spontaneously' stopped accepting nfs4 mounts and had to resort to nfs3 13:33:34 because at the time ssh still had fixed sized windows which limited it's throughput, I had to concoct this rube goldbergesque setup 13:34:41 where another system in the same buildnig as the mainframe would instead use ftp, pipe it through gzip (default level), then pipe it through a connection across the wan to the destination system that'd uncompress and save the file 13:35:15 it took 4 hours 13:35:37 when the group was told 'ok, you can import everythign now' they thought we were joking 17:01:11 hrm.. there's no way to display all longest prefix matching routes for a given IP, is there? 17:01:27 route get IP just picks one (when there's more than one) 17:05:22 jbk: yeah, I think I've noticed that. 20:53:24 I had a zfs deadman panic today on an otherwise idle system running DEBUG bits of my latest cxgbe work. This is my first time ever seeing such a panic. zpool scrub showed no issues. Looking at the stack I see that intrd had an ioctl in progress and was sitting in apix_wait_till_seen(). I'm wondering if IRM and/or DEBUG bits can induce this failure mode? 20:56:02 I ran debug bits for a while, and things were not fun, but they did not crash 20:56:36 rzezeski: apix was on the stack above ZFS, like we'd pinned an i/o thread for interrupt? 20:57:27 or at the time it died interrupt management things were happening (but in another thread) 20:57:36 (I can't tell which you mean) 20:58:48 https://gist.github.com/rzezeski/175396b3d8423dff66db9b2bfec28515 21:01:03 i'm actually impressed intrd was doing something... 21:03:05 oh, it's actually r'ing ints 21:03:08 I have never seen that 21:03:44 rzezeski: I guess I would look at IRM first 21:04:07 though the combination of IRM|DEBUG can only be more uncommon 21:07:12 Yea, I have a lot of stuff on my plate, so I don't necessarily have time to learn how all this works. That's why I thought I'd get a spot check in here and see if IRM could put one in this position given that I have no reason to suspect an actual hardware/software issue on this host. I guess if I could figure out what interrupt it was trying to remove/remap, and if that interrupt was related to one of my nvme devices... 21:08:14 I think that's not that complicated, but someone like rm would be the one to try as far as remembering the x86 interrupt support well 21:08:20 maybe not well enough for intrd though... 21:08:20 And I know DEBUG loves to throw random curveballs like trying to unload modules randomly, I figured this could be another thing like that. But I guess it could also point to a legit bug that's just a lot harder to hit in non-debug 21:08:41 if I were wanting to make forward progress, I'd disable intrd etc, and keep DEBUG 21:09:31 I don't remember ever enabling it in the first place. I honestly know almost nothing about it. 21:10:46 okay, yea looks like it's enabled on my stock omnios bits 21:17:32 yeah, it's enabled by default, but usually not particularly active 21:17:41 or at all active 21:17:54 richlowe: I have a way of activating things, the computer gods hate me 21:18:19 fucking yak farm over here 21:24:18 okay, mapped the apix_vector_t argument back to the device info, looks like its for ixgbe, which makes sense since it participates in IRM 21:30:32 rmustacc: do you remember why `CTASSERT` doesn't stringify `x`? 21:32:34 So with this deadman panic there should be an outstanding I/O, right? I wonder if that would give me any clues? 21:33:12 yeah 21:33:23 well mostly 21:33:39 Yes there is outstanding I/O, maybe it would give you clues 21:33:48 technically, it could get stuck in the zfs i/o scheduler without ever being issued to the disk 21:34:16 though it'd be unlikely to happen on an idle system 21:34:49 I think if we are in the mood to remap interrupts the system isn't, to our view, idle 21:34:56 since the deadman basically looks at when it gets sent to the scheduler and then when it completes 21:35:55 richlowe: of the 16 CPUs, 14 were idle, two were in sched 21:36:02 (and if a disk is busy or slow enough, you can get into a situation where zio to higher LBAs get starved because zios to lower numbered LBAs keep cutting in front of them) 21:36:43 I was literally out for a walk with nothing of significance running on the host (AFAIK) 21:37:23 rmustacc: https://gist.github.com/richlowe/398a1ddf8a0db48685e7445f736fd8d8 makes the sky turn clear and the sun shine brighter, so I'm just worried about something adverse I haven't thought of 21:38:00 ::zfs_dbgmsg shows two "slow spa_sync" messages followed by a "SLOW IO". 21:38:30 rzezeski: jclulow is someone I would talk to 21:38:38 but that's almost always true, unfortunately 21:39:28 i wonder if maybe the HBA lost a completion interrupt (though I would think it would then time out) 21:39:32 oh and I see in the deadman function that right before it panics it writes this SLOW IO message to the debug log 21:39:48 which the HBAs are usually pretty (sometimes extremely) vocal about I/O timeouts 21:40:14 did you do any sort of scrub prior? 21:40:18 nope 21:40:41 my pool consist of two NVME M.2 devices 21:41:01 we had a customer that kept hitting that 21:41:27 because the i/o throttling in the scrub code needs help 21:42:13 (for night court fans, I dubbed the drive behavior 'tortoise nervosa' :P) 21:42:20 that was contributing to it 22:14:39 for the morbidly curious I added more info from mdb/zdb in that gist, but at this point can't say I'm any closer to understanding why this would have happened, I guess I'll just have to write up an issue and move on 22:21:04 maybe i missed it, but what is that sched thread on cpu 10 (...d7c20) doing? 22:44:50 random question.. ISTR (though I could easily be confusing and misremembering).. didn't someone write a utility that'd print the device tree a bit like prtconf, but interactive? 23:14:53 richlowe: well the good thing about CTASSERT is that if it compiles you don't have to worry about it again so it seems like a low-risk change...