13:01:29 <aquamo4k> jbk: and others, your talk about talking to mainframes, back in early '90s we build a system that used a pho-printer queue to get data from the sunos to mainframes :-)
13:02:09 <aquamo4k> lpd (or lprNG was the translator) :-)
13:29:38 <jbk> i had another 'fun' mainframe experience..
13:30:49 <jbk> they were doing a data migration which involved downloading whatever the mainframe equivalent of exported tables was, transferring them over a wan link and importing them into the new database (running on Solaris 10)
13:31:17 <jbk> the group had originally decided that they should just mount that dataset over the WAN with NFS4
13:31:55 <jbk> 72 hours into their 48 hour transfer window (rehearsal) they realized they had a problem
13:32:23 <jbk> also.. for some reason, the mainframe 'spontaneously' stopped accepting nfs4 mounts and had to resort to nfs3
13:33:34 <jbk> because at the time ssh still had fixed sized windows which limited it's throughput, I had to concoct this rube goldbergesque setup
13:34:41 <jbk> where another system in the same buildnig as the mainframe would instead use ftp, pipe it through gzip (default level), then pipe it through a connection across the wan to the destination system that'd uncompress and save the file
13:35:15 <jbk> it took 4 hours
13:35:37 <jbk> when the group was told 'ok, you can import everythign now' they thought we were joking
17:01:11 <jbk> hrm.. there's no way to display all longest prefix matching routes for a given IP, is there?
17:01:27 <jbk> route get IP just picks one (when there's more than one)
17:05:22 <sommerfeld> jbk: yeah, I think I've noticed that.
20:53:24 <rzezeski> I had a zfs deadman panic today on an otherwise idle system running DEBUG bits of my latest cxgbe work. This is my first time ever seeing such a panic. zpool scrub showed no issues. Looking at the stack I see that intrd had an ioctl in progress and was sitting in apix_wait_till_seen(). I'm wondering if IRM and/or DEBUG bits can induce this failure mode?
20:56:02 <richlowe> I ran debug bits for a while, and things were not fun, but they did not crash
20:56:36 <richlowe> rzezeski: apix was on the stack above ZFS, like we'd pinned an i/o thread for interrupt?
20:57:27 <richlowe> or at the time it died interrupt management things were happening (but in another thread)
20:57:36 <richlowe> (I can't tell which you mean)
20:58:48 <rzezeski> https://gist.github.com/rzezeski/175396b3d8423dff66db9b2bfec28515
21:01:03 <jbk> i'm actually impressed intrd was doing something...
21:03:05 <richlowe> oh, it's actually  r'ing ints
21:03:08 <richlowe> I have never seen that
21:03:44 <richlowe> rzezeski: I guess I would look at IRM first
21:04:07 <richlowe> though the combination of IRM|DEBUG can only be more uncommon
21:07:12 <rzezeski> Yea, I have a lot of stuff on my plate, so I don't necessarily have time to learn how all this works. That's why I thought I'd get a spot check in here and see if IRM could put one in this position given that I have no reason to suspect an actual hardware/software issue on this host. I guess if I could figure out what interrupt it was trying to remove/remap, and if that interrupt was related to one of my nvme devices...
21:08:14 <richlowe> I think that's not that complicated, but someone like rm would be the one to try as far as remembering the  x86 interrupt support well
21:08:20 <richlowe> maybe not well enough for intrd though...
21:08:20 <rzezeski> And I know DEBUG loves to throw random curveballs like trying to unload modules randomly, I figured this could be another thing like that. But I guess it could also point to a legit bug that's just a lot harder to hit in non-debug
21:08:41 <richlowe> if I were wanting to make forward progress, I'd disable intrd etc, and keep DEBUG
21:09:31 <rzezeski> I don't remember ever enabling it in the first place. I honestly know almost nothing about it.
21:10:46 <rzezeski> okay, yea looks like it's enabled on my stock omnios bits
21:17:32 <richlowe> yeah, it's enabled by default, but usually not particularly active
21:17:41 <richlowe> or at all active
21:17:54 <rzezeski> richlowe: I have a way of activating things, the computer gods hate me
21:18:19 <rzezeski> fucking yak farm over here
21:24:18 <rzezeski> okay, mapped the apix_vector_t argument back to the device info, looks like its for ixgbe, which makes sense since it participates in IRM
21:30:32 <richlowe> rmustacc: do you remember why `CTASSERT` doesn't stringify `x`?
21:32:34 <rzezeski> So with this deadman panic there should be an outstanding I/O, right? I wonder if that would give me any clues?
21:33:12 <jbk> yeah
21:33:23 <jbk> well mostly
21:33:39 <richlowe> Yes there is outstanding I/O, maybe it would give you clues
21:33:48 <jbk> technically, it could get stuck in the zfs i/o scheduler without ever being issued to the disk
21:34:16 <jbk> though it'd be unlikely to happen on an idle system
21:34:49 <richlowe> I think if we are in the mood to remap interrupts the system isn't, to our view, idle
21:34:56 <jbk> since the deadman basically looks at when it gets sent to the scheduler and then when it completes
21:35:55 <rzezeski> richlowe: of the 16 CPUs, 14 were idle, two were in sched
21:36:02 <jbk> (and if a disk is busy or slow enough, you can get into a situation where zio to higher LBAs get starved because zios to lower numbered LBAs keep cutting in front of them)
21:36:43 <rzezeski> I was literally out for a walk with nothing of significance running on the host (AFAIK)
21:37:23 <richlowe> rmustacc: https://gist.github.com/richlowe/398a1ddf8a0db48685e7445f736fd8d8 makes the sky turn clear and the sun shine brighter, so I'm just worried about something adverse I haven't thought of
21:38:00 <rzezeski> ::zfs_dbgmsg shows two "slow spa_sync" messages followed by a "SLOW IO".
21:38:30 <richlowe> rzezeski: jclulow is someone I would talk to
21:38:38 <richlowe> but that's almost always true, unfortunately
21:39:28 <jbk> i wonder if maybe the HBA lost a completion interrupt (though I would think it would then time out)
21:39:32 <rzezeski> oh and I see in the deadman function that right before it panics it writes this SLOW IO message to the debug log
21:39:48 <jbk> which the HBAs are usually pretty (sometimes extremely) vocal about I/O timeouts
21:40:14 <jbk> did you do any sort of scrub prior?
21:40:18 <rzezeski> nope
21:40:41 <rzezeski> my pool consist of two NVME M.2 devices
21:41:01 <jbk> we had a customer that kept hitting that
21:41:27 <jbk> because the i/o throttling in the scrub code needs help
21:42:13 <jbk> (for night court fans, I dubbed the drive behavior 'tortoise nervosa' :P)
21:42:20 <jbk> that was contributing to it
22:14:39 <rzezeski> for the morbidly curious I added more info from mdb/zdb in that gist, but at this point can't say I'm any closer to understanding why this would have happened, I guess I'll just have to write up an issue and move on
22:21:04 <jbk> maybe i missed it, but what is that sched thread on cpu 10 (...d7c20) doing?
22:44:50 <jbk> random question.. ISTR (though I could easily be confusing and misremembering).. didn't someone write a utility that'd print the device tree a bit like prtconf, but interactive?
23:14:53 <sommerfeld> richlowe: well the good thing about CTASSERT is that if it compiles you don't have to worry about it again so it seems like a low-risk change...