-
aquamo4k
jbk: and others, your talk about talking to mainframes, back in early '90s we build a system that used a pho-printer queue to get data from the sunos to mainframes :-)
-
aquamo4k
lpd (or lprNG was the translator) :-)
-
jbk
i had another 'fun' mainframe experience..
-
jbk
they were doing a data migration which involved downloading whatever the mainframe equivalent of exported tables was, transferring them over a wan link and importing them into the new database (running on Solaris 10)
-
jbk
the group had originally decided that they should just mount that dataset over the WAN with NFS4
-
jbk
72 hours into their 48 hour transfer window (rehearsal) they realized they had a problem
-
jbk
also.. for some reason, the mainframe 'spontaneously' stopped accepting nfs4 mounts and had to resort to nfs3
-
jbk
because at the time ssh still had fixed sized windows which limited it's throughput, I had to concoct this rube goldbergesque setup
-
jbk
where another system in the same buildnig as the mainframe would instead use ftp, pipe it through gzip (default level), then pipe it through a connection across the wan to the destination system that'd uncompress and save the file
-
jbk
it took 4 hours
-
jbk
when the group was told 'ok, you can import everythign now' they thought we were joking
-
jbk
hrm.. there's no way to display all longest prefix matching routes for a given IP, is there?
-
jbk
route get IP just picks one (when there's more than one)
-
sommerfeld
jbk: yeah, I think I've noticed that.
-
rzezeski
I had a zfs deadman panic today on an otherwise idle system running DEBUG bits of my latest cxgbe work. This is my first time ever seeing such a panic. zpool scrub showed no issues. Looking at the stack I see that intrd had an ioctl in progress and was sitting in apix_wait_till_seen(). I'm wondering if IRM and/or DEBUG bits can induce this failure mode?
-
richlowe
I ran debug bits for a while, and things were not fun, but they did not crash
-
richlowe
rzezeski: apix was on the stack above ZFS, like we'd pinned an i/o thread for interrupt?
-
richlowe
or at the time it died interrupt management things were happening (but in another thread)
-
richlowe
(I can't tell which you mean)
-
rzezeski
-
jbk
i'm actually impressed intrd was doing something...
-
richlowe
oh, it's actually r'ing ints
-
richlowe
I have never seen that
-
richlowe
rzezeski: I guess I would look at IRM first
-
richlowe
though the combination of IRM|DEBUG can only be more uncommon
-
rzezeski
Yea, I have a lot of stuff on my plate, so I don't necessarily have time to learn how all this works. That's why I thought I'd get a spot check in here and see if IRM could put one in this position given that I have no reason to suspect an actual hardware/software issue on this host. I guess if I could figure out what interrupt it was trying to remove/remap, and if that interrupt was related to one of my nvme devices...
-
richlowe
I think that's not that complicated, but someone like rm would be the one to try as far as remembering the x86 interrupt support well
-
richlowe
maybe not well enough for intrd though...
-
rzezeski
And I know DEBUG loves to throw random curveballs like trying to unload modules randomly, I figured this could be another thing like that. But I guess it could also point to a legit bug that's just a lot harder to hit in non-debug
-
richlowe
if I were wanting to make forward progress, I'd disable intrd etc, and keep DEBUG
-
rzezeski
I don't remember ever enabling it in the first place. I honestly know almost nothing about it.
-
rzezeski
okay, yea looks like it's enabled on my stock omnios bits
-
richlowe
yeah, it's enabled by default, but usually not particularly active
-
richlowe
or at all active
-
rzezeski
richlowe: I have a way of activating things, the computer gods hate me
-
rzezeski
fucking yak farm over here
-
rzezeski
okay, mapped the apix_vector_t argument back to the device info, looks like its for ixgbe, which makes sense since it participates in IRM
-
richlowe
rmustacc: do you remember why `CTASSERT` doesn't stringify `x`?
-
rzezeski
So with this deadman panic there should be an outstanding I/O, right? I wonder if that would give me any clues?
-
jbk
yeah
-
jbk
well mostly
-
richlowe
Yes there is outstanding I/O, maybe it would give you clues
-
jbk
technically, it could get stuck in the zfs i/o scheduler without ever being issued to the disk
-
jbk
though it'd be unlikely to happen on an idle system
-
richlowe
I think if we are in the mood to remap interrupts the system isn't, to our view, idle
-
jbk
since the deadman basically looks at when it gets sent to the scheduler and then when it completes
-
rzezeski
richlowe: of the 16 CPUs, 14 were idle, two were in sched
-
jbk
(and if a disk is busy or slow enough, you can get into a situation where zio to higher LBAs get starved because zios to lower numbered LBAs keep cutting in front of them)
-
rzezeski
I was literally out for a walk with nothing of significance running on the host (AFAIK)
-
richlowe
rmustacc:
gist.github.com/richlowe/398a1ddf8a0db48685e7445f736fd8d8 makes the sky turn clear and the sun shine brighter, so I'm just worried about something adverse I haven't thought of
-
rzezeski
::zfs_dbgmsg shows two "slow spa_sync" messages followed by a "SLOW IO".
-
richlowe
rzezeski: jclulow is someone I would talk to
-
richlowe
but that's almost always true, unfortunately
-
jbk
i wonder if maybe the HBA lost a completion interrupt (though I would think it would then time out)
-
rzezeski
oh and I see in the deadman function that right before it panics it writes this SLOW IO message to the debug log
-
jbk
which the HBAs are usually pretty (sometimes extremely) vocal about I/O timeouts
-
jbk
did you do any sort of scrub prior?
-
rzezeski
nope
-
rzezeski
my pool consist of two NVME M.2 devices
-
jbk
we had a customer that kept hitting that
-
jbk
because the i/o throttling in the scrub code needs help
-
jbk
(for night court fans, I dubbed the drive behavior 'tortoise nervosa' :P)
-
jbk
that was contributing to it
-
rzezeski
for the morbidly curious I added more info from mdb/zdb in that gist, but at this point can't say I'm any closer to understanding why this would have happened, I guess I'll just have to write up an issue and move on
-
jbk
maybe i missed it, but what is that sched thread on cpu 10 (...d7c20) doing?
-
jbk
random question.. ISTR (though I could easily be confusing and misremembering).. didn't someone write a utility that'd print the device tree a bit like prtconf, but interactive?
-
sommerfeld
richlowe: well the good thing about CTASSERT is that if it compiles you don't have to worry about it again so it seems like a low-risk change...