05:58:54 <yuripv> are smartos issues visible publicly? wondering about the OS-6216 (https://github.com/TritonDataCenter/illumos-joyent/commit/80caa2e90432a6929d577979f0d62a577d583b69)
06:43:19 <rmustacc> yuripv: https://smartos.org/bugview/OS-6216
06:43:21 <fenix> → OS-6216: VOP_ACCESS() use in sdev_readdir() leads to deadlock (Resolved) | https://github.com/joyent/illumos-joyent/commit/80caa2e90432a6929d577979f0d62a577d583b69
06:43:28 <yuripv> rmustacc: thank you!
10:09:43 <jlevon`> yuripv: oh god the horror
10:47:28 <yuripv> jlevon`: why? :)
10:47:48 <yuripv> we seem to be hitting that issue intermittently
11:48:22 <jimklimov> got ZFS crashes: https://pasteboard.co/1gedb0wwniJJ.png (after an update 2 days ago), https://pasteboard.co/aBtHf4N3xvsg.png (from possibly a month-old OI, mid-December'ish)
11:49:47 <jimklimov> A pool scrub from linux rescue ISO (the only one DigitalOcean lets use) did not complain. I think the crashes may be triggered by znapzend activity, but can't vouch for that.
11:53:20 <jimklimov> the later one is from OpenIndiana Hipster 2023.10 Version illumos-00f13cd38a 64-bit (per early boot prompt)
11:55:24 <jimklimov> disabled znapzend and launched a scrub from OI... will see if it crashes or finds anything that illumos codebase dislikes...
12:24:00 <jimklimov> so, illumos zpool scrub also passed well
12:24:16 <otis> did you try to import without zpool cache file?
12:24:39 <otis> i vaguely remember that i've seen mysterious crashes with solaris 11 zfs that were connected to zpool cache somehow
12:26:14 <jimklimov> I think rpool goes imported without one - got nowhere to read it from yet?
12:26:31 <otis> ah ok, it goes on rpool.
12:26:43 <otis> (i did not check yout pastebins)
12:27:23 <jimklimov> but FWIW, /etc/zfs/zpool.cache is updated today on the running system (probably during boot, roughly same age as uptime)
12:30:20 <otis> i was looking for screenshot from "my" crashes (i surely have some, but can't seem to find them)
12:35:26 <jlevon`> yuripv: just sdev is a locking horrowshow
14:07:08 <jimklimov> Checking that the fault was related to a running znapzend service - and indeed it seems to have been due to its clean-up of older snapshots (in manual run with debug), as seen in topmost line at https://pasteboard.co/OdhyUQgFbPoG.png . The `zpool scrub` is clean however...
14:09:10 <jimklimov> Locking up during reboot (claimed in earlier discussions to be something between illumos kernel and QEMU running the VM) is annoying - gotta power-cycle the VM or suffer an outage of about an hour or two until it does reboot unattended. But to get such screenshots it is actually helpful :D
14:13:14 <jimklimov> I'll try to clean that snapshot away manually, maybe with the DigitalOcean Linux recovery ISO, in hopes that it is the only such troublemaker
14:26:56 <jimklimov> that was fun... I told DigitalOcean to turn off the vm, it said it did. I changed the boot device and turned it on - and the console still showed the ZFS stacktrace and attempt to reboot. Power-cycling helped, though...
14:34:47 <jimklimov> so, linux zfs had no qualms dropping that snapshot chain
14:34:55 <jimklimov> trying znapzend from OI again...
14:53:15 <jimklimov> yeah, seems that one snap was somehow special; now znapzend steps through a lot of other obsolete iterations without crashing the kernel
15:12:50 <yuripv> jlevon: luckily for me, you already fixed at least this issue :D
15:22:47 <jlevon> true
16:02:00 <jimklimov> yep, so not the whole run of znapzend went well
16:02:29 <jimklimov> having competing implementations is useful :)
16:40:25 <sommerfeld> jimklimov: traceback mentions zio_ddt_free - do you have dedup enabled on any filesystems in the pool?
20:19:23 <jclulow> jimklimov: https://www.illumos.org/issues/14526
20:19:23 <fenix> → BUG 14526: illumos guest hangs on reboot under QEMU 6.0.0 (New)
20:19:27 <jclulow> You are most welcome to debug it!
20:20:40 <jclulow> It's not surprising that it fucks up the DO control plane FWIW.  I expect they're just proxying a request to reboot through to QEMU through the monitor protocol, and it's pretty clear that QEMU has a ridiculous bug that wedges the whole emulation stack in this case.
20:21:07 <jclulow> If you turn it off, they probably go and actually kill the process and then fire it up on a new server when you start it up again.