-
yuripv
are smartos issues visible publicly? wondering about the OS-6216 (
TritonDataCenter/illumos-joyent 80caa2e)
-
rmustacc
-
fenix
→ OS-6216: VOP_ACCESS() use in sdev_readdir() leads to deadlock (Resolved) |
joyent/illumos-joyent 80caa2e
-
yuripv
rmustacc: thank you!
-
jlevon`
yuripv: oh god the horror
-
yuripv
jlevon`: why? :)
-
yuripv
we seem to be hitting that issue intermittently
-
jimklimov
got ZFS crashes:
pasteboard.co/1gedb0wwniJJ.png (after an update 2 days ago),
pasteboard.co/aBtHf4N3xvsg.png (from possibly a month-old OI, mid-December'ish)
-
jimklimov
A pool scrub from linux rescue ISO (the only one DigitalOcean lets use) did not complain. I think the crashes may be triggered by znapzend activity, but can't vouch for that.
-
jimklimov
the later one is from OpenIndiana Hipster 2023.10 Version illumos-00f13cd38a 64-bit (per early boot prompt)
-
jimklimov
disabled znapzend and launched a scrub from OI... will see if it crashes or finds anything that illumos codebase dislikes...
-
jimklimov
so, illumos zpool scrub also passed well
-
otis
did you try to import without zpool cache file?
-
otis
i vaguely remember that i've seen mysterious crashes with solaris 11 zfs that were connected to zpool cache somehow
-
jimklimov
I think rpool goes imported without one - got nowhere to read it from yet?
-
otis
ah ok, it goes on rpool.
-
otis
(i did not check yout pastebins)
-
jimklimov
but FWIW, /etc/zfs/zpool.cache is updated today on the running system (probably during boot, roughly same age as uptime)
-
otis
i was looking for screenshot from "my" crashes (i surely have some, but can't seem to find them)
-
jlevon`
yuripv: just sdev is a locking horrowshow
-
jimklimov
Checking that the fault was related to a running znapzend service - and indeed it seems to have been due to its clean-up of older snapshots (in manual run with debug), as seen in topmost line at
pasteboard.co/OdhyUQgFbPoG.png . The `zpool scrub` is clean however...
-
jimklimov
Locking up during reboot (claimed in earlier discussions to be something between illumos kernel and QEMU running the VM) is annoying - gotta power-cycle the VM or suffer an outage of about an hour or two until it does reboot unattended. But to get such screenshots it is actually helpful :D
-
jimklimov
I'll try to clean that snapshot away manually, maybe with the DigitalOcean Linux recovery ISO, in hopes that it is the only such troublemaker
-
jimklimov
that was fun... I told DigitalOcean to turn off the vm, it said it did. I changed the boot device and turned it on - and the console still showed the ZFS stacktrace and attempt to reboot. Power-cycling helped, though...
-
jimklimov
so, linux zfs had no qualms dropping that snapshot chain
-
jimklimov
trying znapzend from OI again...
-
jimklimov
yeah, seems that one snap was somehow special; now znapzend steps through a lot of other obsolete iterations without crashing the kernel
-
yuripv
jlevon: luckily for me, you already fixed at least this issue :D
-
jlevon
true
-
jimklimov
yep, so not the whole run of znapzend went well
-
jimklimov
having competing implementations is useful :)
-
sommerfeld
jimklimov: traceback mentions zio_ddt_free - do you have dedup enabled on any filesystems in the pool?
-
jclulow
-
fenix
→
BUG 14526: illumos guest hangs on reboot under QEMU 6.0.0 (New)
-
jclulow
You are most welcome to debug it!
-
jclulow
It's not surprising that it fucks up the DO control plane FWIW. I expect they're just proxying a request to reboot through to QEMU through the monitor protocol, and it's pretty clear that QEMU has a ridiculous bug that wedges the whole emulation stack in this case.
-
jclulow
If you turn it off, they probably go and actually kill the process and then fire it up on a new server when you start it up again.