00:01:27 <antranigv> I left the poor guy at the datacenter at 4 in the morning
00:01:36 * antranigv is a horrible boss
00:03:47 <neirac> any new message in /var/adm/messages ?
00:20:45 <antranigv> neirac nope :/ altho I asked ararat to go home now. I'll go tomorrow and check myself.
00:21:08 <antranigv> neirac what's the best NIC that I can get for basic support? We just need to run DNS and DHCP zones
01:28:26 <neirac> antranigv check this: https://docs.tritondatacenter.com/private-cloud/hardware#network-interface-cards 
01:29:28 <neirac> antranigv there is also this list https://illumos.org/hcl/
07:08:05 <gitomat> [illumos-gate] 16202 swapfs_minfree defaults to an absurdly high value -- Bryan Cantrill <bryan⊙oc>
08:23:12 <Aedil> Hibernate to disk like in Linux, would be a very nice feature for OpenIndiana illumos.
10:20:43 <sjorge> antranigv I got stock overnight once, security guard forgot about me. Couldn't leave afterwards.
10:20:47 <sjorge> *stuck
10:21:06 <sjorge> On the upside, I got paid the entire time I was stuck so ... 
10:21:43 <antranigv> sjorge omgggg same story here! :D and I was like... 16, so panicked like hell. then I said "fuck it, back to work" and delivered all of my weekly tasks in a night xD
10:28:16 <antranigv> I'm having a hard time understanding if illumos has SIGINFO (^T) or not.
10:52:14 <ptribble> It's there, not much actually uses it though (can't think of anything other than dd offhand)
11:00:46 <ptribble> see https://www.illumos.org/issues/4493
11:00:47 <fenix> → FEATURE 4493: want siginfo (Resolved)
11:07:42 <antranigv> ptribble looks like I know what I want to contribute :))
19:05:53 <jclulow> sommerfeld: FWIW, I suspect you can just move to test and RTI on 16163.  If you want to get more review that's fine, but I don't think it's needed!
19:39:07 <sommerfeld> jclulow: just wanted to let anyone else speak up on it.  As I noted in the bug I've run the zfs test suite; there are a few failures but none in scrub-related tests.
19:39:32 <sommerfeld> probably could use someone with experience with the test suite looking over the fails to make sure I didn't miss anything.
19:40:17 <jclulow> I think that's pretty much on us to try and figure out at the RTI point generally, FWIW.  It's a bit of a mess in there right now
19:40:42 <jclulow> Would love to have a reliably green board on those tests, but we're not quite there haha
20:07:53 <jbk> i have a perhaps related fix as well but i've been drowning in customer issues, so I didn't have a chance to look at it closely
20:08:32 <jbk> also arguably fallout from the sequential scrub work
20:08:55 <jbk> in that we try to keep the distribution of I/Os across the pool roughly balanced
20:09:11 <jbk> or rather there's supposed to be a limit on the # of outstanding zios per top level vdev
20:09:45 <jbk> except that the way it's enforced means things can get extremely lopsided
20:10:18 <jbk> (basically instead of lookint at each top level vdev, it takes the per top level vdev limit * # top level vdevs and forces that total limit on the whole pool)
20:10:45 <jbk> my change fixes that
20:12:23 <jbk> the problem with that is since (aside from syncing zios) we issue zios sorted by (priority, LBA)
20:12:47 <jbk> you can get starvation for zios to higher numbered LBAs
20:12:47 <sommerfeld> jbk: not that closely related (mine puts on the brakes for the "scan metadata for block pointers" side of sequential scrub if the system is short on memory) but probably would also help..
20:13:03 <jbk> which normally isn't a problem
20:13:16 <jbk> unless the queue for the top level vdev gets too large
20:13:28 <jbk> which can happen because of the way that limit is enforced
20:13:43 <jbk> at which point, it's possible the zio starvation ends up triggering the spa deadman
20:13:59 <jbk> since it looks at the time the zio enters the queue, and not when it's issued to the disk)
20:14:19 <jbk> so if it's queued for 16 minutes, the instant it finally gets to run, *boom*
20:15:16 <sommerfeld> jbk: I was seeing pageout deadman triggers, not spa deadman.
20:15:48 <jbk> yeah, we'd see that from snapshot activity on vmware vms :)
20:15:56 <jbk> well snapshot and/or zfs diff
20:16:42 <jbk> when vmware deletes one of it's snapshots, your disk I/O is better measured in IOs per minute instead of per second
20:17:43 <jbk> what really needs to happen is a backpressure mechanism in the zio pipeline
20:18:14 <jbk> so if things are backing up, but still progressing, it can throttle stuff at the higher layers in zfs
20:18:44 <jbk> AFAICT, zfs diff basically just issues all the zios as fast as it can and hopes the disks can keep up
20:18:59 <jbk> and if they can't, it'll just keep doing it until it exhausts memory
20:29:20 <sommerfeld> backpressure and some sort of fair-share-scheduler-like thing.
20:31:03 <jbk> annoyingly, we can get an event from the hypervisor (via vmtoolsd) when things start, but not when it ends
20:31:54 <jbk> so we can quiesce things to avoid that from happening, but have no way to bring things back
20:48:28 <nomad> That reminds me, might https://www.illumos.org/issues/15884 be related to bug 14526 or am I just triggering on QEMU being in both?
20:48:29 <fenix> → BUG 15884: NVMe driver panics on xcp-ng when trying to enable volatile write cache (New)
20:54:13 <jbk> no
20:54:30 <jbk> that's omnios specific
20:55:00 <jbk> but I suspect the PR I have up for it (those drivers aren't upstream yet) might help
20:55:56 <jbk> the problem is that somehow the VM guest xcp-ng is presenting looks enough like a Hyper-V VM that the Hyper-V drivers are loading and get out of sorts when then things aren't working as expected
21:06:01 <jbk> oh he has two different problems in that ticket
21:06:18 <jbk> the one in the description is a different one from what he reported in the comments
21:10:11 <nomad> wait? I do?
21:10:16 * nomad goes back to re-read the ticket.
21:11:18 <nomad> oh right, we updated the ticket after additional debugging.
21:11:25 <nomad> it was long ago so I've forgotten all the details.
21:17:45 <jbk> the xcp-ng is related to the hyper-v drivers in omnios
21:17:58 <jbk> but the nvme on the physical box is arguably two issues
21:18:28 <jbk> the namespace thing rmustacc mentioned, but also (as he mentioned) we shouldn't panic if the command fails
21:19:17 <jbk> (a similar thing happened to me with a VMware VM in that their emulated NVMe -- much like the rest of their storage emulation -- didn't seem to like certain things it should, and caused a panic loop
21:19:54 <nomad> IIRC (mind you, 5 months ago so bad memory) all tests were done on hypervisors running XCP-ng.
21:20:37 <nomad> supermicro bare-iron had spinning rust and the dell was nVME.
21:20:42 <nomad> but I won't swear to that now.
21:21:25 <nomad> entry #4 seems to bear that out: using
21:21:26 <nomad> boot -B disable-nvme=true,disable-hv_vmbus=true
21:21:26 <nomad> from the bootloader gets a running installer on both the dell and SuperMicro hypervisors.
21:21:55 <nomad> anyway, if it's not related it's not related.
21:22:27 <nomad> I had completely forgotten about the ticket until today's conversation made it bubble back up from the recesses of memory.
21:24:06 <jbk> I have a PR open with omnios (the people that'll review it were away for a few days) that adds Hyper-V Gen2VM support.. as part of that, the way the vmbus driver is instantiated is smarter
21:24:24 <nomad> cool
21:24:34 <jbk> basically instead of always trying and getting fooled (though i thought it looked at the hypervisor signature, which xcp-ng shouldn't be returning 'Microsoft')
21:24:36 <rmustacc> If someone can help with xcp-ng, I did a bit more digging there, but wasn't able to really understand exactly why it was happening the way it was.
21:24:48 <nomad> I'll be happy to test it when you're ready.
21:25:31 <jbk> rmustacc: it looks like the vmbus driver thought it should attach and try to talk to the hypervisor
21:25:46 <rmustacc> Oh, if that's easy, then yeah we can. The one thing I was still trying to figure out is what the actual underlying hypervisor was in xcp-ng.
21:26:07 <nomad> ISTR it is Xen.
21:26:35 <rmustacc> OK. I was having trouble finding the underlying nvme source.
21:27:11 <jbk> for the vmbus panic, i'd need to dig into it, but my guess is that it's actually the hypercall that's causign things to crash
21:27:13 <rmustacc> I wasn't sure if it was leveraging QEMU or something else there. nomad is it doing xpv or not?
21:27:27 <nomad> https://xcp-ng.org/ says XenServer.
21:27:53 <jbk> it works similar to xen in that you write the address of a physical page to an msr, and the hypervisor maps in a page of code that contains the hw-specific instructions for making the hypercall
21:28:00 <nomad> rmustacc, I lack that information. Is there an easy way for me to check?
21:28:03 <jbk> i'm guesssing the msr values are probably different
21:28:24 * nomad is a sysadmin spread very, very thin so never digs deep unless he has to.
21:28:35 <jbk> so then that page it's using is empty
21:29:00 <jbk> so when it jumps to that page to make the hypercall
21:29:06 <jbk> sadness ensues
21:30:27 <rmustacc> nomad: No worries. If I can figure out an easy way to ask that. I'll reacho ut. I didn't realize you still had access to that.
21:30:41 <rmustacc> I'll try to spin something on top of all the other NVMe work I have outstanding.
21:34:21 <nomad> I have access on the SuperMicro hardware. It's our prod cluster.
22:34:25 <jbk> i'd offer you an omnios build w/ the xcp-ng fixes (though as you found, you can just disable the driver to work around it), but i wouldn't have a good way to get you the image
22:45:58 <nomad> I'm wrapping up work in that lab for the week. I'll be able to take a look again next week. Maybe we can find a way to do the exchange, depending on how large the build is.