00:01:27 I left the poor guy at the datacenter at 4 in the morning 00:01:36 * antranigv is a horrible boss 00:03:47 any new message in /var/adm/messages ? 00:20:45 neirac nope :/ altho I asked ararat to go home now. I'll go tomorrow and check myself. 00:21:08 neirac what's the best NIC that I can get for basic support? We just need to run DNS and DHCP zones 01:28:26 antranigv check this: https://docs.tritondatacenter.com/private-cloud/hardware#network-interface-cards 01:29:28 antranigv there is also this list https://illumos.org/hcl/ 07:08:05 [illumos-gate] 16202 swapfs_minfree defaults to an absurdly high value -- Bryan Cantrill 08:23:12 Hibernate to disk like in Linux, would be a very nice feature for OpenIndiana illumos. 10:20:43 antranigv I got stock overnight once, security guard forgot about me. Couldn't leave afterwards. 10:20:47 *stuck 10:21:06 On the upside, I got paid the entire time I was stuck so ... 10:21:43 sjorge omgggg same story here! :D and I was like... 16, so panicked like hell. then I said "fuck it, back to work" and delivered all of my weekly tasks in a night xD 10:28:16 I'm having a hard time understanding if illumos has SIGINFO (^T) or not. 10:52:14 It's there, not much actually uses it though (can't think of anything other than dd offhand) 11:00:46 see https://www.illumos.org/issues/4493 11:00:47 → FEATURE 4493: want siginfo (Resolved) 11:07:42 ptribble looks like I know what I want to contribute :)) 19:05:53 sommerfeld: FWIW, I suspect you can just move to test and RTI on 16163. If you want to get more review that's fine, but I don't think it's needed! 19:39:07 jclulow: just wanted to let anyone else speak up on it. As I noted in the bug I've run the zfs test suite; there are a few failures but none in scrub-related tests. 19:39:32 probably could use someone with experience with the test suite looking over the fails to make sure I didn't miss anything. 19:40:17 I think that's pretty much on us to try and figure out at the RTI point generally, FWIW. It's a bit of a mess in there right now 19:40:42 Would love to have a reliably green board on those tests, but we're not quite there haha 20:07:53 i have a perhaps related fix as well but i've been drowning in customer issues, so I didn't have a chance to look at it closely 20:08:32 also arguably fallout from the sequential scrub work 20:08:55 in that we try to keep the distribution of I/Os across the pool roughly balanced 20:09:11 or rather there's supposed to be a limit on the # of outstanding zios per top level vdev 20:09:45 except that the way it's enforced means things can get extremely lopsided 20:10:18 (basically instead of lookint at each top level vdev, it takes the per top level vdev limit * # top level vdevs and forces that total limit on the whole pool) 20:10:45 my change fixes that 20:12:23 the problem with that is since (aside from syncing zios) we issue zios sorted by (priority, LBA) 20:12:47 you can get starvation for zios to higher numbered LBAs 20:12:47 jbk: not that closely related (mine puts on the brakes for the "scan metadata for block pointers" side of sequential scrub if the system is short on memory) but probably would also help.. 20:13:03 which normally isn't a problem 20:13:16 unless the queue for the top level vdev gets too large 20:13:28 which can happen because of the way that limit is enforced 20:13:43 at which point, it's possible the zio starvation ends up triggering the spa deadman 20:13:59 since it looks at the time the zio enters the queue, and not when it's issued to the disk) 20:14:19 so if it's queued for 16 minutes, the instant it finally gets to run, *boom* 20:15:16 jbk: I was seeing pageout deadman triggers, not spa deadman. 20:15:48 yeah, we'd see that from snapshot activity on vmware vms :) 20:15:56 well snapshot and/or zfs diff 20:16:42 when vmware deletes one of it's snapshots, your disk I/O is better measured in IOs per minute instead of per second 20:17:43 what really needs to happen is a backpressure mechanism in the zio pipeline 20:18:14 so if things are backing up, but still progressing, it can throttle stuff at the higher layers in zfs 20:18:44 AFAICT, zfs diff basically just issues all the zios as fast as it can and hopes the disks can keep up 20:18:59 and if they can't, it'll just keep doing it until it exhausts memory 20:29:20 backpressure and some sort of fair-share-scheduler-like thing. 20:31:03 annoyingly, we can get an event from the hypervisor (via vmtoolsd) when things start, but not when it ends 20:31:54 so we can quiesce things to avoid that from happening, but have no way to bring things back 20:48:28 That reminds me, might https://www.illumos.org/issues/15884 be related to bug 14526 or am I just triggering on QEMU being in both? 20:48:29 → BUG 15884: NVMe driver panics on xcp-ng when trying to enable volatile write cache (New) 20:54:13 no 20:54:30 that's omnios specific 20:55:00 but I suspect the PR I have up for it (those drivers aren't upstream yet) might help 20:55:56 the problem is that somehow the VM guest xcp-ng is presenting looks enough like a Hyper-V VM that the Hyper-V drivers are loading and get out of sorts when then things aren't working as expected 21:06:01 oh he has two different problems in that ticket 21:06:18 the one in the description is a different one from what he reported in the comments 21:10:11 wait? I do? 21:10:16 * nomad goes back to re-read the ticket. 21:11:18 oh right, we updated the ticket after additional debugging. 21:11:25 it was long ago so I've forgotten all the details. 21:17:45 the xcp-ng is related to the hyper-v drivers in omnios 21:17:58 but the nvme on the physical box is arguably two issues 21:18:28 the namespace thing rmustacc mentioned, but also (as he mentioned) we shouldn't panic if the command fails 21:19:17 (a similar thing happened to me with a VMware VM in that their emulated NVMe -- much like the rest of their storage emulation -- didn't seem to like certain things it should, and caused a panic loop 21:19:54 IIRC (mind you, 5 months ago so bad memory) all tests were done on hypervisors running XCP-ng. 21:20:37 supermicro bare-iron had spinning rust and the dell was nVME. 21:20:42 but I won't swear to that now. 21:21:25 entry #4 seems to bear that out: using 21:21:26 boot -B disable-nvme=true,disable-hv_vmbus=true 21:21:26 from the bootloader gets a running installer on both the dell and SuperMicro hypervisors. 21:21:55 anyway, if it's not related it's not related. 21:22:27 I had completely forgotten about the ticket until today's conversation made it bubble back up from the recesses of memory. 21:24:06 I have a PR open with omnios (the people that'll review it were away for a few days) that adds Hyper-V Gen2VM support.. as part of that, the way the vmbus driver is instantiated is smarter 21:24:24 cool 21:24:34 basically instead of always trying and getting fooled (though i thought it looked at the hypervisor signature, which xcp-ng shouldn't be returning 'Microsoft') 21:24:36 If someone can help with xcp-ng, I did a bit more digging there, but wasn't able to really understand exactly why it was happening the way it was. 21:24:48 I'll be happy to test it when you're ready. 21:25:31 rmustacc: it looks like the vmbus driver thought it should attach and try to talk to the hypervisor 21:25:46 Oh, if that's easy, then yeah we can. The one thing I was still trying to figure out is what the actual underlying hypervisor was in xcp-ng. 21:26:07 ISTR it is Xen. 21:26:35 OK. I was having trouble finding the underlying nvme source. 21:27:11 for the vmbus panic, i'd need to dig into it, but my guess is that it's actually the hypercall that's causign things to crash 21:27:13 I wasn't sure if it was leveraging QEMU or something else there. nomad is it doing xpv or not? 21:27:27 https://xcp-ng.org/ says XenServer. 21:27:53 it works similar to xen in that you write the address of a physical page to an msr, and the hypervisor maps in a page of code that contains the hw-specific instructions for making the hypercall 21:28:00 rmustacc, I lack that information. Is there an easy way for me to check? 21:28:03 i'm guesssing the msr values are probably different 21:28:24 * nomad is a sysadmin spread very, very thin so never digs deep unless he has to. 21:28:35 so then that page it's using is empty 21:29:00 so when it jumps to that page to make the hypercall 21:29:06 sadness ensues 21:30:27 nomad: No worries. If I can figure out an easy way to ask that. I'll reacho ut. I didn't realize you still had access to that. 21:30:41 I'll try to spin something on top of all the other NVMe work I have outstanding. 21:34:21 I have access on the SuperMicro hardware. It's our prod cluster. 22:34:25 i'd offer you an omnios build w/ the xcp-ng fixes (though as you found, you can just disable the driver to work around it), but i wouldn't have a good way to get you the image 22:45:58 I'm wrapping up work in that lab for the week. I'll be able to take a look again next week. Maybe we can find a way to do the exchange, depending on how large the build is.