00:57:28 <jclulow> updating neovim to 0.10.3, and working around the crashing problem: https://github.com/omniosorg/omnios-extra/pull/1573
01:05:46 <justinw312> Well... work happened and just getting back to bhyve zones. Trying to install Rocky 9.5 from ISO. I get as far as connecting to VNC and seeing the GRUB boot menu. Then I'm getting a crash and not sure what to do next: "VM-entry failure due to invalid guest state" // Has anyone run into this error with RHEL/EL/etc? Not sure if it's a uefi vs bios boot issue?  
03:05:59 <justinw312>  found a tutorial for getting Void running, and their settings are very close to mine. So, maybe a hardware issue. I'll try the same config on a different machine tomorrow. 
03:06:26 <justinw312> i guess i could try void and follow the tutorial exactly and then (if it works) start bisecting the differences. 
12:17:43 <pilonsi> Hello! I'm using an OmniOS r151046 machine and I wanted to upgrade the storage, replacing the current NVME drives with a pair of 2TB Samsung 990 pros. The drives work fine under linux and are reported correctly in the BIOS but omnios doesn't see them at all. Neither format, nor iostat -eE show anything and nvmeadm list says there are no controllers. The old NVME disk (SK hynix BC711 HFM512GD3JX013N) is 
12:17:49 <pilonsi> shown correctly in all three. Is the drive not supported in illumos? 
12:45:51 <andyf> pilonsi - if they are not showing up that seems like a bug.
12:46:21 <andyf> Is there anything useful in the logs? Try `dmesg | grep nvme` or similar.
12:47:54 <tsoome> also prtconf -vD
12:48:26 <andyf> I am just checking which OmniOS version got support for newer NVMe standards, but I think it was backported.
12:53:36 <andyf> In OmniOS r151046 there is no support for NVMe 2.x, so that's a likely reason. If that turns out to be the reason, try uncommenting the 'strict-version=0' line in /kernel/drv/nvme.conf
12:53:43 <pilonsi> andyf the new drives are nvme 2.0, not sure about the old ones
12:53:51 <pilonsi> Aaah so yeah that must be it
12:54:35 <andyf> Relaxing that strict version check should be fine, or you could upgrade to r151052 if you don't have a particular desire to stay on the LTS version
12:54:46 <andyf> May's r151054 will be the next LTS.
12:55:52 <pilonsi> I was going to say that, if there may be some stability issues by skipping the version check I may just upgrade. I have no special requirement to be in that version, just chose the LTS when I set up the system last year and so far had no reason to upgrade
12:56:15 <pilonsi> Thanks!
13:21:45 <andyf> You're welcome. The reason I say it should be fine is that relaxing that restriction mostly involved a validation exercise. When the actual change was integrated it was effectively just bumping the maximum allowed version in the driver - https://github.com/illumos/illumos-gate/commit/f7ce6a749c79a107447356b130df647b6b60d449
13:22:19 <andyf> Having said that, there have been several further improvements in the nvme driver since r46 so newer should be better!
13:23:42 <pilonsi> andyf: I'm on r52 now and it's improved, but the drive is still not fully recognized
13:23:51 <pilonsi> iostat -eE shows the following
13:24:08 <pilonsi> blkdev3   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
13:24:08 <pilonsi> Model: (null) Revision: (null) Serial No: (null) 
13:24:08 <pilonsi> Size: 2000,40GB <2000398934016 bytes>
13:24:08 <pilonsi> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
13:24:10 <pilonsi> Illegal Request: 0 Predictive Failure Analysis: 0 
13:24:21 <pilonsi> while nvmeadm list
13:25:03 <pilonsi> nvmeadm: failed to open 'nvme3': failed to open device path /devices/pci@0,0/pci8086,6c0@1b/pci144d,a801@0:devctl: No such device or address: NVME_ERR_OPEN_DEV (libnvme: 0xd, sys: 6)
13:27:25 <andyf> Anything in the logs this time? (dmesg)
13:31:37 <pilonsi> Seems like it initializes ok
13:31:46 <pilonsi> Jan 22 14:20:26 localhost nvme: [ID 259564 kern.info] nvme3: NVMe spec version 2.0
13:31:49 <pilonsi> Jan 22 14:20:26 localhost nvme: [ID 259564 kern.info] nvme2: NVMe spec version 1.3
13:31:52 <pilonsi> Jan 22 14:20:26 localhost blkdev: [ID 643073 kern.info] NOTICE: blkdev2: dynamic LUN expansion
13:31:55 <pilonsi> Jan 22 14:20:26 localhost blkdev: [ID 348765 kern.info] Block device: blkdev@wACE42E00156E7F45,0, blkdev2
13:31:59 <pilonsi> Jan 22 14:20:26 localhost genunix: [ID 936769 kern.info] blkdev2 is /pci@0,0/pci8086,6ac@1b,4/pci1c5c,174a@0/blkdev@wACE42E00156E7F45,0
13:32:02 <pilonsi> Jan 22 14:20:26 localhost genunix: [ID 408114 kern.info] /pci@0,0/pci8086,6ac@1b,4/pci1c5c,174a@0/blkdev@wACE42E00156E7F45,0 (blkdev2) online
13:32:06 <pilonsi> Jan 22 14:20:26 localhost blkdev: [ID 643073 kern.info] NOTICE: blkdev3: dynamic LUN expansion
13:32:09 <pilonsi> Jan 22 14:20:26 localhost blkdev: [ID 348765 kern.info] Block device: blkdev@w0025384A41402A94,0, blkdev3
13:32:12 <pilonsi> Jan 22 14:20:26 localhost genunix: [ID 936769 kern.info] blkdev3 is /pci@0,0/pci8086,6c0@1b/pci144d,a801@0/blkdev@w0025384A41402A94,0
13:32:15 <pilonsi> Jan 22 14:20:26 localhost genunix: [ID 408
13:34:28 <andyf> Can you try a reconfigure with `devfsadm -vC` ?
13:39:21 <pilonsi> actually maybe it's a hardware issue :(
13:41:34 <pilonsi> I've rebooted and now it's not detected at all. The drive that was in that slot failed previously as well (wasnt recognized). I thought it was a drive issue since linux could see both samsungs but now with this inconsistencies im starting to think that maybe there is some issue with the connector itself. I'm going to swap drives and check in case it's that and ill report back
14:21:43 <pilonsi> so after moving the disk to the slot that hasn't given any issues I still see the same
14:22:17 <pilonsi> and running devfsadm -vC removes a lot of device nodes related to disks:
14:24:03 <pilonsi> https://pastebin.com/bPjFhmj7
14:24:14 <pilonsi> Here is the full output of devfsadm
14:31:33 <andyf> Can you go into the kernel debugger with `mdb -k` and then enter `::prtconf -d nvme -i 3 | ::devinfo -d | ::print nvme_t` at the prompt, and pastebin that output?
14:32:46 <andyf> to exit the debugger again you can press ^d or enter $q
14:38:35 <tsoome> or ::quit 
14:39:48 <andyf> Or $%
14:40:00 <andyf> (no idea why)
14:41:58 <tsoome> $ prefix is originating from adb
14:42:45 <tsoome> if you have the good old "Panic!" book, it is all based on adb:)
14:44:22 <tsoome> after I once did blow up some production DB system with mdb -kw, I did switch to use ::quit for safety:D
14:47:24 <andyf> I didn't use adb much, I used SCAT a few times on panics. Maybe $% was adb then, but it was definitely carried over to mdb but not documented in the man page or help output.
15:00:12 <pilonsi> andyf: I'm going to try that. Meanwhile I swapped the disk in case it was faulty, with the other copy I had. Then it wasn't recognized at all. Put the 1st samsung back in and it wasnt recognized at all either (while before it was). So i loaded linux and it is recognized fine. I have populated the other slot with the second disk and linux does recognize both. So I would discard hardware issues then.
15:00:20 <pilonsi> So now, with both disks connected to the board
15:00:51 <pilonsi> I see one of them properly recognized, but the other one does not appear at all... No strange things in dmesg either
15:02:14 <pilonsi> It's strange because wether it is recognized completely, not at all, or partially is not consistent and I'm not sure what makes it change
15:02:54 <pilonsi> ~ > pfexec nvmeadm list
15:02:55 <pilonsi> nvme4: model: Samsung SSD 990 PRO 2TB, serial: S7PJNJ0XA07900D, FW rev: 4B2QJXD7, NVMe v2.0 nvme4/1 (c3t0025384A41402A94d0): Size = 1.82 TB, Capacity = 1.82 TB, Used = 0 B
15:08:34 <pilonsi> https://pastebin.com/hUtJHdv0
15:09:26 <pilonsi> This is the output of the debugger command. Although now the disk that does appear is fully recognized. I changed -i to 4 since it is nvme4 now
15:16:48 <pilonsi> prtconf -d shows both disks, but one of them appears as (retired)
15:18:37 <pilonsi> https://pastebin.com/dhjrrkhy
15:20:20 <pilonsi> Could it be fmadm? Just learned about it
15:20:42 <pilonsi> But maybe when the old disk failed it disabled that port in the mobo?
15:21:18 <pilonsi> Since it failed i see a Notice: One or more i/o devices have been retired message at boot, which still appears now, i thought it was related to the zfs pool the disk was part of
15:22:27 <andyf> That's interesting. We can clear the retire store and see what happens.
15:22:53 <andyf> Try moving '/etc/devices/retire_store' to one side and rebooting
15:27:46 <pilonsi> Ok so I've moved the retire_store and now I see both of them, but the one in the slot that failed is weird still
15:27:55 <pilonsi> ~ > pfexec nvmeadm list
15:27:56 <pilonsi> nvmeadm: failed to open 'nvme3': failed to open device path /devices/pci@0,0/pci8086,6c0@1b/pci144d,a801@0:devctl: No such device or address: NVME_ERR_OPEN_DEV (libnvme: 0xd, sys: 6)
15:27:59 <pilonsi> nvme4: model: Samsung SSD 990 PRO 2TB, serial: S7PJNJ0XA07900D, FW rev: 4B2QJXD7, NVMe v2.0 nvme4/1 (c3t0025384A41402A94d0): Size = 1.82 TB, Capacity = 1.82 TB, Used = 0 B
15:29:54 <pilonsi> ~ > https://pastebin.com/PPmg6WZE
15:30:13 <pilonsi> This is the mdb command for the nvme3 disk
15:32:52 <pilonsi> https://pastebin.com/4E3GkUJD
15:33:45 <pilonsi> And this is the fmadm output, the device it listed as failed is the same one that is not being read properly
15:34:07 <pilonsi> So maybe there is indeed something wrong with the motherboard?
15:37:43 <GrayGhost> .join #illumos
15:38:35 <GrayGhost>  /  not . 
15:40:00 <pilonsi> Got it working :)
15:40:13 <pilonsi> It was fmadm
15:41:10 <pilonsi> I had renamed the retire_store file to something else, but I noticed it had been recreated, and the drive still appeared as retired
15:41:35 <pilonsi> (Renaming it is what i understood by moving it to the side, maybe you meant something else)
15:42:06 <andyf> Oh, you had to do something like `fmadm repaired` ?
15:42:09 <pilonsi> But in any case since Im trying to actually replace the faluted disk i marked it as replaced in fmadm, and both are now detected correclty
15:42:10 <pilonsi> Yes!
15:42:55 <pilonsi> Maybe there was a race condition between the driver loading and fmadm retiring the device
15:43:07 <pilonsi> And sometimes it was retired before the driver loaded (So, not detected)
15:43:32 <pilonsi> And sometimes after (So, listed as present, but when trying to access it later it appeared as missing)
15:43:35 <pilonsi> Very interesting
15:43:46 <pilonsi> Illumos is indeed in another league compared to linux :)
15:44:12 <pilonsi> Thank you very much for your help
15:45:29 <andyf> I'm glad you got it working!