00:57:28 updating neovim to 0.10.3, and working around the crashing problem: https://github.com/omniosorg/omnios-extra/pull/1573 01:05:46 Well... work happened and just getting back to bhyve zones. Trying to install Rocky 9.5 from ISO. I get as far as connecting to VNC and seeing the GRUB boot menu. Then I'm getting a crash and not sure what to do next: "VM-entry failure due to invalid guest state" // Has anyone run into this error with RHEL/EL/etc? Not sure if it's a uefi vs bios boot issue? 03:05:59 found a tutorial for getting Void running, and their settings are very close to mine. So, maybe a hardware issue. I'll try the same config on a different machine tomorrow. 03:06:26 i guess i could try void and follow the tutorial exactly and then (if it works) start bisecting the differences. 12:17:43 Hello! I'm using an OmniOS r151046 machine and I wanted to upgrade the storage, replacing the current NVME drives with a pair of 2TB Samsung 990 pros. The drives work fine under linux and are reported correctly in the BIOS but omnios doesn't see them at all. Neither format, nor iostat -eE show anything and nvmeadm list says there are no controllers. The old NVME disk (SK hynix BC711 HFM512GD3JX013N) is 12:17:49 shown correctly in all three. Is the drive not supported in illumos? 12:45:51 pilonsi - if they are not showing up that seems like a bug. 12:46:21 Is there anything useful in the logs? Try `dmesg | grep nvme` or similar. 12:47:54 also prtconf -vD 12:48:26 I am just checking which OmniOS version got support for newer NVMe standards, but I think it was backported. 12:53:36 In OmniOS r151046 there is no support for NVMe 2.x, so that's a likely reason. If that turns out to be the reason, try uncommenting the 'strict-version=0' line in /kernel/drv/nvme.conf 12:53:43 andyf the new drives are nvme 2.0, not sure about the old ones 12:53:51 Aaah so yeah that must be it 12:54:35 Relaxing that strict version check should be fine, or you could upgrade to r151052 if you don't have a particular desire to stay on the LTS version 12:54:46 May's r151054 will be the next LTS. 12:55:52 I was going to say that, if there may be some stability issues by skipping the version check I may just upgrade. I have no special requirement to be in that version, just chose the LTS when I set up the system last year and so far had no reason to upgrade 12:56:15 Thanks! 13:21:45 You're welcome. The reason I say it should be fine is that relaxing that restriction mostly involved a validation exercise. When the actual change was integrated it was effectively just bumping the maximum allowed version in the driver - https://github.com/illumos/illumos-gate/commit/f7ce6a749c79a107447356b130df647b6b60d449 13:22:19 Having said that, there have been several further improvements in the nvme driver since r46 so newer should be better! 13:23:42 andyf: I'm on r52 now and it's improved, but the drive is still not fully recognized 13:23:51 iostat -eE shows the following 13:24:08 blkdev3 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 13:24:08 Model: (null) Revision: (null) Serial No: (null) 13:24:08 Size: 2000,40GB <2000398934016 bytes> 13:24:08 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 13:24:10 Illegal Request: 0 Predictive Failure Analysis: 0 13:24:21 while nvmeadm list 13:25:03 nvmeadm: failed to open 'nvme3': failed to open device path /devices/pci@0,0/pci8086,6c0@1b/pci144d,a801@0:devctl: No such device or address: NVME_ERR_OPEN_DEV (libnvme: 0xd, sys: 6) 13:27:25 Anything in the logs this time? (dmesg) 13:31:37 Seems like it initializes ok 13:31:46 Jan 22 14:20:26 localhost nvme: [ID 259564 kern.info] nvme3: NVMe spec version 2.0 13:31:49 Jan 22 14:20:26 localhost nvme: [ID 259564 kern.info] nvme2: NVMe spec version 1.3 13:31:52 Jan 22 14:20:26 localhost blkdev: [ID 643073 kern.info] NOTICE: blkdev2: dynamic LUN expansion 13:31:55 Jan 22 14:20:26 localhost blkdev: [ID 348765 kern.info] Block device: blkdev@wACE42E00156E7F45,0, blkdev2 13:31:59 Jan 22 14:20:26 localhost genunix: [ID 936769 kern.info] blkdev2 is /pci@0,0/pci8086,6ac@1b,4/pci1c5c,174a@0/blkdev@wACE42E00156E7F45,0 13:32:02 Jan 22 14:20:26 localhost genunix: [ID 408114 kern.info] /pci@0,0/pci8086,6ac@1b,4/pci1c5c,174a@0/blkdev@wACE42E00156E7F45,0 (blkdev2) online 13:32:06 Jan 22 14:20:26 localhost blkdev: [ID 643073 kern.info] NOTICE: blkdev3: dynamic LUN expansion 13:32:09 Jan 22 14:20:26 localhost blkdev: [ID 348765 kern.info] Block device: blkdev@w0025384A41402A94,0, blkdev3 13:32:12 Jan 22 14:20:26 localhost genunix: [ID 936769 kern.info] blkdev3 is /pci@0,0/pci8086,6c0@1b/pci144d,a801@0/blkdev@w0025384A41402A94,0 13:32:15 Jan 22 14:20:26 localhost genunix: [ID 408 13:34:28 Can you try a reconfigure with `devfsadm -vC` ? 13:39:21 actually maybe it's a hardware issue :( 13:41:34 I've rebooted and now it's not detected at all. The drive that was in that slot failed previously as well (wasnt recognized). I thought it was a drive issue since linux could see both samsungs but now with this inconsistencies im starting to think that maybe there is some issue with the connector itself. I'm going to swap drives and check in case it's that and ill report back 14:21:43 so after moving the disk to the slot that hasn't given any issues I still see the same 14:22:17 and running devfsadm -vC removes a lot of device nodes related to disks: 14:24:03 https://pastebin.com/bPjFhmj7 14:24:14 Here is the full output of devfsadm 14:31:33 Can you go into the kernel debugger with `mdb -k` and then enter `::prtconf -d nvme -i 3 | ::devinfo -d | ::print nvme_t` at the prompt, and pastebin that output? 14:32:46 to exit the debugger again you can press ^d or enter $q 14:38:35 or ::quit 14:39:48 Or $% 14:40:00 (no idea why) 14:41:58 $ prefix is originating from adb 14:42:45 if you have the good old "Panic!" book, it is all based on adb:) 14:44:22 after I once did blow up some production DB system with mdb -kw, I did switch to use ::quit for safety:D 14:47:24 I didn't use adb much, I used SCAT a few times on panics. Maybe $% was adb then, but it was definitely carried over to mdb but not documented in the man page or help output. 15:00:12 andyf: I'm going to try that. Meanwhile I swapped the disk in case it was faulty, with the other copy I had. Then it wasn't recognized at all. Put the 1st samsung back in and it wasnt recognized at all either (while before it was). So i loaded linux and it is recognized fine. I have populated the other slot with the second disk and linux does recognize both. So I would discard hardware issues then. 15:00:20 So now, with both disks connected to the board 15:00:51 I see one of them properly recognized, but the other one does not appear at all... No strange things in dmesg either 15:02:14 It's strange because wether it is recognized completely, not at all, or partially is not consistent and I'm not sure what makes it change 15:02:54 ~ > pfexec nvmeadm list 15:02:55 nvme4: model: Samsung SSD 990 PRO 2TB, serial: S7PJNJ0XA07900D, FW rev: 4B2QJXD7, NVMe v2.0 nvme4/1 (c3t0025384A41402A94d0): Size = 1.82 TB, Capacity = 1.82 TB, Used = 0 B 15:08:34 https://pastebin.com/hUtJHdv0 15:09:26 This is the output of the debugger command. Although now the disk that does appear is fully recognized. I changed -i to 4 since it is nvme4 now 15:16:48 prtconf -d shows both disks, but one of them appears as (retired) 15:18:37 https://pastebin.com/dhjrrkhy 15:20:20 Could it be fmadm? Just learned about it 15:20:42 But maybe when the old disk failed it disabled that port in the mobo? 15:21:18 Since it failed i see a Notice: One or more i/o devices have been retired message at boot, which still appears now, i thought it was related to the zfs pool the disk was part of 15:22:27 That's interesting. We can clear the retire store and see what happens. 15:22:53 Try moving '/etc/devices/retire_store' to one side and rebooting 15:27:46 Ok so I've moved the retire_store and now I see both of them, but the one in the slot that failed is weird still 15:27:55 ~ > pfexec nvmeadm list 15:27:56 nvmeadm: failed to open 'nvme3': failed to open device path /devices/pci@0,0/pci8086,6c0@1b/pci144d,a801@0:devctl: No such device or address: NVME_ERR_OPEN_DEV (libnvme: 0xd, sys: 6) 15:27:59 nvme4: model: Samsung SSD 990 PRO 2TB, serial: S7PJNJ0XA07900D, FW rev: 4B2QJXD7, NVMe v2.0 nvme4/1 (c3t0025384A41402A94d0): Size = 1.82 TB, Capacity = 1.82 TB, Used = 0 B 15:29:54 ~ > https://pastebin.com/PPmg6WZE 15:30:13 This is the mdb command for the nvme3 disk 15:32:52 https://pastebin.com/4E3GkUJD 15:33:45 And this is the fmadm output, the device it listed as failed is the same one that is not being read properly 15:34:07 So maybe there is indeed something wrong with the motherboard? 15:37:43 .join #illumos 15:38:35 / not . 15:40:00 Got it working :) 15:40:13 It was fmadm 15:41:10 I had renamed the retire_store file to something else, but I noticed it had been recreated, and the drive still appeared as retired 15:41:35 (Renaming it is what i understood by moving it to the side, maybe you meant something else) 15:42:06 Oh, you had to do something like `fmadm repaired` ? 15:42:09 But in any case since Im trying to actually replace the faluted disk i marked it as replaced in fmadm, and both are now detected correclty 15:42:10 Yes! 15:42:55 Maybe there was a race condition between the driver loading and fmadm retiring the device 15:43:07 And sometimes it was retired before the driver loaded (So, not detected) 15:43:32 And sometimes after (So, listed as present, but when trying to access it later it appeared as missing) 15:43:35 Very interesting 15:43:46 Illumos is indeed in another league compared to linux :) 15:44:12 Thank you very much for your help 15:45:29 I'm glad you got it working!