-
jclulow
updating neovim to 0.10.3, and working around the crashing problem:
omniosorg/omnios-extra #1573
-
justinw312
Well... work happened and just getting back to bhyve zones. Trying to install Rocky 9.5 from ISO. I get as far as connecting to VNC and seeing the GRUB boot menu. Then I'm getting a crash and not sure what to do next: "VM-entry failure due to invalid guest state" // Has anyone run into this error with RHEL/EL/etc? Not sure if it's a uefi vs bios boot issue?
-
justinw312
found a tutorial for getting Void running, and their settings are very close to mine. So, maybe a hardware issue. I'll try the same config on a different machine tomorrow.
-
justinw312
i guess i could try void and follow the tutorial exactly and then (if it works) start bisecting the differences.
-
pilonsi
Hello! I'm using an OmniOS r151046 machine and I wanted to upgrade the storage, replacing the current NVME drives with a pair of 2TB Samsung 990 pros. The drives work fine under linux and are reported correctly in the BIOS but omnios doesn't see them at all. Neither format, nor iostat -eE show anything and nvmeadm list says there are no controllers. The old NVME disk (SK hynix BC711 HFM512GD3JX013N) is
-
pilonsi
shown correctly in all three. Is the drive not supported in illumos?
-
andyf
pilonsi - if they are not showing up that seems like a bug.
-
andyf
Is there anything useful in the logs? Try `dmesg | grep nvme` or similar.
-
tsoome
also prtconf -vD
-
andyf
I am just checking which OmniOS version got support for newer NVMe standards, but I think it was backported.
-
andyf
In OmniOS r151046 there is no support for NVMe 2.x, so that's a likely reason. If that turns out to be the reason, try uncommenting the 'strict-version=0' line in /kernel/drv/nvme.conf
-
pilonsi
andyf the new drives are nvme 2.0, not sure about the old ones
-
pilonsi
Aaah so yeah that must be it
-
andyf
Relaxing that strict version check should be fine, or you could upgrade to r151052 if you don't have a particular desire to stay on the LTS version
-
andyf
May's r151054 will be the next LTS.
-
pilonsi
I was going to say that, if there may be some stability issues by skipping the version check I may just upgrade. I have no special requirement to be in that version, just chose the LTS when I set up the system last year and so far had no reason to upgrade
-
pilonsi
Thanks!
-
andyf
You're welcome. The reason I say it should be fine is that relaxing that restriction mostly involved a validation exercise. When the actual change was integrated it was effectively just bumping the maximum allowed version in the driver -
illumos/illumos-gate f7ce6a7
-
andyf
Having said that, there have been several further improvements in the nvme driver since r46 so newer should be better!
-
pilonsi
andyf: I'm on r52 now and it's improved, but the drive is still not fully recognized
-
pilonsi
iostat -eE shows the following
-
pilonsi
blkdev3 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
-
pilonsi
Model: (null) Revision: (null) Serial No: (null)
-
pilonsi
Size: 2000,40GB <2000398934016 bytes>
-
pilonsi
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
-
pilonsi
Illegal Request: 0 Predictive Failure Analysis: 0
-
pilonsi
while nvmeadm list
-
pilonsi
nvmeadm: failed to open 'nvme3': failed to open device path /devices/pci@0,0/pci8086,6c0@1b/pci144d,a801@0:devctl: No such device or address: NVME_ERR_OPEN_DEV (libnvme: 0xd, sys: 6)
-
andyf
Anything in the logs this time? (dmesg)
-
pilonsi
Seems like it initializes ok
-
pilonsi
Jan 22 14:20:26 localhost nvme: [ID 259564 kern.info] nvme3: NVMe spec version 2.0
-
pilonsi
Jan 22 14:20:26 localhost nvme: [ID 259564 kern.info] nvme2: NVMe spec version 1.3
-
pilonsi
Jan 22 14:20:26 localhost blkdev: [ID 643073 kern.info] NOTICE: blkdev2: dynamic LUN expansion
-
pilonsi
Jan 22 14:20:26 localhost blkdev: [ID 348765 kern.info] Block device: blkdev@wACE42E00156E7F45,0, blkdev2
-
pilonsi
Jan 22 14:20:26 localhost genunix: [ID 936769 kern.info] blkdev2 is /pci@0,0/pci8086,6ac@1b,4/pci1c5c,174a@0/blkdev@wACE42E00156E7F45,0
-
pilonsi
Jan 22 14:20:26 localhost genunix: [ID 408114 kern.info] /pci@0,0/pci8086,6ac@1b,4/pci1c5c,174a@0/blkdev@wACE42E00156E7F45,0 (blkdev2) online
-
pilonsi
Jan 22 14:20:26 localhost blkdev: [ID 643073 kern.info] NOTICE: blkdev3: dynamic LUN expansion
-
pilonsi
Jan 22 14:20:26 localhost blkdev: [ID 348765 kern.info] Block device: blkdev@w0025384A41402A94,0, blkdev3
-
pilonsi
Jan 22 14:20:26 localhost genunix: [ID 936769 kern.info] blkdev3 is /pci@0,0/pci8086,6c0@1b/pci144d,a801@0/blkdev@w0025384A41402A94,0
-
pilonsi
Jan 22 14:20:26 localhost genunix: [ID 408
-
andyf
Can you try a reconfigure with `devfsadm -vC` ?
-
pilonsi
actually maybe it's a hardware issue :(
-
pilonsi
I've rebooted and now it's not detected at all. The drive that was in that slot failed previously as well (wasnt recognized). I thought it was a drive issue since linux could see both samsungs but now with this inconsistencies im starting to think that maybe there is some issue with the connector itself. I'm going to swap drives and check in case it's that and ill report back
-
pilonsi
so after moving the disk to the slot that hasn't given any issues I still see the same
-
pilonsi
and running devfsadm -vC removes a lot of device nodes related to disks:
-
pilonsi
-
pilonsi
Here is the full output of devfsadm
-
andyf
Can you go into the kernel debugger with `mdb -k` and then enter `::prtconf -d nvme -i 3 | ::devinfo -d | ::print nvme_t` at the prompt, and pastebin that output?
-
andyf
to exit the debugger again you can press ^d or enter $q
-
tsoome
or ::quit
-
andyf
Or $%
-
andyf
(no idea why)
-
tsoome
$ prefix is originating from adb
-
tsoome
if you have the good old "Panic!" book, it is all based on adb:)
-
tsoome
after I once did blow up some production DB system with mdb -kw, I did switch to use ::quit for safety:D
-
andyf
I didn't use adb much, I used SCAT a few times on panics. Maybe $% was adb then, but it was definitely carried over to mdb but not documented in the man page or help output.
-
pilonsi
andyf: I'm going to try that. Meanwhile I swapped the disk in case it was faulty, with the other copy I had. Then it wasn't recognized at all. Put the 1st samsung back in and it wasnt recognized at all either (while before it was). So i loaded linux and it is recognized fine. I have populated the other slot with the second disk and linux does recognize both. So I would discard hardware issues then.
-
pilonsi
So now, with both disks connected to the board
-
pilonsi
I see one of them properly recognized, but the other one does not appear at all... No strange things in dmesg either
-
pilonsi
It's strange because wether it is recognized completely, not at all, or partially is not consistent and I'm not sure what makes it change
-
pilonsi
~ > pfexec nvmeadm list
-
pilonsi
nvme4: model: Samsung SSD 990 PRO 2TB, serial: S7PJNJ0XA07900D, FW rev: 4B2QJXD7, NVMe v2.0 nvme4/1 (c3t0025384A41402A94d0): Size = 1.82 TB, Capacity = 1.82 TB, Used = 0 B
-
pilonsi
-
pilonsi
This is the output of the debugger command. Although now the disk that does appear is fully recognized. I changed -i to 4 since it is nvme4 now
-
pilonsi
prtconf -d shows both disks, but one of them appears as (retired)
-
pilonsi
-
pilonsi
Could it be fmadm? Just learned about it
-
pilonsi
But maybe when the old disk failed it disabled that port in the mobo?
-
pilonsi
Since it failed i see a Notice: One or more i/o devices have been retired message at boot, which still appears now, i thought it was related to the zfs pool the disk was part of
-
andyf
That's interesting. We can clear the retire store and see what happens.
-
andyf
Try moving '/etc/devices/retire_store' to one side and rebooting
-
pilonsi
Ok so I've moved the retire_store and now I see both of them, but the one in the slot that failed is weird still
-
pilonsi
~ > pfexec nvmeadm list
-
pilonsi
nvmeadm: failed to open 'nvme3': failed to open device path /devices/pci@0,0/pci8086,6c0@1b/pci144d,a801@0:devctl: No such device or address: NVME_ERR_OPEN_DEV (libnvme: 0xd, sys: 6)
-
pilonsi
nvme4: model: Samsung SSD 990 PRO 2TB, serial: S7PJNJ0XA07900D, FW rev: 4B2QJXD7, NVMe v2.0 nvme4/1 (c3t0025384A41402A94d0): Size = 1.82 TB, Capacity = 1.82 TB, Used = 0 B
-
pilonsi
-
pilonsi
This is the mdb command for the nvme3 disk
-
pilonsi
-
pilonsi
And this is the fmadm output, the device it listed as failed is the same one that is not being read properly
-
pilonsi
So maybe there is indeed something wrong with the motherboard?
-
GrayGhost
.join #illumos
-
GrayGhost
/ not .
-
pilonsi
Got it working :)
-
pilonsi
It was fmadm
-
pilonsi
I had renamed the retire_store file to something else, but I noticed it had been recreated, and the drive still appeared as retired
-
pilonsi
(Renaming it is what i understood by moving it to the side, maybe you meant something else)
-
andyf
Oh, you had to do something like `fmadm repaired` ?
-
pilonsi
But in any case since Im trying to actually replace the faluted disk i marked it as replaced in fmadm, and both are now detected correclty
-
pilonsi
Yes!
-
pilonsi
Maybe there was a race condition between the driver loading and fmadm retiring the device
-
pilonsi
And sometimes it was retired before the driver loaded (So, not detected)
-
pilonsi
And sometimes after (So, listed as present, but when trying to access it later it appeared as missing)
-
pilonsi
Very interesting
-
pilonsi
Illumos is indeed in another league compared to linux :)
-
pilonsi
Thank you very much for your help
-
andyf
I'm glad you got it working!