-
nomad
figure I might as well ask here as well. I'm trying to install OmniOS 151046 (latest) in XCP-ng. I'm trying to boot the installer but it just powers off as soon as it starts configuring devices.
-
rmustacc
Do you have access to the hypervisor logs at all? Otherwise I would consider booting with kmdb.
-
nomad
I should. It's XCP-ng. Just have to figure out where they're stashing that specific log info.
-
rmustacc
The fact that it's powering off is a bit surprising. But clearly it's upset at something we did. So I'd either use kmdb to figure out what that is or figure that or see if the logs have a bit of help.
-
rmustacc
Presumably this is using the hvm as opposed to xpv kernel?
-
nomad
I've never used kmdb but I'm happy to take a walk through if you are willing.
-
nomad
There's a warning popping up just before it crashes. I'm trying to catch it.
-
nomad
"WARNING: illegal PCI request: offset = 100, size = 4"
-
nomad
about 2 seconds later it powers down.
-
rmustacc
If it's 2s later, it's likely, but hard to say unrelated.
-
nomad
looking at the video, "configuring devices." shows less than a second before the shutdown.
-
rmustacc
Sure, there's a lot of stuff there.
-
nomad
I can't count frames but it's certainly well under a second.
-
rmustacc
Sure, but even 10ms is a lot of time for the computer to do things.
-
rmustacc
Anyways, I'm not going to be able to walk you through everything today, but I would toggle into the boot menu, ask for some amount of verbose settings and then set the kmdb bit to maybe 'on panic'.
-
» nomad nods
-
nomad
kmdb option I have are loaded, on NMI, on boot, on boot/NMI, or off.
-
nomad
I'm going with boot/NMI
-
rmustacc
I think what I was thinking of was loaded.
-
nomad
I'll try that after this book.
-
rmustacc
But if you do on boot, you'll need to continue with ':c'.
-
nomad
I'm in the kmdb prompt, should I reboot with loaded or proceed?
-
nomad
k
-
nomad
Lots for me to type. Do you want me to proceed or are you busy?
-
rmustacc
I mean, you should proceed and figure out what the last thing you see is and then focus on the xcp-ng log.
-
rmustacc
If it says we triple faulted, that'd be one reason we power off and then we'll end up back in kmdb.
-
nomad
panic[cpu1]/thread=fffffffe0011e43c20: programming error: invalid NS/format in cmd ffffffe0bd3b780d40
-
nomad
(I suspect I didn't get those hex strings correct)
-
rmustacc
Don't worry about the hex. It sounds like you ended up back in kmdb?
-
nomad
yes,
-
nomad
last line is "kmdb_enter+0xb: movq %rax,%rdi"
-
nomad
I can post a photo if you'd like.
-
rmustacc
You'll want to run something like $C
-
rmustacc
And then sure, a photo or something.
-
nomad
-
nomad
I've got this running on a cluster without XCP-ng support. I'm going to try to duplicate it on a cluster that has support so I can ask them questions.
-
rmustacc
OK, so that panic is something we can work with.
-
nomad
like, which of the 520 logs in /var/log would be meaningful for this? :)
-
rmustacc
For xcp-ng?
-
rmustacc
No idea.
-
nomad
is there any other data (aside from the log that I'm still trying to find) would you find useful?
-
nomad
no, that was the question I want to ask them.
-
rmustacc
No, based on the panic message there's not much from them now.
-
nomad
I didn't expect anyone here to have any idea about XCP-ng :)
-
rmustacc
The nvme driver is doing sometihng that upsets them.
-
rmustacc
So we'll need to figure out what.
-
nomad
Interesting. I didn't configure any nvme devices in the VDI. I wonder if it got that from the template.
-
nomad
If you have what you need from this one I'm going to blow it away and create a new one with a different template.
-
nomad
(They don't offer an illumos template so I just grabbed the one for AlmaLinux 9)
-
rmustacc
I mean, we can debug more, but I don't have the cycles to drive that right this moment.
-
» nomad nods
-
rmustacc
We need to figure out what they dislike about the command we sent; however, the good news is it is probably reproducible.
-
nomad
ok. I'll try another template. We've just started using XCP-ng and I don't have a lot of insight into what's in the templates so it might take me a while.
-
rmustacc
It is upset about us trying to enable the volatile write cache it looks like.
-
rmustacc
Can you run '*nvme_state::walk softstate | ::print nvme_t n_idctl->id_vwc.vwc_present'
-
nomad
n_idctl->id_vwc.vwc_present = 0x1
-
rmustacc
Hmm. Intresting, so there is a write cache present, but it doesn't like something about our set features.
-
nomad
photos.app.goo.gl/qCN5GstjsL69rHUD7 - an install done without a template. Looks like a different panic but this time I ran it with 'loaded' so maybe I'm just mis-reading it.
-
nomad
I think I found a logfile that might work. It's *very* verbose though.
-
nomad
Not sure how to pull anything meaningful out of it.
-
nomad
I had to lock it to hv5 to make sure I was getting the right log file. So this *might* be a red herring. That said...
-
nomad
Sep 5 10:17:47 hv5 xenopsd-xc: [error||8 ||xenops] QMP result for domid 17: {"error":{"class":"GenericError","desc":"State blocked by non-migratable device '0000:00:07.0/nvme'","data":{}},"id":"qmp-000141-17"} (File "xc/device_common.ml", line 565, characters 62-69)
-
nomad
looking back in the log it shows up before I locked the vm to that hypervisor so maybe it's a valid warning.
-
nomad
I have a dmesg output that might have something useful. I'm looking for a pastebin that isn't full of adverts to share it with you.
-
sjorge
gist.github.com ?
-
nomad
I suppose I should just open a ticket for this.
-
nomad
dpaste.com/BZXV3W6NP (xe dmesg output for the crashed vm)
-
gitomat
[illumos-gate] 15755 ilbd: the comparison will always evaluate as 'false' -- Toomas Soome <tsoome⊙mc>
-
rmustacc
Yeah, so I think we want to focus on debugging why the nvme driver is blowing up on the set features in combination with the hardware platform.
-
» nomad nods
-
nomad
Happy to do anything I can to help.
-
nomad
This isn't fully blocking, I was just hoping to get this set up so I could evaluate something before relocating an OmniOS file server to a new location (with different authentication services).
-
rmustacc
I'll try to sneak in a few suggestions on how to look at the command when I get some breaks between things.
-
nomad
ok.
-
nomad
Should I drop the links to these photos and copy/paste this conversation into a ticket?
-
jbk
if they offer a virtual serial port, you should be able to select it as a the console.. the advantage being you can copy/paste from that (as text)
-
rmustacc
I think we'll probably want to open a ticket that focuses on the nvme bit and getting that fixed.
-
rmustacc
I can follow up here or on there I guess based on your druthers, but recording stuff in the ticket longer term is helpful.
-
nomad
jbk, if they do it's well hidden. Not saying it isn't there, just that I've never been able to find one.
-
rmustacc
My focus would be to look at the xcp-ng nvme issue and get that figured out first.
-
nomad
rmustacc, I'll open a ticket then.
-
jbk
what hypervisor is it?
-
nomad
XCP-ng (xen)
-
nomad
we just started using it so, like almost everything I barely have time to dabble in, I'm not an expert on it at all.
-
nomad
from the XCP-ng people: "This is not great.
-
nomad
To answer your question, when you boot in UEFI, before the guest OS loads the Xen PV drivers, it's QEMU that's using NVMe emulation. Can you try in BIOS mode? (this way, the VM will never see any NVMe stuff).
-
nomad
I'm not sure it's Dom0 related at all, since Xen detects a triple fault in the guest before killing it."
-
nomad
so I'm going to try in BIOS mode.
-
nomad
...which seems to result in a boot loop instead of just powering down.
-
rmustacc
The focus is honestly on figuring out why we're getting the bad command.
-
» nomad nods
-
rmustacc
The triple fault is a red herring.
-
nomad
roger
-
rmustacc
That is a terrible fallback pattern for reboot on x86, though not sure we're hitting that or something else.
-
nomad
nothing more shows up in 'xl dmesg' when it does that.
-
nomad
I'm trying to think of the most reasonable ticket to submit.
-
rmustacc
For illumos?
-
nomad
yes
-
rmustacc
'NVMe driver panics on xcp-ng when trying to enable volatile write cache'
-
nomad
thanks. Kernel category, I presume.
-
nomad
found a serial console. :)
-
nomad
now to collect the data again.
-
rmustacc
Yeah, that's fine. Mind adding me as a watcher on the ticket?
-
nomad
don't mind at all. What name do I add? I don't see a 'rmustacc' :)
-
nomad
ah, found you (I think).
-
nomad
ugh. How do I put the installer into serial console mode? As soon as I start it stops showing anything in the serial.
-
nomad
ah, onconsole ttya.
-
» nomad is dumb.
-
nomad
remember how I mentioned when I selected loaded instead of boot/NMI I got a different set of errors?
-
nomad
well:
-
nomad
[1]> *nvme_state::walk softstate | ::print nvme_t n_idctl->id_vwc.vwc_present
-
nomad
kmdb: failed to dereference symbol: unknown symbol name
-
nomad
I'll drop this into the ticket. Let me know if you want me to issue a different set of commands.
-
nomad
panic[cpu1]/thread=fffffe0bd4c1c7c0: BAD TRAP: type=d (#gp General protection) rp=fffffe00117fb480 addr=fffffe0bd24123d8
-
rmustacc
We'll want to look at the stack and compare notes. What's the bug id?
-
nomad
-
fenix
→
BUG 15884: NVMe driver panics on xcp-ng when trying to enable volatile write cache (New)
-
nomad
rebooting with boot/NMI still gives the general protection error instead of nvme.
-
rmustacc
Was this due to changing the machine type?
-
nomad
I don't think so. I put it back in UEFI mode.
-
rmustacc
OK, hmm. Well, that's going to be a very different path than the first time.
-
nomad
FTR: Vates support confirms they see the same general problem (power off or boot loop).
-
nomad
I'm going to link those photos to the ticket so anyone else looking at it will see what we were first observing.
-
nomad
ok, is there anything else I should add to the ticket?
-
rmustacc
I'm not familiar with the xen hvm bits at all there so someone'll probably have to help there.
-
» nomad nods
-
rmustacc
The NVMe bit is something I know more about.
-
nomad
Now that I have a working serial console I should be able to help. I hope. :)
-
jbk
hmm..
-
jbk
what is the syntax in loader to prevent a driver from loading?
-
jbk
nomad: looking at your ticket, i'm wondering if it's the hyper-v hv_vmbus driver that's causing a problem
-
jbk
or issue
-
jbk
(sorry.. habit from work :P)
-
nomad
I just realized something.
-
jbk
hyper-v supposedly was highly 'inspired' by xen
-
nomad
The nvme error was presented on the first XCP-ng I tried to use. That one doesn't have support so I tried on the one that does and that's where we're seeing the different error.
-
nomad
the unsupported cluster uses dell hosts for hypervisors, the supported one uses SuperMicro.
-
nomad
by 'unsupported' I mean it doesn't have a support contract - it still has all the software, etc.
-
nomad
so the difference was likely not in any way related to how I invoked the kmdb.
-
jbk
unless something's not ported back though
-
jbk
hv_vmbus shouldn't even be getting as far as it is
-
jbk
because get_hwenv() shouldn't be returning that it's a microsoft hypervisor
-
jbk
the panic in #15884 from hv_vmbus`vmbux_xcall in the call stack
-
jbk
the driver shouldn't even be attaching
-
jbk
-
jbk
tsoome: isn't there support for running the cpuid command from loader?
-
nomad
confirmed, the dell cluster has the nvme error.
-
nomad
I've updated the ticket.
-
nomad
I presume at some point it'll require splitting out into a different ticket.
-
tsoome
jbk as an ui command? no
-
jbk
even just from the cmdline
-
tsoome
we do use it with bios loader to check if we can get 64 bit
-
nomad
on the dell:
-
nomad
hyperv_init: Checking Hyper-V support...
-
nomad
hyperv_identify: NOT Hyper-V environment: 0x4hyperv_init: Hyper-V not supported on this environmentpseudo-device: pseudo1
-
nomad
on the supermicro:
-
nomad
hyperv_init: Checking Hyper-V support...
-
nomad
hyperv_identify: Checking Hyper-V features...
-
nomad
do_cpuid: cpuid leaf=0x40000000, eax=0x40000006, ebx=0x7263694d, ecx=0x666f736f, edx=0x76482074
-
nomad
do_cpuid: cpuid leaf=0x40000001, eax=0x31237648, ebx=0x00000000, ecx=0x00000000, edx=0x00000000
-
nomad
panic[cpu1]/thread=fffffe0bd4c1c7c0: BAD TRAP: type=d (#gp General protection) rp=fffffe00117fb480 addr=fffffe0bd24123d8
-
jbk
yeah
-
jbk
cpuid leaf 0x40000000 has to be returning 'Microsoft Hv' in ebx/ecx/edx
-
jbk
for that to work
-
jbk
which if it's not hyper-v is 100% wrong
-
jbk
it would be nice to confirm that from the bootloader though
-
nomad
Nope, latest stable release of XCP-ng on both. :)
-
jbk
yeah... something isn't right there
-
jbk
tsoome: but is there a way to invoke it from the cmdline? or is it not exposed to the forth environment?
-
tsoome
no, but it is quite trivial to add the command - see biossmap.c for instance in libi386
-
jbk
hrm..
-
jbk
ok what about appending a boot argument?
-
tsoome
boot argument?
-
sommerfeld
jbk: yeah, that decodes to "Microsoft HvHv#1" !
-
tsoome
oh, I misread:D
-
jbk
tsoome: well if we can disable the hv_vmbus driver, that might at least allow nomad (on the one system) to progress further
-
jbk
(though it might end up in the same place as the other one, but it'd at least be progress)
-
tsoome
boot -B disable-hv_vmbus=true
-
tsoome
the syntax is disable-NAME=true
-
jbk
ahh
-
tsoome
I have /boot/conf.d/audiohd
-
tsoome
disable-audiohd="true"
-
jbk
nomad: might want to try what tsoome posted
-
nomad
I presume I do that at the loader prompt?
-
jbk
yeah
-
tsoome
to skip some spam from buggy driver:)
-
jbk
might want to add '-v' to the cmdline as well
-
jbk
tsoome: in this case, it sounds like a buggy hypervisor :)
-
tsoome
yes, you can enter that from ok prompt, if it will do the trick, you can add it into /boot/conf.d
-
jbk
if it's reporting it's hyper-v when it's not :)
-
nomad
hmm. I must have done it wrong.
-
nomad
hyperv_init: Checking Hyper-V support...
-
nomad
hyperv_identify: NOT Hyper-V environment: 0x4hyperv_init: Hyper-V not supported on this environmentpseudo-device: pseudo1
-
nomad
I'll try again.
-
jbk
that's what you want to see on that system (I think the supermicro) that was spitting out the 'do_cpuid' lines
-
jbk
the one panic was because the hv_vmbus driver was trying to do some hyper-v stuff because it thought it was running under hyper-v
-
jbk
because the hypervisor is reporting the HV type as 'MicrosoftHv'
-
nomad
NOTICE: driver hv_vmbus disabled
-
nomad
followed by:
-
nomad
WARNING: illegal PCI request: offset = 100, size = 4
-
nomad
Configuring devices.
-
nomad
panic[cpu0]/thread=fffffe0011d9fc20: programming error: invalid NS/format in cmd fffffe0bd3d87d40
-
jbk
ok so then you get a similar error
-
nomad
(the PCI request error has been there all along, reported earlier.)
-
nomad
I'm going to try on the supermicro host now.
-
tsoome
but that programming error was from nvme, wasnt it?
-
jbk
yes
-
jbk
but the one stack trace you put in the issue was from hv_vmbus
-
tsoome
so you can boot -B disable-nvme=true,disable-hv_vmbus=true
-
nomad
NOTICE: driver hv_vmbus disabled
-
nomad
WARNING: illegal PCI request: offset = 100, size = 4
-
nomad
Configuring devices.
-
nomad
panic[cpu1]/thread=fffffe0011e00c20: programming error: invalid NS/format in cmd fffffe0bd51c6d40
-
rmustacc
OK, so we're back to that issue.
-
nomad
NOTICE: driver hv_vmbus disabled
-
nomad
NOTICE: driver nvme disabled
-
nomad
WARNING: illegal PCI request: offset = 100, size = 4
-
nomad
Configuring devices.
-
nomad
Applying initial boot settings...
-
nomad
Scanning for media...
-
nomad
Found r151046 media at /devices/pci@0,0/pci-ide@1,1/ide@1/sd@1,0:q
-
nomad
Press return to start the OmniOS installer.
-
nomad
To use the old text installer, press T within 6 seconds...
-
nomad
(on the supermicro)
-
nomad
trying the same on the dell now.
-
jbk
i'd put $5 on the emulated NVMe advertising support for write cache and then returning an error when you try to disable it (whatever version of ESX we run internally does the same thing for the dataset management command w/ accompanying panic)
-
jbk
and basically we're panicing when the command returns 'not supported'
-
nomad
and success again.
-
tsoome
basically we do not want to panic there:)
-
nomad
I've updated the ticket with that information.
-
jbk
so yeah.. if we can fix the NVMe issue then, you can probably get things installed and working if it's able to get to userland
-
jbk
it might be useful to drop to a shell and do 'dladm show-phys' just to make sure it sees a network device
-
jbk
or if you can use some other type of emulated storage
-
jbk
that might be a workaround as well
-
nomad
I'm going to try an install on the dell-based cluster (where I needed this host). I presume it'll crash on boot but at least I can (in theory) get into /boot/conf.d and do the thing I needed to do.
-
nomad
oh not.
-
nomad
'no disks found'
-
tsoome
if you only have nvme, and nvme disabled...
-
nomad
I was hoping the nvme was not the only disk.
-
tsoome
loader lsdev command should list them
-
tsoome
but yea...
-
nomad
on the dell I don't need the disable-hv_vmbus. I do need both for the SuperMicro though.
-
nomad
ok lsdev
-
nomad
cd devices:
-
nomad
cd0: 145550 X 2048 blocks
-
nomad
cd0: ISO9660
-
nomad
disk devices:
-
nomad
disk0: 52428800 X 512 blocks
-
nomad
net devices:
-
nomad
net0:
-
nomad
(that's from the SuperMicro)
-
nomad
same for the dell.
-
gitomat
[illumos-gate] 15754 format: the comparison will always evaluate as 'true' -- Toomas Soome <tsoome⊙mc>
-
jbk
so the bootloader doesn't see any devices..
-
jbk
if you boot into the installer do you see any network devices?
-
nomad
jbk, the installer stops when it doesn't see any drives. I'm not sure how I'd get past that to see.
-
nomad
I've got the installer shell up now, looking around
-
jbk
dladm show-phys
-
nomad
kayak-r151046# dladm
-
nomad
LINK CLASS MTU STATE BRIDGE OVER
-
nomad
e1000g0 phys 1500 unknown -- --
-
jbk
ok so you're set there at least
-
nomad
kayak-r151046# dladm show-phys
-
nomad
LINK MEDIA STATE SPEED DUPLEX DEVICE
-
nomad
e1000g0 Ethernet unknown 0 half e1000g0
-
nomad
half duplex? Is that normal?
-
rmustacc
The state is unknown so.
-
nomad
k
-
nomad
anything else I should check while I'm here?
-
jbk
i don't think so... unless there was something special you were trying to do
-
nomad
The 'special' thing I was trying to do is see what configuration I need to get OmniOS to talk with the LDAP/Kerberos setup on this new-to-me network.
-
nomad
but that's kind of hard to do without a working host :)
-
jbk
so it sounds like once the NVMe issue is solved, you should have something usable
-
nomad
yeah, but given the old server is likely moving on Monday I don't think it's really going to matter.
-
nomad
I'm not going to be able to do anything meaningful past to test past tomorrow.
-
nomad
so at this point I'm looking at this as a chance to contribute to a project that I like so someone else in the future can benefit.
-
nomad
:s/meaningful past/meaningful past tomorrow/
-
» nomad sighs
-
nomad
I really need to do a better job of editing.
-
nomad
anyway, I'll leave the VM in this state for a while in case anyone thinkgs of something they want examined.
-
rmustacc
nomad: I think the thing that'd help is to get into kmdb again with the nvme panic.
-
rmustacc
And then I want to print the full sqe / cqe.
-
rmustacc
So when you run '$C', you'll see there's an address pased to nvme_check_generic_cmd_status in hex.
-
rmustacc
And then do hexaddr::print nvme_cmd_t nc_sqe nc_cqe
-
rmustacc
That is the hex address that looks like an argument.