15:56:41 figure I might as well ask here as well. I'm trying to install OmniOS 151046 (latest) in XCP-ng. I'm trying to boot the installer but it just powers off as soon as it starts configuring devices. 16:18:54 Do you have access to the hypervisor logs at all? Otherwise I would consider booting with kmdb. 16:24:43 I should. It's XCP-ng. Just have to figure out where they're stashing that specific log info. 16:27:18 The fact that it's powering off is a bit surprising. But clearly it's upset at something we did. So I'd either use kmdb to figure out what that is or figure that or see if the logs have a bit of help. 16:27:30 Presumably this is using the hvm as opposed to xpv kernel? 16:27:41 I've never used kmdb but I'm happy to take a walk through if you are willing. 16:28:05 There's a warning popping up just before it crashes. I'm trying to catch it. 16:29:19 "WARNING: illegal PCI request: offset = 100, size = 4" 16:29:27 about 2 seconds later it powers down. 16:29:41 If it's 2s later, it's likely, but hard to say unrelated. 16:29:53 looking at the video, "configuring devices." shows less than a second before the shutdown. 16:30:02 Sure, there's a lot of stuff there. 16:30:05 I can't count frames but it's certainly well under a second. 16:30:20 Sure, but even 10ms is a lot of time for the computer to do things. 16:30:54 Anyways, I'm not going to be able to walk you through everything today, but I would toggle into the boot menu, ask for some amount of verbose settings and then set the kmdb bit to maybe 'on panic'. 16:31:09 * nomad nods 16:32:32 kmdb option I have are loaded, on NMI, on boot, on boot/NMI, or off. 16:32:37 I'm going with boot/NMI 16:32:48 I think what I was thinking of was loaded. 16:32:58 I'll try that after this book. 16:33:09 But if you do on boot, you'll need to continue with ':c'. 16:33:11 I'm in the kmdb prompt, should I reboot with loaded or proceed? 16:33:20 k 16:33:41 Lots for me to type. Do you want me to proceed or are you busy? 16:34:14 I mean, you should proceed and figure out what the last thing you see is and then focus on the xcp-ng log. 16:34:30 If it says we triple faulted, that'd be one reason we power off and then we'll end up back in kmdb. 16:34:31 panic[cpu1]/thread=fffffffe0011e43c20: programming error: invalid NS/format in cmd ffffffe0bd3b780d40 16:34:41 (I suspect I didn't get those hex strings correct) 16:34:54 Don't worry about the hex. It sounds like you ended up back in kmdb? 16:35:11 yes, 16:35:37 last line is "kmdb_enter+0xb: movq %rax,%rdi" 16:35:45 I can post a photo if you'd like. 16:35:49 You'll want to run something like $C 16:35:54 And then sure, a photo or something. 16:36:54 https://photos.app.goo.gl/KXxi6ZwT9Nxw9iYi9 16:38:04 I've got this running on a cluster without XCP-ng support. I'm going to try to duplicate it on a cluster that has support so I can ask them questions. 16:38:22 OK, so that panic is something we can work with. 16:38:37 like, which of the 520 logs in /var/log would be meaningful for this? :) 16:38:50 For xcp-ng? 16:38:54 No idea. 16:38:55 is there any other data (aside from the log that I'm still trying to find) would you find useful? 16:39:02 no, that was the question I want to ask them. 16:39:13 No, based on the panic message there's not much from them now. 16:39:13 I didn't expect anyone here to have any idea about XCP-ng :) 16:39:22 The nvme driver is doing sometihng that upsets them. 16:39:26 So we'll need to figure out what. 16:40:11 Interesting. I didn't configure any nvme devices in the VDI. I wonder if it got that from the template. 16:40:32 If you have what you need from this one I'm going to blow it away and create a new one with a different template. 16:40:49 (They don't offer an illumos template so I just grabbed the one for AlmaLinux 9) 16:41:12 I mean, we can debug more, but I don't have the cycles to drive that right this moment. 16:41:18 * nomad nods 16:41:44 We need to figure out what they dislike about the command we sent; however, the good news is it is probably reproducible. 16:41:59 ok. I'll try another template. We've just started using XCP-ng and I don't have a lot of insight into what's in the templates so it might take me a while. 16:42:11 It is upset about us trying to enable the volatile write cache it looks like. 16:46:34 Can you run '*nvme_state::walk softstate | ::print nvme_t n_idctl->id_vwc.vwc_present' 16:53:18 n_idctl->id_vwc.vwc_present = 0x1 16:53:31 Hmm. Intresting, so there is a write cache present, but it doesn't like something about our set features. 16:57:36 https://photos.app.goo.gl/qCN5GstjsL69rHUD7 - an install done without a template. Looks like a different panic but this time I ran it with 'loaded' so maybe I'm just mis-reading it. 17:12:41 I think I found a logfile that might work. It's *very* verbose though. 17:12:52 Not sure how to pull anything meaningful out of it. 17:19:09 I had to lock it to hv5 to make sure I was getting the right log file. So this *might* be a red herring. That said... 17:19:13 Sep 5 10:17:47 hv5 xenopsd-xc: [error||8 ||xenops] QMP result for domid 17: {"error":{"class":"GenericError","desc":"State blocked by non-migratable device '0000:00:07.0/nvme'","data":{}},"id":"qmp-000141-17"} (File "xc/device_common.ml", line 565, characters 62-69) 17:20:16 looking back in the log it shows up before I locked the vm to that hypervisor so maybe it's a valid warning. 17:37:12 I have a dmesg output that might have something useful. I'm looking for a pastebin that isn't full of adverts to share it with you. 17:39:53 gist.github.com ? 17:40:02 I suppose I should just open a ticket for this. 17:42:40 https://dpaste.com/BZXV3W6NP (xe dmesg output for the crashed vm) 17:43:19 [illumos-gate] 15755 ilbd: the comparison will always evaluate as 'false' -- Toomas Soome 17:44:25 Yeah, so I think we want to focus on debugging why the nvme driver is blowing up on the set features in combination with the hardware platform. 17:45:20 * nomad nods 17:45:25 Happy to do anything I can to help. 17:46:05 This isn't fully blocking, I was just hoping to get this set up so I could evaluate something before relocating an OmniOS file server to a new location (with different authentication services). 17:46:36 I'll try to sneak in a few suggestions on how to look at the command when I get some breaks between things. 17:46:42 ok. 17:46:57 Should I drop the links to these photos and copy/paste this conversation into a ticket? 17:48:41 if they offer a virtual serial port, you should be able to select it as a the console.. the advantage being you can copy/paste from that (as text) 17:49:09 I think we'll probably want to open a ticket that focuses on the nvme bit and getting that fixed. 17:49:33 I can follow up here or on there I guess based on your druthers, but recording stuff in the ticket longer term is helpful. 17:49:45 jbk, if they do it's well hidden. Not saying it isn't there, just that I've never been able to find one. 17:49:53 My focus would be to look at the xcp-ng nvme issue and get that figured out first. 17:49:54 rmustacc, I'll open a ticket then. 17:50:01 what hypervisor is it? 17:50:16 XCP-ng (xen) 17:50:47 we just started using it so, like almost everything I barely have time to dabble in, I'm not an expert on it at all. 17:53:10 from the XCP-ng people: "This is not great. 17:53:10 To answer your question, when you boot in UEFI, before the guest OS loads the Xen PV drivers, it's QEMU that's using NVMe emulation. Can you try in BIOS mode? (this way, the VM will never see any NVMe stuff). 17:53:10 I'm not sure it's Dom0 related at all, since Xen detects a triple fault in the guest before killing it." 17:53:15 so I'm going to try in BIOS mode. 17:55:08 ...which seems to result in a boot loop instead of just powering down. 17:55:22 The focus is honestly on figuring out why we're getting the bad command. 17:55:27 * nomad nods 17:55:30 The triple fault is a red herring. 17:55:36 roger 17:56:03 That is a terrible fallback pattern for reboot on x86, though not sure we're hitting that or something else. 17:57:35 nothing more shows up in 'xl dmesg' when it does that. 17:57:47 I'm trying to think of the most reasonable ticket to submit. 17:58:15 For illumos? 17:58:22 yes 17:58:42 'NVMe driver panics on xcp-ng when trying to enable volatile write cache' 17:59:15 thanks. Kernel category, I presume. 18:03:34 found a serial console. :) 18:03:44 now to collect the data again. 18:03:48 Yeah, that's fine. Mind adding me as a watcher on the ticket? 18:05:34 don't mind at all. What name do I add? I don't see a 'rmustacc' :) 18:06:01 ah, found you (I think). 18:08:30 ugh. How do I put the installer into serial console mode? As soon as I start it stops showing anything in the serial. 18:10:20 ah, onconsole ttya. 18:10:22 * nomad is dumb. 18:14:49 remember how I mentioned when I selected loaded instead of boot/NMI I got a different set of errors? 18:14:56 well: 18:15:04 [1]> *nvme_state::walk softstate | ::print nvme_t n_idctl->id_vwc.vwc_present 18:15:05 kmdb: failed to dereference symbol: unknown symbol name 18:15:25 I'll drop this into the ticket. Let me know if you want me to issue a different set of commands. 18:18:18 panic[cpu1]/thread=fffffe0bd4c1c7c0: BAD TRAP: type=d (#gp General protection) rp=fffffe00117fb480 addr=fffffe0bd24123d8 18:20:41 We'll want to look at the stack and compare notes. What's the bug id? 18:20:52 https://www.illumos.org/issues/15884 18:20:54 → BUG 15884: NVMe driver panics on xcp-ng when trying to enable volatile write cache (New) 18:21:30 rebooting with boot/NMI still gives the general protection error instead of nvme. 18:22:52 Was this due to changing the machine type? 18:23:11 I don't think so. I put it back in UEFI mode. 18:23:30 OK, hmm. Well, that's going to be a very different path than the first time. 18:25:55 FTR: Vates support confirms they see the same general problem (power off or boot loop). 18:26:24 I'm going to link those photos to the ticket so anyone else looking at it will see what we were first observing. 18:35:40 ok, is there anything else I should add to the ticket? 18:37:25 I'm not familiar with the xen hvm bits at all there so someone'll probably have to help there. 18:37:45 * nomad nods 18:37:47 The NVMe bit is something I know more about. 18:38:09 Now that I have a working serial console I should be able to help. I hope. :) 18:52:23 hmm.. 18:52:40 what is the syntax in loader to prevent a driver from loading? 18:53:43 nomad: looking at your ticket, i'm wondering if it's the hyper-v hv_vmbus driver that's causing a problem 18:53:49 or issue 18:53:54 (sorry.. habit from work :P) 18:55:02 I just realized something. 18:55:27 hyper-v supposedly was highly 'inspired' by xen 18:55:43 The nvme error was presented on the first XCP-ng I tried to use. That one doesn't have support so I tried on the one that does and that's where we're seeing the different error. 18:56:06 the unsupported cluster uses dell hosts for hypervisors, the supported one uses SuperMicro. 18:56:50 by 'unsupported' I mean it doesn't have a support contract - it still has all the software, etc. 18:57:39 so the difference was likely not in any way related to how I invoked the kmdb. 18:58:02 unless something's not ported back though 18:58:18 hv_vmbus shouldn't even be getting as far as it is 18:58:39 because get_hwenv() shouldn't be returning that it's a microsoft hypervisor 19:01:55 the panic in #15884 from hv_vmbus`vmbux_xcall in the call stack 19:02:05 the driver shouldn't even be attaching 19:04:31 https://github.com/omniosorg/illumos-omnios/blob/r151046/usr/src/uts/intel/io/hyperv/vmbus/hyperv.c#L228-L264 -- somehow all of that is succeeding so that _init() succeeds 19:19:38 tsoome: isn't there support for running the cpuid command from loader? 19:32:14 confirmed, the dell cluster has the nvme error. 19:32:17 I've updated the ticket. 19:32:29 I presume at some point it'll require splitting out into a different ticket. 19:36:18 jbk as an ui command? no 19:36:28 even just from the cmdline 19:37:03 we do use it with bios loader to check if we can get 64 bit 19:37:30 on the dell: 19:37:33 hyperv_init: Checking Hyper-V support... 19:37:33 hyperv_identify: NOT Hyper-V environment: 0x4hyperv_init: Hyper-V not supported on this environmentpseudo-device: pseudo1 19:37:36 on the supermicro: 19:37:45 hyperv_init: Checking Hyper-V support... 19:37:45 hyperv_identify: Checking Hyper-V features... 19:37:45 do_cpuid: cpuid leaf=0x40000000, eax=0x40000006, ebx=0x7263694d, ecx=0x666f736f, edx=0x76482074 19:37:45 do_cpuid: cpuid leaf=0x40000001, eax=0x31237648, ebx=0x00000000, ecx=0x00000000, edx=0x00000000 19:37:45 panic[cpu1]/thread=fffffe0bd4c1c7c0: BAD TRAP: type=d (#gp General protection) rp=fffffe00117fb480 addr=fffffe0bd24123d8 19:37:54 yeah 19:39:22 cpuid leaf 0x40000000 has to be returning 'Microsoft Hv' in ebx/ecx/edx 19:39:26 for that to work 19:39:36 which if it's not hyper-v is 100% wrong 19:39:49 it would be nice to confirm that from the bootloader though 19:39:55 Nope, latest stable release of XCP-ng on both. :) 19:42:07 yeah... something isn't right there 19:43:36 tsoome: but is there a way to invoke it from the cmdline? or is it not exposed to the forth environment? 19:46:09 no, but it is quite trivial to add the command - see biossmap.c for instance in libi386 19:47:56 hrm.. 19:48:14 ok what about appending a boot argument? 19:53:38 boot argument? 19:53:40 jbk: yeah, that decodes to "Microsoft HvHv#1" ! 19:53:59 oh, I misread:D 19:54:58 tsoome: well if we can disable the hv_vmbus driver, that might at least allow nomad (on the one system) to progress further 19:55:11 (though it might end up in the same place as the other one, but it'd at least be progress) 19:56:21 boot -B disable-hv_vmbus=true 19:57:39 the syntax is disable-NAME=true 19:58:28 ahh 19:59:26 I have /boot/conf.d/audiohd 19:59:26 disable-audiohd="true" 19:59:44 nomad: might want to try what tsoome posted 19:59:46 I presume I do that at the loader prompt? 19:59:50 yeah 19:59:57 to skip some spam from buggy driver:) 20:00:03 might want to add '-v' to the cmdline as well 20:00:35 tsoome: in this case, it sounds like a buggy hypervisor :) 20:00:40 yes, you can enter that from ok prompt, if it will do the trick, you can add it into /boot/conf.d 20:00:44 if it's reporting it's hyper-v when it's not :) 20:01:05 hmm. I must have done it wrong. 20:01:07 hyperv_init: Checking Hyper-V support... 20:01:08 hyperv_identify: NOT Hyper-V environment: 0x4hyperv_init: Hyper-V not supported on this environmentpseudo-device: pseudo1 20:01:11 I'll try again. 20:01:34 that's what you want to see on that system (I think the supermicro) that was spitting out the 'do_cpuid' lines 20:02:04 the one panic was because the hv_vmbus driver was trying to do some hyper-v stuff because it thought it was running under hyper-v 20:02:16 because the hypervisor is reporting the HV type as 'MicrosoftHv' 20:02:30 NOTICE: driver hv_vmbus disabled 20:02:35 followed by: 20:02:48 WARNING: illegal PCI request: offset = 100, size = 4 20:02:48 Configuring devices. 20:02:48 panic[cpu0]/thread=fffffe0011d9fc20: programming error: invalid NS/format in cmd fffffe0bd3d87d40 20:02:59 ok so then you get a similar error 20:03:00 (the PCI request error has been there all along, reported earlier.) 20:03:29 I'm going to try on the supermicro host now. 20:03:42 but that programming error was from nvme, wasnt it? 20:03:47 yes 20:03:57 but the one stack trace you put in the issue was from hv_vmbus 20:04:19 so you can boot -B disable-nvme=true,disable-hv_vmbus=true 20:04:47 NOTICE: driver hv_vmbus disabled 20:04:47 WARNING: illegal PCI request: offset = 100, size = 4 20:04:47 Configuring devices. 20:04:47 panic[cpu1]/thread=fffffe0011e00c20: programming error: invalid NS/format in cmd fffffe0bd51c6d40 20:05:04 OK, so we're back to that issue. 20:05:59 NOTICE: driver hv_vmbus disabled 20:05:59 NOTICE: driver nvme disabled 20:05:59 WARNING: illegal PCI request: offset = 100, size = 4 20:05:59 Configuring devices. 20:05:59 Applying initial boot settings... 20:06:00 Scanning for media... 20:06:02 Found r151046 media at /devices/pci@0,0/pci-ide@1,1/ide@1/sd@1,0:q 20:06:04 Press return to start the OmniOS installer. 20:06:06 To use the old text installer, press T within 6 seconds... 20:06:08 (on the supermicro) 20:06:12 trying the same on the dell now. 20:06:15 i'd put $5 on the emulated NVMe advertising support for write cache and then returning an error when you try to disable it (whatever version of ESX we run internally does the same thing for the dataset management command w/ accompanying panic) 20:06:55 and basically we're panicing when the command returns 'not supported' 20:07:38 and success again. 20:07:59 basically we do not want to panic there:) 20:09:00 I've updated the ticket with that information. 20:11:58 so yeah.. if we can fix the NVMe issue then, you can probably get things installed and working if it's able to get to userland 20:12:34 it might be useful to drop to a shell and do 'dladm show-phys' just to make sure it sees a network device 20:12:50 or if you can use some other type of emulated storage 20:12:54 that might be a workaround as well 20:13:20 I'm going to try an install on the dell-based cluster (where I needed this host). I presume it'll crash on boot but at least I can (in theory) get into /boot/conf.d and do the thing I needed to do. 20:15:19 oh not. 20:15:23 'no disks found' 20:15:52 if you only have nvme, and nvme disabled... 20:17:16 I was hoping the nvme was not the only disk. 20:18:02 loader lsdev command should list them 20:18:23 but yea... 20:20:16 on the dell I don't need the disable-hv_vmbus. I do need both for the SuperMicro though. 20:21:08 ok lsdev 20:21:08 cd devices: 20:21:08 cd0: 145550 X 2048 blocks 20:21:08 cd0: ISO9660 20:21:08 disk devices: 20:21:09 disk0: 52428800 X 512 blocks 20:21:11 net devices: 20:21:13 net0: 20:21:15 (that's from the SuperMicro) 20:22:20 same for the dell. 20:31:19 [illumos-gate] 15754 format: the comparison will always evaluate as 'true' -- Toomas Soome 22:03:02 so the bootloader doesn't see any devices.. 22:03:36 if you boot into the installer do you see any network devices? 22:04:54 jbk, the installer stops when it doesn't see any drives. I'm not sure how I'd get past that to see. 22:07:36 I've got the installer shell up now, looking around 22:07:44 dladm show-phys 22:07:46 kayak-r151046# dladm 22:07:46 LINK CLASS MTU STATE BRIDGE OVER 22:07:46 e1000g0 phys 1500 unknown -- -- 22:07:54 ok so you're set there at least 22:08:03 kayak-r151046# dladm show-phys 22:08:03 LINK MEDIA STATE SPEED DUPLEX DEVICE 22:08:03 e1000g0 Ethernet unknown 0 half e1000g0 22:08:12 half duplex? Is that normal? 22:08:25 The state is unknown so. 22:08:33 k 22:09:05 anything else I should check while I'm here? 22:09:49 i don't think so... unless there was something special you were trying to do 22:10:26 The 'special' thing I was trying to do is see what configuration I need to get OmniOS to talk with the LDAP/Kerberos setup on this new-to-me network. 22:10:34 but that's kind of hard to do without a working host :) 22:10:36 so it sounds like once the NVMe issue is solved, you should have something usable 22:11:00 yeah, but given the old server is likely moving on Monday I don't think it's really going to matter. 22:11:25 I'm not going to be able to do anything meaningful past to test past tomorrow. 22:11:50 so at this point I'm looking at this as a chance to contribute to a project that I like so someone else in the future can benefit. 22:12:07 :s/meaningful past/meaningful past tomorrow/ 22:12:12 * nomad sighs 22:12:18 I really need to do a better job of editing. 22:13:10 anyway, I'll leave the VM in this state for a while in case anyone thinkgs of something they want examined. 22:42:10 nomad: I think the thing that'd help is to get into kmdb again with the nvme panic. 22:42:33 And then I want to print the full sqe / cqe. 22:43:13 So when you run '$C', you'll see there's an address pased to nvme_check_generic_cmd_status in hex. 22:43:31 And then do hexaddr::print nvme_cmd_t nc_sqe nc_cqe 22:43:44 That is the hex address that looks like an argument.