00:47:48 danmcd I see you made the last change on ipxe, I was trying to build it just running make on the src directory, how do I generate the smartos.ipxe ? my intention is to see where the pxe boot is failing on my side by compiling with debug flags 01:17:33 I was just missing 2 node packages to begin building 04:47:06 SOrry I missed this neirac, but you got it, it seems. 11:57:03 danmcd yes the setup for the build image took time but things were pretty documented to follow without problemss 13:38:40 bahamat: danmcd looks like 1 of my CNs is reporting a nvme device missing, but I think it's definitely there.... not sure what happened 13:38:41 https://gist.github.com/Smithx10/fa946cc1c208af4f362bcd918945df0c 13:38:51 Trying to figure out how to get it back 14:02:44 jperkin: whould it be possible to bring back the jemalloc package in trunk? It is in 2021Q4 but not in trunk. 14:12:54 jfqd1: possibly, could you create a ticket? generally you really want to just use libumem on illumos, you'll get better performance and debugging 14:13:18 Looks like IPMI shows 10, but slot1 SN BTLN9050022R3P2BGN isn't showing in the OS 14:19:47 jperkin: done https://github.com/TritonDataCenter/pkgsrc/issues/347 14:20:16 I also reopend the following ticket: https://github.com/TritonDataCenter/pkgsrc/issues/340 14:21:23 And did you saw this comment related to dnsdist? https://github.com/TritonDataCenter/pkgsrc/issues/344#issuecomment-1320928341 14:22:02 Looks like cfgadm shows 10 nvme connected devices, but diskinfo and nvmeadm shows 9 14:27:32 I fixed 340, it'll be in the next build 14:33:31 Ok, so I've bumped the gist, looks like cfgadm shows that "Jan 1 1970 nvme/hp n /devices/pci@36,0/pci8086,2031@1:Slot238" doesn't have any links to /dev/rdsk but shows connected and configured. Not really sure what the next steps would be 14:46:01 try unconfiguring and configuring it again? 15:21:44 Smithx10: fmadm faulty shows the one device is faulted. 15:23:15 You *could* try and acquit the fault and see if it reappears. 15:23:19 `fmadm acquit 0180fc24-f033-e01e-e422-c49ee0b99996` 15:23:30 but if it's really faulty it'll come right back as faulty. 15:29:31 still missing, https://gist.github.com/Smithx10/fa946cc1c208af4f362bcd918945df0c#file-gistfile1-txt-L14 still not showed up, maybe I should bounce it from configured to unconfgiured to configured? 15:30:29 Did it come back as faulty? 15:30:40 (is there a new fmadm faulty report for this device?) 15:30:41 no 15:31:26 https://gist.github.com/Smithx10/8fa88958239be4f04f6a171b76e1c29a 15:31:30 Then try the bump. (You may induce a fault again...) 15:32:14 cfgadm -v -c unconfigure pcieb4:Slot238 && cfgadm -v -c configure pcieb4:Slot238 ? 15:32:21 BTW that overlay fault is caused when portolan service goes down (reboot of HN does this all the time), you can acquit that one too unless your fabrics on that node are faulty. 15:32:56 Smithx10: I think so? I've not used cfgadm a lot save for USB. Should work the same however. 15:39:38 All that will do is attach / detach the device driver. 15:40:02 Not knowing more about what's going on, that may be sufficient. 15:40:43 rmustacc: a nvme device is showing connected and configured in cfgadm https://gist.github.com/Smithx10/8fa88958239be4f04f6a171b76e1c29a#file-gistfile1-txt-L53 15:41:15 but the disk doesn't show up in nvmeadm list or diskinfo. Doesn't have a ln in /dev/rdsk either https://gist.github.com/Smithx10/8fa88958239be4f04f6a171b76e1c29a#file-gistfile1-txt-L41 15:41:41 I noticed this why looking at zpool status and seeing that device as UNAVAIL. 15:41:44 Go into mdb, and figure out if it set the n_dead flag. 15:41:59 (I'm sparsely avaiable so don't have the time right now to guide that) 15:42:11 Thanks for the pointer 15:43:56 (I can help you find the n_dead.) 15:44:19 I'm giving the unconfigure configure a shot 15:44:22 if you don't see a retired flag in prtconf (not -v, iirc) it won't be there. 15:44:24 Wait. 15:44:29 If you do that you will destroy the state. 15:44:45 It may be the path to recovery but it won't let you see that it was marked dead. 15:44:47 I've resilvered that device with a spare, and detached it 15:45:03 ahhhh :( I ran that command alright sorry 15:45:05 yes, but unconfigure will delete the kernel nvme_t. 15:45:09 If it worked. 15:45:10 OK. 15:45:15 already* trying to configur enow 15:45:23 my bad 15:45:40 It is what it is. 15:45:44 Hopefully configure works. 15:45:54 Anyways, gotta go afk. I'll circle back in a few hours. 15:46:11 If it doesn't it SHOULD send a new fault to FMA, AIUI. 15:46:41 It's not clear there was ever a fault to fma. 15:46:46 Did you have things in fmump -e? 15:46:56 (But I came late so hard to follow) 15:47:06 The n_dead is a driver thing, not an fm thing. 15:48:36 danmcd: https://gist.github.com/Smithx10/eb839a3728c6cf118c29b32c33f5e0ad faulted 15:50:03 Hmmm. Now you can likely see if the n_dead is set. 15:50:23 and if n_dead is set, this device is probably bad? time to replace? 15:51:18 To my mind the continual throwing of faults says that, but we can/should check for n_dead anyway. 15:52:26 We need to find its ddi_info first. In `mdb -k` utter "::prtconf !grep nvme" 15:54:18 https://gist.github.com/Smithx10/865aad812e90cfa2f3396bced2b09274 15:57:14 So let's make sure we are reaching things properly. Pick one pointer, say fffffd65700666f0 and utter this... 15:58:08 0xfffffd65700666f0::print struct dev_info devi_driver_data |::print nvme_t n_dead 16:00:32 n_dead = 0 (0) 16:00:49 > fffffd6c3a780c10::print struct dev_info devi_driver_data |::print nvme_t n_dead; n_dead = 0x1 (B_TRUE) 16:00:53 Aha! 16:01:07 The other ones should also be 0 16:01:19 yea 16:01:29 looks that way, i didn't loop over them all but the next 3 were 16:01:47 Good. So that one is n_dead 16:02:01 What does n_dead signify ? 16:02:29 Hang on... source link to block-comment explaining coming... 16:03:21 In my experience it means the driver found something it didn't ike which isn't reasonable. 16:03:33 The last time we had this it was a driver bug. 16:03:59 Check logs to see if it at least said why. 16:04:04 Oh! 16:04:35 rmustacc: is it filed? Fixed? (And if fixed, what PI are you running Smithx10?) 16:04:49 SunOS ac-1f-6b-a5-ab-f2 5.11 joyent_20220505T001410Z i86pc i386 i86pc 16:05:13 danmcd - yes, illumos 15202 (fenix) 16:05:24 and if fenix does not work here, ah well 16:05:49 jinni illumos 15202 16:05:54 jinni illumos#15202 16:05:55 https://www.illumos.org/issues/15202 16:06:18 That is only in just-pushed-out 20221201 16:06:44 That one is specifically related to running an `nvmeadm format` though. 16:07:02 But Smithx10 isn't running that, presumably. 16:07:08 This is part of an active pool. 16:07:21 It was, its been since removed 16:08:29 brb, gotta go to target before the misses ends me. lol 16:08:39 Ack. 16:08:48 When you return: fmdump -e -u 184b2bc2-a925-40aa-bfc0-be2133e06f62 16:42:57 I debugged pxe today and it stucks on executing the image, after record boot attempt it tries to execute the image, core/image.c 16:48:27 takes a long time before it returns again to the efi shell 16:52:31 danmcd: Dec 02 15:44:57.0724 ereport.io.service.lost 16:56:38 Add -v to it? 16:57:09 or -V gives more. 16:57:18 (it'll be gist-able.) 16:57:37 https://gist.github.com/Smithx10/2c65e2f9812ca54397152d8673140371 16:58:19 (both -vV ?) 16:58:23 (Or is that the same?) 16:58:32 same 16:58:41 okay. THanks. 16:58:47 little -v just added the ENA 16:58:48 lemme check nvme.c 16:59:59 Lots of places that throw Lost in nvme.c. Does `grep nvme /var/adm/messages ` have anything unusual? 17:02:46 https://gist.github.com/Smithx10/b3ff04b7552daca2da118a4e4f4b9630 17:03:02 nvme4 unable to reset controller 17:05:19 Yeah. init-time failure... link coming 17:06:07 https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/io/nvme/nvme.c#L3222-L3230 17:06:29 nvme_reset() is failing. 17:08:23 So WHY it's failing is beyond me at the moment. Also, it's lunchtime here, so AFK for now. 17:54:39 this is the last step before it returns to the efi shell https://pasteboard.co/VUzmG0r9XJ4d.png 18:05:39 danmcd: no problem, that pool is healthy. We also have spare drives on site if you'd like to swap it out at some point 18:38:01 On the off chance rmustacc might have additional clues to dispense don't do anything just yet. nvme_reset() failing convinces *me* it needs removal and replacement, but I may be missing something. 18:41:54 danmcd: sounds good, will wait for further instruction. 18:48:08 The first nvme4 reset in your message isn't all that delayed, but the second one has a 1min delay. 18:48:14 Suggesting it's not listening. 19:03:25 rmustacc, whatever the ixgbe/i40e connection problem was for me (months ago), it was either fixed by newer platform or otherwise transient. Upgraded and rebooted the nodes finally and iperf gives healthy 8.21 Gbits/sec with 30 clients 19:23:03 That's good news! 20:24:41 well, it can still be a lurking bug that'll trigger randomly=( 20:25:13 anyway, will get back if it fails again=) 20:25:45 does anybody know what pkgin -y upgrade is supposed to return on failure? 20:26:50 apparently nothing-to-change is 0, and upgraded-something is 1? 20:26:55 (return value, that is) 20:33:18 Don't know if @jperkin is around but his will be the canonical answer. 20:37:13 yeah, I was assuming he'd be off this late already 20:48:50 actually, irrelevant 20:49:00 I should be using the ansible pkgin module 20:49:54 should always do a basic sanity check first with old playbooks 21:11:48 FWIW, the ansible pkgin module seems to lack the ability to restrict which version of a package is installed. 21:17:18 jesse_: yeh that seems wrong, feel free to raise an issue for that 21:58:04 Is there a way to block a usb device from being used at boot? 21:58:26 probably the same way we block them for PCIP? 21:58:56 https://movementarian.org/blog/posts/2018-10-26-pci-pass-through-support-with-bhyve-and-smartos/ 21:59:25 https://docs.tritondatacenter.com/private-cloud/instances/pass-through-config 22:31:00 danmcd, Smithx10: Actually there is a case where I've seen a reset fail/time out. i'll need to dig a bit and get back to you and see if it's the same. 22:31:55 * danmcd sets aside some popcorn for whenever the digging is done 22:32:11 Thank you, and no rush. 22:32:31 There was an issue Jordan debugged around a reset that was failing / going out to lunch and i'd need to figure that out. 22:34:31 Again, take your time. It's almost (friday) dinnertime here so really no big rush. 22:35:30 for ipxe, I cannot go further, the last thing executed is get_efi_mmap and stays there 22:42:21 neirac: so I'm clear, it's still stuck in ipxe. 22:42:30 Yeah, source says so. 22:55:25 danmcd I was trying to check to which statement the pxe stopped and was at executing the image and then calling get_efi_mmap, 2 times is called then the third is stuck 22:55:54 on exit_boot_services 22:56:43 that's on multiboot2 so now I'm trying to get the last return code from get_efi_mmap 23:39:07 gah.. hit that damn usb panic again... forgot i was running a DEBUG kernel