-
neirac
danmcd I see you made the last change on ipxe, I was trying to build it just running make on the src directory, how do I generate the smartos.ipxe ? my intention is to see where the pxe boot is failing on my side by compiling with debug flags
-
neirac
I was just missing 2 node packages to begin building
-
danmcd
SOrry I missed this neirac, but you got it, it seems.
-
neirac
danmcd yes the setup for the build image took time but things were pretty documented to follow without problemss
-
Smithx10
bahamat: danmcd looks like 1 of my CNs is reporting a nvme device missing, but I think it's definitely there.... not sure what happened
-
Smithx10
-
Smithx10
Trying to figure out how to get it back
-
jfqd1
jperkin: whould it be possible to bring back the jemalloc package in trunk? It is in 2021Q4 but not in trunk.
-
jperkin
jfqd1: possibly, could you create a ticket? generally you really want to just use libumem on illumos, you'll get better performance and debugging
-
Smithx10
Looks like IPMI shows 10, but slot1 SN BTLN9050022R3P2BGN isn't showing in the OS
-
jfqd1
-
jfqd1
I also reopend the following ticket:
TritonDataCenter/pkgsrc #340
-
jfqd1
-
Smithx10
Looks like cfgadm shows 10 nvme connected devices, but diskinfo and nvmeadm shows 9
-
jperkin
I fixed 340, it'll be in the next build
-
Smithx10
Ok, so I've bumped the gist, looks like cfgadm shows that "Jan 1 1970 nvme/hp n /devices/pci@36,0/pci8086,2031@1:Slot238" doesn't have any links to /dev/rdsk but shows connected and configured. Not really sure what the next steps would be
-
jesse_
try unconfiguring and configuring it again?
-
danmcd
Smithx10: fmadm faulty shows the one device is faulted.
-
danmcd
You *could* try and acquit the fault and see if it reappears.
-
danmcd
`fmadm acquit 0180fc24-f033-e01e-e422-c49ee0b99996`
-
danmcd
but if it's really faulty it'll come right back as faulty.
-
Smithx10
still missing,
gist.github.com/Smithx10/fa946cc1c2…cd918945df0c#file-gistfile1-txt-L14 still not showed up, maybe I should bounce it from configured to unconfgiured to configured?
-
danmcd
Did it come back as faulty?
-
danmcd
(is there a new fmadm faulty report for this device?)
-
Smithx10
no
-
Smithx10
-
danmcd
Then try the bump. (You may induce a fault again...)
-
Smithx10
cfgadm -v -c unconfigure pcieb4:Slot238 && cfgadm -v -c configure pcieb4:Slot238 ?
-
danmcd
BTW that overlay fault is caused when portolan service goes down (reboot of HN does this all the time), you can acquit that one too unless your fabrics on that node are faulty.
-
danmcd
Smithx10: I think so? I've not used cfgadm a lot save for USB. Should work the same however.
-
rmustacc
All that will do is attach / detach the device driver.
-
rmustacc
Not knowing more about what's going on, that may be sufficient.
-
Smithx10
-
Smithx10
but the disk doesn't show up in nvmeadm list or diskinfo. Doesn't have a ln in /dev/rdsk either
gist.github.com/Smithx10/8fa8895823…171b76e1c29a#file-gistfile1-txt-L41
-
Smithx10
I noticed this why looking at zpool status and seeing that device as UNAVAIL.
-
rmustacc
Go into mdb, and figure out if it set the n_dead flag.
-
rmustacc
(I'm sparsely avaiable so don't have the time right now to guide that)
-
Smithx10
Thanks for the pointer
-
danmcd
(I can help you find the n_dead.)
-
Smithx10
I'm giving the unconfigure configure a shot
-
rmustacc
if you don't see a retired flag in prtconf (not -v, iirc) it won't be there.
-
rmustacc
Wait.
-
rmustacc
If you do that you will destroy the state.
-
rmustacc
It may be the path to recovery but it won't let you see that it was marked dead.
-
Smithx10
I've resilvered that device with a spare, and detached it
-
Smithx10
ahhhh :( I ran that command alright sorry
-
rmustacc
yes, but unconfigure will delete the kernel nvme_t.
-
rmustacc
If it worked.
-
rmustacc
OK.
-
Smithx10
already* trying to configur enow
-
Smithx10
my bad
-
rmustacc
It is what it is.
-
rmustacc
Hopefully configure works.
-
rmustacc
Anyways, gotta go afk. I'll circle back in a few hours.
-
danmcd
If it doesn't it SHOULD send a new fault to FMA, AIUI.
-
rmustacc
It's not clear there was ever a fault to fma.
-
rmustacc
Did you have things in fmump -e?
-
rmustacc
(But I came late so hard to follow)
-
rmustacc
The n_dead is a driver thing, not an fm thing.
-
Smithx10
-
danmcd
Hmmm. Now you can likely see if the n_dead is set.
-
Smithx10
and if n_dead is set, this device is probably bad? time to replace?
-
danmcd
To my mind the continual throwing of faults says that, but we can/should check for n_dead anyway.
-
danmcd
We need to find its ddi_info first. In `mdb -k` utter "::prtconf !grep nvme"
-
Smithx10
-
danmcd
So let's make sure we are reaching things properly. Pick one pointer, say fffffd65700666f0 and utter this...
-
danmcd
0xfffffd65700666f0::print struct dev_info devi_driver_data |::print nvme_t n_dead
-
Smithx10
n_dead = 0 (0)
-
Smithx10
> fffffd6c3a780c10::print struct dev_info devi_driver_data |::print nvme_t n_dead; n_dead = 0x1 (B_TRUE)
-
danmcd
Aha!
-
danmcd
The other ones should also be 0
-
Smithx10
yea
-
Smithx10
looks that way, i didn't loop over them all but the next 3 were
-
danmcd
Good. So that one is n_dead
-
Smithx10
What does n_dead signify ?
-
danmcd
Hang on... source link to block-comment explaining coming...
-
rmustacc
In my experience it means the driver found something it didn't ike which isn't reasonable.
-
rmustacc
The last time we had this it was a driver bug.
-
rmustacc
Check logs to see if it at least said why.
-
danmcd
Oh!
-
danmcd
rmustacc: is it filed? Fixed? (And if fixed, what PI are you running Smithx10?)
-
Smithx10
SunOS ac-1f-6b-a5-ab-f2 5.11 joyent_20220505T001410Z i86pc i386 i86pc
-
andyf
danmcd - yes, illumos 15202 (fenix)
-
andyf
and if fenix does not work here, ah well
-
danmcd
jinni illumos 15202
-
danmcd
jinni illumos#15202
-
jinni
-
danmcd
That is only in just-pushed-out 20221201
-
andyf
That one is specifically related to running an `nvmeadm format` though.
-
danmcd
But Smithx10 isn't running that, presumably.
-
danmcd
This is part of an active pool.
-
Smithx10
It was, its been since removed
-
Smithx10
brb, gotta go to target before the misses ends me. lol
-
danmcd
Ack.
-
danmcd
When you return: fmdump -e -u 184b2bc2-a925-40aa-bfc0-be2133e06f62
-
neirac
I debugged pxe today and it stucks on executing the image, after record boot attempt it tries to execute the image, core/image.c
-
neirac
takes a long time before it returns again to the efi shell
-
Smithx10
danmcd: Dec 02 15:44:57.0724 ereport.io.service.lost
-
danmcd
Add -v to it?
-
danmcd
or -V gives more.
-
danmcd
(it'll be gist-able.)
-
Smithx10
-
danmcd
(both -vV ?)
-
danmcd
(Or is that the same?)
-
Smithx10
same
-
danmcd
okay. THanks.
-
Smithx10
little -v just added the ENA
-
danmcd
lemme check nvme.c
-
danmcd
Lots of places that throw Lost in nvme.c. Does `grep nvme /var/adm/messages ` have anything unusual?
-
Smithx10
-
Smithx10
nvme4 unable to reset controller
-
danmcd
Yeah. init-time failure... link coming
-
danmcd
-
danmcd
nvme_reset() is failing.
-
danmcd
So WHY it's failing is beyond me at the moment. Also, it's lunchtime here, so AFK for now.
-
neirac
this is the last step before it returns to the efi shell
pasteboard.co/VUzmG0r9XJ4d.png
-
Smithx10
danmcd: no problem, that pool is healthy. We also have spare drives on site if you'd like to swap it out at some point
-
danmcd
On the off chance rmustacc might have additional clues to dispense don't do anything just yet. nvme_reset() failing convinces *me* it needs removal and replacement, but I may be missing something.
-
Smithx10
danmcd: sounds good, will wait for further instruction.
-
danmcd
The first nvme4 reset in your message isn't all that delayed, but the second one has a 1min delay.
-
danmcd
Suggesting it's not listening.
-
jesse_
rmustacc, whatever the ixgbe/i40e connection problem was for me (months ago), it was either fixed by newer platform or otherwise transient. Upgraded and rebooted the nodes finally and iperf gives healthy 8.21 Gbits/sec with 30 clients
-
danmcd
That's good news!
-
jesse_
well, it can still be a lurking bug that'll trigger randomly=(
-
jesse_
anyway, will get back if it fails again=)
-
jesse_
does anybody know what pkgin -y upgrade is supposed to return on failure?
-
jesse_
apparently nothing-to-change is 0, and upgraded-something is 1?
-
jesse_
(return value, that is)
-
danmcd
Don't know if @jperkin is around but his will be the canonical answer.
-
jesse_
yeah, I was assuming he'd be off this late already
-
jesse_
actually, irrelevant
-
jesse_
I should be using the ansible pkgin module
-
jesse_
should always do a basic sanity check first with old playbooks
-
pjustice
FWIW, the ansible pkgin module seems to lack the ability to restrict which version of a package is installed.
-
jperkin
jesse_: yeh that seems wrong, feel free to raise an issue for that
-
Smithx10
Is there a way to block a usb device from being used at boot?
-
Smithx10
probably the same way we block them for PCIP?
-
Smithx10
-
Smithx10
-
rmustacc
danmcd, Smithx10: Actually there is a case where I've seen a reset fail/time out. i'll need to dig a bit and get back to you and see if it's the same.
-
» danmcd sets aside some popcorn for whenever the digging is done
-
danmcd
Thank you, and no rush.
-
rmustacc
There was an issue Jordan debugged around a reset that was failing / going out to lunch and i'd need to figure that out.
-
danmcd
Again, take your time. It's almost (friday) dinnertime here so really no big rush.
-
neirac
for ipxe, I cannot go further, the last thing executed is get_efi_mmap and stays there
-
danmcd
neirac: so I'm clear, it's still stuck in ipxe.
-
danmcd
Yeah, source says so.
-
neirac
danmcd I was trying to check to which statement the pxe stopped and was at executing the image and then calling get_efi_mmap, 2 times is called then the third is stuck
-
neirac
on exit_boot_services
-
neirac
that's on multiboot2 so now I'm trying to get the last return code from get_efi_mmap
-
jbk
gah.. hit that damn usb panic again... forgot i was running a DEBUG kernel