-
jvl
danmcd: it happened again overnight, here's a screenshot from the iDRAC console
pasteboard.co/33PGQCyhNjJr.png
-
nikolam
I have put " route add -inet default _IP_address_ " in post-boot script in SmartOS, with the intention of global zone 'admin' interface internet access go through alternate Gateway on admin network, instead of the one that is given by DHCP as gateway.
-
nikolam
Result is when I do traceroute on global zone, it once goes through gateway it got through DHCP and second time through alternate IP address stated in route add command.
-
nikolam
How to make it always use alternate IP for gateway, instead of one given to admin interface through DHCP?
-
jvl
danmcd: I've reinstalled to those two SSDs after sdc-factoryreset and made striped pool out of those 12x4T SATA drives. dd if=/dev/zero of=/data/testfile bs=1M panics the box after a while. I've got a video of the console and screenshot of panic and dump is at
pasteboard.co/oJjGhng7CTFC.png. It is reproducible
-
jvl
the quicktime video is 20M, but I have no idea where to share that
-
jvl
there's nothing in DRAC and the box was rock solid in linux for 14days, so I don't have a reason to doubt the HW.
-
nikolam
jvl, where you created the pool? on Smartos/illumos or OpenZFS/ZoL/linux? There might be some feature flags on pool that are in OpenZFS and not available on illumos.
-
nikolam
Plus, having striped pool of many drives sounds like and data loose invitation. Maybe several Raidz VDEVs ?
-
jvl
it's fresh install from latest smartos PI
-
jvl
only smartos
-
jvl
box was wiped clean after linux, it's on a loan from vendor for something else, I'm just testing smartos there - primarily ucode update, because it has new enough Intel CPU, but came into troubles with lmrc
-
jvl
nikolam: this is really just testing, I'm not going to use it for anything and need to return it on thursday
-
nikolam
why striped 12 drives?
-
jvl
having 12 striped drives shouldn't lead to null pointer dereference in lmrc IMO
-
nikolam
sysinfo?
-
nikolam
jvl, sysinfo to paste somewhere ?
-
nikolam
jvl, mesage you shared previously says exactly pool zones encountered I/O failure. So with those drives, striped 12 drives seems as a bad choice.
-
nikolam
If speed is needed, 6 mirrored pairs VDEVs might help. Or 4 raidz VDEVs for capacity. But 4 times 3 drives mirror and 2 6 drives raidz2 is more sound.
-
Woodstock
jvl: do you have a core dump of that lmrc null pointer dereference?
-
Woodstock
jvl: and do you know what model of the HBA you have in that box?
-
jvl
nikolam: sysinfo here:
pastebin.com/4nApyXyM
-
nikolam
jvl, with 12 striped drives, Ymmw
-
jvl
nikolam: previously I had zpool2 with 12 drives and SSDs as log devices in mirror and one spare (suggested by installer) I was seeing write errors on all and lost I/O on that box, so it seems that all drives went away. The first screenshot I posted up there suggests that whole HBA tried to reset, but never came back out of it.
-
jvl
Woodstock: card is DELL PERC H700
-
jvl
looking for the dump
-
jvl
I don't have any vmdump.x under /var
-
jvl
/var/crash/volatile is empty
-
jvl
Woodstock: netbooting linux to get you better HBA name from lspci
-
jvl
Woodstock: 18:00.0 RAID bus controller: Broadcom / LSI MegaRAID 12GSAS/PCIe Secure SAS39xx Subsystem: Dell PERC H755 Adapter
-
Woodstock
hm
-
jvl
DRAC sees it as PERC H755 Adapter
-
jvl
too
-
Woodstock
ok, thanks
-
Woodstock
so i assume you have your zones pool attached to that?
-
jvl
this is 2U server with 4x3 3.5" bays in the front and two 2.5" bays in the back.
-
jvl
I'm booting from those SSDs, but in DRAC and even in bios of the PERC it seems that there are just two backplanes for drives connected to the PERC
-
Woodstock
hm yes, looking at that screenshot it's quite obvious that you don't get a dump because the same panic in lmrc happens again in the dump code path.
-
Woodstock
i think i see what the problem is, and it makes me wonder why i never saw that happen during testing.
-
Woodstock
jvl: are you aware of this issue?
illumos.org/issues/15935
-
Woodstock
it still shouldn't panic, though.
-
jvl
Woodstock: didn't see the issue before
-
jvl
seems like the issue I had at first with that huge raidz2 pool with log devices.
-
jvl
first instance in 19 minutes after boot, second in an hour or so
-
jvl
We're returning this particular piece to the vendor tomorrow, but we'll be buying those for linux next year, so I can test some more if necessary
-
jvl
seems like we have two more of these PE750xs already in to act as backup servers, but I don't have the time schedule when they'll go into production.
-
jvl
if I can get some more info for debugging, I'm available
-
jvl
fwiw, I'm seeing it on firmware ControllerFirmwareVersion52.21.1-5149 DELL has 52.26.0-5179 now.
-
jvl
same result with latest PERC firmware (sorry I said I had latest before, I haven't checked in those couple weeks again)
-
jvl
stack trace for the latest firmware looks longer
snipboard.io/SXHCnT.jpg
-
jvl
"same result" here means box panics on that dd. I've yet to confirm the behavior with drives missing from the BUS
-
danmcd
(Thank you @Woodstock for jumping in on this.)
-
jvl
Woodstock: with regards to that
illumos.org/issues/15935 - it has been created on 28th Sept, DELL released 52.21.1-5149 on 5th Oct that should solve controller hangs with "improved DDR4 memory settings
-
jvl
that's one version before current.
-
jvl
-
jvl
w
-
jvl
I'm still waiting for the disks to disappear with 52.26.0.5179
-
Woodstock
jvl: it doesn't
-
Woodstock
we've tested that. we even had a call with dell about this issue, but there hasn't been any progress so far
-
Woodstock
i'm going to fix that panic, and if you like you can test it :)
-
jvl
Woodstock: sure thing wuth the test!
-
jvl
with
-
jvl
I'm booting from SSD mirrored pool, so I guess I'll need ISO for piadm (or DRAC)