-
xmerlin
I am installing SmartOS on some Dell 7525 full-NVMe servers equipped with Mellanox CX-4 LX cards, and I am encountering intermittent problems of the type 'PCI1318 A fatal error was detected on a component at bus XXX device X function X' on the Mellanox card and the default-installed dual gigabit card. Has anyone experienced similar issues? Do you recommend Mellanox CX4 25Gbit cards or CX5 25Gbit cards? ...do you have any feedback on
-
xmerlin
the 25Gbit Intel alternatives? I saw that the XXV710-AM2 are on the HCL but not the E810-XXVAM2 ...correct? ...essentially, which dual port 25Gbit card do you recommend with SmartOS?
-
nwilkens
xmerlin: we just spun up similar servers, Dell 7525, but we we're deploying with CX-5 2 x 100gb. These have not been burned in yet, so we won't have much feedback for another week or so.
-
nwilkens
If I were looking for alternatives, I'd look at Chelsio too -- maybe the T6225-CR (we're not running these, so no actual usage recommendation here).
-
xmerlin
nwilkens, From the Chelsio website, I see official documentation on the SmartOS driver... Is it a driver developed directly by the manufacturer? Does anyone have feedback on the 25Gbit Chelsio cards & SmartOS?
-
jbk
I believe they've contributed the drivers to illumos, though I've not used them to date
-
jbk
we've got CX-4 and CX-5 cards in our lab (we're not running smartos, but different illumos distro) and haven't had any issues like that
-
xmerlin
I'm asking because I'm starting to hypothesize that some of the problems we've encountered may be due to the Mellanox CX4-LX cards (or their firmware). We are migrating some Supermicro AS-2123US-TN24R25M servers (dual Epyc Rome - full NVMe - Mellanox CX4-LX) to DELL 7525 servers because we've encountered various enumeration issues that prevent the correct bus enumeration and NVMe disk detection during some cold boots (Supermicr
-
xmerlin
o has been unable to solve the problem for 3 years). We put the first 7525 server into production without encountering any problems, but on servers 2 and 3, we noticed that PCI1318 error on the Mellanox card and on the integrated Broadcom card. Similarly to the Supermicro servers, the error does not always present itself, and in the case of the DELL server, it blocks the boot before the operating system startup attempt. I'm trying
-
xmerlin
to find an answer/solution before I go crazy.
-
rmustacc
These are errors that illumos is reporting?
-
xmerlin
PCI1318 is the iDRAC error ...if I reboot the server all works as expected
-
xmerlin
on supermicro sometimes nvme disks are not detected and smartos goes in install mode
-
danmcd
xmerlin: I'm assuming you're running the very latest PIs? (More than a few PCIe fixes have landed in the past 6 mos.)
-
xmerlin
right
-
xmerlin
the last one
-
xmerlin
danmcd, When the problem occurs, often at reboot, I see for a moment a message like 'apic error on interrupt' in SmartOS.
-
xmerlin
on DELL servers
-
danmcd
Hmmm... I assume you're using X2APIC and your BIOS is set to that... which is weird.
-
xmerlin
yes I've x2apic
-
» danmcd wonders if acpica needs an update... uggh.
-
xmerlin
danmcd, I'll try to disable x2apic
-
rmustacc
xmerlin: So these errors happen *before* illumos ever runs?
-
rmustacc
Just to make sure I understand this correctly?
-
rmustacc
Or does that error show up sometime after boot?
-
xmerlin
rmustacc, Correct, the error 'PCI1318 A fatal error was detected on a component at bus 225 device 0 function 0.' is reported before the operating system starts. If I proceed with the boot, the system starts, but then I notice that a message like the one posted above appears. I'm starting to think that the problems on the Supermicro servers and these Dell servers are connected and are related to the Mellanox CX-4 LX cards, which are
-
xmerlin
the only constant (the Supermicro servers are based on Epyc 7402 Rome and have Intel 5620 disks, while the Dell servers are 7525 with 7443 Milan processors and Western Digital SN840 disks).
-
rmustacc
Hmm. So to me that reads like something got a PCIe AER.
-
rmustacc
If you are in a position, I would try to ask Dell what AER was generated or what the really means.
-
rmustacc
But I'm not sure how to advise that given this is ahead of us running and in their firmware.
-
rmustacc
These are 1P or 2P systems?
-
xmerlin
2P systems
-
rmustacc
Actually you said you get a normal boot right?
-
rmustacc
Sometimes.
-
xmerlin
right
-
rmustacc
OK, bus 225 is e1.
-
rmustacc
So we can tell you what device that is with certainty if that helps.
-
rmustacc
If you run /usr/lib/pci/pcieadm show-devs it'll likely tell you.
-
rmustacc
e0 is going to be one of the 4 root complexes that exist on each CPU.
-
rmustacc
(Most likely)
-
xmerlin
rmustacc, I've already find the devices ...mellanox cx4 and the broadcom network card
-
xmerlin
both on cpu 2
-
rmustacc
Only one of them will be that bus e1 though.
-
rmustacc
Unless Dell did something really weird. But fair.
-
xmerlin
mellanox
-
xmerlin
-PCI1318 A fatal error was detected on a component at bus 129 device 0 function 0.
-
xmerlin
-PCI1318 A fatal error was detected on a component at bus 225 device 0 function 0.
-
xmerlin
every time I've two messages
-
xmerlin
first one is broadcom
-
xmerlin
the second one mellanox cx4
-
danmcd
xmerlin: DO NOT DISABLE x2apic!
-
xmerlin
danmcd, ok
-
danmcd
There will come a day very soon where you MUST MUST MUST use x2apic.
-
danmcd
(sorry for the shouting.... my goal is to get everyone whose bios is NOT set to x2apic for some reason to set it TO x2apic.)
-
danmcd
xmerlin: Also, sorry for not asking what rmustacc did... if it's happening before illumos it's gonna be an issue with your HW vendor.
-
xmerlin
danmcd, Likely, but I'm having similar problems with two different manufacturers. Moreover, the latest issue is on two servers. I haven't received the other two machines yet, but I'm starting to think the worst.
-
xmerlin
It's as if, after a reboot, one of the two cards remains in a state that triggers these issues. With Red Hat Linux, I haven't encountered similar problems.
-
xmerlin
The major difference between the Supermicro servers and the Dell servers is that while the Supermicro servers require a shutdown with power removal / a BMC reset to make the problem disappear... the Dell servers only need a reboot to no longer see the problem.
-
rmustacc
Hmm. So this suggests that maybe there's some state that they're not properly resetting here.
-
rmustacc
I guess arekinath isn't here?
-
rmustacc
Who has a similar combo.