13:01:05 I am installing SmartOS on some Dell 7525 full-NVMe servers equipped with Mellanox CX-4 LX cards, and I am encountering intermittent problems of the type 'PCI1318 A fatal error was detected on a component at bus XXX device X function X' on the Mellanox card and the default-installed dual gigabit card. Has anyone experienced similar issues? Do you recommend Mellanox CX4 25Gbit cards or CX5 25Gbit cards? ...do you have any feedback on 13:01:05 the 25Gbit Intel alternatives? I saw that the XXV710-AM2 are on the HCL but not the E810-XXVAM2 ...correct? ...essentially, which dual port 25Gbit card do you recommend with SmartOS? 14:35:26 xmerlin: we just spun up similar servers, Dell 7525, but we we're deploying with CX-5 2 x 100gb. These have not been burned in yet, so we won't have much feedback for another week or so. 14:38:25 If I were looking for alternatives, I'd look at Chelsio too -- maybe the T6225-CR (we're not running these, so no actual usage recommendation here). 14:55:30 nwilkens, From the Chelsio website, I see official documentation on the SmartOS driver... Is it a driver developed directly by the manufacturer? Does anyone have feedback on the 25Gbit Chelsio cards & SmartOS? 14:59:02 I believe they've contributed the drivers to illumos, though I've not used them to date 14:59:51 we've got CX-4 and CX-5 cards in our lab (we're not running smartos, but different illumos distro) and haven't had any issues like that 15:06:35 I'm asking because I'm starting to hypothesize that some of the problems we've encountered may be due to the Mellanox CX4-LX cards (or their firmware). We are migrating some Supermicro AS-2123US-TN24R25M​ servers (dual Epyc Rome - full NVMe - Mellanox CX4-LX) to DELL 7525 servers because we've encountered various enumeration issues that prevent the correct bus enumeration and NVMe disk detection during some cold boots (Supermicr 15:06:35 o has been unable to solve the problem for 3 years). We put the first 7525 server into production without encountering any problems, but on servers 2 and 3, we noticed that PCI1318 error on the Mellanox card and on the integrated Broadcom card. Similarly to the Supermicro servers, the error does not always present itself, and in the case of the DELL server, it blocks the boot before the operating system startup attempt. I'm trying 15:06:35 to find an answer/solution before I go crazy. 15:19:02 These are errors that illumos is reporting? 15:20:03 PCI1318 is the iDRAC error ...if I reboot the server all works as expected 15:20:29 on supermicro sometimes nvme disks are not detected and smartos goes in install mode 15:33:21 xmerlin: I'm assuming you're running the very latest PIs? (More than a few PCIe fixes have landed in the past 6 mos.) 15:53:11 right 15:53:17 the last one 16:47:05 danmcd, When the problem occurs, often at reboot, I see for a moment a message like 'apic error on interrupt' in SmartOS. 16:47:15 on DELL servers 16:47:40 Hmmm... I assume you're using X2APIC and your BIOS is set to that... which is weird. 16:48:00 yes I've x2apic 16:48:13 * danmcd wonders if acpica needs an update... uggh. 16:49:16 danmcd, I'll try to disable x2apic 16:49:37 xmerlin: So these errors happen *before* illumos ever runs? 16:49:44 Just to make sure I understand this correctly? 16:49:55 Or does that error show up sometime after boot? 16:56:27 rmustacc, Correct, the error 'PCI1318 A fatal error was detected on a component at bus 225 device 0 function 0.' is reported before the operating system starts. If I proceed with the boot, the system starts, but then I notice that a message like the one posted above appears. I'm starting to think that the problems on the Supermicro servers and these Dell servers are connected and are related to the Mellanox CX-4 LX cards, which are 16:56:28 the only constant (the Supermicro servers are based on Epyc 7402 Rome and have Intel 5620 disks, while the Dell servers are 7525 with 7443 Milan processors and Western Digital SN840 disks). 16:59:16 Hmm. So to me that reads like something got a PCIe AER. 16:59:28 If you are in a position, I would try to ask Dell what AER was generated or what the really means. 16:59:41 But I'm not sure how to advise that given this is ahead of us running and in their firmware. 17:00:15 These are 1P or 2P systems? 17:01:01 2P systems 17:01:05 Actually you said you get a normal boot right? 17:01:07 Sometimes. 17:01:11 right 17:01:13 OK, bus 225 is e1. 17:01:21 So we can tell you what device that is with certainty if that helps. 17:01:36 If you run /usr/lib/pci/pcieadm show-devs it'll likely tell you. 17:01:47 e0 is going to be one of the 4 root complexes that exist on each CPU. 17:01:50 (Most likely) 17:02:05 rmustacc, I've already find the devices ...mellanox cx4 and the broadcom network card 17:02:14 both on cpu 2 17:02:32 Only one of them will be that bus e1 though. 17:02:41 Unless Dell did something really weird. But fair. 17:02:46 mellanox 17:02:55 -PCI1318 A fatal error was detected on a component at bus 129 device 0 function 0. 17:02:56 -PCI1318 A fatal error was detected on a component at bus 225 device 0 function 0. 17:03:02 every time I've two messages 17:03:09 first one is broadcom 17:03:16 the second one mellanox cx4 17:03:52 xmerlin: DO NOT DISABLE x2apic! 17:03:58 danmcd, ok 17:04:06 There will come a day very soon where you MUST MUST MUST use x2apic. 17:04:27 (sorry for the shouting.... my goal is to get everyone whose bios is NOT set to x2apic for some reason to set it TO x2apic.) 17:06:00 xmerlin: Also, sorry for not asking what rmustacc did... if it's happening before illumos it's gonna be an issue with your HW vendor. 17:09:03 danmcd, Likely, but I'm having similar problems with two different manufacturers. Moreover, the latest issue is on two servers. I haven't received the other two machines yet, but I'm starting to think the worst. 17:10:25 It's as if, after a reboot, one of the two cards remains in a state that triggers these issues. With Red Hat Linux, I haven't encountered similar problems. 17:15:32 The major difference between the Supermicro servers and the Dell servers is that while the Supermicro servers require a shutdown with power removal / a BMC reset to make the problem disappear... the Dell servers only need a reboot to no longer see the problem. 17:17:15 Hmm. So this suggests that maybe there's some state that they're not properly resetting here. 17:17:24 I guess arekinath isn't here? 17:17:34 Who has a similar combo.