00:09:52 Smithx10: I haven't looked through everything yet, but there's definitely a counter related to this that is worrying: `i40e:1:trqpair_tx_7:tx_err_nodescs 6218704` 00:10:09 that's when a Tx is attempted but there are no descriptors available 00:10:39 you could run something like the following to watch this counter: `kstat -p :::tx_err_nodesc 1` 00:12:15 yeah on a machine that i couldn't get it to happen on, they are all 0 00:12:16 also, given how small the `tx_descriptors` (total descriptors that have been completed over lifetime) is for that trqpair, it could be a sign of a frozen ring... 00:13:33 This machie had a bunch of large bhyve instances on there, I wonder if memory pressure could have played a factor 00:14:56 Smithx10: on the problem host, you could try running the following for a while: `kstat -p :::tx_descriptors :::tx_err_nodescs 1` 00:15:11 if the first count stays the same, but the second count goes up, you have a frozen ring 00:15:51 hmmm, memory pressure could maybe do this, I'd have to look at the code to refresh my memory 00:16:18 think out of a tb there is only 30 gb free 00:17:21 `mdb -ke '::memstat'` will give you a general idea of where things are 00:19:29 rzezeski: frozen ring 00:23:43 https://gist.github.com/Smithx10/8babe2510393ed3298e847eac873e884#file-memory 00:24:55 Low on memory too, think If freed up some memory the descriptor count would go up? 00:34:04 Smithx10: worth a try to see if it unsticks the ring. 00:55:51 @sommerfeld yeah, gonna move one of these like 80gb machines over and see what happens 01:07:44 rzezeski: that behavior usually indicates a bug in the driver? 01:13:03 I'm trying to page the i40e code back in while also working on other stuff. But based on my initial look this probably isn't being caused by low memory (but it could be). At the place where we increment this counter the mblk and TCB are already allocated, all we need is the free Tx descriptors to send the packet. But if we don't have enough we bail, increment that counter, and put the mac client in flow control... 01:13:21 until the device sends some data and notifies the driver so it can recycle the descriptors 01:14:01 In the past I've diagnosed several i40e device ring freezes, where the hardware refused to make progress because of malformed descriptor data 01:14:16 I'm not saying this is a ring freeze, but it has the markings of one 01:14:57 And the reason this is intermittent is probably because I only saw this count on one of the rings. So depending on how things fall out some connections land on that ring while others land on the other 6. 01:26:54 makes sense 01:33:39 To debug what the malformed data was, did you dump what was in those descriptors ? 01:51:16 yes, I used mdb to dump the memory and then wrote a program to decipher it, let me see if I can resurrect those tools 02:03:13 https://smartos.org/bugview/OS-7492 02:03:14 → OS-7492: i40e Tx freeze when b_cont chain exceeds 8 descriptors (Resolved) | https://github.com/joyent/illumos-joyent/commit/0d3f2b61dcfb18edace4fd257054f6fdbe07c99c 02:04:19 that's a writeup of one of the i40e freeze issues I worked on, the first 4 mdb commands are used to extra the Tx descriptor data, if you want to try to get me that data I could run it through my program, none of that holds customer data, it's purely metadata about the packet being sent 02:11:45 That second command, they all came back 1024, i used the addr for i40e1 since i40e:1:trqpair_tx_7 02:12:20 Do I need to specify tx7 in that second command some how? Im trying to grok it but small brain 02:13:33 Rung the second command for all of them 02:13:50 (all i40e instances) 02:13:50 if everything is 1024, then it means you don't have this problem 02:14:27 waht is i40e_trqpair_t 0t7 02:14:31 "0t7" 02:15:19 0t7 is how you specify the number 7 in base 10 in mdb (by default mdb treats everything as base16) 02:15:36 it's saying the array is 7 elements long 02:15:40 because there are 7 Tx rings 02:16:09 for each address you see in the first command, you want to run the second command with that address 02:16:22 so you'll print all 7 Tx rings for all of your i40e instances 02:16:37 nice 02:17:12 https://gist.github.com/Smithx10/411a62d1014c1eb7989891a652bf58e2, they look to be all the appropriate size 02:18:22 i40e:1:trqpair_tx_7:tx_descriptors 1297, i40e:1:trqpair_tx_7:tx_err_nodescs 6254849 keeps growing 02:19:14 ive got about 100gb of memory free now, didnt help 02:20:30 are you sure that mdb output is from the problem host? 02:20:40 none of the `itxs_err_nodescs` are non-zero 02:22:37 same machine https://gist.github.com/Smithx10/411a62d1014c1eb7989891a652bf58e2#file-gistfile2-txt 02:23:25 can you try changing that `0t7` to `0t8` 02:23:30 in the mdb command 02:23:40 interestingly on a pretty fresh boot, the other gbox that had this issue.... pretty much instantly i40e:1:trqpair_tx_6:tx_err_nodescs 1681 had this issue 02:25:11 https://gist.github.com/Smithx10/411a62d1014c1eb7989891a652bf58e2#file-with-8-L9 02:25:24 off by one, always, haha 02:25:35 yea, so there's our problem child 02:25:56 ahh, so print out until the 8 elemnt in the array of ringz 02:26:34 I need to just spend time mucking around in mdb and the illumos source 02:27:52 yea, either the old driver had only 7 rings and now it has 8, or I was always doing it wrong in the past and got lucky. But anyways, now you can run the third command in that ticket with the address of the ring 02:28:12 the address of the ring is the second column above 02:28:21 `fffffa09b6c2d560` 02:29:18 `fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_ring | ::array i40e_tx_desc_t 0t1024 | ::printf "0x%x\n" i40e_tx_desc_t cmd_type_offset_bsz ! xargs -L1 echo > /var/tmp/i40e_desc_ring_0xfffffa09b6c2d560.out` 02:29:22 should hopefully work 02:31:57 got that file 02:32:52 okay, send it to me or gist it, it's just a list of hex numbers that I'm going to use as input for my descriptor decoder program 02:34:24 and after you do that show me the output of this: `fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_head |=D` 02:34:25 https://github.com/Smithx10/debug-images/blob/main/i40e_desc_ring_fffffa09b6c2d560.out 02:35:04 > fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_head |=D; 280 02:41:12 https://gist.github.com/rzezeski/8ead9c44ec94982ad07f06eaa2c4b09e 02:41:24 that TSO descriptor looks...busted 02:41:47 that 180, IIRC, is supposed to be the size of the packet 02:41:54 a 180 byte TSO makes no sense 02:42:12 and the driver is probably like wtf do you want me to do with this 02:43:27 Interesting 02:45:38 If I'm right, then this is a bug in the driver probably. 02:49:15 nice 02:49:26 Wonder if its always been lurking or recently introduced 02:55:27 well I haven't used these tools in a while so they could also be doing something wrong, and it's getting late so my eyes are starting to bleed, but certainly this has the smell of a ring freeze, and i40e has a history here 02:56:19 Thank You a million 03:05:33 rzezeski: can I just turn off the frozen tx descriptor?..... I imagine whatever was causing it to happen will likely peak its head up again 03:07:26 danmcd: you around? 03:08:00 I have some i40e fixes I probably need to upstream 03:08:27 Smithx10: there is another issue I potentially see where too many descriptors are being used for a single segment, but I need to look at my old code with fresh eyes to be sure. 03:08:31 I thought dan might have built a smartos image w/ them in the past 03:08:39 that addresses a lot of these... 03:09:20 i could build you one, but I don't have a good place where I could make it available to download... 03:10:19 Smithx10: I don't know how to turn off a ring, we do expose a txrings linkprop, but I'm not sure what, if anything, would happen if you try to change that. Honestly the only thing I know for sure is to reset. But that also potentially loses us the smoking gun state. So if you could take an NMI so we could have a crash dump that would be great. But I understand that might not be possible if it's production. 03:10:54 They live in europe... so everything is possible 03:11:10 jbk: addresses known Tx ring freezes? 03:11:15 is there an issue written up anywhere? 03:11:58 there's probably something internal I can copy out 03:13:14 IIRC (this was 2 years ago)... smb in particular seemed to (for whatever reason) generate a run of big, big, big, tiny, tiny, tiny, tiny, big, big chains of fragments (b_cont) and the right mix of sizes could stop the ring 03:13:26 anyways, I've got to get to bed, I might be able to look more tomorrow Smithx10 but I'm also on the line for other work that needs to happen ASAP 03:13:42 no problem, thanks a ton... very helpful! 03:13:48 i think the other changes we made to address the DMA memory might have made it more sensitive 03:13:57 jbk: okay, yea I used to have a traffic generator that did that as well 03:14:37 iperf is great, but it doesn't replicate real traffic that can mess up that intricate logic 03:14:48 (we capped the size of the buffers for DMA at 4k, so you were more likely to being doing more splitting and then dma mapping and copying 03:15:08 so you could create a vnic and not take a cross-state trip before it'd finish 03:15:21 after the system had been up for a while 03:16:10 k, well if I get a chance I might try replicating that traffic scheme and seeing if it reproduces, if it does I'd rather just make the most targeted fix possible, I don't have the time to write or review more involved memory/DMA changes to i40e at the moment, though I agree it would be nice to have that stuff 03:16:53 it turns out the docs for the earlier intel nics suggest at what the actual limitations are 03:17:20 anyways, I have to run to bed 03:17:23 jerry (and others) have reviewed it internally, and we've not had a single customer incident in the 2 years since we've deployed the fix 03:18:33 we ended up hosting like 8 VMs's images off NFS all running fio, seemed to be able to fairly regularly cause it 03:18:38 but it'd still take 10-40 mins 03:18:57 jbk, I could probably put something up real quick if you wanted to send over a platform 03:18:59 since it was really hard to control how the kernel decided to segment the TX packets before they got passed to mac 03:19:18 Smithx10: that's the thing, I don't really have a place I could host the file 03:19:33 or the problem... 03:19:46 1 sec 03:20:04 .... though I suppose now that i have gigabit symmetric (fiber), i could probably set something up at some point for stuff like this... 03:22:48 (for those curious) from looking at the docs for older intel nics (which spell this out more clearly than the i40 or ice docs) 03:23:11 and suggest the limitation is present in i40e and ice 03:23:51 it appears for every packet written on the wire, it can only do at most 8 DMA transfers to write it out 03:24:34 and if using TSO, the header is always done as it's own transfer for each packet it generates (regardless if there's data in the same descriptor -- that will be done in a separate dma transfer) 03:25:35 (which also makes trying to map/copy mblk_ts which themselves can be segmented on arbitrary boundaries extremely complicated if you want to not just copy everything into pre-allocated buffers) 13:11:15 [illumos-gate] 17610 build_cscope should stop looking for SPARC uts Makefiles -- Hans Rosenfeld 13:12:45 [illumos-gate] 17609 vioscsi doesn't probe for the target with the maximum target id -- Hans Rosenfeld 20:14:45 Tried running the jbk bits, and got i40e:1:trqpair_tx_5:tx_err_nodescs 1390, Wonder if I should just slowly work my way backward until I land on a platform that doesn't do this.... 20:24:04 not backwards, binary search 20:24:31 I don't know if you can easily fake 'git bisect' just downloading smartos builds 20:24:44 danmcd: `piadm bisect`? :) 20:28:54 richlowe: gonna try the jbk build with these https://github.com/illumos/illumos-gate/compare/master...jasonbking:illumos-gate:i40e without an aggr in the mix and see if the counters go up 21:03:44 nope happened with no aggr too.... boo!!!! 21:16:48 Smithx10: not surprising if it's triggered by the mblk chain shape -- aggr's not going to rewrite taht. 21:43:28 Looks like it gets frozen in joyent_20250306T000316Z https://gist.github.com/Smithx10/648f03dce630f8b530f49b6f12dfa9ba 21:43:57 gonna give 20240307T000552Z a go 21:53:18 jbk: yes, I was aware of these limitations. I dealt with several of these i40e freeze issues at Joyent. 21:53:36 Smithx10: what makes you believe an older platform will fix the issue? Have there been changes to i40e in SmartoS? 21:54:45 rzezeski: we were on an older platform a few months ago like 2023 platform adn the customer wasnt complaining 21:54:54 * rzezeski realizes he doesn't know where illumos-joyent lives anymore 21:55:16 https://github.com/TritonDataCenter/illumos-joyent/ 21:55:57 at the moment, I'm just trying to get to a platform where on monday this euro team comes in and doesn't put more tickets in my queue 21:57:52 rzezeski: is it possible that anyone touching TSO could get that ring frozen? or you think its just been there? you dont think its bad hw right? 22:00:24 rzezeski: 20230713T000721Z was what we were on a few months ago and I no one was complaining...that could just be a co-incidence. 22:00:25 well if it wasn't happening on an older platform my first question is what changed 22:00:40 the only change I see in the Tx path is Patrick's meoi changes 22:01:07 I greped through git log, but I dont know wwhat files to git blame or look through 22:01:10 you just checking i40e? 22:01:32 yea, just checked i40e 22:01:44 how did you reproduce? 22:01:52 just booted the box 22:01:52 just send traffic over i40e? 22:01:54 lol 22:02:01 fair enough 22:02:07 and just checked the counters 22:02:22 there are a bunch of zones on here running 22:02:36 its the weekend so i get away with being a little naughty 22:02:44 I don't think there are any smartos-exlusive changes here, so in theory I could put one of my i40e parts into my omnios box and see if I can replicate 22:03:12 yea, that's why I asked, those zones could be producing interesting traffic patterns that trip this behavior 22:03:28 so it may not be as easy to replicate at home 22:03:49 its a mix of bhyve os zones BHYV 65536 22:03:51 but one of the things I used back in the day was a generator that used various sizes writes 22:03:51 not sure if memory matters 22:04:00 probably not 22:06:57 The good news, is within like 25-30 minutes this one box trips it up pretty regulary 22:07:04 another one is almost instant 23:40:14 rzezeski: yes, but all of the comments and the commentary at the time suggested no one was really sure what was going on with i40e and descriptors... I was just noting that older intel NICs seem to explain the actual behavior much clearer than the i40e programming manual does 23:42:14 (basically with TSO, the header for each packet it produces from the input is done as it's own DMA transfer -- leaving 7) 23:42:46 which means for the initial packet, a descriptor that contains the header + data will take 2 transfers (leaving 6 for the rest of that first packet) 23:44:56 or to put it another way, for each subsequent packet, it re-copies the initial header from RAM (then adjusts the fields as appropriate based on how much data has been sent so far) 23:45:14 and that always counts as 1 descriptor of the 8