00:09:52 <rzezeski> Smithx10: I haven't looked through everything yet, but there's definitely a counter related to this that is worrying: `i40e:1:trqpair_tx_7:tx_err_nodescs      6218704`
00:10:09 <rzezeski> that's when a Tx is attempted but there are no descriptors available
00:10:39 <rzezeski> you could run something like the following to watch this counter: `kstat -p :::tx_err_nodesc 1`
00:12:15 <Smithx10> yeah on  a machine that i couldn't get it to happen on, they are all 0
00:12:16 <rzezeski> also, given how small the `tx_descriptors` (total descriptors that have been completed over lifetime) is for that trqpair, it could be a sign of a frozen ring...
00:13:33 <Smithx10> This machie had a bunch of large bhyve instances on there,  I wonder if memory pressure could have played a factor
00:14:56 <rzezeski> Smithx10: on the problem host, you could try running the following for a while: `kstat -p :::tx_descriptors :::tx_err_nodescs 1`
00:15:11 <rzezeski> if the first count stays the same, but the second count goes up, you have a frozen ring
00:15:51 <rzezeski> hmmm, memory pressure could maybe do this, I'd have to look at the code to refresh my memory
00:16:18 <Smithx10> think out of a tb there is only 30 gb free
00:17:21 <rzezeski> `mdb -ke '::memstat'` will give you a general idea of where things are
00:19:29 <Smithx10> rzezeski:  frozen ring
00:23:43 <Smithx10> https://gist.github.com/Smithx10/8babe2510393ed3298e847eac873e884#file-memory
00:24:55 <Smithx10> Low on memory too, think If freed up some memory the descriptor count would go up?
00:34:04 <sommerfeld> Smithx10: worth a try to see if it unsticks the ring.
00:55:51 <Smithx10> @sommerfeld  yeah, gonna move one of these like 80gb machines over and see what happens
01:07:44 <Smithx10> rzezeski:  that behavior usually indicates a bug in the driver?
01:13:03 <rzezeski> I'm trying to page the i40e code back in while also working on other stuff. But based on my initial look this probably isn't being caused by low memory (but it could be). At the place where we increment this counter the mblk and TCB are already allocated, all we need is the free Tx descriptors to send the packet. But if we don't have enough we bail, increment that counter, and put the mac client in flow control...
01:13:21 <rzezeski> until the device sends some data and notifies the driver so it can recycle the descriptors
01:14:01 <rzezeski> In the past I've diagnosed several i40e device ring freezes, where the hardware refused to make progress because of malformed descriptor data
01:14:16 <rzezeski> I'm not saying this is a ring freeze, but it has the markings of one
01:14:57 <rzezeski> And the reason this is intermittent is probably because I only saw this count on one of the rings. So depending on how things fall out some connections land on that ring while others land on the other 6.
01:26:54 <Smithx10> makes sense
01:33:39 <Smithx10> To debug what the malformed data was, did you dump what was in those descriptors ?
01:51:16 <rzezeski> yes, I used mdb to dump the memory and then wrote a program to decipher it, let me see if I can resurrect those tools
02:03:13 <rzezeski> https://smartos.org/bugview/OS-7492
02:03:14 <fenix> → OS-7492: i40e Tx freeze when b_cont chain exceeds 8 descriptors (Resolved) | https://github.com/joyent/illumos-joyent/commit/0d3f2b61dcfb18edace4fd257054f6fdbe07c99c
02:04:19 <rzezeski> that's a writeup of one of the i40e freeze issues I worked on, the first 4 mdb commands are used to extra the Tx descriptor data, if you want to try to get me that data I could run it through my program, none of that holds customer data, it's purely metadata about the packet being sent
02:11:45 <Smithx10> That second command,  they all came back 1024,  i used the addr for i40e1 since i40e:1:trqpair_tx_7
02:12:20 <Smithx10> Do I need to specify tx7 in that second command some how?  Im trying to grok it but small brain
02:13:33 <rzezeski> Rung the second command for all of them
02:13:50 <rzezeski> (all i40e instances)
02:13:50 <rzezeski> if everything is 1024, then it means you don't have this problem
02:14:27 <Smithx10> waht is i40e_trqpair_t 0t7
02:14:31 <Smithx10> "0t7"
02:15:19 <rzezeski> 0t7 is how you specify the number 7 in base 10 in mdb (by default mdb treats everything as base16)
02:15:36 <rzezeski> it's saying the array is 7 elements long
02:15:40 <rzezeski> because there are 7 Tx rings
02:16:09 <rzezeski> for each address you see in the first command, you want to run the second command with that address
02:16:22 <rzezeski> so you'll print all 7 Tx rings for all of your i40e instances
02:16:37 <Smithx10> nice
02:17:12 <Smithx10> https://gist.github.com/Smithx10/411a62d1014c1eb7989891a652bf58e2,  they look to be all the appropriate size
02:18:22 <Smithx10> i40e:1:trqpair_tx_7:tx_descriptors      1297,   i40e:1:trqpair_tx_7:tx_err_nodescs      6254849  keeps growing
02:19:14 <Smithx10> ive got about 100gb of memory free now,   didnt help
02:20:30 <rzezeski> are you sure that mdb output is from the problem host?
02:20:40 <rzezeski> none of the `itxs_err_nodescs` are non-zero
02:22:37 <Smithx10> same machine https://gist.github.com/Smithx10/411a62d1014c1eb7989891a652bf58e2#file-gistfile2-txt
02:23:25 <rzezeski> can you try changing that `0t7` to `0t8`
02:23:30 <rzezeski> in the mdb command
02:23:40 <Smithx10> interestingly on a pretty fresh boot, the other gbox that had this issue.... pretty much instantly i40e:1:trqpair_tx_6:tx_err_nodescs      1681  had this issue
02:25:11 <Smithx10> https://gist.github.com/Smithx10/411a62d1014c1eb7989891a652bf58e2#file-with-8-L9
02:25:24 <rzezeski> off by one, always, haha
02:25:35 <rzezeski> yea, so there's our problem child
02:25:56 <Smithx10> ahh,  so print out until the 8 elemnt in the array of ringz
02:26:34 <Smithx10> I need to just spend time mucking around in mdb and the illumos source
02:27:52 <rzezeski> yea, either the old driver had only 7 rings and now it has 8, or I was always doing it wrong in the past and got lucky. But anyways, now you can run the third command in that ticket with the address of the ring
02:28:12 <rzezeski> the address of the ring is the second column above
02:28:21 <rzezeski> `fffffa09b6c2d560`
02:29:18 <rzezeski> `fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_ring | ::array i40e_tx_desc_t 0t1024 | ::printf "0x%x\n" i40e_tx_desc_t cmd_type_offset_bsz ! xargs -L1 echo > /var/tmp/i40e_desc_ring_0xfffffa09b6c2d560.out`
02:29:22 <rzezeski> should hopefully work
02:31:57 <Smithx10> got that file
02:32:52 <rzezeski> okay, send it to me or gist it, it's just a list of hex numbers that I'm going to use as input for my descriptor decoder program
02:34:24 <rzezeski> and after you do that show me the output of this: `fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_head |=D`
02:34:25 <Smithx10> https://github.com/Smithx10/debug-images/blob/main/i40e_desc_ring_fffffa09b6c2d560.out
02:35:04 <Smithx10> > fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_head |=D; 280
02:41:12 <rzezeski> https://gist.github.com/rzezeski/8ead9c44ec94982ad07f06eaa2c4b09e
02:41:24 <rzezeski> that TSO descriptor looks...busted
02:41:47 <rzezeski> that 180, IIRC, is supposed to be the size of the packet
02:41:54 <rzezeski> a 180 byte TSO makes no sense
02:42:12 <rzezeski> and the driver is probably like wtf do you want me to do with this
02:43:27 <Smithx10> Interesting
02:45:38 <rzezeski> If I'm right, then this is a bug in the driver probably.
02:49:15 <Smithx10> nice
02:49:26 <Smithx10> Wonder if its always been lurking or recently introduced
02:55:27 <rzezeski> well I haven't used these tools in a while so they could also be doing something wrong, and it's getting late so my eyes are starting to bleed, but certainly this has the smell of a ring freeze, and i40e has a history here
02:56:19 <Smithx10> Thank You a million
03:05:33 <Smithx10> rzezeski:   can I just turn off the frozen tx descriptor?..... I imagine whatever was causing it to happen will likely peak its head up again
03:07:26 <jbk> danmcd: you around?
03:08:00 <jbk> I have some i40e fixes I probably need to upstream
03:08:27 <rzezeski> Smithx10: there is another issue I potentially see where too many descriptors are being used for a single segment, but I need to look at my old code with fresh eyes to be sure.
03:08:31 <jbk> I thought dan might have built a smartos image w/ them in the past
03:08:39 <jbk> that addresses a lot of these...
03:09:20 <jbk> i could build you one, but I don't have a good place where I could make it available to download...
03:10:19 <rzezeski> Smithx10: I don't know how to turn off a ring, we do expose a txrings linkprop, but I'm not sure what, if anything, would happen if you try to change that. Honestly the only thing I know for sure is to reset. But that also potentially loses us the smoking gun state. So if you could take an NMI so we could have a crash dump that would be great. But I understand that might not be possible if it's production.
03:10:54 <Smithx10> They live in europe... so everything is possible 
03:11:10 <rzezeski> jbk: addresses known Tx ring freezes?
03:11:15 <rzezeski> is there an issue written up anywhere?
03:11:58 <jbk> there's probably something internal I can copy out
03:13:14 <jbk> IIRC (this was 2 years ago)... smb in particular seemed to (for whatever reason) generate a run of big, big, big, tiny, tiny, tiny, tiny, big, big chains of fragments (b_cont) and the right mix of sizes could stop the ring
03:13:26 <rzezeski> anyways, I've got to get to bed, I might be able to look more tomorrow Smithx10 but I'm also on the line for other work that needs to happen ASAP
03:13:42 <Smithx10> no problem, thanks a ton... very helpful!
03:13:48 <jbk> i think the other changes we made to address the DMA memory might have made it more sensitive
03:13:57 <rzezeski> jbk: okay, yea I used to have a traffic generator that did that as well
03:14:37 <rzezeski> iperf is great, but it doesn't replicate real traffic that can mess up that intricate logic
03:14:48 <jbk> (we capped the size of the buffers for DMA at 4k, so you were more likely to being doing more splitting and then dma mapping and copying
03:15:08 <jbk> so you could create a vnic and not take a cross-state trip before it'd finish
03:15:21 <jbk> after the system had been up for a while
03:16:10 <rzezeski> k, well if I get a chance I might try replicating that traffic scheme and seeing if it reproduces, if it does I'd rather just make the most targeted fix possible, I don't have the time to write or review more involved memory/DMA changes to i40e at the moment, though I agree it would be nice to have that stuff
03:16:53 <jbk> it turns out the docs for the earlier intel nics suggest at what the actual limitations are
03:17:20 <rzezeski> anyways, I have to run to bed
03:17:23 <jbk> jerry (and others) have reviewed it internally, and we've not had a single customer incident in the 2 years since we've deployed the fix
03:18:33 <jbk> we ended up hosting like 8 VMs's images off NFS all running fio, seemed to be able to fairly regularly cause it
03:18:38 <jbk> but it'd still take 10-40 mins
03:18:57 <Smithx10> jbk,  I could probably put something up real quick if you wanted to send over a platform
03:18:59 <jbk> since it was really hard to control how the kernel decided to segment the TX packets before they got passed to mac
03:19:18 <jbk> Smithx10: that's the thing, I don't really have a place I could host the file
03:19:33 <jbk> or the problem...
03:19:46 <Smithx10> 1 sec
03:20:04 <jbk> .... though I suppose now that i have gigabit symmetric (fiber), i could probably set something up at some point for stuff like this...
03:22:48 <jbk> (for those curious) from looking at the docs for older intel nics (which spell this out more clearly than the i40 or ice docs)
03:23:11 <jbk> and suggest the limitation is present in i40e and ice
03:23:51 <jbk> it appears for every packet written on the wire, it can only do at most 8 DMA transfers to write it out
03:24:34 <jbk> and if using TSO, the header is always done as it's own transfer for each packet it generates (regardless if there's data in the same descriptor -- that will be done in a separate dma transfer)
03:25:35 <jbk> (which also makes trying to map/copy mblk_ts which themselves can be segmented on arbitrary boundaries extremely complicated if you want to not just copy everything into pre-allocated buffers)
13:11:15 <gitomat> [illumos-gate] 17610 build_cscope should stop looking for SPARC uts Makefiles -- Hans Rosenfeld <rosenfeld⊙gho>
13:12:45 <gitomat> [illumos-gate] 17609 vioscsi doesn't probe for the target with the maximum target id -- Hans Rosenfeld <rosenfeld⊙gho>
20:14:45 <Smithx10> Tried running the jbk  bits,  and   got i40e:1:trqpair_tx_5:tx_err_nodescs      1390,  Wonder if I should just slowly work my way backward until I land on  a platform that doesn't do this.... 
20:24:04 <richlowe> not backwards, binary search
20:24:31 <richlowe> I don't know if you can easily fake 'git bisect' just downloading smartos builds
20:24:44 <richlowe> danmcd: `piadm bisect`? :)
20:28:54 <Smithx10> richlowe:  gonna try the jbk build with these https://github.com/illumos/illumos-gate/compare/master...jasonbking:illumos-gate:i40e   without an aggr in the mix and see if the counters go up
21:03:44 <Smithx10> nope happened with no aggr too.... boo!!!!
21:16:48 <sommerfeld> Smithx10: not surprising if it's triggered by the mblk chain shape -- aggr's not going to rewrite taht.
21:43:28 <Smithx10> Looks like it gets frozen in joyent_20250306T000316Z https://gist.github.com/Smithx10/648f03dce630f8b530f49b6f12dfa9ba
21:43:57 <Smithx10> gonna give 20240307T000552Z a go
21:53:18 <rzezeski> jbk: yes, I was aware of these limitations. I dealt with several of these i40e freeze issues at Joyent.
21:53:36 <rzezeski> Smithx10: what makes you believe an older platform will fix the issue? Have there been changes to i40e in SmartoS?
21:54:45 <Smithx10> rzezeski:  we were on an older platform a few months ago like 2023 platform adn the customer wasnt complaining
21:54:54 * rzezeski realizes he doesn't know where illumos-joyent lives anymore
21:55:16 <Smithx10> https://github.com/TritonDataCenter/illumos-joyent/
21:55:57 <Smithx10> at the moment, I'm just trying to get to a platform where on monday this euro team comes in and doesn't put more tickets in my queue
21:57:52 <Smithx10> rzezeski: is it possible that anyone touching TSO could get that ring frozen?  or you think its just been there?   you dont think its bad hw right?
22:00:24 <Smithx10> rzezeski: 20230713T000721Z  was what we were on a few months ago and I no one was complaining...that could just be a co-incidence. 
22:00:25 <rzezeski> well if it wasn't happening on an older platform my first question is what changed
22:00:40 <rzezeski> the only change I see in the Tx path is Patrick's meoi changes
22:01:07 <Smithx10> I greped through git log, but I dont know wwhat files to git blame or look through
22:01:10 <Smithx10> you just checking i40e?
22:01:32 <rzezeski> yea, just checked i40e
22:01:44 <rzezeski> how did you reproduce?
22:01:52 <Smithx10> just booted the box
22:01:52 <rzezeski> just send traffic over i40e?
22:01:54 <rzezeski> lol
22:02:01 <rzezeski> fair enough
22:02:07 <Smithx10> and just checked the counters
22:02:22 <Smithx10> there are a bunch of zones on here running
22:02:36 <Smithx10> its the weekend so i get away with being a little naughty
22:02:44 <rzezeski> I don't think there are any smartos-exlusive changes here, so in theory I could put one of my i40e parts into my omnios box and see if I can replicate
22:03:12 <rzezeski> yea, that's why I asked, those zones could be producing interesting traffic patterns that trip this behavior
22:03:28 <rzezeski> so it may not be as easy to replicate at home
22:03:49 <Smithx10> its a mix of bhyve os zones  BHYV  65536 
22:03:51 <rzezeski> but one of the things I used back in the day was a generator that used various sizes writes
22:03:51 <Smithx10> not sure if memory matters
22:04:00 <Smithx10> probably not
22:06:57 <Smithx10> The good news, is within like 25-30 minutes this one box trips it up pretty regulary
22:07:04 <Smithx10> another one is almost instant
23:40:14 <jbk> rzezeski: yes, but all of the comments and the commentary at the time suggested no one was really sure what was going on with i40e and descriptors... I was just noting that older intel NICs seem to explain the actual behavior much clearer than the i40e programming manual does
23:42:14 <jbk> (basically with TSO, the header for each packet it produces from the input is done as it's own DMA transfer -- leaving 7)
23:42:46 <jbk> which means for the initial packet, a descriptor that contains the header + data will take 2 transfers (leaving 6 for the rest of that first packet)
23:44:56 <jbk> or to put it another way, for each subsequent packet, it re-copies the initial header from RAM (then adjusts the fields as appropriate based on how much data has been sent so far)
23:45:14 <jbk> and that always counts as 1 descriptor of the 8