-
rzezeski
Smithx10: I haven't looked through everything yet, but there's definitely a counter related to this that is worrying: `i40e:1:trqpair_tx_7:tx_err_nodescs 6218704`
-
rzezeski
that's when a Tx is attempted but there are no descriptors available
-
rzezeski
you could run something like the following to watch this counter: `kstat -p :::tx_err_nodesc 1`
-
Smithx10
yeah on a machine that i couldn't get it to happen on, they are all 0
-
rzezeski
also, given how small the `tx_descriptors` (total descriptors that have been completed over lifetime) is for that trqpair, it could be a sign of a frozen ring...
-
Smithx10
This machie had a bunch of large bhyve instances on there, I wonder if memory pressure could have played a factor
-
rzezeski
Smithx10: on the problem host, you could try running the following for a while: `kstat -p :::tx_descriptors :::tx_err_nodescs 1`
-
rzezeski
if the first count stays the same, but the second count goes up, you have a frozen ring
-
rzezeski
hmmm, memory pressure could maybe do this, I'd have to look at the code to refresh my memory
-
Smithx10
think out of a tb there is only 30 gb free
-
rzezeski
`mdb -ke '::memstat'` will give you a general idea of where things are
-
Smithx10
rzezeski: frozen ring
-
Smithx10
-
Smithx10
Low on memory too, think If freed up some memory the descriptor count would go up?
-
sommerfeld
Smithx10: worth a try to see if it unsticks the ring.
-
Smithx10
@sommerfeld yeah, gonna move one of these like 80gb machines over and see what happens
-
Smithx10
rzezeski: that behavior usually indicates a bug in the driver?
-
rzezeski
I'm trying to page the i40e code back in while also working on other stuff. But based on my initial look this probably isn't being caused by low memory (but it could be). At the place where we increment this counter the mblk and TCB are already allocated, all we need is the free Tx descriptors to send the packet. But if we don't have enough we bail, increment that counter, and put the mac client in flow control...
-
rzezeski
until the device sends some data and notifies the driver so it can recycle the descriptors
-
rzezeski
In the past I've diagnosed several i40e device ring freezes, where the hardware refused to make progress because of malformed descriptor data
-
rzezeski
I'm not saying this is a ring freeze, but it has the markings of one
-
rzezeski
And the reason this is intermittent is probably because I only saw this count on one of the rings. So depending on how things fall out some connections land on that ring while others land on the other 6.
-
Smithx10
makes sense
-
Smithx10
To debug what the malformed data was, did you dump what was in those descriptors ?
-
rzezeski
yes, I used mdb to dump the memory and then wrote a program to decipher it, let me see if I can resurrect those tools
-
rzezeski
-
fenix
→ OS-7492: i40e Tx freeze when b_cont chain exceeds 8 descriptors (Resolved) |
joyent/illumos-joyent 0d3f2b6
-
rzezeski
that's a writeup of one of the i40e freeze issues I worked on, the first 4 mdb commands are used to extra the Tx descriptor data, if you want to try to get me that data I could run it through my program, none of that holds customer data, it's purely metadata about the packet being sent
-
Smithx10
That second command, they all came back 1024, i used the addr for i40e1 since i40e:1:trqpair_tx_7
-
Smithx10
Do I need to specify tx7 in that second command some how? Im trying to grok it but small brain
-
rzezeski
Rung the second command for all of them
-
rzezeski
(all i40e instances)
-
rzezeski
if everything is 1024, then it means you don't have this problem
-
Smithx10
waht is i40e_trqpair_t 0t7
-
Smithx10
"0t7"
-
rzezeski
0t7 is how you specify the number 7 in base 10 in mdb (by default mdb treats everything as base16)
-
rzezeski
it's saying the array is 7 elements long
-
rzezeski
because there are 7 Tx rings
-
rzezeski
for each address you see in the first command, you want to run the second command with that address
-
rzezeski
so you'll print all 7 Tx rings for all of your i40e instances
-
Smithx10
nice
-
Smithx10
-
Smithx10
i40e:1:trqpair_tx_7:tx_descriptors 1297, i40e:1:trqpair_tx_7:tx_err_nodescs 6254849 keeps growing
-
Smithx10
ive got about 100gb of memory free now, didnt help
-
rzezeski
are you sure that mdb output is from the problem host?
-
rzezeski
none of the `itxs_err_nodescs` are non-zero
-
Smithx10
-
rzezeski
can you try changing that `0t7` to `0t8`
-
rzezeski
in the mdb command
-
Smithx10
interestingly on a pretty fresh boot, the other gbox that had this issue.... pretty much instantly i40e:1:trqpair_tx_6:tx_err_nodescs 1681 had this issue
-
Smithx10
-
rzezeski
off by one, always, haha
-
rzezeski
yea, so there's our problem child
-
Smithx10
ahh, so print out until the 8 elemnt in the array of ringz
-
Smithx10
I need to just spend time mucking around in mdb and the illumos source
-
rzezeski
yea, either the old driver had only 7 rings and now it has 8, or I was always doing it wrong in the past and got lucky. But anyways, now you can run the third command in that ticket with the address of the ring
-
rzezeski
the address of the ring is the second column above
-
rzezeski
`fffffa09b6c2d560`
-
rzezeski
`fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_ring | ::array i40e_tx_desc_t 0t1024 | ::printf "0x%x\n" i40e_tx_desc_t cmd_type_offset_bsz ! xargs -L1 echo > /var/tmp/i40e_desc_ring_0xfffffa09b6c2d560.out`
-
rzezeski
should hopefully work
-
Smithx10
got that file
-
rzezeski
okay, send it to me or gist it, it's just a list of hex numbers that I'm going to use as input for my descriptor decoder program
-
rzezeski
and after you do that show me the output of this: `fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_head |=D`
-
Smithx10
-
Smithx10
> fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_head |=D; 280
-
rzezeski
-
rzezeski
that TSO descriptor looks...busted
-
rzezeski
that 180, IIRC, is supposed to be the size of the packet
-
rzezeski
a 180 byte TSO makes no sense
-
rzezeski
and the driver is probably like wtf do you want me to do with this
-
Smithx10
Interesting
-
rzezeski
If I'm right, then this is a bug in the driver probably.
-
Smithx10
nice
-
Smithx10
Wonder if its always been lurking or recently introduced
-
rzezeski
well I haven't used these tools in a while so they could also be doing something wrong, and it's getting late so my eyes are starting to bleed, but certainly this has the smell of a ring freeze, and i40e has a history here
-
Smithx10
Thank You a million
-
Smithx10
rzezeski: can I just turn off the frozen tx descriptor?..... I imagine whatever was causing it to happen will likely peak its head up again
-
jbk
danmcd: you around?
-
jbk
I have some i40e fixes I probably need to upstream
-
rzezeski
Smithx10: there is another issue I potentially see where too many descriptors are being used for a single segment, but I need to look at my old code with fresh eyes to be sure.
-
jbk
I thought dan might have built a smartos image w/ them in the past
-
jbk
that addresses a lot of these...
-
jbk
i could build you one, but I don't have a good place where I could make it available to download...
-
rzezeski
Smithx10: I don't know how to turn off a ring, we do expose a txrings linkprop, but I'm not sure what, if anything, would happen if you try to change that. Honestly the only thing I know for sure is to reset. But that also potentially loses us the smoking gun state. So if you could take an NMI so we could have a crash dump that would be great. But I understand that might not be possible if it's production.
-
Smithx10
They live in europe... so everything is possible
-
rzezeski
jbk: addresses known Tx ring freezes?
-
rzezeski
is there an issue written up anywhere?
-
jbk
there's probably something internal I can copy out
-
jbk
IIRC (this was 2 years ago)... smb in particular seemed to (for whatever reason) generate a run of big, big, big, tiny, tiny, tiny, tiny, big, big chains of fragments (b_cont) and the right mix of sizes could stop the ring
-
rzezeski
anyways, I've got to get to bed, I might be able to look more tomorrow Smithx10 but I'm also on the line for other work that needs to happen ASAP
-
Smithx10
no problem, thanks a ton... very helpful!
-
jbk
i think the other changes we made to address the DMA memory might have made it more sensitive
-
rzezeski
jbk: okay, yea I used to have a traffic generator that did that as well
-
rzezeski
iperf is great, but it doesn't replicate real traffic that can mess up that intricate logic
-
jbk
(we capped the size of the buffers for DMA at 4k, so you were more likely to being doing more splitting and then dma mapping and copying
-
jbk
so you could create a vnic and not take a cross-state trip before it'd finish
-
jbk
after the system had been up for a while
-
rzezeski
k, well if I get a chance I might try replicating that traffic scheme and seeing if it reproduces, if it does I'd rather just make the most targeted fix possible, I don't have the time to write or review more involved memory/DMA changes to i40e at the moment, though I agree it would be nice to have that stuff
-
jbk
it turns out the docs for the earlier intel nics suggest at what the actual limitations are
-
rzezeski
anyways, I have to run to bed
-
jbk
jerry (and others) have reviewed it internally, and we've not had a single customer incident in the 2 years since we've deployed the fix
-
jbk
we ended up hosting like 8 VMs's images off NFS all running fio, seemed to be able to fairly regularly cause it
-
jbk
but it'd still take 10-40 mins
-
Smithx10
jbk, I could probably put something up real quick if you wanted to send over a platform
-
jbk
since it was really hard to control how the kernel decided to segment the TX packets before they got passed to mac
-
jbk
Smithx10: that's the thing, I don't really have a place I could host the file
-
jbk
or the problem...
-
Smithx10
1 sec
-
jbk
.... though I suppose now that i have gigabit symmetric (fiber), i could probably set something up at some point for stuff like this...
-
jbk
(for those curious) from looking at the docs for older intel nics (which spell this out more clearly than the i40 or ice docs)
-
jbk
and suggest the limitation is present in i40e and ice
-
jbk
it appears for every packet written on the wire, it can only do at most 8 DMA transfers to write it out
-
jbk
and if using TSO, the header is always done as it's own transfer for each packet it generates (regardless if there's data in the same descriptor -- that will be done in a separate dma transfer)
-
jbk
(which also makes trying to map/copy mblk_ts which themselves can be segmented on arbitrary boundaries extremely complicated if you want to not just copy everything into pre-allocated buffers)
-
gitomat
[illumos-gate] 17610 build_cscope should stop looking for SPARC uts Makefiles -- Hans Rosenfeld <rosenfeld⊙gho>
-
gitomat
[illumos-gate] 17609 vioscsi doesn't probe for the target with the maximum target id -- Hans Rosenfeld <rosenfeld⊙gho>
-
Smithx10
Tried running the jbk bits, and got i40e:1:trqpair_tx_5:tx_err_nodescs 1390, Wonder if I should just slowly work my way backward until I land on a platform that doesn't do this....
-
richlowe
not backwards, binary search
-
richlowe
I don't know if you can easily fake 'git bisect' just downloading smartos builds
-
richlowe
danmcd: `piadm bisect`? :)
-
Smithx10
richlowe: gonna try the jbk build with these
github.com/illumos/illumos-gate/com…ster...jasonbking:illumos-gate:i40e without an aggr in the mix and see if the counters go up
-
Smithx10
nope happened with no aggr too.... boo!!!!
-
sommerfeld
Smithx10: not surprising if it's triggered by the mblk chain shape -- aggr's not going to rewrite taht.
-
Smithx10
-
Smithx10
gonna give 20240307T000552Z a go
-
rzezeski
jbk: yes, I was aware of these limitations. I dealt with several of these i40e freeze issues at Joyent.
-
rzezeski
Smithx10: what makes you believe an older platform will fix the issue? Have there been changes to i40e in SmartoS?
-
Smithx10
rzezeski: we were on an older platform a few months ago like 2023 platform adn the customer wasnt complaining
-
» rzezeski realizes he doesn't know where illumos-joyent lives anymore
-
Smithx10
-
Smithx10
at the moment, I'm just trying to get to a platform where on monday this euro team comes in and doesn't put more tickets in my queue
-
Smithx10
rzezeski: is it possible that anyone touching TSO could get that ring frozen? or you think its just been there? you dont think its bad hw right?
-
Smithx10
rzezeski: 20230713T000721Z was what we were on a few months ago and I no one was complaining...that could just be a co-incidence.
-
rzezeski
well if it wasn't happening on an older platform my first question is what changed
-
rzezeski
the only change I see in the Tx path is Patrick's meoi changes
-
Smithx10
I greped through git log, but I dont know wwhat files to git blame or look through
-
Smithx10
you just checking i40e?
-
rzezeski
yea, just checked i40e
-
rzezeski
how did you reproduce?
-
Smithx10
just booted the box
-
rzezeski
just send traffic over i40e?
-
rzezeski
lol
-
rzezeski
fair enough
-
Smithx10
and just checked the counters
-
Smithx10
there are a bunch of zones on here running
-
Smithx10
its the weekend so i get away with being a little naughty
-
rzezeski
I don't think there are any smartos-exlusive changes here, so in theory I could put one of my i40e parts into my omnios box and see if I can replicate
-
rzezeski
yea, that's why I asked, those zones could be producing interesting traffic patterns that trip this behavior
-
rzezeski
so it may not be as easy to replicate at home
-
Smithx10
its a mix of bhyve os zones BHYV 65536
-
rzezeski
but one of the things I used back in the day was a generator that used various sizes writes
-
Smithx10
not sure if memory matters
-
Smithx10
probably not
-
Smithx10
The good news, is within like 25-30 minutes this one box trips it up pretty regulary
-
Smithx10
another one is almost instant
-
jbk
rzezeski: yes, but all of the comments and the commentary at the time suggested no one was really sure what was going on with i40e and descriptors... I was just noting that older intel NICs seem to explain the actual behavior much clearer than the i40e programming manual does
-
jbk
(basically with TSO, the header for each packet it produces from the input is done as it's own DMA transfer -- leaving 7)
-
jbk
which means for the initial packet, a descriptor that contains the header + data will take 2 transfers (leaving 6 for the rest of that first packet)
-
jbk
or to put it another way, for each subsequent packet, it re-copies the initial header from RAM (then adjusts the fields as appropriate based on how much data has been sent so far)
-
jbk
and that always counts as 1 descriptor of the 8