#illumos

00:09

rzezeski

Smithx10: I haven't looked through everything yet, but there's definitely a counter related to this that is worrying: `i40e:1:trqpair_tx_7:tx_err_nodescs 6218704`
00:10

rzezeski

that's when a Tx is attempted but there are no descriptors available
00:10

rzezeski

you could run something like the following to watch this counter: `kstat -p :::tx_err_nodesc 1`
00:12

Smithx10

yeah on a machine that i couldn't get it to happen on, they are all 0
00:12

rzezeski

also, given how small the `tx_descriptors` (total descriptors that have been completed over lifetime) is for that trqpair, it could be a sign of a frozen ring...
00:13

Smithx10

This machie had a bunch of large bhyve instances on there, I wonder if memory pressure could have played a factor
00:14

rzezeski

Smithx10: on the problem host, you could try running the following for a while: `kstat -p :::tx_descriptors :::tx_err_nodescs 1`
00:15

rzezeski

if the first count stays the same, but the second count goes up, you have a frozen ring
00:15

rzezeski

hmmm, memory pressure could maybe do this, I'd have to look at the code to refresh my memory
00:16

Smithx10

think out of a tb there is only 30 gb free
00:17

rzezeski

`mdb -ke '::memstat'` will give you a general idea of where things are
00:19

Smithx10

rzezeski: frozen ring
00:23

Smithx10

gist.github.com/Smithx10/8babe2510393ed3298e847eac873e884#file-memory
00:24

Smithx10

Low on memory too, think If freed up some memory the descriptor count would go up?
00:34

sommerfeld

Smithx10: worth a try to see if it unsticks the ring.
00:55

Smithx10

@sommerfeld yeah, gonna move one of these like 80gb machines over and see what happens
01:07

Smithx10

rzezeski: that behavior usually indicates a bug in the driver?
01:13

rzezeski

I'm trying to page the i40e code back in while also working on other stuff. But based on my initial look this probably isn't being caused by low memory (but it could be). At the place where we increment this counter the mblk and TCB are already allocated, all we need is the free Tx descriptors to send the packet. But if we don't have enough we bail, increment that counter, and put the mac client in flow control...
01:13

rzezeski

until the device sends some data and notifies the driver so it can recycle the descriptors
01:14

rzezeski

In the past I've diagnosed several i40e device ring freezes, where the hardware refused to make progress because of malformed descriptor data
01:14

rzezeski

I'm not saying this is a ring freeze, but it has the markings of one
01:14

rzezeski

And the reason this is intermittent is probably because I only saw this count on one of the rings. So depending on how things fall out some connections land on that ring while others land on the other 6.
01:26

Smithx10

makes sense
01:33

Smithx10

To debug what the malformed data was, did you dump what was in those descriptors ?
01:51

rzezeski

yes, I used mdb to dump the memory and then wrote a program to decipher it, let me see if I can resurrect those tools
02:03

rzezeski

smartos.org/bugview/OS-7492
02:03

fenix

→ OS-7492: i40e Tx freeze when b_cont chain exceeds 8 descriptors (Resolved) | joyent/illumos-joyent 0d3f2b6
02:04

rzezeski

that's a writeup of one of the i40e freeze issues I worked on, the first 4 mdb commands are used to extra the Tx descriptor data, if you want to try to get me that data I could run it through my program, none of that holds customer data, it's purely metadata about the packet being sent
02:11

Smithx10

That second command, they all came back 1024, i used the addr for i40e1 since i40e:1:trqpair_tx_7
02:12

Smithx10

Do I need to specify tx7 in that second command some how? Im trying to grok it but small brain
02:13

rzezeski

Rung the second command for all of them
02:13

rzezeski

(all i40e instances)
02:13

rzezeski

if everything is 1024, then it means you don't have this problem
02:14

Smithx10

waht is i40e_trqpair_t 0t7
02:14

Smithx10

"0t7"
02:15

rzezeski

0t7 is how you specify the number 7 in base 10 in mdb (by default mdb treats everything as base16)
02:15

rzezeski

it's saying the array is 7 elements long
02:15

rzezeski

because there are 7 Tx rings
02:16

rzezeski

for each address you see in the first command, you want to run the second command with that address
02:16

rzezeski

so you'll print all 7 Tx rings for all of your i40e instances
02:16

Smithx10

nice
02:17

Smithx10

gist.github.com/Smithx10/411a62d1014c1eb7989891a652bf58e2, they look to be all the appropriate size
02:18

Smithx10

i40e:1:trqpair_tx_7:tx_descriptors 1297, i40e:1:trqpair_tx_7:tx_err_nodescs 6254849 keeps growing
02:19

Smithx10

ive got about 100gb of memory free now, didnt help
02:20

rzezeski

are you sure that mdb output is from the problem host?
02:20

rzezeski

none of the `itxs_err_nodescs` are non-zero
02:22

Smithx10

same machine gist.github.com/Smithx10/411a62d101…989891a652bf58e2#file-gistfile2-txt
02:23

rzezeski

can you try changing that `0t7` to `0t8`
02:23

rzezeski

in the mdb command
02:23

Smithx10

interestingly on a pretty fresh boot, the other gbox that had this issue.... pretty much instantly i40e:1:trqpair_tx_6:tx_err_nodescs 1681 had this issue
02:25

Smithx10

gist.github.com/Smithx10/411a62d101…1eb7989891a652bf58e2#file-with-8-L9
02:25

rzezeski

off by one, always, haha
02:25

rzezeski

yea, so there's our problem child
02:25

Smithx10

ahh, so print out until the 8 elemnt in the array of ringz
02:26

Smithx10

I need to just spend time mucking around in mdb and the illumos source
02:27

rzezeski

yea, either the old driver had only 7 rings and now it has 8, or I was always doing it wrong in the past and got lucky. But anyways, now you can run the third command in that ticket with the address of the ring
02:28

rzezeski

the address of the ring is the second column above
02:28

rzezeski

`fffffa09b6c2d560`
02:29

rzezeski

`fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_ring | ::array i40e_tx_desc_t 0t1024 | ::printf "0x%x\n" i40e_tx_desc_t cmd_type_offset_bsz ! xargs -L1 echo > /var/tmp/i40e_desc_ring_0xfffffa09b6c2d560.out`
02:29

rzezeski

should hopefully work
02:31

Smithx10

got that file
02:32

rzezeski

okay, send it to me or gist it, it's just a list of hex numbers that I'm going to use as input for my descriptor decoder program
02:34

rzezeski

and after you do that show me the output of this: `fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_head |=D`
02:34

Smithx10

github.com/Smithx10/debug-images/bl…i40e_desc_ring_fffffa09b6c2d560.out
02:35

Smithx10

> fffffa09b6c2d560::print i40e_trqpair_t itrq_desc_head |=D; 280
02:41

rzezeski

gist.github.com/rzezeski/8ead9c44ec94982ad07f06eaa2c4b09e
02:41

rzezeski

that TSO descriptor looks...busted
02:41

rzezeski

that 180, IIRC, is supposed to be the size of the packet
02:41

rzezeski

a 180 byte TSO makes no sense
02:42

rzezeski

and the driver is probably like wtf do you want me to do with this
02:43

Smithx10

Interesting
02:45

rzezeski

If I'm right, then this is a bug in the driver probably.
02:49

Smithx10

nice
02:49

Smithx10

Wonder if its always been lurking or recently introduced
02:55

rzezeski

well I haven't used these tools in a while so they could also be doing something wrong, and it's getting late so my eyes are starting to bleed, but certainly this has the smell of a ring freeze, and i40e has a history here
02:56

Smithx10

Thank You a million
03:05

Smithx10

rzezeski: can I just turn off the frozen tx descriptor?..... I imagine whatever was causing it to happen will likely peak its head up again
03:07

jbk

danmcd: you around?
03:08

jbk

I have some i40e fixes I probably need to upstream
03:08

rzezeski

Smithx10: there is another issue I potentially see where too many descriptors are being used for a single segment, but I need to look at my old code with fresh eyes to be sure.
03:08

jbk

I thought dan might have built a smartos image w/ them in the past
03:08

jbk

that addresses a lot of these...
03:09

jbk

i could build you one, but I don't have a good place where I could make it available to download...
03:10

rzezeski

Smithx10: I don't know how to turn off a ring, we do expose a txrings linkprop, but I'm not sure what, if anything, would happen if you try to change that. Honestly the only thing I know for sure is to reset. But that also potentially loses us the smoking gun state. So if you could take an NMI so we could have a crash dump that would be great. But I understand that might not be possible if it's production.
03:10

Smithx10

They live in europe... so everything is possible
03:11

rzezeski

jbk: addresses known Tx ring freezes?
03:11

rzezeski

is there an issue written up anywhere?
03:11

jbk

there's probably something internal I can copy out
03:13

jbk

IIRC (this was 2 years ago)... smb in particular seemed to (for whatever reason) generate a run of big, big, big, tiny, tiny, tiny, tiny, big, big chains of fragments (b_cont) and the right mix of sizes could stop the ring
03:13

rzezeski

anyways, I've got to get to bed, I might be able to look more tomorrow Smithx10 but I'm also on the line for other work that needs to happen ASAP
03:13

Smithx10

no problem, thanks a ton... very helpful!
03:13

jbk

i think the other changes we made to address the DMA memory might have made it more sensitive
03:13

rzezeski

jbk: okay, yea I used to have a traffic generator that did that as well
03:14

rzezeski

iperf is great, but it doesn't replicate real traffic that can mess up that intricate logic
03:14

jbk

(we capped the size of the buffers for DMA at 4k, so you were more likely to being doing more splitting and then dma mapping and copying
03:15

jbk

so you could create a vnic and not take a cross-state trip before it'd finish
03:15

jbk

after the system had been up for a while
03:16

rzezeski

k, well if I get a chance I might try replicating that traffic scheme and seeing if it reproduces, if it does I'd rather just make the most targeted fix possible, I don't have the time to write or review more involved memory/DMA changes to i40e at the moment, though I agree it would be nice to have that stuff
03:16

jbk

it turns out the docs for the earlier intel nics suggest at what the actual limitations are
03:17

rzezeski

anyways, I have to run to bed
03:17

jbk

jerry (and others) have reviewed it internally, and we've not had a single customer incident in the 2 years since we've deployed the fix
03:18

jbk

we ended up hosting like 8 VMs's images off NFS all running fio, seemed to be able to fairly regularly cause it
03:18

jbk

but it'd still take 10-40 mins
03:18

Smithx10

jbk, I could probably put something up real quick if you wanted to send over a platform
03:18

jbk

since it was really hard to control how the kernel decided to segment the TX packets before they got passed to mac
03:19

jbk

Smithx10: that's the thing, I don't really have a place I could host the file
03:19

jbk

or the problem...
03:19

Smithx10

1 sec
03:20

jbk

.... though I suppose now that i have gigabit symmetric (fiber), i could probably set something up at some point for stuff like this...
03:22

jbk

(for those curious) from looking at the docs for older intel nics (which spell this out more clearly than the i40 or ice docs)
03:23

jbk

and suggest the limitation is present in i40e and ice
03:23

jbk

it appears for every packet written on the wire, it can only do at most 8 DMA transfers to write it out
03:24

jbk

and if using TSO, the header is always done as it's own transfer for each packet it generates (regardless if there's data in the same descriptor -- that will be done in a separate dma transfer)
03:25

jbk

(which also makes trying to map/copy mblk_ts which themselves can be segmented on arbitrary boundaries extremely complicated if you want to not just copy everything into pre-allocated buffers)
13:11

gitomat

[illumos-gate] 17610 build_cscope should stop looking for SPARC uts Makefiles -- Hans Rosenfeld <rosenfeld⊙gho>
13:12

gitomat

[illumos-gate] 17609 vioscsi doesn't probe for the target with the maximum target id -- Hans Rosenfeld <rosenfeld⊙gho>
20:14

Smithx10

Tried running the jbk bits, and got i40e:1:trqpair_tx_5:tx_err_nodescs 1390, Wonder if I should just slowly work my way backward until I land on a platform that doesn't do this....
20:24

richlowe

not backwards, binary search
20:24

richlowe

I don't know if you can easily fake 'git bisect' just downloading smartos builds
20:24

richlowe

danmcd: `piadm bisect`? :)
20:28

Smithx10

richlowe: gonna try the jbk build with these github.com/illumos/illumos-gate/com…ster...jasonbking:illumos-gate:i40e without an aggr in the mix and see if the counters go up
21:03

Smithx10

nope happened with no aggr too.... boo!!!!
21:16

sommerfeld

Smithx10: not surprising if it's triggered by the mblk chain shape -- aggr's not going to rewrite taht.
21:43

Smithx10

Looks like it gets frozen in joyent_20250306T000316Z gist.github.com/Smithx10/648f03dce630f8b530f49b6f12dfa9ba
21:43

Smithx10

gonna give 20240307T000552Z a go
21:53

rzezeski

jbk: yes, I was aware of these limitations. I dealt with several of these i40e freeze issues at Joyent.
21:53

rzezeski

Smithx10: what makes you believe an older platform will fix the issue? Have there been changes to i40e in SmartoS?
21:54

Smithx10

rzezeski: we were on an older platform a few months ago like 2023 platform adn the customer wasnt complaining
21:54

» rzezeski realizes he doesn't know where illumos-joyent lives anymore
21:55

Smithx10

github.com/TritonDataCenter/illumos-joyent
21:55

Smithx10

at the moment, I'm just trying to get to a platform where on monday this euro team comes in and doesn't put more tickets in my queue
21:57

Smithx10

rzezeski: is it possible that anyone touching TSO could get that ring frozen? or you think its just been there? you dont think its bad hw right?
22:00

Smithx10

rzezeski: 20230713T000721Z was what we were on a few months ago and I no one was complaining...that could just be a co-incidence.
22:00

rzezeski

well if it wasn't happening on an older platform my first question is what changed
22:00

rzezeski

the only change I see in the Tx path is Patrick's meoi changes
22:01

Smithx10

I greped through git log, but I dont know wwhat files to git blame or look through
22:01

Smithx10

you just checking i40e?
22:01

rzezeski

yea, just checked i40e
22:01

rzezeski

how did you reproduce?
22:01

Smithx10

just booted the box
22:01

rzezeski

just send traffic over i40e?
22:01

rzezeski

lol
22:02

rzezeski

fair enough
22:02

Smithx10

and just checked the counters
22:02

Smithx10

there are a bunch of zones on here running
22:02

Smithx10

its the weekend so i get away with being a little naughty
22:02

rzezeski

I don't think there are any smartos-exlusive changes here, so in theory I could put one of my i40e parts into my omnios box and see if I can replicate
22:03

rzezeski

yea, that's why I asked, those zones could be producing interesting traffic patterns that trip this behavior
22:03

rzezeski

so it may not be as easy to replicate at home
22:03

Smithx10

its a mix of bhyve os zones BHYV 65536
22:03

rzezeski

but one of the things I used back in the day was a generator that used various sizes writes
22:03

Smithx10

not sure if memory matters
22:04

Smithx10

probably not
22:06

Smithx10

The good news, is within like 25-30 minutes this one box trips it up pretty regulary
22:07

Smithx10

another one is almost instant
23:40

jbk

rzezeski: yes, but all of the comments and the commentary at the time suggested no one was really sure what was going on with i40e and descriptors... I was just noting that older intel NICs seem to explain the actual behavior much clearer than the i40e programming manual does
23:42

jbk

(basically with TSO, the header for each packet it produces from the input is done as it's own DMA transfer -- leaving 7)
23:42

jbk

which means for the initial packet, a descriptor that contains the header + data will take 2 transfers (leaving 6 for the rest of that first packet)
23:44

jbk

or to put it another way, for each subsequent packet, it re-copies the initial header from RAM (then adjusts the fields as appropriate based on how much data has been sent so far)
23:45

jbk

and that always counts as 1 descriptor of the 8

4 days ago

« a day earlier

a day later »

today »