#illumos

00:42

richlowe

I'm not sure I'd use the metaphor, but I can see the function
00:42

richlowe

it's a shame ufs is probably too dead for "bread" to be a pun
00:48

richlowe

(I swear, perhaps because I'm in a foul mood, that maybe 45% of software is literary analysis)
01:49

jbk

i'm still kinda ad none of the vm2 stuff made it out the door... it would have been interesting to see the design
01:49

jbk

(sysv ipc teething issues aside)
05:57

AmyMalik

is the bread line like a final reserve of some sort?
05:57

AmyMalik

or, i suppose that's classified
13:53

jbk

unfortuantely AFAIK, none of the design was ever made public... just little hints here and there in the code in a few places
17:21

sommerfeld

so I spent a chunk of yesterday poking some more at tcp performance anomalies and realized we could use tcp small queues -- illumos.org/issues/17909
17:21

fenix

→ BUG 17909: tcp could benefit from a small-queues transmit throttle (In Progress)
20:00

richlowe

I read the bug, and was interesting but uninformed
20:00

richlowe

another thing I'd ping rzezeski about thought
20:00

richlowe

'though'
20:05

jbk

actually looking at that... what layer is the queueing happening in? mac? tcp?
20:06

sommerfeld

jbk: what I've observed is queueing in the driver TX rings.
20:06

jbk

ok
20:06

sommerfeld

I wrote a quick dtrace script that hooked e1000g_recycle() and printed out ring occupancy
20:07

jbk

would it be correct (just trying to see if I'm understanding) that basically tcp can't respond because the response it wants to send is (to be a bit loose) stuck at the back of the line?
20:07

jbk

(to tx)
20:07

sommerfeld

driver hangs on to the queued mblks until the NIC says the tx descriptor is sent
20:08

sommerfeld

jbk: bingo. packet is transmitted by tcp as soon as it sees the signs (3 dup acks) but it sits in the tx ring for ~20ms
20:12

jbk

yeah that's pretty typical in a lot of the drivers for bound mblk_ts (they stick around until after the the HW has signaled the stuff's been transmitted, or in some drivers we've manually checked to see if it's been sent)
20:13

rzezeski

Yes, I'll be very interested to see how that pans out, sommerfeld. I've been doing many different experiments over the last few months at all levels of the network stack and have a long grocery list, but not that specific issue. Nice find. I've seen plenty of other issue that this probably contributed to. E.g., our poor hashing on Tx.
20:14

jbk

it sounds like (thinking out loud) we'd want the throttling to happen in the mac layer so we're not handing off too many packets to the NIC for a given ring
20:15

sommerfeld

jbk: yeah, that's also potentially a good place for a throttle.
20:15

jbk

though i'm guessing there might be a bit of a balancing act as well?
20:16

sommerfeld

yeah, unless you provide backpressure from mac into tcp you want a throttle in TCP as well.
20:16

jbk

just because if we have the traffic to do so, we probably want to keep the NIC as busy as we can and not idle
20:16

rzezeski

how is mac making that decision? why not use the entire tx queue? it seems to me TCP is in the best place to make the throttling decision since it has the most information
20:17

sommerfeld

you want the rings full enough that tx doesn't stall but not so full that you add latency.
20:17

jbk

(which I guess means the throttle would vary by the speed of the NIC)
20:17

jbk

at least to a degree.. multiple rings blurs things
20:18

sommerfeld

BBR congestion control makes a stab at this by pacing tx and watching RTT and looking for the rate that causes RTT to increase and staying just below that.
20:20

jbk

i've not looked at the details of that
20:20

jbk

though just from the description, i'm picturing a feedback loop of some sort that's adjusting things up/down
20:23

rzezeski

cxgbe has code that tries to reclaim relative to queue size, but it's all broken, I'm hopefully coming back around to that after the big wad in the sky lands, but for any given driver I would think they'd want to do it based on current queue usage and some type of hysteresis like we have for the SRS polling. But even then I feel like the driver is kind of guessing compared to the window scaling of TCP.
20:24

jbk

(having had to take classes in control systems in college, there's a lot of things that end up resembling them without explicitly saying it)...
20:24

sommerfeld

yeah, this is all control loops within control loops
20:24

richlowe

jbk: a terrifying lack of backpressure is also something that turns up basically everywhere in life, when you start to look
20:25

jbk

oh trust me I know :)
20:25

jbk

if zfs has an original sin, it's the lack of backpressure in the zio pipeline
20:25

jbk

soooooooooooooooooooooooo many problems from that if your backend is too slow
20:26

jbk

(it's also not trivial to fix at this point unfortuantely)
20:26

jbk

well problems/outages
20:28

sommerfeld

rzezeski: when I was at google I worked in the same group with the guy who did the original TSQ for linux so I was primed to realize we were missing this when I saw the giant queues.
20:28

sommerfeld

lwn.net/Articles/507065 <- writeup from 2012 on the original
20:29

jbk

what little hair i have left has largely turned grey from it
20:33

sommerfeld

with respect to limiting queueing at the driver, see medium.com/@tom_84912/byte-queue-li…unauthorized-biography-61adc5730b83 for a discussion of similar work in linux
20:53

jbk

i wonder as a proof of concept, maybe just have a driver return any excess packets from it'x mc_tx(9E) and use mac_tx{_ring}_update() to signal more... though (not having looked at the code in there) there might be undesirable knock-on effects (but as a test..)
20:56

jbk

(a side note: it'd be nice when we return packets, if we could assume they don't have to be the identical mblk_ts in case we had to msgpullup() before deciding we couldn't TX it)
20:58

jbk

it just complicates the TX processing since you have to keep the original one around until you're sure you're not going to try to send it back up to mac (at which point you're almost certainly goign to have to call msgpullup() again on it when it gets handed back to up later)
20:59

sommerfeld

jbk: might be simpler to just allow one extra packet, and just return the ones after that one.
21:00

jbk

usually it happens when there's no space left on the ring (at least in most of our drivers today).. trying to make everything fit and can't
22:04

jbk

not qute the same as bufferbloat, but one thing if I ever get the time I'd like to dig into more is the I/O path in illumos
22:04

jbk

it seems like depending on the driver (sd, blkdev) as well as the filesystem (e.g. zfs) there's multiple levels of queueing as well as retries
22:06

jbk

which I think can often get in the way more than it helps these days
22:07

jbk

e.g. it'd be better for a disk (or bio or sd) to not retry a failed I/O too much
22:08

jbk

mark the block as bad and let ZFS self heal
22:09

jbk

there's probably some other things that'd be nice to think about as well..
22:10

jbk

when we enumerate a scsi disk, we send (IIRC) a START STOP UNIT and TEST UNIT READY
22:10

jbk

or attach one i guess
22:10

jbk

and that is I believe single threaded
22:10

jbk

so a misbehaving disk can _really_ slow things down during boot
22:11

jbk

(oh and send an INQUIRY as well)
22:13

jbk

sommerfeld: nice link btw (still reading, but was interrupted)

17 days ago

« a day earlier

a day later »

today »