-
richlowe
I'm not sure I'd use the metaphor, but I can see the function
-
richlowe
it's a shame ufs is probably too dead for "bread" to be a pun
-
richlowe
(I swear, perhaps because I'm in a foul mood, that maybe 45% of software is literary analysis)
-
jbk
i'm still kinda ad none of the vm2 stuff made it out the door... it would have been interesting to see the design
-
jbk
(sysv ipc teething issues aside)
-
AmyMalik
is the bread line like a final reserve of some sort?
-
AmyMalik
or, i suppose that's classified
-
jbk
unfortuantely AFAIK, none of the design was ever made public... just little hints here and there in the code in a few places
-
sommerfeld
so I spent a chunk of yesterday poking some more at tcp performance anomalies and realized we could use tcp small queues --
illumos.org/issues/17909
-
fenix
→
BUG 17909: tcp could benefit from a small-queues transmit throttle (In Progress)
-
richlowe
I read the bug, and was interesting but uninformed
-
richlowe
another thing I'd ping rzezeski about thought
-
richlowe
'though'
-
jbk
actually looking at that... what layer is the queueing happening in? mac? tcp?
-
sommerfeld
jbk: what I've observed is queueing in the driver TX rings.
-
jbk
ok
-
sommerfeld
I wrote a quick dtrace script that hooked e1000g_recycle() and printed out ring occupancy
-
jbk
would it be correct (just trying to see if I'm understanding) that basically tcp can't respond because the response it wants to send is (to be a bit loose) stuck at the back of the line?
-
jbk
(to tx)
-
sommerfeld
driver hangs on to the queued mblks until the NIC says the tx descriptor is sent
-
sommerfeld
jbk: bingo. packet is transmitted by tcp as soon as it sees the signs (3 dup acks) but it sits in the tx ring for ~20ms
-
jbk
yeah that's pretty typical in a lot of the drivers for bound mblk_ts (they stick around until after the the HW has signaled the stuff's been transmitted, or in some drivers we've manually checked to see if it's been sent)
-
rzezeski
Yes, I'll be very interested to see how that pans out, sommerfeld. I've been doing many different experiments over the last few months at all levels of the network stack and have a long grocery list, but not that specific issue. Nice find. I've seen plenty of other issue that this probably contributed to. E.g., our poor hashing on Tx.
-
jbk
it sounds like (thinking out loud) we'd want the throttling to happen in the mac layer so we're not handing off too many packets to the NIC for a given ring
-
sommerfeld
jbk: yeah, that's also potentially a good place for a throttle.
-
jbk
though i'm guessing there might be a bit of a balancing act as well?
-
sommerfeld
yeah, unless you provide backpressure from mac into tcp you want a throttle in TCP as well.
-
jbk
just because if we have the traffic to do so, we probably want to keep the NIC as busy as we can and not idle
-
rzezeski
how is mac making that decision? why not use the entire tx queue? it seems to me TCP is in the best place to make the throttling decision since it has the most information
-
sommerfeld
you want the rings full enough that tx doesn't stall but not so full that you add latency.
-
jbk
(which I guess means the throttle would vary by the speed of the NIC)
-
jbk
at least to a degree.. multiple rings blurs things
-
sommerfeld
BBR congestion control makes a stab at this by pacing tx and watching RTT and looking for the rate that causes RTT to increase and staying just below that.
-
jbk
i've not looked at the details of that
-
jbk
though just from the description, i'm picturing a feedback loop of some sort that's adjusting things up/down
-
rzezeski
cxgbe has code that tries to reclaim relative to queue size, but it's all broken, I'm hopefully coming back around to that after the big wad in the sky lands, but for any given driver I would think they'd want to do it based on current queue usage and some type of hysteresis like we have for the SRS polling. But even then I feel like the driver is kind of guessing compared to the window scaling of TCP.
-
jbk
(having had to take classes in control systems in college, there's a lot of things that end up resembling them without explicitly saying it)...
-
sommerfeld
yeah, this is all control loops within control loops
-
richlowe
jbk: a terrifying lack of backpressure is also something that turns up basically everywhere in life, when you start to look
-
jbk
oh trust me I know :)
-
jbk
if zfs has an original sin, it's the lack of backpressure in the zio pipeline
-
jbk
soooooooooooooooooooooooo many problems from that if your backend is too slow
-
jbk
(it's also not trivial to fix at this point unfortuantely)
-
jbk
well problems/outages
-
sommerfeld
rzezeski: when I was at google I worked in the same group with the guy who did the original TSQ for linux so I was primed to realize we were missing this when I saw the giant queues.
-
sommerfeld
lwn.net/Articles/507065 <- writeup from 2012 on the original
-
jbk
what little hair i have left has largely turned grey from it
-
sommerfeld
with respect to limiting queueing at the driver, see
medium.com/@tom_84912/byte-queue-li…unauthorized-biography-61adc5730b83 for a discussion of similar work in linux
-
jbk
i wonder as a proof of concept, maybe just have a driver return any excess packets from it'x mc_tx(9E) and use mac_tx{_ring}_update() to signal more... though (not having looked at the code in there) there might be undesirable knock-on effects (but as a test..)
-
jbk
(a side note: it'd be nice when we return packets, if we could assume they don't have to be the identical mblk_ts in case we had to msgpullup() before deciding we couldn't TX it)
-
jbk
it just complicates the TX processing since you have to keep the original one around until you're sure you're not going to try to send it back up to mac (at which point you're almost certainly goign to have to call msgpullup() again on it when it gets handed back to up later)
-
sommerfeld
jbk: might be simpler to just allow one extra packet, and just return the ones after that one.
-
jbk
usually it happens when there's no space left on the ring (at least in most of our drivers today).. trying to make everything fit and can't
-
jbk
not qute the same as bufferbloat, but one thing if I ever get the time I'd like to dig into more is the I/O path in illumos
-
jbk
it seems like depending on the driver (sd, blkdev) as well as the filesystem (e.g. zfs) there's multiple levels of queueing as well as retries
-
jbk
which I think can often get in the way more than it helps these days
-
jbk
e.g. it'd be better for a disk (or bio or sd) to not retry a failed I/O too much
-
jbk
mark the block as bad and let ZFS self heal
-
jbk
there's probably some other things that'd be nice to think about as well..
-
jbk
when we enumerate a scsi disk, we send (IIRC) a START STOP UNIT and TEST UNIT READY
-
jbk
or attach one i guess
-
jbk
and that is I believe single threaded
-
jbk
so a misbehaving disk can _really_ slow things down during boot
-
jbk
(oh and send an INQUIRY as well)
-
jbk
sommerfeld: nice link btw (still reading, but was interrupted)