00:42:26 <richlowe> I'm not sure I'd use the metaphor, but I can see the function
00:42:42 <richlowe> it's a shame ufs is probably too dead for "bread" to be a pun
00:48:31 <richlowe> (I swear, perhaps because I'm in a foul mood, that maybe 45% of software is literary analysis)
01:49:19 <jbk> i'm still kinda ad none of the vm2 stuff made it out the door... it would have been interesting to see the design
01:49:27 <jbk> (sysv ipc teething issues aside)
05:57:17 <AmyMalik> is the bread line like a final reserve of some sort?
05:57:25 <AmyMalik> or, i suppose that's classified
13:53:26 <jbk> unfortuantely AFAIK, none of the design was ever made public... just little hints here and there in the code in a few places
17:21:25 <sommerfeld> so I spent a chunk of yesterday poking some more at tcp performance anomalies and realized we could use tcp small queues -- https://www.illumos.org/issues/17909
17:21:27 <fenix> → BUG 17909: tcp could benefit from a small-queues transmit throttle (In Progress)
20:00:28 <richlowe> I read the bug, and was interesting but uninformed
20:00:40 <richlowe> another thing I'd ping rzezeski about thought
20:00:52 <richlowe> 'though'
20:05:16 <jbk> actually looking at that... what layer is the queueing happening in? mac? tcp?
20:06:01 <sommerfeld> jbk: what I've observed is queueing in the driver TX rings.
20:06:15 <jbk> ok
20:06:48 <sommerfeld> I wrote a quick dtrace script that hooked e1000g_recycle() and printed out ring occupancy
20:07:11 <jbk> would it be correct (just trying to see if I'm understanding) that basically tcp can't respond because the response it wants to send is (to be a bit loose) stuck at the back of the line?
20:07:17 <jbk> (to tx)
20:07:32 <sommerfeld> driver hangs on to the queued mblks until the NIC says the tx descriptor is sent
20:08:09 <sommerfeld> jbk: bingo.   packet is transmitted by tcp as soon as it sees the signs (3 dup acks) but it sits in the tx ring for ~20ms
20:12:08 <jbk> yeah that's pretty typical in a lot of the drivers for bound mblk_ts (they stick around until after the the HW has signaled the stuff's been transmitted, or in some drivers we've manually checked to see if it's been sent)
20:13:26 <rzezeski> Yes, I'll be very interested to see how that pans out, sommerfeld. I've been doing many different experiments over the last few months at all levels of the network stack and have a long grocery list, but not that specific issue. Nice find. I've seen plenty of other issue that this probably contributed to. E.g., our poor hashing on Tx.
20:14:46 <jbk> it sounds like (thinking out loud) we'd want the throttling to happen in the mac layer so we're not handing off too many packets to the NIC for a given ring
20:15:44 <sommerfeld> jbk: yeah, that's also potentially a good place for a throttle.  
20:15:51 <jbk> though i'm guessing there might be a bit of a balancing act as well?
20:16:13 <sommerfeld> yeah, unless you provide backpressure from mac into tcp you want a throttle in TCP as well.
20:16:31 <jbk> just because if we have the traffic to do so, we probably want to keep the NIC as busy as we can and not idle
20:16:59 <rzezeski> how is mac making that decision? why not use the entire tx queue? it seems to me TCP is in the best place to make the throttling decision since it has the most information
20:17:13 <sommerfeld> you want the rings full enough that tx doesn't stall but not so full that you add latency.  
20:17:21 <jbk> (which I guess means the throttle would vary by the speed of the NIC)
20:17:45 <jbk> at least to a degree.. multiple rings blurs things
20:18:20 <sommerfeld> BBR congestion control makes a stab at this by pacing tx and watching RTT and looking for the rate that causes RTT to increase and staying just below that.
20:20:10 <jbk> i've not looked at the details of that
20:20:49 <jbk> though just from the description, i'm picturing a feedback loop of some sort that's adjusting things up/down
20:23:35 <rzezeski> cxgbe has code that tries to reclaim relative to queue size, but it's all broken, I'm hopefully coming back around to that after the big wad in the sky lands, but for any given driver I would think they'd want to do it based on current queue usage and some type of hysteresis like we have for the SRS polling. But even then I feel like the driver is kind of guessing compared to the window scaling of TCP.
20:24:07 <jbk> (having had to take classes in control systems in college, there's a lot of things that end up resembling them without explicitly saying it)...
20:24:32 <sommerfeld> yeah, this is all control loops within control loops
20:24:50 <richlowe> jbk: a terrifying lack of backpressure is also something that turns up basically everywhere in life, when you start to look
20:25:05 <jbk> oh trust me I know :)
20:25:35 <jbk> if zfs has an original sin, it's the lack of backpressure in the zio pipeline
20:25:59 <jbk> soooooooooooooooooooooooo many problems from that if your backend is too slow
20:26:17 <jbk> (it's also not trivial to fix at this point unfortuantely)
20:26:56 <jbk> well problems/outages
20:28:17 <sommerfeld> rzezeski: when I was at google I worked in the same group with the guy who did the original TSQ for linux so I was primed to realize we were missing this when I saw the giant queues.
20:28:59 <sommerfeld> https://lwn.net/Articles/507065/ <- writeup from 2012 on the original
20:29:20 <jbk> what little hair i have left has largely turned grey from it
20:33:28 <sommerfeld> with respect to limiting queueing at the driver, see https://medium.com/@tom_84912/byte-queue-limits-the-unauthorized-biography-61adc5730b83 for a discussion of similar work in linux
20:53:29 <jbk> i wonder as a proof of concept, maybe just have a driver return any excess packets from it'x mc_tx(9E) and use mac_tx{_ring}_update() to signal more... though (not having looked at the code in there) there might be undesirable knock-on effects (but as a test..)
20:56:52 <jbk> (a side note: it'd be nice when we return packets, if we could assume they don't have to be the identical mblk_ts in case we had to msgpullup() before deciding we couldn't TX it)
20:58:21 <jbk> it just complicates the TX processing since you have to keep the original one around until you're sure you're not going to try to send it back up to mac (at which point you're almost certainly goign to have to call msgpullup() again on it when it gets handed back to up later)
20:59:24 <sommerfeld> jbk: might be simpler to just allow one extra packet, and just return the ones after that one.
21:00:20 <jbk> usually it happens when there's no space left on the ring (at least in most of our drivers today).. trying to make everything fit and can't
22:04:04 <jbk> not qute the same as bufferbloat, but one thing if I ever get the time I'd like to dig into more is the I/O path in illumos
22:04:37 <jbk> it seems like depending on the driver (sd, blkdev) as well as the filesystem (e.g. zfs) there's multiple levels of queueing as well as retries
22:06:20 <jbk> which I think can often get in the way more than it helps these days
22:07:57 <jbk> e.g. it'd be better for a disk (or bio or sd) to not retry a failed I/O too much
22:08:04 <jbk> mark the block as bad and let ZFS self heal
22:09:31 <jbk> there's probably some other things that'd be nice to think about as well..
22:10:10 <jbk> when we enumerate a scsi disk, we send (IIRC) a START STOP UNIT and TEST UNIT READY
22:10:15 <jbk> or attach one i guess
22:10:22 <jbk> and that is I believe single threaded
22:10:35 <jbk> so a misbehaving disk can _really_ slow things down during boot
22:11:19 <jbk> (oh and send an INQUIRY as well)
22:13:58 <jbk> sommerfeld: nice link btw (still reading, but was interrupted)