02:53:52 <sommerfeld> filed https://www.illumos.org/issues/17895 on the LSO visibility thing.
02:53:53 <fenix> → BUG 17895: need better visibility into ip offload state (New)
03:46:38 <jbk> that does remind me... i should probably put up my i40e changes for review at some point...
03:47:00 <jbk> just i don't have a testing setup, so if someone wants changes during review, I have no way of really re-testing any of them
03:47:48 <jbk> (it limits the size of the DMA buffers to avoid the VM having to do it's best defrag impression with jumbo frames and vnics)
05:54:26 <sommerfeld> jbk: I have a couple i40e-equipped machines so when you have something to test...
13:45:53 <jbk> it's been running on I don't know how many of our customers now for probably 2 years, so I'm pretty confident it works fine.. the big thing is trying to recreate the pathological cases that can lead to a TX ring stopping if you get it wrong...
13:46:59 <jbk> we had a bunch of VMs causing continuous massive network I/O...
13:47:38 <jbk> running for 24+ hours (any problems usually showed up in 2-3 hours)
13:49:10 <jbk> the Intel NICs have some rather annoying restrictions on LSO that make translating 'large' (> MTU) packets (segmented across many b_cont linked mblk_t chains) complicated
13:49:49 <jbk> (and the restrictions have become more poorly documented/explained in the Intel documentation with each successive family of NIC :P)
13:50:07 <Smithx10> yeah,  I was suffering frozen rings
13:52:23 <jbk> for reasons I don't understand, it turns out the smb code is pretty good at inducing pathological mblk_ts (for intel nics)
13:52:45 <jbk> (either directly or indirectly through the tcp layer)
13:53:38 <jbk> basically a chain of 'big, big, big, tiny, tiny, tiny, tiney, big, big' dblk_ts for a packet tend to do it
13:54:32 <jbk> for ice (E810), I took a _slightly_ different approach that I'm hoping (once I'm able to torture test it) works better...
13:55:17 <jbk> (not finished with it yet -- having to configure an entire network switch and applications -- if if it's just 'yeah, don't do that stuff' is a lot of work :P)
14:03:05 <jbk> (for those curious, the problem with Intels is you still have to do a quasi-segmentation of the packet in the driver -- they will do at most 8 DMA transfers to assemble each 'chunked' packet on the wire from the original larger packet.... except the headers (and only the headers) are done as 1 transfer (so a descriptor with header + data takes 2 transfers)
14:06:03 <jbk> in the ice driver, as we're mapping/copying things, I add a bit of extra state (I wrap all of it in a struct since it makes it easier for debugging) so we can assume a 'happy path' most of the time, and if we exceed a limit for a portion, we can roll everything back for just that 'chunk' (not really sure waht the terminology would be) and just copy that chunk (since that will always fit)
19:05:28 <jbk> do we leave the ACPI tables at their physical address, or do we do any remapping? (I'm wondering if I can stash a pointer to a table early in boot and it stays valid once the system is up or not..
19:05:43 <jbk> (when accessing in the kernel)
19:36:45 <gitomat> [illumos-gate] 17605 !dohwcksum should also disable LSO -- Bill Sommerfeld <sommerfeld⊙ho>
22:43:48 <nomad> jbk: I don't think this is the scenario you mean but I just built a fileserver for $homelab that's got two i40e SPF+ connections. It's a bare-iron OmniOS server that only feeds hypervisors their VDIs (via nFS). Happy to test if that would be in any way useful to you.
22:44:16 <jbk> the big thing is if you use jumbo frames + vnics
22:44:52 <nomad> oh heck, I think I forgot to configure jiumboframes on that server! I think I should be. Give me a few minutes to check.
22:45:11 <jbk> the upstream driver will allocate 9kb physically contiguous buffers when you create a vnic over it
22:45:21 <jbk> (or it can at least)
22:45:30 <jbk> and that's where the problem can come in..
22:46:17 <jbk> if the system has been up for a while, or is busy, the VM system will have to move pages around in memory (basically like DEFRAG.EXE but for ram :P)
22:46:46 <nomad> Yep, should have configured jumbo frames on the NAS interface. Time to go read documentation to see how to do that in OmniOS.
22:46:50 <jbk> which we've seen take the upwards for 45 minutes(!).. all while dladm create-vnic is stuck unkillable in the kernel waiting for that to finished
22:48:31 <jbk> the fix is to limit the size of the buffers allocated to <= 4k (page size) and instead split packets across multiple buffers (cards support this just fine)
22:49:18 <jbk> (technically that isn't _guaranteed_ to work across all possible platforms, but in practice on x86 it does)
22:49:55 <nomad> while I wait for my XOA VDI to migrate off that server (I'm just going to halt everythine else hosted there) I'll look up how to configure jumboframes... I don't use them at $job[1] so have never needed to do that before.
22:50:24 <jbk> the thing with jumboframes is you _really, really_ want to be sure everything on the vlan is using them (and agrees on the size)
22:50:37 <jbk> otherwise you will be pulling your hair out trying to figure out strange behavior
22:51:15 <jbk> because sometimes things will work, othertimes things will at like they're down or something's broken on the network
22:51:42 <jbk> and it will seem random (if a packet is too big, it just gets uncerimonously dropped w/o any sort of notification on the network)
22:51:52 <jbk> the ethernet packet (frame)
22:52:20 <nomad> yep, everything on the NAS VLAN is using jumboframes (except for hvfs, which I'm running 'sudo dladm set-linprop -p mtu=9000 i40e1' on now)
22:52:25 <jbk> and since the size of an ethernet frame is largely non-determinstic
22:52:39 <jbk> you can't easily predict when that's going to happen
22:52:49 <nomad> well... will be running that command once the VDI is migrated and the remaining VMs are halted.
22:57:27 <nomad> My old file server is so slow by comparison (14 year old server blah blah spinning rust blah blah blah) that I didn't notice the VMs weren't running optimally from the new file server (6 year old hypervisor + SSDs).
22:58:32 <nomad> oh, right, link busy. Can't set linkprop. ugh.  More readin'!
22:59:00 <richlowe> I fear an irix-bloat like situation, just because everything is so tolerably fast these days
22:59:06 <richlowe> when I'm jaded, that is
22:59:45 <richlowe> I bet jbk's byte-at-a-time i/o isn't even _that_ rough, on a modern system
23:00:07 <richlowe> you can make a lot of really bad decisions, and fully get away with them on modern hardware
23:00:14 <richlowe> get away with them for ages
23:00:26 <richlowe> things that keep me up at night dot text
23:00:43 <nomad> oh... that's confusing. Now I'm seeing I should use ipadm set-ifprop. So which is it?
23:02:33 <jbk> richlowe: if you mean for i40e, it's more I think just a question of time
23:03:18 <jbk> intel made some decisions on how they support TSO that makes mapping our mblk_ts _really_ painful
23:03:19 <nomad> looks like ipadm set-ifprop refuses to set mtu over 1500. 
23:03:44 <jbk> you might need to delete the IP interface, then do dladm, then recreate the ip interface
23:03:58 <jbk> (I feel like we should be able to do better there for that, but how it is now)
23:04:17 <richlowe> I mean in general
23:04:37 <richlowe> but specifically nomad's system being so fast now, that you can just not notice major misconfiguration :) 
23:04:45 <jbk> richlowe: if we had a bit more rigor in the network stack surrounding how things could be split across multiple buffers, that might help a bit
23:06:08 <jbk> there's more than a few checks that boil down to 'things can potentially be split at an arbitrary byte boundary'
23:08:36 <nomad> jbk:  delete and recreate (with a dladm set-linkprop before create-ip) is what I wound up doing.
23:12:01 <nomad> ping -s 9000 is a happy camper now.
23:19:26 <nomad> richlowe: "new server, who dis?"