-
sommerfeld
filed
illumos.org/issues/17895 on the LSO visibility thing.
-
fenix
→
BUG 17895: need better visibility into ip offload state (New)
-
jbk
that does remind me... i should probably put up my i40e changes for review at some point...
-
jbk
just i don't have a testing setup, so if someone wants changes during review, I have no way of really re-testing any of them
-
jbk
(it limits the size of the DMA buffers to avoid the VM having to do it's best defrag impression with jumbo frames and vnics)
-
sommerfeld
jbk: I have a couple i40e-equipped machines so when you have something to test...
-
jbk
it's been running on I don't know how many of our customers now for probably 2 years, so I'm pretty confident it works fine.. the big thing is trying to recreate the pathological cases that can lead to a TX ring stopping if you get it wrong...
-
jbk
we had a bunch of VMs causing continuous massive network I/O...
-
jbk
running for 24+ hours (any problems usually showed up in 2-3 hours)
-
jbk
the Intel NICs have some rather annoying restrictions on LSO that make translating 'large' (> MTU) packets (segmented across many b_cont linked mblk_t chains) complicated
-
jbk
(and the restrictions have become more poorly documented/explained in the Intel documentation with each successive family of NIC :P)
-
Smithx10
yeah, I was suffering frozen rings
-
jbk
for reasons I don't understand, it turns out the smb code is pretty good at inducing pathological mblk_ts (for intel nics)
-
jbk
(either directly or indirectly through the tcp layer)
-
jbk
basically a chain of 'big, big, big, tiny, tiny, tiny, tiney, big, big' dblk_ts for a packet tend to do it
-
jbk
for ice (E810), I took a _slightly_ different approach that I'm hoping (once I'm able to torture test it) works better...
-
jbk
(not finished with it yet -- having to configure an entire network switch and applications -- if if it's just 'yeah, don't do that stuff' is a lot of work :P)
-
jbk
(for those curious, the problem with Intels is you still have to do a quasi-segmentation of the packet in the driver -- they will do at most 8 DMA transfers to assemble each 'chunked' packet on the wire from the original larger packet.... except the headers (and only the headers) are done as 1 transfer (so a descriptor with header + data takes 2 transfers)
-
jbk
in the ice driver, as we're mapping/copying things, I add a bit of extra state (I wrap all of it in a struct since it makes it easier for debugging) so we can assume a 'happy path' most of the time, and if we exceed a limit for a portion, we can roll everything back for just that 'chunk' (not really sure waht the terminology would be) and just copy that chunk (since that will always fit)
-
jbk
do we leave the ACPI tables at their physical address, or do we do any remapping? (I'm wondering if I can stash a pointer to a table early in boot and it stays valid once the system is up or not..
-
jbk
(when accessing in the kernel)
-
gitomat
[illumos-gate] 17605 !dohwcksum should also disable LSO -- Bill Sommerfeld <sommerfeld⊙ho>
-
nomad
jbk: I don't think this is the scenario you mean but I just built a fileserver for $homelab that's got two i40e SPF+ connections. It's a bare-iron OmniOS server that only feeds hypervisors their VDIs (via nFS). Happy to test if that would be in any way useful to you.
-
jbk
the big thing is if you use jumbo frames + vnics
-
nomad
oh heck, I think I forgot to configure jiumboframes on that server! I think I should be. Give me a few minutes to check.
-
jbk
the upstream driver will allocate 9kb physically contiguous buffers when you create a vnic over it
-
jbk
(or it can at least)
-
jbk
and that's where the problem can come in..
-
jbk
if the system has been up for a while, or is busy, the VM system will have to move pages around in memory (basically like DEFRAG.EXE but for ram :P)
-
nomad
Yep, should have configured jumbo frames on the NAS interface. Time to go read documentation to see how to do that in OmniOS.
-
jbk
which we've seen take the upwards for 45 minutes(!).. all while dladm create-vnic is stuck unkillable in the kernel waiting for that to finished
-
jbk
the fix is to limit the size of the buffers allocated to <= 4k (page size) and instead split packets across multiple buffers (cards support this just fine)
-
jbk
(technically that isn't _guaranteed_ to work across all possible platforms, but in practice on x86 it does)
-
nomad
while I wait for my XOA VDI to migrate off that server (I'm just going to halt everythine else hosted there) I'll look up how to configure jumboframes... I don't use them at $job[1] so have never needed to do that before.
-
jbk
the thing with jumboframes is you _really, really_ want to be sure everything on the vlan is using them (and agrees on the size)
-
jbk
otherwise you will be pulling your hair out trying to figure out strange behavior
-
jbk
because sometimes things will work, othertimes things will at like they're down or something's broken on the network
-
jbk
and it will seem random (if a packet is too big, it just gets uncerimonously dropped w/o any sort of notification on the network)
-
jbk
the ethernet packet (frame)
-
nomad
yep, everything on the NAS VLAN is using jumboframes (except for hvfs, which I'm running 'sudo dladm set-linprop -p mtu=9000 i40e1' on now)
-
jbk
and since the size of an ethernet frame is largely non-determinstic
-
jbk
you can't easily predict when that's going to happen
-
nomad
well... will be running that command once the VDI is migrated and the remaining VMs are halted.
-
nomad
My old file server is so slow by comparison (14 year old server blah blah spinning rust blah blah blah) that I didn't notice the VMs weren't running optimally from the new file server (6 year old hypervisor + SSDs).
-
nomad
oh, right, link busy. Can't set linkprop. ugh. More readin'!
-
richlowe
I fear an irix-bloat like situation, just because everything is so tolerably fast these days
-
richlowe
when I'm jaded, that is
-
richlowe
I bet jbk's byte-at-a-time i/o isn't even _that_ rough, on a modern system
-
richlowe
you can make a lot of really bad decisions, and fully get away with them on modern hardware
-
richlowe
get away with them for ages
-
richlowe
things that keep me up at night dot text
-
nomad
oh... that's confusing. Now I'm seeing I should use ipadm set-ifprop. So which is it?
-
jbk
richlowe: if you mean for i40e, it's more I think just a question of time
-
jbk
intel made some decisions on how they support TSO that makes mapping our mblk_ts _really_ painful
-
nomad
looks like ipadm set-ifprop refuses to set mtu over 1500.
-
jbk
you might need to delete the IP interface, then do dladm, then recreate the ip interface
-
jbk
(I feel like we should be able to do better there for that, but how it is now)
-
richlowe
I mean in general
-
richlowe
but specifically nomad's system being so fast now, that you can just not notice major misconfiguration :)
-
jbk
richlowe: if we had a bit more rigor in the network stack surrounding how things could be split across multiple buffers, that might help a bit
-
jbk
there's more than a few checks that boil down to 'things can potentially be split at an arbitrary byte boundary'
-
nomad
jbk: delete and recreate (with a dladm set-linkprop before create-ip) is what I wound up doing.
-
nomad
ping -s 9000 is a happy camper now.
-
nomad
richlowe: "new server, who dis?"