#illumos

02:53

sommerfeld

filed illumos.org/issues/17895 on the LSO visibility thing.
02:53

fenix

→ BUG 17895: need better visibility into ip offload state (New)
03:46

jbk

that does remind me... i should probably put up my i40e changes for review at some point...
03:47

jbk

just i don't have a testing setup, so if someone wants changes during review, I have no way of really re-testing any of them
03:47

jbk

(it limits the size of the DMA buffers to avoid the VM having to do it's best defrag impression with jumbo frames and vnics)
05:54

sommerfeld

jbk: I have a couple i40e-equipped machines so when you have something to test...
13:45

jbk

it's been running on I don't know how many of our customers now for probably 2 years, so I'm pretty confident it works fine.. the big thing is trying to recreate the pathological cases that can lead to a TX ring stopping if you get it wrong...
13:46

jbk

we had a bunch of VMs causing continuous massive network I/O...
13:47

jbk

running for 24+ hours (any problems usually showed up in 2-3 hours)
13:49

jbk

the Intel NICs have some rather annoying restrictions on LSO that make translating 'large' (> MTU) packets (segmented across many b_cont linked mblk_t chains) complicated
13:49

jbk

(and the restrictions have become more poorly documented/explained in the Intel documentation with each successive family of NIC :P)
13:50

Smithx10

yeah, I was suffering frozen rings
13:52

jbk

for reasons I don't understand, it turns out the smb code is pretty good at inducing pathological mblk_ts (for intel nics)
13:52

jbk

(either directly or indirectly through the tcp layer)
13:53

jbk

basically a chain of 'big, big, big, tiny, tiny, tiny, tiney, big, big' dblk_ts for a packet tend to do it
13:54

jbk

for ice (E810), I took a _slightly_ different approach that I'm hoping (once I'm able to torture test it) works better...
13:55

jbk

(not finished with it yet -- having to configure an entire network switch and applications -- if if it's just 'yeah, don't do that stuff' is a lot of work :P)
14:03

jbk

(for those curious, the problem with Intels is you still have to do a quasi-segmentation of the packet in the driver -- they will do at most 8 DMA transfers to assemble each 'chunked' packet on the wire from the original larger packet.... except the headers (and only the headers) are done as 1 transfer (so a descriptor with header + data takes 2 transfers)
14:06

jbk

in the ice driver, as we're mapping/copying things, I add a bit of extra state (I wrap all of it in a struct since it makes it easier for debugging) so we can assume a 'happy path' most of the time, and if we exceed a limit for a portion, we can roll everything back for just that 'chunk' (not really sure waht the terminology would be) and just copy that chunk (since that will always fit)
19:05

jbk

do we leave the ACPI tables at their physical address, or do we do any remapping? (I'm wondering if I can stash a pointer to a table early in boot and it stays valid once the system is up or not..
19:05

jbk

(when accessing in the kernel)
19:36

gitomat

[illumos-gate] 17605 !dohwcksum should also disable LSO -- Bill Sommerfeld <sommerfeld⊙ho>
22:43

nomad

jbk: I don't think this is the scenario you mean but I just built a fileserver for $homelab that's got two i40e SPF+ connections. It's a bare-iron OmniOS server that only feeds hypervisors their VDIs (via nFS). Happy to test if that would be in any way useful to you.
22:44

jbk

the big thing is if you use jumbo frames + vnics
22:44

nomad

oh heck, I think I forgot to configure jiumboframes on that server! I think I should be. Give me a few minutes to check.
22:45

jbk

the upstream driver will allocate 9kb physically contiguous buffers when you create a vnic over it
22:45

jbk

(or it can at least)
22:45

jbk

and that's where the problem can come in..
22:46

jbk

if the system has been up for a while, or is busy, the VM system will have to move pages around in memory (basically like DEFRAG.EXE but for ram :P)
22:46

nomad

Yep, should have configured jumbo frames on the NAS interface. Time to go read documentation to see how to do that in OmniOS.
22:46

jbk

which we've seen take the upwards for 45 minutes(!).. all while dladm create-vnic is stuck unkillable in the kernel waiting for that to finished
22:48

jbk

the fix is to limit the size of the buffers allocated to <= 4k (page size) and instead split packets across multiple buffers (cards support this just fine)
22:49

jbk

(technically that isn't _guaranteed_ to work across all possible platforms, but in practice on x86 it does)
22:49

nomad

while I wait for my XOA VDI to migrate off that server (I'm just going to halt everythine else hosted there) I'll look up how to configure jumboframes... I don't use them at $job[1] so have never needed to do that before.
22:50

jbk

the thing with jumboframes is you _really, really_ want to be sure everything on the vlan is using them (and agrees on the size)
22:50

jbk

otherwise you will be pulling your hair out trying to figure out strange behavior
22:51

jbk

because sometimes things will work, othertimes things will at like they're down or something's broken on the network
22:51

jbk

and it will seem random (if a packet is too big, it just gets uncerimonously dropped w/o any sort of notification on the network)
22:51

jbk

the ethernet packet (frame)
22:52

nomad

yep, everything on the NAS VLAN is using jumboframes (except for hvfs, which I'm running 'sudo dladm set-linprop -p mtu=9000 i40e1' on now)
22:52

jbk

and since the size of an ethernet frame is largely non-determinstic
22:52

jbk

you can't easily predict when that's going to happen
22:52

nomad

well... will be running that command once the VDI is migrated and the remaining VMs are halted.
22:57

nomad

My old file server is so slow by comparison (14 year old server blah blah spinning rust blah blah blah) that I didn't notice the VMs weren't running optimally from the new file server (6 year old hypervisor + SSDs).
22:58

nomad

oh, right, link busy. Can't set linkprop. ugh. More readin'!
22:59

richlowe

I fear an irix-bloat like situation, just because everything is so tolerably fast these days
22:59

richlowe

when I'm jaded, that is
22:59

richlowe

I bet jbk's byte-at-a-time i/o isn't even _that_ rough, on a modern system
23:00

richlowe

you can make a lot of really bad decisions, and fully get away with them on modern hardware
23:00

richlowe

get away with them for ages
23:00

richlowe

things that keep me up at night dot text
23:00

nomad

oh... that's confusing. Now I'm seeing I should use ipadm set-ifprop. So which is it?
23:02

jbk

richlowe: if you mean for i40e, it's more I think just a question of time
23:03

jbk

intel made some decisions on how they support TSO that makes mapping our mblk_ts _really_ painful
23:03

nomad

looks like ipadm set-ifprop refuses to set mtu over 1500.
23:03

jbk

you might need to delete the IP interface, then do dladm, then recreate the ip interface
23:03

jbk

(I feel like we should be able to do better there for that, but how it is now)
23:04

richlowe

I mean in general
23:04

richlowe

but specifically nomad's system being so fast now, that you can just not notice major misconfiguration :)
23:04

jbk

richlowe: if we had a bit more rigor in the network stack surrounding how things could be split across multiple buffers, that might help a bit
23:06

jbk

there's more than a few checks that boil down to 'things can potentially be split at an arbitrary byte boundary'
23:08

nomad

jbk: delete and recreate (with a dladm set-linkprop before create-ip) is what I wound up doing.
23:12

nomad

ping -s 9000 is a happy camper now.
23:19

nomad

richlowe: "new server, who dis?"

24 days ago

« a day earlier

a day later »

today »