02:53:52 filed https://www.illumos.org/issues/17895 on the LSO visibility thing. 02:53:53 → BUG 17895: need better visibility into ip offload state (New) 03:46:38 that does remind me... i should probably put up my i40e changes for review at some point... 03:47:00 just i don't have a testing setup, so if someone wants changes during review, I have no way of really re-testing any of them 03:47:48 (it limits the size of the DMA buffers to avoid the VM having to do it's best defrag impression with jumbo frames and vnics) 05:54:26 jbk: I have a couple i40e-equipped machines so when you have something to test... 13:45:53 it's been running on I don't know how many of our customers now for probably 2 years, so I'm pretty confident it works fine.. the big thing is trying to recreate the pathological cases that can lead to a TX ring stopping if you get it wrong... 13:46:59 we had a bunch of VMs causing continuous massive network I/O... 13:47:38 running for 24+ hours (any problems usually showed up in 2-3 hours) 13:49:10 the Intel NICs have some rather annoying restrictions on LSO that make translating 'large' (> MTU) packets (segmented across many b_cont linked mblk_t chains) complicated 13:49:49 (and the restrictions have become more poorly documented/explained in the Intel documentation with each successive family of NIC :P) 13:50:07 yeah, I was suffering frozen rings 13:52:23 for reasons I don't understand, it turns out the smb code is pretty good at inducing pathological mblk_ts (for intel nics) 13:52:45 (either directly or indirectly through the tcp layer) 13:53:38 basically a chain of 'big, big, big, tiny, tiny, tiny, tiney, big, big' dblk_ts for a packet tend to do it 13:54:32 for ice (E810), I took a _slightly_ different approach that I'm hoping (once I'm able to torture test it) works better... 13:55:17 (not finished with it yet -- having to configure an entire network switch and applications -- if if it's just 'yeah, don't do that stuff' is a lot of work :P) 14:03:05 (for those curious, the problem with Intels is you still have to do a quasi-segmentation of the packet in the driver -- they will do at most 8 DMA transfers to assemble each 'chunked' packet on the wire from the original larger packet.... except the headers (and only the headers) are done as 1 transfer (so a descriptor with header + data takes 2 transfers) 14:06:03 in the ice driver, as we're mapping/copying things, I add a bit of extra state (I wrap all of it in a struct since it makes it easier for debugging) so we can assume a 'happy path' most of the time, and if we exceed a limit for a portion, we can roll everything back for just that 'chunk' (not really sure waht the terminology would be) and just copy that chunk (since that will always fit) 19:05:28 do we leave the ACPI tables at their physical address, or do we do any remapping? (I'm wondering if I can stash a pointer to a table early in boot and it stays valid once the system is up or not.. 19:05:43 (when accessing in the kernel) 19:36:45 [illumos-gate] 17605 !dohwcksum should also disable LSO -- Bill Sommerfeld 22:43:48 jbk: I don't think this is the scenario you mean but I just built a fileserver for $homelab that's got two i40e SPF+ connections. It's a bare-iron OmniOS server that only feeds hypervisors their VDIs (via nFS). Happy to test if that would be in any way useful to you. 22:44:16 the big thing is if you use jumbo frames + vnics 22:44:52 oh heck, I think I forgot to configure jiumboframes on that server! I think I should be. Give me a few minutes to check. 22:45:11 the upstream driver will allocate 9kb physically contiguous buffers when you create a vnic over it 22:45:21 (or it can at least) 22:45:30 and that's where the problem can come in.. 22:46:17 if the system has been up for a while, or is busy, the VM system will have to move pages around in memory (basically like DEFRAG.EXE but for ram :P) 22:46:46 Yep, should have configured jumbo frames on the NAS interface. Time to go read documentation to see how to do that in OmniOS. 22:46:50 which we've seen take the upwards for 45 minutes(!).. all while dladm create-vnic is stuck unkillable in the kernel waiting for that to finished 22:48:31 the fix is to limit the size of the buffers allocated to <= 4k (page size) and instead split packets across multiple buffers (cards support this just fine) 22:49:18 (technically that isn't _guaranteed_ to work across all possible platforms, but in practice on x86 it does) 22:49:55 while I wait for my XOA VDI to migrate off that server (I'm just going to halt everythine else hosted there) I'll look up how to configure jumboframes... I don't use them at $job[1] so have never needed to do that before. 22:50:24 the thing with jumboframes is you _really, really_ want to be sure everything on the vlan is using them (and agrees on the size) 22:50:37 otherwise you will be pulling your hair out trying to figure out strange behavior 22:51:15 because sometimes things will work, othertimes things will at like they're down or something's broken on the network 22:51:42 and it will seem random (if a packet is too big, it just gets uncerimonously dropped w/o any sort of notification on the network) 22:51:52 the ethernet packet (frame) 22:52:20 yep, everything on the NAS VLAN is using jumboframes (except for hvfs, which I'm running 'sudo dladm set-linprop -p mtu=9000 i40e1' on now) 22:52:25 and since the size of an ethernet frame is largely non-determinstic 22:52:39 you can't easily predict when that's going to happen 22:52:49 well... will be running that command once the VDI is migrated and the remaining VMs are halted. 22:57:27 My old file server is so slow by comparison (14 year old server blah blah spinning rust blah blah blah) that I didn't notice the VMs weren't running optimally from the new file server (6 year old hypervisor + SSDs). 22:58:32 oh, right, link busy. Can't set linkprop. ugh. More readin'! 22:59:00 I fear an irix-bloat like situation, just because everything is so tolerably fast these days 22:59:06 when I'm jaded, that is 22:59:45 I bet jbk's byte-at-a-time i/o isn't even _that_ rough, on a modern system 23:00:07 you can make a lot of really bad decisions, and fully get away with them on modern hardware 23:00:14 get away with them for ages 23:00:26 things that keep me up at night dot text 23:00:43 oh... that's confusing. Now I'm seeing I should use ipadm set-ifprop. So which is it? 23:02:33 richlowe: if you mean for i40e, it's more I think just a question of time 23:03:18 intel made some decisions on how they support TSO that makes mapping our mblk_ts _really_ painful 23:03:19 looks like ipadm set-ifprop refuses to set mtu over 1500. 23:03:44 you might need to delete the IP interface, then do dladm, then recreate the ip interface 23:03:58 (I feel like we should be able to do better there for that, but how it is now) 23:04:17 I mean in general 23:04:37 but specifically nomad's system being so fast now, that you can just not notice major misconfiguration :) 23:04:45 richlowe: if we had a bit more rigor in the network stack surrounding how things could be split across multiple buffers, that might help a bit 23:06:08 there's more than a few checks that boil down to 'things can potentially be split at an arbitrary byte boundary' 23:08:36 jbk: delete and recreate (with a dladm set-linkprop before create-ip) is what I wound up doing. 23:12:01 ping -s 9000 is a happy camper now. 23:19:26 richlowe: "new server, who dis?"