01:43:41 it's probably something i need to dig into.. IIRC, vnics will take a HW ring (if available) from the underlying device, but not sure if it's just 1 tx and rx, or if it'll use multiple HW rings 01:56:55 jbk: It's more useful to think of it as a vnic will take one group from mac today. 01:57:25 For i40e it'll only be one tx as we use the pseudo-rings feature for tx where each ring becomes a group. 02:00:08 Ultimately we'll set the number of qpairs per vsi which turns into groups basically based on the number of interrupts we got. 08:48:17 rmustacc: btw how is about support for mellanox/nvidia NICs, like CX6? 08:50:47 rmustacc: just checked HCL, seems to be in, nice. So I may use a VF instead of a vmware vnic, when moving this VM over to HPC env 14:44:32 [illumos-gate] 16476 lmrc panic on MFI passthru ioctl that sets I/O direction flags but has no buffer -- Hans Rosenfeld 15:17:40 I've got an NMI-induced dump from a hung VM. Where do I go from here to find what was locked up and then hopefully why it locked up? 15:35:46 Meths: mdb ::cpuinfo -v is one good starting point. are they idle because no threads are runnable or are they all busy with spinning threads that aren't making useful forward progress. 15:38:24 look at stack traces of suspect threads, build a dependency graph of who's waiting for what 15:44:28 Will do, thanks 17:01:20 ok.. this seems odd... i got a packet capture from both sides, and there's a pattern of , then the other side responds with a bunch of duplicate acks in a row, rinse, wash, repeat 17:07:30 like the source sends 40 duplicate ACKs in a row 17:15:34 i don't think this is the i40e duplicate packet bug (as i'd expect that to manifest on _every_ packet) 17:58:17 jbk: that sounds like expected behavior when a packet is dropped. the receiver is sending dup acks when it gets in-sequence data as a hint to the sender to retransmit one packet to fill the hole. 18:00:32 dig further into the forward sequence number vs acked sequence numbers (and sack blocks if they're present). 18:01:42 one stat to look at on the reciever is whether any packets are being dropped for bad checksums or bad CRCs 18:02:33 (to be thorough, look at ethernet CRCs at each switch/router along the path) 18:03:24 dup acks could also be window updates ("i've read some data so you can send more") 18:07:58 i don't have access to anything but the end systems 18:10:41 if the advertised window is consistently small, you may be receiver limited (if it's nibbling one "thing" at a time out of the dump stream and taking a while to process it). 18:18:36 jbk: first step with network weirdness, get in touch with someone who can see the network 18:18:44 (good luck) 18:24:07 jbk: another thing to try is to put something like "mbuffer" in the path (it was originally built to smooth out dump/restore to/from tape but might be useful here) 18:29:24 If you have captures from both sides you can see if packets are failing to traverse the switch. Also, you'll want to look at the i40e kstats to see if it's deciding to drop any packets due to full buffers and such (e.g. "rx_discards"). 18:33:22 richlowe: I believe this is more of a systems-performance-weirdness thing where there isn't yet clear evidence of whether or not to blame the network. 18:36:00 jbk: maybe also check i40e "tx_errors" on both sides 18:36:28 (side note: a nice addition to kstat would be to have a way to just show which ones increment in an interval) 18:36:47 i see some i40e rings where at some point there haven't been available descriptors (on tx and rx) 18:36:53 but I can't tell how long ago that was 18:37:03 this connection has been running for several days 18:53:27 hrm.. this is interesting (and likely unfortnate).. when assembling a PDU, about 25% of the segments are arriving out of order 18:54:09 err no.. retransmitted (wireshark is a bit confusing here) 19:32:24 jbk: that kstat RFE would be wonderful, a *stat -type ticking view 19:32:32 that wasn't impossible to read 19:32:47 (similarly, basically all of *stat are impossible to read on a big enough machine) 19:33:25 (intrstat's smart linewrapping actually makes me _less_ happy reading) 19:46:55 sommerfeld: there's already a piece that buffers and then sends the data over a TLS connection... 19:47:09 i am going to ask to see if there's any known issue with packet loss across the connection 19:49:10 the window size is too small, though not sure how I can bump that on an existing connection, but might have to make it re-connect and resume the transfer 19:49:25 but not sure if that'd explain the rest of the behavior i saw in that capture 19:49:31 or if there might also be some loss happening 19:49:58 i've not looked enough at the fast retransmit and such to know if 19,000 acks for the same seq are 'normal' or not though... 20:09:19 well, maybe if the rx window is full and the receiver isn't reading anything because it's busy doing some zfs op that takes 10 minutes to commit... 20:17:53 jbk: bumping it on an existing connection would be tricky. Best way to do that would likely be to implement pr_setsockopt() (see pr_getsockopt() in libproc) 20:22:05 they just get the agent to do it, right? 20:22:06 nothing fancy 20:28:05 yeah 20:34:21 not as simple as poking a structure field with mdb -kw 21:10:53 /join #freebsd 21:11:07 ha! 21:25:59 Did I miss a perl flag day or something? I know it'd been a few months, but usr/src/cmd/perl/contrib/Sun/Solaris/BSM seems mad at me 21:27:02 which reminds me, would it be useful to collect flag day emails in the last few years and update https://illumos.org/docs/developers/flagdays/? 21:27:53 check the version of perl and the one defined in your env file 21:29:24 it is probably the change to calculate all that automatically of Marcel's 21:33:34 "did you make make sure this repo was updated according to Building illumos¶ 21:33:41 sigh. 21:52:42 yes, updating flagdays would be useful. 22:27:05 yay, first nearly useless changeset: https://github.com/illumos/docs/pull/92 22:32:51 huh, I've never noticed https://illumos.org/rb/r/ before, looks inactive. Should the "Review Board" link on https://illumos.org/docs/community/conduct/ be updated to something more gerritty? 22:41:11 yeah, it was used before gerrit. 22:52:10 So I have a question: nightly runs `make clobber`, and then runs what amounts to a secret second heavier clobber 22:53:18 Is this just more historical 'make clobber' being very broken (it's less broken now), or is there deeper meaning? 22:54:51 you're talking about the thing after the "Get back to a clean workspace" comment? 22:57:49 Yes 22:57:56 I filed #16482 and #16483 (fenix?) 22:57:57 BUG 16482: nightly should tell you when it's deleting your work (New) 22:57:57 ↳ https://www.illumos.org/issues/16482 22:58:04 oh, it can't do two 22:58:07 16483 (fenix?) 22:58:18 I give up, it says "don't delete stuff in source control, you ninny" 22:59:33 because while having a file that matches that pattern is very unlikely (it's been years and years!), when it does happen, it requires a _lot_ of work to find out why you're getting screwed 23:10:45 hmm, ran into this today: https://pastebin.com/bN2RqALX 23:11:49 I think it _might_ be this issue: https://www.illumos.org/issues/15024 23:11:50 → BUG 15024: NFS can exhaust pool threads getting RPCSEC_GSS credentials (In Progress) | https://code.illumos.org/c/illumos-gate/+/2402 23:12:09 the machine was hung, logins weren't working and the panic finally happened after pressing the power button to invoke a shutdown 23:12:35 I can probably extract the coredump if need be 23:36:18 fenix can you show us illumos 16483? 23:36:19 BUG 16483: nightly should refuse to delete source-controlled files (New) 23:36:19 ↳ https://www.illumos.org/issues/16483 23:56:18 richlowe: I think it's just a backstop for broken "make clobber". Especially since "make clobber" runs before "bringover" (if nightly does that); any fixes to clobber wouldn't run until the next build. 23:57:46 (but, in general, I think that ordering is correct -- if a subtree is deleted from the gate, you want to "make clobber" it before its makefiles go away)