07:36:15 <sjorge> "zrepl status" always errors out for me, on two hosts
07:36:31 <sjorge> while on the latest bloody bits but is fine on the previous ones
07:51:46 <andyf> That's strange, but at least it's a narrow window to look at!
08:01:11 <andyf> Since it's happening on OI too, it must be some change in gate, and I think richlowe narrowed it down to between june 9th and 30th
08:03:00 <andyf> Well, not "must", but likely..
08:12:26 <sjorge> just booted back to my previous be as it was getting annoying
08:14:13 <andyf> I am not seeing anything strange in vim. Are you seeing this on the console or is it an ssh session or something?
08:21:23 <sjorge> ssh session
08:21:27 <sjorge> vim was a mixed bag
08:21:34 <sjorge> zrepl status triggered it 100% of the time for me
08:21:45 <sjorge> (And I am debugging a sync issue so it was a problem)
08:25:02 <andyf> That looks fine for me too. iterm2 and mac terminal, various sizes
08:27:12 <sjorge> weird
08:27:18 <sjorge> Where is your primary console?
08:27:24 <andyf> serial
08:27:29 <sjorge> Mine is on serial, where zrepl/vim have always been funcky for me
08:27:51 <sjorge> client side I tried iterm2 terminal.app and putty
08:29:01 <sjorge> So it must be something in combination with a configuration I have I guess
08:29:03 <sjorge> But what
08:29:16 <andyf> Well, you and yuripv (who saw it on OI too)
08:31:51 <sjorge> Once I have fixed the zrepl issue, I'll update again, assuming there will be another bloody build soonish with the openssh stuff fixed.
08:32:03 <sjorge> I'll only do one then so it's easier to compare A and B
08:32:12 <andyf> The updated openssh is already in bloody.
08:32:47 <andyf> But yes, somebody who is seeing this problem will have to bisect packages and try and find where it was introduced
08:37:21 <sjorge> Sadly neither of mine are easy to reboot, it takes a long time
08:38:11 <sjorge> https://gist.github.com/sjorge/defa35d58989ba40bf2c0dcf1680f432 does this help?
08:38:18 <sjorge> That's the panic zprel throws from last night
08:38:21 <sjorge> I still had a terminal window open
08:48:32 <sjorge> Doesn't seem to be any changes between now and the working be that would come even close to terminal handling I think
08:49:36 <sjorge> 5.11-151047.0:20230609T142524Z
08:49:42 <sjorge> Is the exact ts from the working build
09:06:30 <sjorge> At least I found the zrepl not syncing problem now I can actually run zrepl status, my certs expired
09:06:48 <sjorge> I guess it's time to move that to wireguard + tcp transport instead as I had the same issue lat year too
09:07:04 <sjorge> I'll switch back to the new be on one of hte nodes after
09:19:22 <sjorge> re-updating one of my hosts now
09:28:41 <sjorge> Done and I can trigger it again with zrepl
09:28:48 <sjorge> Let me do an truss on both I guess
09:29:21 <sjorge> OMGFG
09:29:28 <sjorge> when I run it though truss it works
09:29:33 <sjorge> without truss it fails -_-
09:31:56 <sjorge> Aha it's not truss
09:32:20 <sjorge> I guess like yuripv I can't trigger it all the time I just got unlucky ;(
09:33:14 <sjorge> I did get a truss output now from when it fails
09:33:52 <sjorge> https://gist.github.com/sjorge/c26852eb753129fc35fae38747fa015f
09:35:24 <sjorge> It seems to panic shortly after a TCSETS (L1003)
09:36:18 <sjorge> Based on the lines above it started to draw it's box layout and then calls that ioctl and it poofs out of existance
09:37:21 <sjorge> That would also explain why andyf could not reproduce it as it's indeed 'random' I must have just gotten unlucky last night  to repeatedly hit it
15:06:12 <papertigers> sjorge: im doing zrepl over tailscale with the tls certs.  I think I set the to expire in a year so I will likely be puzzled in a years time when things break haha.  I should add a cron job to alert me when they are close to expiring
15:08:16 <nahamu> Does zrepl support using certs that you provide? I wonder if you could use tailscale-provided certs...
15:10:20 <nahamu> I think "tailscale serve" might even auto-renew them for you...
15:19:56 <sjorge> papertigers I was like I can either rotate them or... push it over wireguard
15:20:09 <sjorge> I wanted to setup a tunnel between the tiny backup box at my mom and home anyway
15:20:26 <sjorge> Since it's encrypted over the internet via wg I just went to tcp transport
15:20:35 <sjorge> No point in double encryption tbh
15:20:56 <sjorge> Seems to work well enough, with nahamu work on getting wireguard-go into omnios-extra
15:20:57 <nahamu> Is the wg connection using my bits, or is the tunnel handled by a non-illumos kernel?
15:21:05 <sjorge> See above :p
15:21:08 <nahamu> asked and answered
15:21:10 <nahamu> nice!
15:21:28 <sjorge> I wrote my own wg-quick manifest... then found the one from the omnios package
15:21:33 <sjorge> So i dropped mine
15:21:51 <nahamu> If your old one had any improvements over mine, please let me know.
15:21:57 <sjorge> It did not
15:22:39 <sjorge> Only downside is that the interfaces are named tunX tbh
15:22:51 <nahamu> yeah, that's a shortcoming of the tun driver.
15:23:08 <nahamu> but at least you can name your config file however you want.
15:23:16 <sjorge> Yeah I noticed that
15:23:30 <sjorge> I had it called tun0.conf first but that got messy so now I'm back to wgX.conf naming
15:23:35 <sjorge> And let wg-quick handle it
15:23:48 <nahamu> the tun driver works just well enough that no one has enough drive to write a better one.
15:24:03 <sjorge> Having it fold into ipadm some day would rock, but I guess that means we need a none wireguard-go implementation first
15:24:09 <sjorge> Exactly :D
15:24:28 <sjorge> Eh its a bit quircky but it works well enough -> setup and stop caring about the quirckiness
15:24:47 <nahamu> uh, my tailscale port does use ipadm. does my wg-quick script not??
15:24:59 <sjorge> It does fro the ip stuff
15:25:21 <sjorge> but something like ipadm create-wg -l wg0  ... would be nice too
15:25:53 <sjorge> Well I guess you need a dladm create-wg + ipadm create-addr technically
15:26:00 <nahamu> Ah, yeah, we'd need to port wireguard into the kernel. Someone smart enough might be able to port the FreeBSD kernel port over...
15:26:59 <sjorge> We can probably get the wg dev to look over a port before integration if one ever get smade
15:28:00 <sjorge> How the freebsd one went was a mess, but he did end up reviewing and fixing a ton of bugs in the first attempt, ideally it would just be reviewing but the port for freebsd has a whole history about it -_-
15:28:25 <sjorge> It was the hallway talk at the EuroBSDcon I went to before covid
15:28:31 <nahamu> Yeah, I think the current version is fine, but yeah, the first version was a mess.
15:28:52 <nbjoerg> we still don't know what's wrong with the netbsd version :)
15:29:13 <nahamu> nbjoerg: is it based on the FreeBSD one or somethign else?
15:29:21 <nbjoerg> written from scratch
15:29:43 <nbjoerg> ("no, you are not supposed to do that, you are not smart enough!")
15:31:02 <rmustacc> Wrong as in doesn't work or wrong because it's not 'approved'
15:31:33 <nbjoerg> rmustacc: both
15:31:38 <nbjoerg> according to him
15:33:00 <sjorge> He does seem very opinionated on the crypto stuff used
15:33:38 <sjorge> But that not necessarily bad (assuming he knows what he's doing and not a total ass about it)
15:34:48 <jbk> the amusing part is if his choices ever prove to be wrong, the protocol is designed such that you can't practically do any sort of rolling upgrade
15:36:03 <jbk> there's really no way to specify a 'v2' set of mechanisms... basically upgrade everything everywhere at once or stand up entirely parallel infrastructure and move things over
15:36:32 <nbjoerg> yeah, that's my pet peeve with the design
15:36:47 <jbk> i also find the whole 'you must assign every client a static IP' thing annoying
15:36:53 <nbjoerg> that and the lack of a configuration protocol
15:37:28 <jbk> like that's fine for tiny deployments
15:37:39 <jbk> i don't see how that really works well in large deployments
15:38:11 <nbjoerg> it doesn't, that's why some vpn providing shrinkwrapped wireguard
15:38:22 <nbjoerg> kind of like the various commercial vpn tools on windows
15:38:40 <nbjoerg> ...so compared e.g. to openvpn, it can actually be a step back
15:40:13 <sjorge> Yeah I do hate the needs static assignment
15:40:40 <sjorge> I like my pipes encrypted and setup, addressing should not be part of the pipe laying so to speak
15:41:01 <sjorge> But the performance is a lot better for me than openvpn (which I also use for mobile devices -> home connections)
15:41:49 <jbk> isn't openvpn basically vpn over http[s] ?
15:48:44 <sjorge> Yes
15:48:50 <sjorge> Well TLS
15:49:11 <sjorge> I run mine on port 443 for 'random wifi that is not mine' compat reasons
16:11:30 <papertigers> jbk: I think that's tailscales value add.  Management layer on top of wg that seems to work really well
16:30:55 <jbk> i had (back at joyent) started on the _very_ beginning bits for kernel support
16:31:18 <jbk> starting with chacha20-poly1305 support
16:31:59 <jbk> though the openbsd code i was using to try to add to our existing chacha20 support was headache inducing
16:33:17 <jbk> because of the rampant type punning happening everywhere
16:35:11 <jbk> which by extension made it rely on the implicit C struct layout and padding for correctness, which made me uncomfortable
16:35:14 <jbk> at least from my recollection
16:35:34 <gitomat> [illumos-gate] 15793 When preferred_dc is set, DNS outage can cause smb/server and idmap startup to timeout -- Matt Barden <mbarden⊙rc>
16:36:02 <jbk> in retrospect, it probably would have been easier to not even try to reuse the existing chacha20 code and just have a completely separate chacha20-poly1305 implementation
16:37:17 <jbk> (I remember being kinda of shocked with the quality of the code given openbsd's vaunted reputation for security)
16:37:55 <jbk> but the thought was to add the needed mechanisms to the crypto api, then could implement the protocol in the kernel using those
16:39:50 <gitomat> [illumos-gate] 15794 Want file name in oplock break timeout messages -- Gordon Ross <gwr⊙rc>
16:41:50 <jbk> and to maybe extend pf_key
16:42:20 <jbk> since IIRC at least openbsd (and maybe freebsd) basically requires the kernel to hold all the info for every client in kernel mem
16:42:33 <jbk> which (again) seems bad for larger deployments
16:43:08 <jbk> and might allow for more flexibility on how you manage client keys
17:19:44 <richlowe> andyf: I didn't narrow it down, I thought I had, but it isn't happening to me in general.
17:19:53 <richlowe> I thought I narrowed it based on me being fine on the 9th and yuri broken on the 30th
17:19:56 <richlowe> but I'm also fine in july
17:20:06 <richlowe> or I'm mis-organizing the repro steps.
18:09:42 <sjorge> i had a 100% reproducible rate yesterday but like 1/25 ish now
18:09:55 <sjorge> i hope it's not timing based
18:10:16 <sjorge> i did capture it with truss
18:35:34 <yuripv> for me it's like `sudo vim ...` is broken, and then everything else starts to behave weirdly
18:43:36 <sjorge> did the openssh update and reboot, it's now close to a 100% again for zrepl status :/
19:27:23 <Agnar> moin, assuming I want to understand what goes wrong in illumos when I get: NOTICE: NVRM: GPU 0000:01:00.0: Failed to copy vbios to system memory. - how would I start debugging it?
19:41:47 <Agnar> ej tsoome!
20:12:36 <rmustacc> Agnar: So I can't find the string vbios in illumos.
20:12:56 <rmustacc> The nvrm makes me think this maybe is nvidia related. Is that the case?
20:17:26 <Agnar> rmustacc: yes, sorry I forgot to mention the important part. It's the /kernel/drv/amd64/nvidia driver version 470 that ships with OI. It should work, the card IS supported but it seems we have some issues in OI with it
20:18:01 <Agnar> rmustacc: the driver from freebsd works, so does Linux
20:18:46 <rmustacc> OK. I'm not really sure how to suggest debugging that. I guess try to see what APIs it's calling around when it emits that message.
20:20:26 <Agnar> rmustacc: the question is, can I somehow get more informations about the driver with kmdb? ::modinfo seems not to provide more informations
20:20:51 <Agnar> rmustacc: I also think of trying Ghidra on the module...
20:22:21 <Agnar> It would be very helpful if we had some illumos folks at nvidia :/
20:32:23 <rmustacc> There are some folks who used to work on illumos there.
20:32:39 <rmustacc> Anyways, I would use kmdb to step through stuff until you find where the message is coming from or similar.
20:32:45 <rmustacc> I don't have great suggestions here.
20:32:54 <richlowe> I don't think the people we know at nvidia are at the part of nvidia that would be interesting to you.
20:33:11 <rmustacc> Well, actually, if you try to load that module again after boot can you get that message to appear again?
20:33:25 <rmustacc> If so, then you can start doing a bit of DTrace. Though is there even a symbol table?
20:33:44 <richlowe> nvidia used to provide "source" within the package they themselves provided, to the non-juicy bits.
20:34:00 <richlowe> back when it was an SVR4 package, anyway.
20:34:35 <richlowe> it's unlikely to tell you anything about NVRM, but might get you a little somewhere.
20:35:05 <Agnar> richlowe: I guess starting Xorg will trigger that again. good idea. 
20:35:49 <Agnar> richlowe: I'll play with that tomorrow, thanks so far!
20:38:03 <rmustacc> If you can get it to reproduce the message then you can use DTrace to get a stack trace for what it's worth.
20:42:58 <Agnar> rmustacc: yes! I'll get my dtrace book tomorrow and try to get a stack trace and maybe I get an idea why the copy fails :)
20:43:27 <rmustacc> I mean, if you can get that message to reproduce reliably after boot then you could do something like fbt::cmn_err:entry{ stack(); }
20:43:51 <Agnar> there are even fbt probes in the nvidia module
20:45:36 <nahamu> I was under the impression that fbt can automatically find function boundaries to trace at runtime rather than needing those probes to be configured at build time.
20:45:52 <rmustacc> That's correct. fbt is just looking at the symbol table basically.
20:46:01 <rmustacc> sdt is build time static probes.
20:46:12 <Agnar> ah!
20:47:47 <Agnar> well, it's late, I'll try that tomorrow. thank you all so far