07:36:15 "zrepl status" always errors out for me, on two hosts 07:36:31 while on the latest bloody bits but is fine on the previous ones 07:51:46 That's strange, but at least it's a narrow window to look at! 08:01:11 Since it's happening on OI too, it must be some change in gate, and I think richlowe narrowed it down to between june 9th and 30th 08:03:00 Well, not "must", but likely.. 08:12:26 just booted back to my previous be as it was getting annoying 08:14:13 I am not seeing anything strange in vim. Are you seeing this on the console or is it an ssh session or something? 08:21:23 ssh session 08:21:27 vim was a mixed bag 08:21:34 zrepl status triggered it 100% of the time for me 08:21:45 (And I am debugging a sync issue so it was a problem) 08:25:02 That looks fine for me too. iterm2 and mac terminal, various sizes 08:27:12 weird 08:27:18 Where is your primary console? 08:27:24 serial 08:27:29 Mine is on serial, where zrepl/vim have always been funcky for me 08:27:51 client side I tried iterm2 terminal.app and putty 08:29:01 So it must be something in combination with a configuration I have I guess 08:29:03 But what 08:29:16 Well, you and yuripv (who saw it on OI too) 08:31:51 Once I have fixed the zrepl issue, I'll update again, assuming there will be another bloody build soonish with the openssh stuff fixed. 08:32:03 I'll only do one then so it's easier to compare A and B 08:32:12 The updated openssh is already in bloody. 08:32:47 But yes, somebody who is seeing this problem will have to bisect packages and try and find where it was introduced 08:37:21 Sadly neither of mine are easy to reboot, it takes a long time 08:38:11 https://gist.github.com/sjorge/defa35d58989ba40bf2c0dcf1680f432 does this help? 08:38:18 That's the panic zprel throws from last night 08:38:21 I still had a terminal window open 08:48:32 Doesn't seem to be any changes between now and the working be that would come even close to terminal handling I think 08:49:36 5.11-151047.0:20230609T142524Z 08:49:42 Is the exact ts from the working build 09:06:30 At least I found the zrepl not syncing problem now I can actually run zrepl status, my certs expired 09:06:48 I guess it's time to move that to wireguard + tcp transport instead as I had the same issue lat year too 09:07:04 I'll switch back to the new be on one of hte nodes after 09:19:22 re-updating one of my hosts now 09:28:41 Done and I can trigger it again with zrepl 09:28:48 Let me do an truss on both I guess 09:29:21 OMGFG 09:29:28 when I run it though truss it works 09:29:33 without truss it fails -_- 09:31:56 Aha it's not truss 09:32:20 I guess like yuripv I can't trigger it all the time I just got unlucky ;( 09:33:14 I did get a truss output now from when it fails 09:33:52 https://gist.github.com/sjorge/c26852eb753129fc35fae38747fa015f 09:35:24 It seems to panic shortly after a TCSETS (L1003) 09:36:18 Based on the lines above it started to draw it's box layout and then calls that ioctl and it poofs out of existance 09:37:21 That would also explain why andyf could not reproduce it as it's indeed 'random' I must have just gotten unlucky last night to repeatedly hit it 15:06:12 sjorge: im doing zrepl over tailscale with the tls certs. I think I set the to expire in a year so I will likely be puzzled in a years time when things break haha. I should add a cron job to alert me when they are close to expiring 15:08:16 Does zrepl support using certs that you provide? I wonder if you could use tailscale-provided certs... 15:10:20 I think "tailscale serve" might even auto-renew them for you... 15:19:56 papertigers I was like I can either rotate them or... push it over wireguard 15:20:09 I wanted to setup a tunnel between the tiny backup box at my mom and home anyway 15:20:26 Since it's encrypted over the internet via wg I just went to tcp transport 15:20:35 No point in double encryption tbh 15:20:56 Seems to work well enough, with nahamu work on getting wireguard-go into omnios-extra 15:20:57 Is the wg connection using my bits, or is the tunnel handled by a non-illumos kernel? 15:21:05 See above :p 15:21:08 asked and answered 15:21:10 nice! 15:21:28 I wrote my own wg-quick manifest... then found the one from the omnios package 15:21:33 So i dropped mine 15:21:51 If your old one had any improvements over mine, please let me know. 15:21:57 It did not 15:22:39 Only downside is that the interfaces are named tunX tbh 15:22:51 yeah, that's a shortcoming of the tun driver. 15:23:08 but at least you can name your config file however you want. 15:23:16 Yeah I noticed that 15:23:30 I had it called tun0.conf first but that got messy so now I'm back to wgX.conf naming 15:23:35 And let wg-quick handle it 15:23:48 the tun driver works just well enough that no one has enough drive to write a better one. 15:24:03 Having it fold into ipadm some day would rock, but I guess that means we need a none wireguard-go implementation first 15:24:09 Exactly :D 15:24:28 Eh its a bit quircky but it works well enough -> setup and stop caring about the quirckiness 15:24:47 uh, my tailscale port does use ipadm. does my wg-quick script not?? 15:24:59 It does fro the ip stuff 15:25:21 but something like ipadm create-wg -l wg0 ... would be nice too 15:25:53 Well I guess you need a dladm create-wg + ipadm create-addr technically 15:26:00 Ah, yeah, we'd need to port wireguard into the kernel. Someone smart enough might be able to port the FreeBSD kernel port over... 15:26:59 We can probably get the wg dev to look over a port before integration if one ever get smade 15:28:00 How the freebsd one went was a mess, but he did end up reviewing and fixing a ton of bugs in the first attempt, ideally it would just be reviewing but the port for freebsd has a whole history about it -_- 15:28:25 It was the hallway talk at the EuroBSDcon I went to before covid 15:28:31 Yeah, I think the current version is fine, but yeah, the first version was a mess. 15:28:52 we still don't know what's wrong with the netbsd version :) 15:29:13 nbjoerg: is it based on the FreeBSD one or somethign else? 15:29:21 written from scratch 15:29:43 ("no, you are not supposed to do that, you are not smart enough!") 15:31:02 Wrong as in doesn't work or wrong because it's not 'approved' 15:31:33 rmustacc: both 15:31:38 according to him 15:33:00 He does seem very opinionated on the crypto stuff used 15:33:38 But that not necessarily bad (assuming he knows what he's doing and not a total ass about it) 15:34:48 the amusing part is if his choices ever prove to be wrong, the protocol is designed such that you can't practically do any sort of rolling upgrade 15:36:03 there's really no way to specify a 'v2' set of mechanisms... basically upgrade everything everywhere at once or stand up entirely parallel infrastructure and move things over 15:36:32 yeah, that's my pet peeve with the design 15:36:47 i also find the whole 'you must assign every client a static IP' thing annoying 15:36:53 that and the lack of a configuration protocol 15:37:28 like that's fine for tiny deployments 15:37:39 i don't see how that really works well in large deployments 15:38:11 it doesn't, that's why some vpn providing shrinkwrapped wireguard 15:38:22 kind of like the various commercial vpn tools on windows 15:38:40 ...so compared e.g. to openvpn, it can actually be a step back 15:40:13 Yeah I do hate the needs static assignment 15:40:40 I like my pipes encrypted and setup, addressing should not be part of the pipe laying so to speak 15:41:01 But the performance is a lot better for me than openvpn (which I also use for mobile devices -> home connections) 15:41:49 isn't openvpn basically vpn over http[s] ? 15:48:44 Yes 15:48:50 Well TLS 15:49:11 I run mine on port 443 for 'random wifi that is not mine' compat reasons 16:11:30 jbk: I think that's tailscales value add. Management layer on top of wg that seems to work really well 16:30:55 i had (back at joyent) started on the _very_ beginning bits for kernel support 16:31:18 starting with chacha20-poly1305 support 16:31:59 though the openbsd code i was using to try to add to our existing chacha20 support was headache inducing 16:33:17 because of the rampant type punning happening everywhere 16:35:11 which by extension made it rely on the implicit C struct layout and padding for correctness, which made me uncomfortable 16:35:14 at least from my recollection 16:35:34 [illumos-gate] 15793 When preferred_dc is set, DNS outage can cause smb/server and idmap startup to timeout -- Matt Barden 16:36:02 in retrospect, it probably would have been easier to not even try to reuse the existing chacha20 code and just have a completely separate chacha20-poly1305 implementation 16:37:17 (I remember being kinda of shocked with the quality of the code given openbsd's vaunted reputation for security) 16:37:55 but the thought was to add the needed mechanisms to the crypto api, then could implement the protocol in the kernel using those 16:39:50 [illumos-gate] 15794 Want file name in oplock break timeout messages -- Gordon Ross 16:41:50 and to maybe extend pf_key 16:42:20 since IIRC at least openbsd (and maybe freebsd) basically requires the kernel to hold all the info for every client in kernel mem 16:42:33 which (again) seems bad for larger deployments 16:43:08 and might allow for more flexibility on how you manage client keys 17:19:44 andyf: I didn't narrow it down, I thought I had, but it isn't happening to me in general. 17:19:53 I thought I narrowed it based on me being fine on the 9th and yuri broken on the 30th 17:19:56 but I'm also fine in july 17:20:06 or I'm mis-organizing the repro steps. 18:09:42 i had a 100% reproducible rate yesterday but like 1/25 ish now 18:09:55 i hope it's not timing based 18:10:16 i did capture it with truss 18:35:34 for me it's like `sudo vim ...` is broken, and then everything else starts to behave weirdly 18:43:36 did the openssh update and reboot, it's now close to a 100% again for zrepl status :/ 19:27:23 moin, assuming I want to understand what goes wrong in illumos when I get: NOTICE: NVRM: GPU 0000:01:00.0: Failed to copy vbios to system memory. - how would I start debugging it? 19:41:47 ej tsoome! 20:12:36 Agnar: So I can't find the string vbios in illumos. 20:12:56 The nvrm makes me think this maybe is nvidia related. Is that the case? 20:17:26 rmustacc: yes, sorry I forgot to mention the important part. It's the /kernel/drv/amd64/nvidia driver version 470 that ships with OI. It should work, the card IS supported but it seems we have some issues in OI with it 20:18:01 rmustacc: the driver from freebsd works, so does Linux 20:18:46 OK. I'm not really sure how to suggest debugging that. I guess try to see what APIs it's calling around when it emits that message. 20:20:26 rmustacc: the question is, can I somehow get more informations about the driver with kmdb? ::modinfo seems not to provide more informations 20:20:51 rmustacc: I also think of trying Ghidra on the module... 20:22:21 It would be very helpful if we had some illumos folks at nvidia :/ 20:32:23 There are some folks who used to work on illumos there. 20:32:39 Anyways, I would use kmdb to step through stuff until you find where the message is coming from or similar. 20:32:45 I don't have great suggestions here. 20:32:54 I don't think the people we know at nvidia are at the part of nvidia that would be interesting to you. 20:33:11 Well, actually, if you try to load that module again after boot can you get that message to appear again? 20:33:25 If so, then you can start doing a bit of DTrace. Though is there even a symbol table? 20:33:44 nvidia used to provide "source" within the package they themselves provided, to the non-juicy bits. 20:34:00 back when it was an SVR4 package, anyway. 20:34:35 it's unlikely to tell you anything about NVRM, but might get you a little somewhere. 20:35:05 richlowe: I guess starting Xorg will trigger that again. good idea. 20:35:49 richlowe: I'll play with that tomorrow, thanks so far! 20:38:03 If you can get it to reproduce the message then you can use DTrace to get a stack trace for what it's worth. 20:42:58 rmustacc: yes! I'll get my dtrace book tomorrow and try to get a stack trace and maybe I get an idea why the copy fails :) 20:43:27 I mean, if you can get that message to reproduce reliably after boot then you could do something like fbt::cmn_err:entry{ stack(); } 20:43:51 there are even fbt probes in the nvidia module 20:45:36 I was under the impression that fbt can automatically find function boundaries to trace at runtime rather than needing those probes to be configured at build time. 20:45:52 That's correct. fbt is just looking at the symbol table basically. 20:46:01 sdt is build time static probes. 20:46:12 ah! 20:47:47 well, it's late, I'll try that tomorrow. thank you all so far