14:23:43 ll 14:24:54 whoops wrong widnow 14:32:22 But I do have a question I'm seeing an NTP issue between a routed network (RAN_Admin) and the headnode using NTP. Wondering if I missed something 14:32:37 joysetup keeps failing 14:42:23 nodes can ping each other fine 14:42:33 netboot works fine 14:42:47 I disabled ipfilter on the headnode but ntp refuses 14:43:48 compute node PI 20230727T000733Z and 20230824T000638Z 15:04:26 nvmd 15:04:35 looks like bios clock off by 6 minutes 15:14:38 nope. Not it 15:18:22 IIRC, joysetup creates a log file in /tmp 15:18:34 on the CN 15:18:59 that might be worth looking at if you haven't yet 15:19:25 (ISTR a few times having to add some extra debugging to it, but my memory is hazy now) 15:20:50 I'm tailing it one sec I'll send some output 15:21:04 10:18 < jbk> IIRC, joysetup creates a log file in / 15:21:06 whoops 15:21:20 [2023-08-29T15:19:49Z] ./joysetup.sh:244: check_ntp(): ntp_wait_seconds=90 15:21:21 [2023-08-29T15:19:49Z] ./joysetup.sh:246: check_ntp(): [[ 90 -gt 600 ]] 15:21:21 [2023-08-29T15:19:49Z] ./joysetup.sh:233: check_ntp(): true 15:21:21 [[2023-08-29T15:19:49Z] ./joysetup.sh:234: check_ntp(): ntpdate -b 192.168.35.254 15:21:23 [2023-08-29T15:19:57Z] ./joysetup.sh:234: check_ntp(): output='29 Aug 15:19:57 ntpdate[4842]: no server suitable for synchronization found' 15:21:26 [2023-08-29T15:19:57Z] ./joysetup.sh:235: check_ntp(): retval=1 15:21:29 [2023-08-29T15:19:57Z] ./joysetup.sh:236: check_ntp(): (( retval == 0 )) 15:21:32 I ran a snoop 15:21:48 I'm thinking hwclock issue on the headnode. Log says "TIME_ERROR" clock unsynchronized on the headnode 15:22:11 29 Aug 15:18:31 ntpd[31385]: Listening on routing socket on fd #54 for interface updates 15:22:15 29 Aug 15:18:31 ntpd[31385]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized 15:22:18 29 Aug 15:18:31 ntpd[31385]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized 15:22:21 29 Aug 15:18:33 ntpd[31385]: Soliciting pool server 142.147.88.111 15:22:23 29 Aug 15:18:33 ntpd[31385]: Soliciting pool server 217.180.209.214 15:22:26 29 Aug 15:18:34 ntpd[31385]: Soliciting pool server 45.55.58.103 15:22:37 What is the equivilant `hwclock` command on smartOS? 15:22:40 like on linux 15:23:06 seems like only "date" is available 15:24:55 so `ntpdate -b 0.pool.smartos.ntp.org` returns "no server suitable". digging into this a bit further. 15:35:51 29 Aug 15:23:40 ntpd[31385]: Soliciting pool server 162.159.200.123 15:35:52 29 Aug 15:25:34 ntpd[31385]: receive: Unexpected origin timestamp 0xe8988c6e.9602e113 does not match aorg 0000000000.00000000 from server⊙1121 xmt 0xe8988c6e.96501509 15:36:22 I wonder if this has anything to do with the `smartos.ntp.org` vanity name 15:38:03 https://www.suse.com/support/kb/doc/?id=000018710 15:38:33 29 Aug 15:38:16 ntpdate[32564]: ntpdate 4.2.8p15⊙1 Thu Dec 15 04:20:11 UTC 2022 (1) 15:38:55 Guess I'll update the headnode PI and see if it gets resolved by a newer version of ntpdate 15:40:50 jbk: thank you for helping 15:41:39 Wonder if I should report a smartOS bug in PI: 15:41:40 offline 15:09:34 svc:/system/fm/smtp-notify:default 15:41:55 ignore that^ PI joyent_20221215T000744Z 15:51:07 hmmm... last time we did any serious updates of NTP was late 2021. I wonder if smartos.ntp.org has an issue? 15:51:19 That was my original thoughts 15:51:28 4.2.8p15⊙1 15:51:41 Yeah... updated in 2021. 15:51:44 Hmmm 15:51:51 Is pretty close to 4.2.8p7.1 as mentioned in the suse bug report I referenced earlier 15:52:25 But of course, I updated the headnode image, rebooted, and it can't find the pi. So now to fix the USB key or determine if I set it up with piadm 15:54:05 Guess I'll have some manatee work to do too lol 15:57:34 do you happen to have the loader `boot /os/X` command handy so I can tell this thing to boot from the previous image? 15:57:42 0.smartos.pool.ntp.org that's the correct host. 15:57:46 We still have a problem, however. 15:58:08 nevermind I'll check another headnode 15:58:24 Yeah I hadnt even noticed this until I tried setting up new compute nodes. What problem are you seeing? 16:01:09 29 Aug 15:59:30 ntpdate[1805]: no server suitable for synchronization found 16:01:21 but I see traffic. It's like our NTP doesn't like the answers for some bizarre reason. 16:01:49 reference clock is 00:00:00\ 16:02:11 NTP: ----- Network Time Protocol ----- 16:02:12 NTP: 16:02:12 NTP: Leap = 0x3 (alarm condition (clock unsynchronized)) 16:02:12 NTP: Version = 4 16:02:12 NTP: Mode = 3 (client) 16:02:14 NTP: Stratum = 0 (unspecified) 16:02:16 NTP: Poll = 3 16:02:19 NTP: Precision = -6 seconds 16:02:21 NTP: Synchronizing distance = 0x0001.0000 (1.000000) 16:02:24 NTP: Synchronizing dispersion = 0x0001.0000 (1.000000) 16:02:26 NTP: Reference clock = 16:02:29 NTP: Reference time = 0x00000000.00000000 () 16:02:31 NTP: Originate time = 0x00000000.00000000 () 16:02:34 NTP: Receive time = 0x00000000.00000000 () 16:02:36 NTP: Transmit time = 0xe8988a86.754192e1 (2023-08-29 15:17:26.45803) 16:02:44 29 Aug 16:02:31 ntpd[69188]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized 16:02:47 29 Aug 16:02:31 ntpd[69188]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized 16:02:51 from /var/log/ntp.log 16:03:57 very strange. We had a bad a NTP host in the ntp.org pool affect all of the AP-North and Southeast regions for samsung back in december. 16:04:16 On the non-triton platform 16:05:29 1 bad pool member wrecked havoc on everything, including SSL certs, causing "untrusted site" errors across all of the api-gateway hosts. It was pretty terrible. I found the source host from the folks over at: teamcymru 16:05:33 Well, let me read back to catch up... 16:06:05 Welcome to the party :) 16:06:17 So barfield --> Two things: 16:06:47 1.) bahamat reminded me to disable the NTP service (so I uttered `svcadm disable -t ntp`) 16:07:14 2.) I tried once after that and failed, but tried again and succeeded. I have packet captures for both the failure and the success. 16:07:15 Shit. I knew that. Then manually update 16:07:31 So when you re-enabled did it come back online?/ 16:08:32 worked like a champ. What a rookie mistake. 16:08:48 Considering I was the person who fixed this 9 months ago LMAO 16:09:17 What through me off was another user I saw on the mailing list having NTP issues as well. Made me go down a rabbit-hole that was more complex than needed 16:09:27 Thank you guys. 16:10:16 however, let me back track a bit, after re-enabling ntp service, I'm still getting "clock unsynchronized" kernel messages in the logs. 16:10:35 Yeah, so IIRC, what's happening is that because ntp is over UDP, if you have the ntp service online then it's already bound listening on port 123. 16:11:05 So then when ntpdate runs, it sends the request and fails to bind to port 123 (because ntpd is already there). Then ntpd gets all the replies and ntpdate thinks it got no response. 16:11:14 Yeah I totally knew that. Clearly haven't had enough caffeine today. 16:11:26 So if you're doing it by hand you need to make sure to disable ntp first. 16:11:33 But are the kernel error logs normal? I can't say I have spent a lot of time monitoring them in the past 16:11:49 And the -b is important with ntpdate. That says "update the clock no matter how far off it is" 16:12:01 I ran `svcadm disable ntp` then ran `ntpdate -b 0.smartos.pool.ntp.org` then ran `svcadm enable ntp` 16:12:04 The default is that it only adjusts if it's within like 2m or 5m or something. 16:12:12 Aha 16:12:18 Thank you for clarifying 16:12:23 It all makes sense now. 16:12:31 Because it assumes that the clock is mostly right, and that a response that's too far off must be malicious. 16:12:33 You guys are the best. 16:12:59 And -b says to ignore all that and just take whatever you get. 16:12:59 Probably not a bad assumption. I thought that my IPS may be dropping the traffic as well. Because we drop DDOS on port 123 consistently. 16:13:25 reflection attacks happen all day long against our networks. Now if I can just get my headnode to boot :) 16:13:32 Or in other words "I as an operator know this clock is very wrong, so no matter how bad someone else's clock is, it's better than us" 16:14:03 Again, sorry for the rabbit hole. I should've been able to fix this quickly. ...sigh... 16:14:28 Another indication that something was off, was that ntp had no logs since december of 2022. 16:14:39 Now I see it logging consistently again. 16:15:07 danmcd: bahamat: jkb: thank you all for assisting with this jr admin problem lol 16:17:08 On another note, this is the first zone that I enabled rack-aware networking in. I've some hurdles to work through. so it played into the complexity as well. I found a way to get past the old dell/intel pxe boot issues we used to see, when a node didn't have a USB key to load ipxe 16:17:35 I didn't want 1000 USB key's like JPC/SPC had. 16:17:54 If you guys are interested in how I finally got around it I'll send you a write up 16:17:59 Well, I always recommend booting the first time with a USB, then you can switch to pool booting. 16:18:17 That takes a while though if you're preflighting 100+ nodes 16:18:23 I never fully trust hardware pxe. And if there are issues with it, we can't fix them. 16:18:30 Very true 16:18:33 Depends on your HW> 16:18:50 The answer to "my hardware pxe isn't working with Triton" is always "well that's what the USB is for". 16:18:55 Does piadm support compute nodes yet? 16:18:58 Yes 16:19:03 My HN boots from disk. My 3 supermicro CNs boot three different ways: BIOS PXE, EFI Netboot, and iPXE-from-disk 16:19:11 barfield: piadm doesn't per se. 16:19:21 piadm can, for disk-booting-to-iPXE only. 16:19:29 Oh hell 16:19:31 Thats genius 16:19:46 Hang on... 16:19:52 Yeah, it manages just the boot bits, but the PI still comes from Triton. 16:19:53 I've actually had some issues with these OCP blades booting from the NVME's on board 16:20:12 Also why the USB exists :-D 16:20:19 So I'm still letting the NIC PXE boot first because of that. I leave a USB key in the headnode for emergencies. 16:20:23 See "TRITON COMPUTE NODES and iPXE" in the piadm(8) man page. 16:20:35 I read that a while back. I'll read it again. 16:20:52 Also, it doesn't need to be a USB. You can use SD/CF, etc. cards. 16:20:55 The old "compute node wont boot" from support tickets because of a bad usb drive is what I was trying to avoid. 16:21:49 Yeah, the SPC datacenters had issues with heat management that caused an excessive number of USBs to fail. 16:21:50 Regardless the USB booting is still awesome if you ask me. 16:22:08 So did JPC not see the same issues as often? 16:22:19 I came in after JPC was EOL'ed 16:22:35 But I dont remember it happening as often in JAC 16:23:36 Yeah, JPC had like one, maybe two USBs fail in the time I was there, which was like 6 years. 16:23:58 I don't know how many there were before that, but my understanding is that it was not many. 16:25:24 We also had very high end and low capacity drives. 16:25:36 I think the drives we had were physically only 4G. 16:25:44 You can't even find those these days. 16:26:13 Yeah, I noticed 64-128GB drives at bestbuy the other day for like 10 bucks. 16:26:13 And the higher the capacity in the same space, means you've got more electricity flowing through smaller circuits and that increases heat. 16:26:52 so last question 16:27:01 My HN cannot find the latest PI that I assigned 16:27:05 I am stuck in the loader menu 16:27:18 I use these: https://amzn.to/47VJwQP 16:27:21 I copied and pasted the var's from another headnode for bootfile bootarchive etc 16:27:34 but its still saying can't find unix, etc, when I type boot 16:27:49 WTH am I doing wrong now 16:28:27 I use the tiny metal kingston 16GB drives and I have never had one go bad. 16:28:38 And they're not even in stock anymore. I'm glad I bought a 5 pack when I could. 16:28:42 I'll find the model but I do not know if you can still get them. 16:29:13 I think I bought like 50. a couple of years ago lol. I'll ship them to you if you want them. I intend to go usb-free all the way 16:30:07 copy/pasting from another node is not recommended... 16:30:33 There should be an option to boot the old PI, unless you broke it buy hand. 16:30:54 I didn't break it by hand. But in EFI mode the loader menu doesn't come up 16:31:07 It just tries to boot, fails, and drops to prompt. 16:31:24 I guess I'll switch it bios, fix it, and switch back 16:34:28 Too bad this damn intel VMD cannot just be removed from the firmware. 16:35:25 danmcd: do you know if there is any illumos/smartos work going on for Intel VMD support? 16:35:39 Not that I care about using it 16:36:48 That's probably a better question for #illumos 16:36:54 I figured 16:36:56 I've not heard of any, but that doesn't mean there isn't. 16:39:14 okay well I got the loader menu up in legacy mode and chose option 3 to rollback. Seems to be working. Thanks for everything. 17:39:55 well back to square 1. 17:40:18 ntp -b 192.168.35.254 (headnode ip in another rack) still fails on the new compute nodes 17:40:50 It's ntpdate -b
17:41:22 sorry 17:41:25 that is what I meant 17:41:27 ntpdate -b 17:41:50 so the next step is to snoop the packets. 17:41:57 got it 17:42:08 Are they sent? Is the headnode receiving them? Is it sending replies? Are the replies received? 17:42:12 yep 17:42:19 one second let me check something 17:43:32 https://pastebin.com/eKAR5hWf 17:44:26 The reference time being zero seems wrong. 17:44:39 yeah, that is what I was referencing earlier in the chat 17:45:12 I suppose it could be a hardware clock issue on the headnode 17:45:13 So check that it matches what's being sent by the headnode. 17:45:24 Maybe the headnode's clock needs to be force set. 17:46:23 compute node snoop 17:46:24 https://pastebin.com/skq0FV9b 17:46:30 The first was the headnode 17:53:17 Oh, the reference time 0 is the request. 17:53:32 > IP: Destination address = 192.168.35.254, 192.168.35.254 17:54:08 If that's the only thing you're seeing, then you're not getting responses. 18:00:27 oh hell 18:00:31 let me check that out