-
barfield
ll
-
barfield
whoops wrong widnow
-
barfield
But I do have a question I'm seeing an NTP issue between a routed network (RAN_Admin) and the headnode using NTP. Wondering if I missed something
-
barfield
joysetup keeps failing
-
barfield
nodes can ping each other fine
-
barfield
netboot works fine
-
barfield
I disabled ipfilter on the headnode but ntp refuses
-
barfield
compute node PI 20230727T000733Z and 20230824T000638Z
-
barfield
nvmd
-
barfield
looks like bios clock off by 6 minutes
-
barfield
nope. Not it
-
jbk
IIRC, joysetup creates a log file in /tmp
-
jbk
on the CN
-
jbk
that might be worth looking at if you haven't yet
-
jbk
(ISTR a few times having to add some extra debugging to it, but my memory is hazy now)
-
barfield
I'm tailing it one sec I'll send some output
-
barfield
10:18 < jbk> IIRC, joysetup creates a log file in /
-
barfield
whoops
-
barfield
[2023-08-29T15:19:49Z] ./joysetup.sh:244: check_ntp(): ntp_wait_seconds=90
-
barfield
[2023-08-29T15:19:49Z] ./joysetup.sh:246: check_ntp(): [[ 90 -gt 600 ]]
-
barfield
[2023-08-29T15:19:49Z] ./joysetup.sh:233: check_ntp(): true
-
barfield
[[2023-08-29T15:19:49Z] ./joysetup.sh:234: check_ntp(): ntpdate -b 192.168.35.254
-
barfield
[2023-08-29T15:19:57Z] ./joysetup.sh:234: check_ntp(): output='29 Aug 15:19:57 ntpdate[4842]: no server suitable for synchronization found'
-
barfield
[2023-08-29T15:19:57Z] ./joysetup.sh:235: check_ntp(): retval=1
-
barfield
[2023-08-29T15:19:57Z] ./joysetup.sh:236: check_ntp(): (( retval == 0 ))
-
barfield
I ran a snoop
-
barfield
I'm thinking hwclock issue on the headnode. Log says "TIME_ERROR" clock unsynchronized on the headnode
-
barfield
29 Aug 15:18:31 ntpd[31385]: Listening on routing socket on fd #54 for interface updates
-
barfield
29 Aug 15:18:31 ntpd[31385]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
-
barfield
29 Aug 15:18:31 ntpd[31385]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
-
barfield
29 Aug 15:18:33 ntpd[31385]: Soliciting pool server 142.147.88.111
-
barfield
29 Aug 15:18:33 ntpd[31385]: Soliciting pool server 217.180.209.214
-
barfield
29 Aug 15:18:34 ntpd[31385]: Soliciting pool server 45.55.58.103
-
barfield
What is the equivilant `hwclock` command on smartOS?
-
barfield
like on linux
-
barfield
seems like only "date" is available
-
barfield
so `ntpdate -b 0.pool.smartos.ntp.org` returns "no server suitable". digging into this a bit further.
-
barfield
29 Aug 15:23:40 ntpd[31385]: Soliciting pool server 162.159.200.123
-
barfield
29 Aug 15:25:34 ntpd[31385]: receive: Unexpected origin timestamp 0xe8988c6e.9602e113 does not match aorg 0000000000.00000000 from server⊙1121 xmt 0xe8988c6e.96501509
-
barfield
I wonder if this has anything to do with the `smartos.ntp.org` vanity name
-
barfield
-
barfield
29 Aug 15:38:16 ntpdate[32564]: ntpdate 4.2.8p15⊙1 Thu Dec 15 04:20:11 UTC 2022 (1)
-
barfield
Guess I'll update the headnode PI and see if it gets resolved by a newer version of ntpdate
-
barfield
jbk: thank you for helping
-
barfield
Wonder if I should report a smartOS bug in PI:
-
barfield
offline 15:09:34 svc:/system/fm/smtp-notify:default
-
barfield
ignore that^ PI joyent_20221215T000744Z
-
danmcd
hmmm... last time we did any serious updates of NTP was late 2021. I wonder if smartos.ntp.org has an issue?
-
barfield
That was my original thoughts
-
barfield
4.2.8p15⊙1
-
danmcd
Yeah... updated in 2021.
-
danmcd
Hmmm
-
barfield
Is pretty close to 4.2.8p7.1 as mentioned in the suse bug report I referenced earlier
-
barfield
But of course, I updated the headnode image, rebooted, and it can't find the pi. So now to fix the USB key or determine if I set it up with piadm
-
barfield
Guess I'll have some manatee work to do too lol
-
barfield
do you happen to have the loader `boot /os/X` command handy so I can tell this thing to boot from the previous image?
-
danmcd
0.smartos.pool.ntp.org that's the correct host.
-
danmcd
We still have a problem, however.
-
barfield
nevermind I'll check another headnode
-
barfield
Yeah I hadnt even noticed this until I tried setting up new compute nodes. What problem are you seeing?
-
danmcd
29 Aug 15:59:30 ntpdate[1805]: no server suitable for synchronization found
-
danmcd
but I see traffic. It's like our NTP doesn't like the answers for some bizarre reason.
-
barfield
reference clock is 00:00:00\
-
barfield
NTP: ----- Network Time Protocol -----
-
barfield
NTP:
-
barfield
NTP: Leap = 0x3 (alarm condition (clock unsynchronized))
-
barfield
NTP: Version = 4
-
barfield
NTP: Mode = 3 (client)
-
barfield
NTP: Stratum = 0 (unspecified)
-
barfield
NTP: Poll = 3
-
barfield
NTP: Precision = -6 seconds
-
barfield
NTP: Synchronizing distance = 0x0001.0000 (1.000000)
-
barfield
NTP: Synchronizing dispersion = 0x0001.0000 (1.000000)
-
barfield
NTP: Reference clock =
-
barfield
NTP: Reference time = 0x00000000.00000000 ()
-
barfield
NTP: Originate time = 0x00000000.00000000 ()
-
barfield
NTP: Receive time = 0x00000000.00000000 ()
-
barfield
NTP: Transmit time = 0xe8988a86.754192e1 (2023-08-29 15:17:26.45803)
-
barfield
29 Aug 16:02:31 ntpd[69188]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
-
barfield
29 Aug 16:02:31 ntpd[69188]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
-
barfield
from /var/log/ntp.log
-
barfield
very strange. We had a bad a NTP host in the ntp.org pool affect all of the AP-North and Southeast regions for samsung back in december.
-
barfield
On the non-triton platform
-
barfield
1 bad pool member wrecked havoc on everything, including SSL certs, causing "untrusted site" errors across all of the api-gateway hosts. It was pretty terrible. I found the source host from the folks over at: teamcymru
-
bahamat
Well, let me read back to catch up...
-
barfield
Welcome to the party :)
-
danmcd
So barfield --> Two things:
-
danmcd
1.) bahamat reminded me to disable the NTP service (so I uttered `svcadm disable -t ntp`)
-
danmcd
2.) I tried once after that and failed, but tried again and succeeded. I have packet captures for both the failure and the success.
-
barfield
Shit. I knew that. Then manually update
-
barfield
So when you re-enabled did it come back online?/
-
barfield
worked like a champ. What a rookie mistake.
-
barfield
Considering I was the person who fixed this 9 months ago LMAO
-
barfield
What through me off was another user I saw on the mailing list having NTP issues as well. Made me go down a rabbit-hole that was more complex than needed
-
barfield
Thank you guys.
-
barfield
however, let me back track a bit, after re-enabling ntp service, I'm still getting "clock unsynchronized" kernel messages in the logs.
-
bahamat
Yeah, so IIRC, what's happening is that because ntp is over UDP, if you have the ntp service online then it's already bound listening on port 123.
-
bahamat
So then when ntpdate runs, it sends the request and fails to bind to port 123 (because ntpd is already there). Then ntpd gets all the replies and ntpdate thinks it got no response.
-
barfield
Yeah I totally knew that. Clearly haven't had enough caffeine today.
-
bahamat
So if you're doing it by hand you need to make sure to disable ntp first.
-
barfield
But are the kernel error logs normal? I can't say I have spent a lot of time monitoring them in the past
-
bahamat
And the -b is important with ntpdate. That says "update the clock no matter how far off it is"
-
barfield
I ran `svcadm disable ntp` then ran `ntpdate -b 0.smartos.pool.ntp.org` then ran `svcadm enable ntp`
-
bahamat
The default is that it only adjusts if it's within like 2m or 5m or something.
-
barfield
Aha
-
barfield
Thank you for clarifying
-
barfield
It all makes sense now.
-
bahamat
Because it assumes that the clock is mostly right, and that a response that's too far off must be malicious.
-
barfield
You guys are the best.
-
bahamat
And -b says to ignore all that and just take whatever you get.
-
barfield
Probably not a bad assumption. I thought that my IPS may be dropping the traffic as well. Because we drop DDOS on port 123 consistently.
-
barfield
reflection attacks happen all day long against our networks. Now if I can just get my headnode to boot :)
-
bahamat
Or in other words "I as an operator know this clock is very wrong, so no matter how bad someone else's clock is, it's better than us"
-
barfield
Again, sorry for the rabbit hole. I should've been able to fix this quickly. ...sigh...
-
barfield
Another indication that something was off, was that ntp had no logs since december of 2022.
-
barfield
Now I see it logging consistently again.
-
barfield
danmcd: bahamat: jkb: thank you all for assisting with this jr admin problem lol
-
barfield
On another note, this is the first zone that I enabled rack-aware networking in. I've some hurdles to work through. so it played into the complexity as well. I found a way to get past the old dell/intel pxe boot issues we used to see, when a node didn't have a USB key to load ipxe
-
barfield
I didn't want 1000 USB key's like JPC/SPC had.
-
barfield
If you guys are interested in how I finally got around it I'll send you a write up
-
bahamat
Well, I always recommend booting the first time with a USB, then you can switch to pool booting.
-
barfield
That takes a while though if you're preflighting 100+ nodes
-
bahamat
I never fully trust hardware pxe. And if there are issues with it, we can't fix them.
-
barfield
Very true
-
danmcd
Depends on your HW>
-
bahamat
The answer to "my hardware pxe isn't working with Triton" is always "well that's what the USB is for".
-
barfield
Does piadm support compute nodes yet?
-
bahamat
Yes
-
danmcd
My HN boots from disk. My 3 supermicro CNs boot three different ways: BIOS PXE, EFI Netboot, and iPXE-from-disk
-
danmcd
barfield: piadm doesn't per se.
-
danmcd
piadm can, for disk-booting-to-iPXE only.
-
barfield
Oh hell
-
barfield
Thats genius
-
danmcd
Hang on...
-
bahamat
Yeah, it manages just the boot bits, but the PI still comes from Triton.
-
barfield
I've actually had some issues with these OCP blades booting from the NVME's on board
-
bahamat
Also why the USB exists :-D
-
barfield
So I'm still letting the NIC PXE boot first because of that. I leave a USB key in the headnode for emergencies.
-
danmcd
See "TRITON COMPUTE NODES and iPXE" in the piadm(8) man page.
-
barfield
I read that a while back. I'll read it again.
-
bahamat
Also, it doesn't need to be a USB. You can use SD/CF, etc. cards.
-
barfield
The old "compute node wont boot" from support tickets because of a bad usb drive is what I was trying to avoid.
-
bahamat
Yeah, the SPC datacenters had issues with heat management that caused an excessive number of USBs to fail.
-
barfield
Regardless the USB booting is still awesome if you ask me.
-
barfield
So did JPC not see the same issues as often?
-
barfield
I came in after JPC was EOL'ed
-
barfield
But I dont remember it happening as often in JAC
-
bahamat
Yeah, JPC had like one, maybe two USBs fail in the time I was there, which was like 6 years.
-
bahamat
I don't know how many there were before that, but my understanding is that it was not many.
-
bahamat
We also had very high end and low capacity drives.
-
bahamat
I think the drives we had were physically only 4G.
-
bahamat
You can't even find those these days.
-
barfield
Yeah, I noticed 64-128GB drives at bestbuy the other day for like 10 bucks.
-
bahamat
And the higher the capacity in the same space, means you've got more electricity flowing through smaller circuits and that increases heat.
-
barfield
so last question
-
barfield
My HN cannot find the latest PI that I assigned
-
barfield
I am stuck in the loader menu
-
bahamat
-
barfield
I copied and pasted the var's from another headnode for bootfile bootarchive etc
-
barfield
but its still saying can't find unix, etc, when I type boot
-
barfield
WTH am I doing wrong now
-
barfield
I use the tiny metal kingston 16GB drives and I have never had one go bad.
-
bahamat
And they're not even in stock anymore. I'm glad I bought a 5 pack when I could.
-
barfield
I'll find the model but I do not know if you can still get them.
-
barfield
I think I bought like 50. a couple of years ago lol. I'll ship them to you if you want them. I intend to go usb-free all the way
-
bahamat
copy/pasting from another node is not recommended...
-
bahamat
There should be an option to boot the old PI, unless you broke it buy hand.
-
barfield
I didn't break it by hand. But in EFI mode the loader menu doesn't come up
-
barfield
It just tries to boot, fails, and drops to prompt.
-
barfield
I guess I'll switch it bios, fix it, and switch back
-
barfield
Too bad this damn intel VMD cannot just be removed from the firmware.
-
barfield
danmcd: do you know if there is any illumos/smartos work going on for Intel VMD support?
-
barfield
Not that I care about using it
-
bahamat
That's probably a better question for #illumos
-
barfield
I figured
-
bahamat
I've not heard of any, but that doesn't mean there isn't.
-
barfield
okay well I got the loader menu up in legacy mode and chose option 3 to rollback. Seems to be working. Thanks for everything.
-
barfield
well back to square 1.
-
barfield
ntp -b 192.168.35.254 (headnode ip in another rack) still fails on the new compute nodes
-
bahamat
It's ntpdate -b <address>
-
barfield
sorry
-
barfield
that is what I meant
-
barfield
ntpdate -b
-
bahamat
so the next step is to snoop the packets.
-
barfield
got it
-
bahamat
Are they sent? Is the headnode receiving them? Is it sending replies? Are the replies received?
-
barfield
yep
-
barfield
one second let me check something
-
barfield
-
bahamat
The reference time being zero seems wrong.
-
barfield
yeah, that is what I was referencing earlier in the chat
-
barfield
I suppose it could be a hardware clock issue on the headnode
-
bahamat
So check that it matches what's being sent by the headnode.
-
bahamat
Maybe the headnode's clock needs to be force set.
-
barfield
compute node snoop
-
barfield
-
barfield
The first was the headnode
-
bahamat
Oh, the reference time 0 is the request.
-
bahamat
> IP: Destination address = 192.168.35.254, 192.168.35.254
-
bahamat
If that's the only thing you're seeing, then you're not getting responses.
-
barfield
oh hell
-
barfield
let me check that out