#smartos

14:23

barfield

ll
14:24

barfield

whoops wrong widnow
14:32

barfield

But I do have a question I'm seeing an NTP issue between a routed network (RAN_Admin) and the headnode using NTP. Wondering if I missed something
14:32

barfield

joysetup keeps failing
14:42

barfield

nodes can ping each other fine
14:42

barfield

netboot works fine
14:42

barfield

I disabled ipfilter on the headnode but ntp refuses
14:43

barfield

compute node PI 20230727T000733Z and 20230824T000638Z
15:04

barfield

nvmd
15:04

barfield

looks like bios clock off by 6 minutes
15:14

barfield

nope. Not it
15:18

jbk

IIRC, joysetup creates a log file in /tmp
15:18

jbk

on the CN
15:18

jbk

that might be worth looking at if you haven't yet
15:19

jbk

(ISTR a few times having to add some extra debugging to it, but my memory is hazy now)
15:20

barfield

I'm tailing it one sec I'll send some output
15:21

barfield

10:18 < jbk> IIRC, joysetup creates a log file in /
15:21

barfield

whoops
15:21

barfield

[2023-08-29T15:19:49Z] ./joysetup.sh:244: check_ntp(): ntp_wait_seconds=90
15:21

barfield

[2023-08-29T15:19:49Z] ./joysetup.sh:246: check_ntp(): [[ 90 -gt 600 ]]
15:21

barfield

[2023-08-29T15:19:49Z] ./joysetup.sh:233: check_ntp(): true
15:21

barfield

[[2023-08-29T15:19:49Z] ./joysetup.sh:234: check_ntp(): ntpdate -b 192.168.35.254
15:21

barfield

[2023-08-29T15:19:57Z] ./joysetup.sh:234: check_ntp(): output='29 Aug 15:19:57 ntpdate[4842]: no server suitable for synchronization found'
15:21

barfield

[2023-08-29T15:19:57Z] ./joysetup.sh:235: check_ntp(): retval=1
15:21

barfield

[2023-08-29T15:19:57Z] ./joysetup.sh:236: check_ntp(): (( retval == 0 ))
15:21

barfield

I ran a snoop
15:21

barfield

I'm thinking hwclock issue on the headnode. Log says "TIME_ERROR" clock unsynchronized on the headnode
15:22

barfield

29 Aug 15:18:31 ntpd[31385]: Listening on routing socket on fd #54 for interface updates
15:22

barfield

29 Aug 15:18:31 ntpd[31385]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
15:22

barfield

29 Aug 15:18:31 ntpd[31385]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
15:22

barfield

29 Aug 15:18:33 ntpd[31385]: Soliciting pool server 142.147.88.111
15:22

barfield

29 Aug 15:18:33 ntpd[31385]: Soliciting pool server 217.180.209.214
15:22

barfield

29 Aug 15:18:34 ntpd[31385]: Soliciting pool server 45.55.58.103
15:22

barfield

What is the equivilant `hwclock` command on smartOS?
15:22

barfield

like on linux
15:23

barfield

seems like only "date" is available
15:24

barfield

so `ntpdate -b 0.pool.smartos.ntp.org` returns "no server suitable". digging into this a bit further.
15:35

barfield

29 Aug 15:23:40 ntpd[31385]: Soliciting pool server 162.159.200.123
15:35

barfield

29 Aug 15:25:34 ntpd[31385]: receive: Unexpected origin timestamp 0xe8988c6e.9602e113 does not match aorg 0000000000.00000000 from server⊙1121 xmt 0xe8988c6e.96501509
15:36

barfield

I wonder if this has anything to do with the `smartos.ntp.org` vanity name
15:38

barfield

suse.com/support/kb/doc/?id=000018710
15:38

barfield

29 Aug 15:38:16 ntpdate[32564]: ntpdate 4.2.8p15⊙1 Thu Dec 15 04:20:11 UTC 2022 (1)
15:38

barfield

Guess I'll update the headnode PI and see if it gets resolved by a newer version of ntpdate
15:40

barfield

jbk: thank you for helping
15:41

barfield

Wonder if I should report a smartOS bug in PI:
15:41

barfield

offline 15:09:34 svc:/system/fm/smtp-notify:default
15:41

barfield

ignore that^ PI joyent_20221215T000744Z
15:51

danmcd

hmmm... last time we did any serious updates of NTP was late 2021. I wonder if smartos.ntp.org has an issue?
15:51

barfield

That was my original thoughts
15:51

barfield

4.2.8p15⊙1
15:51

danmcd

Yeah... updated in 2021.
15:51

danmcd

Hmmm
15:51

barfield

Is pretty close to 4.2.8p7.1 as mentioned in the suse bug report I referenced earlier
15:52

barfield

But of course, I updated the headnode image, rebooted, and it can't find the pi. So now to fix the USB key or determine if I set it up with piadm
15:54

barfield

Guess I'll have some manatee work to do too lol
15:57

barfield

do you happen to have the loader `boot /os/X` command handy so I can tell this thing to boot from the previous image?
15:57

danmcd

0.smartos.pool.ntp.org that's the correct host.
15:57

danmcd

We still have a problem, however.
15:58

barfield

nevermind I'll check another headnode
15:58

barfield

Yeah I hadnt even noticed this until I tried setting up new compute nodes. What problem are you seeing?
16:01

danmcd

29 Aug 15:59:30 ntpdate[1805]: no server suitable for synchronization found
16:01

danmcd

but I see traffic. It's like our NTP doesn't like the answers for some bizarre reason.
16:01

barfield

reference clock is 00:00:00\
16:02

barfield

NTP: ----- Network Time Protocol -----
16:02

barfield

NTP:
16:02

barfield

NTP: Leap = 0x3 (alarm condition (clock unsynchronized))
16:02

barfield

NTP: Version = 4
16:02

barfield

NTP: Mode = 3 (client)
16:02

barfield

NTP: Stratum = 0 (unspecified)
16:02

barfield

NTP: Poll = 3
16:02

barfield

NTP: Precision = -6 seconds
16:02

barfield

NTP: Synchronizing distance = 0x0001.0000 (1.000000)
16:02

barfield

NTP: Synchronizing dispersion = 0x0001.0000 (1.000000)
16:02

barfield

NTP: Reference clock =
16:02

barfield

NTP: Reference time = 0x00000000.00000000 ()
16:02

barfield

NTP: Originate time = 0x00000000.00000000 ()
16:02

barfield

NTP: Receive time = 0x00000000.00000000 ()
16:02

barfield

NTP: Transmit time = 0xe8988a86.754192e1 (2023-08-29 15:17:26.45803)
16:02

barfield

29 Aug 16:02:31 ntpd[69188]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
16:02

barfield

29 Aug 16:02:31 ntpd[69188]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
16:02

barfield

from /var/log/ntp.log
16:03

barfield

very strange. We had a bad a NTP host in the ntp.org pool affect all of the AP-North and Southeast regions for samsung back in december.
16:04

barfield

On the non-triton platform
16:05

barfield

1 bad pool member wrecked havoc on everything, including SSL certs, causing "untrusted site" errors across all of the api-gateway hosts. It was pretty terrible. I found the source host from the folks over at: teamcymru
16:05

bahamat

Well, let me read back to catch up...
16:06

barfield

Welcome to the party :)
16:06

danmcd

So barfield --> Two things:
16:06

danmcd

1.) bahamat reminded me to disable the NTP service (so I uttered `svcadm disable -t ntp`)
16:07

danmcd

2.) I tried once after that and failed, but tried again and succeeded. I have packet captures for both the failure and the success.
16:07

barfield

Shit. I knew that. Then manually update
16:07

barfield

So when you re-enabled did it come back online?/
16:08

barfield

worked like a champ. What a rookie mistake.
16:08

barfield

Considering I was the person who fixed this 9 months ago LMAO
16:09

barfield

What through me off was another user I saw on the mailing list having NTP issues as well. Made me go down a rabbit-hole that was more complex than needed
16:09

barfield

Thank you guys.
16:10

barfield

however, let me back track a bit, after re-enabling ntp service, I'm still getting "clock unsynchronized" kernel messages in the logs.
16:10

bahamat

Yeah, so IIRC, what's happening is that because ntp is over UDP, if you have the ntp service online then it's already bound listening on port 123.
16:11

bahamat

So then when ntpdate runs, it sends the request and fails to bind to port 123 (because ntpd is already there). Then ntpd gets all the replies and ntpdate thinks it got no response.
16:11

barfield

Yeah I totally knew that. Clearly haven't had enough caffeine today.
16:11

bahamat

So if you're doing it by hand you need to make sure to disable ntp first.
16:11

barfield

But are the kernel error logs normal? I can't say I have spent a lot of time monitoring them in the past
16:11

bahamat

And the -b is important with ntpdate. That says "update the clock no matter how far off it is"
16:12

barfield

I ran `svcadm disable ntp` then ran `ntpdate -b 0.smartos.pool.ntp.org` then ran `svcadm enable ntp`
16:12

bahamat

The default is that it only adjusts if it's within like 2m or 5m or something.
16:12

barfield

Aha
16:12

barfield

Thank you for clarifying
16:12

barfield

It all makes sense now.
16:12

bahamat

Because it assumes that the clock is mostly right, and that a response that's too far off must be malicious.
16:12

barfield

You guys are the best.
16:12

bahamat

And -b says to ignore all that and just take whatever you get.
16:12

barfield

Probably not a bad assumption. I thought that my IPS may be dropping the traffic as well. Because we drop DDOS on port 123 consistently.
16:13

barfield

reflection attacks happen all day long against our networks. Now if I can just get my headnode to boot :)
16:13

bahamat

Or in other words "I as an operator know this clock is very wrong, so no matter how bad someone else's clock is, it's better than us"
16:14

barfield

Again, sorry for the rabbit hole. I should've been able to fix this quickly. ...sigh...
16:14

barfield

Another indication that something was off, was that ntp had no logs since december of 2022.
16:14

barfield

Now I see it logging consistently again.
16:15

barfield

danmcd: bahamat: jkb: thank you all for assisting with this jr admin problem lol
16:17

barfield

On another note, this is the first zone that I enabled rack-aware networking in. I've some hurdles to work through. so it played into the complexity as well. I found a way to get past the old dell/intel pxe boot issues we used to see, when a node didn't have a USB key to load ipxe
16:17

barfield

I didn't want 1000 USB key's like JPC/SPC had.
16:17

barfield

If you guys are interested in how I finally got around it I'll send you a write up
16:17

bahamat

Well, I always recommend booting the first time with a USB, then you can switch to pool booting.
16:18

barfield

That takes a while though if you're preflighting 100+ nodes
16:18

bahamat

I never fully trust hardware pxe. And if there are issues with it, we can't fix them.
16:18

barfield

Very true
16:18

danmcd

Depends on your HW>
16:18

bahamat

The answer to "my hardware pxe isn't working with Triton" is always "well that's what the USB is for".
16:18

barfield

Does piadm support compute nodes yet?
16:18

bahamat

Yes
16:19

danmcd

My HN boots from disk. My 3 supermicro CNs boot three different ways: BIOS PXE, EFI Netboot, and iPXE-from-disk
16:19

danmcd

barfield: piadm doesn't per se.
16:19

danmcd

piadm can, for disk-booting-to-iPXE only.
16:19

barfield

Oh hell
16:19

barfield

Thats genius
16:19

danmcd

Hang on...
16:19

bahamat

Yeah, it manages just the boot bits, but the PI still comes from Triton.
16:19

barfield

I've actually had some issues with these OCP blades booting from the NVME's on board
16:20

bahamat

Also why the USB exists :-D
16:20

barfield

So I'm still letting the NIC PXE boot first because of that. I leave a USB key in the headnode for emergencies.
16:20

danmcd

See "TRITON COMPUTE NODES and iPXE" in the piadm(8) man page.
16:20

barfield

I read that a while back. I'll read it again.
16:20

bahamat

Also, it doesn't need to be a USB. You can use SD/CF, etc. cards.
16:20

barfield

The old "compute node wont boot" from support tickets because of a bad usb drive is what I was trying to avoid.
16:21

bahamat

Yeah, the SPC datacenters had issues with heat management that caused an excessive number of USBs to fail.
16:21

barfield

Regardless the USB booting is still awesome if you ask me.
16:22

barfield

So did JPC not see the same issues as often?
16:22

barfield

I came in after JPC was EOL'ed
16:22

barfield

But I dont remember it happening as often in JAC
16:23

bahamat

Yeah, JPC had like one, maybe two USBs fail in the time I was there, which was like 6 years.
16:23

bahamat

I don't know how many there were before that, but my understanding is that it was not many.
16:25

bahamat

We also had very high end and low capacity drives.
16:25

bahamat

I think the drives we had were physically only 4G.
16:25

bahamat

You can't even find those these days.
16:26

barfield

Yeah, I noticed 64-128GB drives at bestbuy the other day for like 10 bucks.
16:26

bahamat

And the higher the capacity in the same space, means you've got more electricity flowing through smaller circuits and that increases heat.
16:26

barfield

so last question
16:27

barfield

My HN cannot find the latest PI that I assigned
16:27

barfield

I am stuck in the loader menu
16:27

bahamat

I use these: amzn.to/47VJwQP
16:27

barfield

I copied and pasted the var's from another headnode for bootfile bootarchive etc
16:27

barfield

but its still saying can't find unix, etc, when I type boot
16:27

barfield

WTH am I doing wrong now
16:28

barfield

I use the tiny metal kingston 16GB drives and I have never had one go bad.
16:28

bahamat

And they're not even in stock anymore. I'm glad I bought a 5 pack when I could.
16:28

barfield

I'll find the model but I do not know if you can still get them.
16:29

barfield

I think I bought like 50. a couple of years ago lol. I'll ship them to you if you want them. I intend to go usb-free all the way
16:30

bahamat

copy/pasting from another node is not recommended...
16:30

bahamat

There should be an option to boot the old PI, unless you broke it buy hand.
16:30

barfield

I didn't break it by hand. But in EFI mode the loader menu doesn't come up
16:31

barfield

It just tries to boot, fails, and drops to prompt.
16:31

barfield

I guess I'll switch it bios, fix it, and switch back
16:34

barfield

Too bad this damn intel VMD cannot just be removed from the firmware.
16:35

barfield

danmcd: do you know if there is any illumos/smartos work going on for Intel VMD support?
16:35

barfield

Not that I care about using it
16:36

bahamat

That's probably a better question for #illumos
16:36

barfield

I figured
16:36

bahamat

I've not heard of any, but that doesn't mean there isn't.
16:39

barfield

okay well I got the loader menu up in legacy mode and chose option 3 to rollback. Seems to be working. Thanks for everything.
17:39

barfield

well back to square 1.
17:40

barfield

ntp -b 192.168.35.254 (headnode ip in another rack) still fails on the new compute nodes
17:40

bahamat

It's ntpdate -b <address>
17:41

barfield

sorry
17:41

barfield

that is what I meant
17:41

barfield

ntpdate -b
17:41

bahamat

so the next step is to snoop the packets.
17:41

barfield

got it
17:42

bahamat

Are they sent? Is the headnode receiving them? Is it sending replies? Are the replies received?
17:42

barfield

yep
17:42

barfield

one second let me check something
17:43

barfield

pastebin.com/eKAR5hWf
17:44

bahamat

The reference time being zero seems wrong.
17:44

barfield

yeah, that is what I was referencing earlier in the chat
17:45

barfield

I suppose it could be a hardware clock issue on the headnode
17:45

bahamat

So check that it matches what's being sent by the headnode.
17:45

bahamat

Maybe the headnode's clock needs to be force set.
17:46

barfield

compute node snoop
17:46

barfield

pastebin.com/skq0FV9b
17:46

barfield

The first was the headnode
17:53

bahamat

Oh, the reference time 0 is the request.
17:53

bahamat

> IP: Destination address = 192.168.35.254, 192.168.35.254
17:54

bahamat

If that's the only thing you're seeing, then you're not getting responses.
18:00

barfield

oh hell
18:00

barfield

let me check that out

3 years ago

« 4 days earlier

a day later »

today »