-
john__
triton gurus, looking for assistance on a compute node, which upon setup, continously claims to fail, all though it is setup, and the job has no errors, says it is setup successfully and is updating cnapi, although cnapi never gets updated. I cannot seem to find any logs pertaining to server setup failure. It even drops the /var/svc/setup_complete file upon completion.
-
john__
I'm at a loss
-
barfield
I'm not sure why it didn't update my nick
-
barfield
Those are john__'s are from barfield
-
barfield
The only thing that I do see occurring is that `ur` is running fine on the node, but on the headnode `sdc-server list` shows status as "unknown" after a few minutes. I checked disk space. No issues. I restart ur `svcadm restart ur` it pops back up as running in the sdc-server list output but after about a minute it says unknown state again.
-
barfield
It also only appears to install the amon-agent and show the agent package version. cn-agent, net-agent, basically *-agent* doesn't exist on the CN.
-
barfield
This is a CN in a different rack with rack-aware networking deployed as a site-note.
-
barfield
side*
-
barfield
bahamat: the ntp issue yesterday ended up being a restriction in /etc/inet/ntp.conf to only the original admin network subnet. The RAN subnets were not in the config file. I updated the config and it works now. I am assuming I need an ntp.conf in /usbkey/config.inc in order to make this permenant. I cannot remember if this is how we did it at joyent though for some reason
-
barfield
danmcd: Do you know if the lastest intel 100GB NIC's are working correctly in SmartOS as of recently? Do they happen to have the same memory issues as i40e in libdl?
-
barfield
another weird thing I am noticing on this separate compute_node issue is that: sdc-oneachnode -n <UUID> 'CMDS' work just fine. It isn't giving the old "missing compute node" error like you would expect from a CN with an "unknown" status.
-
barfield
sorry to be a pest the last couple of days lol
-
bahamat
I don't remember exactly how we handled that for SPC, but I know there was an allowance for it.
-
barfield
yeah I was thinking the same thing. Maybe a cnapi setting somewhere. I'll dig into the source
-
barfield
weird looks like agent setup may be failing because of a lack of usbkey
-
bahamat
We handled it in chef :-(
-
barfield
Aha, I found the issue. cn_tools.tar.gz is "maybe" missing from the assets zone
-
bahamat
that's definitely not a normal situation.
-
barfield
its definitely there
-
barfield
-
barfield
curl: (22) The requested URL returned error: 404 Not Found
-
barfield
But getting this
-
barfield
that* I mean. Looks like a DNS issue. like the "assets" server name is required in the URL
-
barfield
yeah super weird. can ping sapi, napi, etc, but not assets.
-
danmcd
barfield: We don't support ice (their 100G 800-series part).
-
barfield
danmcd: shit. So only chelsio?
-
jbk
what libdl memory issue?
-
jbk
there's an issue with the driver+jumbo frames+kernel memory fragmentation
-
barfield
There was an issue where dladm would use up a bunch of contigious memory blocks on vnic creation and cause contention with bhyve VM's
-
barfield
At least that was my understanding
-
jbk
though hopefully I can get a fix for that upstreamed (we've been testing it, and have started rolling it out to customers)
-
barfield
Yeah I think that is it. I may not have described it perfectly
-
jbk
ahh.. that's actually the driver
-
jbk
dladm just triggers it :)
-
barfield
haha, that makes sense
-
barfield
I've hit the issue a lot
-
jbk
vnics create additional rings, and each entry wants a mtu-sized chunk of physically contiguous memory
-
jbk
yeah
-
barfield
but its because our customer purchased nodes that really didn't have enough RAM to begin with. Like 128GB in the node and wanted to assign 112GB to the BHYVE VM with a 40GB network in the loop
-
jbk
our fix is we cap the size of the buffers used on the RX ring to 4k (PAGESIZE)
-
jbk
and let the HW split larger packets
-
barfield
I've seen it on freebsd with netmap interfaces too
-
jbk
which should help
-
jbk
of course if the system is tight on memory in general, it's not going to help
-
barfield
I think it was the 202306 PI that got us past the GZ crash hump here
-
jbk
but at least it shouldn't have to work extra hard for jumbo frames
-
barfield
What is the deal with the intel 100GB NICs? This concerns me because our supplier sourced 100 of them for some racks we're building out even though I asked multiple times for chelsio nics
-
barfield
Or would mellanox be a better choice?
-
jbk
well i think we have drivers for chelsio or mellanox, but not intel
-
barfield
Oh so its a vendor issue not a community issue. My ignorance just lead me to believe that "100GB intel couldn't be that different from 40GB" lol.
-
jbk
i haven't looked at the drivers to know what the difference is
-
barfield
Yeah me either
-
barfield
Was just checking to see the status is all before I get a bunch of compute_nodes that we can't use.
-
jbk
it's possible that maybe the existing driver just needs a small amount of update, it oculd be different enough to do a whole new driver, dunno
-
barfield
Sorry if my humor doens't come across as humerous in the chat :)
-
jbk
oh it's ok
-
jbk
i have a rather dark sense of humor, though i try to limit it to friends that expect it :)
-
barfield
LOL same here, had to create a "dark humor only" channel on slack for some select co-workers
-
rzezeski
one thing to keep in mind with chelsio is that there is still no support for mac groups, so no hardware classification for VNICs, not sure how much that matters
-
rzezeski
the intel ice driver would be significantly different from i40e IIRC
-
rzezeski
like, nothing shared
-
barfield
I'm not sure if mac groups affects us or not. Just running triton compute node workload in the traditional sense
-
jbk
annoyingly, i don't think intel has published any programming guide for it either
-
rzezeski
yea, if your zones aren't heavy on network I/O, then it shouldn't matter as much
-
jbk
so it's all going off of existing drivers
-
barfield
Another random question: Is anyone going to opencompute world in October?
-
rzezeski
jbk: Robert started a driver for it back in the Joyent days, very very bare bones, lots left to do, I imagine he had a PG at that time
-
rzezeski
one discussion we had was eschewing the common code this time around and just going our own way
-
danmcd
barfield: Mellanox (now nVidia) should get some improvements into mlxcx(4D) SOMETIME this year or early next.
-
» danmcd has had a challenging 2H2023...
-
barfield
update from my earlier compute node setup issue...looks like the admin network in the other rack which is: 10.24.20.0/22 was not allowed to access imgapi. I think that was the issue. But will know for certain after this node finishes booting. I found logs in /opt/smartcd/agents/log/install.log which pointed me to this
-
barfield
*/opt/smartdc/agents/log --- excuse the typo
-
bahamat
barfield: Yeah, the default fwrules for the admin account is only to allow the admin network prefix to tag smartdc_role. When setting up RAN you need to manually add the new subnet(s).
-
barfield
I think there is still more. It still failed.
-
barfield
WTf
-
barfield
lol
-
bahamat
Unfortunately RAN is a quasi-supported feature. The parts that work do work, but how to set it up isn't fully documented.
-
barfield
Its okay I'm doing things like I always do anyway, the hard-way
-
barfield
Its how I learn best anyway
-
bahamat
I don't even have a RAN setup to try to test any of this.
-
barfield
When I run the joysetup.sh script locally it always just ends at the last line "exit 0". How do it run it locally like sdc-server would remotely? So that I can watch -x output and try to find something failing
-
bahamat
We (MNX) don't use it (yet), and we don't have any paying customers (hint, hint) that use it so filling in the gaps isn't the best use of our time, when considering other things we know we'd like to get done.
-
bahamat
The output goes to a log file. The best way is to follow that.
-
barfield
I'll write some docs when it is done
-
barfield
Yes, but it isn't leading me anywhere. I think that I am missing some environment variables set by sdc
-
barfield
I'll source the configs maybe thats it...+ /lib/sdc or whatever the path is
-
barfield
yep that did it
-
barfield
yeah failed again. seen_states are zpool_created filesystems_setup and imgadm_setup...must be failing on tools
-
bahamat
barfield: I just created OS-8484.
-
jinni
-
barfield
Nice