17:49:28 triton gurus, looking for assistance on a compute node, which upon setup, continously claims to fail, all though it is setup, and the job has no errors, says it is setup successfully and is updating cnapi, although cnapi never gets updated. I cannot seem to find any logs pertaining to server setup failure. It even drops the /var/svc/setup_complete file upon completion. 17:49:35 I'm at a loss 17:50:35 I'm not sure why it didn't update my nick 17:50:46 Those are john__'s are from barfield 17:52:34 The only thing that I do see occurring is that `ur` is running fine on the node, but on the headnode `sdc-server list` shows status as "unknown" after a few minutes. I checked disk space. No issues. I restart ur `svcadm restart ur` it pops back up as running in the sdc-server list output but after about a minute it says unknown state again. 17:57:14 It also only appears to install the amon-agent and show the agent package version. cn-agent, net-agent, basically *-agent* doesn't exist on the CN. 17:58:20 This is a CN in a different rack with rack-aware networking deployed as a site-note. 17:58:27 side* 18:00:02 bahamat: the ntp issue yesterday ended up being a restriction in /etc/inet/ntp.conf to only the original admin network subnet. The RAN subnets were not in the config file. I updated the config and it works now. I am assuming I need an ntp.conf in /usbkey/config.inc in order to make this permenant. I cannot remember if this is how we did it at joyent though for some reason 18:02:33 danmcd: Do you know if the lastest intel 100GB NIC's are working correctly in SmartOS as of recently? Do they happen to have the same memory issues as i40e in libdl? 18:06:10 another weird thing I am noticing on this separate compute_node issue is that: sdc-oneachnode -n 'CMDS' work just fine. It isn't giving the old "missing compute node" error like you would expect from a CN with an "unknown" status. 18:06:39 sorry to be a pest the last couple of days lol 18:12:56 I don't remember exactly how we handled that for SPC, but I know there was an allowance for it. 18:13:36 yeah I was thinking the same thing. Maybe a cnapi setting somewhere. I'll dig into the source 18:17:52 weird looks like agent setup may be failing because of a lack of usbkey 18:19:54 We handled it in chef :-( 18:20:22 Aha, I found the issue. cn_tools.tar.gz is "maybe" missing from the assets zone 18:20:46 that's definitely not a normal situation. 18:21:28 its definitely there 18:21:33 [root@SCS_5MPHXQ1 (dfw003) ~]# /usr/bin/curl -sSf http://192.168.32.21/assets/extra/joysetup/cn_tools.tar.gz 18:21:36 curl: (22) The requested URL returned error: 404 Not Found 18:21:37 But getting this 18:24:55 that* I mean. Looks like a DNS issue. like the "assets" server name is required in the URL 18:27:37 yeah super weird. can ping sapi, napi, etc, but not assets. 18:27:39 barfield: We don't support ice (their 100G 800-series part). 18:29:37 danmcd: shit. So only chelsio? 18:30:02 what libdl memory issue? 18:30:33 there's an issue with the driver+jumbo frames+kernel memory fragmentation 18:30:43 There was an issue where dladm would use up a bunch of contigious memory blocks on vnic creation and cause contention with bhyve VM's 18:30:47 At least that was my understanding 18:31:11 though hopefully I can get a fix for that upstreamed (we've been testing it, and have started rolling it out to customers) 18:31:17 Yeah I think that is it. I may not have described it perfectly 18:31:20 ahh.. that's actually the driver 18:31:26 dladm just triggers it :) 18:31:39 haha, that makes sense 18:31:50 I've hit the issue a lot 18:31:53 vnics create additional rings, and each entry wants a mtu-sized chunk of physically contiguous memory 18:31:56 yeah 18:32:31 but its because our customer purchased nodes that really didn't have enough RAM to begin with. Like 128GB in the node and wanted to assign 112GB to the BHYVE VM with a 40GB network in the loop 18:33:03 our fix is we cap the size of the buffers used on the RX ring to 4k (PAGESIZE) 18:33:11 and let the HW split larger packets 18:33:24 I've seen it on freebsd with netmap interfaces too 18:33:34 which should help 18:33:50 of course if the system is tight on memory in general, it's not going to help 18:34:02 I think it was the 202306 PI that got us past the GZ crash hump here 18:34:04 but at least it shouldn't have to work extra hard for jumbo frames 18:34:56 What is the deal with the intel 100GB NICs? This concerns me because our supplier sourced 100 of them for some racks we're building out even though I asked multiple times for chelsio nics 18:35:20 Or would mellanox be a better choice? 18:35:55 well i think we have drivers for chelsio or mellanox, but not intel 18:36:32 Oh so its a vendor issue not a community issue. My ignorance just lead me to believe that "100GB intel couldn't be that different from 40GB" lol. 18:36:48 i haven't looked at the drivers to know what the difference is 18:36:56 Yeah me either 18:37:25 Was just checking to see the status is all before I get a bunch of compute_nodes that we can't use. 18:37:32 it's possible that maybe the existing driver just needs a small amount of update, it oculd be different enough to do a whole new driver, dunno 18:37:45 Sorry if my humor doens't come across as humerous in the chat :) 18:37:57 oh it's ok 18:38:35 i have a rather dark sense of humor, though i try to limit it to friends that expect it :) 18:39:13 LOL same here, had to create a "dark humor only" channel on slack for some select co-workers 18:40:02 one thing to keep in mind with chelsio is that there is still no support for mac groups, so no hardware classification for VNICs, not sure how much that matters 18:40:19 the intel ice driver would be significantly different from i40e IIRC 18:40:25 like, nothing shared 18:41:11 I'm not sure if mac groups affects us or not. Just running triton compute node workload in the traditional sense 18:41:50 annoyingly, i don't think intel has published any programming guide for it either 18:41:52 yea, if your zones aren't heavy on network I/O, then it shouldn't matter as much 18:42:01 so it's all going off of existing drivers 18:42:20 Another random question: Is anyone going to opencompute world in October? 18:42:33 jbk: Robert started a driver for it back in the Joyent days, very very bare bones, lots left to do, I imagine he had a PG at that time 18:43:17 one discussion we had was eschewing the common code this time around and just going our own way 18:47:23 barfield: Mellanox (now nVidia) should get some improvements into mlxcx(4D) SOMETIME this year or early next. 18:47:52 * danmcd has had a challenging 2H2023... 20:38:20 update from my earlier compute node setup issue...looks like the admin network in the other rack which is: 10.24.20.0/22 was not allowed to access imgapi. I think that was the issue. But will know for certain after this node finishes booting. I found logs in /opt/smartcd/agents/log/install.log which pointed me to this 20:38:45 */opt/smartdc/agents/log --- excuse the typo 20:44:57 barfield: Yeah, the default fwrules for the admin account is only to allow the admin network prefix to tag smartdc_role. When setting up RAN you need to manually add the new subnet(s). 20:45:16 I think there is still more. It still failed. 20:45:17 WTf 20:45:18 lol 20:46:33 Unfortunately RAN is a quasi-supported feature. The parts that work do work, but how to set it up isn't fully documented. 20:46:52 Its okay I'm doing things like I always do anyway, the hard-way 20:47:15 Its how I learn best anyway 20:47:33 I don't even have a RAN setup to try to test any of this. 20:48:39 When I run the joysetup.sh script locally it always just ends at the last line "exit 0". How do it run it locally like sdc-server would remotely? So that I can watch -x output and try to find something failing 20:48:55 We (MNX) don't use it (yet), and we don't have any paying customers (hint, hint) that use it so filling in the gaps isn't the best use of our time, when considering other things we know we'd like to get done. 20:49:16 The output goes to a log file. The best way is to follow that. 20:49:20 I'll write some docs when it is done 20:49:45 Yes, but it isn't leading me anywhere. I think that I am missing some environment variables set by sdc 20:50:11 I'll source the configs maybe thats it...+ /lib/sdc or whatever the path is 20:51:00 yep that did it 20:52:25 yeah failed again. seen_states are zpool_created filesystems_setup and imgadm_setup...must be failing on tools 20:58:56 barfield: I just created OS-8484. 20:58:56 https://smartos.org/bugview/OS-8484 21:06:55 Nice