00:37:58 Does anybody know what this means? `first of 2 errors: timed out waiting for /var/svc/provisioning to move for 416146bb-4d88-4834-b3f2-d5b15b4fa2cb` 01:11:28 toasterson: It's almost always an issue internal to the zone. That file marker gets moved as part of the mdata:execute service and something is preventing that from happening. Another potential cause is when you have way too many zfs datasets (often caused by heavy docker use, in which case you can run imgadm vacuum). 01:12:18 Or if you have something other than docker making tons of snapshots. 01:18:25 bahamat: thats trying to spawn base64⊙13 as vm 01:18:44 As in, inside of bhyve? 01:18:56 Or joyent brand zone? 01:19:58 joyent branded zone. it's on coal 01:20:26 OK, well then it might be the performance of vmware 01:21:00 it's on libvirt 01:21:07 on a pretty beefy server 01:21:22 well I'll try minimal-64-lts 01:23:09 But ultimately, /var/svc/provisioning needs to be moved to /var/svc/provision_success 01:23:43 It's not about "beefy". It's about whatever is causing latency. 01:24:54 a qcow2 on nvme raid1 via virtio should not have a problem 01:24:59 Here's where it happens: https://github.com/TritonDataCenter/illumos-joyent/blob/master/usr/src/cmd/svc/milestone/mdata-execute#L34-L36 01:25:26 Well you're going to have to figure that part out. 01:25:45 But I can tell you, the *only* way that happens is if the zone can't move that file before the timeout. 01:27:40 Can the IO slowness happen if memory is insufficient? i.e coal is swapping? 01:28:20 Yes. 01:28:40 Anything that causes latency might be the culprit. 01:29:30 You might have better luck with standalone SmartOS so that you don't have the whole headnode worth of services competing for cpu cycles. 01:30:05 True. I might just boot a smartos USB image instead and clean the disk 01:30:12 coal defaults to 8G of RAM, but if you've got enough physical memory it will do much better with more. 01:30:30 Well that sounds like a simeple tweak :( 01:30:33 :) 01:30:37 wrong smiley 01:57:10 ok, both enlarging ram to 32gb and stopping the CPU hungry Arm VM did not help. I'll try smartos clean next. 03:01:12 well we can rule out the headnode being to overwhelmed 03:01:22 on smartos I now get `first of 1 error: first of 1 error: vminfod watchForChanges "VM.js startZone (e23a358d-08b5-6603-92c6-ea75655674df)" timeout exceeded` 03:01:39 looks like we dont't like something from virtio? 07:52:25 toasterson: out of interest why running a zone image that was obsolete nearly 8 years ago? 12:58:57 jperkin: because thats what the guide tells me to run :) 12:59:22 is that the wrong image? 12:59:43 which guide? 13:02:13 https://github.com/TritonDataCenter/mibe 13:02:49 ok, so nothing current, just wanted to check 13:03:08 what is the current guide to build images then? 13:04:05 we don't have one as we don't ship per-product images any longer, they just aren't worth the time spent on them when you can effectively get the same thing from a base image + pkgin 13:07:06 fair. but what is the name of the base image in that case? base-64? 13:08:07 base-64-lts for the yearly LTS releases, or base-64-trunk for the latest 13:11:48 thanks 13:21:12 and I again get `first of 2 errors: timed out waiting for /var/svc/provisioning to move for 0ec7acbd-36ed-cc97-b2e7-b1810e72c0b3` 13:21:19 so it's not the image either :) 13:29:41 there should be a log inside the image with the output from sm-prepare-image or whatever it is 13:31:58 toasterson: if you misconfigured the network this might also happen 13:37:03 jperkin: grepping for sm-prepare-image in /var/log yields nothing. Is there a troubleshooting guide to get the log locations on where to start debugging? 13:37:41 wiedi_: I hope... (full message at ) 13:38:04 no, it's just a case of reading the scripts and figuring it out, these aren't really supported parts of the stack 13:43:57 toasterson: well if your dhcp server doesn't provide you with an ip on the admin nictag that might happen 13:49:14 hmmmmm, smartos does not have dhcp by default? 14:15:39 a server? no 15:01:50 https://twitter.com/an0n_r0/status/1589405818885398528 <- Samba flaw 15:22:28 toasterson: there's no default networking. But if you specify dhcp, it will do it. 16:35:40 it was the dhcp option :) 16:52:28 wiedi_: your tool is not usable on the same smartos host you want to build on i gues? 19:34:45 Well, dhcp absolutely works. 19:36:46 Well, not for me :) as I had no dhcp on that network 19:40:16 Well if you told it to use dhcp, but you don't have a server to give it an address, it's not going to make up one on its own. 19:44:39 True that. Unfortunetely there is no Dhcp embedded in my coffe when I do these things late at night. 21:21:23 Ordered some CNs with Mellanox 100GB nics, hope all goes well lol 21:53:05 Cool re: Mellanox. 21:57:15 toasterson: you could do that but it might be a bit more complicated to install the dependencies. The idea is to use it on your local laptop or so and it will connect to the smartos box you want to build on via ssh 21:58:24 yeah but my smartos box is behind a JUmpserver :) 22:00:03 danmcd: maybe @arekinath wants some MNX time to work on the drriver ;p ? 22:00:17 it will respect your ~/.ssh/config where you can use ProxyJump 22:01:38 Tailscale can be run on a SmartOS GZ if that's something you want to try. 22:21:36 wiedi_: hmmm, does it still need a dns entry to add the host? 22:22:28 Because i get host add failed when i try to add it by the name in .ssh/config 22:33:24 it should not, as long as you can do "ssh host" it should be ok 22:43:24 @Smithx10 Chelsio is another alternative, that gets actual illumos drops from the company. 22:44:00 Yea, I got screwed and had to go Mellanox because the 200 -> 100 splitter cables had to be qsfp56 22:44:35 the switch was 32 ports of 200gb. Chelsio only had qsfp28 cards until the next gen comes out 22:45:37 I checked with you earlier and they should be supported, luckily they are only for storage, we have known working intel currently so if we have an issue, the local pool will still work fine and they wont be storage available 22:50:40 Ack. 23:09:09 Smithx10: don't need paid MNX time, UQ will pay me for it since we use those parts extensively anyway 23:09:44 but it should Just Work (TM) 23:10:01 mellanox are pretty good at keeping the same base hardware interface at the moment 23:12:02 just don't start asking about RDMA :) 23:15:04 jbk: i promise. 23:15:19 after taking to rmustacc I learned a bunch 23:15:31 do ask about memory usage and LSO though, I have plans for those ;) 23:15:41 haha 23:15:44 just going to be mounting nvme-tcp from within the guest atm 23:15:57 unless we can get nvme-tcp into the OS / BHYVE 23:16:10 or Propolis which might eventually get into smartos 23:16:24 hopefully i'll get some testing of my vmxnet3 changes (once they're good they'll be going upstream) 23:17:02 since it'll hopefully provide at least some conceptual testing for dealing with jumbo frames better 23:17:42 though it seems it and i40e at least both end up hitting a similar problem (kernel memory fragmentation making 9k DMA allocations 'slow') 23:18:28 it might maybe be useful for other NICs -- at least as a guide 23:18:28 not sure what would be involved to adding a nvme-tcp client into illumos 23:19:01 mlxcx parts can have SRQs, which let you share one big pool of RX buffers between a lot of much smaller rings at once, and the hardware splits them up as needed 23:19:12 which I would like to use for machines with lots of vnics 23:19:24 oh that would be nice 23:19:28 so they can have full depth rings without paying for all the buffers for all of them all the time 23:20:39 https://nvmexpress.org/wp-content/uploads/NVM-Express-TCP-Transport-Specification-2021.06.02-Ratified-1.pdf rev 1 23:20:56 it looked like other OSes basically deal with the problem by just allocating PAGESIZE chunks and just let jumbo frames be segmented by the NIC on tx and rx 23:21:11 https://nvmexpress.org/changes-in-nvm-express-revision-2-0/ 23:22:15 we are going to use drbd, from a bunch of linux machines to replicate volumes, and just have users attach over the HVM linux nvme-tcp 23:22:26 but it could be smoother to add that into the OS 23:23:03 and add them at the vmadm layer vs within the guest. Not sure how much perf that would gain