04:20:04 Good news: At least on the big NUC my keyboard suddenly works. 04:20:56 Bad news: Before that happened I mistyped this: `cfgadm l` and I seem to have induced an ndi_devi_enter() wedge of some sort. Yes I do have the coredump. 04:30:58 More bad news: the other NUC appears to be still failing. 14:05:21 [illumos-gate] 17959 nlm_copy_netbuf(): NULL pointer dereference -- Toomas Soome 14:08:27 [illumos-gate] 17961 klm: memory leak in nlm_host_destroy() -- Toomas Soome 16:46:59 I'm missing something, probably something quite obvious, but $newfileserver is reporting zpools of absurd size and I'm confused. A raidz2 with 5 20TB drives should not be reporting a size of 90.9T, right? It should be closer to 60T. 16:47:31 but: : || lvd@fs2 ~ [519] ; zpool get size pool0 16:47:31 NAME PROPERTY VALUE SOURCE 16:47:31 pool0 size 90.9T - 16:48:03 the ssdp0 in the same host is also reporting an impossible size. 16:53:05 https://pastebin.com/QLuwpc72 16:57:04 raidz sizing reports the raw size -- unlike mirrors 16:58:25 so that's the obvious thing I was missing. 16:58:50 raidz overhead depends in part on the block sizes written which is impossible to predict 16:59:07 I'm trying to chase down why our new fileserver is slower than our 5 year old one. Back to the drawing board. 16:59:08 so it punts and reports raw capacity while inflating usage 16:59:32 (I was hoping I'd just made the pools wrong.) 17:00:52 any ssds involved? One I just found was that m.2 NVME sticks without power loss prevention will be slower log devices than a SATA ssd with PLP. 17:01:24 I/O is faster but the cache flush op is slower. 17:01:39 the pool we were testing was entirely SSD based. 17:01:58 (ssdp0 in that paste) 17:02:18 c3t58CE38EE22F3C9C2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 17:02:18 Vendor: KIOXIA Product: KPM7XVUG6T40 Revision: 0103 Serial No: 1E60A01G0993 17:02:18 Size: 6401.25GB <6401252745216 bytes> 17:02:23 that's the SSD in question. 17:03:10 IIRC, they're connected via SAS. 17:04:27 claims to have power loss protection so probably not the thing I ran into (with lower end hardware..) 17:07:27 to check something, oldpool (also entirely SSD, but SATA-connected), has 6 devices in the VDEV while newpool has 4. I presume that isn't going to impact speed to any noticeable amount. 17:11:35 might help to have a crisper definition of "speed". (The slowness I was chasing turned out to be fsync() latency; adding a log device that didn't suck made a big difference) 17:12:31 sommerfeld, I'm trying to work on that definition. Problem is, oldfileserver is under load. The new one isn't. 17:12:49 what we're seeing isn't really useful to quantify things, it's too many layers deep 17:13:42 (VM-based database server with all disks coming as VDI from an NFS mount from the fileserver. VM's VDIs hosted on old fileserver gives X database speed. same VDIs migrated to new fileserver, database is slightly slower.) 17:14:22 I want to do a speed test on the fileserver itself to see if it's the pool or NFS or something else. 17:14:39 but noticed the size report thing first so thought I'd check it before digging deeper. 17:14:53 IIRC, NFS does a fair amount of synchronous writes, or I guess on zfs will exercise the log 17:15:29 this pool doesn't have a dedicated log (or cache) device since it's entirely SSD already. 17:15:38 (same for the existing pool.) 17:15:45 the log always exists 17:15:57 just a question of it it's on a dedicated disk or not 17:16:29 though did you say the new disks are SAS vs the old being SATA? 17:16:50 that's my recollection, yes. 17:17:47 one thing I've wanted to investigate, but never had the time is queue depth.. SATA has a fixed depth (I'd need to re-read the specs, but I think it's basically either 1 or 16) 17:18:16 while SCSI doesn't really have a way to discover what it is for a disk 17:18:35 other than at some point, it might return an error that means 'queue full' 17:19:38 i've wondered (and the answer might be 'no') if sd is perhaps not always letting enough work get to SSDs 17:19:39 would prtconf -v confirm it's SAS? 17:20:12 I think the iostat -e shows SATA drives as 'ATA' for the vendor 17:20:31 since we pretend they're scsi drives and then translate the SCSI commands into ATA commands 17:21:24 iotstat -En returns a vendor name. 17:21:49 then it's probably SAS (or NVMe) 17:22:12 (NVMe would have a device path that end with 'blkdev⊙.' while SCSI would end with 'sd⊙.') 17:22:46 the diskinfo command should show all of that.. 17:22:56 we don't have any NVMe in this. Everything is old-school hot-plug drive bays. 17:22:57 (as I just remembered :P) 17:24:08 the rpool drives are SATA and show as "ATA" in the vendor field. The SAS SSDs (and the spinning rust in pool0) show the proper vendor name. 17:24:14 so I guess I remembered correctly. 17:24:55 I can pastebin diskinfo output if you'd like. 17:26:33 jbk, is there an easy way to verify your queue depth question since I'm here with these drives? 17:26:41 (with a server that isn't in production yet.) 17:27:36 hmm. interesting. diskinfo says these disks's aren't removable when they very much are. 17:28:02 I guess that's doesn't really reflect hotswap, though. Never mind. 17:32:55 it was more of an 'in general' type question.. since scsi doesn't have a way to tell you.. 17:33:11 diskinfo "removable" means something more like "cd, floppy, or usb stick" 17:33:11 k 17:33:24 unless I'm badly mistaken.. 17:34:01 and there's probably a lot of things in sd.c that probably aren't useful much anymore 17:35:15 sommerfeld, the manpage agrees with you. I just spoke before reading it :) 17:36:09 (there's that whole 'prototype' 17:36:15 xbuf stuff 17:37:00 amongst other things 17:40:26 So I'm wondering if there may be benefits to a slog even in an all ssd pool. But that's in part based on experience with a pool which started with mirrored m.2 nvme's as "special" vdevs (metadata) plus mirrored spinning rust for main storage. adding a bad slog took fsync latency from ~1s to 2-5ms; adding a better one took fsync latency to more like 0.5ms 17:42:18 sadly, we don't have budget to add anything to this host. It's already way overbudget thanks to the [expetives deleted] AI bullshit. 17:46:00 the person who did the timing tests on the database notes writes were worse than reads. I presume he means in terms of the delta between current and new. 17:46:09 He's smart enough to know writes will be slower than reads. 17:49:47 https://gist.github.com/Bill-Sommerfeld/b6cf6713ed0edd72d04e4350dd47dc7f <- looks at fsync latency with dtrace 17:55:32 thanks sommerfeld. I haven't used dtrace in years so I'll need to review how to do it. I'll let you know what I find out. 18:37:28 in case you're curious. I did some fio tests. https://pastebin.com/uBkKGVfq has the raw data. I haven't even looked at it yet. 18:40:17 looking at the run status of the two runs doesn't make me happy. 18:40:25 ok, time to look into dtrace. 18:45:55 there was no load on fs2 (the new server) so I'm running fio over NFS to it. 18:49:02 sommerfeld, can you help me understand these results, please? https://pastebin.com/S6bfkpKM 19:00:01 hmm. ignore that paste. I re-ran with fio running against each server and the charts look a lot more similar. 19:01:00 just to ask, have you verified the zfs settings for the dataset are the same on both systems? 19:01:32 https://pastebin.com/0V46NHNC updated result. 19:01:47 jbk: yes 19:05:03 ok, zpool settings are the same but the directory I'm writing fio stuff isn't. compression is off for the old fileserver but on for the new one. hmmm. hmmmmmmmmm. 19:05:25 need to test that, obviously. 19:08:45 jbk, it is quite possible you've not only hit the nail on the head but driven it through the board and out the other side. 19:08:54 Run status group 0 (all jobs): 19:08:54 READ: bw=309MiB/s (324MB/s), 6241KiB/s-11.7MiB/s (6391kB/s-12.3MB/s), io=18.4GiB (19.8GB), run=60009-61036msec 19:08:54 WRITE: bw=312MiB/s (327MB/s), 6660KiB/s-12.1MiB/s (6820kB/s-12.7MB/s), io=18.6GiB (20.0GB), run=60009-61036msec 19:15:10 I was absolutely certain I'd set that right. Thank you for encouraging me to recheck my work. 19:17:58 no problem -- i've had enough issues myself that I always try to do an explicit sanity check to catch the 'easy' fixes 19:18:28 I thought I had checked it. Clearly I hadn't. Always good to ask the "It's obvious but..." questions. 19:18:49 yeah.. been there many times :) 19:19:12 I'm very glad you folks are here. I don't have anyone here to sanity check me on these things. The joy of being the only 'nix admin in my group. 19:19:55 Glad to help! 19:20:11 (Though I did nothing of the sort for this time.) 19:20:34 danmcd, you are a font of help. 19:30:44 ( s/help/$SOMETHING_ELSE/g I think. :D ) 19:30:53 a bold thing to say 19:31:22 sorry, had stepped out for a bit. 19:33:18 old server: most fsyncs take 0.5-1ms. new server: most fsyncs take 16-32ms. 19:34:14 directionally similar to what I was seeing on my new server and which I fixed by adding a pair of old sata ssds with power loss protection. 19:35:50 Your fsync test is on-local-disk? Or over NFS @sommefeld ? 19:35:56 on-local-disk. 19:36:21 or, well, it measures wall-clock-time taken by zfs_fsync() 19:36:25 Is it a simple write-a-test-program-and-time-it or are you running dtrace on the syscall over a load. 19:36:39 sommerfeld, I never updated the chart. The re-run on oldserver looks a lot more similar to the new server with fio running. 19:36:41 (prob the latter as you beat me to the typing punch). 19:37:07 I'll probably re-do with compression disabled on the new server shortly but right now we're setting up to see if the database server notices the difference. 19:37:13 timing fsyncs as used by whatever the workload is. 19:38:08 link I pasted earlier with dtrace script: https://gist.github.com/Bill-Sommerfeld/b6cf6713ed0edd72d04e4350dd47dc7f 19:39:21 [illumos-gate] 17960 want format_arg attribute -- Toomas Soome 19:41:03 TY 19:42:03 without a slog I saw fsync latency in the 100-200ms range which was higher than I was expecting. 19:48:31 here's the fio-based dtrace against both servers. https://pastebin.com/0V46NHNC (I meant to paste this earlier but I think I got distracted and forgot.) 19:49:34 that was from before I found the compression discrepancy. 20:00:15 FTR: our database speed test results are much better now so yeah, me being stupid was the problem. 20:08:39 one thing I recall someone at oxide mentioning is that for NVMe at least (not sure if it'd more broadly apply to SATA or SCSI SSDs or not) that (IIRC) the reordering of I/Os by the zio scheduler wasn't beneficial and I think actually somewhat detrimental to performance... which seems reasonsable since SSDs don't have to worry about rotational latency and such 20:21:22 jbk: yeah. I'd imagine there are probably cases where there would be a benefit coalescing adjacent small-block i/o's but they'll be relatively small compared to the benefit of an elevator sort on a real disk.. 20:22:04 is scheduling I/O wrongly actually real pain, or just something that shows up when you measure it? 20:25:09 at least some SSDs appear to use some combination of super capacitors, ram, and SLC flash to 'buffer' I/Os to the 'main' flash.. so it wouldn't surprise me if at least some of them can handle small I/Os to adjacent (or adjacent enough) blocks internally 20:29:26 danmcd: I would be surprised if there was only that AF from LX that you probably want to reserve 20:30:05 For now, it's the one we have. Fixing this can probably set a precedent for bizarro-brand stuff like this. 20:30:35 Also, I wonder about some of the lesser-used AF values that are probably around only for historical purposes. (Looks askew at AF_X25...) 20:31:25 all removing them can do is break weirdo software, right? 20:31:43 danmcd: AF_DATAKIT! 20:32:02 wow, and chaosnet 20:32:03 and imp! 20:32:30 the anachronism of AF_INET < AF_IMPLINK 20:34:17 I've *written code* that established an X.25 call and sent data, but it was a completely different API, something more akin to DLPI or SOCK_RAW, and it was (checks calendar) 36 years ago this summer. 20:34:48 I have written code that did something over chaosnet, but only as an historical adventure 20:35:18 This was paid-summer-intern duty. I was supposed to *test* the aforementioned API. 20:35:28 Found more bugs in the C compiler than I did in the API, FWIW. 20:36:23 as recent as the late 90s, you had to use X.25 to communicate with phone switches... which was fun 20:36:29 well some switches.. 20:38:30 It's why the OS/400 "Open FM API" supported Ethernet (both normal and 802.whatever), Token Ring, and X.25. 20:39:05 I've found ONE reference to it in a google search, in this PDF: https://public.dhe.ibm.com/systems/power/docs/systemi/v5r2/it_IT/apiexmp.pdf 20:39:27 i'm surprised it supported ethernet... 20:39:42 It HAD to... $MOTIVATING_CUSTOMER needed it. 20:39:47 ahh 20:40:31 I see their naming is almost as bad as mainframes 20:40:36 'can you spare a vowel'? :) 20:41:09 LLC... that was what I was trying to remember. I *can't* recall if the Ethernet support REQUIRED LLC or not. It has been 36 years. 20:41:29 heh 20:41:51 i remember trying to use dlpi with the llc2 driver on solaris 7 and having it immediately panic after opening the device :) 20:42:01 'ok then..guess I won't be messing with that' 20:46:36 It was not that many years ago when I had to maintain solaris cluster setup which was fetching call records for accounting over X.25 and was feeding them to database:D 20:48:37 i think i mentioned this before, but one of my coworkers had a system like that, except it was using sun's SNA (cue screams) stack to talk to an IBM mainframe since it had copies of all the call records (prior to this, there was a group that had to manually load 8-track 6250 DPI tapes sent from the various phone companies) 20:49:22 it was amusing in that after the mainframe did some upgrades, it couldn't communicate at the full 1.54Mbps (T1) 20:49:32 but the mainframe is never at fault :) 20:49:44 (the fix was to clock the circuit down to 768k) 20:50:03 oops:) 20:50:30 but at one point, my coworker had whoever the direct # for the engineer at Sun that wrote (or at least maintained at that point) the SNA stack.. 20:50:39 because it took quite a while to figure that out