-
danmcd
Good news: At least on the big NUC my keyboard suddenly works.
-
danmcd
Bad news: Before that happened I mistyped this: `cfgadm l` and I seem to have induced an ndi_devi_enter() wedge of some sort. Yes I do have the coredump.
-
danmcd
More bad news: the other NUC appears to be still failing.
-
gitomat
[illumos-gate] 17959 nlm_copy_netbuf(): NULL pointer dereference -- Toomas Soome <tsoome⊙mc>
-
gitomat
[illumos-gate] 17961 klm: memory leak in nlm_host_destroy() -- Toomas Soome <tsoome⊙mc>
-
ENOMAD
I'm missing something, probably something quite obvious, but $newfileserver is reporting zpools of absurd size and I'm confused. A raidz2 with 5 20TB drives should not be reporting a size of 90.9T, right? It should be closer to 60T.
-
ENOMAD
but: : || lvd@fs2 ~ [519] ; zpool get size pool0
-
ENOMAD
NAME PROPERTY VALUE SOURCE
-
ENOMAD
pool0 size 90.9T -
-
ENOMAD
the ssdp0 in the same host is also reporting an impossible size.
-
ENOMAD
-
sommerfeld
raidz sizing reports the raw size -- unlike mirrors
-
ENOMAD
so that's the obvious thing I was missing.
-
sommerfeld
raidz overhead depends in part on the block sizes written which is impossible to predict
-
ENOMAD
I'm trying to chase down why our new fileserver is slower than our 5 year old one. Back to the drawing board.
-
sommerfeld
so it punts and reports raw capacity while inflating usage
-
ENOMAD
(I was hoping I'd just made the pools wrong.)
-
sommerfeld
any ssds involved? One I just found was that m.2 NVME sticks without power loss prevention will be slower log devices than a SATA ssd with PLP.
-
sommerfeld
I/O is faster but the cache flush op is slower.
-
ENOMAD
the pool we were testing was entirely SSD based.
-
ENOMAD
(ssdp0 in that paste)
-
ENOMAD
c3t58CE38EE22F3C9C2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
-
ENOMAD
Vendor: KIOXIA Product: KPM7XVUG6T40 Revision: 0103 Serial No: 1E60A01G0993
-
ENOMAD
Size: 6401.25GB <6401252745216 bytes>
-
ENOMAD
that's the SSD in question.
-
ENOMAD
IIRC, they're connected via SAS.
-
sommerfeld
claims to have power loss protection so probably not the thing I ran into (with lower end hardware..)
-
ENOMAD
to check something, oldpool (also entirely SSD, but SATA-connected), has 6 devices in the VDEV while newpool has 4. I presume that isn't going to impact speed to any noticeable amount.
-
sommerfeld
might help to have a crisper definition of "speed". (The slowness I was chasing turned out to be fsync() latency; adding a log device that didn't suck made a big difference)
-
ENOMAD
sommerfeld, I'm trying to work on that definition. Problem is, oldfileserver is under load. The new one isn't.
-
ENOMAD
what we're seeing isn't really useful to quantify things, it's too many layers deep
-
ENOMAD
(VM-based database server with all disks coming as VDI from an NFS mount from the fileserver. VM's VDIs hosted on old fileserver gives X database speed. same VDIs migrated to new fileserver, database is slightly slower.)
-
ENOMAD
I want to do a speed test on the fileserver itself to see if it's the pool or NFS or something else.
-
ENOMAD
but noticed the size report thing first so thought I'd check it before digging deeper.
-
jbk
IIRC, NFS does a fair amount of synchronous writes, or I guess on zfs will exercise the log
-
ENOMAD
this pool doesn't have a dedicated log (or cache) device since it's entirely SSD already.
-
ENOMAD
(same for the existing pool.)
-
jbk
the log always exists
-
jbk
just a question of it it's on a dedicated disk or not
-
jbk
though did you say the new disks are SAS vs the old being SATA?
-
ENOMAD
that's my recollection, yes.
-
jbk
one thing I've wanted to investigate, but never had the time is queue depth.. SATA has a fixed depth (I'd need to re-read the specs, but I think it's basically either 1 or 16)
-
jbk
while SCSI doesn't really have a way to discover what it is for a disk
-
jbk
other than at some point, it might return an error that means 'queue full'
-
jbk
i've wondered (and the answer might be 'no') if sd is perhaps not always letting enough work get to SSDs
-
ENOMAD
would prtconf -v confirm it's SAS?
-
jbk
I think the iostat -e shows SATA drives as 'ATA' for the vendor
-
jbk
since we pretend they're scsi drives and then translate the SCSI commands into ATA commands
-
ENOMAD
iotstat -En returns a vendor name.
-
jbk
then it's probably SAS (or NVMe)
-
jbk
(NVMe would have a device path that end with 'blkdev⊙.' while SCSI would end with 'sd⊙.')
-
jbk
the diskinfo command should show all of that..
-
ENOMAD
we don't have any NVMe in this. Everything is old-school hot-plug drive bays.
-
jbk
(as I just remembered :P)
-
ENOMAD
the rpool drives are SATA and show as "ATA" in the vendor field. The SAS SSDs (and the spinning rust in pool0) show the proper vendor name.
-
ENOMAD
so I guess I remembered correctly.
-
ENOMAD
I can pastebin diskinfo output if you'd like.
-
ENOMAD
jbk, is there an easy way to verify your queue depth question since I'm here with these drives?
-
ENOMAD
(with a server that isn't in production yet.)
-
ENOMAD
hmm. interesting. diskinfo says these disks's aren't removable when they very much are.
-
ENOMAD
I guess that's doesn't really reflect hotswap, though. Never mind.
-
jbk
it was more of an 'in general' type question.. since scsi doesn't have a way to tell you..
-
sommerfeld
diskinfo "removable" means something more like "cd, floppy, or usb stick"
-
ENOMAD
k
-
sommerfeld
unless I'm badly mistaken..
-
jbk
and there's probably a lot of things in sd.c that probably aren't useful much anymore
-
ENOMAD
sommerfeld, the manpage agrees with you. I just spoke before reading it :)
-
jbk
(there's that whole 'prototype'
-
jbk
xbuf stuff
-
jbk
amongst other things
-
sommerfeld
So I'm wondering if there may be benefits to a slog even in an all ssd pool. But that's in part based on experience with a pool which started with mirrored m.2 nvme's as "special" vdevs (metadata) plus mirrored spinning rust for main storage. adding a bad slog took fsync latency from ~1s to 2-5ms; adding a better one took fsync latency to more like 0.5ms
-
ENOMAD
sadly, we don't have budget to add anything to this host. It's already way overbudget thanks to the [expetives deleted] AI bullshit.
-
ENOMAD
the person who did the timing tests on the database notes writes were worse than reads. I presume he means in terms of the delta between current and new.
-
ENOMAD
He's smart enough to know writes will be slower than reads.
-
sommerfeld
-
ENOMAD
thanks sommerfeld. I haven't used dtrace in years so I'll need to review how to do it. I'll let you know what I find out.
-
ENOMAD
in case you're curious. I did some fio tests.
pastebin.com/uBkKGVfq has the raw data. I haven't even looked at it yet.
-
ENOMAD
looking at the run status of the two runs doesn't make me happy.
-
ENOMAD
ok, time to look into dtrace.
-
ENOMAD
there was no load on fs2 (the new server) so I'm running fio over NFS to it.
-
ENOMAD
sommerfeld, can you help me understand these results, please?
pastebin.com/S6bfkpKM
-
ENOMAD
hmm. ignore that paste. I re-ran with fio running against each server and the charts look a lot more similar.
-
jbk
just to ask, have you verified the zfs settings for the dataset are the same on both systems?
-
ENOMAD
-
ENOMAD
jbk: yes
-
ENOMAD
ok, zpool settings are the same but the directory I'm writing fio stuff isn't. compression is off for the old fileserver but on for the new one. hmmm. hmmmmmmmmm.
-
ENOMAD
need to test that, obviously.
-
ENOMAD
jbk, it is quite possible you've not only hit the nail on the head but driven it through the board and out the other side.
-
ENOMAD
Run status group 0 (all jobs):
-
ENOMAD
READ: bw=309MiB/s (324MB/s), 6241KiB/s-11.7MiB/s (6391kB/s-12.3MB/s), io=18.4GiB (19.8GB), run=60009-61036msec
-
ENOMAD
WRITE: bw=312MiB/s (327MB/s), 6660KiB/s-12.1MiB/s (6820kB/s-12.7MB/s), io=18.6GiB (20.0GB), run=60009-61036msec
-
ENOMAD
I was absolutely certain I'd set that right. Thank you for encouraging me to recheck my work.
-
jbk
no problem -- i've had enough issues myself that I always try to do an explicit sanity check to catch the 'easy' fixes
-
ENOMAD
I thought I had checked it. Clearly I hadn't. Always good to ask the "It's obvious but..." questions.
-
jbk
yeah.. been there many times :)
-
ENOMAD
I'm very glad you folks are here. I don't have anyone here to sanity check me on these things. The joy of being the only 'nix admin in my group.
-
danmcd
Glad to help!
-
danmcd
(Though I did nothing of the sort for this time.)
-
ENOMAD
danmcd, you are a font of help.
-
danmcd
( s/help/$SOMETHING_ELSE/g I think. :D )
-
jbk
a bold thing to say
-
sommerfeld
sorry, had stepped out for a bit.
-
sommerfeld
old server: most fsyncs take 0.5-1ms. new server: most fsyncs take 16-32ms.
-
sommerfeld
directionally similar to what I was seeing on my new server and which I fixed by adding a pair of old sata ssds with power loss protection.
-
danmcd
Your fsync test is on-local-disk? Or over NFS @sommefeld ?
-
sommerfeld
on-local-disk.
-
sommerfeld
or, well, it measures wall-clock-time taken by zfs_fsync()
-
danmcd
Is it a simple write-a-test-program-and-time-it or are you running dtrace on the syscall over a load.
-
ENOMAD
sommerfeld, I never updated the chart. The re-run on oldserver looks a lot more similar to the new server with fio running.
-
danmcd
(prob the latter as you beat me to the typing punch).
-
ENOMAD
I'll probably re-do with compression disabled on the new server shortly but right now we're setting up to see if the database server notices the difference.
-
sommerfeld
timing fsyncs as used by whatever the workload is.
-
sommerfeld
-
gitomat
[illumos-gate] 17960 want format_arg attribute -- Toomas Soome <tsoome⊙mc>
-
danmcd
TY
-
sommerfeld
without a slog I saw fsync latency in the 100-200ms range which was higher than I was expecting.
-
ENOMAD
here's the fio-based dtrace against both servers.
pastebin.com/0V46NHNC (I meant to paste this earlier but I think I got distracted and forgot.)
-
ENOMAD
that was from before I found the compression discrepancy.
-
ENOMAD
FTR: our database speed test results are much better now so yeah, me being stupid was the problem.
-
jbk
one thing I recall someone at oxide mentioning is that for NVMe at least (not sure if it'd more broadly apply to SATA or SCSI SSDs or not) that (IIRC) the reordering of I/Os by the zio scheduler wasn't beneficial and I think actually somewhat detrimental to performance... which seems reasonsable since SSDs don't have to worry about rotational latency and such
-
sommerfeld
jbk: yeah. I'd imagine there are probably cases where there would be a benefit coalescing adjacent small-block i/o's but they'll be relatively small compared to the benefit of an elevator sort on a real disk..
-
richlowe
is scheduling I/O wrongly actually real pain, or just something that shows up when you measure it?
-
jbk
at least some SSDs appear to use some combination of super capacitors, ram, and SLC flash to 'buffer' I/Os to the 'main' flash.. so it wouldn't surprise me if at least some of them can handle small I/Os to adjacent (or adjacent enough) blocks internally
-
richlowe
danmcd: I would be surprised if there was only that AF from LX that you probably want to reserve
-
danmcd
For now, it's the one we have. Fixing this can probably set a precedent for bizarro-brand stuff like this.
-
danmcd
Also, I wonder about some of the lesser-used AF values that are probably around only for historical purposes. (Looks askew at AF_X25...)
-
richlowe
all removing them can do is break weirdo software, right?
-
richlowe
danmcd: AF_DATAKIT!
-
richlowe
wow, and chaosnet
-
richlowe
and imp!
-
richlowe
the anachronism of AF_INET < AF_IMPLINK
-
danmcd
I've *written code* that established an X.25 call and sent data, but it was a completely different API, something more akin to DLPI or SOCK_RAW, and it was (checks calendar) 36 years ago this summer.
-
richlowe
I have written code that did something over chaosnet, but only as an historical adventure
-
danmcd
This was paid-summer-intern duty. I was supposed to *test* the aforementioned API.
-
danmcd
Found more bugs in the C compiler than I did in the API, FWIW.
-
jbk
as recent as the late 90s, you had to use X.25 to communicate with phone switches... which was fun
-
jbk
well some switches..
-
danmcd
It's why the OS/400 "Open FM API" supported Ethernet (both normal and 802.whatever), Token Ring, and X.25.
-
danmcd
-
jbk
i'm surprised it supported ethernet...
-
danmcd
It HAD to... $MOTIVATING_CUSTOMER needed it.
-
jbk
ahh
-
jbk
I see their naming is almost as bad as mainframes
-
jbk
'can you spare a vowel'? :)
-
danmcd
LLC... that was what I was trying to remember. I *can't* recall if the Ethernet support REQUIRED LLC or not. It has been 36 years.
-
jbk
heh
-
jbk
i remember trying to use dlpi with the llc2 driver on solaris 7 and having it immediately panic after opening the device :)
-
jbk
'ok then..guess I won't be messing with that'
-
tsoome
It was not that many years ago when I had to maintain solaris cluster setup which was fetching call records for accounting over X.25 and was feeding them to database:D
-
jbk
i think i mentioned this before, but one of my coworkers had a system like that, except it was using sun's SNA (cue screams) stack to talk to an IBM mainframe since it had copies of all the call records (prior to this, there was a group that had to manually load 8-track 6250 DPI tapes sent from the various phone companies)
-
jbk
it was amusing in that after the mainframe did some upgrades, it couldn't communicate at the full 1.54Mbps (T1)
-
jbk
but the mainframe is never at fault :)
-
jbk
(the fix was to clock the circuit down to 768k)
-
tsoome
oops:)
-
jbk
but at one point, my coworker had whoever the direct # for the engineer at Sun that wrote (or at least maintained at that point) the SNA stack..
-
jbk
because it took quite a while to figure that out