#illumos

04:20

danmcd

Good news: At least on the big NUC my keyboard suddenly works.
04:20

danmcd

Bad news: Before that happened I mistyped this: `cfgadm l` and I seem to have induced an ndi_devi_enter() wedge of some sort. Yes I do have the coredump.
04:30

danmcd

More bad news: the other NUC appears to be still failing.
14:05

gitomat

[illumos-gate] 17959 nlm_copy_netbuf(): NULL pointer dereference -- Toomas Soome <tsoome⊙mc>
14:08

gitomat

[illumos-gate] 17961 klm: memory leak in nlm_host_destroy() -- Toomas Soome <tsoome⊙mc>
16:46

ENOMAD

I'm missing something, probably something quite obvious, but $newfileserver is reporting zpools of absurd size and I'm confused. A raidz2 with 5 20TB drives should not be reporting a size of 90.9T, right? It should be closer to 60T.
16:47

ENOMAD

but: : || lvd@fs2 ~ [519] ; zpool get size pool0
16:47

ENOMAD

NAME PROPERTY VALUE SOURCE
16:47

ENOMAD

pool0 size 90.9T -
16:48

ENOMAD

the ssdp0 in the same host is also reporting an impossible size.
16:53

ENOMAD

pastebin.com/QLuwpc72
16:57

sommerfeld

raidz sizing reports the raw size -- unlike mirrors
16:58

ENOMAD

so that's the obvious thing I was missing.
16:58

sommerfeld

raidz overhead depends in part on the block sizes written which is impossible to predict
16:59

ENOMAD

I'm trying to chase down why our new fileserver is slower than our 5 year old one. Back to the drawing board.
16:59

sommerfeld

so it punts and reports raw capacity while inflating usage
16:59

ENOMAD

(I was hoping I'd just made the pools wrong.)
17:00

sommerfeld

any ssds involved? One I just found was that m.2 NVME sticks without power loss prevention will be slower log devices than a SATA ssd with PLP.
17:01

sommerfeld

I/O is faster but the cache flush op is slower.
17:01

ENOMAD

the pool we were testing was entirely SSD based.
17:01

ENOMAD

(ssdp0 in that paste)
17:02

ENOMAD

c3t58CE38EE22F3C9C2d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
17:02

ENOMAD

Vendor: KIOXIA Product: KPM7XVUG6T40 Revision: 0103 Serial No: 1E60A01G0993
17:02

ENOMAD

Size: 6401.25GB <6401252745216 bytes>
17:02

ENOMAD

that's the SSD in question.
17:03

ENOMAD

IIRC, they're connected via SAS.
17:04

sommerfeld

claims to have power loss protection so probably not the thing I ran into (with lower end hardware..)
17:07

ENOMAD

to check something, oldpool (also entirely SSD, but SATA-connected), has 6 devices in the VDEV while newpool has 4. I presume that isn't going to impact speed to any noticeable amount.
17:11

sommerfeld

might help to have a crisper definition of "speed". (The slowness I was chasing turned out to be fsync() latency; adding a log device that didn't suck made a big difference)
17:12

ENOMAD

sommerfeld, I'm trying to work on that definition. Problem is, oldfileserver is under load. The new one isn't.
17:12

ENOMAD

what we're seeing isn't really useful to quantify things, it's too many layers deep
17:13

ENOMAD

(VM-based database server with all disks coming as VDI from an NFS mount from the fileserver. VM's VDIs hosted on old fileserver gives X database speed. same VDIs migrated to new fileserver, database is slightly slower.)
17:14

ENOMAD

I want to do a speed test on the fileserver itself to see if it's the pool or NFS or something else.
17:14

ENOMAD

but noticed the size report thing first so thought I'd check it before digging deeper.
17:14

jbk

IIRC, NFS does a fair amount of synchronous writes, or I guess on zfs will exercise the log
17:15

ENOMAD

this pool doesn't have a dedicated log (or cache) device since it's entirely SSD already.
17:15

ENOMAD

(same for the existing pool.)
17:15

jbk

the log always exists
17:15

jbk

just a question of it it's on a dedicated disk or not
17:16

jbk

though did you say the new disks are SAS vs the old being SATA?
17:16

ENOMAD

that's my recollection, yes.
17:17

jbk

one thing I've wanted to investigate, but never had the time is queue depth.. SATA has a fixed depth (I'd need to re-read the specs, but I think it's basically either 1 or 16)
17:18

jbk

while SCSI doesn't really have a way to discover what it is for a disk
17:18

jbk

other than at some point, it might return an error that means 'queue full'
17:19

jbk

i've wondered (and the answer might be 'no') if sd is perhaps not always letting enough work get to SSDs
17:19

ENOMAD

would prtconf -v confirm it's SAS?
17:20

jbk

I think the iostat -e shows SATA drives as 'ATA' for the vendor
17:20

jbk

since we pretend they're scsi drives and then translate the SCSI commands into ATA commands
17:21

ENOMAD

iotstat -En returns a vendor name.
17:21

jbk

then it's probably SAS (or NVMe)
17:22

jbk

(NVMe would have a device path that end with 'blkdev⊙.' while SCSI would end with 'sd⊙.')
17:22

jbk

the diskinfo command should show all of that..
17:22

ENOMAD

we don't have any NVMe in this. Everything is old-school hot-plug drive bays.
17:22

jbk

(as I just remembered :P)
17:24

ENOMAD

the rpool drives are SATA and show as "ATA" in the vendor field. The SAS SSDs (and the spinning rust in pool0) show the proper vendor name.
17:24

ENOMAD

so I guess I remembered correctly.
17:24

ENOMAD

I can pastebin diskinfo output if you'd like.
17:26

ENOMAD

jbk, is there an easy way to verify your queue depth question since I'm here with these drives?
17:26

ENOMAD

(with a server that isn't in production yet.)
17:27

ENOMAD

hmm. interesting. diskinfo says these disks's aren't removable when they very much are.
17:28

ENOMAD

I guess that's doesn't really reflect hotswap, though. Never mind.
17:32

jbk

it was more of an 'in general' type question.. since scsi doesn't have a way to tell you..
17:33

sommerfeld

diskinfo "removable" means something more like "cd, floppy, or usb stick"
17:33

ENOMAD

k
17:33

sommerfeld

unless I'm badly mistaken..
17:34

jbk

and there's probably a lot of things in sd.c that probably aren't useful much anymore
17:35

ENOMAD

sommerfeld, the manpage agrees with you. I just spoke before reading it :)
17:36

jbk

(there's that whole 'prototype'
17:36

jbk

xbuf stuff
17:37

jbk

amongst other things
17:40

sommerfeld

So I'm wondering if there may be benefits to a slog even in an all ssd pool. But that's in part based on experience with a pool which started with mirrored m.2 nvme's as "special" vdevs (metadata) plus mirrored spinning rust for main storage. adding a bad slog took fsync latency from ~1s to 2-5ms; adding a better one took fsync latency to more like 0.5ms
17:42

ENOMAD

sadly, we don't have budget to add anything to this host. It's already way overbudget thanks to the [expetives deleted] AI bullshit.
17:46

ENOMAD

the person who did the timing tests on the database notes writes were worse than reads. I presume he means in terms of the delta between current and new.
17:46

ENOMAD

He's smart enough to know writes will be slower than reads.
17:49

sommerfeld

gist.github.com/Bill-Sommerfeld/b6cf6713ed0edd72d04e4350dd47dc7f <- looks at fsync latency with dtrace
17:55

ENOMAD

thanks sommerfeld. I haven't used dtrace in years so I'll need to review how to do it. I'll let you know what I find out.
18:37

ENOMAD

in case you're curious. I did some fio tests. pastebin.com/uBkKGVfq has the raw data. I haven't even looked at it yet.
18:40

ENOMAD

looking at the run status of the two runs doesn't make me happy.
18:40

ENOMAD

ok, time to look into dtrace.
18:45

ENOMAD

there was no load on fs2 (the new server) so I'm running fio over NFS to it.
18:49

ENOMAD

sommerfeld, can you help me understand these results, please? pastebin.com/S6bfkpKM
19:00

ENOMAD

hmm. ignore that paste. I re-ran with fio running against each server and the charts look a lot more similar.
19:01

jbk

just to ask, have you verified the zfs settings for the dataset are the same on both systems?
19:01

ENOMAD

pastebin.com/0V46NHNC updated result.
19:01

ENOMAD

jbk: yes
19:05

ENOMAD

ok, zpool settings are the same but the directory I'm writing fio stuff isn't. compression is off for the old fileserver but on for the new one. hmmm. hmmmmmmmmm.
19:05

ENOMAD

need to test that, obviously.
19:08

ENOMAD

jbk, it is quite possible you've not only hit the nail on the head but driven it through the board and out the other side.
19:08

ENOMAD

Run status group 0 (all jobs):
19:08

ENOMAD

READ: bw=309MiB/s (324MB/s), 6241KiB/s-11.7MiB/s (6391kB/s-12.3MB/s), io=18.4GiB (19.8GB), run=60009-61036msec
19:08

ENOMAD

WRITE: bw=312MiB/s (327MB/s), 6660KiB/s-12.1MiB/s (6820kB/s-12.7MB/s), io=18.6GiB (20.0GB), run=60009-61036msec
19:15

ENOMAD

I was absolutely certain I'd set that right. Thank you for encouraging me to recheck my work.
19:17

jbk

no problem -- i've had enough issues myself that I always try to do an explicit sanity check to catch the 'easy' fixes
19:18

ENOMAD

I thought I had checked it. Clearly I hadn't. Always good to ask the "It's obvious but..." questions.
19:18

jbk

yeah.. been there many times :)
19:19

ENOMAD

I'm very glad you folks are here. I don't have anyone here to sanity check me on these things. The joy of being the only 'nix admin in my group.
19:19

danmcd

Glad to help!
19:20

danmcd

(Though I did nothing of the sort for this time.)
19:20

ENOMAD

danmcd, you are a font of help.
19:30

danmcd

( s/help/$SOMETHING_ELSE/g I think. :D )
19:30

jbk

a bold thing to say
19:31

sommerfeld

sorry, had stepped out for a bit.
19:33

sommerfeld

old server: most fsyncs take 0.5-1ms. new server: most fsyncs take 16-32ms.
19:34

sommerfeld

directionally similar to what I was seeing on my new server and which I fixed by adding a pair of old sata ssds with power loss protection.
19:35

danmcd

Your fsync test is on-local-disk? Or over NFS @sommefeld ?
19:35

sommerfeld

on-local-disk.
19:36

sommerfeld

or, well, it measures wall-clock-time taken by zfs_fsync()
19:36

danmcd

Is it a simple write-a-test-program-and-time-it or are you running dtrace on the syscall over a load.
19:36

ENOMAD

sommerfeld, I never updated the chart. The re-run on oldserver looks a lot more similar to the new server with fio running.
19:36

danmcd

(prob the latter as you beat me to the typing punch).
19:37

ENOMAD

I'll probably re-do with compression disabled on the new server shortly but right now we're setting up to see if the database server notices the difference.
19:37

sommerfeld

timing fsyncs as used by whatever the workload is.
19:38

sommerfeld

link I pasted earlier with dtrace script: gist.github.com/Bill-Sommerfeld/b6cf6713ed0edd72d04e4350dd47dc7f
19:39

gitomat

[illumos-gate] 17960 want format_arg attribute -- Toomas Soome <tsoome⊙mc>
19:41

danmcd

TY
19:42

sommerfeld

without a slog I saw fsync latency in the 100-200ms range which was higher than I was expecting.
19:48

ENOMAD

here's the fio-based dtrace against both servers. pastebin.com/0V46NHNC (I meant to paste this earlier but I think I got distracted and forgot.)
19:49

ENOMAD

that was from before I found the compression discrepancy.
20:00

ENOMAD

FTR: our database speed test results are much better now so yeah, me being stupid was the problem.
20:08

jbk

one thing I recall someone at oxide mentioning is that for NVMe at least (not sure if it'd more broadly apply to SATA or SCSI SSDs or not) that (IIRC) the reordering of I/Os by the zio scheduler wasn't beneficial and I think actually somewhat detrimental to performance... which seems reasonsable since SSDs don't have to worry about rotational latency and such
20:21

sommerfeld

jbk: yeah. I'd imagine there are probably cases where there would be a benefit coalescing adjacent small-block i/o's but they'll be relatively small compared to the benefit of an elevator sort on a real disk..
20:22

richlowe

is scheduling I/O wrongly actually real pain, or just something that shows up when you measure it?
20:25

jbk

at least some SSDs appear to use some combination of super capacitors, ram, and SLC flash to 'buffer' I/Os to the 'main' flash.. so it wouldn't surprise me if at least some of them can handle small I/Os to adjacent (or adjacent enough) blocks internally
20:29

richlowe

danmcd: I would be surprised if there was only that AF from LX that you probably want to reserve
20:30

danmcd

For now, it's the one we have. Fixing this can probably set a precedent for bizarro-brand stuff like this.
20:30

danmcd

Also, I wonder about some of the lesser-used AF values that are probably around only for historical purposes. (Looks askew at AF_X25...)
20:31

richlowe

all removing them can do is break weirdo software, right?
20:31

richlowe

danmcd: AF_DATAKIT!
20:32

richlowe

wow, and chaosnet
20:32

richlowe

and imp!
20:32

richlowe

the anachronism of AF_INET < AF_IMPLINK
20:34

danmcd

I've *written code* that established an X.25 call and sent data, but it was a completely different API, something more akin to DLPI or SOCK_RAW, and it was (checks calendar) 36 years ago this summer.
20:34

richlowe

I have written code that did something over chaosnet, but only as an historical adventure
20:35

danmcd

This was paid-summer-intern duty. I was supposed to *test* the aforementioned API.
20:35

danmcd

Found more bugs in the C compiler than I did in the API, FWIW.
20:36

jbk

as recent as the late 90s, you had to use X.25 to communicate with phone switches... which was fun
20:36

jbk

well some switches..
20:38

danmcd

It's why the OS/400 "Open FM API" supported Ethernet (both normal and 802.whatever), Token Ring, and X.25.
20:39

danmcd

I've found ONE reference to it in a google search, in this PDF: public.dhe.ibm.com/systems/power/docs/systemi/v5r2/it_IT/apiexmp.pdf
20:39

jbk

i'm surprised it supported ethernet...
20:39

danmcd

It HAD to... $MOTIVATING_CUSTOMER needed it.
20:39

jbk

ahh
20:40

jbk

I see their naming is almost as bad as mainframes
20:40

jbk

'can you spare a vowel'? :)
20:41

danmcd

LLC... that was what I was trying to remember. I *can't* recall if the Ethernet support REQUIRED LLC or not. It has been 36 years.
20:41

jbk

heh
20:41

jbk

i remember trying to use dlpi with the llc2 driver on solaris 7 and having it immediately panic after opening the device :)
20:42

jbk

'ok then..guess I won't be messing with that'
20:46

tsoome

It was not that many years ago when I had to maintain solaris cluster setup which was fetching call records for accounting over X.25 and was feeding them to database:D
20:48

jbk

i think i mentioned this before, but one of my coworkers had a system like that, except it was using sun's SNA (cue screams) stack to talk to an IBM mainframe since it had copies of all the call records (prior to this, there was a group that had to manually load 8-track 6250 DPI tapes sent from the various phone companies)
20:49

jbk

it was amusing in that after the mainframe did some upgrades, it couldn't communicate at the full 1.54Mbps (T1)
20:49

jbk

but the mainframe is never at fault :)
20:49

jbk

(the fix was to clock the circuit down to 768k)
20:50

tsoome

oops:)
20:50

jbk

but at one point, my coworker had whoever the direct # for the engineer at Sun that wrote (or at least maintained at that point) the SNA stack..
20:50

jbk

because it took quite a while to figure that out

6 days ago

« a day earlier

a day later »

today »