#illumos

00:10

alanc

I rode through the desert on a device with no name, it felt good to get out of the rain
01:04

SarahMalik

a bicycle?
11:12

gitomat

[illumos-gate] 16338 pynfs/nfs4.1: COUR2 st_courtesy.testLockSleepLock FAILURE -- Vitaliy Gusev <gusev.vitaliy⊙gc>
11:12

gitomat

[illumos-gate] 17495 NFSv4.2 unsupported operations missing reply encoding -- Gordon Ross <gwr⊙rc>
11:28

gitomat

[illumos-gate] 17515 ucodeadm could handle binary AMD microcode files -- Andy Fiddaman <illumos⊙fn>
14:05

danmcd_

Is Carsten Grzemba on here ? (Maybe under a nickname that I probably know but didn't connect with $REAL_NAME ?)
18:41

Smithx10

We are thinking about doing some better operations monitoring around knowing if zpools have issues / have drive errors and maybe also exporting information about the host. I saw github.com/oxidecomputer/kstat-rs which I imagine could be used to expose metrics for some kind of illumos exporter for prometheus. Did oxide or anyone write any fault management / zfs libs to be used from rust ?
18:46

jbk

i think your question is actually two different questions
18:47

jbk

kstats are just stats
18:48

jbk

fault management is/should be receiving fault events and determine if a component is faulty (and if so mark it)
18:49

jbk

that data may include kstats, but the system itself has a mechanism for things to generate a fault event, which is distinct from bumping any stats
18:49

Smithx10

How would I listen to those events from rust / go to get informed ?
18:50

Smithx10

I really just want to get away from running sdc-oneachnode to know when disks / pools may be not good.
18:52

jbk

the only 'official' way currently is either running fmadm or running net-snmp w/ the fmd module and doing an snmp-get on the table... i don't know if oxide has written a wrapper to the fmd libraries or not though (they're 'private' but that just means 'these may change in a way you have to patch or recompile your code') but distros can also manage that
18:53

jbk

having said that, there is some room for improvement with zfs and fmd
18:53

jbk

i've been wanting to do it for a while, but getting sufficient time with hardware (especially for testing -- that's probably far more challenging than the work itself)
18:54

jbk

has meant i haven't done it yet
18:55

sommerfeld

for real testing of the upper layers of the fault management you probably want to augment bhyve's drivers with virtualized brokenness/fault injection.
18:55

jbk

for HDDs at least (though I suspect SSDs -- either as a SCSI device or NVMe device -- are probably similar enough) may have a bad 'spot' in them while the rest of the device is fine and have some amount of 'replacement' blocks
18:55

jbk

today, if the drive can't read from a block, it returns an error
18:55

jbk

zfs interprets that as 'the whole disk is bad'
18:55

jbk

and will kick in a spare if available
18:56

jbk

what would arguably be better is
18:57

jbk

to mark that block as bad (drives can do this automatically) and if sufficient pool redundancy is available (mirrors, raidz, ...) self heal after the block(s) have been marked bad
18:57

jbk

as the drive will remap those to some of the reserve blocks
18:57

jbk

and most vendors will tell you how many bad blocks means 'bad drive'
18:58

jbk

(zfs only self heals if the drive returns success, but with bad data)
18:58

jbk

at least today
19:05

Smithx10

jbk where can i read about the fmd module.... having this available via snmp would be decent.
19:12

jbk

oh there is also a process that will send an event via smtp as well
19:12

jbk

annoyingly, i'm not seeing any man pages for any of this (though just might have missed them)
19:13

jbk

github.com/illumos/illumos-gate/tre…ter/usr/src/lib/fm/libfmd_snmp/mibs has the SNMP mibs
19:13

jbk

alsohttps://github.com/illumos/illumos-gate/tree/master/usr/src/cmd/fm/notify/smtp-notify/common has the source for the smtp notification service
19:14

jbk

that could be used as a guide for talking to fmd via it's library interface (which should be doable from rust I would think)
21:58

gitomat

[illumos-gate] 17517 loader: errors with pointer conversion -- Toomas Soome <tsoome⊙mc>

2 months ago

« a day earlier

a day later »

today »