-
alanc
I rode through the desert on a device with no name, it felt good to get out of the rain
-
SarahMalik
a bicycle?
-
gitomat
[illumos-gate] 16338 pynfs/nfs4.1: COUR2 st_courtesy.testLockSleepLock FAILURE -- Vitaliy Gusev <gusev.vitaliy⊙gc>
-
gitomat
[illumos-gate] 17495 NFSv4.2 unsupported operations missing reply encoding -- Gordon Ross <gwr⊙rc>
-
gitomat
[illumos-gate] 17515 ucodeadm could handle binary AMD microcode files -- Andy Fiddaman <illumos⊙fn>
-
danmcd_
Is Carsten Grzemba on here ? (Maybe under a nickname that I probably know but didn't connect with $REAL_NAME ?)
-
Smithx10
We are thinking about doing some better operations monitoring around knowing if zpools have issues / have drive errors and maybe also exporting information about the host. I saw
github.com/oxidecomputer/kstat-rs which I imagine could be used to expose metrics for some kind of illumos exporter for prometheus. Did oxide or anyone write any fault management / zfs libs to be used from rust ?
-
jbk
i think your question is actually two different questions
-
jbk
kstats are just stats
-
jbk
fault management is/should be receiving fault events and determine if a component is faulty (and if so mark it)
-
jbk
that data may include kstats, but the system itself has a mechanism for things to generate a fault event, which is distinct from bumping any stats
-
Smithx10
How would I listen to those events from rust / go to get informed ?
-
Smithx10
I really just want to get away from running sdc-oneachnode to know when disks / pools may be not good.
-
jbk
the only 'official' way currently is either running fmadm or running net-snmp w/ the fmd module and doing an snmp-get on the table... i don't know if oxide has written a wrapper to the fmd libraries or not though (they're 'private' but that just means 'these may change in a way you have to patch or recompile your code') but distros can also manage that
-
jbk
having said that, there is some room for improvement with zfs and fmd
-
jbk
i've been wanting to do it for a while, but getting sufficient time with hardware (especially for testing -- that's probably far more challenging than the work itself)
-
jbk
has meant i haven't done it yet
-
sommerfeld
for real testing of the upper layers of the fault management you probably want to augment bhyve's drivers with virtualized brokenness/fault injection.
-
jbk
for HDDs at least (though I suspect SSDs -- either as a SCSI device or NVMe device -- are probably similar enough) may have a bad 'spot' in them while the rest of the device is fine and have some amount of 'replacement' blocks
-
jbk
today, if the drive can't read from a block, it returns an error
-
jbk
zfs interprets that as 'the whole disk is bad'
-
jbk
and will kick in a spare if available
-
jbk
what would arguably be better is
-
jbk
to mark that block as bad (drives can do this automatically) and if sufficient pool redundancy is available (mirrors, raidz, ...) self heal after the block(s) have been marked bad
-
jbk
as the drive will remap those to some of the reserve blocks
-
jbk
and most vendors will tell you how many bad blocks means 'bad drive'
-
jbk
(zfs only self heals if the drive returns success, but with bad data)
-
jbk
at least today
-
Smithx10
jbk where can i read about the fmd module.... having this available via snmp would be decent.
-
jbk
oh there is also a process that will send an event via smtp as well
-
jbk
annoyingly, i'm not seeing any man pages for any of this (though just might have missed them)
-
jbk
-
jbk
alsohttps://github.com/illumos/illumos-gate/tree/master/usr/src/cmd/fm/notify/smtp-notify/common has the source for the smtp notification service
-
jbk
that could be used as a guide for talking to fmd via it's library interface (which should be doable from rust I would think)
-
gitomat
[illumos-gate] 17517 loader: errors with pointer conversion -- Toomas Soome <tsoome⊙mc>