00:10:11 I rode through the desert on a device with no name, it felt good to get out of the rain 01:04:36 a bicycle? 11:12:48 [illumos-gate] 16338 pynfs/nfs4.1: COUR2 st_courtesy.testLockSleepLock FAILURE -- Vitaliy Gusev 11:12:48 [illumos-gate] 17495 NFSv4.2 unsupported operations missing reply encoding -- Gordon Ross 11:28:03 [illumos-gate] 17515 ucodeadm could handle binary AMD microcode files -- Andy Fiddaman 14:05:43 Is Carsten Grzemba on here ? (Maybe under a nickname that I probably know but didn't connect with $REAL_NAME ?) 18:41:16 We are thinking about doing some better operations monitoring around knowing if zpools have issues / have drive errors and maybe also exporting information about the host. I saw https://github.com/oxidecomputer/kstat-rs which I imagine could be used to expose metrics for some kind of illumos exporter for prometheus. Did oxide or anyone write any fault management / zfs libs to be used from rust ? 18:46:55 i think your question is actually two different questions 18:47:06 kstats are just stats 18:48:20 fault management is/should be receiving fault events and determine if a component is faulty (and if so mark it) 18:49:05 that data may include kstats, but the system itself has a mechanism for things to generate a fault event, which is distinct from bumping any stats 18:49:54 How would I listen to those events from rust / go to get informed ? 18:50:33 I really just want to get away from running sdc-oneachnode to know when disks / pools may be not good. 18:52:54 the only 'official' way currently is either running fmadm or running net-snmp w/ the fmd module and doing an snmp-get on the table... i don't know if oxide has written a wrapper to the fmd libraries or not though (they're 'private' but that just means 'these may change in a way you have to patch or recompile your code') but distros can also manage that 18:53:27 having said that, there is some room for improvement with zfs and fmd 18:53:51 i've been wanting to do it for a while, but getting sufficient time with hardware (especially for testing -- that's probably far more challenging than the work itself) 18:54:08 has meant i haven't done it yet 18:55:07 for real testing of the upper layers of the fault management you probably want to augment bhyve's drivers with virtualized brokenness/fault injection. 18:55:09 for HDDs at least (though I suspect SSDs -- either as a SCSI device or NVMe device -- are probably similar enough) may have a bad 'spot' in them while the rest of the device is fine and have some amount of 'replacement' blocks 18:55:38 today, if the drive can't read from a block, it returns an error 18:55:47 zfs interprets that as 'the whole disk is bad' 18:55:57 and will kick in a spare if available 18:56:15 what would arguably be better is 18:57:05 to mark that block as bad (drives can do this automatically) and if sufficient pool redundancy is available (mirrors, raidz, ...) self heal after the block(s) have been marked bad 18:57:14 as the drive will remap those to some of the reserve blocks 18:57:52 and most vendors will tell you how many bad blocks means 'bad drive' 18:58:31 (zfs only self heals if the drive returns success, but with bad data) 18:58:35 at least today 19:05:53 jbk where can i read about the fmd module.... having this available via snmp would be decent. 19:12:11 oh there is also a process that will send an event via smtp as well 19:12:43 annoyingly, i'm not seeing any man pages for any of this (though just might have missed them) 19:13:21 https://github.com/illumos/illumos-gate/tree/master/usr/src/lib/fm/libfmd_snmp/mibs has the SNMP mibs 19:13:58 alsohttps://github.com/illumos/illumos-gate/tree/master/usr/src/cmd/fm/notify/smtp-notify/common has the source for the smtp notification service 19:14:27 that could be used as a guide for talking to fmd via it's library interface (which should be doable from rust I would think) 21:58:55 [illumos-gate] 17517 loader: errors with pointer conversion -- Toomas Soome