-
gitomat
[illumos-gate] 16289 gss_mechs: left shift of negative value -- Toomas Soome <tsoome⊙mc>
-
gitomat
[illumos-gate] 16281 libnsl: left shift of negative value -- Toomas Soome <tsoome⊙mc>
-
gitomat
[illumos-gate] 16292 gss_mechs: symbol 'asn1buf_insert_octet' is multiply-defined -- Toomas Soome <tsoome⊙mc>
-
gitomat
[illumos-gate] 16293 libsldap: passing argument 1 to 'restrict'-qualified parameter aliases with argument 4 -- Toomas Soome <tsoome⊙mc>
-
danmcd
@tsoome_ --> I'd prefer richlowe look at 16301 (sgs related).
-
tsoome_
I figured so:)
-
gitomat
[illumos-gate] 16295 nc: 'restrict'-qualified parameter aliases with argument -- Toomas Soome <tsoome⊙mc>
-
gitomat
[illumos-gate] 16296 snoop: 'restrict'-qualified parameter aliases with argument 4 -- Toomas Soome <tsoome⊙mc>
-
gitomat
[illumos-gate] 16298 mem: 'restrict'-qualified parameter aliases with argument 4 -- Toomas Soome <tsoome⊙mc>
-
KungFuJesus
jbk: sorry, I realize this conversation has like 1.5 day latency, but what am I missing to make this fully light up as failed automatically?
-
danmcd
Thanks richlowe
-
jclulow
KungFuJesus: The lights are generally managed by FMA, and for the fault light to be on, the system needs to think that the disk is broken and match that up with an FMA case as far as I know. So we'd need to look at things like: why do you think the disk is faulty, and why does the system apparently not think that; also, what actions have you taken, etc, since determining the disk is faulty.
-
richlowe
do you need topo for the chassis too?
-
jclulow
Earlier they mentioned they have a SES enclosure, which is self-describing
-
jclulow
Also that the lights _are_ able to be turned on, manually, we just turn them off later.
-
jclulow
There is often no good way to ask the firmware in these things "is your light on already?" so we tend to resort to flushing out new LED state periodically
-
richlowe
oh, I missed that
-
jclulow
In the limit if you want to be able to manually override the LED state, we'll need to come up with some way to allow that -- to mediate the desires of both the operator _and_ the fault management stuff, basicaly
-
jclulow
Which is totally possible but is not a mechanism that exists today
-
richlowe
unrelated: I want to split pkg:/system/kernel/platform, because it's a mess if there's more than one. Do people prefer pkg:/system/kernel/platform/{i386,aarch64} (cf. SUNWcakr.x), or two distinct manifests that each deliver the same FMRI?
-
richlowe
rather than the mess of $(i386_ONLY) that happens otherwise
-
richlowe
you can see the mess, in fact, even _without_ 2 platforms.
-
richlowe
technically there's a 3rd option: make pkg:/system/kernel/platform i386-only, and other platforms deliver pkg:/system/kernel/platform/<name>, but I hate that one
-
gitomat
[illumos-gate] 16301 sgs: 'restrict'-qualified parameter aliases with argument 4 -- Toomas Soome <tsoome⊙mc>
-
andyf
I think I prefer a separate FMRI for each. There is some precedent with pkg:/driver/i86pc/platform for the arch to come before 'platform' though
-
andyf
I definitely support splitting it up!
-
KungFuJesus
jclulow: so, is fmd the thing that's actually turning the LED back off?
-
KungFuJesus
so if that's the case, fmd apparently then isn't properly connecting that bay and disk. What's weird is that I used fmtopo -V to tell me the faulted disk in the pool
-
KungFuJesus
fmd reported the disk as faulted and fmtopo knows the bay it's in
-
KungFuJesus
one thing that perhaps makes it weird is that it's going through VHCI for multipathing. Is that perhaps part of the problem?
-
KungFuJesus
unfortunately at the moment the disk is already being replaced, so we maybe lost our chance at a guinea pig opportunity
-
KungFuJesus
-
KungFuJesus
jclulow: the fault event:
pastebin.com/7dnpKUhi
-
KungFuJesus
wtf, it even names the slot in the event, lol
-
KungFuJesus
at this point I don't know what else I could give fmd to know any better than it does
-
jclulow
KungFuJesus: That's good information!
-
KungFuJesus
indeed, why doesn't it know how to act on it?
-
jclulow
An excellent question! This fmd module is supposed to drive the lights based on that sort of thing:
github.com/illumos/illumos-gate/tre…c/cmd/fm/modules/common/disk-lights
-
jclulow
So you have a FRU with the right shaped FMRI: hc://:product-id=SMC-SC846P:server-id=:chassis-id=5003048017d2467f:serial=8DK8LE1H:part=HGST-HUH721212AL4200:revision=A3D0/ses-enclosure=0/bay=12/disk=0
-
jclulow
i.e., a DISK under a BAY
-
jclulow
I think maybe the disconnect here is that we've got a "fault.fs.zfs.vdev.io" fault
-
jclulow
But the disk-lights module is subscribing to fault.io.{disk,scsi}.*
-
KungFuJesus
so, modify the config file to point to fault.fs.zfs.vdev.io?
-
jclulow
Well
-
jclulow
I am not sure
-
jclulow
But perhaps!
-
KungFuJesus
hah, could it hurt?
-
jclulow
Well, anything could hurt haha -- but in this case I think the likelihood is low
-
jclulow
It doesn't look like the actual C code has anything specific about the fault classes,
github.com/illumos/illumos-gate/blo…disk-lights/disk_lights.c#L188-L195
-
jclulow
So maybe subscribing it to more fault classes will be enough, if fmd_nvl_fmri_has_fault() manages to link up the vdev fault with the disk FMRI -- but I don't know if it will
-
jclulow
It's also possible that the problem is at a much higher level, and that we should have some kind of translation of a vdev-level fault into a disk-level fault
-
jclulow
But yeah I think this is where the disconnect is!
-
jclulow
I also think this will be reproducible without your specific system
-
jclulow
But you could preserve the contents of /var/fm right now to be sure that we have all the data
-
jclulow
(it has all the log data and stuff that fmd uses to make these diagnoses)
-
KungFuJesus
sure thing - the disk itself has been removed from the pool and is being replaced but I suspect that won't matter if it's historically logged
-
KungFuJesus
I'll tar it up, where should I send it to?
-
richlowe
jclulow: thanks for making me look at what `fmsim` does :\
-
jclulow
KungFuJesus: how big is the tar.gz
-
KungFuJesus
~2MB
-
jclulow
KungFuJesus: I think you can just upload it to this issue as an attachment probably:
illumos.org/issues/16353
-
fenix
→
BUG 16353: disk fault lights should come on for ZFS vdev-level faults (New)
-
KungFuJesus
cool, uploaded
-
richlowe
andyf: (and others interested), my current preference is to treat platform as a variable, that is, s/platform/i86pc/ s/platform/aarch64/ s/platform/raspberry-pi-4/ in those (and other?) packages.
-
richlowe
looking at *platform*.p5m, it looks the most reasonable
-
jclulow
richlowe: I would like to be able to install _just_ the oxide parts and not the i86pc parts, if that helps
-
jclulow
(oxide being an i86pc-level peer in the taxonomy)
-
jclulow
Which I suspect means we should codify some kind of facet structure for pulling those in, as well
-
jclulow
Although I suppose it's possible nothing explicitly depends on /system/kernel/platform today so maybe that won't be needed
-
richlowe
jclulow: In the model I suggested above, oxide would be system/kernel/oxide
-
richlowe
oxide as a moral equivalent to a raspberry-pi-4
-
richlowe
except a touch more expensive
-
richlowe
so system/kernel/i86pc system/kernel/i86xpv(?) system/kernel/oxide system/kernel/aarch64 system/library/i86pc
-
richlowe
etc.
-
richlowe
literally replace the word "platform" with the platform
-
richlowe
and split the shitty ones as necessary (that's just system/kernel/platform)
-
sommerfeld
rhetorical question: what else lives under system/kernel/foo that isn't a platform?
-
richlowe
oh, hmmmm, specific kernel features.
-
richlowe
system/kernel/cpu-counters for eg
-
jclulow
Yeah I think you want /system/kernel/platform/$platform
-
richlowe
yeah, and then that really needs to be /system/kernel/$platform/platform, maybe, to follow the majority of the mess we've made
-
richlowe
see driver/i86pc/* etc.
-
jclulow
Apparently there was once a /system/kernel/platform/netra so this is not really new
-
richlowe
I hate naming stuff
-
richlowe
though not as much as I hate that package
-
richlowe
jclulow: oh, in that case I can go back to how I did it at first!
-
richlowe
that's good to know
-
jclulow
But yeah we currently, when building an oxide-specific ramdisk, and up just removing a lot of the stuff from that (and similar) packages anyway:
github.com/oxidecomputer/helios/blo…gimlet/ramdisk-02-trim.json#L28-L35
-
richlowe
yeah, hopefully I can split this in a way that makes that better.
-
jclulow
Incremental improvements to making it easier to do that by just installing the bits you need would be awesome
-
richlowe
I'll do what I can, I just hate having a manifest that is literally two entirely distinct packages :)