-
jbk
github.com/illumos/illumos-gate/blo…c/io/mp_platform_common.c#L838-L840 <-- would anyone get too upset with a change that only prints that if Lint != 1 (the value we expect) ?
-
sommerfeld
that seems like an entirely reasonable change..
-
jbk
i'll at least file a bug on it..
-
jbk
and I have found a legitimate bug in the iommu code, though it's one that doesn't happen unless there's more than one segment
-
jbk
-
jbk
if IMMU_MAXSEG is > 1, it'll try to create duplicate instances
-
jbk
as I discovered :)
-
sommerfeld
unit=0 should move outside the for (i=0 ; ..) loop, presumably?
-
jbk
yeah that's what I'm trying
-
jbk
are there any resource for decoding PCIe errors -- i'm thinking 'ereport.io.pciex.rc.ce-msg' is implying a correctable error event
-
jbk
but it'd be nice to be able to decode the particulars
-
jbk
(in this case it seems to be originating from an NVMe disk)
-
Denis
hello all. I've got a couple questions regarding name resolution in illumos
-
Denis
1. I work on a software that has a test that tries to resolve "host.invalid" as an Internet hostname
-
Denis
the test runs fine on other OSes (the hostname does not resolve, which is the intended scenario, because the test verifies that the code handles the failure to resolve correctly)
-
Denis
on illumos the test times out because the timeout is rather low (around 0.1s) and the attempt to resolve the hostname on illumos does indeed result in NXDOMAIN, but it sometimes takes something around 0.5-1.0s
-
Denis
which to me seems like the illumos host does not treat the hostname as special (see RFC 6761 Section Section 6.4 point 3) and queries the caching resolver, which also does not recognise the name as special (point 4 ibid.), which then ends up as a regular query
-
Denis
which takes the time to resolve for the first time, then serves NXDOMAIN fast until the negative response expires, at which point it incurs another hiccup, and so on
-
Denis
(which is point 5 ibid.)
-
Denis
with this in mind, would it make sense to implement point 3 in illumos name resolver library? everything that ends in ".invalid." just before it goes to a resolver would instead produce an immediate "host not found"
-
Denis
2. another test in the same software calls ether_hostton() two times: one for a lower-case hostname and another for the same hostname but upper-case
-
Denis
both variants are in /etc/ethers, and on most other OSes both calls to ether_hostton() succeed
-
Denis
however, on illumos the upper-case name randomly fails to resolve: let's say if you run the test repeatedly, it will consistently fail first few minutes, then supposedly some change happens somewhere and the name starts to resolve fine
-
Denis
then if you wait for a few hours and repeat the same, it will again fail to resolve, and maybe even not recover later
-
Denis
this behaviour is the same as what I observed on Solaris 11.4 CBE before the most recent one (didn't test the most recent one in this regard)
-
Denis
seems to be a weird caching effect somewhere, and does not seem to be the intended behaviour
-
tsoome
I think your dns setup has some flaws. default timeout in libresolv is 5 seconds. if you are using libresolv, the only caches involved is your dns server; otherwise disable nscd for test.
-
Denis
for now I have exempted illumos from the tests in issue 2, but issue 1 affects more practical scenarios, so it would be nice to be able to test it as originally intended
-
tsoome
well, you can test dns with dig (and dig @8.8.8.8 for example)
-
Denis
tsoome: it does not time out, it seems to resort ultimately to the root servers, which do return an NXDOMAIN, but it takes time and there is no point in sending that query in the first place, as the RFC explains
-
Denis
I agree that the caching nameserver ought to short-circuit the query, but this does not seem to be done either, and is a separate issue
-
tsoome
well, if you want reliable test case, you can not rely on global name server but need to have your own local caching server. In my home setup, 8.8.8.8 does return answer with Query time: 6 msec and my local caching server Query time: 0 msec
-
tsoome
thats according to dig.
-
Denis
thank you for discussing this, but the point is that it would be nice not to query a caching resolver in the first place
-
Denis
because "host.invalid" by specification must produce an NXDOMAIN, and the recommended logical shortcut is to get the same result at the host that is trying to resolve the hostname
-
Denis
in practical terms, test 1 runs fine (fails to resolve close enough to instantly) on a number of OSes less sporadic timeouts on illumos, for example:
ci.tcpdump.org/#/builders/57/builds/1759
-
Denis
given a sufficient number of retries (or favourable random conditions), it passes, so it is not an acute problem
-
Denis
but to me it clearly looks like a place for a simple and useful improvement, so maybe it would be best just to make a feature request, though I decided to ask first
-
jbk
what variable in the kernel actually gets set with boot -v ?
-
richlowe
boothowto
-
richlowe
& RB_VERBOSE
-
jbk
oh it's not verbose boot.. it's the debug build...
-
jbk
just to see (though maybe we should go ahead and put it up) I updated acpica to the latest release 20250807
-
jbk
but my god does it spam the kernel log
-
richlowe
so we have a few branches of acpica updates in somewhat approaching flight
-
richlowe
the challenge is testing the damn thing
-
richlowe
oh, and that isn't to say that your changes are superfluous, just that you're probably going to run into the problem everyone else did :)
-
richlowe
if you have a variety of hardware to test _and_ those changes, though, I think a bunch of people are at least interested
-
danmcd
LOTS of changes in the category of, "LARGE TEST SPACE due to amount of HW." (side-eyes at any "Intel Common Code" updates for any Intel NIC drivers)
-
danmcd
"Pay no attention to that `ixgbe-e610` branch in github.com/danmcd/illumos-gate ... NONE I tell you. I haven't even BOOTED it yet, but it does compile..."
-
danmcd
(E610 is an ixgbe extension, thankfully. How much of a PITA it is remains to be seen. I wish Intel had released E610 instead of E700 10Gbit a long time ago.)
-
jbk
at the same time.. you either have to figure out where the line is and accept the risk, or decide not to do it...
-
jbk
i suspect most of the issues are likely to be with very old machines (just because otherwise they'd probably break windows or at least linux)
-
jbk
since AFAIK they're using the same common code we'd be using
-
jbk
if we know we need to test on x,y,z we can work on trying to get that happen
-
jbk
if it's just some amorphous list of systems, it's not going to happen
-
rzezeski
Well, it's more complicated than that (works on windows/linux, must be fine) in some cases. E.g., i40e common code updates may also require a newer NVM image. If your hardware is not flashed with that image you now have effectively dead parts (ask me how I know). If you want to update that image you need to boot into Linux/FreeBSD and use the tools provided by intel to do so.
-
danmcd
I thought you could update it as an EFI binary.
-
rzezeski
danmcd: we have no way to update it AFAIK
-
danmcd
s/as an/with an/^
-
danmcd
Also I thought your mobo-vendor could help there too.
-
rzezeski
and pretty sure Robert confirmed as much to me recently (or maybe in this very room)
-
danmcd
Oh oh oh... I'm thinking onboard i40e, not PCIe card i40e.
-
rzezeski
yea I'm talking standalone
-
danmcd
I wouldn't buy an i40e card, but many folks are stuck with 'em on their mobos.
-
danmcd
My advice to people who want > 10Gbit is "Chelsio or Mellanox, and just jump straight to 100Gbit".
-
rzezeski
I have "dead" i40e cards in my office because of our last update to the common code
-
danmcd
OUCH OUCH OUCH.
-
nomad
rzezeski, dead as in fried? How'd that happen?
-
rzezeski
nomad: no, dead as in our driver won't attach to it because the last common code update gated itself on a newer NVM image which my parts from circa 2017/2018 do not have
-
nomad
ah
-
nomad
well... if you're looking for someone who might be able to make use of them, I've got old stuff in $HOMELAB that might be happy :)
-
rzezeski
my point is that while updating common code that is already running in the wild on other operating systems is generally acceptable (to me at least), it might still cause heartburn in other ways, especially since we don't have the first class support for third-party tools like the other operating systems have
-
» nomad nods
-
richlowe
do we have information in the PCI space to attach different drivers to different parts in that case?
-
richlowe
not that that is a particularly tidy solution
-
» nomad is ~1 hr from vacation.
-
nomad
What is this concentration people are so interested in?
-
rzezeski
richlowe: took me a second to realize what you are getting at, I don't remember how the NVM version number is determined (and I don't feel like looking that up right now), but even if we could technically do such a trick it would mean carrying two different source code versions (or I guess carrying feature flags throughout the code). I feel like it could get really messy.
-
richlowe
yeah, no, that's what I meant by not tidy
-
richlowe
but I know for eg joyent went that route once with mr/dr_sas just for perceived safety, for eg
-
rzezeski
oh interesting, yea it's a neat idea that I've never thought of
-
andyf_
We introduced some mediators to allow people to switch between a couple of drivers, maybe that was lmrc and cpqary.. something like that.
-
rzezeski
I guess it really depends on how much it diverges and how often the churn is. In this case it maybe wouldn't be terrible since the common code updates aren't too frequent.
-
rzezeski
But I guess we didn't really break anyone but myself given no one ever showed up to complain
-
jbk
i mean, if we wanted to be snarky.. we could probably use ebpf on linux to figure out exactly that intel's tool is sending to the driver and recreate that :P
-
jbk
hah
-
jbk
there's an option immu-dmar-print
-
jbk
except it's not _quite_ hooked up
-
jbk
it sets immu_dmar_print
-
jbk
which is otherwise never referenced, and instead there's another 'dmar_print' that's actually used
-
jbk
(also.. 'dmar' sound like something from star trek)
-
richlowe
rzezeski: I tried to update your test-runner bug, but I think the AI loons ate it: I get definite warning messages when I run the tests saying the missing ones have "failed verification"
-
gitomat
[illumos-gate] 17534 fct: buffer freed to wrong cache -- Vitaliy Gusev <gusev.vitaliy⊙gc>
-
jbk
hrm..
-
jbk
does a pci host bridge have a pci bus/device/function or is that strictly for the 'stuff' behind the host bridge?
-
Denis
in Linux lspci my PCI bridges show up with various BDF values
-
Denis
the function is sometimes 0 and sometimes other values
-
Denis
the "host bridge", however, is 00:00.0, that must be the domain root or whatever is the proper term for it