-
ptribble
Our implementation of posix_fallocate(3c) returns EINVAL unless the underlying filesystem is ufs
-
ptribble
This sort of makes sense - on zfs, copy-on-write makes any guarantee of allocation worthless
-
ptribble
I tripped over this recently, as Java can use posix_fallocate to place the Java heap on a disk-based file rather than memory
-
ptribble
Interestingly, it works on Solaris, which confused me somewhat
-
ptribble
I wondered if there were opinions on whether implementing it would be regarded as sensible, or if we just say it's unsupported
-
alanc
really? I could have sworn Solaris returned EINVAL for posix_fallocate on zfs for exactly the reason you mention
-
ptribble
It used to, and indeed used to be explicit in the man page about what filesystems were supported.
-
ptribble
But the man page no longer says 'ufs only' in 11.4 at least, and the Java functionality appears to work fun
-
ptribble
oops, s/fun/fine
-
alanc
but apparently I'm thinking of the underlying F_ALLOCSP fnctl, which our posix_fallocate calls, and then if the fcntl returns EINVAL to indicate the file system doesn't support it, we fake it using an F_FREESP fnctl instead
-
ptribble
So, basically, be economical with the truth to keep applications happy
-
alanc
the bug discussion mentions that with a COW filesystem like ZFS, actually allocating blocks would increase the chance of failure by using up blocks you may need for the real write later, but at least doing the equivalent of ftruncate() means you can mmap() the expected size
-
alanc
yep
-
jclulow
Yeah I think that's unfortunate but reasonable. In general, software that strongly believes in Preallocation For Performance is going to take the failure to fallocate and write zeroes out anyway
-
jclulow
see also: PostgreSQL and their unfortunate WAL behaviour
-
alanc
it also mentions that (at least back in 2012) glibc also did a bit of fakery for filesystems that didn't support the linux syscall for this, and would write out zero bytes the old fashioned way, which is neither good for performance nor for allocations on COW filesystems
-
richlowe
this all checks out, but it feels like the greater "we" should maybe like, talk to people about that?
-
richlowe
because it's clearly not great, and surely ZFS has traction enough people will 100% care even though we're weird
-
alanc
sourceware.org/git/?p=glibc.git;a=b…b;f=sysdeps/posix/posix_fallocate.c - apparently its fallback is "write one null byte to every block", which seems awful
-
richlowe
that feels like it probably works on ancient filesystems
-
richlowe
and ... I guess hurts less on zfs?
-
richlowe
but it's definitely a solution that given two ok-ish options, does neither
-
jbk
jclulow: they did at least eventually add a flag to disable that behavior
-
jbk
even if it felt like 'now that I've thought of it, it's a brilliant idea' :)
-
jclulow
jbk: Yes, I was there while Jerry and dap were convincing folks to do it haha
-
alanc
-
jbk
i remember them arguing for it, but meeting resistance and the actual addition of the flag happening later
-
jclulow
richlowe: I don't think writing a single 0 into every block is going to be materially different from writing 0 into the whole block on ZFS
-
alanc
openzfs/zfs #326 has thoughts from a ZFS perspective
-
richlowe
jclulow: I thought the variable block size would at least cause ZFS to give away _less_ COW space
-
richlowe
(though obviously, if compression is on, neither matters?)
-
jclulow
Perhaps that's true! I'm not sure.
-
jclulow
Compression would certainly change things and probably make it a wash if it was indeed all zeroes
-
richlowe
jclulow: fwiw, the 2nd comment alan just linked is exactly what I just DM'd you surprised nobody made the ZFS team do :)
-
alanc
I suspect the difference between writing a single zero byte vs. a whole zero block is mainly copyin() performance for the system call, and bytes transimitted for the NFS case, while producing the same underlying changes on disk
-
jclulow
Right
-
ptribble
Just to capture this, I've opened
illumos.org/issues/16887
-
fenix
→ FEATURE 16887: Status of posix_fallocate (New)
-
richlowe
thanks
-
richlowe
I've filed #16888 (fenix?) so I don't lose it. If anyone has battled this and won with a better workaround than /* begin/end cstyled */, I'd love to hear.
-
fenix
BUG 16888: cstyle(1ONBLD) can't handle C11 static assertions with continuation lines (New)
-
fenix