-
vetal
sjorge: It looks like your application performs a lot of lock/unlock syscalls. NFSv4 Delegation eliminates it.
-
vetal
sjorge: I assume if you try NFSv3 mount from client side you will have the same issue. BTW could you do it?
-
sjorge
the media mounts have ACLs on them, i never got it to work with nfs3, also why server_versmin=4
-
sjorge
but yes, I'd assume it would try to lock/unlock the db file for every insert/update without delegation those end up been separate calls
-
sjorge
oops, plex did not like me trying to mount it as nfs3
-
sjorge
it delete most stuff and recreated a bunch
-
sjorge
time to restore from a snapshot
-
sjorge
ah that makes sense, it can't read the media mounts on nfs3 so it assumes it's been deleted snd nukes everything
-
vetal
sjorge: Hmm. Did it log something ?
-
vetal
sjorge: You can't due to permission?
-
sjorge
no, but when the UI loaded eveything had the X over it and it was slowly cleaning it up
-
sjorge
like when i delete media it also does it, as since plex does not have acces without the ACLs being there
-
sjorge
it assumes it's gone and tries to clean up
-
sjorge
seems stuck again though
-
sjorge
ah yeah, time to boot single usermode, it as fstab now has nfs 3 in it, plex just starts doing it again after rebooting the vm
-
sjorge
it's actually probably easier to also rollback to a snapshot, let's see if i have a recent one
-
vetal
sjorge: It looks like plex remembers last used nfs options?
-
sjorge
no, I edited fstab
-
sjorge
so when plex wrecked itself i killed rhe vm, restored a snapshot and booted the vm
-
sjorge
so it mounted it with nfs3 again and plex has the same issue it can't access rhe media and starts deleting
-
vetal
sjorge: Do I understand right, that you have only nfs client? Or several nfs clients and connections ?
-
sjorge
doh just setting minver back to 4 fixed it as now it can't mount anything, let me fix fstab
-
sjorge
It depends, the media ds and some others have multiple
-
sjorge
The dataset for /var/lib/plex... just one client
-
sjorge
Since it makes no sense to have multiple client try and use that
-
sjorge
I wonder if I can trigger the behavior with just sqlite3 as plex uses there own modified variant of that, that would be a nice test case
-
vetal
sjorge: I wonder about whether locking is really needed, if only application accesses to the files.
-
sjorge
is delegation just not support on anything thatis not 4.0 (spec wise) ?
-
sjorge
I mean if we document that it's only support on 4.0 somewhere it's probably OK then
-
vetal
sjorge: Or there is possibility that several applications can access the same file simultaneusly?
-
vetal
sjorge: are you asking about nfsv4.1?
-
sjorge
In theory yes, if plex crashes and doesn't cleanup properly one process could still have the db open while the new one tries to start
-
sjorge
Since server_delegation=off has it broken on 4.0 and 4.1, but server_delegation=on only on 4.1
-
vetal
sjorge: sorry if it crashes all processes are killed then?
-
sjorge
vetal no, plex forks a bunch of things so if one of those crashes/gets stuck plex just spawns another one.
-
vetal
sjorge: "Since server_delegation=off has it broken on 4.0 and 4.1" Sorry, didn't get it.
-
sjorge
Not sure since plex is closed source, not sure if multiple of those try to open the db
-
sjorge
vetal you had me test 4.0 with delegation=off and plex showed the same behavior as on 4.1, where it hangs.
-
sjorge
So it might just be the lock overhead without delegation that is the issue for my workload, if delegation is not supported on 4.1, documenting it should be enough to get the stuff back in gate right?
-
vetal
sjorge: Ok, but what means "server_delegation=on only on 4.1" ?
-
sjorge
That with with server_delegation=on, with 4.0 there is no issue, only with 4.1 is the issue present
-
sjorge
s/with with/with/g
-
vetal
sjorge: Or you wanted to write "server_delegation=on only on 4.0" ?
-
vetal
sjorge: "if delegation is not supported on 4.1" - In protocol it is supported, but in recently added code it is not in the box yet.
-
sjorge
I see
-
sjorge
I think the issue with plex and some of my other workloads is that they all have sqlite like databases in the /var/lib I have on NFS
-
sjorge
And the locking overhead just trips them up when they can't do delegation
-
sjorge
I guess documenting it somewhere that delegation is only supported on 4.0 would be enough? (Or better maybe not negotiate 4.1 when server_delegation=on ?)
-
vetal
sjorge: Yes, that was my opinion, do not do anything with code until it is clear where the issue is.
-
vetal
sjorge: "better maybe not negotiate 4.1 when server_delegation=on" - 4.1 protocol can work w/o delegation.
-
vetal
sjorge: For the instance VMware ESX works and performs locking VM (via .lock file) once it is started, so it doesn't need delegation.
-
vetal
sjorge: Your case just increases importance to add backchannel and delegation support to nfsv4.1 implementation.
-
sjorge
It's still not expected if your workload works on 4.0 then it suddenly might not after the initial 4.1 merge
-
sjorge
I certainly wasn't aware mine was delegation dependent
-
sjorge
Perhaps having something
illumos.org/issues/15871 like go in with it so I can at least set server_versmax=4.0 to not have to edit my fstabs would be desirable
-
fenix
→ FEATURE 15871: nfs server_versmin and server_versmax should allow specifying a minor (New)
-
vetal
sjorge: Right. I think we need set max minorversion to 0 for now and keep doing development. If people want to use 4.1 or 4.2 they can manually set it via sharectl.
-
sjorge
Yes, that would prevent unpleasant surprise but still allow people that want it an easy way to opt in
-
vetal
sjorge: Agree.
-
sjorge
And if we document the gotcha that server_delegation is not yet implemented for 4.1 that should cover most bases I think.
-
sjorge
It looks like currently the server_vers{min,max} only take a major version though
-
vetal
sjorge: Could you try to disable locking for some time with 4.1 mount ? It could help to understand whether locking is the issue.
-
vetal
sjorge: Or course after finished experiment enable it.
-
sjorge
Seems 'nolock' is ignored with vers=4.0 or higher ?
-
sjorge
[hyperon :: sjorge][~/Desktop]
-
sjorge
[■]$ tshark -r vers40sqlite.pcap| grep -i lock | wc -l
-
sjorge
20
-
sjorge
[hyperon :: sjorge][~/Desktop]
-
sjorge
[■]$ tshark -r vers41sqlite.pcap| grep -i lock | wc -l
-
sjorge
696
-
sjorge
I managed to replicate the behavior with just vanilla sqlite
-
sjorge
-
sjorge
vers40 shows only 20 locks, while with vers41 it shows >600 locks
-
vetal
sjorge: You can try to use "local_lock=all" ?
-
sjorge
no effect
-
sjorge
It saw the opt at least according to mount -v
-
sjorge
But ended as local_lock=none
-
sjorge
-
vetal
sjorge: "nfsvers=4.0,noresvport,fsc,local_lock=all" - you use 4.0 ?
-
sjorge
I tested with both
-
sjorge
4.0 and 4.1
-
sjorge
Neither take the local_lock=all option or nolock options
-
vetal
sjorge: Well. Those are only for nfsv3.
-
vetal
sjorge: Thanks!
-
sjorge
I did managed to switch the mount I was using for my little test.sql to vers3
-
sjorge
[hyperon :: sjorge][~]
-
sjorge
[■]$ tshark -r vers3sqlite.pcap | grep -ic lock
-
sjorge
836
-
sjorge
[hyperon :: sjorge][~]
-
sjorge
[■]$ tshark -r vers3locallocksqlite.pcap | grep -ic lock
-
sjorge
2
-
sjorge
with local_lock=all it only saw 2 lock requests, but that could have been from another mount as that vm has a few too
-
sjorge
(note that is on a completely different VM, to avoid confusion)
-
sjorge
But since I am comparing sqlite3 test.db < test.sql in all 4 of the tshark greps it should show a consistent picture
-
sjorge
That vers=3,lock,local_lock=all => (nearly) no locks ( like I said the VM still has other mounts too), vers=3,lock => lots of lock calls, vers=4.0 + delegation => very few lock calls, vers=4.1 => lots of locks
-
sjorge
Let me try that with everything on mounted to be sure, give me a bit as that's a lot of work
-
sjorge
-
sjorge
server_delegation=on for all of these
-
sjorge
3 + local_lock=all => 0 lock calls
-
sjorge
4.0 + delegation => 0 lock calls
-
sjorge
4.1 or 3 with locking => lots of lock calls
-
sjorge
Since the actions in test.sql are the same the lock call count also lines up which makes sense I guess
-
sjorge
i'm off for a while now
-
jclulow
I think that makes sense, given the symptoms. Delegations are pretty critical for client performance in general