08:55:50 <vetal> sjorge: It looks like your application performs a lot of lock/unlock syscalls. NFSv4 Delegation eliminates it.
08:57:33 <vetal> sjorge: I assume if you try NFSv3 mount from client side you will have the same issue. BTW could you do it?
09:00:34 <sjorge> the media mounts have ACLs on them, i never got it to work with nfs3, also why server_versmin=4
09:03:54 <sjorge> but yes, I'd assume it would try to lock/unlock the db file for every insert/update without delegation those end up been separate calls
09:08:41 <sjorge> oops, plex did not like me trying to mount it as nfs3
09:08:53 <sjorge> it delete most stuff and recreated a bunch
09:09:06 <sjorge> time to restore from a snapshot
09:09:36 <sjorge> ah that makes sense, it can't read the media mounts on nfs3 so it assumes it's been deleted snd nukes everything
09:09:37 <vetal> sjorge: Hmm. Did it log something ?
09:10:10 <vetal> sjorge: You can't due to permission?
09:10:22 <sjorge> no, but when the UI loaded eveything had the X over it and it was slowly cleaning it up
09:10:58 <sjorge> like when i delete media it also does it, as since plex does not have acces without the ACLs being there
09:11:29 <sjorge> it assumes it's gone and tries to clean up
09:11:36 <sjorge> seems stuck again though
09:12:52 <sjorge> ah yeah, time to boot single usermode, it as fstab now has nfs 3 in it, plex just starts doing it again after rebooting the vm
09:13:17 <sjorge> it's actually probably easier to also rollback to a snapshot, let's see if i have a recent one
09:14:07 <vetal> sjorge: It looks like plex remembers last used nfs options?
09:14:53 <sjorge> no, I edited fstab
09:15:19 <sjorge> so when plex wrecked itself i killed rhe vm, restored a snapshot and booted the vm
09:15:44 <sjorge> so it mounted it with nfs3 again and plex has the same issue it can't access rhe media and starts deleting
09:17:25 <vetal> sjorge: Do I understand right, that you have only nfs client? Or several nfs clients and connections ?
09:19:24 <sjorge> doh just setting minver back to 4 fixed it as now it can't mount anything, let me fix fstab
09:19:36 <sjorge> It depends, the media ds and some others have multiple
09:19:50 <sjorge> The dataset for /var/lib/plex... just one client
09:20:19 <sjorge> Since it makes no sense to have multiple client try and use that
09:21:13 <sjorge> I wonder if I can trigger the behavior with just sqlite3 as plex uses there own modified variant of that, that would be a nice test case
09:22:25 <vetal> sjorge: I wonder about whether locking is really needed, if only application accesses to the files.
09:22:50 <sjorge> is delegation just not support on anything thatis not 4.0 (spec wise) ?
09:23:11 <sjorge> I mean if we document that it's only support on 4.0 somewhere it's probably OK then
09:23:15 <vetal> sjorge: Or there is possibility that several applications can access the same file simultaneusly?
09:23:44 <vetal> sjorge: are you asking about nfsv4.1?
09:23:46 <sjorge> In theory yes, if plex crashes and doesn't cleanup properly one process could still have the db open while the new one tries to start
09:24:19 <sjorge> Since server_delegation=off has it broken on 4.0 and 4.1, but server_delegation=on only on 4.1
09:24:21 <vetal> sjorge: sorry if it crashes all processes are killed then?
09:25:05 <sjorge> vetal no, plex forks a bunch of things so if one of those crashes/gets stuck plex just spawns another one.
09:25:13 <vetal> sjorge: "Since server_delegation=off has it broken on 4.0 and 4.1" Sorry, didn't get it.
09:25:29 <sjorge> Not sure since plex is closed source, not sure if multiple of those try to open the db
09:25:48 <sjorge> vetal you had me test 4.0 with delegation=off and plex showed the same behavior as on 4.1, where it hangs.
09:26:27 <sjorge> So it might just be the lock overhead without delegation that is the issue for my workload, if delegation is not supported on 4.1, documenting it should be enough to get the stuff back in gate right?
09:26:55 <vetal> sjorge:  Ok, but what means "server_delegation=on only on 4.1" ?
09:27:28 <sjorge> That with with server_delegation=on, with 4.0 there is no issue, only with 4.1 is the issue present
09:27:34 <sjorge> s/with with/with/g
09:27:43 <vetal> sjorge: Or you wanted to write "server_delegation=on only on 4.0" ?
09:29:39 <vetal> sjorge: "if delegation is not supported on 4.1" - In protocol it is supported, but in recently added code it is not in the box yet.
09:30:04 <sjorge> I see
09:30:25 <sjorge> I think the issue with plex and some of my other workloads is that they all have sqlite like databases in the /var/lib I have on NFS
09:30:34 <sjorge> And the locking overhead just trips them up when they can't do delegation
09:31:20 <sjorge> I guess documenting it somewhere that delegation is only supported on 4.0 would be enough? (Or better maybe not negotiate 4.1 when server_delegation=on ?)
09:31:45 <vetal> sjorge: Yes, that was my opinion, do not do anything with code until it is clear where the issue is. 
09:32:27 <vetal> sjorge: "better maybe not negotiate 4.1 when server_delegation=on" - 4.1 protocol can work w/o delegation.
09:33:40 <vetal> sjorge: For the instance VMware ESX works and performs locking VM (via .lock file) once it is started, so it doesn't need delegation. 
09:34:41 <vetal> sjorge: Your case just increases importance to add backchannel and delegation support to nfsv4.1 implementation.
09:35:07 <sjorge> It's still not expected if your workload works on 4.0 then it suddenly might not after the initial 4.1 merge
09:35:24 <sjorge> I certainly wasn't aware mine was delegation dependent
09:35:50 <sjorge> Perhaps having something https://www.illumos.org/issues/15871 like go in with it so I can at least set server_versmax=4.0 to not have to edit my fstabs would be desirable
09:35:51 <fenix> → FEATURE 15871: nfs server_versmin and server_versmax should allow specifying a minor (New)
09:36:40 <vetal> sjorge: Right. I think we need set max minorversion to 0 for now and keep doing development. If people want to use 4.1 or 4.2 they can manually set it via sharectl. 
09:37:15 <sjorge> Yes, that would prevent unpleasant surprise but still allow people that want it an easy way to opt in
09:37:28 <vetal> sjorge: Agree. 
09:37:52 <sjorge> And if we document the gotcha that server_delegation is not yet implemented for 4.1   that should cover most bases I think.
09:38:08 <sjorge> It looks like currently  the server_vers{min,max} only take a major version though
09:38:26 <vetal> sjorge: Could you try to disable locking for some time with 4.1 mount ? It could help to understand whether locking is the issue.
09:39:25 <vetal> sjorge: Or course after finished experiment enable it.
09:43:58 <sjorge> Seems 'nolock' is ignored with vers=4.0 or higher ?
09:45:16 <sjorge> [hyperon :: sjorge][~/Desktop]
09:45:16 <sjorge> [■]$ tshark -r vers40sqlite.pcap| grep -i lock | wc -l
09:45:16 <sjorge>       20
09:45:16 <sjorge> [hyperon :: sjorge][~/Desktop]
09:45:16 <sjorge> [■]$ tshark -r vers41sqlite.pcap| grep -i lock | wc -l
09:45:17 <sjorge>      696
09:45:36 <sjorge> I managed to replicate the behavior with just vanilla sqlite
09:46:13 <sjorge> https://gist.github.com/sjorge/ea61edcf5f36e26721fb6121cc4710e7
09:46:27 <sjorge> vers40 shows only 20 locks, while with vers41 it shows >600 locks
09:48:03 <vetal> sjorge: You can try to use "local_lock=all" ?
09:49:40 <sjorge> no effect
09:50:31 <sjorge> It saw the opt at least according to mount -v
09:50:38 <sjorge> But ended as local_lock=none
09:50:39 <sjorge> https://gist.github.com/sjorge/b2622bf6d1921bfd79aee26c937b8024
09:51:35 <vetal> sjorge: "nfsvers=4.0,noresvport,fsc,local_lock=all" - you use 4.0 ?
09:52:23 <sjorge> I tested with both
09:52:28 <sjorge> 4.0 and 4.1
09:52:36 <sjorge> Neither take the local_lock=all option or nolock options
10:02:23 <vetal> sjorge: Well. Those are only for nfsv3.
10:03:53 <vetal> sjorge: Thanks!
10:08:23 <sjorge> I did managed to switch the mount I was using for my little test.sql to vers3
10:08:25 <sjorge> [hyperon :: sjorge][~]
10:08:25 <sjorge> [■]$ tshark -r vers3sqlite.pcap | grep -ic lock
10:08:25 <sjorge> 836
10:08:25 <sjorge> [hyperon :: sjorge][~]
10:08:25 <sjorge> [■]$ tshark -r vers3locallocksqlite.pcap | grep -ic lock
10:08:25 <sjorge> 2
10:09:00 <sjorge> with local_lock=all it only saw 2 lock requests, but that could have been from another mount as that vm has a few too
10:09:44 <sjorge> (note that is on a completely different VM, to avoid confusion)
10:10:18 <sjorge> But since I am comparing sqlite3 test.db < test.sql in all 4 of the tshark greps it should show a consistent picture
10:11:33 <sjorge> That vers=3,lock,local_lock=all => (nearly) no locks ( like I said the VM still has other mounts too), vers=3,lock => lots of lock calls, vers=4.0 + delegation => very few lock calls, vers=4.1 => lots of locks
10:14:10 <sjorge> Let me try that with everything on mounted to be sure, give me a bit as that's a lot of work
10:27:19 <sjorge> https://gist.github.com/sjorge/8df5121cc8a006370bb75f8ff25b9756 OK so that indeed shows a clear picture
10:27:34 <sjorge> server_delegation=on for all of these
10:28:06 <sjorge> 3 + local_lock=all => 0 lock calls
10:28:13 <sjorge> 4.0 + delegation => 0 lock calls
10:28:21 <sjorge> 4.1 or 3 with locking => lots of lock calls
10:28:44 <sjorge> Since the actions in test.sql are the same the lock call count also lines up which makes sense I guess
10:29:12 <sjorge> i'm off for a while now
19:49:54 <jclulow> I think that makes sense, given the symptoms.  Delegations are pretty critical for client performance in general