08:55:50 sjorge: It looks like your application performs a lot of lock/unlock syscalls. NFSv4 Delegation eliminates it. 08:57:33 sjorge: I assume if you try NFSv3 mount from client side you will have the same issue. BTW could you do it? 09:00:34 the media mounts have ACLs on them, i never got it to work with nfs3, also why server_versmin=4 09:03:54 but yes, I'd assume it would try to lock/unlock the db file for every insert/update without delegation those end up been separate calls 09:08:41 oops, plex did not like me trying to mount it as nfs3 09:08:53 it delete most stuff and recreated a bunch 09:09:06 time to restore from a snapshot 09:09:36 ah that makes sense, it can't read the media mounts on nfs3 so it assumes it's been deleted snd nukes everything 09:09:37 sjorge: Hmm. Did it log something ? 09:10:10 sjorge: You can't due to permission? 09:10:22 no, but when the UI loaded eveything had the X over it and it was slowly cleaning it up 09:10:58 like when i delete media it also does it, as since plex does not have acces without the ACLs being there 09:11:29 it assumes it's gone and tries to clean up 09:11:36 seems stuck again though 09:12:52 ah yeah, time to boot single usermode, it as fstab now has nfs 3 in it, plex just starts doing it again after rebooting the vm 09:13:17 it's actually probably easier to also rollback to a snapshot, let's see if i have a recent one 09:14:07 sjorge: It looks like plex remembers last used nfs options? 09:14:53 no, I edited fstab 09:15:19 so when plex wrecked itself i killed rhe vm, restored a snapshot and booted the vm 09:15:44 so it mounted it with nfs3 again and plex has the same issue it can't access rhe media and starts deleting 09:17:25 sjorge: Do I understand right, that you have only nfs client? Or several nfs clients and connections ? 09:19:24 doh just setting minver back to 4 fixed it as now it can't mount anything, let me fix fstab 09:19:36 It depends, the media ds and some others have multiple 09:19:50 The dataset for /var/lib/plex... just one client 09:20:19 Since it makes no sense to have multiple client try and use that 09:21:13 I wonder if I can trigger the behavior with just sqlite3 as plex uses there own modified variant of that, that would be a nice test case 09:22:25 sjorge: I wonder about whether locking is really needed, if only application accesses to the files. 09:22:50 is delegation just not support on anything thatis not 4.0 (spec wise) ? 09:23:11 I mean if we document that it's only support on 4.0 somewhere it's probably OK then 09:23:15 sjorge: Or there is possibility that several applications can access the same file simultaneusly? 09:23:44 sjorge: are you asking about nfsv4.1? 09:23:46 In theory yes, if plex crashes and doesn't cleanup properly one process could still have the db open while the new one tries to start 09:24:19 Since server_delegation=off has it broken on 4.0 and 4.1, but server_delegation=on only on 4.1 09:24:21 sjorge: sorry if it crashes all processes are killed then? 09:25:05 vetal no, plex forks a bunch of things so if one of those crashes/gets stuck plex just spawns another one. 09:25:13 sjorge: "Since server_delegation=off has it broken on 4.0 and 4.1" Sorry, didn't get it. 09:25:29 Not sure since plex is closed source, not sure if multiple of those try to open the db 09:25:48 vetal you had me test 4.0 with delegation=off and plex showed the same behavior as on 4.1, where it hangs. 09:26:27 So it might just be the lock overhead without delegation that is the issue for my workload, if delegation is not supported on 4.1, documenting it should be enough to get the stuff back in gate right? 09:26:55 sjorge: Ok, but what means "server_delegation=on only on 4.1" ? 09:27:28 That with with server_delegation=on, with 4.0 there is no issue, only with 4.1 is the issue present 09:27:34 s/with with/with/g 09:27:43 sjorge: Or you wanted to write "server_delegation=on only on 4.0" ? 09:29:39 sjorge: "if delegation is not supported on 4.1" - In protocol it is supported, but in recently added code it is not in the box yet. 09:30:04 I see 09:30:25 I think the issue with plex and some of my other workloads is that they all have sqlite like databases in the /var/lib I have on NFS 09:30:34 And the locking overhead just trips them up when they can't do delegation 09:31:20 I guess documenting it somewhere that delegation is only supported on 4.0 would be enough? (Or better maybe not negotiate 4.1 when server_delegation=on ?) 09:31:45 sjorge: Yes, that was my opinion, do not do anything with code until it is clear where the issue is. 09:32:27 sjorge: "better maybe not negotiate 4.1 when server_delegation=on" - 4.1 protocol can work w/o delegation. 09:33:40 sjorge: For the instance VMware ESX works and performs locking VM (via .lock file) once it is started, so it doesn't need delegation. 09:34:41 sjorge: Your case just increases importance to add backchannel and delegation support to nfsv4.1 implementation. 09:35:07 It's still not expected if your workload works on 4.0 then it suddenly might not after the initial 4.1 merge 09:35:24 I certainly wasn't aware mine was delegation dependent 09:35:50 Perhaps having something https://www.illumos.org/issues/15871 like go in with it so I can at least set server_versmax=4.0 to not have to edit my fstabs would be desirable 09:35:51 → FEATURE 15871: nfs server_versmin and server_versmax should allow specifying a minor (New) 09:36:40 sjorge: Right. I think we need set max minorversion to 0 for now and keep doing development. If people want to use 4.1 or 4.2 they can manually set it via sharectl. 09:37:15 Yes, that would prevent unpleasant surprise but still allow people that want it an easy way to opt in 09:37:28 sjorge: Agree. 09:37:52 And if we document the gotcha that server_delegation is not yet implemented for 4.1 that should cover most bases I think. 09:38:08 It looks like currently the server_vers{min,max} only take a major version though 09:38:26 sjorge: Could you try to disable locking for some time with 4.1 mount ? It could help to understand whether locking is the issue. 09:39:25 sjorge: Or course after finished experiment enable it. 09:43:58 Seems 'nolock' is ignored with vers=4.0 or higher ? 09:45:16 [hyperon :: sjorge][~/Desktop] 09:45:16 [■]$ tshark -r vers40sqlite.pcap| grep -i lock | wc -l 09:45:16 20 09:45:16 [hyperon :: sjorge][~/Desktop] 09:45:16 [■]$ tshark -r vers41sqlite.pcap| grep -i lock | wc -l 09:45:17 696 09:45:36 I managed to replicate the behavior with just vanilla sqlite 09:46:13 https://gist.github.com/sjorge/ea61edcf5f36e26721fb6121cc4710e7 09:46:27 vers40 shows only 20 locks, while with vers41 it shows >600 locks 09:48:03 sjorge: You can try to use "local_lock=all" ? 09:49:40 no effect 09:50:31 It saw the opt at least according to mount -v 09:50:38 But ended as local_lock=none 09:50:39 https://gist.github.com/sjorge/b2622bf6d1921bfd79aee26c937b8024 09:51:35 sjorge: "nfsvers=4.0,noresvport,fsc,local_lock=all" - you use 4.0 ? 09:52:23 I tested with both 09:52:28 4.0 and 4.1 09:52:36 Neither take the local_lock=all option or nolock options 10:02:23 sjorge: Well. Those are only for nfsv3. 10:03:53 sjorge: Thanks! 10:08:23 I did managed to switch the mount I was using for my little test.sql to vers3 10:08:25 [hyperon :: sjorge][~] 10:08:25 [■]$ tshark -r vers3sqlite.pcap | grep -ic lock 10:08:25 836 10:08:25 [hyperon :: sjorge][~] 10:08:25 [■]$ tshark -r vers3locallocksqlite.pcap | grep -ic lock 10:08:25 2 10:09:00 with local_lock=all it only saw 2 lock requests, but that could have been from another mount as that vm has a few too 10:09:44 (note that is on a completely different VM, to avoid confusion) 10:10:18 But since I am comparing sqlite3 test.db < test.sql in all 4 of the tshark greps it should show a consistent picture 10:11:33 That vers=3,lock,local_lock=all => (nearly) no locks ( like I said the VM still has other mounts too), vers=3,lock => lots of lock calls, vers=4.0 + delegation => very few lock calls, vers=4.1 => lots of locks 10:14:10 Let me try that with everything on mounted to be sure, give me a bit as that's a lot of work 10:27:19 https://gist.github.com/sjorge/8df5121cc8a006370bb75f8ff25b9756 OK so that indeed shows a clear picture 10:27:34 server_delegation=on for all of these 10:28:06 3 + local_lock=all => 0 lock calls 10:28:13 4.0 + delegation => 0 lock calls 10:28:21 4.1 or 3 with locking => lots of lock calls 10:28:44 Since the actions in test.sql are the same the lock call count also lines up which makes sense I guess 10:29:12 i'm off for a while now 19:49:54 I think that makes sense, given the symptoms. Delegations are pretty critical for client performance in general