Parallel directory operations

By Jake Edge
April 16, 2025

Allowing directories to be modified in parallel was the topic of Jeff Layton's filesystem-track session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF). There are certain use cases, including for the NFS and Lustre filesystems, as mentioned in a patch set referenced in the topic proposal, where contention in creating multiple files in a directory is causing noticeable performance problems. In some testing, Layton has found that the inode read-write semaphore (i_rwsem) for the directory is serializing operations; he wanted to discuss alternatives.

He and Christian Brauner had worked on a patch set that addressed some of the low-hanging fruit from the problem. It would avoid taking the exclusive lock on the directory if an open with the O_CREAT flag did not actually create a new file. It provided a substantial improvement for certain workloads, Layton said.

But what is really desired, he said, is to be able to do all sorts of parallel directory operations (mkdir, unlink, rmdir, etc.). Neil Brown developed the patch set referenced earlier (which was up to v7 as of the summit) that would stop taking the exclusive (write) lock for every operation on the directory; it would take the shared (read) lock instead. There are filesystems, such as NFS, that do not need to have those operations serialized, Layton said.

Brauner and Al Viro reviewed Brown's patch set and gave some good feedback, Layton said. The stumbling block for those patches is that they take the directory i_rwsem as a shared lock, rather than as an exclusive one, but take an exclusive lock on the directory entry (dentry), which opens up a lot of possibilities for deadlock. There are lots of corner cases, so the problem is with making it "provably correct, which is going to be an order of magnitude more difficult than the existing exclusive lock".

Brown has been working on the patches for a while. The most recent version adds many new inode operations with _async added, "which is a little ugly", Layton said, but that is "cosmetic stuff". The hard part is getting the locking right.

That is where things stand, he said, and wondered what filesystem developers could do to help with the "long slog", because it is important for scalability in directories. The workloads where the problems are seen are "big untar-type workloads", where there are many threads all creating files in parallel in the same directories.

Brauner asked if individual filesystems would have to opt into supporting the feature. Layton said that he did not see any other possibility. One path might be to, say, create a mkdir_async() inode operation, slowly convert all of the filesystems to use that, and then replace the mkdir() operation. Brauner said that he did not like adding lots of new inode operations, but thought that if the idea was viable, the existing operations could add an asynchronous form that would only be used for filesystems that opt in. Layton suggested starting slow, with a single inode operation for unlink, say; Brauner thought that sounded like a reasonable approach.

Josef Bacik wondered if it made sense to push the locking down to individual filesystems; "I hate doing this" but it would mean that filesystems could try alternative locking that would allow more parallelism. There would need to be code that embodies the existing locking for every filesystem (except those trying out alternatives) to use.

Mark Harmstone asked about batching the operations for these workloads, which could be done, but requires rewriting applications. Brauner pointed out that io_uring could probably be used to do the batching today if that was desirable, but Harmstone said that he had been using io_uring for directory-creation (mkdir) operations which still takes locks for each one. Layton said that a fast-path option could perhaps be added to io_uring, but there is a more general problem to be solved.

For example, if multiple files are created in the same directory using NFS, those operations are all serialized on the client side, which means that it requires lots of network round trips before anything can happen. Fixing that would still end up serializing many of the operations on the server side, but that would be better. Bacik concurred, noting that you can take the untar workload out of the picture; creating a file in a directory takes an exclusive lock, so any lookups that involve the directory are held off as well.

Part of the problem is that it is not entirely clear what i_rwsem is protecting, Brauner said. For example, the setuid exit path recently started taking that lock to avoid some potential security problems. David Howells said that it is also used by network filesystems to ensure that direct and buffered I/O do not conflict.

Chris Mason asked which local filesystems would be able to handle parallel creation in directories. Layton did not know of any, but Bacik said that Btrfs would be able to; he agreed when someone else suggested that bcachefs might as well. Timothy Day from the Lustre filesystem project said that it could do so as well.

Bacik noted that once parallel creates are in place, "for sure we are going to run into the next thing"; there will be other constraints that will be encountered. Bacik said that network filesystems are likely to be able to more fully take advantage of the feature, while local filesystems will only have modest gains. He has an ulterior motive in that Filesystem in Userspace (FUSE) servers may be able to find better locking schemes in user space if the kernel is not taking exclusive locks on its behalf.

Index entries for this article
Kernel	Filesystems/Virtual filesystem layer
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2025

to post comments

Benefits for Lustre

Posted Apr 16, 2025 17:32 UTC (Wed) by tim-day-387 (subscriber, #171751) [Link]

Parallel directory operations would actually benefit Lustre in two ways. Firstly, the client would see a significant speed-up with (hopefully) only minor changes [1]. Lustre supports multiple metadata servers and a single directory can be striped over multiple servers. With the locking in VFS, you can't take advantage of that from a single client.

Secondly, the Lustre server (which also lives in the kernel, kind of like NFS) uses a patched ext4 as a storage backend. Lustre calls ext4 directly (bypassing VFS) and implements a custom locking scheme [2] to enable parallel directory operations on ext4. This allows for a single metadata server to scale performance much higher. This work could serve as a starting point for supporting parallel directory operations on vanilla ext4 as well.

[1] https://jira.whamcloud.com/browse/LU-17776
[2] https://wiki.whamcloud.com/display/PUB/Parallel+Directory...

My current plan

Posted Apr 18, 2025 1:49 UTC (Fri) by neilbrown (subscriber, #359) [Link]

My current plan is to provide an alternate fine-grained locking scheme that provides all that the VFS needs. These locks (which focus on locking individual names in the directory rather than the whole directory) would be taken before the whole-directory locks.

Then filesystems could indicate (via a flag in a new inode_operations->flags) that they didn't need the directory locks. They would then be responsible for doing their own locking.

For network filesystems, the server would likely provide all the needed locking.
For local storage-based filesystems locking could likely happen at the directory-block level (as lustre does for ldiskfs - their modified ext4).
For local virtual filesystems, ->d_lock in the parent dentry is likely to be sufficient.

Eventually we might get all filesystems setting this flag and we could then remove whole directory locks completely on these paths.

I've put the async idea on the back-burner for now. I think it would be great for io_uring to be able to submit 10,000 unlink requests in parallel then wait for them to complete. I think it can be added later without changing function signatures.

If anyone is curious my current devel tree is

https://github.com/neilbrown/linux/tree/pdirops

or you can see the patch list at:

https://github.com/torvalds/linux/compare/master...neilbr...

A lot of it is definitely WIP. particularly the last patch which is exploring a more light-weight approach for cross-directory rename, but a lot of is unlikely to change much before I submit it.