Following up on file-position locking
At the end of the previous article, a change had been merged into the mainline to unconditionally take the file-position lock (which ensures that only one thread is manipulating the current file read/write position at any given time). The article noted that the performance impact of this change had not been measured. That changed on August 3, when Mateusz Guzik reported that there was indeed a performance change — specifically, a 5% regression on a test he had run. VFS layer maintainer Christian Brauner initially discounted the report, but also said that the problem, if it truly existed, could be mitigated by only taking the position lock for directories (and not for regular files). The original locking problem had only affected directory reads, so the fix is only needed there as well.
Linus Torvalds was quick to respond to the report, though, posting
a patch that implemented the directory-only behavior; it worked by
doing the locking explicitly in the implementation of the directory-related
system calls. Brauner pointed
out that this approach missed the case of calling lseek()
on a directory; "that's the heinous part
". Since the whole purpose
of lseek() is to manipulate the file position, that was indeed a
bit of an oversight.
Torvalds acknowledged
the omission, but additionally realized that there are some filesystems
that support read() and write() on directories as well.
"They may be broken, but people actually did do things like that
historically, maybe there's a reason adfs and ceph allow it
". Those
cases, too, would need to be addressed for a complete fix. To create that
fix, he put together a new patch that changes __fdget_pos() to
simply test whether the object in question is a directory, and to always
take the lock in that case. Brauner was still unconvinced
about the need for the patch, but agreed that Torvalds's approach would
work; that patch was duly applied to the
mainline for 6.5-rc5.
After that happened, Brauner suggested a slightly different approach. __fdget_pos() takes a file structure pointer as an argument; to determine whether that file corresponds to an open directory, the code must traverse the f_inode pointer to get to the associated inode structure. A simpler test, he said, might be to test whether the given file provides either of the iterate() or iterate_shared() operations, which are only provided for directories.
Once upon a time, the kernel's file_operations structure included a member called readdir() that, as might be expected, would be provided by filesystems to support reading directory contents. The 3.11 release in 2013 saw that member renamed to iterate(), with a different API. This function had exclusive access to the inode, which would be write-locked before the call was made. That limited performance for filesystems that were able to support multiple, concurrent read operations on the same directory. To speed things up, a new iterate_shared() variant was added for the 4.7 release in 2016. This version is called with a shared (read) lock on the inode, so multiple calls can be running concurrently.
At the time, the documentation was updated with the admonition: "Old
method is only used if the new one is absent; eventually it will be
removed. Switch while you still can; the old one won't stay.
" Few
people who are familiar with the kernel development process will be shocked
to learn that, in fact, iterate() did stay, and that it is
still present in current kernels, seven years later.
Torvalds, having been "shamed" by this reminder that iterate() still exists, decided to finish the job. With a new patch set, he eliminated the iterate() function. For the filesystems that were still using it (ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, procfs, and vboxfs), he created a new wrapper that can be provided as iterate_shared(); it drops the read lock, then acquires a full write lock before calling into the filesystem. The exclusive locking, in other words, has been pushed down a level and only appears in the few remaining places where it is needed.
Torvalds had intended to apply this change during the 6.6 merge window; it is, after all, rather late in the 6.5 cycle for core changes of this nature. One of the advantages of being Linus Torvalds, though, is that you can break the rules when it seems appropriate; he took that approach in this case and merged the change into the mainline, also for the 6.5-rc5 release; it was quickly followed by a change from Brauner to change the position-locking test to just look for the existence of an iterate_shared() function.
Naturally, none of that work updated the associated documentation, giving
the tireless documentation maintainer something to do.
Otherwise, though, perhaps this little story has finally come to a real
conclusion, with code that is more correct and a little bit of longstanding
technical debt removed. Of course, it is once again true that no
performance results have been posted, so the possibility of a third
installment cannot be entirely ruled out.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/Virtual filesystem layer |