Deferring mtime and ctime updates
Until now, that is. Unix-like systems actually manage three timestamps for each file: along with atime, the system maintains the time of the last modification of the file's contents ("mtime") and the last metadata change ("ctime"). At a first glance, maintaining these times would appear to be less of a performance problem; updating mtime or ctime requires writing the file's inode back to disk, but the operation that causes the time to be updated will be causing a write to happen anyway. So, one would think, any extra cost would be lost in the noise.
It turns out, though, that there is a situation where that is not the case — uses where a file is written through a mapping created with mmap(). Writable memory-mapped files are a bit of a challenge for the operating system: the application can change any part of the file with a simple memory reference without notifying the kernel. But the kernel must learn about the write somehow so that it can eventually push the modified data back to persistent storage. So, when a file is mapped for write access and a page is brought into memory, the kernel will mark that page (in the hardware) as being read-only. An attempt to write that page will generate a fault, notifying the kernel that the page has been changed. At that point, the page can be made writable so that further writes will not generate any more faults; it can stay writable until the kernel cleans the page by writing it back to disk. Once the page is clean, it must be marked read-only once again.
The problem, as explained by Dave Chinner, is this: as soon as the kernel receives the page fault and makes the page writable, it must update the file's timestamps, and, for some filesystem types, an associated revision counter as well. That update is done synchronously in a filesystem transaction as part of the process of handling the page fault and allowing write access. So a quick operation to make a page writable turns into a heavyweight filesystem operation, and it happens every time the application attempts to write to a clean page. If the application writes large numbers of pages that have been mapped into memory, the result will be a painful slowdown. And most of that effort is wasted; the timestamp updates overwrite each other, so only the last one will persist for any useful period of time.
As it happens, Andy Lutomirski has an application that is affected badly by this problem. One of his previous attempts to address the associated performance problems — MADV_WILLWRITE — was covered here recently. Needless to say, he is not a big fan of the current behavior associated with mtime and ctime updates. He also asserted that the current behavior violates the Single Unix Specification, which states that those times must be updated between any write to a page and either the next msync() call or the writeback of the data in question. The kernel, he said, does not currently implement the required behavior.
In particular, he pointed out that the timestamp updates happen after the first write to a given page. After that first reference, the page is left writable and the kernel will be unaware of any subsequent modifications until the page is written back. If the page remains in memory for a long time (multiple seconds) before being written back — as is often the case — the timestamp update will incorrectly reflect the time of the first write, not the last one.
In an attempt to fix both the performance and correctness issues, Andy has put together a patch set that changes the way timestamp updates are handled. In the new scheme, timestamps are not updated when a page is made writable; instead, a new flag (AS_CMTIME) is set in the associated address_space structure. So there is no longer a filesystem transaction that must be done when the page is made writable. At some future time, the kernel will call the new flush_cmtime() address space operation to tell the filesystem that an inode's times should be updated; that call will happen in response to a writeback operation or an msync() call. So, if thousands of pages are dirtied before writeback happens, the timestamp updates will be collapsed into a single transaction, speeding things considerably. Additionally, the timestamp will reflect the time of the last update instead of the first.
There have been some quibbles with this approach. One concern is that there are tight requirements around the handling of timestamps and revision numbers in filesystems that are exported via NFS. NFS clients use those timestamps to learn when cached copies of file data have gone stale; if the timestamp updates are deferred, there is a risk that a client could work with stale data for some period of time. Andy claimed that, with the current scheme, the timestamp could be wrong for a far longer period, so, he said, his patch represents an improvement, even if it's not perfect. David Lang suggested that perfection could be reached by updating the timestamps in memory on the first fault but not flushing that change to disk; Andy saw merit in the idea, but has not implemented it thus far.
As of this writing, the responses to the patch set itself have mostly been
related
to implementation details. Andy will have a number of things to change in
the patch; it also needs filesystem implementations beyond just ext4 and a
test for the xfstests package to show that things work correctly. But the
core idea no longer seems to be controversial. Barring a change of opinion
within the community, faster write fault handling for file-backed pages
should be headed toward a mainline kernel sometime soon.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/Access-time tracking |