Deferring mtime and ctime updates

By Jonathan Corbet
August 21, 2013

Back in 2007, the kernel developers realized that the maintenance of the last-accessed time for files ("atime") was a significant performance problem. Atime updates turned every read operation into a write, slowing the I/O subsystem significantly. The response was to add the "relatime" mount option that reduced atime updates to the minimum frequency that did not risk breaking applications. Since then, little thought has gone into the performance issues associated with file timestamps.

Until now, that is. Unix-like systems actually manage three timestamps for each file: along with atime, the system maintains the time of the last modification of the file's contents ("mtime") and the last metadata change ("ctime"). At a first glance, maintaining these times would appear to be less of a performance problem; updating mtime or ctime requires writing the file's inode back to disk, but the operation that causes the time to be updated will be causing a write to happen anyway. So, one would think, any extra cost would be lost in the noise.

It turns out, though, that there is a situation where that is not the case — uses where a file is written through a mapping created with mmap(). Writable memory-mapped files are a bit of a challenge for the operating system: the application can change any part of the file with a simple memory reference without notifying the kernel. But the kernel must learn about the write somehow so that it can eventually push the modified data back to persistent storage. So, when a file is mapped for write access and a page is brought into memory, the kernel will mark that page (in the hardware) as being read-only. An attempt to write that page will generate a fault, notifying the kernel that the page has been changed. At that point, the page can be made writable so that further writes will not generate any more faults; it can stay writable until the kernel cleans the page by writing it back to disk. Once the page is clean, it must be marked read-only once again.

The problem, as explained by Dave Chinner, is this: as soon as the kernel receives the page fault and makes the page writable, it must update the file's timestamps, and, for some filesystem types, an associated revision counter as well. That update is done synchronously in a filesystem transaction as part of the process of handling the page fault and allowing write access. So a quick operation to make a page writable turns into a heavyweight filesystem operation, and it happens every time the application attempts to write to a clean page. If the application writes large numbers of pages that have been mapped into memory, the result will be a painful slowdown. And most of that effort is wasted; the timestamp updates overwrite each other, so only the last one will persist for any useful period of time.

As it happens, Andy Lutomirski has an application that is affected badly by this problem. One of his previous attempts to address the associated performance problems — MADV_WILLWRITE — was covered here recently. Needless to say, he is not a big fan of the current behavior associated with mtime and ctime updates. He also asserted that the current behavior violates the Single Unix Specification, which states that those times must be updated between any write to a page and either the next msync() call or the writeback of the data in question. The kernel, he said, does not currently implement the required behavior.

In particular, he pointed out that the timestamp updates happen after the first write to a given page. After that first reference, the page is left writable and the kernel will be unaware of any subsequent modifications until the page is written back. If the page remains in memory for a long time (multiple seconds) before being written back — as is often the case — the timestamp update will incorrectly reflect the time of the first write, not the last one.

In an attempt to fix both the performance and correctness issues, Andy has put together a patch set that changes the way timestamp updates are handled. In the new scheme, timestamps are not updated when a page is made writable; instead, a new flag (AS_CMTIME) is set in the associated address_space structure. So there is no longer a filesystem transaction that must be done when the page is made writable. At some future time, the kernel will call the new flush_cmtime() address space operation to tell the filesystem that an inode's times should be updated; that call will happen in response to a writeback operation or an msync() call. So, if thousands of pages are dirtied before writeback happens, the timestamp updates will be collapsed into a single transaction, speeding things considerably. Additionally, the timestamp will reflect the time of the last update instead of the first.

There have been some quibbles with this approach. One concern is that there are tight requirements around the handling of timestamps and revision numbers in filesystems that are exported via NFS. NFS clients use those timestamps to learn when cached copies of file data have gone stale; if the timestamp updates are deferred, there is a risk that a client could work with stale data for some period of time. Andy claimed that, with the current scheme, the timestamp could be wrong for a far longer period, so, he said, his patch represents an improvement, even if it's not perfect. David Lang suggested that perfection could be reached by updating the timestamps in memory on the first fault but not flushing that change to disk; Andy saw merit in the idea, but has not implemented it thus far.

As of this writing, the responses to the patch set itself have mostly been related to implementation details. Andy will have a number of things to change in the patch; it also needs filesystem implementations beyond just ext4 and a test for the xfstests package to show that things work correctly. But the core idea no longer seems to be controversial. Barring a change of opinion within the community, faster write fault handling for file-backed pages should be headed toward a mainline kernel sometime soon.

Index entries for this article
Kernel	Filesystems/Access-time tracking

Deferring mtime and ctime updates

Posted Aug 22, 2013 11:23 UTC (Thu) by jlayton (subscriber, #31672) [Link] (17 responses)

Yes, great article!

I'm not convinced that the NFS problem is really a huge issue. mmap and NFS have a long and problematic history together...

The answer for people trying to enforce cache-coherency across multiple clients has always been "use POSIX locking". If we simply have nfsd force the c/mtime update whenever a lock is released then that may be enough keep that in check.

Another idea might be to somehow allow the c/mtime to be updated in memory without requiring that to be flushed to disk until flush_cmtime() is called. knfsd could then report the in-memory time on GETATTR calls. That could be problematic though if the host crashes, I guess. The client could see c/mtime move backward. For the Linux client, that's not such a huge problem -- it watches for the c/mtime to change and that change doesn't necessarily need to move toward the future. Other clients might not like it though.

Deferring mtime and ctime updates

Posted Aug 22, 2013 15:28 UTC (Thu) by bfields (subscriber, #19510) [Link] (16 responses)

Yeah, I don't think it's worth trying for perfect consistency between a local application using mmap and a remote NFS client.

"The answer for people trying to enforce cache-coherency across multiple clients has always been "use POSIX locking". If we simply have nfsd force the c/mtime update whenever a lock is released then that may be enough keep that in check."

That might not be a bad idea.

For ordinary NFS clients the requirement that "times must be updated between any write to a page and either the next msync() call or the writeback of the data in question" may be all we need to guarantee that clients will see updates when they should, as they should normally be committing data before unlocking or opening. (Though I wonder about the case where they hold a write delegation.)

"that change doesn't necessarily need to move toward the future."

Though in the v4 case there's a chance the change attribute could repeat a value after reboot. I wonder if the filesystem could somehow add N to all change attributes after an unclean shutdown, for some big enough N.

Deferring mtime and ctime updates

Posted Aug 22, 2013 20:26 UTC (Thu) by luto (subscriber, #39314) [Link] (2 responses)

The trouble with forcing early cmtime updates is that we don't really know whether there's a dirty pte without walking all dirty pages.

(Given that the kernel doesn't currently support clean+writable ptes for shared mappings, this could be hacked around by tracking the number of writable ptes, but this is IMO gross. The other reason I don't want to do that is because I want the system to have a chance of working on writable XIP devices and on some magic future kernel that does support clean+writable.)

It would be possible to write a function to collect dirty ptes for an entire address_space much faster than calling page_mkclean on every page.

Deferring mtime and ctime updates

Posted Aug 23, 2013 12:08 UTC (Fri) by etienne (guest, #25256) [Link] (1 responses)

Maybe the other trouble with forcing early cmtime updates is that you do not know if someone will ask for it - i.e. the amount of work to maintain cmtime should be reduced to a minimum, even if that means more work when someone "stat()" the file.
Maybe even scan both versions (on disk and on memory) to look for difference when a stat() of a mmap file is done...
I wonder if someone has a idea about measuring the amount of time taken by managing a "page written" bit by setting the page read-only and faulting at first write (on Intel/AMD) and on ARM also managing a "page accessed" bit by setting the page non-accessible and faulting at the first access... that is a lot of faults (lots of pages) and each one may slow down the faulted process (cache and jump target cache pollution, out-of-order CPU stalled)...

Deferring mtime and ctime updates

Posted Aug 23, 2013 17:33 UTC (Fri) by dlang (guest, #313) [Link]

think about a multi-GB video file that's accessed via NFS, then look at your proposal to scan the entire file again and see if it still seems reasonable.

Deferring mtime and ctime updates

Posted Aug 23, 2013 11:55 UTC (Fri) by jlayton (subscriber, #31672) [Link] (12 responses)

> Though in the v4 case there's a chance the change attribute could repeat a
> value after reboot.

Is that really a problem though? IIRC, the change_attribute is supposed to be opaque. The client is only supposed to care if it's different from the last one it saw, not necessarily that it has gone forward.

Oh but...I suppose you could get bitten if you saw the change attribute transition from (for instance) 3->4 and then server reboots without committing that to disk. It then comes back again and does another 3->4 transition with a different set of data. Client then sees change attribute is "still" at 4 and doesn't purge the cache.

In that case...yeah -- maybe adding some sort of random offset to the change_attribute that's generated at boot time might make sense.

Deferring mtime and ctime updates

Posted Aug 23, 2013 13:41 UTC (Fri) by bfields (subscriber, #19510) [Link] (11 responses)

Yeah, exactly. At the time of a crash the in-memory change attribute may be well ahead of the on-disk one, and when the client resends the uncommitted data after boot it probably doesn't send exactly the same number and sequence of write rpc's, so as the server processes those resends it could reuse old change attributes with different data.

I don't know if the problem would be easy to hit in practice.

For a fix: we'd rather not invalidate all caches on every boot. We can't know which inodes are affected as that would require a disk write before allowing dirtying of any pages. Especially if there's a possibility of multiple reboots and network partitions I don't think we even know which boots are affected (maybe this is a boot after a clean shutdown but we still have a back-in-time change attribute left over from a previous crash).

Maybe a simple fix would be: instead of making the change attribute a simple 64-bit counter, instead put current unix time in the top 32 bits and a counter in the bottom 32 bits. Print a warning and congratulations to the log the first time anyone manages to sustain more than 4 billion writes in a second....

Deferring mtime and ctime updates

Posted Aug 23, 2013 17:34 UTC (Fri) by dlang (guest, #313) [Link] (7 responses)

keep in mind that the in-memory version should only be used if you are NOT exporting the file via NFS.

So you don't have to worry about clients after a reboot.

Deferring mtime and ctime updates

Posted Aug 23, 2013 18:45 UTC (Fri) by bfields (subscriber, #19510) [Link] (6 responses)

You're suggesting ensuring that any pending ctime/mtime/change attribute updates be committed to disk before responding to an nfs stat? I'm not sure that's practical.

Deferring mtime and ctime updates

Posted Aug 24, 2013 1:31 UTC (Sat) by dlang (guest, #313) [Link] (5 responses)

remember that the NFS spec requires that any writes to a NFS volume must be safe on disk before the write completes.

This requires a fsync after every write, which absolutely kills performance (unless you avoid ext3 and you have NVRAM or battery backed cache to write to), updating the attribute at the same time seems to be required by the standard.

Now, many people configure their systems outside the standard, accepting less data reliability in the name of performance, but if you are trying to provide all of the NFS guarantees, you need to update the timestamp after every write

This is why it's a _really_ bad idea to put things like Apache logs on NFS, unless you have a server with a lot of NVRAM to buffer your writes, and even then it's questionable.

Deferring mtime and ctime updates

Posted Aug 24, 2013 2:18 UTC (Sat) by mjg59 (subscriber, #23239) [Link] (3 responses)

I… think that reminding the maintainer of the kernel NFS server how NFS works might be a touch unnecessary.

Deferring mtime and ctime updates

Posted Aug 24, 2013 2:28 UTC (Sat) by dlang (guest, #313) [Link] (2 responses)

could be (and for the record, I didn't recognize that was who he was), but I've seen people manage to miss obvious things before in their area of expertise (and I've done it myself)

If I'm wrong about my understanding of what NFS requires, I'm interested in being corrected, I'll learn something and be in a better position to setup or troubleshoot in the future.

David Lang

Deferring mtime and ctime updates

Posted Aug 24, 2013 19:45 UTC (Sat) by bfields (subscriber, #19510) [Link] (1 responses)

No problem, I can overlook the obvious....

But as jlayton says, what you describe is not the typical case for NFS since v3, and reverting to NFSv2-like behavior would be a significant regression in some common cases.

And on a quick check.... I think the Linux v4 client, as one example, does request the change attribute on every write (assuming it doesn't hold a delegation), so the server would be forcing a commit to disk on every write.

Deferring mtime and ctime updates

Posted Aug 24, 2013 20:11 UTC (Sat) by dlang (guest, #313) [Link]

Ok, I wasn't aware that newer versions of NFS had relaxed the standard (I've been dealing with NFS for a while, but for the last 10 years or so it's either been with home-grade machines that I didn't expect great performance from, or with EMC/Netapp high end devices that include a lot of NVRAM to handle writes fast anyway)

just so I can see if I've got the use cases correct, I am understanding that we have the following cases

1. no NFS: ctime and mtime updates can be deferrred

2. NFSv2 in use: all writes are synchronous and ctime/mtime updates should be as well.

3. NFSv3+ in use: writes can be delayed (which should include ctime/mtime updates), unless the client says they can't, in which case NFSv2 rules apply

It seems to me that having a mount options like relctime or relmtime where the timestamp gets written out when the file is closed/mmunmap, when a fsync is done, or sooner if the kernel feels like it, should work (assuming NFS does flushes)

The only gap I can see is if the writes to the file are being done locally (mmap for example), then the writes may not be visible to NFS clients immediatly, but if this is a mount option like relatime is, people who care about this case just don't use the mount option and get the old (slower but reliable) mode.

Deferring mtime and ctime updates

Posted Aug 24, 2013 11:23 UTC (Sat) by jlayton (subscriber, #31672) [Link]

> remember that the NFS spec requires that any writes to a NFS volume must
> be safe on disk before the write completes. This requires a fsync after
> every write, which absolutely kills performance (unless you avoid ext3 and
> you have NVRAM or battery backed cache to write to), updating the
> attribute at the same time seems to be required by the standard.

That was true for NFSv2, but NFSv3 and later allow you to do UNSTABLE writes. Those don't need to be written to stable storage until the client issues a COMMIT (though the server is free to write them out earlier if it needs to). Most clients (Linux' included) will use UNSTABLE writes for the bulk of the writes that it does. STABLE (NFSv2-ish) writes are still used in some cases, but that's only where we deem that it's more efficient to do it that way.

Deferring mtime and ctime updates

Posted Aug 24, 2013 11:28 UTC (Sat) by jlayton (subscriber, #31672) [Link] (2 responses)

> Maybe a simple fix would be: instead of making the change attribute a
> simple 64-bit counter, instead put current unix time in the top 32 bits
> and a counter in the bottom 32 bits. Print a warning and congratulations
> to the log the first time anyone manages to sustain more than 4 billion
> writes in a second....

I suspect it wouldn't be too hard to hit that mark ;)

This scheme might work, but you'd still have the same problem that all caches would end up invalidated when the server reboots. You're quite correct that that *is* a problem that can crush an NFS server if it has a lot of clients dealing with large files. I do think we'll need to come up with some scheme that avoids that.

Deferring mtime and ctime updates

Posted Aug 24, 2013 19:59 UTC (Sat) by bfields (subscriber, #19510) [Link] (1 responses)

I suspect it wouldn't be too hard to hit that mark ;)

I'm not so sure, but actually this is really only a problem if the counter wraps around *and* a client's two successive stats manage to hit the same value each time through, which sounds pretty unlikely.

This scheme might work, but you'd still have the same problem that all caches would end up invalidated when the server reboots.

I'm suggesting replacing inode_inc_version by something that does this instead of just i_version++. So existing change attributes wouldn't change on reboot. It'd just ensure that when we write the file again, we choose a genuinely new change attribute and not one we might have used on the previous boot.

Deferring mtime and ctime updates

Posted Aug 26, 2013 16:25 UTC (Mon) by raven667 (subscriber, #5198) [Link]

> manage to hit the same value each time through, which sounds pretty unlikely

Just throwing this out there because I'm not an expert on this but it might also be useful to look from a security perspective, how could someone intentionally cause this to fail because if there is a thing with can fail than someone is going to try very hard and make it fail just to screw up your system if they can.

Deferring mtime and ctime updates

Posted Aug 23, 2013 0:41 UTC (Fri) by PaulWay (guest, #45600) [Link]

So is the idea that the mtime will be set as the time when the page is finally written to disk? Or the time of the page fault? The former seems to make sense to me, but may be less accurate for some. Or is it something else?

Obviously the correct answer is the time when the memory was actually written, but knowing that time is bordering on impossible.

Great reporting on a very interesting improvement, Jon!

Have fun,

Paul

Deferring mtime and ctime updates

Posted Aug 29, 2013 3:05 UTC (Thu) by heijo (guest, #88363) [Link] (1 responses)

Wow, how did this apparently huge issue get missed for 20 years?

Does this *really* mean that if you write a 4GB file via mmap on a 4KB page architecture, the mtime field is updated on disk a million times, or is it mitigated somehow?

Deferring mtime and ctime updates

Posted Aug 29, 2013 10:21 UTC (Thu) by cladisch (✭ supporter ✭, #50193) [Link]

The inode in the page cache will be updated and scheduled for an eventual write to disk a million times. How many of those writes are merged or do actually happen depends on the caching policy, and on how seriously a journaling FS takes metadata updates.

Deferring mtime and ctime updates

Posted Sep 5, 2013 7:01 UTC (Thu) by kevinm (guest, #69913) [Link]

I wonder if it's really worth trying to fix the remote NFS client thing perfectly, since there's still the issue that subquent writes to the dirty page won't cause more mtime updates - the fault only happens when the page is first marked as dirty. So if the NFS client does a GETATTR and a read after the first write on a page, but then the application with a mapping does another write to the same page, followed by the NFS client doing another GETATTR, it will see the timestamp as unchanged when the file contents have changed.

It seems to me like the "perfect" fix would require marking all the PTEs mapping the file as read-only again at every remote GETATTR call, which is likely to be expensive enough to rule out entirely.

In my view you shouldn't be relying on changes to mmap()ed files propagating to remote NFS clients until you msync().