Deferring mtime and ctime updates

Posted Aug 23, 2013 13:41 UTC (Fri) by bfields (subscriber, #19510)
In reply to: Deferring mtime and ctime updates by jlayton
Parent article: Deferring mtime and ctime updates

Yeah, exactly. At the time of a crash the in-memory change attribute may be well ahead of the on-disk one, and when the client resends the uncommitted data after boot it probably doesn't send exactly the same number and sequence of write rpc's, so as the server processes those resends it could reuse old change attributes with different data.

I don't know if the problem would be easy to hit in practice.

For a fix: we'd rather not invalidate all caches on every boot. We can't know which inodes are affected as that would require a disk write before allowing dirtying of any pages. Especially if there's a possibility of multiple reboots and network partitions I don't think we even know which boots are affected (maybe this is a boot after a clean shutdown but we still have a back-in-time change attribute left over from a previous crash).

Maybe a simple fix would be: instead of making the change attribute a simple 64-bit counter, instead put current unix time in the top 32 bits and a counter in the bottom 32 bits. Print a warning and congratulations to the log the first time anyone manages to sustain more than 4 billion writes in a second....

to post comments

Deferring mtime and ctime updates

Posted Aug 23, 2013 17:34 UTC (Fri) by dlang (guest, #313) [Link] (7 responses)

keep in mind that the in-memory version should only be used if you are NOT exporting the file via NFS.

So you don't have to worry about clients after a reboot.

Deferring mtime and ctime updates

Posted Aug 23, 2013 18:45 UTC (Fri) by bfields (subscriber, #19510) [Link] (6 responses)

You're suggesting ensuring that any pending ctime/mtime/change attribute updates be committed to disk before responding to an nfs stat? I'm not sure that's practical.

Deferring mtime and ctime updates

Posted Aug 24, 2013 1:31 UTC (Sat) by dlang (guest, #313) [Link] (5 responses)

remember that the NFS spec requires that any writes to a NFS volume must be safe on disk before the write completes.

This requires a fsync after every write, which absolutely kills performance (unless you avoid ext3 and you have NVRAM or battery backed cache to write to), updating the attribute at the same time seems to be required by the standard.

Now, many people configure their systems outside the standard, accepting less data reliability in the name of performance, but if you are trying to provide all of the NFS guarantees, you need to update the timestamp after every write

This is why it's a _really_ bad idea to put things like Apache logs on NFS, unless you have a server with a lot of NVRAM to buffer your writes, and even then it's questionable.

Deferring mtime and ctime updates

Posted Aug 24, 2013 2:18 UTC (Sat) by mjg59 (subscriber, #23239) [Link] (3 responses)

I… think that reminding the maintainer of the kernel NFS server how NFS works might be a touch unnecessary.

Deferring mtime and ctime updates

Posted Aug 24, 2013 2:28 UTC (Sat) by dlang (guest, #313) [Link] (2 responses)

could be (and for the record, I didn't recognize that was who he was), but I've seen people manage to miss obvious things before in their area of expertise (and I've done it myself)

If I'm wrong about my understanding of what NFS requires, I'm interested in being corrected, I'll learn something and be in a better position to setup or troubleshoot in the future.

David Lang

Deferring mtime and ctime updates

Posted Aug 24, 2013 19:45 UTC (Sat) by bfields (subscriber, #19510) [Link] (1 responses)

No problem, I can overlook the obvious....

But as jlayton says, what you describe is not the typical case for NFS since v3, and reverting to NFSv2-like behavior would be a significant regression in some common cases.

And on a quick check.... I think the Linux v4 client, as one example, does request the change attribute on every write (assuming it doesn't hold a delegation), so the server would be forcing a commit to disk on every write.

Deferring mtime and ctime updates

Posted Aug 24, 2013 20:11 UTC (Sat) by dlang (guest, #313) [Link]

Ok, I wasn't aware that newer versions of NFS had relaxed the standard (I've been dealing with NFS for a while, but for the last 10 years or so it's either been with home-grade machines that I didn't expect great performance from, or with EMC/Netapp high end devices that include a lot of NVRAM to handle writes fast anyway)

just so I can see if I've got the use cases correct, I am understanding that we have the following cases

1. no NFS: ctime and mtime updates can be deferrred

2. NFSv2 in use: all writes are synchronous and ctime/mtime updates should be as well.

3. NFSv3+ in use: writes can be delayed (which should include ctime/mtime updates), unless the client says they can't, in which case NFSv2 rules apply

It seems to me that having a mount options like relctime or relmtime where the timestamp gets written out when the file is closed/mmunmap, when a fsync is done, or sooner if the kernel feels like it, should work (assuming NFS does flushes)

The only gap I can see is if the writes to the file are being done locally (mmap for example), then the writes may not be visible to NFS clients immediatly, but if this is a mount option like relatime is, people who care about this case just don't use the mount option and get the old (slower but reliable) mode.

Deferring mtime and ctime updates

Posted Aug 24, 2013 11:23 UTC (Sat) by jlayton (subscriber, #31672) [Link]

> remember that the NFS spec requires that any writes to a NFS volume must
> be safe on disk before the write completes. This requires a fsync after
> every write, which absolutely kills performance (unless you avoid ext3 and
> you have NVRAM or battery backed cache to write to), updating the
> attribute at the same time seems to be required by the standard.

That was true for NFSv2, but NFSv3 and later allow you to do UNSTABLE writes. Those don't need to be written to stable storage until the client issues a COMMIT (though the server is free to write them out earlier if it needs to). Most clients (Linux' included) will use UNSTABLE writes for the bulk of the writes that it does. STABLE (NFSv2-ish) writes are still used in some cases, but that's only where we deem that it's more efficient to do it that way.

Deferring mtime and ctime updates

Posted Aug 24, 2013 11:28 UTC (Sat) by jlayton (subscriber, #31672) [Link] (2 responses)

> Maybe a simple fix would be: instead of making the change attribute a
> simple 64-bit counter, instead put current unix time in the top 32 bits
> and a counter in the bottom 32 bits. Print a warning and congratulations
> to the log the first time anyone manages to sustain more than 4 billion
> writes in a second....

I suspect it wouldn't be too hard to hit that mark ;)

This scheme might work, but you'd still have the same problem that all caches would end up invalidated when the server reboots. You're quite correct that that *is* a problem that can crush an NFS server if it has a lot of clients dealing with large files. I do think we'll need to come up with some scheme that avoids that.

Deferring mtime and ctime updates

Posted Aug 24, 2013 19:59 UTC (Sat) by bfields (subscriber, #19510) [Link] (1 responses)

I suspect it wouldn't be too hard to hit that mark ;)

I'm not so sure, but actually this is really only a problem if the counter wraps around *and* a client's two successive stats manage to hit the same value each time through, which sounds pretty unlikely.

This scheme might work, but you'd still have the same problem that all caches would end up invalidated when the server reboots.

I'm suggesting replacing inode_inc_version by something that does this instead of just i_version++. So existing change attributes wouldn't change on reboot. It'd just ensure that when we write the file again, we choose a genuinely new change attribute and not one we might have used on the previous boot.

Deferring mtime and ctime updates

Posted Aug 26, 2013 16:25 UTC (Mon) by raven667 (subscriber, #5198) [Link]

> manage to hit the same value each time through, which sounds pretty unlikely

Just throwing this out there because I'm not an expert on this but it might also be useful to look from a security perspective, how could someone intentionally cause this to fail because if there is a thing with can fail than someone is going to try very hard and make it fail just to screw up your system if they can.