Journal flushing

Posted Apr 29, 2021 17:28 UTC (Thu) by tytso (subscriber, #9993)
In reply to: Journal flushing by Wol
Parent article: Preventing information leaks from ext4 filesystems

Um, no. There's no danger. While we are checkpointing, we are forcing the dirty metadata blocks to be written to the file system. Only *after* all of the dirty metadata blocks are written, do we truncate the journal.

In general, file systems should not take ages to checkpoint a transaction. On average, we will force a journal commit every 5 seconds (or perhaps sooner if fsync is called). As part of the commit processing, once the commit block is written, all of the metadata buffers will be marked dirty, and 30 seconds later, the normal buffer cache writeback will start. Hence in practice, the only writes that we need to do when doing a full checkpoint are the metadata blocks that were touched in the last 30 seconds, plus any pending writeback.

Also, normally, we don't actually try to checkpoint all completed transactions. If we checkpoint a transaction that completed, say, 50 seconds ago, and all of its dirty buffers have been writen back (or taken over by a newer transaction), then we don't need to do any I/O before we declare, "yep, this transaction is no longer needs to be kept in the journal; we can reuse that space for newer transactions". So normally, we just try to checkpoint all old transactions without needing to do any I/O, and we can move the tail of the journal without needing to do any synchronous I/O.

Now, if the journal is too small, or the block device is too slow, it's possible that this will not free enough space in the journal for newer transactions. In that case, we will need to do a synchronous checkpoint where we actually force the buffers to be written back so we can free space in the journal. While this is happening, file system mutations are blocked, so this is terrible for file system performance. Fortunately, this rarely happens, and in modern versions of mke2fs, we create file systems with larger journals to prevent this from happening.

The only time we need to do a full, synchronous checkpoint, to completely empty the journal, is when we unmount the file system, or if we are trying to freeze the file system in order to take a snapshot. But even then, it's perfectly safe, because we don't actually truncate the journal until the metadata writeback is complete.

to post comments

Journal flushing

Posted Apr 30, 2021 0:30 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> Hence in practice, the only writes that we need to do when doing a full checkpoint are the metadata blocks that were touched in the last 30 seconds, plus any pending writeback.

So you're saying that - in normal circumstances - ext has a maximum of 30 secs of writes in flight. Anything older than that SHOULD have been flushed through to the underlying layers (or binned as having been overwritten :-)

That's good news from ext. I'm just wondering whether other filesystems and layers provide similar guarantees. It may just be mis-understanding the terminology, but I just don't know ...

Cheers,
Wol

Journal flushing

Posted Apr 30, 2021 4:21 UTC (Fri) by tytso (subscriber, #9993) [Link]

Well, "maximum of 30 seconds in flight" isn't guaranteed". After 30 seconds, we will start the writeback. However, writeback is not instaneous. Sumchronous reads are higher priority than background write operations. So if there are processes that are continuously reading files in order to do useful work, those will be prioritized ahead of the write backs.

Secondly, if the metadata block is modified so it's part of a newer check point, that block's write back will be suppressed. So if a particular inode table block is getting constantly modified every ten seconds, that inode table block won't get written back. So.... if you are appending to a log file every five seconds, such that the inode size is changing, then that inode table block will be involved in every journal transaction (assuming no fsyncs). So if you have a super-reliable file system device, and a super-UN-reliable external journal device, then yeah, it's possible that if the unreliable external journal device explodes, you might lose the inode table block. But that's a stupid, stupid configuration. Don't do that. (And indeed, most people don't use external journals, so this is really an unrealistic, or at least uncommon, configuration.)

You could also complain that if you had a RAID 0 configuration, where one disk is super-reliable, and the other disk is super-fragile, the chances of data loss is much larger. So is having 3 tires in a car that are brand new, and one front tire which has complete threads worn out and broken steel belts sticking out the size; that could kill people, even though 3 of the tires are expensive, brand new tires. But there are always unwise ways you can configure any system, and the fault is not that you chose to use 3 expensive brand new tires. It's the fact that one of the other tires is not like the others. :-)

As far as other systems are concerned, in general, the journal is a circular buffer, and it's not the case that a file system or database would wait until the write-ahead log is 100% full, and then all useful work stops while the log is written to the file system or database, and then the log is truncated down to be completely empty. People would be really unhappy with the performance, for one thing --- imagine how you would feel if you were using a web browser, and all of sudden, KER-CHUNK --- checkpointing now, and the web browser freezes until the checkpoint is done. As they say, that would not be a great user experience. :-) So in general, writeback happens in the background, in parallel with other file system or database operations.