Journal flushing

Posted Apr 27, 2021 21:41 UTC (Tue) by corbet (editor, #1)
In reply to: Preventing information leaks from ext4 filesystems by Paf
Parent article: Preventing information leaks from ext4 filesystems

My phrasing was a bit sloppy there, sorry.

What was meant by "flushing the journal" was "playing the journal out to the filesystem proper". Yes, the journal is written to persistent storage before any other filesystem metadata changes are made. Once the data is in the journal, though, finishing the job becomes less important - the data is saved and won't be lost. The new ioctl() forces the journal to be reflected in the filesystem metadata, then optionally does FITRIM to clear the data from the journal.

to post comments

Journal flushing

Posted Apr 27, 2021 22:05 UTC (Tue) by tytso (subscriber, #9993) [Link] (10 responses)

If we want to be super nit-picky, what we are doing is "checkpointing the journal". This means assuring that everything is written to the final location on disk, so that it is no longer necessary to replay the journal after a crash. We checkpoint the journal when we freeze a file system before taking a snapshot, and as the last step before unmounting the file system.

It's not really a case of replaying the journal, because we don't actually read the data back from the journal. The dirty blocks (dirty because they haven't been written to the disk yet) are still in the buffer cache, so we are writing the contents of those blocks from the buffer cache. Once all of the dirty metadata blocks are written out, then it is safe to truncate the journal, as the journal is no longer needed any more.

After a crash, or an unplanned power cycle, that's when we replay the journal. In that case, we are reading the contents of the blocks from the journal, and then writing them back to their correct locations in the file system. Since we are reading from the journal, that's why we are calling it "playback" or "replay".

Why yes, file system developers are sometimes really anal-retentive. Why do you ask? :-)

Journal flushing

Posted Apr 27, 2021 22:25 UTC (Tue) by Paf (subscriber, #91811) [Link]

Thanks (both) for the clarification, that's roughly what I figured but it's nice to have that clarified and detailed.

"Why yes, file system developers are sometimes really anal-retentive. Why do you ask? :-)"
Speaking as one, I have no idea. ;)

Journal flushing

Posted Apr 29, 2021 3:20 UTC (Thu) by dgc (subscriber, #6611) [Link] (8 responses)

> If we want to be super nit-picky, what we are doing is "checkpointing the journal".

If we want to be super nit-picky, this is what *ext4 developers* call "checkpointing the journal". :/

In XFS, a "journal checkpoint" is what we write to the journal (XFS_TRANS_CHECKPOINT) and recover from the journal during replay. What you've defined as "checkpointing the journal" is called "quiescing the log" in XFS. See xfs_log_quiesce() for details.

And, if we are really going to pick nits, what does "checkpoint the journal" mean to a filesystem with copy-on-write metadata and no journal?

-Dave.

Journal flushing

Posted Apr 29, 2021 3:48 UTC (Thu) by tytso (subscriber, #9993) [Link] (7 responses)

It's not just ext4 developers. It's also the terminology used by SQLite and Postgres when using write-ahead logging:

"Of course, one wants to eventually transfer all the transactions that are appended in the WAL file back into the original database. Moving the WAL file transactions back into the database is called a "checkpoint" -- https://sqlite.org/wal.html

"Put simply, a checkpoint is a point in the transaction sequence at which all data files (where tables and indexes reside) are guaranteed to be written and flushed to disk to reflect the data modified in memory by regular operation. All those changes have previously been written and flushed to WAL." https://www.commandprompt.com/blog/the_write_ahead_log/

Journal flushing

Posted Apr 29, 2021 11:23 UTC (Thu) by Wol (subscriber, #4433) [Link] (6 responses)

So to use tytso's terminology, between the file i/o layer, and the file system checkpointing the journal, A MIRRORED FILE SYSTEM IS IN DANGER. That's what I was getting at - I get the impression (maybe I'm wrong, because of all the confusing terminology thrown around like Humpty Dumpty) that some filesystems can take absolutely ages to run their checkpoint.

If I've got a couple of hour's data waiting to be checkpointed, I would be well pissed off if the journal drive died and I discovered that work from a long time ago hadn't made it to the mirror ...

I appreciate it's not easy, but users expect every layer to complete in the shortest time practical, not hang around on a "why bother rushing" basis.

Cheers,
Wol

Journal flushing

Posted Apr 29, 2021 17:28 UTC (Thu) by tytso (subscriber, #9993) [Link] (2 responses)

Um, no. There's no danger. While we are checkpointing, we are forcing the dirty metadata blocks to be written to the file system. Only *after* all of the dirty metadata blocks are written, do we truncate the journal.

In general, file systems should not take ages to checkpoint a transaction. On average, we will force a journal commit every 5 seconds (or perhaps sooner if fsync is called). As part of the commit processing, once the commit block is written, all of the metadata buffers will be marked dirty, and 30 seconds later, the normal buffer cache writeback will start. Hence in practice, the only writes that we need to do when doing a full checkpoint are the metadata blocks that were touched in the last 30 seconds, plus any pending writeback.

Also, normally, we don't actually try to checkpoint all completed transactions. If we checkpoint a transaction that completed, say, 50 seconds ago, and all of its dirty buffers have been writen back (or taken over by a newer transaction), then we don't need to do any I/O before we declare, "yep, this transaction is no longer needs to be kept in the journal; we can reuse that space for newer transactions". So normally, we just try to checkpoint all old transactions without needing to do any I/O, and we can move the tail of the journal without needing to do any synchronous I/O.

Now, if the journal is too small, or the block device is too slow, it's possible that this will not free enough space in the journal for newer transactions. In that case, we will need to do a synchronous checkpoint where we actually force the buffers to be written back so we can free space in the journal. While this is happening, file system mutations are blocked, so this is terrible for file system performance. Fortunately, this rarely happens, and in modern versions of mke2fs, we create file systems with larger journals to prevent this from happening.

The only time we need to do a full, synchronous checkpoint, to completely empty the journal, is when we unmount the file system, or if we are trying to freeze the file system in order to take a snapshot. But even then, it's perfectly safe, because we don't actually truncate the journal until the metadata writeback is complete.

Journal flushing

Posted Apr 30, 2021 0:30 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> Hence in practice, the only writes that we need to do when doing a full checkpoint are the metadata blocks that were touched in the last 30 seconds, plus any pending writeback.

So you're saying that - in normal circumstances - ext has a maximum of 30 secs of writes in flight. Anything older than that SHOULD have been flushed through to the underlying layers (or binned as having been overwritten :-)

That's good news from ext. I'm just wondering whether other filesystems and layers provide similar guarantees. It may just be mis-understanding the terminology, but I just don't know ...

Cheers,
Wol

Journal flushing

Posted Apr 30, 2021 4:21 UTC (Fri) by tytso (subscriber, #9993) [Link]

Well, "maximum of 30 seconds in flight" isn't guaranteed". After 30 seconds, we will start the writeback. However, writeback is not instaneous. Sumchronous reads are higher priority than background write operations. So if there are processes that are continuously reading files in order to do useful work, those will be prioritized ahead of the write backs.

Secondly, if the metadata block is modified so it's part of a newer check point, that block's write back will be suppressed. So if a particular inode table block is getting constantly modified every ten seconds, that inode table block won't get written back. So.... if you are appending to a log file every five seconds, such that the inode size is changing, then that inode table block will be involved in every journal transaction (assuming no fsyncs). So if you have a super-reliable file system device, and a super-UN-reliable external journal device, then yeah, it's possible that if the unreliable external journal device explodes, you might lose the inode table block. But that's a stupid, stupid configuration. Don't do that. (And indeed, most people don't use external journals, so this is really an unrealistic, or at least uncommon, configuration.)

You could also complain that if you had a RAID 0 configuration, where one disk is super-reliable, and the other disk is super-fragile, the chances of data loss is much larger. So is having 3 tires in a car that are brand new, and one front tire which has complete threads worn out and broken steel belts sticking out the size; that could kill people, even though 3 of the tires are expensive, brand new tires. But there are always unwise ways you can configure any system, and the fault is not that you chose to use 3 expensive brand new tires. It's the fact that one of the other tires is not like the others. :-)

As far as other systems are concerned, in general, the journal is a circular buffer, and it's not the case that a file system or database would wait until the write-ahead log is 100% full, and then all useful work stops while the log is written to the file system or database, and then the log is truncated down to be completely empty. People would be really unhappy with the performance, for one thing --- imagine how you would feel if you were using a web browser, and all of sudden, KER-CHUNK --- checkpointing now, and the web browser freezes until the checkpoint is done. As they say, that would not be a great user experience. :-) So in general, writeback happens in the background, in parallel with other file system or database operations.

Journal flushing

Posted Apr 29, 2021 17:43 UTC (Thu) by zlynx (guest, #2285) [Link] (2 responses)

When people want lots of reliability they also mirror any journal devices in the system.

For example, the ZFS SLOG is often a single Flash or Optane or NVRAM device. And for many situations this works very well.

However, for installations that can't allow losing the transactions within a 5 second window during failure of that single SLOG device, you can use two. Or three.

This same consideration needs to be done for external EXT4 journals. Usually the journal is on the same device, so if the data is mirrored the journal is also mirrored.

Journal flushing

Posted Apr 29, 2021 20:21 UTC (Thu) by tytso (subscriber, #9993) [Link] (1 responses)

Right, I was really unclear about what Wol was concerned about. A mirrored file system doesn't make things more unsafe. If you have an external journal, the fact that you have an external journal does add an extra point of failure, sure. But that has nothing to do with the mirrored file system.

As I said in my direct reply to Wol, after 30 seconds, a dirtied metadata block will be subject to writeback. So if you are using an external journal device, and it dies, then there will certainly be some metadata that may have only been written to the journal, but which hasn't been written back to the file system, where you may end up losing that metadata update. But taking hours of data waiting to be checkpointed? I'm not sure when that would *ever* be the case. I guess if you had a super slow USB thumb drive, it might take a while to sync the file system and unmount the file system, but that's mainly the data blocks which is taking all of the time, not the journal checkpoint. But even with the slowest USB thumb drive, it's never taken me an hour to writeout everything so the unmount can proceed. So I'm not sure what Wol was talking about in that case.

Journal flushing

Posted Apr 30, 2021 0:37 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Right, I was really unclear about what Wol was concerned about. A mirrored file system doesn't make things more unsafe. If you have an external journal, the fact that you have an external journal does add an extra point of failure, sure. But that has nothing to do with the mirrored file system.

Yup, that was it. If stuff gets delayed in the journal and not flushed to the mirror, you have a single point of failure. From what you've said ext only has a minimal risk window, which is great :-)

Maybe I'm naive (probably am :-) but I was thinking of journals as being a simplistic "write log, write to permanent storage, delete log", and thinking "if stuff gets to the log and the system is tardy about the next step, then that's not good ..."

Cheers,
Wol

Journal flushing

Posted Apr 28, 2021 8:01 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

And, as the OP who started this thread, and who is also interested in RAID :-), I'm thinking of situations where we have the file system on a mirror, and the journal on a single disk ...

The longer it takes for the journal to flush, the longer the user is being misled as to the safety of said data ...

(And yes, I know disasters are unlikely, but would you like to lose hours or days of supposedly mirrored data because your single journal disk died ...?)

Cheers,
Wol

Journal flushing

Posted Apr 28, 2021 14:29 UTC (Wed) by Paf (subscriber, #91811) [Link]

Wol,

The journal has always been able to remain in this state for arbitrary lengths of time. This is in fact a change to do this in new circumstances.

And, the journal not being flushed doesn’t mean the data/metadata isn’t on the primary storage/written to the main file system. It is, it’s just also recorded in the journal. This is - as I understand it - essentially a journal “clean” or “purge”. At a time when the journal and the disk are known to be in accord, the journal is purged because it’s known to be not necessary. (The journal handles the case of crashing when there is data/metadata that is not known to have reached permanent storage in the actual file system.)

If there *are* any outstanding changes to the file system, I would guess this also ensures those reach permanent storage on the primary file system, and then purges the associated journal entries, but that’s a transient state while operations are in progress.

The journal is not kept forever. This is just making sure it’s gone at certain times. (Hope I’ve got that right. ;) )