Preventing information leaks from ext4 filesystems

By Jonathan Corbet
April 27, 2021

A filesystem's role is to store information and retrieve it in its original form on request. But filesystems are also expected to prevent the retrieval of information by people who should not see it. That requirement extends to data that has been deleted; users expect that data to be truly gone and will not welcome its reappearance in surprising places. Some work being done with ext4 shows the kind of measures that are required to live up to that expectation.

In early April, Leah Rumancik posted a two-patch series making a couple of small changes to the ext4 filesystem implementation. The first of those caused the filesystem to, after a file is deleted, overwrite the space (on disk) where that file's name was stored. In response to a question about why this was needed, ext4 maintainer Ted Ts'o explained that it was meant to deal with the case where users were storing personally identifiable information (PII) in the names of files. When a file of that nature is removed, the user would like to be sure that the PII is no longer stored on the disk; that means wiping out the file names as well.

Dave Chinner quickly objected to this explanation. The real problem, he argued, is that users are storing PII as clear text; wiping directory entries is not a real solution to that problem:

This sounds more and more like "Don't encode PII in clear text anywhere" is a best practice that should be enforced with a big hammer. Filenames get everywhere and there's no easy way to prevent that because path lookups can be done by anyone in the kernel. This so much sounds like you're starting a game of whack-a-mole that can never be won.
From a security perspective, this is just bad design. Storing PII in clear text filenames pretty much guarantees that the PII will leak because it can't be scrubbed/contained within application controlled boundaries. Trying to contain the spread of filenames within random kernel subsystems sounds like a fool's errand to me, especially given how few kernel developers will even know that filenames are considered sensitive information from a security perspective...

The problem with that approach, as Ts'o explained, is that the people involved may not even know what their "legacy workloads" are doing in this regard. Rather than risk the possibility of exposing PII that nobody even knew was there, he said, it is better to simply clear the file names.

Of course, if a file's name constitutes PII, its contents are likely to be interesting as well. It is possible, though expensive, to overwrite the data in files when they are deleted. An alternative, on modern storage devices at least, is to use the FITRIM ioctl() command to tell the drive to discard the data. Even if this operation does not physically erase that data, it should make the data inaccessible afterward. For this reason (and others), administrators who are concerned about the persistence of deleted data tend to arrange for FITRIM operations to be run regularly.

There is one little remaining issue, though, for the truly paranoid. Ext4, like many contemporary filesystems, uses journaling to ensure that the filesystem remains consistent even if the system is interrupted at an inopportune time. To implement that sort of robustness, metadata written to ext4 filesystems is also written to the journal (and file data can optionally be written there too), so the possibility exists that the journal will contain PII even after all traces of it have been removed from the rest of the filesystem. A bad actor who gains access to the disk could harvest that data from the journal, even though it had already been deleted from the filesystem.

To address this problem, Rumancik's second patch adds a new ioctl() command, called FS_IOC_CHKPT_JRNL, that forces the journal to be flushed out to persistent storage. There is an optional flag, CHKPT_JRNL_DISCARD, which causes the filesystem to run the equivalent of a FITRIM operation on the journal itself once the flush is complete. That ensures that no PII remains in the journal itself.

Chinner (in the same email linked above) suggested that this behavior should be an option on the FITRIM operation rather than a separate command — before noting that FITRIM has no "flags" field and, thus, cannot be extended. Perhaps, he suggested, it is time for a separate fstrim() system call that could also trim the journal on request. A separate system call would also, by default, make the functionality available to all filesystems rather than being an ext4-specific feature.

The two patches together are intended to help administrators ensure that data that has been deleted from a filesystem truly disappears. It looks like they will be taking separate paths into the kernel from here, though. Rumancik recently posted a new version of the file-name overwrite patch on its own, and Ts'o subsequently applied it. The journal side of the problem, Ts'o said, is going to require some more discussion. Eventually, though, one can expect solutions to both problems to find their way into the kernel. That will help prevent the accidental disclosure of sensitive information from ext4 filesystems, even if the user is storing it in ill-advised ways.

Index entries for this article
Kernel	Filesystems/ext4
Security	Information leak

to post comments

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 16:23 UTC (Tue) by dullfire (guest, #111432) [Link] (10 responses)

This sounds like it would be better solved by marking the containing dir as having encrypted contents, and then just encrypting the file names (in addition to the file contents).

That would let also protect the file contents, and it would clearly mark the data as 'to-be-scrubbed'

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 22:18 UTC (Tue) by dcg (subscriber, #9198) [Link] (1 responses)

This is what APFS (Apple's new file system) does. I don't remember the details well, but IIRC there is a per-subvolume key that is used to encrypt the data in that volume (there are other cryptographic features whose keys are mixed with this one in some way, I think) . The key is kept in some special metadata block, and the act of deleting all the data securely consists in simply erasing that key - no need to overwrite all the blocks.

This method potentially be implemented in a per-file basis, but the metadata overhead would be too much, I guess.

Preventing information leaks from ext4 filesystems

Posted May 6, 2021 7:40 UTC (Thu) by bluss (guest, #47454) [Link]

for fsck and repair, I hope they have some copies of that key or the superblock, so that you can't lose the whole disk in one bit flip?

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 0:10 UTC (Wed) by jlayton (subscriber, #31672) [Link] (6 responses)

This is actually available on ext4 as well via fscrypt, and I'm surprised no one has suggested it. Surely the truly paranoid should use fscrypt (or maybe LUKS if that's an option).

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 0:42 UTC (Wed) by tytso (subscriber, #9993) [Link] (5 responses)

It was suggested, and the reason why encryption is not always the bests answer was described both on the ext4 mailing list, as well as further down in the comment for this article.

The short version is, where and how to you store the encryption key, especially for a cloud VM, which is the target use case for this feature? Ext4 encryption, which was later generalized to fscrypt, is designed to protect against off-line attacks, such as the "evil hotel maid" attack where you leave your mobile handset in the hotel room and someone carries out an off-line attack against the storage attack. The use case for zero'ing out the directory entry when it is unlinked is to make sure any sensitive information is removed the moment it is no longer necessary. This is more of an on-line property.

The tradeoff is that you can no longer as easily use debugfs's "lsdel" command to find deleted directory entries. On the other hand, because of how ext3 handles truncates, we've made finding the data blocks of deleted inodes impractical for decades, and no one has really complained (see the ext4 mailing list for all of the details). So if you can no longer find the data blocks easily, is finding the inode number from a deleted directory entry all that important? For modern file systems, "rm is forever" and the solution for accidental deletion of files is to refer to your backups.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 9:48 UTC (Wed) by dullfire (guest, #111432) [Link] (4 responses)

> The short version is, where and how to you store the encryption key, especially for a cloud VM, which is the target use case for this feature?

So just going to stop right here to point out: If you are attempting to prevent the machine owners from getting PII because you stuck it, in the clear, in a file name, you are screwed so many ways that it doesnt really matter how the fs scrubs the file name. (backups, the vm's virtual HD uses any sort of remapping, the list could go on an on).

If someone really wants to do this: the only sane thing is to encrypt it, and stuff the key in the inode itself (after all it's not really secret to the system.... your kernel will always know what files its direct child processes are accessing). Obviously if keys to be deleted they are always scrub... because it's a key.

Basically don't store PII in a clear fs that anyone can access. It's a terrible idea.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 9:53 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> So just going to stop right here to point out: If you are attempting to prevent the machine owners from getting PII because you stuck it, in the clear, in a file name, you are screwed so many ways that it doesnt really matter how the fs scrubs the file name. (backups, the vm's virtual HD uses any sort of remapping, the list could go on an on).
The problem is not hiding the PII from the machine's owner, but making sure it's truly deleted when the user requests that.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 11:43 UTC (Wed) by lsl (guest, #86508) [Link] (2 responses)

Does overwriting file names actually accomplish that, though? That depends on the cloud's block storage implementation.

In other words, this feature's usefulness is to the cloud operator who's in a position to assess the effect this overwriting has on the actual stored data.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 11:47 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> Does overwriting file names actually accomplish that, though?
It's a part of the overall puzzle.

For data block cleanup, TRIM is supposed to make the data inaccessible (except maybe through a very thorough forensic scan).

Preventing information leaks from ext4 filesystems

Posted Apr 29, 2021 11:13 UTC (Thu) by Wol (subscriber, #4433) [Link]

> > Does overwriting file names actually accomplish that, though?

> It's a part of the overall puzzle.

Yup. What everybody seems to be missing is that ext* is saying "I've done my bit, it's not my problem any more".

Ext can't provide guarantees ELSEWHERE in the stack, but it can provide its own guarantees, which is the intent here. It's not passing the buck, it's taking responsibility for what it can. If others shirk their responsibility, it's not ext's problem.

Cheers,
Wol

Preventing information leaks from ext4 filesystems

Posted May 8, 2021 3:15 UTC (Sat) by SomeOtherGuy (guest, #151918) [Link]

Encryption on a directory-per-directory or file-by-file basis is not that good, but as pretty much every CPU (for us x86-64 users anyway) has AES (cat /proc/cpuinfo | grep aes) we're good, IIRC the Sandy Bridge i3's didn't have it, Skylake i3s do, not sure about the middle

There's really no reason not to use it, but even then we should write random crap to the drive first (but this hurts SSDs) most of us probably dont and this means sectors filled with 0s leaks "there's no data here", an attacker that can image the drive even if you write random crap first can see where sectors are changing at least

This has its own problems like watermarking attacks but that was a thing that was solved in the 2.63 or even older kernels, however it can still leak some information

File names, I've gotta say, with directory names and that should not be "scrubbed" - I'm not even sure we over-write the data (I don't think we do?) - but obviously if a file is allocated a range that should be 0ed or otherwise not-actually-read by the file until it writes there.

TheTimeIDidTheCrime.txt - if that sensitive should be on an encrypted drive

A netbook of mine was once stolen (good riddance, hated that fad and the tiny keyboard and the Atom CPU) - full disk encryption means I get to write about it with a smile not worrying about whatever was on there being a problem for me (not to sound dodgy but you see what I mean)

If we did do this, symlinks may also need protecting, or NFS/sshfs type deals, it'd be bad

So I doubt encryption-in-the-fs would be safe anyway it'd certainly complicate things (see the watermarking attack I mentioned) - I really like our block device solution of LUKS and the like letting ext4 operate on /dev/mapper/whatever as if it were a drive

We also can pass through trim and stuff like that (at the cost of the drive now contains zeros) - but it helps our SSDs.

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 17:35 UTC (Tue) by dxin (guest, #136611) [Link] (4 responses)

This together with case-insensitive file names and other legacy-oriented anti-feature is making ext4 more and more complex and at the same time, less interesting.

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 21:33 UTC (Tue) by ddevault (subscriber, #99589) [Link]

Full agreement regarding case insensitivity. But this is a case that I'd like to see addressed, as a service operator. I am obligated by law to ensure that personal information is deleted properly, and having better tools to be sure of this is a welcome change. Aye, data should be encrypted - but the most secure data is data that you don't have.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 6:45 UTC (Wed) by ibukanov (subscriber, #3942) [Link] (2 responses)

A case-insensitive file system is very helpful for cross-compiling Windows code on Linux. Include statements in that code often contains wrong case, like include <windows.h> where it should be Windows.h.

Now, one can fix it with various workarounds like using symlinks for all case variations used in code but I am glad that I do not need to do that.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 7:05 UTC (Wed) by taladar (subscriber, #68407) [Link]

Applying case-insensitive semantics to a filesystem for use-cases like this seems like something that should be part of a separate layer on top of existing filesystems similar to bindfs.

Preventing information leaks from ext4 filesystems

Posted Apr 29, 2021 22:54 UTC (Thu) by sandsmark (guest, #62172) [Link]

Doesn't winegcc etc. already do this?

And last time I had to clean up this kind of mess, "automatically" letting it handle case-insensitivity would have broken things, since the first include in the checked paths wasn't always the right one if you checked case insensitively.

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 17:46 UTC (Tue) by ikm (subscriber, #493) [Link] (3 responses)

This seems wrong on so many levels. Why is this specifically about PII? There's no shortage of other sensitive types of information that can be stored. Why is this ext4-specific? Other filesystems would happily leak the same information. Instead of trying to scrub the data from all the places where it could possibly linger, why not just store it encrypted? Looks like someone is trying to solve a specific problem of their own in the wrong place. I for one certainly don't want to always spend resources scrubbing file names when files are deleted - on a CI server, for instance, it would be a colossal waste.

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 20:01 UTC (Tue) by SLi (subscriber, #53131) [Link]

How do you ensure you lose the ability to decrypt it without wiping it and without losing the ability to decrypt names of files that haven't been deleted?

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 20:12 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

> Why is this specifically about PII? There's no shortage of other sensitive types of information that can be stored.

1. PII is subject to GDPR and other laws and regulations. While other types of information are also subject to such regulation, GDPR is explicitly a *general* regulation that applies to all businesses that transact in PII (in practice, most of them).
2. Lots of things are PII despite seemingly not being "sensitive" - GDPR uses a rather generous interpretation which includes basically anything connected to a specific, identifiable individual, such as an IP address.
3. If a file pertains to a specific individual, it is difficult to imagine a scheme in which a name could be assigned to that file without that name becoming PII itself. Even a randomly-generated opaque identifier is generally considered PII if it is linked to a specific individual.
4. Encrypted filenames must still be unique, and are therefore still unique identifiers and might still be subject to GDPR (depending on which lawyers you ask).

As a result, PII is a pervasive and highly common type of sensitive information that can easily find itself in metadata or other ancillary storage, which might not have the same level of protection as the underlying data. It is also very difficult to handle from "inside the system," because of point (4).

> Why is this ext4-specific? Other filesystems would happily leak the same information.

Because everyone uses ext4. But fixing other filesystems is probably a good idea, too.

> Instead of trying to scrub the data from all the places where it could possibly linger, why not just store it encrypted?

Full-disk encryption is probably a good idea. But it has the problem of being imprecise; it may be desirable to hand out privileges in a finer-grained manner than "either you can read the whole disk, or you can't."

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 21:39 UTC (Tue) by tytso (subscriber, #9993) [Link]

First of all, encryption is not free. There is real cost in encrypting file names, and the encryption key needs to be available when the file system is mounted. Where are you going to store the key? In /etc/fstab as a mount option? How do you prevent an attacker from gaining access to it?

Secondly, the cost of zero'ing the directory entry is negligible compared to the cost of writing the directory block to disk, which we have to do anyway. Even with the fastest SSD, writing zeroes to memory is going to be super-small in comparson. Main memory access is order ~100 nanoseconds. SSD access is order ~150 microseconds. So we're talking 2 or 3 orders of magnitude.

This is why we're not gating this on a mount option. It's just not worth it, because it would be a mount option we would have to support forever, compared to the overhead of zeroing directory entries which won't be measurable in benchmarks.

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 17:57 UTC (Tue) by re:fi.64 (subscriber, #132628) [Link]

This seems like a terrible default, at least it can be flipped off I guess...

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 18:04 UTC (Tue) by mss (subscriber, #138799) [Link] (1 responses)

This seems like a rather fragile solution, especially when using a SSD, where discard operations are usually (nearly always?) best-effort with respect to the content of the actual backing flash chips.

It's rather surprising to see this feature in ext4 specifically as this filesystem is one of the very few that support filesystem-level encryption.

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 21:55 UTC (Tue) by tytso (subscriber, #9993) [Link]

Geting deleted information from an SSD requires physical access to the storage device or the ability to somehow install evil firmware into the device. Neither attack is particularly trivial, compared to being able to send read requests using standard SATA/SCSI/NVMe requests from the host OS.

On virtual machines in Cloud Platforms such as Google's GCP, Amazon's AWS, Microsoft's Azure, etc. you don't get access to the raw block devices; most of these hyperscale cloud providers don't SSD's and HDD's leave the data center except in tiny pieces[1]. So for that use case, zero'ing the directory entries is quite sufficient and robuest in terms of protecting them against information leaks.

[1] https://youtu.be/kd33UVZhnAA?t=291

(And, no, companies shouldn't be using social security numbers as filenames.... but enterprise customers do the darnedest things. And even if they don't, it's a lot easier for them to tell their compliance auditors that they are using file systems that protect them even if they _do_ have some legacy software somewhere in their enterprise software stack that might be doing something stupid....)

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 18:10 UTC (Tue) by JoeBuck (guest, #2330) [Link] (3 responses)

I hope that this can be turned off. For a machine that is internal to a company, there's no reason to waste any cycles making file deletion more expensive. The disk can be scrubbed when it is disposed of.

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 19:13 UTC (Tue) by james (guest, #1325) [Link] (2 responses)

Depends on your legal system and regulatory framework: GDPR says "Data must be stored for the shortest time possible.".

Having said that, insisting that deleted data is then scrubbed would be a very exacting interpretation of a law that's widely ignored. Other regulatory regimes might be actually enforced.

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 21:37 UTC (Tue) by JoeBuck (guest, #2330) [Link] (1 responses)

The file names I'm thinking of are mainly randomly generated temporary file names, in very large numbers, that do not refer to people but that are used to distribute jobs over a compute grid. In cases where data protection laws are relevant, the filesystem code is the wrong level of abstraction to handle the problem, and if there's an issue, zeroing out the filename but leaving the deleted data on disk won't do.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 9:04 UTC (Wed) by james (guest, #1325) [Link]

Fascinating to see the use-cases people come up with.

You can come up with whatever naming standards you want for inside an organisation, but if you're receiving documents about people from partner organisations (for example, over HTTPS or email) that's going to be handled by users using conventional desktop programs, the filesystem is what you have to work with.

Also, encryption doesn't help when the regulation insists you can't get the data back -- not without some complicated key rotation scheme (read: likely to cause problems). (And the fine article handles the case of deleting file data.)

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 18:14 UTC (Tue) by Wol (subscriber, #4433) [Link] (20 responses)

> To address this problem, Rumancik's second patch adds a new ioctl() command, called FS_IOC_CHKPT_JRNL, that forces the journal to be flushed out to persistent storage.

As someone interested in this stuff ... this feels just so wrong to me that it hasn't been implemented before! How many filesystems don't flush the journal even on shutdown? How much stuff can you have in the journal that hasn't been flushed to "permanent storage"? And if you so have a serious problem, how much grief will it cause you if you have to not only hexedit the filesystem to recover, but to hexedit the journal on top!

There seems to be (I may be wrong) this assumption that "once it's flushed to journal it's safe". It's almost as if they think that you could store the entire filesystem in the journal without actually ever flushing anything to permanent storage!

Ntfs, XFS, how many others? all seem to require you to "rerun the journal" on boot to get a working filesystem, even after a clean shutdown!?

Cheers,
Wol

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 18:53 UTC (Tue) by sbaugh (guest, #103291) [Link] (1 responses)

>It's almost as if they think that you could store the entire filesystem in the journal without actually ever flushing anything to permanent storage!

Sure, that's a completely valid approach to a filesystem: https://en.wikipedia.org/wiki/Log-structured_file_system

Preventing information leaks from ext4 filesystems

Posted Apr 29, 2021 11:15 UTC (Thu) by Wol (subscriber, #4433) [Link]

No problem with that, if the *user* asks for a filesystem like that. It's when the journal is a half-way stage that breaks guarantees the user thought they had.

Cheers,
Wol

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 21:35 UTC (Tue) by Paf (subscriber, #91811) [Link] (14 responses)

Journals are flushed to permanent storage all the time. Journals are assumed to be permanent in general. I think - I would love to be corrected here, though - this is more about windows where the journal has not been committed yet.

I’m a little confused about the framing of this all together, to be honest. If the journal isn’t going to permanent storage what is it even *for*?

Journal flushing

Posted Apr 27, 2021 21:41 UTC (Tue) by corbet (editor, #1) [Link] (13 responses)

My phrasing was a bit sloppy there, sorry.

What was meant by "flushing the journal" was "playing the journal out to the filesystem proper". Yes, the journal is written to persistent storage before any other filesystem metadata changes are made. Once the data is in the journal, though, finishing the job becomes less important - the data is saved and won't be lost. The new ioctl() forces the journal to be reflected in the filesystem metadata, then optionally does FITRIM to clear the data from the journal.

Journal flushing

Posted Apr 27, 2021 22:05 UTC (Tue) by tytso (subscriber, #9993) [Link] (10 responses)

If we want to be super nit-picky, what we are doing is "checkpointing the journal". This means assuring that everything is written to the final location on disk, so that it is no longer necessary to replay the journal after a crash. We checkpoint the journal when we freeze a file system before taking a snapshot, and as the last step before unmounting the file system.

It's not really a case of replaying the journal, because we don't actually read the data back from the journal. The dirty blocks (dirty because they haven't been written to the disk yet) are still in the buffer cache, so we are writing the contents of those blocks from the buffer cache. Once all of the dirty metadata blocks are written out, then it is safe to truncate the journal, as the journal is no longer needed any more.

After a crash, or an unplanned power cycle, that's when we replay the journal. In that case, we are reading the contents of the blocks from the journal, and then writing them back to their correct locations in the file system. Since we are reading from the journal, that's why we are calling it "playback" or "replay".

Why yes, file system developers are sometimes really anal-retentive. Why do you ask? :-)

Journal flushing

Posted Apr 27, 2021 22:25 UTC (Tue) by Paf (subscriber, #91811) [Link]

Thanks (both) for the clarification, that's roughly what I figured but it's nice to have that clarified and detailed.

"Why yes, file system developers are sometimes really anal-retentive. Why do you ask? :-)"
Speaking as one, I have no idea. ;)

Journal flushing

Posted Apr 29, 2021 3:20 UTC (Thu) by dgc (subscriber, #6611) [Link] (8 responses)

> If we want to be super nit-picky, what we are doing is "checkpointing the journal".

If we want to be super nit-picky, this is what *ext4 developers* call "checkpointing the journal". :/

In XFS, a "journal checkpoint" is what we write to the journal (XFS_TRANS_CHECKPOINT) and recover from the journal during replay. What you've defined as "checkpointing the journal" is called "quiescing the log" in XFS. See xfs_log_quiesce() for details.

And, if we are really going to pick nits, what does "checkpoint the journal" mean to a filesystem with copy-on-write metadata and no journal?

-Dave.

Journal flushing

Posted Apr 29, 2021 3:48 UTC (Thu) by tytso (subscriber, #9993) [Link] (7 responses)

It's not just ext4 developers. It's also the terminology used by SQLite and Postgres when using write-ahead logging:

"Of course, one wants to eventually transfer all the transactions that are appended in the WAL file back into the original database. Moving the WAL file transactions back into the database is called a "checkpoint" -- https://sqlite.org/wal.html

"Put simply, a checkpoint is a point in the transaction sequence at which all data files (where tables and indexes reside) are guaranteed to be written and flushed to disk to reflect the data modified in memory by regular operation. All those changes have previously been written and flushed to WAL." https://www.commandprompt.com/blog/the_write_ahead_log/

Journal flushing

Posted Apr 29, 2021 11:23 UTC (Thu) by Wol (subscriber, #4433) [Link] (6 responses)

So to use tytso's terminology, between the file i/o layer, and the file system checkpointing the journal, A MIRRORED FILE SYSTEM IS IN DANGER. That's what I was getting at - I get the impression (maybe I'm wrong, because of all the confusing terminology thrown around like Humpty Dumpty) that some filesystems can take absolutely ages to run their checkpoint.

If I've got a couple of hour's data waiting to be checkpointed, I would be well pissed off if the journal drive died and I discovered that work from a long time ago hadn't made it to the mirror ...

I appreciate it's not easy, but users expect every layer to complete in the shortest time practical, not hang around on a "why bother rushing" basis.

Cheers,
Wol

Journal flushing

Posted Apr 29, 2021 17:28 UTC (Thu) by tytso (subscriber, #9993) [Link] (2 responses)

Um, no. There's no danger. While we are checkpointing, we are forcing the dirty metadata blocks to be written to the file system. Only *after* all of the dirty metadata blocks are written, do we truncate the journal.

In general, file systems should not take ages to checkpoint a transaction. On average, we will force a journal commit every 5 seconds (or perhaps sooner if fsync is called). As part of the commit processing, once the commit block is written, all of the metadata buffers will be marked dirty, and 30 seconds later, the normal buffer cache writeback will start. Hence in practice, the only writes that we need to do when doing a full checkpoint are the metadata blocks that were touched in the last 30 seconds, plus any pending writeback.

Also, normally, we don't actually try to checkpoint all completed transactions. If we checkpoint a transaction that completed, say, 50 seconds ago, and all of its dirty buffers have been writen back (or taken over by a newer transaction), then we don't need to do any I/O before we declare, "yep, this transaction is no longer needs to be kept in the journal; we can reuse that space for newer transactions". So normally, we just try to checkpoint all old transactions without needing to do any I/O, and we can move the tail of the journal without needing to do any synchronous I/O.

Now, if the journal is too small, or the block device is too slow, it's possible that this will not free enough space in the journal for newer transactions. In that case, we will need to do a synchronous checkpoint where we actually force the buffers to be written back so we can free space in the journal. While this is happening, file system mutations are blocked, so this is terrible for file system performance. Fortunately, this rarely happens, and in modern versions of mke2fs, we create file systems with larger journals to prevent this from happening.

The only time we need to do a full, synchronous checkpoint, to completely empty the journal, is when we unmount the file system, or if we are trying to freeze the file system in order to take a snapshot. But even then, it's perfectly safe, because we don't actually truncate the journal until the metadata writeback is complete.

Journal flushing

Posted Apr 30, 2021 0:30 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> Hence in practice, the only writes that we need to do when doing a full checkpoint are the metadata blocks that were touched in the last 30 seconds, plus any pending writeback.

So you're saying that - in normal circumstances - ext has a maximum of 30 secs of writes in flight. Anything older than that SHOULD have been flushed through to the underlying layers (or binned as having been overwritten :-)

That's good news from ext. I'm just wondering whether other filesystems and layers provide similar guarantees. It may just be mis-understanding the terminology, but I just don't know ...

Cheers,
Wol

Journal flushing

Posted Apr 30, 2021 4:21 UTC (Fri) by tytso (subscriber, #9993) [Link]

Well, "maximum of 30 seconds in flight" isn't guaranteed". After 30 seconds, we will start the writeback. However, writeback is not instaneous. Sumchronous reads are higher priority than background write operations. So if there are processes that are continuously reading files in order to do useful work, those will be prioritized ahead of the write backs.

Secondly, if the metadata block is modified so it's part of a newer check point, that block's write back will be suppressed. So if a particular inode table block is getting constantly modified every ten seconds, that inode table block won't get written back. So.... if you are appending to a log file every five seconds, such that the inode size is changing, then that inode table block will be involved in every journal transaction (assuming no fsyncs). So if you have a super-reliable file system device, and a super-UN-reliable external journal device, then yeah, it's possible that if the unreliable external journal device explodes, you might lose the inode table block. But that's a stupid, stupid configuration. Don't do that. (And indeed, most people don't use external journals, so this is really an unrealistic, or at least uncommon, configuration.)

You could also complain that if you had a RAID 0 configuration, where one disk is super-reliable, and the other disk is super-fragile, the chances of data loss is much larger. So is having 3 tires in a car that are brand new, and one front tire which has complete threads worn out and broken steel belts sticking out the size; that could kill people, even though 3 of the tires are expensive, brand new tires. But there are always unwise ways you can configure any system, and the fault is not that you chose to use 3 expensive brand new tires. It's the fact that one of the other tires is not like the others. :-)

As far as other systems are concerned, in general, the journal is a circular buffer, and it's not the case that a file system or database would wait until the write-ahead log is 100% full, and then all useful work stops while the log is written to the file system or database, and then the log is truncated down to be completely empty. People would be really unhappy with the performance, for one thing --- imagine how you would feel if you were using a web browser, and all of sudden, KER-CHUNK --- checkpointing now, and the web browser freezes until the checkpoint is done. As they say, that would not be a great user experience. :-) So in general, writeback happens in the background, in parallel with other file system or database operations.

Journal flushing

Posted Apr 29, 2021 17:43 UTC (Thu) by zlynx (guest, #2285) [Link] (2 responses)

When people want lots of reliability they also mirror any journal devices in the system.

For example, the ZFS SLOG is often a single Flash or Optane or NVRAM device. And for many situations this works very well.

However, for installations that can't allow losing the transactions within a 5 second window during failure of that single SLOG device, you can use two. Or three.

This same consideration needs to be done for external EXT4 journals. Usually the journal is on the same device, so if the data is mirrored the journal is also mirrored.

Journal flushing

Posted Apr 29, 2021 20:21 UTC (Thu) by tytso (subscriber, #9993) [Link] (1 responses)

Right, I was really unclear about what Wol was concerned about. A mirrored file system doesn't make things more unsafe. If you have an external journal, the fact that you have an external journal does add an extra point of failure, sure. But that has nothing to do with the mirrored file system.

As I said in my direct reply to Wol, after 30 seconds, a dirtied metadata block will be subject to writeback. So if you are using an external journal device, and it dies, then there will certainly be some metadata that may have only been written to the journal, but which hasn't been written back to the file system, where you may end up losing that metadata update. But taking hours of data waiting to be checkpointed? I'm not sure when that would *ever* be the case. I guess if you had a super slow USB thumb drive, it might take a while to sync the file system and unmount the file system, but that's mainly the data blocks which is taking all of the time, not the journal checkpoint. But even with the slowest USB thumb drive, it's never taken me an hour to writeout everything so the unmount can proceed. So I'm not sure what Wol was talking about in that case.

Journal flushing

Posted Apr 30, 2021 0:37 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Right, I was really unclear about what Wol was concerned about. A mirrored file system doesn't make things more unsafe. If you have an external journal, the fact that you have an external journal does add an extra point of failure, sure. But that has nothing to do with the mirrored file system.

Yup, that was it. If stuff gets delayed in the journal and not flushed to the mirror, you have a single point of failure. From what you've said ext only has a minimal risk window, which is great :-)

Maybe I'm naive (probably am :-) but I was thinking of journals as being a simplistic "write log, write to permanent storage, delete log", and thinking "if stuff gets to the log and the system is tardy about the next step, then that's not good ..."

Cheers,
Wol

Journal flushing

Posted Apr 28, 2021 8:01 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

And, as the OP who started this thread, and who is also interested in RAID :-), I'm thinking of situations where we have the file system on a mirror, and the journal on a single disk ...

The longer it takes for the journal to flush, the longer the user is being misled as to the safety of said data ...

(And yes, I know disasters are unlikely, but would you like to lose hours or days of supposedly mirrored data because your single journal disk died ...?)

Cheers,
Wol

Journal flushing

Posted Apr 28, 2021 14:29 UTC (Wed) by Paf (subscriber, #91811) [Link]

Wol,

The journal has always been able to remain in this state for arbitrary lengths of time. This is in fact a change to do this in new circumstances.

And, the journal not being flushed doesn’t mean the data/metadata isn’t on the primary storage/written to the main file system. It is, it’s just also recorded in the journal. This is - as I understand it - essentially a journal “clean” or “purge”. At a time when the journal and the disk are known to be in accord, the journal is purged because it’s known to be not necessary. (The journal handles the case of crashing when there is data/metadata that is not known to have reached permanent storage in the actual file system.)

If there *are* any outstanding changes to the file system, I would guess this also ensures those reach permanent storage on the primary file system, and then purges the associated journal entries, but that’s a transient state while operations are in progress.

The journal is not kept forever. This is just making sure it’s gone at certain times. (Hope I’ve got that right. ;) )

Preventing information leaks from ext4 filesystems

Posted Apr 29, 2021 17:42 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

> How many filesystems don't flush the journal even on shutdown?
[...]
> There seems to be (I may be wrong) this assumption that "once it's flushed to journal it's safe".

Yes, that is what the journal is for :) the journal has to be able to be usable to produce a filesystem in a consistent state on a power failure at any time, or it's useless: therefore, once it's flushed to journal, metadata *is* safe. Given that, flushing the journal on shutdown rather than replaying it on startup is a more-or-less arbitrary decision which filesystem developers can and do make in either direction (ext4 goes one way, xfs goes the other).

(Obviously, things like data which are rarely journalled do have to get flushed to disk on shutdown. The data has to go *somewhere* persistent or it's lost.)

Preventing information leaks from ext4 filesystems

Posted Apr 30, 2021 0:44 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> (Obviously, things like data which are rarely journalled do have to get flushed to disk on shutdown. The data has to go *somewhere* persistent or it's lost.)

And this is my peeve with filesystem developers. Okay, I do understand that if metadata isn't saved then the data is toast, but as a user I don't give a monkeys about the metadata - it's the DATA that pays for the computer!

Cheers,
Wol

Preventing information leaks from ext4 filesystems

Posted Apr 30, 2021 9:24 UTC (Fri) by farnz (subscriber, #17727) [Link]

The metadata matters a lot more than the data because a loss of the metadata loses you all files on the filesystem, even older files. A loss of data loses recently written data, but leaves older data alone.

That said, ext3/4 have data=journal to get the exact same guarantees for data as metadata, and the default data=ordered which does not write metadata to the journal until the data is written to its final resting place. It's only data=writeback (which is fastest) that has an ordering issue.

Similarly, btrfs's default mode is equivalent to data=ordered (but with data checksums to detect bitflips), and it offers a flushoncommit mount option that gets you the same guarantees for your data as btrfs has for its metadata.

Basically, FS developers really do care about the data - it just doesn't seem like that because the metadata is needed to keep all the data safe, where the data is less critical to keeping all data safe.

Preventing information leaks from ext4 filesystems

Posted Apr 27, 2021 23:58 UTC (Tue) by gus3 (guest, #61103) [Link] (4 responses)

Ext4 and f2fs already have native support for encrypting files (and filenames). I don't understand why this is a problem.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 6:59 UTC (Wed) by ibukanov (subscriber, #3942) [Link] (3 responses)

As others have already pointed out above, encryption does not help with possible future malware infection that will look at deleted information.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 7:50 UTC (Wed) by comicfans (subscriber, #117233) [Link] (2 responses)

To my knowledge, if a malware can look at deleted information even with encryption enabled (as said 'encryption does not help ') , it must be a running time attack (since encryption prevent cold attack), and if it can bypass OS fs layer restriction (because you can not access a already deleted file info through normal fs api), it must gain raw disk read permission (at least some level administrator permission, for example access debugfs ). in such situation, if without encryption , malware can parse whole filesystem already, no just a filename. with encryption enabled, malware at least need to obtain encryption key (otherwise encryption still worked). and if malware can obtain encryption key (no matter from kernel, or from a bad-placed-plain-text-password), then whole system already being cracked (almostly). that means: if such malware do exist and make encryption useless, then whole os already insecure. it can leak much more than just a filename. encryption at least makes crack harder.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 21:46 UTC (Wed) by ibukanov (subscriber, #3942) [Link] (1 responses)

I was talking about defense against future infection that can recover deleted information.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 22:10 UTC (Wed) by gus3 (guest, #61103) [Link]

Thanks to both of you for your comments. I see your points.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 0:31 UTC (Wed) by comicfans (subscriber, #117233) [Link] (2 responses)

Emm... what if user delete a file with important-information-as-filename by mistake and want to restore it ?

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 0:45 UTC (Wed) by tytso (subscriber, #9993) [Link] (1 responses)

See https://lwn.net/Articles/854689/ for the answer to your question.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 2:05 UTC (Wed) by comicfans (subscriber, #117233) [Link]

# "rm is forever" and the solution for accidental deletion of files is to refer to your backups.

I think if this is true, then clear filename is not solution for sensitive information leak prevention should be also true . takes cloud VM as example, no matter you trust provider or not , every memory bit is transparent. I can not see any advantage filename-clear over encryption (they both not work). If you're just afraid of sibling VM of others (hardware sidechannel attack maybe), this may reduce the attack window, but still not resolve problem. if such attach exists, they can read more than just filename.

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 10:32 UTC (Wed) by hummassa (guest, #307) [Link] (4 responses)

This seems to me a clear case of "wants A, does B". What is *wanted* is guaranteeing the PII protections on the law. I don't think this can be done on the filesystem level. The proposed "solutions" either don't work and leave breadcrumbs all over the system or are just akin to scrub the whole volume, all the time, and good luck with your SSDs. PII protections laws would be much better satisfied with policies of "collect less data" and "compartmentalize PII data whenever possible".

Preventing information leaks from ext4 filesystems

Posted Apr 28, 2021 13:42 UTC (Wed) by nim-nim (subscriber, #34454) [Link]

PII tends to leak everywhere, the more so because some apps take years to be re-engineered, while (evil) data miners are demonstrating every week their ability to extract more info from the smallest sliver of data.

Ultimately, because the goal posts change all the time, everything will need to be treated like PII with strong just-in-time erasure requirements.

This is no different from the general deprecation of http in favor of https.

The social costs of big data are mounting every day, but who has a reasonable plan to put horses back in the barn? The next years are going to out-Orwell the worst nightmare people imagined.

Preventing information leaks from ext4 filesystems

Posted Apr 29, 2021 11:28 UTC (Thu) by Wol (subscriber, #4433) [Link]

> I don't think this can be done on the filesystem level.

The point is, some information can be leaked at the filesystem level. So the filesystem taking responsibility for those things it has control over is a GOOD THING.

All it's doing is providing a guarantee that when a file is deleted, *all trace* of said file is deleted from the file system. Traces elsewhere, well the filesystem can't do anything about that. The file system can't be blamed for the existence of backups ...

Cheers,
Wol

Preventing information leaks from ext4 filesystems

Posted Apr 29, 2021 14:03 UTC (Thu) by kleptog (subscriber, #1183) [Link] (1 responses)

> PII protections laws would be much better satisfied with policies of "collect less data" and "compartmentalize PII data whenever possible".

This is precisely what happens though.

The story goes that when GDPR was originally passed there were many people that were asking where the checklists were so they could validate their systems. The answer of course is there is no such checklist. You are expected to analyse your own business processes, determine what could be considered PII, justify why you actually need that data and how you're going to protect it and then publish that information for customers.

The process is the important part and it does make rigorous enforcement hard. On the other, it has made a lot of businesses think, for the first time, why their business process is the way it is.

There is still room for improvement though. My pet peeve is sites asking for your date of birth when what they really want to know is if you're over 18. Just use a checkbox please.

Preventing information leaks from ext4 filesystems

Posted May 3, 2021 13:01 UTC (Mon) by jezuch (subscriber, #52988) [Link]

>> PII protections laws would be much better satisfied with policies of "collect less data" and "compartmentalize PII data whenever possible".

> This is precisely what happens though.

And I'll add to that: that was precisely the intent of GDPR. Though not spelled explicitly, because how would you do it?