Log-structured file systems: There's one in every SSD
What is a log-structured file system?
Log-structured file systems, oddly enough, evolved from logging file systems. A logging (or journaling) file system is a normal write-in-place file system in the style of ext2 or FFS, just with a log of write operations bolted on to the side of it. (We'll use the term "journaling file system" in the rest of the paper to avoid confusion between "logging" and "log-structured" file systems.) A journaling file system keeps the on-disk state of the file system consistent by writing a summary of each write operation to the log, stored somewhere non-volatile like disk (or NVRAM if you have the money), before writing the changes directly to their long-term place in the file system. This summary, or log record, contains enough information to repeat the entire operation if the direct write to the file system gets interrupted mid-way through (e.g., by a system crash). This operation is called replaying the log. So, in short, every change to the file system gets written to disk twice: once to the log, and once in the permanent location.Around 1988, John K. Ousterhout and several collaborators realized that they could skip the second write entirely if they treated the entire file system as one enormous log. Instead of writing the operation to the log and then rewriting the changes in place somewhere else on the disk, it would just write it once to the end of the log (wherever that is) and be done with it. Writes to existing files and inodes are copy-on-write - the old version is marked as free space, and the new version is written at the end of the log. Conceptually, finding the current state of the file system is a matter of replaying the log from beginning to end. In practice, a log-structured file system writes checkpoints to disk periodically; these checkpoints describe the state of the file system at that point in time without requiring any log replay. Any changes to the file system after the checkpoint are recovered by replaying the relatively small number of log entries following the checkpoint.
One of the interesting benefits of the log-structured file system (LFS) structure is that most writes to the file system are sequential. The section describing the motivation for Sprite LFS, written nearly 20 years ago, demonstrates how little has changed in the storage world:
But wait, why are we still talking about disk seeks? SSDs have totally changed the performance characteristics of storage! Disks are dead! Long live flash!
Surprisingly, log-structured file systems are more relevant than ever when it comes to SSDs. The founding assumption of log-structured file systems - that reads are cheap and writes are expensive - is emphatically true for the bare-metal building blocks of SSDs, NAND-based flash. (For the rest of this article, "flash" refers to NAND-based flash and SSD refers to a NAND-based flash device with a wear-leveling, write-gathering flash translation layer.) When it comes to flash, reads may be done at small granularities - a few hundreds of bytes - but writes must be done in large contiguous blocks - on the order of tens of thousands or hundreds of thousands of bytes. A write to flash takes two steps: First the entire block is cleared, setting all the bits to the same value (usually 1, counter-intuitively). Second, individual bits in the block are flipped back to 0 until you get the block you wanted.
Log-structured file systems turn out to be a natural fit for flash. One of the details of the log-structured design is that the log is written in large contiguous chunks, called "segments," on the order of several megabytes in size. To cut down on metadata overhead and get the best performance, log entries are gathered and written out sequentially to a completely free segment. Most segments are partially in use and partially free at any given time, so the file system has to collect all the in-use data from a segment and move it elsewhere before it can start writing to it. When the file system needs a fresh segment, it first cleans an existing partially-used segment by moving all the in-use, or live data to another free segment - basically, it garbage-collects. Now that everything is arranged properly, the file system can do one big streaming write to the empty segment. This system of segments and cleaning is exactly what is needed to efficiently write to a flash device, given the necessity to erase large contiguous blocks of flash before writing to them.
[PULL QUOTE: Sadly, many thousands of people probably now associate the Tux penguin bootup logo with the inability to watch TV on long distance flights. END QUOTE] The match between log-structured file systems and flash is obvious when you look at file systems written for the bare flash programming interface - that is, for devices without built-in wear-leveling or write-gathering. File systems that know about and have to manage erase blocks and other details of the flash hardware are almost invariably log-structured in design. The most widely used such file system for Linux is JFFS2, used in many embedded devices, such as ticket machines and seatback airline entertainment systems. More than once, I've boarded a plane and seen a JFFS2 error message reporting flash corruption on a hung seatback entertainment system. (Sadly, many thousands of people probably now associate the Tux penguin bootup logo with the inability to watch TV on long distance flights.)
For SSDs that export a disk-style block interface - most consumer-grade SSDs these days - the operating systems uses a regular file system to talk to the SSD via the block interface (that is, read block #37 into this buffer, write this buffer into block #42, etc.). However, this system still contains the logical equivalent of a log-structured file system; it's just hidden inside the SSD. The firmware that implements wear-leveling, write-gathering, and any other features has to solve the same problems as a log-structured file system.
Most SSD manufacturers refuse to reveal any details of their internal firmware, but we can be fairly confident that it has a lot in common with log-structured file systems. First, the only way to implement efficient random writes is to buffer them and write them out to a single erase block together. This requires clearing an erase block, moving all the in-use blocks to another area, and keeping a mapping between the logical location of blocks and their physical locations - exactly what a log-structured file system does. Second, when we do get SSD implementation details from research publications, they look like log-structured file systems. Third, when we look at long-term performance testing of SSDs, we see the same pattern of performance degradation over time that we do with log-structured file systems. We'll talk about this in detail in the next section.
Log-structured file system performance
Log-structured file systems are a natural fit for flash-based storage today, but back in 1990, they appeared to have great potential for disk-based file systems as well. Yet, as we all know, we're not using log-structured file systems on our disk-based laptops and servers. What happened?In short, log-structured file systems performed relatively well as long as most of the segment cleaning - movement of live data out of a segment so it can be re-used - could be done in the background when the file system wasn't busy with "real" work. The first major follow-up paper on LFS [PDF] found performance of LFS degraded by up to 40% from the best case at real-world levels of disk utilization, memory-to-disk ratio, and file system traffic. In short, in the steady state the file system was spending a significant amount of disk access time cleaning segments - moving old data out of a segment so it could be used for new writes. This segment cleaning problem was the subject of active research for at least another decade, but none of the solutions could consistently beat state-of-the-art write-in-place file systems at practical levels of disk utilization. It's a little bit like comparing garbage collection to explicit reference counting for memory management; when memory usage is low and the occasional high latency hit is okay, the convenience of garbage collecting outweighs the performance benefits. But at "high" levels of disk utilization - as little as 50% - the cleaning cost and periodic high latencies waiting for space to be freed up become a problem.
As the first LFS paper showed, the key to good performance in a log-structured file system is to place data such that nearly empty segments are created about as quickly as they are used. The file system write bandwidth is limited by the rate at which it can produce clean segments. The worst case happens when, in a file system that is X% full, every segment is also X% full. Producing one clean segment requires collecting the live data from:
N = ceiling(1/(1 - X))
segments and writing out the
old data to N - 1 of those segments. For a disk
utilization of 80%, we get:
N = ceiling(1/(1 - .80)) = 1/.20 = 5
segments to clean. If segments were 1MB in size, we'd have to read
5 * 800KB = 4MB
of data seekily and write 4MB sequentially before we could write 1MB of new data. (Note to pedants: I'm using MB/KB in powers of 10, not 2).
The best case, instead, is a file system with two kinds of segments, completely full and completely empty. The best case write pattern is one that changes all of the metadata and data in a single segment, so that when the new versions are written out, the old versions are freed and the entire segment becomes free again. Reality lies somewhere between these two cases. The goal for a log-structured file system is to create a bimodal segment usage distribution: Most segments are either very full or very empty, and full segments tend to be unchanged. This turns out to be difficult to achieve.
SSDs have an extra interesting constraint: wear-leveling. Even in the best case in which most segments are 100% full and no writes ever change the data in them, the SSD must still move those segments around occasionally because it has to spread writes out over every available flash block. This adds an extra segment move in some cases and makes achieving good performance even harder than in a disk-based log-structured file system.
Lessons - learned?
It's great that SSD manufacturers can learn from two decades of prior work on log-structured file systems. What's not clear is whether they are doing so. Most manufacturers take a very closed approach to SSD firmware development - it's the secret sauce that turns cheap commodity flash with very low margins into extremely expensive, reliable, high-performance storage devices with high margins. Some manufacturers are clearly better at this task than others. Currently, manufacturers are taking the trade secret strategy for maintaining their competitive advantage - apply for patents on individual elements of the design, but keep the overall implementation a secret. The message to file systems developers is "Just trust us" and "Don't worry your pretty little systems programmers' heads about it" whenever we ask for more information on SSD implementation. You can't particularly argue with this strategy at present, but it tends to come from (and reinforce) the mindset that not only refuses to share information with the outside, but also ignores information from the outside, such as previously published academic work.
One of the greatest missed opportunities for optimization based on
lessons learned from log-structured file systems is the slow adoption
of TRIM
support for SSDs. TRIM is a command to a block device informing it
that a certain range of blocks is no longer in use by the file system
- basically a free() call for blocks. As described
earlier, the best performance comes when empty segments are created as
a side effect of ongoing writes. As a simple example, imagine a
segment that contains only a single inode and all of its file data.
If the next set of writes to the file system overwrites all of the
file data (and the inode as a side effect), then that segment becomes
completely free and the file system doesn't have to move any live data
around before it uses that segment again. The equivalent action for
an SSD is to write to a block that has already been written in the
past. Internally, the SSD knows that the old copy of that block is
now free, and it can reuse it without copying its data elsewhere.
But log-structured file systems have a distinct advantage over pre-TRIM SSDs (basically all commercially available SSDs as of now, September 2009). Log-structured file systems know when on-disk data has been freed even when it isn't overwritten. Consider the case of deleting the one-segment file: the entire segment is freed, but no overwrite occurred. A log-structured file system knows that this happened and now has a free segment to work with. All the SSD sees is a couple of tiny writes to other blocks on the disk. As far as it's concerned, the blocks used by the now-deleted file are still precious data in-use by the file system and it must continue to move that data around forever. Once every block in the device has been written at least once, the SSD is doomed to a worst case performance state in which its spare blocks are at a minimum and data must be moved each time a new block is rotated into use.
As we've seen, the key to good performance in a log-structured file system is the availability of free or nearly-free segments. An SSD without TRIM support does not know about many free segments and accrues an immense performance disadvantage, which make it somewhat shocking that any SSD ever shipped without the TRIM feature. My guess is that SSDs were initially performance tested only with write-in-place file systems (cough, cough, NTFS) and low total file system usage (say, 70% or less).
Unfortunately, TRIM in its current form is both designed and implemented to
perform incredibly poorly: TRIM commands aren't tagged and at least one
SSD takes hundreds of milliseconds to process a TRIM command.
Kernel developers have debated exactly how to implement TRIM support
at the Linux Plumbers
Conference, at
the Linux
Storage and File System Workshop, and on mailing lists: what the
performance cost of each TRIM is, what granularity TRIMs should have,
how often they should be issued, and whether it's okay to forget or miss
TRIM commands. In my opinion, the in-use/free state of a block on a
TRIM-enabled device should be tracked as carefully as that of a page
of memory. The file system implementation can take the form of
explicit synchronous alloc()/free() calls, or else
asynchronous garbage collection (during a file system check or
scrubbing run), but we shouldn't "leak" in-use blocks for all the same
reasons we don't leak memory.
Additionally, in an ideal world, TRIM would be redesigned or replaced by a
command that is a full-featured, well-designed first-class citizen in the
ATA spec, rather than a hack bolted on after the fact.
Of course, all this is speculation in the absence of implementation details from the SSD manufacturers. Perhaps some SSD firmware programmers have come up with entirely new algorithms for remapping and write-gathering that don't resemble log-structured file systems at all, and the performance characteristics and optimizations we have seen so far just happen to match those for log-structured file systems. However, so far it appears that treating an SSD as though it were backed by a log-structured file system is a good rule of thumb for getting good performance. Full TRIM support by both SSDs and file systems will be key to long-term good performance.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/Log-structured |
| Kernel | Solid-state storage devices |
| GuestArticles | Aurora (Henson), Valerie |