Tux3: the other next-generation filesystem
Daniel is not a newcomer to filesystem development. His Tux2 filesystem was announced in 2000; it attracted a fair amount of interest until it turned out that Network Appliance, Inc. held patents on a number of techniques used in Tux2. There was some talk of filing for defensive patents, and Jeff Merkey popped up for long enough to claim to have hired a patent attorney to help with the situation. What really happened is that Tux2 simply faded from view. Tux3 is built on some of the same ideas as Tux2, but many of those ideas have evolved over the eight intervening years. The new filesystem, one hopes, has changed enough to avoid the attention of NetApp, which has shown a willingness to use software patents to defend its filesystem turf.
Like any self-respecting contemporary filesystem, Tux3 is based on B-trees. The inode table is such a tree; each file stored within is also a B-tree of blocks. Blocks are mapped using extents, of course - another obligatory feature for new filesystems. Most of the expected features are present. In many ways, Tux3 looks like yet another POSIX-style filesystem, but there are some interesting differences.
Tux3 implements transactions through a forward-logging mechanism. A set of changes to the filesystem will be batched together into a "phase," which is then written to the journal. Once the phase is committed to the journal, the transaction is considered to be safely completed. At some future time, the filesystem code will "roll up" the journal changes and write them back to the static version of the filesystem.
The logging implementation is interesting. Tux3 uses a variant of the copy-on-write mechanism employed by Btrfs; it will not allow any filesystem block to be overwritten in place. So writing to a block within a file will cause a new block to be allocated, with the new data written there. That, in turn, will require that the filesystem data structure which maps file-logical blocks to physical blocks (the extent) will need to be changed to reflect the new block location. Tux3 handles this by writing the new blocks directly to their final location, then putting a "promise" to update the metadata block into the log. At roll-up time, that promise will be fulfilled through the allocation of a new block and, if necessary, the logging of a promise to change the next-higher block in the tree. In this way, changes to files propagate up through the filesystem one step at a time, without the need to make a recursive, all-at-once change.
The end result is that the results of a specific change can remain in the log for some time. In Tux3, the log can be thought of as an integral part of the filesystem's metadata. This is true to the point that Tux3 doesn't even bother to roll up the log when the filesystem is unmounted; it just initializes its state from the log when the next mount happens. Among other things, Daniel says, this approach ensures that the journal recovery code will be well-tested and robust - it will be exercised at every filesystem mount.
In most filesystems, on-disk inodes are fixed-size objects. In Tux3, instead, their size will be variable. Inodes are essentially containers for attributes; in Tux3, normal filesystem data and extended attributes are treated in almost the same way. So an inode with more attributes will be larger. Extended attributes are compressed through the use of an "atom table" which remaps attribute names onto small integers. Filesystems with extended attributes tend to have large numbers of files using attributes with a small number of names, so the space savings across an entire filesystem could be significant.
Also counted among a file's attributes are the blocks where the data is stored. The Tux3 design envisions a number of different ways in which file blocks can be tracked. A B-tree of extents is a common solution to this problem, but its benefits are generally seen with larger files. For smaller files - still the majority of files on a typical Linux system - data can be stored either directly in the inode or at the other end of a simple block pointer. Those representations are more compact for small files, and they provide quicker data access as well. For the moment, though, only extents are implemented.
Another interesting - but unimplemented - idea for Tux3 is the concept of versioned pointers. The btrfs filesystem implements snapshots by retaining a copy of the entire filesystem tree; one of these copies exists for every snapshot. The copy-on-write mechanism in btrfs ensures that those snapshots share data which has not been changed, so it is not as bad as it sounds. Tux3 plans to take a different approach to the problem; it will keep a single copy of the filesystem tree, but keep track of different versions of blocks (or extents, really) within that tree. So the versioning information is stored in the leaves of the tree, rather than at the top. But the versioned extents idea has been deferred for now, in favor of getting a working filesystem together.
Also removed from the initial feature list is support for subvolumes. This feature initially seemed like an easy thing to do, but interaction with fsync() proved hard. So Daniel finally concluded that volume management was best left to volume managers and dropped the subvolume feature from Tux3.
One feature which has never been on the list is checksumming of data. Daniel once commented:
Tux3 development is far from the point where the developers can worry about "decorations"; it remains, at this point, an embryonic project being pushed by a developer with a bit of a reputation for bright ideas which never quite reach completion. The code, thus far, has been developed in user space using FUSE. There is, however, an in-kernel version which is now ready for further development. According to Daniel:
The potential user community for a stripped-down ext2 with bugs is likely to be relatively small. But the Tux3 design just might have enough to offer to make it a contender eventually.
First, though, there are a few little problems to solve. At the top of the list, arguably, is the complete lack of locking - locking being the rocks upon which other filesystem projects have run badly aground. The code needs some cleanups - little problems like the almost complete lack of comments and the use of macros as formal function parameters are likely to raise red flags on wider review. Work on an fsck utility does not appear to have begun. There has been no real benchmarking work done; it will be interesting to see how Daniel can manage the "never overwrite a block" policy in a way which does not fragment files (and thus hurt performance) over time. And so on.
That said, a lot of these problems could end up being resolved rather
quickly. Daniel has put the code out there and appears to have attracted an
energetic (if small) community of contributors. Tux3 represents the core
of a new filesystem with some interesting ideas. Code comments may be
scarce, but Daniel - never known as a tight-lipped developer - has posted a
wealth of information which can be found in the Tux3
mailing list archives. Potential contributors should be aware of Daniel's licensing scheme - GPLv3 with a
reserved unilateral right to relicense the code to anything else - but
developers who are comfortable with that are likely to find an interesting
and fast-moving project to play in.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/Tux3 |