Tux3: the other next-generation filesystem

By Jonathan Corbet
December 2, 2008

There is a great deal of activity around Linux filesystems currently. Of the many ongoing efforts, two receive the most attention: ext4, the extension of ext3 expected to keep that filesystem design going for a few more years, and btrfs, which is seen by many as the long-term filesystem of the future. But there is another project out there which is moving quickly and is worth a look: Daniel Phillips's Tux3 filesystem.

Daniel is not a newcomer to filesystem development. His Tux2 filesystem was announced in 2000; it attracted a fair amount of interest until it turned out that Network Appliance, Inc. held patents on a number of techniques used in Tux2. There was some talk of filing for defensive patents, and Jeff Merkey popped up for long enough to claim to have hired a patent attorney to help with the situation. What really happened is that Tux2 simply faded from view. Tux3 is built on some of the same ideas as Tux2, but many of those ideas have evolved over the eight intervening years. The new filesystem, one hopes, has changed enough to avoid the attention of NetApp, which has shown a willingness to use software patents to defend its filesystem turf.

Like any self-respecting contemporary filesystem, Tux3 is based on B-trees. The inode table is such a tree; each file stored within is also a B-tree of blocks. Blocks are mapped using extents, of course - another obligatory feature for new filesystems. Most of the expected features are present. In many ways, Tux3 looks like yet another POSIX-style filesystem, but there are some interesting differences.

Tux3 implements transactions through a forward-logging mechanism. A set of changes to the filesystem will be batched together into a "phase," which is then written to the journal. Once the phase is committed to the journal, the transaction is considered to be safely completed. At some future time, the filesystem code will "roll up" the journal changes and write them back to the static version of the filesystem.

The logging implementation is interesting. Tux3 uses a variant of the copy-on-write mechanism employed by Btrfs; it will not allow any filesystem block to be overwritten in place. So writing to a block within a file will cause a new block to be allocated, with the new data written there. That, in turn, will require that the filesystem data structure which maps file-logical blocks to physical blocks (the extent) will need to be changed to reflect the new block location. Tux3 handles this by writing the new blocks directly to their final location, then putting a "promise" to update the metadata block into the log. At roll-up time, that promise will be fulfilled through the allocation of a new block and, if necessary, the logging of a promise to change the next-higher block in the tree. In this way, changes to files propagate up through the filesystem one step at a time, without the need to make a recursive, all-at-once change.

The end result is that the results of a specific change can remain in the log for some time. In Tux3, the log can be thought of as an integral part of the filesystem's metadata. This is true to the point that Tux3 doesn't even bother to roll up the log when the filesystem is unmounted; it just initializes its state from the log when the next mount happens. Among other things, Daniel says, this approach ensures that the journal recovery code will be well-tested and robust - it will be exercised at every filesystem mount.

In most filesystems, on-disk inodes are fixed-size objects. In Tux3, instead, their size will be variable. Inodes are essentially containers for attributes; in Tux3, normal filesystem data and extended attributes are treated in almost the same way. So an inode with more attributes will be larger. Extended attributes are compressed through the use of an "atom table" which remaps attribute names onto small integers. Filesystems with extended attributes tend to have large numbers of files using attributes with a small number of names, so the space savings across an entire filesystem could be significant.

Also counted among a file's attributes are the blocks where the data is stored. The Tux3 design envisions a number of different ways in which file blocks can be tracked. A B-tree of extents is a common solution to this problem, but its benefits are generally seen with larger files. For smaller files - still the majority of files on a typical Linux system - data can be stored either directly in the inode or at the other end of a simple block pointer. Those representations are more compact for small files, and they provide quicker data access as well. For the moment, though, only extents are implemented.

Another interesting - but unimplemented - idea for Tux3 is the concept of versioned pointers. The btrfs filesystem implements snapshots by retaining a copy of the entire filesystem tree; one of these copies exists for every snapshot. The copy-on-write mechanism in btrfs ensures that those snapshots share data which has not been changed, so it is not as bad as it sounds. Tux3 plans to take a different approach to the problem; it will keep a single copy of the filesystem tree, but keep track of different versions of blocks (or extents, really) within that tree. So the versioning information is stored in the leaves of the tree, rather than at the top. But the versioned extents idea has been deferred for now, in favor of getting a working filesystem together.

Also removed from the initial feature list is support for subvolumes. This feature initially seemed like an easy thing to do, but interaction with fsync() proved hard. So Daniel finally concluded that volume management was best left to volume managers and dropped the subvolume feature from Tux3.

One feature which has never been on the list is checksumming of data. Daniel once commented:

Having been checksumming filesystem data during continuous replication for two years now on multiple machines, and having caught exactly zero blocks of bad data passed as good in that time, I consider the spectre of disks passing bad data as good to be largely vendor FUD. That said, checksumming will likely appear in the feature list at some point, I just consider it a decoration, not an essential feature.

Tux3 development is far from the point where the developers can worry about "decorations"; it remains, at this point, an embryonic project being pushed by a developer with a bit of a reputation for bright ideas which never quite reach completion. The code, thus far, has been developed in user space using FUSE. There is, however, an in-kernel version which is now ready for further development. According to Daniel:

The functionality we have today is roughly like a buggy Ext2 with missing features. While it is very definitely not something you want to store your files on, this undeniably is Tux3 and demonstrates a lot of new design elements that I have described in some detail over the last few months. The variable length inodes, the attribute packing, the btree design, the compact extent encoding and deduplication of extended attribute names are all working out really well.

The potential user community for a stripped-down ext2 with bugs is likely to be relatively small. But the Tux3 design just might have enough to offer to make it a contender eventually.

First, though, there are a few little problems to solve. At the top of the list, arguably, is the complete lack of locking - locking being the rocks upon which other filesystem projects have run badly aground. The code needs some cleanups - little problems like the almost complete lack of comments and the use of macros as formal function parameters are likely to raise red flags on wider review. Work on an fsck utility does not appear to have begun. There has been no real benchmarking work done; it will be interesting to see how Daniel can manage the "never overwrite a block" policy in a way which does not fragment files (and thus hurt performance) over time. And so on.

That said, a lot of these problems could end up being resolved rather quickly. Daniel has put the code out there and appears to have attracted an energetic (if small) community of contributors. Tux3 represents the core of a new filesystem with some interesting ideas. Code comments may be scarce, but Daniel - never known as a tight-lipped developer - has posted a wealth of information which can be found in the Tux3 mailing list archives. Potential contributors should be aware of Daniel's licensing scheme - GPLv3 with a reserved unilateral right to relicense the code to anything else - but developers who are comfortable with that are likely to find an interesting and fast-moving project to play in.

Index entries for this article
Kernel	Filesystems/Tux3

to post comments

Tux3: the other next-generation filesystem

Posted Dec 2, 2008 19:20 UTC (Tue) by martinfick (subscriber, #4455) [Link] (6 responses)

I look forward to a stable Tux3 FS!

I know that with the whole reiserfs debate there was talk of adding a generic journaling layer to the kernel, and now tux3 will have some form of transaction support! But, has anyone considered adding entire FS transactions to the VFS API layer (including the ability to rollback) to help with the future development of distributed redundant filesystems?

It seems like there are many new distributed filesystems also in development. If they do have data redundancy, most of them do not do it in a transactional manner yet, probably because it is hard. However, if these FSes had sub filesystem kernel support for transactions, this might become much easier.

Hmm, maybe some tricks could even be played to use snapshots in this way? A brute force approach might even be to use lvm snapshots, but this might seriously stress lvm if a new snapshot were required for every FS write and it could also mean severe performance penalties. However, an lvm fallback method would allow transactions to be added to the kernel VFS layer even for older filesystems such as FAT.

If this suggested in kernel transaction support could allow commit/rollback decisions to be exported to userspace, I would think that it could easily be used (and would be very welcomed) by distributed FS designers.

Tux3: the other next-generation filesystem

Posted Dec 2, 2008 19:35 UTC (Tue) by sbergman27 (guest, #10767) [Link] (5 responses)

"""
I know that with the whole reiserfs debate there was talk of adding a generic journaling layer to the kernel
"""

I thought that was jbd?

http://en.wikipedia.org/wiki/Journaled_block_device

Tux3: the other next-generation filesystem

Posted Dec 2, 2008 19:59 UTC (Tue) by avik (guest, #704) [Link] (2 responses)

jbd is an in-kernel interface for journalling changes to block devices. What is described here is a filesystem-level, user-visible transaction support.

Consider:

begin transaction
yum update
test test test
commit transaction (or abort transaction)

Useful work can continue to be performed while the update takes place, and is not lost in case of rollback.

I believe NTFS supports this feature.

Tux3: the other next-generation filesystem

Posted Dec 9, 2008 14:48 UTC (Tue) by rwmj (guest, #5474) [Link] (1 responses)

begin transaction
yum update
test test test
commit transaction

This sounds like a nice idea at first, but you're forgetting an essential step: if you have multiple transactions outstanding, you need some way to combine the results to get a consistent filesystem.

For example, suppose that the yum transaction modified /etc/motd, and a user edited this file at the same time (before the yum transaction was committed). What is the final, consistent value of this file after the transaction?

From DBMSes you can find lots of different strategies to deal with these cases. A non-exhaustive list might include: Don't permit the second transaction to succeed. Always take the result of the first (or second) transaction and overwrite the other. Use a merge strategy (and there are many different sorts).

As usual in computer science, there is a whole load of interesting, accessible theory here, which is being completely ignored. My favorite which is directly relevant here is Oleg's Zipper filesystem.

Rich.

Tux3: the other next-generation filesystem

Posted Dec 9, 2008 20:26 UTC (Tue) by martinfick (subscriber, #4455) [Link]

You are correct, that is actually quite more advanced than what I was proposing. But since I did not go into any details about what I was asking for, I can hardly object. :) The real problem with the above, apart from perhaps being difficult to achieve, is that it would likely break posix semantics!

The yum proposal probably assumes that I could have multiple writes interleaved with reads from the same locations that could succeed in one transaction and then possibly be rolled back. Posix requires that once a write succeeds any reads to the same location that succeed after the write report the newly written bytes. To return a read of some written bytes to any process, even the writer, with the transaction pending, and to then rollback the transaction and return in a read what was there before the write, to any process, would break this requirement. The yum example above probably requires such "broken" semantics.

What I was suggesting is actually something much simpler than the above: a method to allow a transaction coordinator to intercept every individual write action (anything that modifies the FS) and decide whether to commit of rollback the write (transaction).

The coordinator would intercept the write after the FS signals "ready to commit". The write action would then block until either a commit or a rollback is received from the coordinator. This would not allow any concurrent read or writes to the portion of the object being modified during this block, ensuring full posix semantics.

For this to be effective with distributed redundant filesystems, once the FS has signaled ready to commit, the write has to be able to survive a crash so that if the node hosting the FS crashes, the rollback or commit can be issued upon recovery (depending on the coordinator's decision) and reads/writes must continue to be blocked until then (even after the crash!)

If the commit is performed, things continue as usual, if there is a rollback, the write simply fails. Nothing would seem different to applications using such an FS, except for a possible (undetermined) delay while the coordinator decides to commit or rollback the transaction.

That's all I had in mind, not bunching together multiple writes. It should not actually be that difficult to implement, the tricky part is defining a useful generic interface to the controller that would allow higher level distributed FSes to use it effectively.

Tux3: the other next-generation filesystem

Posted Dec 3, 2008 0:00 UTC (Wed) by jengelh (subscriber, #33263) [Link] (1 responses)

Not that JBD has many users. Reiser4, JFS, XFS and btrfs all use their own journalling. Leaves... ext3 to use jbd. Wow.

Tux3: the other next-generation filesystem

Posted Dec 3, 2008 12:40 UTC (Wed) by daniel (guest, #3181) [Link]

Reiser4, JFS, XFS and btrfs all use their own journalling. Leaves... ext3 to use jbd.

And OCFS2. JBD was created at a time when it seemed as though all future filesystems would be journalling filesystems. Incidentally, any filesystem developer who overlooks Stephen Tweedie's copious writings on the JBD design, does so at their peril whether they intend to use journalling or some other atomic commit model.

Tux3: the other next-generation filesystem

Posted Dec 2, 2008 20:33 UTC (Tue) by aigarius (subscriber, #7329) [Link] (10 responses)

Tux3: the other next-generation filesystem

Posted Dec 2, 2008 23:01 UTC (Tue) by jmorris42 (guest, #2203) [Link] (5 responses)

> undelete - a tool that would analyse the filesystem and
> report what files and what versions of files it can recover.

Sounds like you are stuck in DOS mode. For an undelete in a real OS, beyond the Windows 'trashcan' desktop GUIs implement, it should be a "Do or Do Not, there is no try." deal. Either have real file versioning, snapshots, etc. or don't bother. Snuffling around on the platters for raw blocks and just blanking out the first letter of file names are bad ideas best left in the dustbin of history.

Tux3: the other next-generation filesystem

Posted Dec 2, 2008 23:26 UTC (Tue) by Ze (guest, #54182) [Link] (4 responses)

undelete through versioning is bloody useful. However it's no substitute for a tool that tries to recover data when there has been a failure. When people are writing filesystems they often don't think about how they could make it easier for themselves or others to write a tool to recover corrupted data in the event of various failures (both software and hardware). Personally I don't see the loss of logical volumes as a big deal. I've never understood the point of them when you could just have a fast index in the first place.

Tux3: the other next-generation filesystem

Posted Dec 3, 2008 9:48 UTC (Wed) by niner (guest, #26151) [Link] (2 responses)

Hardware failures and accidental deletion is what we have backups for.

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 3:54 UTC (Thu) by Ze (guest, #54182) [Link]

>Hardware failures and accidental deletion is what we have backups for I would argue that accidental deletion is one of the things that versioning should handle. Unfortunately backups offer only limited granularity along with people failing to use or test them. When you combine all that you can see why people a clear need for data recovery tools. People clearly feel a need for recovery tools since there are quite a few tools on the market both free and commercial. It makes sense to consider that use case when designing a file system. It can only lead to better documented and designed file system.

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 17:40 UTC (Thu) by lysse (guest, #3190) [Link]

Yes, and you'll still have backups for it in the future. But wouldn't it be nice to have a way out of your last backup having gone up in flames at a really inconvenient time? Is there some reason why it would be desirable to limit the number of ways of thwarting Murphy we permit ourselves? Because honestly, I can't think of one...

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 12:15 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

As a side effect, consider that you could end up making real deletion of information (say, credit card numbers) harder.
Amdist all the great work (which is well above my skill level, kudos to all) there are ramifications.

Tux3: the other next-generation filesystem

Posted Dec 2, 2008 23:04 UTC (Tue) by daniel (guest, #3181) [Link] (3 responses)

One other common utility that Linux filesystem developers often forget is undelete - a tool that would analyse the filesystem and report what files and what versions of files it can recover. This should be simple enough to implement in Tux3.

It's on the to.do list. The standard argument against undelete is that it can be implemented at a higher level, as a move to a Trash folder in place of a delete. In practice, there is often not a gui around, and it doesn't help when you are running a shell under the gui. So if it turns out to be easy to do as part of the versioning, Tux3 will have it.

Regards, Daniel

Tux3: the other next-generation filesystem

Posted Dec 3, 2008 2:55 UTC (Wed) by rsidd (subscriber, #2582) [Link] (2 responses)

Doesn't Tux3 support snapshots (or won't it, in the future)? With snapshots you don't need undelete.

Tux3: the other next-generation filesystem

Posted Dec 3, 2008 7:23 UTC (Wed) by daniel (guest, #3181) [Link] (1 responses)

Snapshots do not take care of all undelete cases - you might not have taken a snapshot before deleting something important. But I am thinking in terms of using the snapshot mechanism to support undelete, essentially creating an "anonymous snapshot" just for the directory where the delete happened. This could be done without wasting a lot of space by using Tux3's versioned attribute model, which would avoid having to use a full disk block just to remember a single undeleted name.

Tux3: the other next-generation filesystem

Posted Dec 3, 2008 8:05 UTC (Wed) by rsidd (subscriber, #2582) [Link]

I've seen your email exchange with Matt Dillon on tux3 and hammer on the dragonfly lists some time back. Hammer can take snapshots at very fine-grained interval -- say every few minutes -- via a cron job and it is not very expensive (supposedly -- I haven't used it myself, yet). That would be incredibly useful. Undelete doesn't take care of all recovery requirements either -- you may have deleted text in a document that you later want back, for example. With hammer, just look at the snapshot from 2 minutes before you deleted it.

I suppose if you have a filesystem where many bulky files are being altered frequently, this is not a great idea, but you can tune the frequency of the snapshot and pruning (or disable snapshotting entirely, if need be...)

Tux3: the other next-generation filesystem

Posted Dec 3, 2008 8:38 UTC (Wed) by plougher (guest, #21620) [Link]

Like any self-respecting contemporary filesystem, Tux3 is based on B-trees [...] Blocks are mapped using extents, of course - another obligatory feature for new filesystems

Of course this should be qualified as any self-respecting read/write filesystem. B-trees and extents are completely unnecessary for read-only filesystems.

Tux3 seems to have some nice design decisions which should offer high performance (reduced seeking). I like the variable sized inodes, (potential) optimised inodes for small files, and the packed attributes. Though I'm obviously bound to say that Squashfs has had variable sized inodes optimised for different file types/sizes for many years.

Correctness

Posted Dec 4, 2008 4:06 UTC (Thu) by ncm (guest, #165) [Link] (10 responses)

having caught exactly zero blocks of bad data passed as good

Evidently Daniel hasn't worked much with disks that are powered off unexpectedly. There's a widespread myth (originating where?!) that disks detect a power drop and use the last few milliseconds to do something safe, such as finish up the sector they're writing. It's not true. A disk will happily write half a sector and scribble trash. Most times reading that sector will report a failure, but you only get reasonable odds. Some hard read failures, even if duly reported, count as real damage, and are not unlikely.

Your typical journaled file system doesn't protect against power-off scribbling damage, as fondly as so many people wish and believe with all their little hearts.

Even without unexpected power drops, it's foolish to depend on more reliable reads than the manufacturer promises, because they trade off marginal correctness (which is hard to measure) against density (which is on the box in big bold letters). What does the money say to do? PostgreSQL uses 64-bit block checksums because they care about integrity. It's possibly reasonable to say that theirs is the right level for such checking, but not to say there's no need for it.

Correctness

Posted Dec 5, 2008 18:22 UTC (Fri) by man_ls (guest, #15091) [Link] (3 responses)

Everything you say can be prevented by a more robust filesystem with data journaling, even without checksums. Ext3 with data=ordered is an example.

Even with checksumming data integrity is not guaranteed: yes, the filesystem will detect that a sector is corrupt, but it still needs to locate a good previous version and be able to roll back to that version. Isn't it easier to just do data journaling?

Correctness

Posted Dec 5, 2008 22:18 UTC (Fri) by ncm (guest, #165) [Link] (2 responses)

Everything you say can be prevented by a more robust filesystem ...

FALSE. I'm talking about hardware-level sector failures. A filesystem without checksumming can be made robust against reported bad blocks, but a bad block that the drive delivers as good can completely bollix ext3 or any fs without its own checksums. Drive manufacturers specify and (just) meet a rate of such bad blocks, low enough for non-critical applications, and low enough not to kill performance of critical applications that perform their own checking and recovery methods.

Denial is not a sound engineering practice.

Correctness

Posted Dec 6, 2008 0:06 UTC (Sat) by man_ls (guest, #15091) [Link] (1 responses)

Interesting point: it seems I misread your post so let me re-elaborate. Data journaling prevents against half-written sectors, since they will not count as written. This leaves a power-off which causes physical damage to a disk, and yet the disk will not realize the sector is bad. Keep in mind that we have data journaling, so this particular sector will not be used until it is completely overwritten. The kind of damage must be permanent yet remain hidden when writing, which is why I deemed it impossible. It seems you have good cause to believe it can happen, so it would be most enlightening to hear any data points you may have.

As to your concerns about high data density and error rates, they are exactly what Mr Phillips happily dismisses: in practice they do not seem to cause any trouble.

Over-engineering is not a sound engineering practice either.

Correctness

Posted Dec 7, 2008 22:28 UTC (Sun) by ncm (guest, #165) [Link]

We have, elsewhere in this same thread, reports of bad data delivered as good, and causing trouble, Mr. Phillips's opinion notwithstanding. The incidence is, therefore, not negligible for data many people care about. Partially-written blocks are only one cause of bad sectors, which I noted only because they are an example on one that occurs much for frequently for some users than for others. Bad sectors may occur in the journal as well as in file contents. The drive will detect and report only a large, but not always a large enough, fraction of these.

File checksums needed?

Posted Dec 6, 2008 18:57 UTC (Sat) by giraffedata (guest, #1954) [Link] (3 responses)

A disk will happily write half a sector and scribble trash. Most times reading that sector will report a failure, but you only get reasonable odds.

Actually, I think the probability of reading such a sector without error indication is negligible. There are much more likely failure modes for which file checksums are needed. One is where the disk writes the data to the wrong track. Another is where it doesn't write anything but reports that it did. Another is that the power left the client slightly before the disk drive and the client sent garbage to the drive, which then correctly wrote it.

I've seen a handful of studies that showed these failure modes, and I'm pretty sure none of them showed simple sector CRC failure.

If sector CRC failure were the problem, adding a file checksum is probably no better than just using stronger sector CRC.

File checksums needed?

Posted Dec 16, 2008 1:57 UTC (Tue) by daniel (guest, #3181) [Link] (2 responses)

There are much more likely failure modes for which file checksums are needed. One is where the disk writes the data to the wrong track. Another is where it doesn't write anything but reports that it did. Another is that the power left the client slightly before the disk drive and the client sent garbage to the drive, which then correctly wrote it.

Scribble on final write is something we plan to detect, by checksumming the commit block. I seem to recall reading that SGI ran into hardware that would lose power to the memory before the drive controller lost its power-good, and had to do something special in XFS to survive it. Better would be if hardware was engineered not to do that.

Please, stop...

Posted Dec 20, 2008 3:31 UTC (Sat) by sandeen (guest, #42852) [Link] (1 responses)

Can we just drop the whole "XFS expects and/or works around special hardware" meme? This has been kicked around for years without a shred of evidence. I may as well assert that XFS requires death-rays from mars for proper functionality.

XFS, like any journaling filesystem, expects that when the storage says data is safe on disk, it is safe on disk and the filesystem can proceed with whatever comes next. That's it; no special capacitors, no power-fail interrupts, no death-rays from mars. There is no special-ness required (unless you consider barriers to prevent re-ordering to be special, and xfs is not unique in that respect either).

Please, stop...

Posted Dec 20, 2008 3:55 UTC (Sat) by giraffedata (guest, #1954) [Link]

You must have seriously misread the post to which you responded. It doesn't mention special features of hardware. It does mention special flaws in hardware and how XFS works in spite of them.

I too remember reports that in testing, systems running early versions of XFS didn't work because XFS assumed, like pretty much everyone else, that the hardware would not write garbage to the disk and subsequently read it back with no error indication. The testing showed that real world hardware does in fact do that and, supposedly, XFS developers improved XFS so it could maintain data integrity in spite of it.

Correctness

Posted Dec 11, 2008 16:50 UTC (Thu) by anton (subscriber, #25547) [Link]

A disk will happily write half a sector and scribble trash. Most times reading that sector will report a failure, but you only get reasonable odds.

Given that disk drives do their own checksumming, you get pretty good odds. And if you think they are not good, why would you think that FS checksums are any better?

Concerning getting such damage on power-off, most drives don't do that; we would hear a lot about drive-level read errors after turning off computers if that was a general characteristic. However, I have seen such things a few times, and it typically leads to me avoiding the brand of the drive for a long time (i.e., no Hitachi drives for me, even though they were still IBM when it happened, and no Maxtor, either; hmm, could it be that selling such drives leads to having to sell the division/company soon after?); they usually did not happen happen on an ordinary power-off, but in some unusual situations that might result in funny power characteristics (that's still no excuse to corrupt the disk).

Correctness

Posted Dec 15, 2008 21:06 UTC (Mon) by grundler (guest, #23450) [Link]

ncm wrote:
> There's a widespread myth (originating where?!) that disks
> detect a power drop and use the last few milliseconds to do
> something safe, such as finish up the sector they're writing.
> It's not true. A disk will happily write half a sector and scribble trash.

It was true for SCSI disks in the 90's. The feature was called "Sector Atomicity". As expected, there is a patent for one implementation:
http://www.freepatentsonline.com/5359728.html

AFAIK, every major server vendor required it. I have no idea if this was ever implemented for IDE/ATA/SATA drives. But UPS's became the norm for avoiding power failure issues.

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 6:11 UTC (Thu) by mjthayer (guest, #39183) [Link] (2 responses)

What about online fsck-ing, which I have not seen mentioned here yet? Surely that ought to be feasible if there is no update in place. (I'm not actually sure why it is not generally feasible with journalling filesystems, possibly excluding the journal itself, at least if you temporarily disable journal write out).

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 7:26 UTC (Thu) by daniel (guest, #3181) [Link] (1 responses)

Online checking is planned. Offline checking is in progress.

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 11:09 UTC (Thu) by mjthayer (guest, #39183) [Link]

Actually this patch

http://patchwork.ozlabs.org/patch/6047/ (filesystem-freeze-implement-generic-freeze-feature.patch)

might make general online fs checking doable.

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 8:05 UTC (Thu) by zmi (guest, #4829) [Link] (10 responses)

I must strongly object here. Over the last years, I have had 3 different customers, using 2 different RAID-controller vendors with 2 different disk types (SCSI, SATA), who got destroyed RAID contents because of a broken disk that did not report (or detect) it's errors.

The problem is, that even RAID controllers do not "read-after-write" and thus verify the contents of a disk. So if the disk says "OK" after a write where in reality it's not, your RAID and filesystem contents still go to be destroyed (because the drive reads back other data than it wrote).

Another check could be "on every read also calculate the RAID checksum to verify", but for performance reasons nobody does that.

There REALLY should be filesystem-level checksumming, and a generic interface between filesystem and disk controller, where the filesystem can tell the RAID controller to switch to "paranoid mode", doing read-after-write of disk data. It's gonna be slow then, but at least the controller will find a broken disk and disable it - after that, it can switch to performance mode again.

Yes, our customers were quiet unsatisfied that even with RAID controllers their data got broken. But the worst is, it takes a long time for customers to see and identify there is a problem - you can only hope for a good backup strategy! Or for a filesystem doing checksumming.

mfg zmi

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 12:47 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]

For RAIDs, there should be few selectable options:
- read (all mirrors) after writes, report error if content differ (slow)
- write (all mirrors) and return if all writes successfull, post a read of the same data and report delayed error if content differ.
- write (one mirror) and return as soon as possible, post writes to other mirrors, then post a read of the same data (all mirrors) and report delayed error if content differ.
Obviously, for previous test, you should run the disks with their cache disabled.

Those can run with cache enabled:
- read all mirrors and compare content, report error to the read operation if content differ (slow)
- read and return first available data, but keep data and compare when other mirrors deliver data; report delayed error if mirrors have different data.

That is better handled in the controller hardware itself, I do not know if some hardware RAID controller do it correctly.
I am not sure there is a defined clean way to report "delayed errors" in either SCSI or SATA, there isn't any in ATA interface (so booting from those RAID drives using the BIOS may be difficult).
Moreover the "check data" (i.e. read and compare) in SCSI is sometimes simply ignored by devices, so that may have to be implemented by reads in the controller itself.
I am not sure a lot of users would accept the delay penalties due to the amount of data transferred in between controller and RAID disks...

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 21:38 UTC (Thu) by njs (subscriber, #40338) [Link] (7 responses)

Luckily the box happened to be hosting a modern DVCS server (the first, in fact), which was doing its own strong validation on everything it read from the disk, and started complaining very loudly. No saying how much stuff on this (busy, multi-user, shared) machine would have gotten corrupted before someone noticed otherwise, though... and backups are no help, either.

I totally understand not being able to implement everything at once, but if there comes a day when there are two great filesystems and one is a little slower but has checksumming, I'm choosing the checksumming one. Saving milliseconds (of computer time) is not worth losing years (of work).

Tux3: the other next-generation filesystem

Posted Dec 5, 2008 22:52 UTC (Fri) by ncm (guest, #165) [Link] (4 responses)

Checksumming only the file system's metadata and log, but not the user-level data, is a reasonable compromise. Then applications that matter (e.g. PostgreSQL, or your DVCS) can provide their own data checksums (and not pay twice) and operate on a reliable file system.

This suggests a reminder for applications providing their own checksums: mix in not just the data, but your own metadata (block number, file id). Getting the right checksum on the wrong block is just embarrassing.

Tux3: the other next-generation filesystem

Posted Dec 5, 2008 23:58 UTC (Fri) by njs (subscriber, #40338) [Link] (3 responses)

>Checksumming only the file system's metadata and log, but not the user-level data, is a reasonable compromise

Well, maybe...

Within reason, my goal is to have a much confidence as possible in my data's safety, with as little investment of my time and attention. Leaving safety up to individual apps is a pretty wretched system for achieving this -- it defaults to "unsafe", then I have to manually figure out which stuff needs more guarantees, which I'll screw up, plus I have to worry about all the bugs that may exist in the eleventeen different checksumming systems being used in different codebases... This is the same reason I do whole disk backups instead of trying to pick and choose which files to save, or leaving backup functionality up to each individual app. (Not as crazy as an idea as it sounds -- that DVCS basically has its own backup system, for instance; but I'm not going around adding that functionality to my photo editor and word processor too.)

Obviously if checksumming ends up causing unacceptable slowdowns, then compromises have to be made. But I'm pretty skeptical; it's not like CRC (or even SHA-1) is expensive compared to disk access latency, and the Btrfs and ZFS folks seem to think usable full disk checksumming is possible.

If it's possible I want it.

Tux3: the other next-generation filesystem

Posted Dec 6, 2008 8:26 UTC (Sat) by ncm (guest, #165) [Link] (2 responses)

This is another case where the end-to-end argument applies. Either (a) it's a non-critical application, and backups (which you have to do anyway) provide enough reliability; or (b) it's a critical application, and the file system can't provide enough assurance anyway, and what it could do would interfere with overall performance.

Similarly, if your application is seek-bound, it's in trouble anyway. If performance matters, it should be limited by the sustained streaming capacity of the file system, and then delays from redundant checksum operations really do hurt.

Hence the argument for reliable metadata, anyway: the application can't do that for itself, and it had better not depend on metadata operations being especially fast. Traditionally, serious databases used raw block devices to avoid depending on file system metadata.

Tux3: the other next-generation filesystem

Posted Dec 6, 2008 8:55 UTC (Sat) by njs (subscriber, #40338) [Link] (1 responses)

End-to-end is great, and it absolutely makes sense that special purpose systems like databases may want both additional guarantees and low-overhead access to the drive. But basically none of my important data is in a database; it's scattered all over my hard drive in ordinary files, in a dozen or more formats. If the filesystem *is* your database, as it is for ordinary desktop storage, then that's the only place you can reasonably put your integrity checking.

Backups are also great, but there are cases (slow quiet unreported corruption that can easily persist undetected for weeks+, see upthread) where they do not protect you.

(In some cases you can actually increase integrity too -- if your app checks its checksum when loading a file and it fails, then the data is lost but at least you know it; if btrfs checks a checksum while loading a block and it fails, then it can go pull an uncorrupted copy from the RAID mirror and prevent the data from being lost at all.)

>If performance matters, it should be limited by the sustained streaming capacity of the file system, and then delays from redundant checksum operations really do hurt.

Again, I'm not convinced. My year-old laptop does SHA-1 at 200 MB/s (using one core only); the fastest hard-drive in the world (according to storagereview.com) streams at 135 MB/s. Not that you want to devote a CPU to this sort of thing, and RAID arrays can stream faster than a single disk, but CRC32 goes *way* faster than SHA-1 too, and my laptop has neither RAID nor a fancy 15k RPM server drive anyway.

And anyway my desktop is often seek-bound, alas, and yours is too; it does make things slow, but I don't see why it should make me care less about my data.

Tux3: the other next-generation filesystem

Posted Dec 7, 2008 21:33 UTC (Sun) by ncm (guest, #165) [Link]

For most uses we would benefit from the file system doing as much as it can, and even backing itself up -- although we'd like to be able to bypass whatever gets in the way. But if the file system does less, at first, the first thing to checksum is the metadata.

Tux3: the other next-generation filesystem

Posted Dec 16, 2008 1:42 UTC (Tue) by daniel (guest, #3181) [Link] (1 responses)

I've only lived with maybe a few dozen disks in my life, but I've still corruption like that too -- in this case, it turned out that the disk was fine, but one of the connections on the RAID card was bad, and was silently flipping single bits on reads that went to that disk (so it was nondeterministic, depending on which mirror got hit on any given cache fill, and quietly persisted even after the usual fix of replacing the disk).

Luckily the box happened to be hosting a modern DVCS server (the first, in fact), which was doing its own strong validation on everything it read from the disk, and started complaining very loudly. No saying how much stuff on this (busy, multi-user, shared) machine would have gotten corrupted before someone noticed otherwise, though... and backups are no help, either.

Our ddnap-style checksumming at replication time would have caught that corruption promptly.

if there comes a day when there are two great filesystems and one is a little slower but has checksumming, I'm choosing the checksumming one. Saving milliseconds (of computer time) is not worth losing years (of work).

It is not milliseconds, it is a significant fraction of your CPU, no matter how powerful. But yes, if you want extra checking is important to you, should be able to have it. Whether block checksums belong in the filesystem rather than volume manager is another question. There may be a powerful efficiency argument that checksumming has to be done by the filesystem, not the volume manager. If so, I would like to see it.

Anyway, when the time comes that block checksumming rises to the top of the list of things to do, we will make sure Tux3 has something respectable, one way or another. Note that checksumming at replication time already gets nearly all the benefit at a very modest CPU cost.

If you want to rank the relative importance of features, replication way beats checksumming. It takes you instantly from having no backup or really awful backup, to having great backup with error detection. So getting to that state with minimal distractions seems like an awfully good idea.

Tux3: the other next-generation filesystem

Posted Dec 21, 2008 12:26 UTC (Sun) by njs (subscriber, #40338) [Link]

> Our ddnap-style checksumming at replication time would have caught that corruption promptly.

What is that, and how does it work? I'm curious...

In general, I don't see how replication can help in the situation I encountered -- basically, some data on the disk magically changed without OS intervention. The only way to distinguish between that and a real data change is if you are somehow hooked into the OS and watching the writes it issues. Maybe ddsnap does that?

>It is not milliseconds, it is a significant fraction of your CPU, no matter how powerful.

Can you elaborate? On my year-old laptop, crc32 over 4k-blocks does >625 MiB/s on one core (adler32 is faster still), and the disk with perfect streaming manages to write at ~60 MiB/s, so by my calculation the worst case is 5% CPU. Enough that it could matter occasionally, but in fact seek-free workloads are very rare... and CPUs continue to follow Moore's law (checksumming is parallelizable), so it seems to me that that number will be <1% by the time tux3 is in production :-).

No opinion on volume manager vs. filesystem (so long as the interface doesn't devolve into distinct camps of developers pushing responsibilities off on each other); I could imagine there being locality benefits if your merkle tree follows the filesystem topology, but eh.

>If you want to rank the relative importance of features, replication way beats checksumming.

Fair enough, but I'll just observe that since I do have a perfectly adequate backup system in place already, replication doesn't get *me* anything extra, while checksumming does :-).

File checksums needed?

Posted Dec 6, 2008 19:07 UTC (Sat) by giraffedata (guest, #1954) [Link]

Having been checksumming filesystem data during continuous replication for two years now on multiple machines, and having caught exactly zero blocks of bad data passed as good in that time,

If TUX3 is for small systems, Philipps is probably right. I don't know what "continuous replication" means or how much data he's talking about here, but I have a feeling that studies I've seen calling for file checksumming did maybe 10,000 times as much I/O as this.

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 11:32 UTC (Thu) by biolo (guest, #1731) [Link] (1 responses)

Does anyone know if there has been any talk about implementing HSM (Hierarchical Storage Management) in Linux? I'm aware there are one or two (very expensive) proprietary solutions out there, but it strikes me that now is a good time to at least consider how you would implement it and what you need from the various layers to handle it. Since we have two potential new generic file systems in the works, whose on-disk layout hasn't been fixed yet I can't think of a better time.

Obviously HSM is one of those things that crosses the traditional layering, but BTRFS at least is already handling multi layer issues.

Implementing a linux native HSM strikes me as one of those game changers, we'd have a huge feature none of the other OS's can currently match without large expenditure. I've lost count of the number of situations where organizations have bought hugely expensive SCSI or FC storage systems with loads of capacity, where what they actually needed was just a few high performance disks (or even SSDs nowadays) backed by a slower but high capacity set of SATA disks. Even small servers or desktops probably have a use for this, that new disk you just bought to expand capacity is probably faster that the old one.

Using tape libraries at the second or third level of the HSM has a few more complications, but could be tackled later.

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 17:42 UTC (Thu) by dlang (guest, #313) [Link]

there is interest, but currently the only way to do this is via FUSE

the hooks that are being proposed for file scanning are also being looked at as possibly being used for HSM type uses.

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 11:55 UTC (Thu) by meuh (guest, #22042) [Link] (1 responses)

The code needs some cleanups - little problems like the almost complete lack of comments and the use of macros as formal function parameters are likely to raise red flags on wider review

And here is changeset 580: "The "Jon Corbet" patch. Get rid of SB and BTREE macros, spell it like it is."

Tux3: the other next-generation filesystem

Posted Dec 5, 2008 19:39 UTC (Fri) by liljencrantz (guest, #28458) [Link]

Cool. Doesn't fix the lack of comments, though. *hint* :)

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 16:13 UTC (Thu) by joern (guest, #22392) [Link] (4 responses)

> Tux3 handles this by writing the new blocks directly to their final location, then putting a "promise" to update the metadata block into the log. At roll-up time, that promise will be fulfilled through the allocation of a new block and, if necessary, the logging of a promise to change the next-higher block in the tree. In this way, changes to files propagate up through the filesystem one step at a time, without the need to make a recursive, all-at-once change.

Excellent, you had the same idea. How do you deal with inode->i_size and inode->i_blocks changing on behalf of the "promise"?

Tux3: the other next-generation filesystem

Posted Dec 4, 2008 21:05 UTC (Thu) by joern (guest, #22392) [Link] (2 responses)

Oh, and there is another more subtle problem. When mounting the filesystem with very little DRAM available, it may not be possible to cache all "promised" metadata blocks. So one must start writing them back at mount time. However, before they have all been loaded, one might have an incorrect (slightly dated) picture of the filesystem. The easiest example I can come up with involves three blocks, A, B and C, where A points to B and B points to C:
A -> B -> C

Now both B and C are rewritten, without updating their respective parent blocks (A and B):
A -> B -> C
B' C'

B' and C' appear disconnected without reading up on all the promises. At this point, when mounting under memory pressure, order becomes important. If A is written out first, to release the "promise" on B', everything works fine. But when B is written out first, to release the "promise on C', we get something like this:
A -> B -> C
B' C'
B"---^

And now there are two conflicting "promises" on B' and B". A rather ugly situation.

Tux3: the other next-generation filesystem

Posted Dec 11, 2008 8:01 UTC (Thu) by daniel (guest, #3181) [Link] (1 responses)

Hi Joern,

there is another more subtle problem. When mounting the filesystem with very little DRAM available, it may not be possible to cache all "promised" metadata blocks. So one must start writing them back at mount time.

You mean, first run with lots of ram, get tons of metadata blocks pinned, then remount with too little ram to hold all the pinned metadata blocks. A rare situation, you would have to work at that. All of ram is available for pinned metadata on remount, and Tux3 is pretty stingy about metadata size.

In your example, when B is rewritten (a btree split or merge) the promise made by C' to update B is released because B' is on disk. So the situation is not as complex as you feared.

I expect we can just ignore the problem of running out of dirtyable cache on replay and nobody will ever hit it. But for completeness, note that writing out the dirty metadata is not the only option. By definition, one can reconstruct each dirty metadata block from the log. So choose a dirty metadata block with no dirty children, reconstruct it and write it out, complete with promises (a mini-rollup). Keep doing that until all the dirty metadata fits in cache, then go live. This may not be fast, but it clearly terminates. Unwinding these promises is surely much easier than unwinding credit default swaps :-)

Regards,

Daniel

Tux3: the other next-generation filesystem

Posted Dec 20, 2008 13:08 UTC (Sat) by joern (guest, #22392) [Link]

>> I expect we can just ignore the problem

In that case I am a step ahead of you. :)

The situation may be easier to reach than you expect. Removable media can move from a beefy machine to some embedded device with 8M of RAM. Might not be likely for tux3, but is reasonably likely for logfs.

And order is important. If B is rewritten _after_ C, the promise made by C' is released. If it is rewritten _before_ C, both promises exist in parallel.

What I did to handle this problem may not apply directly to tux3, as the filesystem designs don't match 100%. Logfs has the old-fashioned indirect blocks and stores a "promise" by marking a pointer in the indirect block as such. Each commit walks a list of promise-containing indirect blocks and writes all promises to the journal.

On mount the promises are added to an in-memory btree. Each promise occupies about 32 bytes - while it would occupy a full page if stored in the indirect block and no other promises share this block. That allows the read-only case to work correctly and consume fairly little memory.

When going to read-write mode, the promises can be moved into the indirect blocks again. If those consume too much memory, they are written back. However, for some period promises may exist both in the btree and in indirect blocks. Care must be taken that those two never disagree.

Requires a bit more RAM than your outlined algorithm, but still bounded to a reasonable amount - nearly identical to the size occupied in the journal.

Tux3: the other next-generation filesystem

Posted Dec 11, 2008 6:42 UTC (Thu) by daniel (guest, #3181) [Link]

How do you deal with inode->i_size and inode->i_blocks changing on behalf of the "promise"?

These are updated with the inode table block and not affected by promises. Note that we can sometimes infer the i_size and i_blocks changes from the logical positions of the written data blocks and could defer inode table block udpates until rollup time. And in the cases where we can't infer it, write the i_size into the log commit block. More optimization fun.