The Btrfs filesystem: An introduction

By Jonathan Corbet
December 11, 2013

The Btrfs filesystem has been through almost every part of the hype cycle at one point or another in its short history. For a time, it was the next-generation filesystem that was going to solve many of our problems; distributors were racing to see who could be the first to ship it by default. Then it became clear that all those longtime filesystem developers weren't totally out to lunch when they warned that it takes many years to get a filesystem to a point where it can be trusted with important data. At this point, it is possible to go to a conference and hear almost nothing positive about Btrfs; disillusionment appears to have set in. By the time Btrfs is truly ready, some seem to think, it will be thoroughly obsolete.

The truth may not be quite so grim. Development on Btrfs continues, with a strong emphasis on stability and performance. Problems are getting fixed, and users are beginning to take another look at this promising filesystem. More users are beginning to play with it, and openSUSE considered the idea of using it by default back in September. Your editor's sense is that the situation may be bottoming out, and that we may, slowly, be heading into a new phase where Btrfs takes its place — still slowly — as one of the key Linux filesystems.

This article is intended to be the first in a series for users interested in experimenting with and evaluating the Btrfs filesystem. We'll start with the basics of the design of the filesystem and how it is being developed; that will be followed by a detailed look at specific Btrfs features. One thing that will not appear in this series, though, is benchmark results; experience says that proper filesystem benchmarking is hard to do right; it's also highly workload- and hardware-dependent. Poor-quality results would not be helpful to anybody, so your editor will simply not try.

What makes Btrfs different?

Not that long ago, Linux users were still working with filesystems that had evolved little since the Unix days. The ext3 filesystem, for example, was still using block pointers: each file's inode (the central data structure holding all the information about the file) contained a list of pointers to each individual block holding the file's data. That design worked well enough when files were small, but it scales poorly: a 1GB file would require 256K individual block pointers. More recent filesystems (including ext4) use pointers to "extents" instead; each extent is a group of contiguous blocks. Since filesystems work to store data contiguously anyway, extent-based storage greatly reduces the overhead of managing a file's space.

Naturally, Btrfs uses extents as well. But it differs from most other Linux filesystems in a significant way: it is a "copy-on-write" (or "COW") filesystem. When data is overwritten in an ext4 filesystem, the new data is written on top of the existing data on the storage device, destroying the old copy. Btrfs, instead, will move overwritten blocks elsewhere in the filesystem and write the new data there, leaving the older copy of the data in place.

The COW mode of operation brings some significant advantages. Since old data is not overwritten, recovery from crashes and power failures should be more straightforward; if a transaction has not completed, the previous state of the data (and metadata) will be where it always was. So, among other things, a COW filesystem does not need to implement a separate journal to provide crash resistance.

Copy-on-write also enables some interesting new features, the most notable of which is snapshots. A snapshot is a virtual copy of the filesystem's contents; it can be created without copying any of the data at all. If, at some later point, a block of data is changed (in either the snapshot or the original), that one block is copied while all of the unchanged data remains shared. Snapshots can be used to provide a sort of "time machine" functionality, or to simply roll back the system after a failed update.

Another important Btrfs feature is its built-in volume manager. A Btrfs filesystem can span multiple physical devices in a number of RAID configurations. Any given volume (collection of one or more physical drives) can also be split into "subvolumes," which can be thought of as independent filesystems sharing a single physical volume set. So Btrfs makes it possible to group part or all of a system's storage into a big pool, then share that pool among a set of filesystems, each with its own usage limits.

Btrfs offers a wide range of other features not supported by other Linux filesystems. It can perform full checksumming of both data and metadata, making it robust in the face of data corruption by the hardware. Full checksumming is expensive, though, so it remains likely to be used in only a minority of installations. Data can be stored on-disk in compressed form. The send/receive feature can be used as part of an incremental backup scheme, among other things. The online defragmentation mechanism can fix up fragmented files in a running filesystem. The 3.12 kernel saw the addition of an offline de-duplication feature; it scans for blocks containing duplicated data and collapses them down to a single, shared copy. And so on.

It is worth noting that the copy-on-write approach is not without its costs. Obviously, some sort of garbage collection is required or all those block copies will quickly eat up all of the available space on the filesystem. Copying blocks can take more time than simply overwriting them as well as significantly increasing the filesystem's memory requirements. COW operations will also have a tendency to fragment files, wrecking the nice, contiguous layout that the filesystem code put so much effort into creating. Fragmentation hurts less with solid-state devices than on rotational storage, but, even in the former case, fragmented files will not be as quick to access.

So all this shiny new Btrfs functionality does not come for free. In many settings, administrators may well decide that the costs associated with Btrfs outweigh the benefits; those sites will stick with filesystems like ext4 or XFS. For others, though, the flexibility and feature set provided with Btrfs are likely to be quite appealing. Once it is generally accepted that Btrfs is ready for real-world use, chances are it will start popping up on a lot of systems.

Development

One concern your editor has heard in conference hallways is that the pace of Btrfs development has slowed. For the curious, here's the changeset count history for the Btrfs code in the kernel, grouped into approximately one-year periods:

Year Changesets Developers

2008 (2.6.25—29) 913 42

2009 (2.6.30—33) 279 45

2010 (2.6.34—37) 193 33

2011 (2.6.38—3.2) 610 67

2012 (3.3—8) 773 63

2013 (3.9—13) 671 68

Year	Changesets	Developers
2008 (2.6.25—29)	913	42
2009 (2.6.30—33)	279	45
2010 (2.6.34—37)	193	33
2011 (2.6.38—3.2)	610	67
2012 (3.3—8)	773	63
2013 (3.9—13)	671	68

These numbers, on their own, do not demonstrate a slowing of development; there was an apparent slow period in 2010, but the number of changesets and the number of developers contributing them has held steady thereafter. That said, there are a couple of things to bear in mind when looking at those numbers. One is that the early work involved the addition of features to a brand-new filesystem, while work in 2013 is almost entirely fixes. So the size of the changes has shrunk considerably, but one could easily argue that things should be just that way.

The other relevant point is that contributions by Btrfs creator Chris Mason have clearly fallen in recent years. Partly that is because he has been working on the user-space btrfs-progs code — work which is not reflected in the above, kernel-side-only numbers — but it also seems clear that he has been busy with other work-related issues. It will be interesting to see how things change now that Chris and prolific Btrfs contributor Josef Bacik have found a new home at Facebook.

In summary, the amount of new code going into Btrfs has clearly fallen in recent years, but that will be seen as good news by anybody hoping for a stable filesystem anytime soon. There is still some significant effort going into this filesystem, and chances are good that developer attention will increase as distributors look more closely at using Btrfs by default.

What's next

All told, Btrfs still looks interesting, and it seems like the right time to take a closer look at what is still the next generation Linux filesystem. Now that the introductory material is out of the way, the next article in this series will start to actually play with Btrfs and explore its feature set. Those articles (appearing here as they are published) are:

Getting started: where to get the software, and the basics of creating and using a Btrfs filesystem.
Working with multiple devices: a Btrfs filesystem is not limited to a physical device; instead, the filesystem supports multiple-device filesystems in a number of RAID and RAID-like configurations. This article describes that functionality and how to make use of it.
Subvolumes and snapshots: creating multiple filesystems on a single storage volume, along with the associated snapshot mechanism.
Send/receive and ioctl(): using the send/receive feature for incremental backups, a brief overview of functionality available with ioctl(), and concluding notes.

By the end of the series, we plan to have a reasonably comprehensive introduction to Btrfs in place; stay tuned.

Index entries for this article
Kernel	Btrfs/LWN's guide to
Kernel	Filesystems/Btrfs

to post comments

The Btrfs filesystem: An introduction

Posted Dec 12, 2013 4:39 UTC (Thu) by geuder (subscriber, #62854) [Link] (4 responses)

Hmm, what about reliability? I think it's not too long that one could hear at least rumours about people losing data.

And wasn't there an issue with fsck?

Myself I have only 2 experiences:

- it was the default in the Meego systems I used. But for obvious reason that usage did not last long.

- when I once wanted a "portable" Ubuntu installed on a USB stick, I used btrfs because of the possibility to compress everything. Besides that it probably was only a 4GB stick, I thought the bigger the speed gap between CPU and "disk", the more beneficial compression should be for overall performance, because there are plenty of spare cycles. The overall result was catastrophic, because every bigger apt-get upgrade took really several hours. The reason was that apt seems to be really cautious about not ending up in an inconsistent state if aborted in the middle of an operation, so it does plenty of fsync (IIRC) calls. At least back then that was a known performance problem in btrfs. I shortly experimented with hooking eatmydata underneath apt, but for some reasons the project then faded out...
(I still wonder how Meego could live with that problem. Is it just the number of packages, which problably was only a fraction in a Meego tablet compared to a full Ubuntu desktop installation? Or is rpm less paranoid than apt about ending up with inconsistent results if something goes wrong?)

The Btrfs filesystem: An introduction

Posted Dec 13, 2013 10:47 UTC (Fri) by jezuch (subscriber, #52988) [Link] (3 responses)

> Hmm, what about reliability? I think it's not too long that one could hear at least rumours about people losing data.

I've been running btrfs on / for about 4 years, I think, and I can remember just one, non-data-eating incident. So no problems for me. But others[1] have very different experiences, so...

[1] http://changelog.complete.org/archives/9123-results-with-...

> The reason was that apt seems to be really cautious about not ending up in an inconsistent state if aborted in the middle of an operation, so it does plenty of fsync (IIRC) calls. At least back then that was a known performance problem in btrfs.

Yes, this was rather bad, but there was a workaround (tell APT to not be so damn paranoid). And this problem is long gone now.

The Btrfs filesystem: An introduction

Posted Dec 13, 2013 15:48 UTC (Fri) by jezuch (subscriber, #52988) [Link]

Oh, and BTW: my / contains only the things that I can lose and/or can be easily restored (like the operating system) so I'm not *that* crazy to give all my precious data to an experimental FS ;) (My precious data is on a RAID'ed XFS partition on at least two plain-old, trusted rotating-rust drives; the root is on an old-ish SSD from Intel.)

But I really like snapshotting. It revolutionized the way I do custom builds of Debian packages: create a snapshot, make a mess (installing build-dependencies etc.), build Chromium (lots and lots of thrashing, up to 20 GB of build artifacts), move away what's important, delete snapshot. And after all of this the main filesystem doesn't have a clue that anything happened. And I don't have to clean up anything at all :)

The Btrfs filesystem: An introduction

Posted Apr 1, 2014 16:12 UTC (Tue) by mcortese (guest, #52099) [Link] (1 responses)

(tell APT to not be so damn paranoid)

How? Especially when apt is called by the installer, not directly by the user?

The Btrfs filesystem: An introduction

Posted Apr 1, 2014 16:43 UTC (Tue) by hummassa (guest, #307) [Link]

load "eatmydata" and call the installer again :D

The Btrfs filesystem: An introduction

Posted Dec 12, 2013 9:23 UTC (Thu) by ebirdie (guest, #512) [Link] (3 responses)

Incidentally I just picked this up, FWIW adding to the list of real world usage, Netgear has put Btrfs into real world use in its ReadyNAS 312 appliance.

<http://www.anandtech.com/show/7500/netgear-readynas-312-2...>

I'm pretty confident there are other vendors too, but to me this was a wake-up for Btrfs.

Great that the editor is publishing the set of articles. Reading the article raised an interest to further information, what vendors/employers might have been active and how they have changed during the presented time period in development of Btrfs.

The Btrfs filesystem: An introduction

Posted Dec 18, 2013 23:18 UTC (Wed) by Lennie (guest, #49641) [Link] (2 responses)

My guess would be they use a large BTRFS filesystem on top of RAID and don't use the multi device support in BTRFS. This is one of the things Suse has also suggested, BTRFS is stable, without using multi device support.

The Btrfs filesystem: An introduction

Posted Dec 19, 2013 23:41 UTC (Thu) by Pc5Y9sbv (guest, #41328) [Link] (1 responses)

We've been testing BTRFS in this configuration (BTRFS formatted on one large hardware RAID disk) with mixed results.

We are formatting about 20-60 TB of raw disk space (different test scenarios), and copying a wide range of different data trees which include large files and huge numbers of small files generated by programs. There might be about 40-70 TB of uncompressed data in around 10M files (using compress-force=zlib, it shrinks to 10-15 TB).

We wanted to store near-line backups with daily/weekly/monthly snapshot history and it failed miserably. It seems we can use the transparent compression and good old-fashioned rsync --link-dest tricks to store our backup history, but if we instead try to take sub-volume snapshots and just keep modifying the "head" via rsync, it blows up and takes the filesystem with it. So, it can handle the huge number of inodes involved in representing trees millions of files for each day, but it cannot handle the equivalent sub-volume snapshot workload.

The Btrfs filesystem: An introduction

Posted Dec 19, 2013 23:46 UTC (Thu) by Lennie (guest, #49641) [Link]

Well, that is good news.

It's a start. Slowly but surely it also be (more) stable for other workloads.

btrfs on raw flash

Posted Dec 12, 2013 14:14 UTC (Thu) by seanyoung (subscriber, #28711) [Link] (2 responses)

It just occurred to me that btrfs might be very well suited for flash (i.e. without the silly ftl) since it does not overwrite data.

In fact, if it can be shown that btrfs performance is significantly faster without FTL would that motivate the manufacturers to produce flash kit where you can bypass the FTL? That would give them an edge over their competition.

btrfs on raw flash

Posted Dec 12, 2013 19:16 UTC (Thu) by drag (guest, #31333) [Link]

I suppose it could be ported or the ideas applied to a direct-to-flash file system.

But right now you can't use btrfs directly on flash.

btrfs on raw flash

Posted Dec 13, 2013 8:49 UTC (Fri) by iq-0 (subscriber, #36655) [Link]

You'd need erase block logic and btrfs only does mild write leveling on it's superblock. But it should be a good fit for flash with a simple ftl.

The Btrfs filesystem: An introduction

Posted Dec 12, 2013 16:30 UTC (Thu) by masoncl (subscriber, #47138) [Link]

Thanks Jon, looking forward to the rest of the series.

The Btrfs filesystem: An introduction

Posted Dec 12, 2013 23:33 UTC (Thu) by dowdle (subscriber, #659) [Link] (6 responses)

I thought Oracle Enterprise Linux and SUSE Linux Enterprise Server both offered Btrfs as a supported option. Since I don't use either of those, I'm not sure.

The Btrfs filesystem: An introduction

Posted Dec 12, 2013 23:43 UTC (Thu) by dowdle (subscriber, #659) [Link] (2 responses)

Btrfs is listed as a feature in Oracle:

http://www.oracle.com/us/technologies/linux/product/featu...

Btrfs is listed as a feature in SLES:

https://www.suse.com/products/server/features/

Lastly, Btrfs is also shown as available in the RHEL 7 beta that came out recently. They don't have it listed as "preview only" anymore.

The Btrfs filesystem: An introduction

Posted Dec 13, 2013 0:04 UTC (Fri) by rahulsundaram (subscriber, #21946) [Link] (1 responses)

RHEL 7 beta lists Btrfs as a preview still

http://www.redhat.com/about/news/archive/2013/12/red-hat-...

"Btrfs, an emerging file system, will be available as a technology preview within Red Hat Enterprise Linux 7"

http://rhelblog.redhat.com/2013/12/11/testers-wanted-red-...

"btrfs file system .. now available to test"

The Btrfs filesystem: An introduction

Posted Dec 13, 2013 13:11 UTC (Fri) by dowdle (subscriber, #659) [Link]

Thanks for the correction on RHEL7 beta. I was going by the Release Notes which I didn't see preview or testing mentioned in.

The Btrfs filesystem: An introduction

Posted Dec 13, 2013 0:33 UTC (Fri) by anselm (subscriber, #2796) [Link] (2 responses)

In its most recent incarnation, SUSE Linux Enterprise Server prods you with considerable verve towards using Btrfs for your root file system.

On the other hand, perhaps interestingly, SUSE Linux Enterprise Server doesn't even support ext4 except as a read-only filesystem to get stuff off ext4-formatted disks.

The Btrfs filesystem: An introduction

Posted Dec 18, 2013 8:25 UTC (Wed) by salimma (subscriber, #34460) [Link] (1 responses)

At least, with SLES 11 SP3, ext4 r/w support is a kernel option away, unlike in SP2 where they make you hunt around for the additional RPM needed

https://www.suse.com/releasenotes/x86_64/SUSE-SLES/11-SP3/

Still not supported though, as you said. Bizarre decision.

The Btrfs filesystem: An introduction

Posted Dec 18, 2013 9:37 UTC (Wed) by anselm (subscriber, #2796) [Link]

Still not supported though, as you said.

And that's exactly the thing.

Even if »ext4 r/w support is a kernel option away«, on SLES you're not supposed to run your own kernels if you want your installation to be supported. And who would ever want to run SLES in the first place if it wasn't for the support?

ZFS

Posted Dec 13, 2013 3:42 UTC (Fri) by grahame (guest, #5823) [Link] (2 responses)

While ZFS can't be merged into the kernel, it's still compelling on Linux. Have been using the Ubuntu-ZFS PPA for years with no problems. I don't run / on ZFS, but I run pretty much everything else with it. It's being run on some fairly large storage installations I know of - I wonder if that's one reason BTRFS isn't getting so much traction?

Anyone know why full data checksums are considered too expensive for BTRFS? I'm running ZFS and seeing great performance (you do need a fair whack of memory) -- the checksumming doesn't seem to be a big problem on a modern system. It's very nice to know your data is actually there, too -- once you get out to storing petabytes of data, you will start to see data corruption occasionally.

Scrub vs. fsck is a huge win on systems with large filesystems. My experience of ext4 is that it will develop problems over time, and if you've got a huge partition fsck can easily take a day -- and you're offline for that time. From what I know btrfs scrub isn't quite so solid as ZFS?

ZFS

Posted Dec 13, 2013 10:11 UTC (Fri) by cwillu (guest, #67268) [Link]

First I've heard of the checksumming thing, and I follow the mailing list and irc channels fairly closely.

Checksumming is the default; it's typically only turned off for vm images and the like, as a side-effect of disabling copy-on-write on those files (and even this is being addressed).

ZFS

Posted Dec 13, 2013 10:21 UTC (Fri) by iq-0 (subscriber, #36655) [Link]

It isn't true that data checksums are too expensive for btrfs, but using them does cost you some performance. And there are people that will tolerate almost *no* performance degradation (ofter for good reasons).

Data checksumming is a good feature but there are enough cases where people might not want to bother with it but where they're still interested in e.g. snapshotting support, transparent compression, deduplication or incremental send/receive.

The reason why btrfs isn't being picked up as much as zfs: Maturity. In the beginning zfs had much the same issues and uptime was rather slow. But it has aged pretty well and is now a pretty much proven filesystem. Btrfs still has some rough edges which make it a less than ideal filesystem for the layman, but it does have the features and they really work. When more people are using it, so will the tooling improve and will the filesystem be considered the default choice for most common uses.

SailfishOS on Jolla phones

Posted Dec 13, 2013 6:16 UTC (Fri) by zdzichu (subscriber, #17118) [Link] (1 responses)

Recently released Jolla phone, running Sailfish, do use btrfs. The "recovery image" is stored as btrfs snapshot of stock install.

SailfishOS on Jolla phones

Posted Dec 13, 2013 11:36 UTC (Fri) by ttonino (guest, #4073) [Link]

Indeed the snapshots are very useful to integrate into an OS. For example the way Windows installs updates into a transaction: it all gets installed or nothing is installed. Or the reverse: if an update is not suitable, roll it back by switching to another snapshot.

It could solve quite some problems...

The Btrfs filesystem: An introduction

Posted Dec 13, 2013 14:45 UTC (Fri) by ibukanov (subscriber, #3942) [Link] (3 responses)

A year ago I tried Btrfs at home on a MacMini with Fedora 17. After couple of months it failed. Included fsck also crashed with double-free so it was complete loss of data. After that the box run flawlessly with ext4 and periodic checksumming of all files just to make sure that the hard drive works.

I could accept a bug in a new file system code, but a bug with double-free in fsck in read-only mode just told me about bad test coverage for a very important recovery tool.

The Btrfs filesystem: An introduction

Posted Dec 13, 2013 15:30 UTC (Fri) by leoc (guest, #39773) [Link]

I used it exclusively on fedora 19 and found it extremely stable, but it really suffered from too much thrashing on my T61 with a 64GB SSD. When I upgraded to Fedora 20 I went back to ext4 but I find I miss many of the more useful features like inline compression and volume management.

Btrfs/fsck bugs

Posted Dec 14, 2013 0:54 UTC (Sat) by giraffedata (guest, #1954) [Link] (1 responses)

I could accept a bug in a new file system code, but a bug with double-free in fsck in read-only mode just told me about bad test coverage for a very important recovery tool.

I don't follow the comparison. Why is a bug in new file system code more acceptable than one in fsck? Or harder to catch with testing?

I can make a case for the opposite: If I were allocating resources for finding (or preventing) bugs between file system code and fsck, I would give lower priority to fsck. You have all the time in the world to fix the double-free in fsck and recover your data, but if broken file system code failed to store the data, you're screwed regardless of how well fsck works.

Btrfs/fsck bugs

Posted Dec 14, 2013 8:58 UTC (Sat) by ibukanov (subscriber, #3942) [Link]

fsck is a user-space utility that can be run with all those extra checks like setting MALLOC_CHECK_, using valgrind etc. Tools for the kernel code just not so good.

> I would give lower priority to fsck.

For me working fsck gives an extra confidence that the data can be recovered as there are at least 2 types of code (fylesystem itself mounted read-only and the checker) that one can use after bugs. That was the reason that I tried it only at Fedora 17 when the long-promised fsck for Btrfs was finally appeared after at least 2 year delay. I suppose that in turn contributed to the delays with wider Btrfs usage.

The Btrfs filesystem: An introduction

Posted Dec 19, 2013 2:55 UTC (Thu) by heijo (guest, #88363) [Link] (2 responses)

Is it free of known bugs causing unrecoverable data corruption?

How much time has passed since the last bug causing unrecoverable data corruption has been fixed?

Is there any study of the probability that btrfs is free from such bugs? (since, obviously, a filesystem is only usable if that is considered to be near certainty)

The Btrfs filesystem: An introduction

Posted Dec 19, 2013 2:57 UTC (Thu) by heijo (guest, #88363) [Link]

Also, what about SSD performance?

Is the btrfs design optimal for SSDs?

With 1TB SSDs now going for $500, everyone is soon going to use them for all their main data storage needs, which is what needs to be fast on desktops, so it's essential that the default filesystem is optimal for them.

The Btrfs filesystem: An introduction

Posted Dec 20, 2013 20:01 UTC (Fri) by JanC_ (guest, #34940) [Link]

I am pretty sure you can find fairly recent "data corruption" bug fixes for almost every popular linux filesystem. It would be more useful to check how often such bugs were found/fixed during the last 1-2 years, and how likely those were to be triggered in real world use cases.