A brief history of union mounts
Several weeks ago, I mentioned on my blog that I planned to move out of programming in the near future. A few days later I received this email from a kernel hacker friend:
At first, I thought we were losing a great hacker... But then I read on your blog: "Don't worry, I'm going to get union mounts into mainline before I change careers," and I realized this means you'll be with us for a few years yet! :)
How long has union mounts existed without going into the mainline Linux kernel? Well, to put it in a human perspective, if you'd been born the same year as the first Linux implementation of union mounts, you'd be writing your college application essays right now. Werner Almsberger began work on the Inheriting File System, one of the early ancestors of Linux union mounts, in 1993 - 17 years ago!
Background
A union mount does the opposite of a normal mount: Instead of hiding the namespace of the file system covered by the new mount, it shows a combination of the namespaces of the unioned file systems. Some use cases include a writable live CD/DVD-based system (without a complicated mess of symbolic links, bind mounts, and writable directories), and a shared base file system used by multiple clients. For an extremely detailed review of unioning file systems in general, see the LWN series:
- Unioning file systems:
Architecture, features, and design
- Unioning file systems:
Union directories and mounts
- Unioning file systems: unionfs and aufs
This article will provide a high-level overview of various implementations of union mounts from the original 1993 Inheriting File System through the present day VFS-based union mount implementation and plans for near-term development. We deliberately leave aside unionfs, aufs, and other non-VFS implementations of unioning, in large part because the probability of merging a non-VFS unioning file system into mainline appears to be even lower than that of a VFS-based solution.
readdir() redux
Throughout this article, we will place special emphasis on the evolution of readdir(), since historically it has been the greatest stumbling block for any implementation of union mounts. A summary from the first article in the LWN unioning file systems series:One of the great tragedies of the UNIX file system interface is the enshrinement of readdir(), telldir(), seekdir(), etc. family in the POSIX standard. An application may begin reading directory entries and pause at any time, restarting later from the "same" place in the directory. The kernel must give out 32-bit magic values which allow it to restart the readdir() from the point where it last stopped. Originally, this was implemented the same way as positions in a file: the directory entries were stored sequentially in a file and the number returned was the offset of the next directory entry from the beginning of the directory. Newer file systems use more complex schemes and the value returned is no longer a simple offset. To support readdir(), a unioning file system must merge the entries from lower file systems, remove duplicates and whiteouts, and create some sort of stable mapping that allows it to resume readdir() correctly. Support from userspace libraries can make this easier by caching the results in user memory.
Union mounts development time line
As mentioned earlier, one of the first implementations of a unioning was the Inheriting File System. In a pattern to be repeated by many future developers, Werner quickly became disenchanted with the complexity of the implementation of IFS and stopped working on it, suggesting that future developers try a mixed user/kernel implementation instead:
Well, I completed it to the point where it was a nice proof of concept, but still with problems (leaks inodes, probably has a few races left, was also a bit too liberal with locking, etc.).
Then I looked back at what I did and was disgusted by its complexity. So I decided that, before I might even consider proposing inclusion into the mainstream kernel, I'd have to see how much poorer (performance-wise) a user-space implementation would be. I did some initial hacking on NFS until I convinced myself that userfs might be the better approach. Unfortunately, I never found the time to work on that.
Many other kernel developers agreed with Werner. One of Linus Torvalds' earliest recorded NAKs of a kernel-based union file system came in 1996:
In 1998, Werner updated his IFS page to suggest working on a unioning file system as a good academic research topic:
Sounds like a very nice master's thesis topic for some good Linux hacker ;-) [...] So far nobody has taken the challenge. So, if you're an aspiring kernel hacker, aren't afraid of complexity, have a lot of time, and are looking for an interesting but useful project, you may just have found it :-)
Around 2003 - 2004, Jan Blunck took up the gauntlet Werner threw down and began working on union mounts for his thesis. The union mount implementation Jan produced lay dormant until 2007, when discussion about merging unionfs into mainline triggered renewed interest in a VFS-based version of unioning. At that point, Bharata B. Rao took the lead and began working with Jan Blunck on a new version of union mounts. Bharata and Jan posted several versions in 2007.
The first version posted in April 2007 used Jan's original strategy of keeping two pointers in the dentry for each directory, one pointing to the directory below this dentry's in the union stack, and one to the dentry of the topmost directory. The drawback to this implementation is that each file system can only be in one union stack at a time, since dentries are shared between all mounts of the same underlying file system.
The second version posted in May 2007 implemented yet another minor variation on in-kernel readdir(), this time using per file pointer cookies. From the patch set's documentation:
When two processes issue readdir()/getdents() call on the same unioned directory, both of them would be referring to the same dentries via their file structures. So it becomes necessary to maintain rdstate separately for these two instances. This is achieved by using a cookie variable in the rdstate. Each of these rdstate instances would get a different cookie thereby differentiating them.
In June 2007, Bharata and Jan posted a third version with an important and novel change to the way union stacks are formed. They replaced the in-dentry links to the topmost and lower directories with an external structure of pointers to (vfsmount, dentry) pairs. For the first time, a file system could be part of more than one union mount. From the patch set's documentation:
In this new approach, the way union stack is built and traversed has been changed. Instead of dentry-to-dentry links forming the stack between different layers, we now have (vfsmount, dentry) pairs as the building blocks of the union stack. Since this (vfsmount, dentry) combination is unique across all namespaces, we should be able to maintain the union stack sanely even if the filesystem is union mounted privately in different namespaces or if it appears under different mounts due to various types of bind mounts.
In July 2007, Jan
posted
a fourth version with some relatively minor changes to the way
whiteouts were implemented, among a few other things. Jan says,
"I'm able to compile the kernel with this patches applied on a 3
layer union mount with the [separate] layers bind mounted to different
locations. I haven't done any performance tests since I think there is
a more important topic ahead: better readdir() support.
"
In December 2007, Bharata B. Rao posted a fifth version that implemented another in-kernel version of readdir():
In this approach, the cached dirents are given offsets in the form of linearly increasing indices/cookies (like 0, 1, 2,...). This helps us to uniformly define offsets across all the directories of the union irrespective of the type of filesystem involved. Also this is needed to define a seek behaviour on the union mounted directory. This cache is stored as part of the struct file of the topmost directory of the union and will remain as long as the directory is kept open.
However, this approach had multiple problems, including excessive use of kernel memory to cache directory entries and to keep the mapping of indices to dentries.
readdir() continued to be a stumbling block, and union mounts
development slowed down for most of 2008. In April 2008, Nagabhushan
BS implemented
and posted
a version of union mounts with most of the readdir() logic
moved to glibc. "I went through Bharata's RFC post on glibc
based Union Mount readdir solution
(http://lkml.org/lkml/2008/3/11/34)
and have come up with patches against glibc to implement the
same.
"
However, moving the complexity to user space wasn't the panacea everyone had hoped for. The glibc maintainers had many objections, the kernel interface was an obvious kludge (returning whiteouts for "." to signal a unioned directory), and no one could figure out how to handle NFS sanely.
In November 2008, Miklos Szeredi posted a simplified version of union mounts that implemented readdir() in the kernel.
The directory entries are read starting from the top layer and they are maintained in a cache. Subsequently when the entries from the bottom layers of the union stack are read they are checked for duplicates (in the cache) before being passed out to the user space. There can be multiple calls to readdir/getdents routines for reading the entries of a single directory. But union directory cache is not maintained across these calls. Instead for every call, the previously read entries are re-read into the cache and newly read entries are compared against these for duplicates before being they are returned to user space.
This implementation only worked for file systems that return a simple increasing offset in the d_off field for readdir(). So ext2 worked, but any file system with a modern directory hashing scheme did not.
In early 2009, I started to get interested in union mounts. I talked
to several groups inside Red Hat and asked them what they needed most
from file systems. I heard the same story over and over: "We
really really need a unioning file system, but for some reason no one
at Red Hat will support unionfs...
" I did some research on the
available implementations and decided to go to work on Jan Blunck's
union mount patch set.
In May 2009, Jan Blunck and I posted a version of union mounts that implemented in-kernel readdir() using a new concept: the fallthru directory entry. The basic idea is that the first time readdir() is called on a directory, the visible directory entries from all the underlying directories are copied up to the topmost directory as fallthru directory entries. This eliminated all the problems I knew of in previous readdir() implementations, but required the topmost file system to always be read-write. This implementation also was limited to only two layers: one read-only file system overlaid with one read-write file system because we were concerned with lock ordering problems.
In October 2009, I posted a version of union mounts that implemented some of the more difficult system calls, such as truncate(), pivot_root(), and rename(). However, implementing chmod() and other system calls that modified files without opening them turned out to be fairly difficult with the current code base. We thought the hard part was copying up file data in open(), rename, and link(), but it turned out they were somewhat easier to implement because they already looked up the parent directory of the file to be altered. For union mounts, we need the parent directory's dentry and vfsmount in order to create a new version of the file in the topmost level of the union file system if necessary. open(), rename, and link() also needed the parent directory in order to create new directory entries, so we just reused the parent in the union mount in-kernel copyup code. But system calls like chmod() that only alter existing files did not bother to lookup the parent directory's path, only the target. Regretfully, I decided to start on a major rewrite.
In March 2010, I posted a rewrite of the pathname lookup mechanism for union mounts, taking into account Al Viro's recent VFS cleanups and removing a great deal of unnecessary code duplication.
In May 2010, I posted the first version of union mounts that implemented nearly all file related system calls correctly. The four exceptions were fchmod(), fchown(), fsetxattr(), and futimensat(), which will fail on read-only file descriptors. (UNIX is full of surprises; none of the VFS experts I talked to knew that these system calls would succeed on a read-only fd.)
The central primitive in this version is a function called user_path_nd(). It is a combination of user_path(), which looks up a pathname and returns the corresponding dentry and vfsmount, and user_path_parent(), which looks up the parent directory of the file or directory given by the pathname and returns the struct nameidata for the parent. (struct nameidata is too complex to describe in full here, but suffice to say it is usually needed to create an entry in a directory.) user_path_nd() returns both the parent's nameidata and the child's path. Once we have both these pieces of information, we can do an in-kernel copyup of a file in chmod() or any other system call that modifies a file. Unfortunately, user_path_nd() is also the weakest point in this version of union mounts: it's racy, inefficient, and copies up files even if the system call fails.
The day after I posted that version, I flew to North Carolina for a long-anticipated in-person code review with Al Viro. We spent three days in his office painfully reviewing the entire union mount implementation. Al immediately figured out how to delete a third of the code I'd spent the last year carefully massaging, and then outlined how to rewrite the other two-thirds of the code more elegantly, including user_path_nd(). As a result of this code review marathon, Linux has a feature-complete implementation of union mounts that has undergone a full code review by the Linux VFS maintainer for the first time. Of course, the resulting todo list is long and complex, and some problems may turn out to be insoluble, but it's an important step forward.
The biggest design change Al suggested was to move the head of the union stack back into the dentry, while keeping the rest of the union stack in a singly linked list of struct union_dir's allocated external to the dentries for the read-only parts of the union stack. This combines the speed and elegance of Jan Blunck's original design using in-dentry pointers to the union stack, with the flexibility of Bharata B. Rao's (vfsmount, dentry) pairs, which allow file systems to be part of many read-only layers. This change removed the entire union stack hash table and the associated lookup logic and shrunk the union_dir struct from 7 members to 2. I posted this hybrid linked list version on June 15, 2010.
Most recently, on June 25th, 2010, I posted a version that implemented submounts in the read-only layers, as well as allowed more than two read-only layers again. Then I went on a two week vacation - the longest vacation I've had since I started working on union mounts - and tried to forget everything I knew about it.
Future Work
The next step is to implement the remainder of Al Viro's review comments. The last big-ticket item is rewriting user_path_nd() and the in-kernel file copyup boilerplate. After that, it's back for another round of code review from Al and the other VFS maintainers. The 2010 Linux Storage and File Systems workshop is in early August. With luck we can hash out any remaining architectural problems face-to-face at the workshop and possibly merge union mounts into mainline before it's old enough to vote. Or it might languish for another 17 years outside the kernel. Such are the vicissitudes of Linux kernel development.
Acknowledgments: I want to extend special thanks to the following people: Kevin Roderick, who provided moral support, Tim Bowen, who gave me a free day at the Spoke6 co-working space while I worked on this article, and, of course, Jake Edge, whose editorial suggestions were, as usual, right on.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/Union |
| Kernel | Union mounts |
| GuestArticles | Aurora, Valerie |