Unifying filesystems with union mounts

December 24, 2008

This article was contributed by Goldwyn Rodrigues

Unification of filesystems is the concept of mounting several filesystems on a single mount point, with the resulting mount showing the logical combination of all the filesystems. Traditionally, when a filesystem is mounted on a directory, the existing contents of the directory are masked, and the content of the latest mounted filesystem is shown. These masked files are available only after the mounted filesystem is unmounted. Even though these files exist, they are inaccessible to the user. Union mount overcomes this by providing access to all directories and files present in the directory, even after a mount.

In the kernel, the filesystems are stacked in order of their mount sequence, the first mounted filesystem is at the bottom of the mount stack, and the latest mount is at the top of the stack. Only the files and directories of the top of the mount stack are visible. With union mounts, directory entries from the lower filesystems are merged with the directory entries of upper filesystem, thus making a logical combination of all mounted filesystems. Files with the same name in a lower filesystem are masked, as the upper one takes precedence.

Union mounts could be used to update packages of a distribution on a DVD. A writable filesystem could be mounted over the read-only filesystem on the DVD. All new and updated package files would be written to the writable, topmost filesystem, while hiding the duplicate files of the read-only media, or even deleting files (this is done through white-outs discussed later). This allows the user to change any of the files on the system, with the new file stored transparently in the image. Such a setup could be used to roll-up an updated DVD, or maintain a package repository with the latest packages for network installs.

As compared to other implementations, such as unionFS, union mounts try to do all directory entry unification handling in the VFS layer, instead of creating a new filesystem type. Some of the advantages of this approach are:

Simple and Lightweight Design: Since all merges happen inside VFS, there is no need for an additional filesystem layer to maintain and merge metadata.
No need to re-iterate the mount stack by the user while mounting: the user is not required to list the directories participating in the union as a part of the mount command. Only the mount point is enough.
Bind mount works without any problems: this is a VFS feature to remount part of the filesystem hierarchy at additional mount points.

Union mount, developed by Jan Blunck, Bharta B Rao, and Miklos Szeredi, is the first step in unifying mounts in the VFS. The patch implementation is similar to that of the Plan 9/Inferno operating system. Currently, it only does namespace unification at the root directory level and not in the subdirectories.

To mount directories through union mount, the mount command must be modified to recognize and set the union mount options. The util-linux patches that update the mount command can be found at ftp://ftp.suse.com/pub/people/jblunck/union-mount/

As an example, consider the following directory structure of two filesystems:

Issuing the following commands will perform a union mount:

    # mount /dev/sdb /mnt
    # ls /mnt
    dir1 file1 link1

    # mount --union /dev/sdc /mnt
    # ls /mnt
    dir1 dir4 file1 link1

After the union, the directory structure looks like:

Unmounting the /mnt directory unwinds the filesystem mount stack:

    # umount /mnt
    # ls /mnt
    dir1 file1 link1

The filesystems are stacked in the mount order in the kernel. The MNT_UNION flag in vfsmnt is set while mounting union mounts. This helps to identify that the directory entries of the stacked filesystems are supposed to be merged. While performing the lookup sequence, if the MNT_UNION flag is set, all root directory entries of all filesystems are scanned. Scanning happens from top of the filesystem stack to bottom, and the first matching entry is returned. This way any duplicate entries in underlying filesystems are automatically ignored.

Similarly, for the readdir() call, the directory entries are read from the topmost union mount directory to the lowest, and collected in the cache. The cache is responsible for collecting and keeping the directory entries across the stacked filesystem, with different callbacks for each filesystem. Like regular files, directories are seekable and the position of the following read is marked by the file position filp->f_pos. When reading from directories across filesystems, it is possible that the file position exceeds the inode size of the directory where it is merged. In such a situation, the file position is rearranged to select the correct directory in the union stack. This is done by subtracting the inode size if the file position exceeds it and selecting the next member of the union.

This works for filesystems such as ext2 that use flat file directories. The directory entry offsets are arranged linearly and are always smaller than the inode size of the directory. However, some filesystems return special cookies as directory entry offsets which are unrelated to the position in the directory or the inode size. Updating file->f_pos to accommodate more directories does not not work for such filesystems.

There can be multiple calls to readdir()/getdents() routines for reading the entries of a single directory. Currently, the union directory cache is not maintained across these calls. Instead, for every call the previously read entries are re-read into the cache and newly read entries are compared against these for duplicates before being returned to user space. The developers are working on making this efficient by maintaining the cache across readdir()/getdents() calls.

Future Plans: Writable Unions

Currently, the namespace unification is limited to the root filesystem directory entries. Future plans, known as writable unions, would come close to the implementations of unionfs namespace unification. Directory entry merging would not be limited to the root filesystem, but would be done for subdirectories as well. Though these patches have been developed, they still require some time and clean up for the mainline.

Using the example above, a writable union mount of the two filesystems would contain:

Note that dir1 directory now contains both file_b1 and file_c1.

All writes are directed to the topmost mounted filesystem if it is mounted read-write. Mounting a new filesystem upon the current union mount makes all filesystems lower in the stack read-only, though the unified namespace would appear read-write to the user. Any modifications in the files of lower filesystems is handled through copy-on-write. If a file belonging to the lower layers of the stack is opened, the entire file is copied on the topmost filesystem on the stack. This is also known as copy-up, where the file is copied to the topmost layer if it has to record a change. While performing a copy-up, the directory path of the file is also recreated on the topmost filesystem, so that the next time it is mounted as a union, it appears in the same location. The older file gets masked during the directory merge the next time the filesystems are union-mounted in the same order.

Rename on union mounts is handled through -EXDEV. -EXDEV is returned in a rename() operation if the source and destination file paths are on different mounted filesystems. In such a case, the application, such as mv, resorts to a copy operation, and unlinks the file from which the filesystem moved. On union mounts, since any writes are performed in the topmost layer, a move operation to directories in the lower layers returns -EXDEV, which means the application must copy the file to the new directory. If both the source and destination of the rename() operation are in the topmost later, the traditional rename method is used.

Deletion of files is handled by a special file type called white-outs. The white-out file type is similar to negative dentries: they describe a filename which isn't there. This is used to mark a file in the lower read-only filesystem as deleted, since only the topmost layer can be modified. However, white-outs would require support from all the filesystems, to store and recognize such a special file type. Currently, there is a special type, DT_WHT defined in include/linux/fs.h which defines a white-out, but is not in use.

Directory namespace unification is a tough task. FreeBSD implementations gave up after calling it "messy code", while unionfs entered the -mm tree for a brief period, it did not make it to mainline. Since the unification is a pathname-based it is best handled in the VFS instead of using a separate stacked filesystem. The union mount offers a cleaner and more lightweight approach for merging directories, however getting it to adhere to POSIX compliant directory calls such as telldir() or seekdir() is still a challenge and is currently being worked on.

The git repository to track union mounts is located at:

    git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git

under the union-dir branch. The union mounts developers intend to release the patches in a phased manner, starting with the current patch of root directory level merging. Further developments would see patches related to merging at the subdirectory level as well.

Index entries for this article
Kernel	Filesystems/Union
Kernel	Union mounts
GuestArticles	Rodrigues, Goldwyn

to post comments

Unifying filesystems with union mounts

Posted Dec 25, 2008 16:30 UTC (Thu) by ms (subscriber, #41272) [Link] (3 responses)

I think the directory listing under the text "Unmounting the /mnt directory unwinds the filesystem mount stack:" is wrong.

Unifying filesystems with union mounts

Posted Dec 26, 2008 1:05 UTC (Fri) by pr1268 (guest, #24648) [Link]

Agreed. Either that, or I'm totally misunderstanding something here.

???

Unifying filesystems with union mounts

Posted Dec 26, 2008 4:11 UTC (Fri) by goldwynr (guest, #55322) [Link] (1 responses)

Yes, the filesystem listing after unmount should be:

# umount /mnt
# ls /mnt
dir1 file1 link1

Unifying filesystems with union mounts

Posted Dec 26, 2008 4:13 UTC (Fri) by jake (editor, #205) [Link]

and now it is fixed in the article ...

jake

Unifying filesystems with union mounts

Posted Dec 26, 2008 0:44 UTC (Fri) by rvfh (guest, #31018) [Link]

Rename on union mounts is handled through -EXDEV. -EXDEV is returned in a rename() operation if the source and destination file paths are on different mounted filesystems.

Maybe difficult, but I wish there could be a way to just mark the file deleted on the one side and create a hard link to the filesystem where the file actually lives on the other, rather than having to copy the file on a mv operation, and using space on the top mounted filesystem (which might often be memory)...

There would be no unexpected -EXDEV either then: after all, it´s all meant to be the same FS from an application POW.

Unifying filesystems with union mounts

Posted Dec 26, 2008 10:37 UTC (Fri) by dlang (guest, #313) [Link] (2 responses)

a question about the copy-up feature. what triggers this?

opening a file for read-write?

opening a file for read-only with atime enabled? (after all the atime updates need to go somewhere)

opening a file for read-only with noatime?

I can see times when each of these would be desirable

Unifying filesystems with union mounts

Posted Dec 28, 2008 22:21 UTC (Sun) by AnswerGuy (guest, #1256) [Link] (1 responses)

Updating the atime is in the filesystem code, below the VFS. So opening read-only would not trigger copy-up ... it simple passes through to whichever fs in the stack contained the file (inode) ... and that fs then handles it (or ignores it if noatime was set on that mount.

It's interesting to ask if it's possible to specify noatime on some layers and not on others. I see no reason it shouldn't be possible.

I would hope that the open for read-write or read-only or append would only *mark* the file for copy-up. Any subsequent write() on that file descriptor should trigger the actual copying.

The ugly bits are the -EXDEV (directory renames triggering the need for whole directory copies) and the "whiteout" entries. (I presume the later are technically vnodes ... inode-like structures existing only in the VFS and not written to any of the filesystems in the stack?)

Actually all of this raises interesting questions about different use cases for union file stacking. Should the removal of a file simply be an operation that only affects the top layer? Or should it drill down to all write-able layers in the stack? Should this "whiteout" persist through a umount/mount? How should the system handle cases where we are mounting filesystem in different stacks at different times. The simple use cases where we have one filesystem (CD-ROM, network read-only mount etc) are easy to imagine. Add a "magic" file on the top-level filesystem (like ext2/3's "bad-blocks" and "journals" --- files/inodes with no visible links, or like the .nfsXXXXXXXXX server files we see when files are removed on some NFS clients while still open on others; have this contain the whiteout data and any other semantics).

But what about when we want to use union filesystem stacks as a sort of transparent version control system? (Similar to the infamous MVFS/VOBS from Clearcase and other such systems). It gets messy to even define the desired semantics, much less implement them.

Unifying filesystems with union mounts

Posted Dec 29, 2008 0:46 UTC (Mon) by dlang (guest, #313) [Link]

if a layer is read-only then atime should not be updated on it. the definition of union mounts is that all layers except the top one will be read-only (which avoids the headache you mention of deciding which writable layer to modify)

which either means that atime is implicitly disabled by union mounts, or the upper layer needs to store it.

I don't see how directory renames are any worse than large files that get renamed or small modifications (think logfiles for example), in both cases you have a (potentially large) disk item that needs to get copied (after all a directory is just a 'file' containing pointers to file contents). or are you saying that if a directory gets renamed all the files in that directory get copied? if so then the layering also is incompatible with hard linked files.

the other option is to extend the whiteout concept to provide a 'link' type entry as well (similar to a symlink, when file X is opened on the top layer, open file Y on lower layers)

the whiteouts (and any other modifications) definantly need to persist across mounts, otherwise the result can't be considered coherent. now if you remount with a different combination of things (or modify the underlying filesystem(s) through some other mechanism) things do get interesting.

if you want version control semantics then you probably want to use FUSE, not union mounts. if you used union mounts then for each new version you would mount another 'filesystem' on top of the stack, and each version could be accessed (read only) by other processes by mounting the appropiate layers of the stack.

the issues that you would run into here would just be efficiancy in drilling down through (potentially) many layers of the stack.

but that does bring up a question, how can you query what the stack is? If many layers have been mounted over time it may not be trivial to know how to recreate the stack for the next boot if you can't query the kernel to find out.

Unifying filesystems with union mounts

Posted Jan 7, 2009 0:24 UTC (Wed) by mlankhorst (subscriber, #52260) [Link]

How is a merging of a case insensitive and a case-sensitive fs handled?

Unifying filesystems with union mounts

Posted Jan 7, 2009 7:26 UTC (Wed) by jengelh (subscriber, #33263) [Link]

It shall be noted that you cannot do everything with union mounts. unionfs/AUFS will continue to stick around for sure.

http://lkml.org/lkml/2007/5/15/68

Unifying filesystems with union mounts

Posted Jan 10, 2009 3:11 UTC (Sat) by willp (guest, #52971) [Link] (1 responses)

"All writes are directed to the topmost mounted filesystem..."

This seems really obscure to me. So, if you union-mount everything onto the / mountpoint, then if you 'mkdir /blah' it gets created on the first (true root) partition, and you have no way to change that -- ever after bootup? Or am I confused?

If the choice of which device gets the writes is determined by the mount-order, won't that lead to some really strange work flows? I'm thinking, suppose we have:

# at bootup, from fstab processing
mount /dev/sda1 /
mount --union /dev/sda2 / (contains /opt)
mount --union /dev/sda3 / (contains /opt/whatever)

If you need to perform an fsck on /dev/sda2 for some reason, because of hardware issues or something, then you would want to unmount /opt then do an fsck on /dev/sda2, then remount it...

BUT! If you did that, you would inadvertently change the behavior of "mkdir /opt/foo" so that it gets created on /dev/sda3 AFTER the fsck and remount, versus being created on /dev/sda2 BEFORE the fsck and remount... Or am I misunderstanding this...?

It seems to me that implicit behavior is far more risky than explicit behavior. I'd ammend the idea of union mounts with an explicit numeric ordering that you specify with the --union option, like:
mount --union=1 /dev/sda2 / (contains /opt)
mount --union=2 /dev/sda3 / (contains /opt/whatever)

Then when you unmount /dev/sda2, and then remount it, you could specify the --union=1 option to restore /dev/sda2 to its ranking(ordering) with respect to the other devices mounted on the same mountpoint... And if you specify an ordering value that is already "in-use", then the 'mount' command could exit and abort with an error message.

It just seems like a huge risk to incur to have regular maintenance (umount,fsck,mount) lead to unexpected changes in the behavior of basic filesystem operations like mkdir. Especially if it's easily solved with an explicit numeric ordering provided by the admin.

Boy, I hope I'm just really ignorant here and missing something obvious... Anyone able to correct me here and allay my concern?

Unifying filesystems with union mounts

Posted Jan 10, 2009 4:23 UTC (Sat) by dlang (guest, #313) [Link]

you wouldn't use union mounts to mount /, /opt, /opt/whatever

you would use union mounts to mount three different filesystems on /opt/whatever

and yes, all your writes would land in the top filesystem.

to work on a lower-level filesystem you would need to unmount the layers on top of it, and to get back where you started you would need to remount the layers above.

if you write something on one filesystem, but decide you want it on the filesystem below that, you would need to unmount the top layer, mount it somewhere else, and move the file from one filesystem to the other.

I don't understand why you think you would need to mount them in a different order.

if you have dev/1 over dev/2 and decide that they should be in the other order just unmount them and re-mount them in the other order.

your filesystem is already dependant on the order that you mount things today.

if you mount /dev/1 on /opt and then mount /dev/2 on opt you will see the files in /dev/2 and not /dev/1, if you mount them in the other order you will see the files in /dev/1 but not in /dev/2

Alternative approach

Posted Jan 11, 2009 7:34 UTC (Sun) by pjm (guest, #2080) [Link]

The difficulties with readdir and atime and perhaps writes generally make me think it may be worthwhile targetting a simple common case, such as immutable base filesystem (typically squashfs) and a partition for storing changes (perhaps including atime changes).