A brief history of union mounts

Posted Jul 22, 2010 13:43 UTC (Thu) by neilbrown (subscriber, #359)
In reply to: A brief history of union mounts by obi
Parent article: A brief history of union mounts

Yes, we need "Why POSIX made the Editor Grumpy". It could be a nice long series - with lots of guest editors probably.

But telldir/seekdir isn't something that POSIX got wrong. Maybe the cookie size (32bit) is a bit small, but that is as easy to overcome as the Y2K038 bug.

Either you require readdir to return all the entries in a directory in one hit, or you need a stable pointer into the directory so you can ask for the 'next' chunk.
The stable pointer doesn't *have* to be exposed to user-space, but if you are going to have any hope of supporting a network filesystem like NFS, then it has to be exposed to the network protocol, so it has to exist.

It isn't hard to design a directory layout which allows stable indexes - it just requires a bit of fore-thought.

It *is* hard to synthesis such pointers from a union of two directories as you cannot predict or control the pointers you get from each. However it is possible to create a stable solution.

Given that the current union-mount proposal requires "white-out" objects to be created in the on-top filessytem to make objects from the below filesystem disappear, it would not be unreasonable to instead require 'white-in' objects which make objects from the below filesystem appear.

This would require a 'copy-up' of the directory when it is read (though more typically, when the directory is changed) which is a bit more harsh than the 'copy-up' that is required of files on e.g. a chmod. But it would give reliable semantics and in many real cases would not be a real burden.

To be a little more explicit: The common case with union mount is (I expect) that you union-mount an empty filesystem on top of a read-only filesystem, and then make changes. Each time you make changes in a new directory you need would to copy-up that directory and all parents that have not yet been copied-up. The copy-up involves creating a white-in object in the top directory for each object in the bottom. (It is a little more complicated than white-out as you want to store the 'DT_*' type of the underlying object). Then further changes simply happen to the top level directory.
A readdir simply uses the top-level directory.
Any lookup which hits a white-in object (or name) continues the lookup in the underlying filesystem.

(unfortunately the margin is too small to contain my elegant implementation....)

to post comments

A brief history of union mounts

Posted Jul 22, 2010 21:20 UTC (Thu) by nix (subscriber, #2304) [Link] (6 responses)

It could be a nice long series - with lots of guest editors probably.

Hell yes. And the suck extends to fairly simple areas. Just saying 'EINTR' and 'short reads' is enough to make anyone who's ever written even a trivial C program on a Unix platform wince. (What do you mean I need a horrible-looking for loop to read a file reliably?!)

A brief history of union mounts

Posted Jul 23, 2010 2:51 UTC (Fri) by neilbrown (subscriber, #359) [Link] (5 responses)

?? You don't get EINTR on read from a regular file - only pipes, sockets, char devices and similar things.

But in general I agree - signals make it very hard to write correct programs.

A brief history of union mounts

Posted Jul 24, 2010 18:16 UTC (Sat) by nix (subscriber, #2304) [Link] (4 responses)

You do get EINTR on read from a regular file if you're unlucky enough to have that file on a network device (e.g. NFS with intr turned on). And before you say 'don't do that then', before very recently we had a choice of turning intr on or losing the whole mount point and very shortly afterwards often the whole machine if the network went down. (And, yes, I have encountered both short reads and EINTR in NFS-based regular file reads on both Linux and Solaris. So it does happen.)

(Also, POSIX doesn't ban getting EINTR on reads from a regular file, so prudence dictates expecting it.)

A brief history of union mounts

Posted Jul 25, 2010 4:20 UTC (Sun) by neilbrown (subscriber, #359) [Link] (3 responses)

Fair point, though that is really an NFS issue rather than a general Posix issue. And NFS has a lot more than just that to answer for.

Posix has a concept of 'slow' and 'not slow' reads where 'slow' reads can result in a short read or EINTR, and disk IO is explcitly not a slow read. So if your file is on disk you cannot get EINTR.
I guess being on disk on another machine doesn't count. :-(

A brief history of union mounts

Posted Jul 31, 2010 20:27 UTC (Sat) by nix (subscriber, #2304) [Link] (2 responses)

I've heard this over and over again, but I've looked through the POSIX specs and I can't find it. No mention of slow reads, no mention that some devices are guaranteed not to get EINTR, no mention in the rationale either.

Now perhaps this is a de facto universal implementation detail, but as far as I can see it isn't in POSIX itself. (Maybe I just haven't looked in the right place?)

A brief history of union mounts

Posted Aug 1, 2010 10:01 UTC (Sun) by neilbrown (subscriber, #359) [Link] (1 responses)

It seems you are right.

http://www.opengroup.org/onlinepubs/9699919799/functions/...

appears to allow any read to be interrupted, and says in the "informative" section "The issue of which files or file types are interruptible is considered an implementation design issue. This is often affected primarily by hardware and reliability issues." which is singularly unhelpful.

I was basing my statements on "man 7 signal" which does talk about "slow" devices. Clearly this isn't normative....

As you say, POSIX by itself is enough to make one wince...

A brief history of union mounts

Posted Aug 4, 2010 22:45 UTC (Wed) by nix (subscriber, #2304) [Link]

Quite so :/

Even 'man 7 signal' says clearly that 'The details vary across Unix systems; below, the details for Linux', and that's not terribly useful really for the vast majority of software. (I suppose you can rely on it in mdadm ;} )

A brief history of union mounts

Posted Jul 23, 2010 20:01 UTC (Fri) by vaurora (guest, #38407) [Link] (1 responses)

Excellent idea - you've just described fallthru dentries. :)

The implementation of fallthrus is pretty small, around a hundred lines in main VFS and then you reuse the whiteout infrastructure in the client file systems.

A brief history of union mounts

Posted Jul 24, 2010 2:10 UTC (Sat) by neilbrown (subscriber, #359) [Link]

> Excellent idea - you've just described fallthru dentries. :)

Yes.... after writing that I went back through the original article, noticed 'fallthru' this time, and felt a bit sheepish.

I don't quite see either how you would implement fallthru using whiteout though, or why you would still want whiteout if you were using copy-up + fallthru..

It is a pity that a block-based COW solution is so inefficent - it is such a simple solution that would address many of the use-cases (not the NFS-as-underlying-filesystem case of course).