Widening ext4's readdir() cookie
In a separate article, we explained how an ext4 change to the kernel-user-space ABI in Linux 3.4 broke the GlusterFS filesystem; here, we look in detail at the change and why it was needed. The change in question was a patch by Fan Yong that widened the readdir() "cookies" produced by ext4 from 32 to 64 bits. Understanding why Fan's patch was necessary first requires a bit of background on the readdir() API.
The readdir API consists of a number of functions that allow an application to walk through the entries in a directory list. The opendir() function opens a directory stream for a specified directory. The readdir() function returns the contents of a directory stream, one entry at a time. The telldir() and seekdir() functions provide lseek-style functionality: an application can remember its current position in a directory stream using telldir(), scan further entries with readdir(), and then return to the remembered position using seekdir().
It turns out that supporting the readdir API is a source of considerable pain for filesystem developers. The API was designed in a simpler age, when directories were essentially linear tables of filenames plus inode numbers. The first of the widely used Linux filesystems, ext2, followed that design. In such filesystems, one can meaningfully talk about an offset within a directory table.
However, in the interests of improving performance and supporting new features, modern filesystems (such as ext4) have long since adopted more complex data structures—typically B-trees (PDF)—for representing directories. The problem with B-tree structures, from the point of view of implementing the readdir() API, is that the nodes in a tree can undergo (sometimes drastic) rearrangements as entries are added to and removed from the tree. This reordering of the tree renders the concept of a directory "offset" meaningless. The lack of a stable offset value is obviously a difficulty when implementing telldir() and seekdir(). However, it is also a problem for the implementation of readdir(), which must be done in such a way that a loop using readdir() to scan an entire directory will return a list of all files in the directory, without duplicates. Consequently, readdir() must internally also maintain some kind of stable representation of a position within the directory stream.
Although there is no notion of an offset inside a B-tree, the implementers of modern filesystems must still support the readdir API (albeit reluctantly); indeed, support for the API is a POSIX requirement. Therefore, it is necessary to find some means of supporting "directory position" semantics. This is generally done by fudging the returned offset value, instead returning an internally understood "cookie" value. The idea is that the kernel computes a hash value that encodes some notion of the current position in a directory (tree) and returns that value (the cookie) to user space. A subsequent readdir() or seekdir() will pass the cookie back to the kernel, at which point the kernel decodes the cookie to derive a position within the directory.
Encoding the directory position as a cookie works, more or less, but has some limitations. The cookie has historically been a 31-bit hash value, because older NFS implementations could handle only 32-bit cookies. (The hash is 31-bit because the off_t type used to represent the information is defined as a signed type, and negative offsets are not allowed.) In earlier times, a 31-bit hash was not too much of a problem: filesystem limitations meant that directories were usually small, so the chance that two directory entries would hash to the same value was small.
However, modern filesystems allow for large directories—so large that the chance of two files producing the same 31-bit hash is significant. For example, in a directory with 2000 entries, the chance of a collision is around 0.1%. In a directory with 32,768 entries (the historical limit in ext2), the chance is somewhat more than 20%. (For the math behind these numbers, see the Wikipedia article on the Birthday Paradox.) Modern filesystems have much higher limits on the number of files in a directory, with a corresponding increase in the chance of hash collisions; in a directory with 100,000 entries, the probability is over 90%.
Two files that hash to the same cookie value can lead to problems when using readdir(), especially on NFS. Suppose that we want to scan all of the files in a directory. And suppose that two files, say abc and xyz, hash to the same value, and that the directory is ordered such that abc is scanned first. When an NFS client readdir() later reaches the file xyz, it will receive a cookie that is exactly the same as for abc. Upon passing that cookie back to the NFS server, the next readdir() will commence at the file following abc. The NFS client code has some logic to detect this situation; that logic causes readdir() to give the (somewhat counter-intuitive) error ELOOP, "Too many levels of symbolic links".
This error can be fairly easily reproduced on NFS with older kernels. One simply has to create an ext4 directory containing enough files, mount that directory over NFS, and run any program that performs a readdir() loop over the directory on the NFS client. When working with a local filesystem (no NFS involved), the same problem exists, but in a different form. One does not encounter it when using readdir(), because of the way in which that function is implemented on top of the getdents() system call. Essentially, opendir() opens a file descriptor that is used by getdents(); the kernel is able to internally associate a directory position with that file descriptor, so cookies play no part in the implementation of readdir(). By contrast, because NFS is stateless, each readdir() over NFS requires that the NFS server explicitly locate the directory position corresponding to the cookie sent by the client.
On the other hand, the problem can be observed with a local ext4 filesystem when using telldir(), because that function explicitly returns the directory "offset" cookie to the caller. If two directory entries produce the same "offset" cookie when calling telldir(), then a call to seekdir() after either of the telldir() calls will go back to the same location. A user-space loop such as the following easily reveals the problem, encountering a difficulty analogous to a readdir() loop over NFS:
dirp = opendir("/path/to/ext4/dir");
while ((dirent = readdir(dirp)) != NULL) {
...
seekdir(dirp, telldir(dirp));
...
}
The seekdir(dirp, telldir(dirp)) call is a seeming no-op, simply resetting the directory position to its current location. However, where a directory entry hashes to the same value as an earlier directory entry, the effect of the call will be to reset the directory position to the earlier entry with the same hash. An infinite loop thus results. Real programs would of course not use telldir() and seekdir() in this manner. However, every now and then programs that use those calls would obtain a surprising result: a seekdir() would reposition the directory stream to a completely unexpected location.
Thus, the cookie collision problem needed to be fixed for the benefit of both ext4 and (especially) NFS. The simplest way of reducing the likelihood of hash collisions is to increase the size of the hash space. That was the purpose of Fan's patch, which increased the size of the hash space for the offset cookies produced by ext4 from 31 bits to 63. (A similar change has also been merged for ext3.) With a 63-bit hash space, even a directory containing one million entries would have less than one chance in four million of producing a hash collision. Of course, a corresponding change is required in NFS, so that the NFS server is able to deal with the larger cookie sizes. That change was provided in a patch by Bernd Schubert.
Reading this article and the GlusterFS article together, one might
wonder why GlusterFS doesn't have the same problems with XFS that it has
with ext4. The answer, as noted by Dave
Chinner, is that XFS uses a rather different scheme to produce
readdir() cookies. That scheme produces cookies that require only
32 bits, and the cookies are produced in such a way as to guarantee that no
two files can generate the same cookie. XFS is able to produce unique
32-bit cookies due to the virtual mapping it overlays onto the directory
index; adding such a mapping to ext4 (which does not otherwise need it)
would be a large job.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/ext4 |