Widening ext4's readdir() cookie

By Michael Kerrisk
March 27, 2013

In a separate article, we explained how an ext4 change to the kernel-user-space ABI in Linux 3.4 broke the GlusterFS filesystem; here, we look in detail at the change and why it was needed. The change in question was a patch by Fan Yong that widened the readdir() "cookies" produced by ext4 from 32 to 64 bits. Understanding why Fan's patch was necessary first requires a bit of background on the readdir() API.

The readdir API consists of a number of functions that allow an application to walk through the entries in a directory list. The opendir() function opens a directory stream for a specified directory. The readdir() function returns the contents of a directory stream, one entry at a time. The telldir() and seekdir() functions provide lseek-style functionality: an application can remember its current position in a directory stream using telldir(), scan further entries with readdir(), and then return to the remembered position using seekdir().

It turns out that supporting the readdir API is a source of considerable pain for filesystem developers. The API was designed in a simpler age, when directories were essentially linear tables of filenames plus inode numbers. The first of the widely used Linux filesystems, ext2, followed that design. In such filesystems, one can meaningfully talk about an offset within a directory table.

However, in the interests of improving performance and supporting new features, modern filesystems (such as ext4) have long since adopted more complex data structures—typically B-trees (PDF)—for representing directories. The problem with B-tree structures, from the point of view of implementing the readdir() API, is that the nodes in a tree can undergo (sometimes drastic) rearrangements as entries are added to and removed from the tree. This reordering of the tree renders the concept of a directory "offset" meaningless. The lack of a stable offset value is obviously a difficulty when implementing telldir() and seekdir(). However, it is also a problem for the implementation of readdir(), which must be done in such a way that a loop using readdir() to scan an entire directory will return a list of all files in the directory, without duplicates. Consequently, readdir() must internally also maintain some kind of stable representation of a position within the directory stream.

Although there is no notion of an offset inside a B-tree, the implementers of modern filesystems must still support the readdir API (albeit reluctantly); indeed, support for the API is a POSIX requirement. Therefore, it is necessary to find some means of supporting "directory position" semantics. This is generally done by fudging the returned offset value, instead returning an internally understood "cookie" value. The idea is that the kernel computes a hash value that encodes some notion of the current position in a directory (tree) and returns that value (the cookie) to user space. A subsequent readdir() or seekdir() will pass the cookie back to the kernel, at which point the kernel decodes the cookie to derive a position within the directory.

Encoding the directory position as a cookie works, more or less, but has some limitations. The cookie has historically been a 31-bit hash value, because older NFS implementations could handle only 32-bit cookies. (The hash is 31-bit because the off_t type used to represent the information is defined as a signed type, and negative offsets are not allowed.) In earlier times, a 31-bit hash was not too much of a problem: filesystem limitations meant that directories were usually small, so the chance that two directory entries would hash to the same value was small.

However, modern filesystems allow for large directories—so large that the chance of two files producing the same 31-bit hash is significant. For example, in a directory with 2000 entries, the chance of a collision is around 0.1%. In a directory with 32,768 entries (the historical limit in ext2), the chance is somewhat more than 20%. (For the math behind these numbers, see the Wikipedia article on the Birthday Paradox.) Modern filesystems have much higher limits on the number of files in a directory, with a corresponding increase in the chance of hash collisions; in a directory with 100,000 entries, the probability is over 90%.

Two files that hash to the same cookie value can lead to problems when using readdir(), especially on NFS. Suppose that we want to scan all of the files in a directory. And suppose that two files, say abc and xyz, hash to the same value, and that the directory is ordered such that abc is scanned first. When an NFS client readdir() later reaches the file xyz, it will receive a cookie that is exactly the same as for abc. Upon passing that cookie back to the NFS server, the next readdir() will commence at the file following abc. The NFS client code has some logic to detect this situation; that logic causes readdir() to give the (somewhat counter-intuitive) error ELOOP, "Too many levels of symbolic links".

This error can be fairly easily reproduced on NFS with older kernels. One simply has to create an ext4 directory containing enough files, mount that directory over NFS, and run any program that performs a readdir() loop over the directory on the NFS client. When working with a local filesystem (no NFS involved), the same problem exists, but in a different form. One does not encounter it when using readdir(), because of the way in which that function is implemented on top of the getdents() system call. Essentially, opendir() opens a file descriptor that is used by getdents(); the kernel is able to internally associate a directory position with that file descriptor, so cookies play no part in the implementation of readdir(). By contrast, because NFS is stateless, each readdir() over NFS requires that the NFS server explicitly locate the directory position corresponding to the cookie sent by the client.

On the other hand, the problem can be observed with a local ext4 filesystem when using telldir(), because that function explicitly returns the directory "offset" cookie to the caller. If two directory entries produce the same "offset" cookie when calling telldir(), then a call to seekdir() after either of the telldir() calls will go back to the same location. A user-space loop such as the following easily reveals the problem, encountering a difficulty analogous to a readdir() loop over NFS:

    dirp = opendir("/path/to/ext4/dir");
    while ((dirent = readdir(dirp)) != NULL) {
        ...
        seekdir(dirp, telldir(dirp));
        ...
   }

The seekdir(dirp, telldir(dirp)) call is a seeming no-op, simply resetting the directory position to its current location. However, where a directory entry hashes to the same value as an earlier directory entry, the effect of the call will be to reset the directory position to the earlier entry with the same hash. An infinite loop thus results. Real programs would of course not use telldir() and seekdir() in this manner. However, every now and then programs that use those calls would obtain a surprising result: a seekdir() would reposition the directory stream to a completely unexpected location.

Thus, the cookie collision problem needed to be fixed for the benefit of both ext4 and (especially) NFS. The simplest way of reducing the likelihood of hash collisions is to increase the size of the hash space. That was the purpose of Fan's patch, which increased the size of the hash space for the offset cookies produced by ext4 from 31 bits to 63. (A similar change has also been merged for ext3.) With a 63-bit hash space, even a directory containing one million entries would have less than one chance in four million of producing a hash collision. Of course, a corresponding change is required in NFS, so that the NFS server is able to deal with the larger cookie sizes. That change was provided in a patch by Bernd Schubert.

Reading this article and the GlusterFS article together, one might wonder why GlusterFS doesn't have the same problems with XFS that it has with ext4. The answer, as noted by Dave Chinner, is that XFS uses a rather different scheme to produce readdir() cookies. That scheme produces cookies that require only 32 bits, and the cookies are produced in such a way as to guarantee that no two files can generate the same cookie. XFS is able to produce unique 32-bit cookies due to the virtual mapping it overlays onto the directory index; adding such a mapping to ext4 (which does not otherwise need it) would be a large job.

Index entries for this article
Kernel	Filesystems/ext4

to post comments

Widening ext4's readdir() cookie

Posted Mar 28, 2013 8:58 UTC (Thu) by paulj (subscriber, #341) [Link] (18 responses)

It's interesting to read of the readdir offset debate, and read the hate for it. Most interesting is the "reluctantly" link in the article, http://lwn.net/Articles/544237/ - where Ted seems to make 2 claims:

The readdir/telldir offset is a legacy problem (NFS), and if not for that it would not be in the kernel.
It would not generally be practical for userspace to keep directories in memory while walking them for telldir/readdir.

Aren't these claims potentially contradictory? If it is not practical to keep directories in memory for telldir/readdir, why would it be practical for any other API? How would a dir-walking API work that didn't require userspace to keep a shadow copy (or didn't require the kernel to keep a shadow copy)? Is this perhaps a fundamental problem of a trade-off of where to keep state, rather than anything particular to readdir?

Also strange is the language in the SUSv3 spec which Ted quotes in another email (http://lwn.net/Articles/544257/):

" One of the perceived problems of implementation is that returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered."

Difficult to describe formally? That's odd. Surely it is quite trivial? You define that directory entries have an absolute ordering (lexical, numberical, whatever you want - it doesn't matter). Why are B-trees a problem? B-trees are defined to provide an index over absolutely ordered objects. Given, say, a B-tree and a rank in that order (e.g. the cookie of a previous dirent), there should be no problem to determine what the next dirent in the order is - even if the B-tree has been changed or re-balanced, even if the dirent for the cookie no longer exists! Surely? All ordered indices should have this property (I.e. hash tables do not, but pretty much all commonly used trees would, no?), surely?

Basically, I'm puzzled, because the issues as stated don't fully make sense (either not issues, or potentially not consistent) to me. So I'm wondering what the deeper issues are, that the developers and spec writers havn't explained; or else whether perhaps the developers have gotten so "in" to the details of the problem that they're overlooking some things that'd be clearer if they stepped back? (Or it could just be me having one of my amazingly stupid moments ;) - in which case, my apologies tentatively in advance).

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:07 UTC (Thu) by paulj (subscriber, #341) [Link] (7 responses)

And further, this bit of the article made me scratch my head and wonder which of my understanding, the description of the issues involved in the article, and/or the code is suboptimal:

"Although there is no notion of an offset inside a B-tree, the implementers of modern filesystems must still support the readdir API (albeit reluctantly); indeed, support for the API is a POSIX requirement. Therefore, it is necessary to find some means of supporting "directory position" semantics. This is generally done by fudging the returned offset value, instead returning an internally understood "cookie" value. The idea is that the kernel computes a hash value that encodes some notion of the current position in a directory (tree) and returns that value (the cookie) to user space.

Encoding the position in the tree in the cookie through a hash? But the tree is indexing something with an absolute order - the very node in the true MUST be associated with a unique value already (e.g. the name), surely? Why would you then hash this?

(I'm assuming I'm having a pre-morning-caffeine stupid moment :) ).

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:13 UTC (Thu) by paulj (subscriber, #341) [Link]

I guess the issue must be the order the B-tree is indexing is ephemeral (e.g. it's of the index in the number of entries). But then, is there no way to make that order more stable?

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:23 UTC (Thu) by dark (guest, #8483) [Link] (5 responses)

I think the problem here is that the underlying ordered data entries are simply too big (they're whole filenames) and telldir() returns a long. No matter how you plan to map the names to long offsets, you'll find some combination of mutations to the directory that makes the offset invalid.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:32 UTC (Thu) by paulj (subscriber, #341) [Link] (3 responses)

Yes, names would be too big. Cpurse, you don't need an index into the space of all possible names. You need an index into the entries (not necessarily names) that existed and have existed in this directory. I wonder is there not a way to get a stable order, with a reliable "next" operation, out of that?

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:57 UTC (Thu) by dlang (guest, #313) [Link]

remember, the OS has no idea when (or even if) the application is going to make it's next request. The spec allows the application to read one filename a week and still be guaranteed to see all files that existed when it started the read with no duplicates.

If there were no resource constraints, the system could use RCU type strategies to make a copy of the directory and just use the token as a numeric offset into this copy, but since the system has no way of ever knowing when it could release this copy, it would have to keep it forever.

remember that with NFS, a server is required to be consistent even across reboots.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 10:24 UTC (Thu) by paulj (subscriber, #341) [Link] (1 responses)

E.g. why not make the order a ring? Store last created ID with the directory. Use an index into that order that is easily capable of answering "What is the next value in the order?" - tree indices typically are capable of this.

To create an entry:

- locate the ID in your ordered index (B-tree, whatever)
- Determine the next free ID by walking the index to the next existing ID
- Create your entry with that free ID, update your index
- Update the directory's last created ID

Insertions have to be done atomically wrt directory and index and any readers. Removals do not need to modify the last-created ID in the directory though. But it does not matter a jot how the internal structure of the index changes.

No collisions, no hashes, no encoding the structure of indices into cookies. No nasty hacks. Just work with the grain by making the underlying order you're indexing have the properties you want.

You are limited to # of directory entries being the size of the ring, but it seems NFS compatibility would allow that ring to be (2^64)-1.

Why wouldn't that work?

Widening ext4's readdir() cookie

Posted Mar 28, 2013 11:31 UTC (Thu) by paulj (subscriber, #341) [Link]

Actually the ring isn't necessary. Any way to pick a free ID will do.

Widening ext4's readdir() cookie

Posted Apr 4, 2013 1:56 UTC (Thu) by vasi (subscriber, #83946) [Link]

HFS+ could do this. Each directory entry has a unique ID (the "catalog node ID"), which fits in a long. You can lookup the CNID to get the file name, and then use that name as your position in the directory.

This is still imperfect. Because each CNID is associated with a file name, hard links require an extra level of indirection. And obviously if the file at the current position is deleted, or renamed, things won't work as expected.

This is all moot, since Linux's HFS+ driver just uses the offset as f_pos, I'm not sure why. Maybe the f_pos values are required to be increasing? CNIDs are not.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:18 UTC (Thu) by dlang (guest, #313) [Link] (9 responses)

the big problem is that the directory contents may be changed by something else while you are going through the list of files.

So your b-tree or other complex structure may change, potentially significantly.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:41 UTC (Thu) by paulj (subscriber, #341) [Link] (8 responses)

It doesn't matter if the ordered index gets changed, even significantly, if the order it is indexing remains stable. I just wonder if there's some better way to manufacture a stable order from directory entries, rather than hash the name?

Widening ext4's readdir() cookie

Posted Mar 28, 2013 9:46 UTC (Thu) by dlang (guest, #313) [Link] (7 responses)

The filesystem authors have been looking for a good way to deal with this ordering problem for years. If there was a simple answer I would have expected that _someone_ would have found it and everyone else would copy them.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 10:25 UTC (Thu) by paulj (subscriber, #341) [Link]

Why wouldn't a ring work? (see other comment too).

Widening ext4's readdir() cookie

Posted Mar 28, 2013 11:59 UTC (Thu) by neilbrown (subscriber, #359) [Link] (5 responses)

Do any filesystems other than ext3/4 have this problem?

Some address it with by having two indexes - one by name and one by cookie. It is more expensive, bit is actually reliable (rather than only being stochastically reliable).

My preference is internal-chaining. This adds a little more book-keeping for the case when hashes collide, but I believe that is preferable to strange errors when hashes collide. And they really don't collide often even with 32bit hashes.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 12:17 UTC (Thu) by paulj (subscriber, #341) [Link] (3 responses)

I was going to post again to speculate whether or not the root of this problem was a desire to avoid keeping any index other than the name-based one! :)

A simple index over a finite, ordered ID space, with 2 operations "get unused ID" and "get next used ID in the order" would solve this problem trivially, it really seems to me. ("get unused ID" could be done efficiently in a number of ways, e.g. picking a random ID in a ringed space, then using the "next unused" operation'd probably work well in a sparsely used space).

Trying to bodge the properties of an order in a finite space out of the properties of an index of (effectively) infinite IDs seems doomed to failure and/or strange corner-cases.

Hacking the internal structure of the index also seems potentially fragile and needlessly complex - though I don't know exactly what you mean by internal chaining. It sounds possibly like adding indices inside the index. In which case, why not just keep a separate, simpler index?

Widening ext4's readdir() cookie

Posted Mar 28, 2013 14:14 UTC (Thu) by etienne (guest, #25256) [Link] (1 responses)

> In which case, why not just keep a separate, simpler index?

There is an obvious index, which is the index of the inode in the directory as written to the hard disk sectors - but all index are worse than cookies because if you are currently using file 4 (out of 10), and someone removes the file number 2, when you will ask for file 5, you will skip a file...

Widening ext4's readdir() cookie

Posted Mar 28, 2013 14:27 UTC (Thu) by paulj (subscriber, #341) [Link]

Because you've changed the ID of file 5, because you made its ID dependent on an ephemeral order.

If you make the IDs stable, so they never change so long as the entry exists, then the problem is gone. The IDs are stable, and so the ordering between them is stable. You can do this through an additional index (in the data-structure sense).

The "pain" seems to be that the ext3+ devs are trying to achieve this stable order by hacking around in the details an existing index of an ID that does NOT have an absolute order (the hashes of names). That wouldn't be necessary at all if they'd just add an additional stable, absolutely ordered, finite ID with its own index. The namespace-hash and the finite-ID could each have their own, simple index - no nasty hacks needed.

(It seems to me, but I could be dreadfully wrong, or perhaps there are some critical details missing from the reporting ;) ).

Widening ext4's readdir() cookie

Posted Mar 29, 2013 2:56 UTC (Fri) by neilbrown (subscriber, #359) [Link]

"jfs" has a fairly simple second index. Each directory entry contains a stable index number, and there is a table indexed by this number that leads to the name in the btree. Look for "modify_index" in jfs_dtree.c and explore uses of "struct dir_table_slot".

Some problems with this are:
- readdir will not return names in natural b-tree order, so you will get lots of seeking. Not a problem when the whole directory fits easily in cache, may be a problem for v.large directories.
- every B-tree modification (page-split, page-delete) requires updating random entries in the index table.

i.e. there is a genuine IO cost here.

Having readdir return entries in natural B-Tree order is clearly appealing, and a separate table doesn't really allow this.

By "Internal chaining" I mean that if a particular hash value is in use, you add 1 and try again. When you find a hash value to use, you store it (or the difference between the hash of the name and the chosen value, which can probably be stored with fewer bits) with the name. When you delete entries you need to be careful to leave place-holders if any subsequent entries are part of an internal chain leading from here.

In the common case of no collisions there is zero IO overhead (where as with the jfs approach there is always IO overhead). In the rarer case were there are collisions, there is an overhead that probably gets substantially worse as the chain length increases. As long as any internal chain is within one block there is a processing overhead but no IO overhead. Once it crosses a block boundary you get a real overhead, but probably comparable with the JFS overhead. Once you have a chain touching three blocks it probably gets a lot worse. But that should be extremely unlikely.

So it seems like a solution with excellent common-case performance, which degrades only in rare circumstances. Worst case performance (where all names hash to the same value) is probably not much worse than a linear search.

Widening ext4's readdir() cookie

Posted Mar 28, 2013 13:42 UTC (Thu) by bfields (subscriber, #19510) [Link]

Do any filesystems other than ext3/4 have this problem?

I haven't heard any.

And, yes, I suspect there are any number of solutions. Just pick one of them, turn it into a patch that's bulletproof and backwards compatible, and we're done.

telldir() and seekdir()

Posted Mar 28, 2013 10:43 UTC (Thu) by meuh (guest, #22042) [Link]

Surprisingly, telldir() and seekdir() use a long int (instead of a off_t) as the location (offset) value. So, from a userspace point of view, going from 32bits to 64bits to reduce the collision problem seen in NFS, seems rather strange.

Why am I seeing a security exploit?

Posted Mar 29, 2013 4:10 UTC (Fri) by jmorris42 (guest, #2203) [Link] (4 responses)

The details of hashes are known, because the kernel source is public.

So why can't someone precompute carefully constructed directory entries that will cause a collision, even with 63 bit hashes? Once you have those, drop them somewhere and wait for the infinite loop. Do believe that is called 'denial of service' Is there a good reason these puppies can't be injected in all sorts of places? How many Linux mail servers unpack zipped attachments to perform malware detection? Is everyone CERTAIN none will ever traverse the unpacked directory in a way that will trigger this?

No, this is bogus. Designing a filesystem that will work most of the time isn't any way to write code for production use.

Why am I seeing a security exploit?

Posted Mar 29, 2013 13:49 UTC (Fri) by bfields (subscriber, #19510) [Link] (3 responses)

So why can't someone precompute carefully constructed directory entries that will cause a collision, even with 63 bit hashes?

Looks like the hash algorithm takes a seed, which is per-superblock, generated at mkfs time, and should be unknown to an unprivileged user; grep for s_hash_seed in kernel and e2fsprogs source.

Why am I seeing a security exploit?

Posted Mar 29, 2013 16:37 UTC (Fri) by jmorris42 (guest, #2203) [Link] (2 responses)

Ok, at least they were on the ball. Still just doesn't seem right to design a filesystem that only works almost all the time. Computers are generally expected to be more deterministic that that.

Why am I seeing a security exploit?

Posted Apr 3, 2013 20:26 UTC (Wed) by bronson (guest, #4806) [Link] (1 responses)

The odds of a random 64 bit collision are worse than 10^18. Since you'll win the lottery thousands of times before that happens, you'll be too rich to care about two gobbledlygook filenames colliding. (But then, you've probably also been fried by lightning and crushed twice by the International Space Station...)

More seriously, there are other sources of error in your computer that are far more worthy of your attention: http://en.wikipedia.org/wiki/Soft_error

It's true that all bets are off if an attacker can break the hash. But, if/when that happens, the fix will probably be a straightforward kernel patch.

Why am I seeing a security exploit?

Posted Apr 4, 2013 16:20 UTC (Thu) by jimparis (guest, #38647) [Link]

The odds of two hashes colliding may be 1 in 10^18, but this grows with the number of files. With 60,000 files it's 1 in 10^10. Make up a use case where you're frequently creating or recreating directories with that many files, and it's not far fetched to expect someone to run into a collision pretty soon.

(numbers from http://preshing.com/20110504/hash-collision-probabilities)

Widening ext4's readdir() cookie

Posted Mar 29, 2013 11:32 UTC (Fri) by magnus (subscriber, #34778) [Link] (1 responses)

If a new opendir/readdir option was introduced that allowed returning the same directory entry twice, would that be easier to implement? Then you could go back to the earliest known node you have already passed if the tree gets rebalanced, which at least feels to me like a simpler problem to solve. And for extreme cases like an NFS server restart you could go back to the start of the dir.

Widening ext4's readdir() cookie

Posted Mar 29, 2013 13:54 UTC (Fri) by bfields (subscriber, #19510) [Link]

Off the top of my head one objection is that there's no guarantee of forward progress for readers.

Widening ext4's readdir() cookie

Posted Mar 30, 2013 5:40 UTC (Sat) by Mog (subscriber, #29529) [Link]

> a directory with 32,768 entries (the historical limit in ext2)

Is it correct ? I remember a limitation on the number of subdirs due to the nlink field in the stat structure being a short, but not on the total number of entries (sudirs+files+...).

Widening ext4's readdir() cookie

Posted Apr 4, 2013 19:40 UTC (Thu) by aakef (subscriber, #38030) [Link]

> adding such a mapping to ext4 (which does not otherwise need it) would be > a large job.

I think "which does not otherwise need it" is not absolutely correct. Ext4s HTREE/dirindex readdir has another issue. Entries are not returned in inode, but in hash order and for example for 'ls -l' (readdir + stat) or 'rm -rf dir' that causes lots of random disk access. Recent core utils try to eliminate this by sorting results returned by readdir(), but that logic is not used by everything. Depending on the implementation "such a mapping" from above would also eliminate the random access problem at all.

Bernd