Messiness in removing directories

Posted Jun 14, 2018 4:56 UTC (Thu) by viro (subscriber, #7872)
Parent article: Messiness in removing directories

Umm... Some corrections/clarifications. Timeline:

* pre-2.1 - no rmdir if i_count of victim inode is greater than 1 (i.e. something else holds references). Checked in ->rmdir() instances.
* 2.1.40-ish - introduction of dcache, check mutates into checking d_count of dentry (since now dentry is what somebody else will be holding references to). False positives if negative lookups had been hashed.
* shortly after that - shrink_dcache_parent() calls added in ->rmdir() instances as part of the refcount check - basically, if we see more than one reference, try to call shrink_dcache_parent() and see if dropping those hashed negative lookups will help.
Still racy, though - hash lookup coming right after we'd checked refcount and decided that nobody else had references and somebody else got just that.
* somewhere during 2.1 - ext2 gets rid of those checks by making sure that nobody will be able to add anything to directory once it's been deleted (i_size trick)
* 2.1.120-ish - back-and-forth between me and Linus, ending up with closing the race in busy checks; _before_ calling ->rmdir() we try to do shrink_dcache_parent(), unhashing the dentry if we ended up with the only remaining reference. That way nobody could get extra references by hash lookups and checks could be turned into "is it still hashed?". A lot of code duplication has disappeared. The cost: now those checks (and shrink_dcache_parent()) had to happen before we'd checked if directory had been empty.
* 2.4-something - ext2 trick generalized and taken to VFS level. "Is it busy" checks are becoming needless, some go away.
* considerably later (3.7 or so) - the "unhash if no other references are there" race prevention (from circa 2.1.120) is taken down into individual ->rmdir() instances and then killed off there on per-filesystem basis, along with the "is it busy" checks
* shortly after (same merge window, I think) somebody (Dave Chinner?) spots a problem - we *do* want that shrink_dcache_parent() for reasons other than (now gone) busy checks. Namely, by the time we call d_delete() (after filesystem reports successful rmdir) extra references mean the difference between "we remember that foo has been removed, so lookups should come negative" and "foo, what foo? I don't remember, go check the filesystem". shrink_dcache_parent() brought back, unfortunately *still* into the place before ->rmdir() call. Which is pointless (we only need it in case of success) and potentially quite costly - it means that one can trigger shrink_dcache_parent() attempt on an arbitrary large dentry tree; worse, other such attempts can happen to subtrees of that tree at the same time.
* this spring: syzbot starts hitting livelock reproducers based upon such duelling shrink_dcache_parent(), at which point I went looking at the callers
* shortly after: "WTF do we call it on non-empty trees and what could be done?", followed by moderately painful exercise in software coproarchaeology (the above is basically the results of that dig)
* this merge window: commit 8767712f26d1 "rmdir(),rename(): do shrink_dcache_parent() only on success" goes in, dealing with that mess.

Unfortunately, that wasn't the only bogosity found on that dig - another one had been an unpleasant consequence of very ugly userland API in configfs and its ilk - rmdir() there is specified to succeed on some non-empty directories. Which is somewhat unexpected, to put it mildly, so the logics in vfs_rmdir() dealing with dissolving mounts somebody might have on the victim dentry (e.g. in another namespace) doesn't cope with it. It *does* dissolve any mounts on the victim directory, but what of something mounted on its children? What children - rmdir would've failed if there had been any... Alas, with configfs it could succeed in such case ;-/

That, BTW, was the place where containers got brought in - somebody asked why do we allow rmdir when something is mounted there, with "think of the situation when the same fs is mounted in another container and something's mounted on that directory in there; disallowing rmdir in such situation leads to an inter-container DoS, so container folks asked for rmdir to be allowed in such cases" in response to that. Tangential to everything else...

While one could argue that logics in vfs_rmdir() should've coped with the possibility of something mounted on children in non-empty directory with filesystem that allows removal of that directory, no amount of stretching will extend that to write(2), would it? sys_write()/vfs_write()/etc. can't be expected to know that writing "-1\n" to /proc/sys/fs/binfmt_misc/status happens to remove almost all files from /proc/sys/fs/binfmt_misc/ so any mounts (of mount --bind variety) on those would need to be dissolved. It's obviously responsibility of the code in fs/binfmt_misc.c that does remove those files.

Not every place where the kernel removes files on a synthetic filesystem has that problem (e.g. anything that has ->d_revalidate() doesn't need to do anything special), but unfortunately there are places that do have that problem or its close analogues. Fixes for the coming cycle - and -stable fodder, obviously.

The last bit had been about the things that are *not* forbidden to do after rmdir() - e.g. if you have a directory (or a file) opened, then remove it, you still can fstat() it. Or fchmod(). Or ioctl(), etc. Looking for potential problems in those when we have an opened-but-removed directory might be a good idea - almost all filesystems don't have any problems in those, but... it's a corner case easily forgotten about and that's the stuff of which surprising bugs are made.

Example (also fixed this window): ext2 logics for freeing the data blocks of unlinked file shared the code with truncate ("free them all" == "truncate everything", after all). truncate() checks for file being marked append-only (or immutable) and refuses to free anything on those. Again, reasonable, and not a problem for sharing that code with "evict the unlinked inode" path - unlink() checks if the file is append-only or immutable and flat-out refuses to remove those. So those checks are harmless - they won't trigger in that case. Except that... append-only and immutable are set via ioctl(2). Which can be done *after* the unlink, with all its checks. It was easily fixed once noticed, but the point is, nobody had thought of that case. Moreover, I have run into exact same issue on UFS a couple of years ago, fixed it *and* hadn't realized that modification I'd done (move the check a bit upstream on the truncate(2) path) hadn't been just a cleanup/microoptimization. I bloody well remember going "OK, it's an equivalent transformation - we reach that from ufs_evict_inode() as well, but we can't get there with immutable/append-only, so nothing changes there". Completely missing the way it _could_ get there with such and not realizing that the change has quietly fixed a bug. Had been rather embarrassing to realize that later when looking in ext2_ioctl() for potentially interesting things that could be done to removed objects...

to post comments

Messiness in removing directories

Posted Jun 14, 2018 10:08 UTC (Thu) by matthias (subscriber, #94967) [Link] (1 responses)

This is probably, what I love most about LWN: If some article is not yet perfect, the expert of the topic shows up and gives all needed details, some corrections, a bunch of background information, or all of the above.

Thanks a lot, Al.

Messiness in removing directories

Posted Jun 14, 2018 10:25 UTC (Thu) by Kamilion (subscriber, #42576) [Link]

Agreed; even on something as "simple" as rmdir().

And on 4.18, btrfs will now dissolve empty subvolumes with a rmdir() call, no need to have the 'btrfs' tool.
https://lwn.net/Articles/756898/
This is a win for embedded systems using btrfs seed filesystems and minimal roots.

Messiness in removing directories

Posted Jun 14, 2018 12:51 UTC (Thu) by doogie (guest, #2445) [Link] (2 responses)

This is where I would insert a eye-heart emoji, but I don't know if LWN would cope with that. The layers of the onion that had to be peeled back to figure all this out is just amazing to see.

I agree with others that this story and this comment are why I pay for access.

Messiness in removing directories

Posted Jun 15, 2018 12:15 UTC (Fri) by dwmw2 (subscriber, #2063) [Link]

> This is where I would insert a eye-heart emoji, but I don't know if LWN would cope with that.

👁 ♥ ?

Messiness in removing directories

Posted Jun 15, 2018 14:02 UTC (Fri) by farnz (subscriber, #17727) [Link]

LWN's Unicode support is pretty decent, and 😍 seems to work just fine.

Messiness in removing directories

Posted Jun 19, 2018 20:44 UTC (Tue) by normanshulman (guest, #93249) [Link]

Does any of this explain the following?

currentGentooOnly64Bit50Prod ~ # ls /usr/src
linux linux-4.9.74-grsecurity linux-4.9.95-gentoo
currentGentooOnly64Bit50Prod ~ # rm -rf /usr/src/linux-4.9.95-gentoo
currentGentooOnly64Bit50Prod ~ # ls /usr/src
currentGentooOnly64Bit50Prod ~ # ls /usr/src
currentGentooOnly64Bit50Prod ~ # ls /usr/src
currentGentooOnly64Bit50Prod ~ # ls /usr/src
linux linux-4.9.74-grsecurity