Messiness in removing directories
Messiness in removing directories
Posted Jun 14, 2018 4:56 UTC (Thu) by viro (subscriber, #7872)Parent article: Messiness in removing directories
* pre-2.1 - no rmdir if i_count of victim inode is greater than 1 (i.e. something else holds references). Checked in ->rmdir() instances.
* 2.1.40-ish - introduction of dcache, check mutates into checking d_count of dentry (since now dentry is what somebody else will be holding references to). False positives if negative lookups had been hashed.
* shortly after that - shrink_dcache_parent() calls added in ->rmdir() instances as part of the refcount check - basically, if we see more than one reference, try to call shrink_dcache_parent() and see if dropping those hashed negative lookups will help.
Still racy, though - hash lookup coming right after we'd checked refcount and decided that nobody else had references and somebody else got just that.
* somewhere during 2.1 - ext2 gets rid of those checks by making sure that nobody will be able to add anything to directory once it's been deleted (i_size trick)
* 2.1.120-ish - back-and-forth between me and Linus, ending up with closing the race in busy checks; _before_ calling ->rmdir() we try to do shrink_dcache_parent(), unhashing the dentry if we ended up with the only remaining reference. That way nobody could get extra references by hash lookups and checks could be turned into "is it still hashed?". A lot of code duplication has disappeared. The cost: now those checks (and shrink_dcache_parent()) had to happen before we'd checked if directory had been empty.
* 2.4-something - ext2 trick generalized and taken to VFS level. "Is it busy" checks are becoming needless, some go away.
* considerably later (3.7 or so) - the "unhash if no other references are there" race prevention (from circa 2.1.120) is taken down into individual ->rmdir() instances and then killed off there on per-filesystem basis, along with the "is it busy" checks
* shortly after (same merge window, I think) somebody (Dave Chinner?) spots a problem - we *do* want that shrink_dcache_parent() for reasons other than (now gone) busy checks. Namely, by the time we call d_delete() (after filesystem reports successful rmdir) extra references mean the difference between "we remember that foo has been removed, so lookups should come negative" and "foo, what foo? I don't remember, go check the filesystem". shrink_dcache_parent() brought back, unfortunately *still* into the place before ->rmdir() call. Which is pointless (we only need it in case of success) and potentially quite costly - it means that one can trigger shrink_dcache_parent() attempt on an arbitrary large dentry tree; worse, other such attempts can happen to subtrees of that tree at the same time.
* this spring: syzbot starts hitting livelock reproducers based upon such duelling shrink_dcache_parent(), at which point I went looking at the callers
* shortly after: "WTF do we call it on non-empty trees and what could be done?", followed by moderately painful exercise in software coproarchaeology (the above is basically the results of that dig)
* this merge window: commit 8767712f26d1 "rmdir(),rename(): do shrink_dcache_parent() only on success" goes in, dealing with that mess.
Unfortunately, that wasn't the only bogosity found on that dig - another one had been an unpleasant consequence of very ugly userland API in configfs and its ilk - rmdir() there is specified to succeed on some non-empty directories. Which is somewhat unexpected, to put it mildly, so the logics in vfs_rmdir() dealing with dissolving mounts somebody might have on the victim dentry (e.g. in another namespace) doesn't cope with it. It *does* dissolve any mounts on the victim directory, but what of something mounted on its children? What children - rmdir would've failed if there had been any... Alas, with configfs it could succeed in such case ;-/
That, BTW, was the place where containers got brought in - somebody asked why do we allow rmdir when something is mounted there, with "think of the situation when the same fs is mounted in another container and something's mounted on that directory in there; disallowing rmdir in such situation leads to an inter-container DoS, so container folks asked for rmdir to be allowed in such cases" in response to that. Tangential to everything else...
While one could argue that logics in vfs_rmdir() should've coped with the possibility of something mounted on children in non-empty directory with filesystem that allows removal of that directory, no amount of stretching will extend that to write(2), would it? sys_write()/vfs_write()/etc. can't be expected to know that writing "-1\n" to /proc/sys/fs/binfmt_misc/status happens to remove almost all files from /proc/sys/fs/binfmt_misc/ so any mounts (of mount --bind variety) on those would need to be dissolved. It's obviously responsibility of the code in fs/binfmt_misc.c that does remove those files.
Not every place where the kernel removes files on a synthetic filesystem has that problem (e.g. anything that has ->d_revalidate() doesn't need to do anything special), but unfortunately there are places that do have that problem or its close analogues. Fixes for the coming cycle - and -stable fodder, obviously.
The last bit had been about the things that are *not* forbidden to do after rmdir() - e.g. if you have a directory (or a file) opened, then remove it, you still can fstat() it. Or fchmod(). Or ioctl(), etc. Looking for potential problems in those when we have an opened-but-removed directory might be a good idea - almost all filesystems don't have any problems in those, but... it's a corner case easily forgotten about and that's the stuff of which surprising bugs are made.
Example (also fixed this window): ext2 logics for freeing the data blocks of unlinked file shared the code with truncate ("free them all" == "truncate everything", after all). truncate() checks for file being marked append-only (or immutable) and refuses to free anything on those. Again, reasonable, and not a problem for sharing that code with "evict the unlinked inode" path - unlink() checks if the file is append-only or immutable and flat-out refuses to remove those. So those checks are harmless - they won't trigger in that case. Except that... append-only and immutable are set via ioctl(2). Which can be done *after* the unlink, with all its checks. It was easily fixed once noticed, but the point is, nobody had thought of that case. Moreover, I have run into exact same issue on UFS a couple of years ago, fixed it *and* hadn't realized that modification I'd done (move the check a bit upstream on the truncate(2) path) hadn't been just a cleanup/microoptimization. I bloody well remember going "OK, it's an equivalent transformation - we reach that from ufs_evict_inode() as well, but we can't get there with immutable/append-only, so nothing changes there". Completely missing the way it _could_ get there with such and not realizing that the change has quietly fixed a bug. Had been rather embarrassing to realize that later when looking in ext2_ioctl() for potentially interesting things that could be done to removed objects...