Avoiding a read-only filesystem on errors
Errors in the storage subsystem can happen due to a variety of known or unknown reasons. A filesystem turns read-only when it encounters errors in the storage subsystem, or a code path which the filesystem code base should not have taken (i.e. a BUG() path). Making the filesystem read-only is a safeguard feature that filesystems implement to avoid further damage because of the errors encountered. Filesystems that change to read-only end up stalling services relying on data writes and, in some cases, may lead to an unresponsive system. In embedded devices, dying applications due to a filesystem turning read-only may render the device useless, leaving the user confused.
Filesystem errors can either be recoverable or non-recoverable. Errors from bad block links in the inode data structures or block pointers can be recovered by filesystem checks. On the other hand, errors due to memory corruption, or a programming error, might not be recoverable because one cannot be sure which data is correct.
Denis Karpov proposed a patchset that would notify user space, so that a user-space policy can be defined to take the appropriate action to avoid turning the filesystem read-only. The patchset is currently limited to FAT filesystems. User-space notifications would allow a filesystem to continue to be used after encountering "correctable errors" if proactive measures are taken to correct them. Such corrective actions can obviate the need for lengthy filesystem checks when the device is mounted next.
The patchset adds a /sys/fs/fat/<volume>/fs_fault file which is initialized to 0 when the filesystem is mounted. The value of fs_fault changes to 1 on error. A user-space program can poll() on this file to check if the value of the file changed which is an indication that an error has occurred. Besides the file polling, a KOBJ_CHANGE uevent is generated, with the uevents environment variable FS_UNCLEAN set to 0 or 1. A udev rule can then trigger the correct program to take care of the damage done by the error.
Kay Sievers is not convinced with the idea of using uevents for user-space notifications, as uevents are designed for device notifications, so they do not fit the design goals of reporting filesystem errors. Filesystem errors are quite often a repeated series of events. For example, a read failure may result in printing multiple read errors in dmesg for each block it is not able to read. An event generated for each block may be too much for udev to handle. Some of the events may get lost, or worse still, may cause udev to ignore uevents from other devices which occurred during the burst of errors.
They could be used to get attention when a superblock does a one-time transition from "clean" to "error", everything else would just get us into serious trouble later.
Keeping <volume>/fs_fault in sysfs is also not the best solution, because sysfs is unaware of filesystem namespaces. The primary responsibility of sysfs is to expose core kernel objects. Filesystem namespaces are a set of filesystem mounts that are only visible to a particular process and may be invisible to the rest of the processes.
A process with a private namespace contains a copy of the namespace instead of a shared namespace. When the system starts, it contains one global namespace which is shared among all processes. Mounts and unmounts in a shared namespace are seen by all the processes in the namespace. A process creates a private namespace if it was created by the clone() system call using the CLONE_NEWNS flag (clone New NameSpace). A process sharing a public namespace can also create a private namespace by calling unshare() with CLONE_NEWNS flag. Mounts and unmounts within a private namespace are only seen by the process that created the namespace. A child process created by fork() shares its parent's namespace.
Because of this, processes might receive errors on a filesystem in a different namespace, so they would not know which device to act on. The problem is also noticeable with processes accessing bind mounts created in a different namespace (bind mounts are a feature in which a sub-tree of a filesystem can be mounted on another directory). Moreover, filesystems spanning multiple devices, such as btrfs, would not be able to report the device name under the current naming structure.
Kay recommends /proc/self/mountinfo as a better alternative, because it contains the list of mount points in the namespace of the process with the specified PID (self). Currently, /proc/self/mountinfo changes when the mount tree changes. This can be extended to propagate errors to user space in the correct namespace using poll(), regardless of the device name. /proc/self/mountinfo can accommodate optional fields which hold values in the form of "tag[:value]" that can be used to communicate the nature of the problem. Instead of using the existing udev infrastructure, this would require a dedicated application to monitor /proc/self/mountinfo, identify the error by parsing the argument, and act accordingly.
Jan Kara further suggests using /proc/self/mountinfo as a link to identify the filesystem device generating the errors:
Despite these objections, everyone agrees that error reporting to
user space must not be limited to dmesg messages. User space
should be notified of the errors reported by the filesystem, so that it can
proactively handle errors and try to correct them. The namespace-unaware
/sys filesystem or notifications through uevent may not be the best
solution, but, for a lack of a better alternative interface,
the developers used sysfs and uevents. The design is still under
discussion, and will take some time to evolve, though it is likely
that some kind of solution to this problem will make its way into the kernel.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems |
| GuestArticles | Rodrigues, Goldwyn |