Kernel development
Brief items
Kernel release status
The current development kernel is 4.11-rc5, released on April 2. Linus said: "Ok, things have definitely started to calm down, let's hope it stays this way and it wasn't just a fluke this week."
The April 2 regression report shows 13 known problems.
Stable updates: 4.10.7, 4.9.19, and 4.4.58 were released on March 30, followed immediately by 4.10.8, 4.9.20, and 4.4.59 on March 31.
Quotes of the week
Linux Kernel Podcast for 2017/04/04
The April 4 kernel podcast is out. "Linus Torvalds announces Linux 4.11-rc5, Donald Drumpf drains the maintainer swamp in April, Intel FPGA Device Drivers, FPU state cacheing, /dev/mem access crashing machines, and assorted ongoing development."
Kernel development news
Handling writeback errors
Error handling during writeback is something of a mess in Linux these days, Jeff Layton said in his plenary session to open the second day of the 2017 Linux Storage, Filesystem, and Memory Management Summit. He has investigated the situation and wanted to discuss it with attendees. He also presented a proposal for a way to make things better. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem.
The process started when he was looking into ENOSPC (out-of-space) errors for the Ceph filesystem. He found that the PG_error page flag, which is set when a problem occurs during writeback, would override any other error status and cause EIO (I/O error) to be returned. That inspired him to start looking at other filesystems to see what they did for this case and he found that there was no consistency among them. Dmitry Monakhov thought that writeback time was too late to get an ENOSPC error, but Layton said there are other ways to get it at that time.
If there is an error during the writeback process, it should be returned when user space calls close() or fsync(), Layton said. The errors are tracked using the AS_EIO and AS_ENOSPC in the flags field of struct address_space. The errors are also tracked at the page level, sometimes, using PG_error in the page flags.
A stray sync operation can clear the error flag without the error getting reported to user space, Layton said. PG_error is also used to track and report read errors, so a mixed read-write pattern can cause the flag to be lost before getting reported. There is also a question of what to do with the page after there is a writeback error for it; right now, the page is left in the page cache marked clean and up-to-date, so user space does not know there is a problem.
So, Layton asked, "how do we fix this mess?" James Bottomley asked what granularity the filesystems want for error tracking. Layton said that address_space was the right level to track the information. Jan Kara pointed out that PG_error was meant to be used for read errors, but Layton said that some writeback implementations use it.
Layton suggested cleaning up the tree to remove the use of PG_error for writeback errors. That would entail taking a pass through the tree to clean that up and to see if the errors on writeback are propagating out to user space or whether they may be getting cleared incorrectly. Ted Ts'o said there may be a need for a way to do writeback without getting any errors because they cannot be returned to user space.
Bottomley said that he would rather not mark pages clean if they have not been written to disk. The error information would need to be tracked by sector, he said, so that the block layer can tell the filesystem where in a BIO (the structure used to describe block I/O requests) the error happened. Chris Mason suggested that anyone working on "redoing the radix tree" (i.e. Matthew Wilcox) might want to provide a way to store an error for a specific spot in the file. That way, the error could be reported then cleared once that reporting is done.
Layton then presented an idea he had for tracking the errors. Writeback error counter and writeback error code fields would be added to the address_space structure and a writeback error counter would be added to struct file. When a writeback error occurs, the counter in address_space would be incremented and the error code recorded. At fsync() or close(), that error would be reported, but only if the counter in the file structure is less than that in address_space. In that case, the address_space counter value would be copied to the file structure.
Monakhov asked why a counter was needed; Layton said it was to handle multiple overlapping writebacks. Effectively, the counter would record whether a writeback had failed since the file was opened or since the last fsync(). Ts'o said that should be fine; applications that want more information should use O_DIRECT. For most applications, knowledge that an error occurred somewhere in the file is all that is necessary; applications that require better granularity already use O_DIRECT.
Layton's approach will result in some false positives, but there will be no false negatives, Ts'o said. Guaranteeing anything more would be much more expensive; avoiding the false positives would be a great goal, but the proposed method is simpler and has minimal performance impact. Layton said that it would definitely improve the situation for user space.
Ts'o noted that there are some parallels in the handling of ENOMEM (out of memory). If that error gets returned deep in the guts of writeback, many of the same problems that Layton described occur. So, some filesystems desperately try not to return ENOMEM but, to avoid that, they have to hold locks, which gets in the way of the out-of-memory (OOM) killer. Ts'o has patches to defer writeback and leave pages in a dirty state rather than acquiring a set of locks and preventing OOM kills.
But Layton thinks the first step should be to get better generic error handling for writeback. Filesystems avoid using the current error handling because it doesn't work well. After that, better handling of temporary errors (e.g. ENOMEM, EAGAIN, and EINTR) could be added. Another thing that should be addressed down the road is to have more killable or interruptible calls when doing writeback; that would help with NFS blocking indefinitely when the server goes away, for example.
One other question Layton had is what should be done with the page when writeback fails. Right now, it is left clean, which is something of a surprise. If there is a hard error that will never be retried, should the page be invalidated? Ts'o said that they can't be left dirty in that case because the system will run out of memory if there are lots of writeback errors. If the radix tree could track the information, though, some advanced filesystem might try writing the page somewhere else.
There are four bytes per 64 pages available in the radix tree, Wilcox said, if that would be useful for this tracking. Mason said that he would rather have one or two more tags available, but Wilcox said that could only happen with a 17% increase in the size of the tree—in which case, there would be lots more tags available. Jan Kara suggested storing the error information somewhere outside of the radix tree. The only applications that would benefit if filesystems could do something smarter are databases (many of which already use O_DIRECT) and large loopback files, Mason said.
Steve French thought that the error handling should be done at a higher level, but Layton said that is how we got to where things are now. It is a bad API that is not clear how to use. He is going to try to fix that, he said, and developers should be looking for his patches to do so.
An update on storage standards
In a second-day plenary session at the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Fred Knight updated the attendees on what has happened in the storage standards world over the last year. While the transports (e.g. Fibre Channel, Ethernet) and the SCSI protocol have not seen a ton of changes over the last year, the NVM Express (NVMe) standards have had a lot of action.
On the transport standards side, Fibre Channel has some new speeds available (32Gb/128Gb) and some in the works (64Gb/256Gb). Terabit Fibre Channel is on the roadmap, Knight said, "it will be interesting to see if they can reach it". Ethernet has added new speeds (from 2.5Gb up through terabit) as well as new markets (e.g. automotive). NVMe has a new command set and multiple connectivity options (e.g. PCIe, RDMA over Ethernet, InfiniBand).
SCSI
For SCSI, a simplified binding status command has been added (TEST BIND). The WRITE ATOMIC command has been added; it will allow writes that either write all of the data or none of it. In addition, a 32-byte variant of WRITE SCATTER has been added. The combination of atomic with scatter writes, which was discussed at LSFMM 2016, is not likely to make an appearance, Knight said. Some storage companies are objecting to it; they have figured out that Linux gets this right and they might get it wrong, so they want to leave it alone, he said.
The WRITE STREAM command for stream IDs has been added to the standard. In addition, the BACKGROUND CONTROL command has been added to allow some control over the background tasks that a storage device may be doing (e.g. garbage collection). That background processing can impact I/O at times, so users have wanted some ability to change the parameters governing it.
Some predefined feature sets (e.g. "2010 Basic", "2016 Basic", "Maintenance", "Provisioning") have been established that storage devices can advertise and hosts can check to see what features are available. New feature sets will be added over time, Knight said, so it will be an ongoing process. Ted Ts'o asked if devices can advertise some feature set but implement additional features beyond those in the set; Knight said they could.
The biggest thing in the SCSI world over the last year has been drive depopulation, Knight said. The way that drives fail is different than what was expected, so it is often just a part of the drive that goes bad. If you have an 8TB drive where one head goes bad, you can just disable that portion of the drive and turn it into a 6TB drive, for example.
For both SCSI and ATA, the "repurposing" support is available. The drive will be available at the reduced capacity and the data that was there will either no longer be available or will be at a different logical block address (LBA) than it was before. The GET PHYSICAL ELEMENT STATUS command can be used to determine what element is failing and what capacity it has. The REMOVE ELEMENT AND TRUNCATE command can then be used to disable the failing piece. The drive can be reformatted; if another element fails, the process can be repeated.
There has been longtime support in the standard for reading the whole drive to try to recover the data that is still there. There are commands that will provide information that the next N LBAs are bad, so they can be skipped. That will allow the host to recover as much data as it can after an element failure.
There was some talk last year about a "data preserving" mode for depopulation. It is complex, however, so it will take a lot longer to be added if it is at all. There are "a bunch of people" asking if there really is a need for that, so it could "end up on the cutting room floor", Knight said.
NVMe
It has been a busy year for the NVMe group, he said. The formal specifications for the fabric were published. The "sanitize" command has been formalized; it can be used to clean the drive before it is repurposed. Other mechanisms, such as crypto erase and block erase, will ruin SSDs, so those devices do not implement them.
A device crash-dump facility, called "telemetry", has been added to the specification. Stream IDs have been added as well. Some support for persistent reservation has been added, but it is incompatible with the SCSI feature, which is not what was wanted. A compatible version is currently a work in progress.
Virtualized/emulated controllers are another feature that has been added. This allows virtual machines to think they have their own dedicated NVMe controller. Each guest gets a dedicated queue pair mediated by the hypervisor. That allows multiple guests to share a single physical NVMe adapter.
The NVMe group has around 80 people on a two-hour call each week, while the T10 (SCSI), T11 (Fibre Channel), and T13 (ATA) committees have three-day meetings every few months. There is some concern that those groups are running out of things to do. The NVMe group is ramping up, he said, while the others are much more stable than they were a few years ago.
Mathew Wilcox asked how Knight felt about all the different errors that can be reported by storage devices essentially all boiling down to EIO (I/O error) in Linux. Knight chuckled and pointed to all of the kinds of sense codes and media errors that are reported by storage devices. But as Martin Petersen pointed out, for POSIX they really all do need to be mapped to EIO because applications don't understand anything else.
Online filesystem scrubbing and repair
In his traditional LSFMM session to "whinge about various things", Darrick Wong mostly discussed his recent work on online filesystem repair for XFS, but also strayed into some other topics. Online filesystem scrubbing for XFS was one of those, as was a new ioctl() command to determine block ownership.
He started with the GETFSMAP ioctl() command, which allows querying a filesystem to determine how its blocks are being used (which blocks contain data or metadata and which are free). He has patches for XFS that are reviewed and ready to go for 4.12. Ext4 support is out for review. Chris Mason said that there are ioctl() commands for Btrfs that can be remapped to GETFSMAP.
Wong then described an online scrubbing tool for XFS that he has been working on. Right now, it examines the metadata in the filesystem and complains if there is something that is "clearly insane". Part two will be posted soon and will cross-check metadata with other metadata to make sure they are all in agreement. The tool will also use GETFSMAP to read all file-data blocks using direct I/O to check to see if any give read errors.
Damien Le Moal suggested that the RAID rebuild assist feature of some storage devices could be used to either get the data or get a failure quickly. Wong asked if there was a way to get a list of LBAs with read errors, but that isn't supported. Fred Knight said that using RAID rebuild assist will cause the device not to try any "heroic recovery" and simply return errors. Wong said that he hasn't done much work on reading from the device yet as he has been concentrating on the filesystem side.
For online repair for XFS, he has been working on using the reverse mappings to reconstruct the primary metadata for corrupted filesystems. He does not want to use the metadata that is left over because it may be corrupted too. There is a problem, though, if you have to recreate the secondary metadata (e.g. the reverse mapping) because you run into a deadlock situation.
The deadlock is avoided by freezing the filesystem so there is no access to the inodes. Inode by inode, this restriction is relaxed as the reverse mapping is rebuilt. The block device is not frozen, just the filesystem, but he wondered if that was a sane thing to do. It is not the intended use case for filesystem freezing, but it seems to work. That is the only part of online repair where the filesystem "has to come to a screeching halt", he said.
There is no secondary metadata for directories right now, so the online repair just moves corrupted directories to lost+found or removes them if they are too corrupted. XFS doesn't have parent directory pointers that might be used to help repair directory corruption.
There is still no good way to remove open files, Wong said. He thought it might be possible to rip the file's pages out of the page cache, replace the inode operations and mapping fields with dummy values that just return errors, and to try to put the inode on the orphan list. Matthew Wilcox asked if writeback would be done before the pages are removed from the page cache; Wong said that was an open question.
Mason said that Facebook does no online repair. If a filesystem gets corrupted, it is taken out of production and restored to a working state before being returned to service. Various aspects of the filesystem change (e.g. latencies) while online repair is in progress, so the company just does not bother. Ted Ts'o said that the cloud world tends to have enough redundancy that doing online repair isn't really needed, but in other worlds that is not true. Wong said that the feature is targeted at users who are willing to trade "some to a lot of latency" for the ability to not have to take a volume offline.
A way to specify a "desperation" flag in block I/O requests is on his wish list, Wong said. Setting that would mean that the block layer should go read RAID-1 mirrors or invoke stronger error recovery to read a block. He would like to use that when a block read succeeds but the checksum doesn't match.
He asked the room if anyone was making progress on filesystem freezing before suspend operations. Mason said that the use of the work queue WQ_FREEZABLE flag was "all wrong" as described in a Kernel Summit session. The code for the filesystems had all been cut and pasted and was simply wrong. He had thought that someone was planning to fix that, but was not sure where that effort had gotten.
Extending statx()
When Andreas Dilger proposed the statx() topic for the 2017 Linux Storage, Filesystem, and Memory-Management Summit, the system call had still not been merged. But that all changed in the 4.11 development cycle when Al Viro merged the system call to provide additional file information. So, unlike previous years, the discussion was not about how to merge such a system call but, instead, how to extend statx() for additional file information.
Both Dilger and David Howells (who authored the statx() patches) led the discussion. Dilger noted that the lack of statx() has been a pain point for him; for example, ls tries to stat() every file in order to properly color it in its output. But stat() is painful on network filesystems and often the kernel already has the information needed, so no network round-trip is needed. statx() solves that problem.
The core of the system call was merged; now ext4, Gluster, and others have support. Due to bikeshedding, Dilger said, the core has been pared back several times over the years. Howells noted that statx() will hopefully make it into glibc soon, so that developers can start using it. One attendee suggested that meant it was "ten years out" from actually being widely used, to some chuckles.
Steve French said that Samba has ways to add statx() functionality easily, but wondered what the best way to test if the kernel has the feature is. Dilger said that trying to use it, then setting a flag if it doesn't exist and not trying to use it again until a restart. But French thought that it didn't really make sense for Samba to build a special module for statx() if the kernel header files were missing the the proper pieces. Jeff Layton pointed out that Samba might be built on a kernel that has support, but run on one that doesn't, so checking is still required.
Dilger said that he would like to get statx() support into GNU ls, though the code is "eye-scratchingly bad". He does not know what the project's policies are with regard to changes like that, however. ls is a "poster child" for using all of the different functionality of statx(), he said.
The subject of testing for the system call came up. There are some tests, Dilger said. Ted Ts'o thought those tests should be ported to xfstests for more widespread testing. Howells suggested overriding stat() with statx() in xfstests, but it was not clear how serious he was about that. Eric Sandeen pointed out that Christoph Hellwig had talked about reverting statx() if no tests were forthcoming. Darrick Wong noted that his patches to support statx() in XFS needed review, since Hellwig has refused to do so awaiting more documentation and tests. But Howells said that both exist, though more tests are needed; the latter was generally agreed with by the attendees.
Talk then turned to adding a call for additional filesystem information, that Howells was calling statfsx(). Information like the granularity of timestamps and the maximum/minimum I/O size of the filesystem could be retrieved with such a call. Layton agreed that a call like that may be needed, "possibly with a better name". There is more that could be returned, such as server addresses, Howells said. Some of that kind of information might be filesystem-type-specific, Dilger said, but Layton thought it was best to have attributes that are not tied to the filesystem type.
The session ended with some ideas for things that could be added to statx() itself, including more Windows attribute bits (only two of the attributes are supported by the call currently), which would help Samba, Howells said. Trond Myklebust suggested that quota information might be another good addition, but Howells seemed skeptical that statx() was the right interface; "give us a patch and let's see". There are, evidently, some patches floating around to extend statx() further, which are likely to be dusted off and posted before long.
A new API for mounting filesystems
The mount() system call tries to do too many things, Miklos Szeredi said at the start of a filesystem-only discussion at LSFMM 2017. He has been interested in cleaning that up for a long time. So he wanted to discuss some ideas he had for a new interface to mount filesystems.
mount() is lots of operations all rolled up into one call; there are various fields that are used in different ways depending on what needs to be done and it is almost out of flags to be used. It supports regular mounts, bind mounts, remounting, moving a mount, and changing the propagation type for a mount, but they are mutually exclusive and some operations require a remount. For example, you cannot create a read-only bind mount; you must first do the bind mount, then remount it read-only. Similarly, you cannot change the propagation parameters while doing a bind mount or changing other mount flags.
Szeredi has come up with a proposed solution with several new system calls, starting with:
int fsopen(const char *fstype, unsigned int flags);
It would be used to get a file descriptor to communicate with a filesystem
driver and might be called as follows:
fsfd = fsopen("ext4", 0);
That would provide a connection to the ext4 filesystem driver so that
parameters could be set via a protocol.
The talk of a protocol prompted Jeff Layton to ask about using a netlink socket instead. But Al Viro said that a netlink protocol would need to be fully specified right from the start, which would not fit well. Josef Bacik said that he thought netlink would allow adding new attributes and values after the fact. There could a different protocol specification for each filesystem type, perhaps based on a common set for all filesystems with extensions for specific types. Layton agreed but said the mechanism for the protocol could be determined at a later point.
The protocol Szeredi is envisioning would have a set of configuration commands, each with a NUL-delimited set of parameters. It might look something like:
SETDEV\0/dev/sda1
SETOPTS\0seclabel\0data=ordered
...
That data would be written to the filesystem file descriptor returned from
fsopen().
Jeff Mahoney asked if there was a need for a system call at all. Perhaps sysfs or the /proc filesystem could be used instead. One attendee pointed out that would mean that some other mechanism would need to be used to mount /proc or /sys. There might also be implications for booting, since those filesystems may not be available early enough to mount the boot partition.
Additional system calls would be needed, Szeredi said, moving back to his proposed interface. Attaching a filesystem to a mount point would be done with mountat(), changes to a mount would done using mountupdate(), while mountmove() to move a mount and mountclone() to clone one round out the set. There were suggestions that some of those could be combined into one call, mountmove() and mountclone() in particular.
Szeredi said that he would look into using a netlink socket rather than fsopen(). One attendee said that netlink would make the simple case of a straightforward mount more complicated, but Szeredi said that the existing mount() would not be going away.
David Howells wondered if netlink was an optional kernel component; if so, mounting using this new mechanism would be impossible in some cases, another attendee said. But, again, Szeredi said that the existing mount() system call could be used. There was some concern that filesystems will come to depend on these new interfaces, so that using mount() won't work well.
Layton noted that there have been requests for better error messages from mounting operations; often there is not enough detail in the error code returned. Szeredi said that more detailed information could potentially be read from the descriptor returned by fsopen().
Overall, the attendees seemed interested in having a better API for mounting filesystems, but it would seem there is a ways to go before there is something concrete to ponder.
Container-aware filesystems
We are getting closer to being able to do unprivileged mounts inside containers, but there are still some pieces that do not work well in that scenario. In particular, the user IDs (and group IDs) that are embedded into filesystem images are problematic for this use case. James Bottomley led a discussion on the problem in a session at the 2017 Linux Storage, Filesystem, and Memory-Management Summit.
The various containerization solutions in Linux (Docker, LXC, rkt, etc.) all use the same container interfaces, he said. That leads to people pulling in different directions for different use cases. But the problem with UIDs stored in filesystem images affects all of them. These images are typically full root filesystems for the containers that have lots of files owned by the root user.
Bottomley has proposed shiftfs as a potential solution to this problem. It is similar to a bind mount, but translates the filesystem UIDs based on the user namespace mapping. It can be used by unprivileged containers to mount a subtree that has been specifically marked by the administrator as being shiftfs-mountable.
An earlier effort to solve the problem added the s_userns field to the superblock in order to do UID translations, but that is a per-superblock solution that does not work well for containers that want to share a specific mounted filesystem among containers with different UID mappings. With shiftfs, an inode operation will translate the UID based on the namespace mapping to that of the underlying filesystem before passing the operation the lower level. That means the virtual filesystem (VFS) does not need changes, which makes for a cleaner solution, Bottomley said.
There are some significant security implications to allowing arbitrary directory trees to be shift-mounted in unprivileged containers, including the ability for users to create setuid-root binaries. So the administrator must mark those subtrees (using extended attributes in his prototype) that are safe to be mounted that way.
Al Viro asked if there is a plan to allow mounting hand-crafted XFS or ext4 filesystem images. That is an easy way for an attacker to run their own code in ring 0, he said. The filesystems are not written to expect that kind of (ab)use. When asked if it really was that easy to crash the kernel with a hand-crafted filesystem image, Viro said: "is water wet?"
Amir Goldstein said that the current mechanism is to use FUSE to mount the filesystems in the unprivileged containers. But Bottomley is concerned that the FUSE daemon can be exploited, so it should run in the unprivileged container as well. If you restrict the mounts to USB sticks, it means an attacker would need physical access, which has plenty of other paths for system compromise so it is "safe" in that sense. But if loopback mounting of filesystems is to be supported at some point, the filesystem code will need to have no exploitable bugs.
In something of an aside, Goldstein reminded filesystem developers that their filesystems may be running under overlayfs. He suggested that there needs to be more testing of different filesystems underneath overlayfs.
While the attendees recognized the problem for unprivileged containers, there does not seem to be a consensus on the right route to take to solve it.
Eliminating Android wrapfs "hackery"
As it has evolved over the years, Android has acquired some hacks in how it handles its filesystems. Ted Ts'o would like to see those hacks eliminated, so he led a session at LSFMM 2017 to look at the problem and see what, if any, upstream-acceptable solution could be found.
Ts'o started with some history. Early Android devices had SD card slots with a FAT filesystem on them. The Android team tried to kill SD cards for the devices, but were ultimately unsuccessful. Meanwhile, the /sdcard mount moved into the /data partition and a FUSE filesystem was added to emulate the case-insensitive behavior of FAT.
In fact, there are three separate FUSE mounts used today to enforce different levels of app privilege: read-only, read-write, or no access to /data. Based on the capabilities of the app, nsenter is used to put it into the namespace where the filesystem is mounted with the proper access restrictions. But this FUSE-based solution has started to become a performance problem, he said.
The weird permission regime could be handled with a stackable Linux Security Module (LSM) on top of the SELinux module that is already used, he said. But supporting case insensitivity is harder. That had been discussed some at LSFMM 2016, but objections were raised because of UTF-8 case-folding requirements and the need to do a brute-force directory search.
Ts'o looked at what Android is doing; it is only handling ASCII file names. He wondered if a case-folding feature could be added to the virtual filesystem (VFS) layer. It could just use strcasecmp() for comparing file names.
Al Viro pointed out that using strcasecmp() would mean that negative directory entries (dentries), which denote names that were looked up but not found, would not work correctly. Unsuccessful searches for "Makefile" with several case variations would leave negative dentries behind, but then "some joker" creates a file with that name and there would be both positive and negative dentries for the "same" name. He suggested that dropping negative dentries for this case might work; it is, he said, what case-insensitive XFS is doing.
The performance implication of dropping negative dentries is probably just fine for Android. The current solution uses FUSE, after all, so it is not hard to do better than that, Ts'o said. Case-insensitive filesystems could be added as a kernel configuration option and as a mount option for specific mounts.
So, Ts'o asked, is that something that could go upstream? Clearly the stacked FUSE approach is not going to cut it, so perhaps some variant of this could? Viro suggested simply doing it in ext4, but Ts'o said that Christoph Hellwig had suggested adding something to the VFS so that all filesystems could use it. Some ideas were thrown around about ways to solve the problem without losing the ability to have negative dentries. No one seemed to come up with a solution there. It seems like it might be possible to get the feature in without negative dentry support, though.
Superblock watch for fsnotify
At the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Amir Goldstein led a discussion about the fsnotify filesystem notification subsystem and some changes he would like to see. Unfortunately, due to a bit of confusion of where the session would be held, I missed half of it; here's what I can reconstruct from the second half. Fsnotify is the internal kernel support subsystem for all three of the file notification APIs (dnotify, inotify, and fanotify).
Goldstein is trying to make fsnotify more scalable for getting notifications of changes in a large filesystem. To that end, he has proposed a "superblock watch" mechanism to efficiently report all changes made to a filesystem. For his use case, he just needs to be able to receive notifications when any file in any directory in the filesystem has changed (been created, deleted, or moved). There was a question about whether the names of the files that are changed should be included in the event, but Goldstein said he did not need that functionality (though others might); his application simply rescans the directory if anything has changed in it.
Al Viro was concerned that the file names would not stay valid while notifications were being delivered. Jan Kara said that there could be races that would make it hard to reproduce the sequence of changes that were made to the directory. But adding names to the fsnotify events does add significant complexity to the code. There is a clear demand for being able to get notification events on a large directory tree, however, Kara said. For now, he is not convinced that adding file names into the event is warranted and it could lead to various kinds of problems.
Goldstein said that the superblock watch is the simplest approach, rather than having a recursive fanotify watch on the mount point, which does not scale well. That API could eventually be extended to allow the creation of a change journal like NTFS supports, he said. There did not seem to be any fundamental opposition to the superblock watch feature as it stands.
Filesystem management interfaces
In a filesystem-only session at LSFMM 2017, Steven Whitehouse wanted to discuss an interface for filesystem management. There is currently no interface for administrators and others to receive events of interest from filesystems (and their underlying storage devices), though two have been proposed over the years. Whitehouse wanted to describe the need for such an interface and see if progress could be made on adding something to the kernel.
Events like ENOSPC (out of space) for thin-provisioned volumes or various kinds of disk errors need to get to the attention of administrators. There are two existing proposals for an interface for filesystems to report these events to user space. Both use netlink sockets, which is a reasonable interface for these kinds of notifications, he said.
Lukas Czerner posted one back in 2011, while Beata Michalska proposed another in 2015. The latter is too detailed, Whitehouse said, and has some performance issues. It notifies on events like changes to the block allocation in the filesystem, which is overkill for the kind of monitoring he is looking for.
The interface needs to provide a way to enumerate the superblocks of filesystems that are mounted on the system. Applications would register their interest in particular mounts and get notification messages from them. The messages would consist of two parts, a key that identified the kind of event being reported along with a set of messages with further information about the event.
The messages would have a unique ID to identify the mount, which would consist of a device number (either the real one or one that was synthesized by the subsystem), supplemented with a UUID and/or volume label. Some kind of generation number might also be needed to distinguish between different mounts of the same filesystem.
Steve French asked which filesystems can provide a UUID; network filesystems can do so easily, but what about others? Ted Ts'o said that all server-class filesystems have a way to generate a UUID. He also said that the device number would be useful to help correlate device errors. Trond Myklebust suggested that the information returned by /proc/self/mountinfo might be enough to uniquely identify mounts.
Ts'o said that this management interface is really only needed for servers, since what Whitehouse is looking for is a realtime alarm that some attention needs to be paid to a volume. That might be because it is thin-provisioned and is running out of space or because it has encountered disk errors of some sort.
There was some discussion of how management applications might filter the messages so that they only process those of interest. Ts'o said that filtering based on device, message severity, filesystem type, and others would probably be needed. There was general agreement for the need for this kind of interface, though it was not clear what the next step would be.
A network filesystem wish list
In the filesystem track and the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Steve French led a discussion on various topics of interest for network filesystems. As with the discussion at LSFMM 2016, there are a number of features that the network filesystem developers would like to see added, though there has been progress on one of primary items from last year: the statx() system call.
French said that the addition of statx() gave Samba and other network filesystems some things that were wanted for a long time. The new system call adds a "birth time" for files, as well as two of the Windows attribute flags. But more is needed. Additional attribute flags, an interface to set attributes, and something like the READDIRPLUS NFS command are all on that list.
Samba leases, which allow the client side to aggressively cache file data, are not fully supported. The API lacks a way to provide a lease key in order to upgrade an existing lease, so upgrading requires dropping the lease, which is inefficient. There is also no way to cache metadata and directory contents, which Microsoft says can result in an enormous reduction in network traffic for typical home-directory-oriented users.
French also noted that version 28 of the rich access-control list (RichACLs) had recently been posted. Some small pieces of the patch set were merged over the last year, but there is a need for full RichACL support. Right now, NFSv4 and Samba ACLs are mapped as best as they can be using extended attributes (xattrs). But not all filesystems store xattrs efficiently and he is also worried that the mapping is imperfect, which could lead to security problems.
There are races when creating files with ACLs and other attributes right now, French said. Jeff Layton thought that using O_TMPFILE, setting the ACLs and attributes on that file, then moving the file to its real name should be sufficient to avoid those races. French said he did not have the details but, from what he understands, the races are unavoidable with the current interfaces. There is a surprising amount of work needed at file creation time, he said.
Network filesystems need broader support for the fast copy options (e.g. server-side copying and copy offload). Almost everyone wants their copies to be done quickly by default, but there is no Linux interface to simply hand a source and target to and have it make the best effort to do that copying quickly. There is a need for a per-file snapshot interface. Right now, there are filesystem-specific ways to request snapshots, but it would be helpful to have a single interface for all filesystems that can support it. Windows has that support, he said.
There is currently no interface to get metadata about the filesystem itself. There are things that XFS and Btrfs know that could help client applications make better decisions. The alignment of the device, whether there is a seek penalty, or if the TRIM command is supported are all things that would be helpful to know. He noted David Howells's proposal for filesystem query system call (possibly statfsx()) that was made in the statx() session; the timestamp granularity example given there was a good one, French said.
Ted Ts'o said that most would probably agree that many of the interfaces French is talking about would be good additions, but they are also easy to bikeshed over so it "takes forever to get them upstream". The filesystem information system call is definitely needed and there is no need to convince the kernel developers of that, but there will be three months to three years of bikeshedding over it. The idea behind statx() was not controversial, Ts'o said, it's just that the details took a long time to be worked out.
But French said that statx() is an example of progress. It was finally pared down to just adding birth time and two attributes, which is great, but it is an extensible interface. The "compressed" and "encrypted" attributes were added with the new call, but more would be helpful and he is optimistic that they will be added.
Performance problems reading files with holes
In the last filesystem session of LSFMM 2017, Anna Schumaker presented some performance problems she has found when working with sparse files. There are some major differences between ext4, XFS, and Btrfs when accessing files with holes.
NFSv4.2 adds a READ_PLUS command that incorporates some optimizations for reading sparse files. READ_PLUS returns both data and holes, but the latter are represented by a range rather than a series of blocks of zeroes. She has been implementing that but ran into some performance problems.
When using the SEEK_DATA and SEEK_HOLE functionality on a file that has an even mix of data and holes, ext4 and XFS perform reasonably well (though XFS is a bit worse than ext4), but Btrfs is completely unusable, she said. It takes roughly twice as long to read the sparse file than it does to read a non-sparse file of the same size on Btrfs.
Several theories were offered as to what might be going wrong. It would seem that caching probably plays a role. Ted Ts'o noted that SEEK_HOLE is an ioctl() command that is implemented by the filesystems. Both ext4 and XFS cache the hole information; ext4 used to do it badly. He has not looked at Btrfs to see what it does, but suggested the problem might lie in this area.
Ts'o thought a quick microbenchmark might help point to the problem. Jan Kara, though, was puzzled by the results reported. He said he needed more information; it was generally agreed by the attendees that more data was needed to shed some light on the problem.
Trond Myklebust pointed out that the fact that only Btrfs exhibits the problem would seem to indicate that there isn't an NFS implementation bug. Jeff Layton suggested writing a small user-space program that exhibits the behavior. If that were to get posted to the mailing lists, it should be pretty easy for the filesystem developers to reproduce the problem and diagnose what is going on.
Booting from remote storage
In the only storage-only LSFMM 2017 session that LWN was able to attend—it was scheduled opposite the one-and-only filesystem and memory management combined session—Lee Duncan explored some of the questions and problems he sees in booting from remote storage. He said that he wanted to get feedback from the assembled developers to see where solutions might lie.
Ethernet booting works just fine, he said, as long as everything is configured correctly. Some of the hardware makes that difficult, however. For example, there are Broadcom network cards that use the same MAC address on multiple ports and other network adapters that can be used for iSCSI, but not for general networking. When things are misconfigured, systems do not boot and it is hard to diagnose and fix them.
There is a standard for booting over iSCSI from Microsoft that is loosely followed, he said. But there is nothing for Fibre Channel over Ethernet (FCoE) so various hacks are used to make things work.
Error handling is a sore spot as well. Disconnection and reconnection need to be handled by either user space or the kernel, which one is transport-specific. For iSCSI, it is done in user space, but it is sometimes difficult to determine which errors indicate that a reconnection should be tried. In addition, some iSCSI cards fail over to another card or port in some cases, but it is not clear what those cases are.
Multipath I/O is also problematic. Systemd handles the sequencing of setting up the paths; if it gets the sequence wrong, the system will hang on start or stop.
There is a need for standards in this area and for more testing. The vendors do not run their hardware on enterprise kernels, he said, so the distributions should be qualifying this hardware. But Jes Sorensen objected that testing cannot capture all the different scenarios; even if a hardware device is qualified, you could "put it in a data center with slightly different routing and it falls over".
Systemd is part of the problem because it changes so frequently, Sorensen said. Hannes Reinecke said that systemd maintainer Lennart Poettering told him that multipath will never work because "that is not how they designed systemd".
It may make sense to have some standard like the Linux Standard Base (LSB) for what the distribution vendors are willing to support for remote boot, Sorensen said. It would require getting the right distribution representatives together in a room to determine what the officially supported processes for remote booting would be. Representatives from Red Hat, SUSE, Ubuntu, and others would need to come together, but he did not see enough of those right people in the room, he said. James Bottomley said that it is problem that impacts the kernel developers but not one they can solve; distributions need to fix it.
The md (software RAID) unit tests are available for some of this testing, Sorensen said. He is interested in making them test more functionality and for them be used more widely. Right now they are quite md-specific and just running on loopback devices, but they could be extended and options could be added to choose the storage device used.
But Mike Snitzer thought that all of those layers should come into play after booting. The remote booting case should be simplified and tested that way; local boot should be used to test the more complex stacking options. Duncan agreed that the stacking of these different layers was not the main problem; the most fundamental problem is just to get the system to boot from remote storage.
There is a "common set of problems booting from the HBAs [host bus adapters] of the world", Martin Petersen said. Duncan said that the problem often manifests itself after an installation from CD; when the system tries to reboot, it fails and complains about a missing device. Sorensen said that it is often a driver or firmware file missing from the Dracut initramfs image, some of which could perhaps be simulated in virtual machines (VMs). Bottomley said that there is no coherent information to give to the user to help them diagnose the problem; the Dracut image needs to have that so that it can report it.
Creating a virtual machine environment to do this testing is going to take a lot of work, Sorensen said. But canned tests in VMs would be helpful, Duncan said. What it takes to boot versus what it takes to bring up all of the attached storage should be tackled separately, Bottomley said. Otherwise there is a combinatorial explosion of different options to test. It would be useful to be able to create a list of what modules are needed to bring a given block device online, Petersen said.
There was some discussion about having a microconference at the 2017 Linux Plumbers Conference to start the process of defining what will be supported and how it will be tested. There was general agreement that it might be the right venue to advance the process of rationalizing remote boot for Linux.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>