Handling writeback errors

By Jake Edge
April 4, 2017

LSFMM 2017

Error handling during writeback is something of a mess in Linux these days, Jeff Layton said in his plenary session to open the second day of the 2017 Linux Storage, Filesystem, and Memory Management Summit. He has investigated the situation and wanted to discuss it with attendees. He also presented a proposal for a way to make things better. Writeback is the process of asynchronously writing dirty pages from the page cache back to the underlying filesystem.

The process started when he was looking into ENOSPC (out-of-space) errors for the Ceph filesystem. He found that the PG_error page flag, which is set when a problem occurs during writeback, would override any other error status and cause EIO (I/O error) to be returned. That inspired him to start looking at other filesystems to see what they did for this case and he found that there was no consistency among them. Dmitry Monakhov thought that writeback time was too late to get an ENOSPC error, but Layton said there are other ways to get it at that time.

If there is an error during the writeback process, it should be returned when user space calls close() or fsync(), Layton said. The errors are tracked using the AS_EIO and AS_ENOSPC in the flags field of struct address_space. The errors are also tracked at the page level, sometimes, using PG_error in the page flags.

A stray sync operation can clear the error flag without the error getting reported to user space, Layton said. PG_error is also used to track and report read errors, so a mixed read-write pattern can cause the flag to be lost before getting reported. There is also a question of what to do with the page after there is a writeback error for it; right now, the page is left in the page cache marked clean and up-to-date, so user space does not know there is a problem.

So, Layton asked, "how do we fix this mess?" James Bottomley asked what granularity the filesystems want for error tracking. Layton said that address_space was the right level to track the information. Jan Kara pointed out that PG_error was meant to be used for read errors, but Layton said that some writeback implementations use it.

Layton suggested cleaning up the tree to remove the use of PG_error for writeback errors. That would entail taking a pass through the tree to clean that up and to see if the errors on writeback are propagating out to user space or whether they may be getting cleared incorrectly. Ted Ts'o said there may be a need for a way to do writeback without getting any errors because they cannot be returned to user space.

Bottomley said that he would rather not mark pages clean if they have not been written to disk. The error information would need to be tracked by sector, he said, so that the block layer can tell the filesystem where in a BIO (the structure used to describe block I/O requests) the error happened. Chris Mason suggested that anyone working on "redoing the radix tree" (i.e. Matthew Wilcox) might want to provide a way to store an error for a specific spot in the file. That way, the error could be reported then cleared once that reporting is done.

Layton then presented an idea he had for tracking the errors. Writeback error counter and writeback error code fields would be added to the address_space structure and a writeback error counter would be added to struct file. When a writeback error occurs, the counter in address_space would be incremented and the error code recorded. At fsync() or close(), that error would be reported, but only if the counter in the file structure is less than that in address_space. In that case, the address_space counter value would be copied to the file structure.

Monakhov asked why a counter was needed; Layton said it was to handle multiple overlapping writebacks. Effectively, the counter would record whether a writeback had failed since the file was opened or since the last fsync(). Ts'o said that should be fine; applications that want more information should use O_DIRECT. For most applications, knowledge that an error occurred somewhere in the file is all that is necessary; applications that require better granularity already use O_DIRECT.

Layton's approach will result in some false positives, but there will be no false negatives, Ts'o said. Guaranteeing anything more would be much more expensive; avoiding the false positives would be a great goal, but the proposed method is simpler and has minimal performance impact. Layton said that it would definitely improve the situation for user space.

Ts'o noted that there are some parallels in the handling of ENOMEM (out of memory). If that error gets returned deep in the guts of writeback, many of the same problems that Layton described occur. So, some filesystems desperately try not to return ENOMEM but, to avoid that, they have to hold locks, which gets in the way of the out-of-memory (OOM) killer. Ts'o has patches to defer writeback and leave pages in a dirty state rather than acquiring a set of locks and preventing OOM kills.

But Layton thinks the first step should be to get better generic error handling for writeback. Filesystems avoid using the current error handling because it doesn't work well. After that, better handling of temporary errors (e.g. ENOMEM, EAGAIN, and EINTR) could be added. Another thing that should be addressed down the road is to have more killable or interruptible calls when doing writeback; that would help with NFS blocking indefinitely when the server goes away, for example.

One other question Layton had is what should be done with the page when writeback fails. Right now, it is left clean, which is something of a surprise. If there is a hard error that will never be retried, should the page be invalidated? Ts'o said that they can't be left dirty in that case because the system will run out of memory if there are lots of writeback errors. If the radix tree could track the information, though, some advanced filesystem might try writing the page somewhere else.

There are four bytes per 64 pages available in the radix tree, Wilcox said, if that would be useful for this tracking. Mason said that he would rather have one or two more tags available, but Wilcox said that could only happen with a 17% increase in the size of the tree—in which case, there would be lots more tags available. Jan Kara suggested storing the error information somewhere outside of the radix tree. The only applications that would benefit if filesystems could do something smarter are databases (many of which already use O_DIRECT) and large loopback files, Mason said.

Steve French thought that the error handling should be done at a higher level, but Layton said that is how we got to where things are now. It is a bad API that is not clear how to use. He is going to try to fix that, he said, and developers should be looking for his patches to do so.

Index entries for this article
Kernel	Block layer/Writeback
Conference	Storage, Filesystem, and Memory-Management Summit/2017

to post comments

Handling writeback errors

Posted Apr 4, 2017 23:19 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (12 responses)

close(2) should never fail with EIO. It just confuses people. Anyone who cares about durability needs to sync somehow (e.g., fsync); writeback may or may have occurred at close time, so checking for EIO can't give you any guarantees. Besides: close(2) always succeeds in the sense that it always closes its passed file descriptor.

Handling writeback errors

Posted Apr 5, 2017 0:11 UTC (Wed) by jlayton (subscriber, #31672) [Link] (3 responses)

POSIX does allow you to return -EIO on close:

http://pubs.opengroup.org/onlinepubs/9699919799/functions...

"If an I/O error occurred while reading from or writing to the file system during close(), it may return -1 with errno set to [EIO]; if this error is returned, the state of fildes is unspecified."

I know for a fact that many filesystems do report errors at close. NFS is one, I think CIFS does too. This is probably another area where we should strive for consistent behavior between filesystems.

At this point I'm convinced enough to leave that out of the next posting of the patchset, and focus on fsync. We can always revisit that later.

Handling writeback errors

Posted Apr 5, 2017 18:18 UTC (Wed) by mtaht (guest, #11087) [Link] (2 responses)

The impedance mismatch between the errno types we have and the errors we actually get has been on my mind lately. Perhaps we need to attempt to more rigorously exploring what posix doesn't have and
start re-evolving that back in line with a consistent view of current reality? The last time errnos were updated was in the 90s.

At the very least, seeing the range of errnos start to evolve again in mainline posix to better represent new realities would be a goodness, I think. Instead of arguing about which posix errno to return in the case of ambiguity, we’d be trying to resolve the ambiguity with a new error definition in posix.

Bugs accumulate in the gaps between interfaces, and work on improving the rigor and expressiveness of those interface standards in face of changing reality is increasingly needed.

errno related rant here:

https://github.com/dtaht/libv6/blob/master/erm/doc/ebusl.org

Handling writeback errors

Posted Apr 6, 2017 8:11 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

I would suggest some hierarchical error system, so POSIX EIO will grow several subtypes, accessible via a new API for applications that want to distinguish among them. Those in turn could be subtyped later.

However, this is perhaps a bit too baroque and instead the errstr from Plan 9 could be adopted.

Handling writeback errors

Posted Apr 6, 2017 8:56 UTC (Thu) by NAR (subscriber, #1313) [Link]

"grow several subtypes,"

Sounds like exception class hierarchy, e.g. in Java there's IOException, then FileNotFoundException or InterruptedByTimeoutException is a child of IOException. Then the file operation can specify that it throws IOException and if the caller needs more details, it can have a narrower catch section.

Handling writeback errors

Posted Apr 5, 2017 12:58 UTC (Wed) by epa (subscriber, #39769) [Link] (7 responses)

Anyone who cares about durability needs to sync somehow

That's almost equivalent to saying that everyone needs to sync in all cases. If you don't care about durability, why would you write it to the filesystem to start with, apart from the occasional tempfile?

Handling writeback errors

Posted Apr 6, 2017 2:55 UTC (Thu) by ringerc (subscriber, #3071) [Link] (6 responses)

It makes more sense if you turn it around: anyone who cares about performance needs to delay sync sometimes.

Handling writeback errors

Posted Apr 6, 2017 8:09 UTC (Thu) by epa (subscriber, #39769) [Link] (5 responses)

Yes, you delay sync, but surely you want to have the file persisted when you close() it? The grandparent seems to suggest that expecting data to be synced on close is too much to ask and you should have to fsync() as well.

Handling writeback errors

Posted Apr 6, 2017 11:57 UTC (Thu) by neilbrown (subscriber, #359) [Link] (4 responses)

When I'm compiling a multifile project I do NOT want every object file to be persisted when it is closed. That brings no value, and real costs.
Ditto when copying a large directory tree. Maybe running sync after the copy would make sense, but not during.

Handling writeback errors

Posted Apr 6, 2017 19:56 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

I agree that those are cases when you're happy to delay syncing. The object files when building are almost in the category of temporary files, and could in principle go to a tempfs, except that current build systems almost always place them in the same place as the source files. Copying a large directory tree, if aborted part way through, needs to be redone as a whole -- so as you say it makes no sense to persist each individual file.

However, I suggest that the default should be for safety, and if particular applications want to optimize performance by delaying sync then they can ask for it. Making sure that a file, once closed, is safely written seems like a good default choice. (The purist position would be to have all file operations synchronous by default, but I recognize that doing so would be a step too far -- sync on close is a fair compromise.) Then some richer API can let you turn that off or sync several files as a group. Better that than making the default unsafe and declaring that all applications need to fsync() if they, for some strange reason, care about the persistence of the data they have written.

Handling writeback errors

Posted Apr 7, 2017 1:30 UTC (Fri) by ringerc (subscriber, #3071) [Link]

Maybe that would've been a better default, but it's a lot too late for that now. POSIX and the Single UNIX Specification have kind of locked that in. No other platform would do it that way so correct code would still need to fsync().

Handling writeback errors

Posted Apr 9, 2017 13:30 UTC (Sun) by anton (subscriber, #25547) [Link]

I think that we are happy to delay syncing in most circumstances. Even if I edit a file, and lose power a few seconds after I hit the Save key, and therefore don't get the final file, I just recover the autosave file and have lost maybe a minute of work. What is important, however, is that the on-disk state of the file system is consistent, but not necessarily up-to-date (i.e., represents one of the application-visible states of the file system) without reqiring syncs after every write.

One case were we really need sync (or an appropriate asynchronous interface) is when reporting the completion of an operation to a remote client who won't notice if the system has crashed.

Handling writeback errors

Posted Apr 6, 2017 19:57 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

Similarly, in a database, that's almost never desired.

Handling writeback errors

Posted Apr 5, 2017 8:20 UTC (Wed) by ringerc (subscriber, #3071) [Link] (11 responses)

I recently ran into issues with EIO on fsync(), where fsync() clears the I/O error flag on the first call after a page is lost on writeback.

Writeup here:

http://stackoverflow.com/q/42434872/398670

PostgreSQL is an application that uses buffered I/O, not O_DIRECT, and would quite like to know which writes failed.

Handling writeback errors

Posted Apr 5, 2017 9:04 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (3 responses)

> PostgreSQL is an application that uses buffered I/O, not O_DIRECT, and would quite like to know which writes failed.

I'm not seeing much need for write or page granularity, as long as a corresponding fsync/fdatasync were guaranteed to fail. There's not much wrt can do in the face of such failures, besides correctly reporting that transactions didn't successfully commit.

Handling writeback errors

Posted Apr 5, 2017 12:29 UTC (Wed) by ringerc (subscriber, #3071) [Link] (2 responses)

That's fine for flushing WAL at commit-time. No problem there.

It's more problematic than that with the bgwriter, etc, where lost pages from async writes mean that if we get an EIO from fsync() we don't know which page(s) we didn't flush, and we don't know which xact(s) those changes corresponded to or whether newer ones have since committed too. We retry fsync()s, which turns out to be problematic. I can discuss it with you offlist; it's part of an ongoing investigation.

If the platform gave us a way to say "these writes failed" - identified by offset+size, some counter value, pageno, or whatever - we could use 2-phase eviction from our shared_buffers, where we mark a page "written" and then mark it "flushed" only once fsync on the file completes. Or something along those lines.

Right now we should probably be PANICing if fsync() fails with EIO on a heap write, so we go back and redo from the last checkpoint. I can give you some details out of band.

Handling writeback errors

Posted Apr 5, 2017 14:19 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (1 responses)

I think the real issue here is that fsync can loose track of which pages it hasn't successfully flushed. And thus a fsync that's retried, isn't guaranteed to fail. Imo *that's* the issue.

Handling writeback errors

Posted Apr 6, 2017 5:11 UTC (Thu) by ringerc (subscriber, #3071) [Link]

It is, but it doesn't sound like something the kernel's going to fix.

From my reading of the relevant bits of POSIX / SUS I could find, we are wrong to rely on fsync() continuing to return EIO until it succeeds in flushing anyway. It's only required that all writes the last fsync() call be flushed to disk if fsync succeeds. Not that all writes since the last *successful* fsync are on disk.

http://pubs.opengroup.org/onlinepubs/009695399/functions/...

Even if Linux 4.12 added an utterly reliable queue mechanism tomorrow that could either retry indefinitely and return EIO until the underlying issue was solved and wasn't subject to failure under memory pressure, Pg couldn't rely on it appearing in users' systems in the wild for many years. Not to mention other platforms.

So we'd better adapt to the semantics fsync() actually has, not the ones we want it to have.

Handling writeback errors

Posted Apr 5, 2017 12:45 UTC (Wed) by jlayton (subscriber, #31672) [Link] (4 responses)

We won't be able to report exactly which writes failed with this scheme, unless you're fsync'ing after every write. fsync is unfortunately rather coarse-grained.

This will allow us to tell you if any writeback failed since the last fsync, but you do have to be aware that it could have happened in a range that was not touched via the fd you're fsyncing.

The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior.

We had a little discussion about allowing that, but that's a bit more dicey -- we might need that memory to keep the system going. One thing that was proposed was keeping them dirty until the last write fd has closed, or until we need to do writeback in order to reclaim memory.

IOW, if the system has plenty of free memory then we could leave the pages dirty after a failed fsync and allow that to be retried. Doing this right though might require new interfaces of some sort.

For now, I want to focus on just not losing writeback errors and ensuring that they are reported on every write fd that's open. Once that's working better, then I think we should be better positioned to explore some of these other ideas.

Handling writeback errors

Posted Apr 5, 2017 13:37 UTC (Wed) by ringerc (subscriber, #3071) [Link]

We won't be able to report exactly which writes failed with this scheme, unless you're fsync'ing after every write. fsync is unfortunately rather coarse-grained.

... and anyone who's fsync'ing after each write can use O_SYNC instead, so there's little to be gained there.

The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior.

Right. Which is why I filed a bug on fsync()'s manpage to document that. Right now it's unclear that fsync clears the error condition on any affected pages when it returns EIO. Bug 194757

The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior.

Right, I checked 2.6.something and found the behaviour was consistent from back then through to now. It's just not very visible behaviour.

We had a little discussion about allowing that, but that's a bit more dicey -- we might need that memory to keep the system going. One thing that was proposed was keeping them dirty until the last write fd has closed, or until we need to do writeback in order to reclaim memory.

IOW, if the system has plenty of free memory then we could leave the pages dirty after a failed fsync and allow that to be retried. Doing this right though might require new interfaces of some sort.

That'd be ideal from my point of view, and fsync could probably get away with returning EAGAIN in this case. Apps would need some way to tell the difference between "writes have been lost" and "writes are not on disk yet but may be recoverable".

However, what's really important is making sure the current behaviour is documented clearly and visibly somewhere devs will actually look.

Handling writeback errors

Posted Apr 5, 2017 13:37 UTC (Wed) by tomik (guest, #93004) [Link] (2 responses)

Marking the pages as not dirty even after a failed fsync() puts the applications (and PostgreSQL in particular) into a rather unfortunate position, though. I mean, how do you ensure data durability in such cases?

What PostgreSQL does is it evicts dirty pages from shared buffers (DB cache) to page cache, and relies on writeback+fsync to push the dirty pages to disk in the background. Once in a while it issues an fsync on the data files (e.g. at the end of a checkpoint, say, once per hour), but at that point it has no idea which pages were actually modified, not to mention that the contents of the pages was evicted from shared buffers.

So, what are the options at this point? The assumption was that we can repeat the fsync (which as you point out is not the case), or shut down the database and perform recovery from WAL (which should be fine, as it's written as O_DIRECT). That might be fair enough, considering such failures are supposed to be fairly rare, though.

Handling writeback errors

Posted Apr 5, 2017 14:46 UTC (Wed) by jlayton (subscriber, #31672) [Link]

Replaying the WAL synchronously sounds like the simplest approach when you get an error on fsync. These are uncommon occurrences for the most part, so having to fall back to slow, synchronous error recovery modes when this occurs is probably what you want to do.

The main thing I working on is to better guarantee is that you actually get an error when this occurs rather than silently corrupting your data. The circumstances where that can occur require some corner-cases, but I think we need to make sure that it doesn't occur.

Handling writeback errors

Posted Apr 6, 2017 3:00 UTC (Thu) by ringerc (subscriber, #3071) [Link]

Right. We have to PANIC on fsync error during a checkpoint and do crash-recovery.

Fun.

Or alternately we might be able to restart the whole checkpoint by re-reading WAL with the xlogreader and re-applying it directly, in some kind of diminished service mode. But that sounds like asking for trouble for what's supposed to be a rare incident, and I think we'd be better off just crashing.

It's a pity that the kernel doesn't offer us better visibility into what went wrong and when.

I looked into using AIO for this, but the API seems quite bad at making guarantees about flushing, and poorly supported.

Handling writeback errors

Posted Apr 5, 2017 14:05 UTC (Wed) by trondmy (subscriber, #28934) [Link]

It should be possible to set up an instantaneous notification mechanism to allow a listening application to figure out which areas of a given file failed to sync. Such a mechanism might be implemented either through a modification to fanotify, or the addition of a similar but dedicated interface.

Handling writeback errors

Posted Apr 8, 2017 17:04 UTC (Sat) by walex (guest, #69836) [Link]

Oh don't mention PostgreSQL :-).

Stonebraker wrote in 1981 a paper "Operating System Support for Database Management" that listed some very basic things that the kernel could do to make IO better for DBMSes. 35 years later, and not much has happened (for the better at least).

It has taken 15 years for a particularly stupid aspect of page_cluster implementation to get fixed, yet new JavaScript frameworks are released every day :-). "Management" just don't care.

NFS blocking

Posted Apr 6, 2017 5:04 UTC (Thu) by fratti (subscriber, #105722) [Link] (1 responses)

>that would help with NFS blocking indefinitely when the server goes away, for example.

That always seemed a little awkward to me. While I get that NFS wasn't made to be used in settings where you don't have really stable connections, getting a bunch of zombies whenever you accidentally yank the USB Ethernet dongle from your laptop during a backup is certainly not pleasant.

Somewhat related, are FUSE filesystems like sshfs unaffected by this entirely?

NFS blocking

Posted Apr 8, 2017 19:23 UTC (Sat) by jlayton (subscriber, #31672) [Link]

No, fuse's fsync calls filemap_write_and_wait_range, which does an uninterruptible sleep. Almost everything in the kernel that waits on pages to be written out does so with a TASK_UNINTERRUPTIBLE sleep, mostly via wait_on_page_writeback()).

In broad strokes, I'd like to see us add wait_on_page_writeback_killable/interruptible, and use those where we can. That does mean looking closely at different places where it's called and making a reasonable judgement of whether it's safe and what to do with errors from being signalled there.