Buffered I/O without page-cache thrashing
Direct I/O can give better performance than buffered I/O in a couple of ways. One of those is simply avoiding the cost of copying the data between user space and the page cache; that cost can be significant, but in many cases it is not the biggest problem. The real issue may be the effect of buffered I/O on the page cache.
A process that performs large amounts of buffered I/O spread out over one or more large (relative to available memory) files will quickly fill the page cache (and thus memory) with cached file data. If the process in question does not access those pages after performing I/O, there is no benefit to keeping the data in memory, but it's there anyway. To be able to allocate memory for other uses, the kernel will have to reclaim some pages from somewhere. That can be expensive for the system as a whole, even if "somewhere" is the data associated with this I/O activity.
The memory-management subsystem tries to do the right thing in this situation. Pages added to the cache via buffered I/O go onto the inactive list; unless they are accessed a second time in the near future, they will be the first pages to be kicked back out. But there is still a fair amount of overhead associated with implementing this behavior; Axboe ran a simple test and described the results this way:
This kind of problem can be avoided by switching to direct I/O, but that brings challenges and problems of its own. Axboe has concluded that there may be a third way that can provide the best of both worlds.
That third way is a new flag, RWF_UNCACHED, which is provided to the preadv2() and pwritev2() system calls. If present, this flag changes the requested I/O operation in two ways, depending on whether the affected file pages are currently in the page cache or not. When the data is present in the page cache, the operation proceeds as if the RWF_UNCACHED flag were not present; data is copied to or from the pages in the cache. If the pages are absent, instead, they will be added to the page cache, but only for the duration of the operation; those pages will be removed from the page cache once the operation completes.
The result, in other words, is buffered I/O that does not change the state of the page cache; whatever was present there before will still be there afterward, but nothing new will be added. I/O performed in this way will gain most of the benefits of buffered I/O, including ease of use and access to any data that is already cached, but without filling memory with unneeded cached data. The result, Axboe says, is readily observable:
This new flag thus seems like a significant improvement for a variety of workloads. In particular, workloads where it is known that the data will only be used once, or where the application performs its own caching in user space, may well benefit from running with the RWF_UNCACHED flag.
The implementation of this new behavior is not complicated; the entire patch set (which also adds support to io_uring) involves just over 200 lines of code. Of course, as Dave Chinner pointed out, there is something missing: all of the testing infrastructure needed to ensure that RWF_UNCACHED behaves as expected and does not corrupt data. Chinner also noted some performance issues in the write implementation, suggesting that an entire I/O operation should be flushed out at a time rather than the page-at-a-time approach taken in the original patch set. Axboe has already reworked the code to address that problem; the boring work of writing tests and precisely documenting semantics will follow at some future point.
If RWF_UNCACHED proves to work as well in real-world workloads, it
may eventually be seen as one of those things that somebody should have
thought of many years ago. Things often turn out this way. Solving the
problem isn't hard; the hard part is figuring out which problem needs to be
solved in the first place. That, and writing tests and documentation, of
course.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Page cache |