Reconsidering swapping

By Jonathan Corbet
June 7, 2016

"Swapping" is generally considered to be a bit of a dirty word among long-time Linux users, who will often go to considerable lengths to avoid it. The memory-management (MM) subsystem has been designed to facilitate that avoidance whenever possible. Now, though, MM developer Johannes Weiner is suggesting that, in light of recent developments in hardware, swapping deserves another look. His associated patch set includes benchmark results indicating that he may be on to something.

Is swapping still bad?

User-accessible memory on a Linux system is divided into two broad classes: file-backed and anonymous. File-backed pages (or page-cache pages) correspond to a segment of a file on disk; if they do not contain newly written data that has not yet made it back to persistent storage, these pages can be easily reclaimed for other uses. Anonymous pages do not correspond to a file on disk; they hold the run-time data generated and used by a process. Reclaiming an anonymous page requires writing its contents to the swap device.

As a general rule, reclaiming anonymous pages (swapping) is seen as being considerably more expensive than reclaiming file-backed pages. One of the key reasons for this difference is that file-backed pages can be read from (and written to) persistent storage in large, contiguous chunks, while anonymous pages tend to be scattered randomly on the swap device. On a rotating storage device, scattered I/O operations are expensive, so a system that is doing a lot of swapping will slow down considerably. It is far faster to read a bunch of sequentially stored file-backed pages — and, since the file is usually current on disk, those pages may not need to be written at reclaim time at all.

Swapping is so much slower that many administrators try to configure their systems to do as little swapping as possible. At its most extreme, this can involve not setting up a swap device at all; this common practice deprives the kernel of any way to reclaim anonymous pages, regardless of whether that memory could be put to better use elsewhere. An intermediate step is to use the swappiness tuning knob (described here in 2004) to bias the system strongly toward reclaiming file-backed pages. Setting swappiness to zero will cause the kernel to swap only when memory pressure reaches dire levels.

Johannes starts off his patch set by noting that this mechanism was designed around the characteristics of rotating storage. Anytime the drive used for swapping needed to perform a seek — which would happen often with randomly placed I/O — throughput would drop dramatically. Hence the strong aversion to swapping if it could possibly be avoided. But, Johannes notes, technology has moved on and some of these decisions should be reconsidered:

With the proliferation of fast random IO devices such as SSDs and persistent memory, though, swap becomes interesting again, not just as a last-resort overflow, but as an extension of memory that can be used to optimize the in-memory balance between the page cache and the anonymous workingset even during moderate load. Our current reclaim choices don't exploit the potential of this hardware.

Not only should the system be more willing to swap out anonymous memory, Johannes claims, but, at times, swapping may well be a better option than reclaiming page-cache pages. That could be true if the swap device is faster than the drives used to hold files; it is also true if the system is reclaiming needed file-backed pages while memory is clogged with unused anonymous pages.

Deciding when to swap

The first step in the patch set is to widen the range of possible settings for the swappiness knob. In current kernels, it can go from zero (no swapping at all if possible) to 100 (reclaim anonymous and file-backed pages equally). Johannes raises the maximum to 200; at that value, the system will strongly favor swapping. That is a possibility nobody has ever wanted before, but fast drives have now made it useful.

While there may always be a use for knobs like swappiness, the best kind of system is one that tunes itself without the need for administrator intervention. So Johannes goes on to change the mechanism that decides whether to reclaim pages from the anonymous least-recently-used (LRU) list or the file-backed LRU. For each list, he introduces the concept of the "cost" of reclaiming a page from that list; the reclaim code then directs its efforts toward the list that costs the least to reclaim pages from.

The first step is to track the cost of "rotations" on each LRU. The MM code does its best to reclaim pages that are not in active use. This is done by occasionally passing through the list and clearing the "referenced" bit on each page. The pages that are used thereafter will have that bit set again; those that still have the referenced bit cleared on a subsequent scan have not been touched in the meantime. Those pages are the least likely to be missed and are, thus, the first to be reclaimed. Pages which have been referenced, instead, are "rotated" to the head of the list, giving them a period of time before they are again considered for reclaim.

That rotation costs a bit of CPU time. If a particular LRU list has a lot of referenced pages in it, scanning that list will use a relatively large amount of time for a relatively small payoff in reclaimable pages; in this case, the kernel may well be better off scanning the other list, which may have more unused pages. To that end, Johannes's patch set tracks the number of rotated pages and uses it to establish the cost of reclaiming from each list.

While rotation has a cost, that cost pales relative to that of reclaiming a page that will be quickly faulted back into memory — even if it is written to a fast device in the meantime. As it happens, Johannes added a mechanism to track "refaulted" pages back in 2012; it is used in current kernels to determine how large the active working set is at any given time. This mechanism can also tell the kernel whether it is reclaiming too many anonymous or file-backed pages. The final patch in the set uses refault information to adjust the cost of reclaiming from each LRU; if pages taken from one LRU are quickly faulted back in, the kernel will turn its attention to the other LRU instead.

In the current patch set, the cost of a refaulted page is set to be 32 times the cost of a rotated page. Johannes suggests in the comments that this value is arbitrary and may change in the future. For now, the intent is to cause refaults to dominate in the cost calculation, but, he says, there may well be settings where refaults cost less than rotations.

The patch set comes with a number of benchmarks to show its impact on performance. A PostgreSQL benchmark goes from 81 to 105 transactions per second with the patches applied; the refault rate is halved, and kernel CPU time is reduced. A streaming I/O benchmark, which shouldn't create serious memory pressure, runs essentially unchanged. So, as far as Johannes's testing goes, the numbers look good.

Memory-management changes are fraught with potential hazards, though, and it is entirely possible that other workloads will be hurt by these changes. The only way to gain confidence that this won't happen is wider testing and review. This patch set is quite young; there have been some favorable reviews, but that testing has not yet happened. Thus, it may be a while before this code goes anywhere near a mainline kernel. But it has been clear for a while that the MM subsystem is going to need a number of changes to bring its design in line with current hardware; this work may be a promising step in that direction.

Index entries for this article
Kernel	Memory management/Swapping

to post comments

Reconsidering swapping

Posted Jun 9, 2016 0:55 UTC (Thu) by eduard.munteanu (guest, #66641) [Link] (1 responses)

I get the feeling swapping is an instance of a more general concept which could be implemented in a generic fashion. We've also seen things like bcache which address combining and transparently migrating between faster and slower storage. Perhaps they all fit into a unified hierarchy like memory atop SSDs atop rotational disks, which takes volatility, direct addressability and access times into account.

Reconsidering swapping

Posted Jun 9, 2016 13:50 UTC (Thu) by pj (subscriber, #4506) [Link]

My similar thought was wondering how/if this would apply to some of the new nonvolatile RAM hardware.

Reconsidering swapping

Posted Jun 9, 2016 5:09 UTC (Thu) by Homer512 (subscriber, #85295) [Link] (1 responses)

I've used high swappiness values for years now, even on conventional HDDs. There are just too many programs that leak memory or keep stuff allocated that they never need. For example the Eclipse IDE imports tons of Java classes and those are never garbage-collected even though they are rarely if ever used. The X server can also be pretty leaky.

I rather swap this memory out and use the free memory as disk cache than reboot or periodically restart applications.

Reconsidering swapping

Posted Jun 12, 2016 19:26 UTC (Sun) by giraffedata (guest, #1954) [Link]

I've used high swappiness values for years now, even on conventional HDDs. There are just too many programs that leak memory or keep stuff allocated that they never need.
...
I rather swap this memory out and use the free memory as disk cache than reboot or periodically restart applications.

I think you're confusing two dichotomies. That unreferenced memory will get swapped out regardless of swappiness, as long as it is greater than zero. So all you have to do to reclaim that memory without rebooting is have enough swap space that it never fills up.

Swappiness affects how quickly that memory gets reclaimed, i.e. how long it takes up space before Linux realizes it is leaked memory that will never be accessed again. If your concern is only leaked memory, then you should set swappiness low to avoid swapping out memory that is just accessed infrequently - the leaked memory's going to get swapped anyway.

Swappiness really needs to be set according to the cost of writing out.

Reconsidering swapping

Posted Jun 9, 2016 18:40 UTC (Thu) by flussence (guest, #85566) [Link]

This sounds like it'd work well with existing swap-on-zram setups too, as a sort of inverse of the existing swap prefetch mechanism. It's better than spinning disk I/O, but there's still noticeable latency spikes when a large chunk of resident memory needs freeing up in a hurry.

Reconsidering swapping

Posted Jun 10, 2016 4:07 UTC (Fri) by gwolf (subscriber, #14632) [Link] (5 responses)

...There's the opposite comment as well. I have often read (and recommended!) _not_ to use solid state devices as swap devices because of the usage patterns: Memory-mapped files usually have a relatively low modification rate, so the pages are mostly clean (or have "just" to be periodically flushed), but raw memory sent to the backing store has much less previsions of stability; page boundaries are often way smaller than flash cells (say, 4Kb vs. 8Mb — Two thousand memory pages per flash cell), so most updates that go through a FTL mean just copying over the same 2047 pages (plus a little modification on one) to a new cell for any modification. This, of course, means a much shortened MTBF for solid-state devices.

Am I stuck in a years-old view of SSDs, MTDs and similar beasts?

Reconsidering swapping

Posted Jun 10, 2016 5:48 UTC (Fri) by matthias (subscriber, #94967) [Link] (1 responses)

I use SSDs for swap since several years. I never had an issue. Looking at the numbers (actual LBAs written vs. what the device is expected to handle), I am on the safe side. So for me it works and it is much faster.

I have no idea how the swap algorithm works, but in theory, the flash cells should not be much of a problem for swap. If a page gets changed it does not need to be stored on the same place as before. The algorithm can just write to previously trimmed space. So read-modify-write cycles should only be necessary when the swap is almost full or heavily fragmented. Even if the swap algorithm is not SSD aware, it will usually not swap single pages. If space gets tight with several GB of RAM, then freeing single 4kb pages will not help much.

Reconsidering swapping

Posted Jun 10, 2016 18:52 UTC (Fri) by flussence (guest, #85566) [Link]

I can't see any corresponding code for that in the kernel side, but `swapon` seems to distinguish between SSD/rotational. It may just be a mislabelled "supports discard" flag though.

Reconsidering swapping

Posted Jun 11, 2016 0:13 UTC (Sat) by BenHutchings (subscriber, #37955) [Link] (1 responses)

most updates that go through a FTL mean just copying over the same 2047 pages (plus a little modification on one) to a new cell for any modification.

The erase-blocks for NAND may be that large, but the pages (the smallest units that can be read or written at once) are typically only a few kilobytes. A smart FTL will map linear addresses in units of the page size, not the erase-block size. This leads to single erase-blocks having some used and some free pages, and from time to time those partially-used erase-blocks have to be rewritten to reclaim the free pages. But the average write amplification really wont be nearly as high as 2000 times. You should be able to get some idea of the write amplification factor by comparing random write vs linear write benchmarks.

Reconsidering swapping

Posted Jun 11, 2016 13:34 UTC (Sat) by barryascott (subscriber, #80640) [Link]

For 3 year old toshiba ssd that i have specs for its looking like an 7 times amplification.
The family is THNSNJ/THNSFJ.

Jesd219a numbers for a 1TB drive are random 172TiB written and client workload 1228TiB written.

Reconsidering swapping

Posted Jun 16, 2016 13:49 UTC (Thu) by Wol (subscriber, #4433) [Link]

> This, of course, means a much shortened MTBF for solid-state devices.

> Am I stuck in a years-old view of SSDs, MTDs and similar beasts?

http://techreport.com/review/27909/the-ssd-endurance-expe...

tl;dr summary - you're unlikely to wear out a modern SSD, even with heavy usage. They tested, iirc, about 6 SSDs in a massive torture test. All the drives lasted way longer than their rated life (which was about 3yrs-worth of hammering), and the last few drives only failed when the test rig suffered a power outage. The test, iirc, was basically a continuous cycle of writing, reading, and wiping the drive as fast as the bus could cope.

So using an SSD as a swap drive will probably last 5 to 10 years no problem ... :-)

Cheers,
Wol

Reconsidering swapping

Posted Aug 16, 2019 9:55 UTC (Fri) by mikegav (guest, #128597) [Link]

Why patches for balance LRU lists based on relative thrashing was declined?
I am asking because I have the same situation fast SSD for swap and slow HDD for all data.
If I correctly understand the patch optimizes page-cache size for situations of lack of memory.
Currently, in such cases, the system remains without caching that cause great drop I/O performance (HDD starts to rustle continuously).

I would happily help to test these patch for further adoption in mainline.

Thanks.