Reservations for must-succeed memory allocations

By Jonathan Corbet
March 17, 2015

LSFMM 2015

When the schedule for the 2015 Linux Storage, Filesystem, and Memory Management Summit was laid out, its authors optimistically set aside 30 minutes on the first day for the thorny issue of memory-allocation problems in low-memory situations. That session (covered here) didn't get past the issue of whether small allocations should be allowed to fail, so the remainder of the discussion, focused on finding better solutions for the problem of allocations that simply cannot fail, was pushed into a plenary session on the second day.

Michal Hocko started off by saying that the memory-management developers would like to deprecate the __GFP_NOFAIL flag, which is used to mark allocation requests that must succeed at any cost. But doing so, it turns out, just drives developers to put infinite retry loops into their own code rather than using the allocator's version. That, he noted dryly, is not a step forward. Retry loops spread throughout the kernel are harder to find and harder to fix, and they hide the "must succeed" nature of the request from the memory-management code.

Getting rid of those loops is thus, from the point of view of the memory-management developers, a desirable thing to do. So Michal asked the gathered developers to work toward their elimination. Whenever such a loop is encountered, he said, it should just be replaced by a __GFP_NOFAIL allocation. Once that's done, the next step is to figure out how to get rid of the must-succeed allocation altogether. Michal has been trying to find ways of locating these retry loops automatically, but attempts to use Coccinelle to that end have shown that the problem is surprisingly hard.

Johannes Weiner mentioned that he has been working recently to improve the out-of-memory (OOM) killer, but that goal proved hard to reach as well. No matter how good the OOM killer is, it is still based on heuristics and will often get things wrong. The fact that almost everything involved with the OOM killer runs in various error paths does not help; it makes OOM-killer changes hard to verify.

The OOM killer is also subject to deadlocks. Whenever code requests a memory allocation while holding a lock, it is relying on there being a potential OOM-killer victim task out there that does not need that particular lock. There are some workloads, often involving a small number of processes running in a memory control group, where every task depends on the same lock. On such systems, a low-memory situation that brings the OOM killer into play may well lead to a full system lockup.

Rather than depend on the OOM killer, he said, it is far better for kernel code to ensure that the resources it needs are available before starting a transaction or getting into some other situation where things cannot be backed out. To that end, there has been talk recently of creating some sort of reservation system for memory. Reservations have downsides too, though; they can be more wasteful of memory overall. Some of that waste can be reduced by placing reclaimable pages in the reserve; that memory is in use, but it can be reclaimed and reallocated quickly should the need arise.

James Bottomley suggested that reserves need only be a page or so of memory, but XFS maintainer Dave Chinner was quick to state that this is not the case. Imagine, he said, a transaction to create a file in an XFS filesystem. It starts with allocations to create an inode and update the directory; that may involve allocating memory to hold and manipulate free-space bitmaps. Some blocks may need to be allocated to hold the directory itself; it may be necessary to work through 1MB of stuff to find the directory block that can hold the new entry. Once that happens, the target block can be pinned.

This work cannot be backed out once it has begun. Actually, it might be possible to implement a robust back-out mechanism for XFS transactions, but it would take years and double the memory requirements, making the actual problem worse. All of this is complicated by the fact that the virtual filesystem (VFS) layer will have already taken locks before calling into the filesystem code. It is not worth the trouble to implement a rollback mechanism, he said, just to be able to handle a rare corner case.

Since the amount of work required to execute the transaction is not known ahead of time, it is not possible to preallocate all of the needed memory before crossing the point of no return. It should be possible, though, to produce a worst-case estimate of memory requirements and set aside a reserve in the memory-management layer. The size of that reserve, for an XFS transaction, would be on the order of 200-300KB, but the filesystem would almost never use it all. That memory could be used for other purposes while the transaction is running as long as it can be grabbed if need be.

XFS has a reservation system built into it now, but it manages space in the transaction log rather than memory. The amount of concurrency in the filesystem is limited by the available log space; on a busy system with a large log he has seen 7-8000 transactions active at once. The reservation system works well and is already generating estimates of the amount of space required; all that is needed is to extend it to memory.

A couple of developers raised concerns about the rest of the I/O stack; even if the filesystem knows what it needs, it has little visibility into what the lower I/O layers will require. But Dave replied that these layers were all converted to use mempools years ago; they are guaranteed to be able to make forward progress, even if it's slow. Filesystems layered on top of other filesystems could add some complication; it may be necessary to add a mechanism where the lower-level filesystem can report its worst-case requirement to the upper-level filesystem.

The reserve would be maintained by the memory-management subsystem. Prior to entering a transaction, a filesystem (or other module with similar memory needs) would request a reservation for its worst-case memory use. If that memory is not available, the request will stall at this point, throttling the users of reservations. Thereafter, a special GFP flag would indicate that an allocation should dip into the reserve if memory is tight. There is a slight complication around demand paging, though: as XFS is reading in all of those directory blocks to find a place to put a new file, it will have to allocate memory to hold them in the page cache. Most of the time, though, the blocks are not needed for any period of time and can be reclaimed almost immediately; these blocks, Dave said, should not be counted against the reserve. Actual accounting of reserved memory should, instead, be done when a page is pinned.

Johannes pointed out that all reservations would be managed in a single, large pool. If one user underestimates their needs and allocates beyond their reservation, it could ruin the guarantees for all users. Dave answered that this eventuality is what the reservation accounting is for. The accounting code can tell when a transaction overruns its reservation and put out a big log message showing where things went wrong. On systems configured for debugging it could even panic the system, though one would not do that on a production system, of course.

The handling of slab allocations brings some challenges of its own. The way forward there seems to be to assume that every object allocated from a slab requires a full page allocation to support it. That adds a fair amount to the memory requirements — an XFS transaction can require as many as fifty slab allocations.

Many (or most) transactions will not need to use their full reservation to complete. Given that there may be a large number of transactions running at any given time, it was suggested, perhaps the kernel could get away with a reservation pool that is smaller than the total number of pages requested in all of the active reservations. But Dave was unenthusiastic, describing this as another way of overcommitting memory that would lead to problems eventually.

Johannes worried that a reservation system would add a bunch of complexity to the system. And, perhaps, nobody will want to use it; instead, they will all want to enable overcommitting of the reserve to get their memory and (maybe) performance back. Ted Ts'o also thought that there might not be much demand for this capability; in the real world, deadlocks related to low-memory conditions are exceedingly rare. But Dave said that the extra complexity should be minimal; XFS, in particular, already has almost everything that it needs.

Ted insisted, though, that this work is addressing a corner case; things work properly, he said, 99.9999% of the time. Do we really want to add the extra complexity just to make things work better on under-resourced systems? Ric Wheeler responded that we really shouldn't have a system where unprivileged users can fire off too much work and crash the box. Dave agreed that such problems can, and should, be fixed.

Even if there is a reserve, Ted said, administrators will often turn it off in order to eliminate the performance hit from the reservation system (which he estimated at 5%); they'll do so with local kernel patches if need be. Dave agreed that it should be possible to turn the reservation system off, but doubted that there would be any significant runtime impact. Chris Mason agreed, saying that there is no code yet, so we should not assume that it will cause a performance hit. Dave said that the real effect of a reservation would be to move throttling from the middle of a transaction to the beginning; the throttling happens in either case. James was not entirely ready to accept that, though; in current systems, he said, we usually muddle through a low-memory situation, while with a reservation we will be actively throttling requests. Throughput could well suffer in that situation.

The only reliable way to judge the performance impact of a reservation system, though, will be to observe it in operation; that will be hard to do until this system is implemented. Johannes closed out the session by stating the apparent consensus: the reservation system should be implemented, but it should be configurable for administrators who want to turn it off. So the next step is to wait for the patches to show up.

Index entries for this article
Kernel	Memory management/Page allocator
Conference	Storage, Filesystem, and Memory-Management Summit/2015

to post comments

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 17:23 UTC (Tue) by amonnet (guest, #54852) [Link] (2 responses)

Reservation looks like a lot of dirty work for an unknown result.

Keeping it dirty, but much simpler, why not reserve a fixed amount of memory, that can be used in the failing use case (ie: allocation while oom killer is waiting for a process to be killed) ?

+++

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 22:11 UTC (Tue) by fandingo (guest, #67019) [Link] (1 responses)

This seems like the preferable solution. Make a hard reservation of emergency memory that cannot be allocated. Give it a kernel parameter, so the administrator can set number of reserved pages. When the OOM killer is invoked, it can dip into this reservation if needed, and as it decides which processes to kill, it can tell the allocator that specific processes (and system calls on their behalf) have access to this memory.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 14:23 UTC (Wed) by mm7323 (subscriber, #87386) [Link]

I was thinking something similar - that there should be a reservation area for these occasions. That memory doesn't have to be totally ring-fenced and idle though - just trivially reclaimable without need for more allocations, locks or IO. Things like read-only pages of mmap()'d files (e.g. text and rodata segments of running programs) which could be re-read from disk later when needed, or any pages which have already been flushed to disk and are clean (e.g. write through caches) could be accounted as in the reservation area already.

In general there's probably enough stuff floating around in the system that there would always be a sizeable reservation area, but the 99.999% occasion could still be problematic, so an API would be needed to check that the reservation area is at least a certain size before XFS or other things goes off on a path of no return. The reservation request function could have a blocking variant (which tries to increase the reservation area to meet demand if needed, or waits for other reservation users to complete), or return a failure which could be propagated back to userspace well before any critical actions have taken place in the caller e.g. open() might return ENOMEM if the reservation area isn't sufficiently large to meet the demands needed to ensure that the system call can progress in the worst case. After a critical operation completes, the reservation area request should be released.

Some other API changes may be needed so that a caller can request pages that use the reservation area, and book-keeping to ensure callers don't request more from the reservation than they have previously 'reserved' would be prudent.

Finally, I was also thinking that a simple swap device could also help. Simple swap would mean that pages can be read and written trivially without calling memory allocators or introducing lock dependencies. If a block device could indicate that it was 'simple swap' compatible according to these requirements, then any of its free space could be accounted to the 'reservation area' by enabling dirty pages to be swapped out without ending up in the circular allocation and lock dependency battles which seem to be the cause of all these problems. Directly accessed partitions on a locally attached hard-drive, or zram could probably be made to flag as 'simple swap' compatible.

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 18:04 UTC (Tue) by post-factum (subscriber, #53836) [Link] (2 responses)

Really, reservations are some kind of dirty hack. Instead of fixing the issue (introducing reliable roll-back mechanisms) they brings weird workarounds.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 1:15 UTC (Wed) by roblucid (guest, #48964) [Link] (1 responses)

Why is reservation instrinsically "dirty"? Roll back, would involve a memory penalty and a considerable change in FS code, don't see why that is objectively a better solution. Pre-allocation of resources for key work, is not IMO "dirty", reservation softens the resource usage by allowing the pages to be used for other purposes that allow rapid reclaim.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 13:41 UTC (Thu) by CChittleborough (subscriber, #60775) [Link]

Using rollback also means having lots of largely unused code paths that have to do some tricky work and get it exactly right in all circumstances. This strikes me as asking for trouble. Reservations are much easier to get right, and much more testable.

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 18:57 UTC (Tue) by josh (subscriber, #17465) [Link] (5 responses)

> A couple of developers raised concerns about the rest of the I/O stack; even if the filesystem knows what it needs, it has little visibility into what the lower I/O layers will require. But Dave replied that these layers were all converted to use mempools years ago; they are guaranteed to be able to make forward progress, even if it's slow.

Does this mean that GFP_NOIO and similar are obsolete?

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 19:55 UTC (Tue) by dlang (guest, #313) [Link]

and does "all layers" include things like iSCSI that talk over the network and so include anything that can happen in the network stack?

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 20:41 UTC (Tue) by neilbrown (subscriber, #359) [Link] (3 responses)

> Does this mean that GFP_NOIO and similar are obsolete?

There is some subtlety here...

Superficially: no, mempools don't make GFP_NOIO obsolete. They protect against different things.

GFP_NOIO is all about the locks. GFP_NOIO is called while holding a lock that might be taken during "IO". Reclaim to satisfy such an allocation must not perform "IO" as that could block on a lock that is already held, resulting in a deadlock.

mempools are about which actively used memory you are willing to wait for to become freed. So a mempool allocation does a normal memory allocation which may initiate reclaim and fs-writeback and IO etc. But it will not wait for memory to become free. If nothing is available, it will then use something in the pool, or wait for a previous pool allocation to be returned.
So it is about waiting for memory, not waiting for locks.

However ... direct reclaim doesn't do IO any more at all - it just kicks kswapd and lets it do all the reclaim and IO. So it is possible that GFP_NOIO is obsolete, but not because of mempool.
And it is only a maybe. The change to avoid direct reclaim will have made the role of GFP_NOIO quite different, but I would need to examine code carefully before proclaiming that it was dead.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 1:02 UTC (Thu) by Paf (subscriber, #91811) [Link] (2 responses)

I think I don't understand why kswapd doing the work instead changes things - the work must still be done, and may still involve locks the process which needs memory is holding, right?

In an entirely real example (I've debugged it), kswapd can call shrinkers (in at least some cases registered by file systems) which attempt to clear caches which (if the file system was asking for the memory) can require locks which are not available.

In our specific case (Lustre), we were actually spawning threads rather than allocating memory directly, so we had to spawn our threads without the relevant locks held... But when directly allocating memory, we must be careful to, most of the time, use GFP_NOFS. That's not GFP_NOIO, of course, but I don't see a fundamental difference.

Have I missed something here?

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 1:04 UTC (Thu) by Paf (subscriber, #91811) [Link]

I should clarify: we were spawning threads, which required memory, causing kthreadd to call kswapd.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 1:39 UTC (Thu) by neilbrown (subscriber, #359) [Link]

> Have I missed something here?

No, I was.

kswapd does all the "writeback to filesystems", but direct reclaim can still call the shrinkers.
So GFP_NOFS and GFP_NOIO are still needed (at least) so shrinkers can decide if it is safe to take various locks.

Thanks.

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 20:29 UTC (Tue) by neilbrown (subscriber, #359) [Link] (23 responses)

> Michal Hocko started off by saying that the memory-management developers would like to deprecate the __GFP_NOFAIL flag

All of the mm developers, or just some? And do we know why?

> Ted insisted, though, that this work is addressing a corner case; things work properly, he said, 99.9999% of the time.

So team-red says "It's broken, we need to fix it", and team blue says "it ain't broke, don't fix it".
It seems that discussing a solution might be premature - more airtime needed on the problem?

> There are some workloads, often involving a small number of processes running in a memory control group, where every task depends on the same lock.

I suspect this is the elephant in the room - memory control groups. Things work properly 99.9999% of the time .... when you aren't using control groups!?!

So the proposal is to rewrite some filesystems to make implementing memory control groups easier. Did I get that right?

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 20:36 UTC (Tue) by dlang (guest, #313) [Link]

> So team-red says "It's broken, we need to fix it", and team blue says "it ain't broke, don't fix it".

I think that team blue isn't saying "it ain't broke" but rather "the fix is worse than the problem"

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 22:59 UTC (Tue) by nix (subscriber, #2304) [Link] (18 responses)

I have seen deadlocks on non-memcg memory-constrained systems when memory runs out. It *can* happen, generally when you have a few big processes eating lots of memory, then doing some simultaneous I/O by mischance and all blocking on an allocation down in the fs layer (metadata allocation is where I've seen it). The oom-killer kicks into action and slaughters the little processes that are lying around (since it can't slaughter the big ones), whereupon one of the big processes fires up, eats that memory too, blocks again... and then everything deadlocks.

Often (the vast majority of the time, I expect) you're lucky and the big process trips the oom-killer while it's doing other work in the middle of that big I/O (few processes do solid metadata-heavy I/O all the time), but that's *luck*, not judgement. And I don't much like relying on luck to keep my systems from deadlocking! :) particularly given that this sort of situation seems like something it wouldn't be *all* that terribly hard to engineer. It's not like the various contending processes need to run in different privilege domains or anything.

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 23:02 UTC (Tue) by dlang (guest, #313) [Link] (3 responses)

well, unless there are no limits on the amount of memory that the processes are allowed to use, they won't be able to run the system completely out of memory to trigger the problem.

or am I missing something here?

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 23:37 UTC (Tue) by neilbrown (subscriber, #359) [Link] (2 responses)

I'm having a bit of trouble parsing what you wrote, so please forgive me if I misunderstand.
But I think you are suggesting that a memory-constrained process cannot run the whole system out of memory and so cannot cause problems - is that right?

That perspective misses the point. The problem isn't exactly being out of memory. The problem is memory allocation requests failing or blocking indefinitely. A memory-constrained process can have a memory allocation fail even when the system as a whole has plenty of free memory. If the code which makes that failing request isn't written to expect that behaviour, it could easily cause further problems.

There is a lot of complexity and subtlety in the VM to try to keep memory balanced between different needs, and to avoid deadlocks and maintain liveness. For memory cgroups to impose limits on in-kernel allocations, it needs to replicate all that subtlety inside the memcg system. Certainly that should be possible, but I doubt it would be easy.

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 23:56 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

I was responding to the portion that seemed to be implying that the problem could be caused by an unprivileged user, or a user constrained within a container/VM

As long as the overall system isn't out of memory, the fact that a user/container/vm is using all the memory it's allowed shouldn't cause this sort of problem for things outside of that user/container/vm

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 11:04 UTC (Wed) by dgm (subscriber, #49227) [Link]

It concerns me deeply that this situation happens at the FS level. If a situation arises where failure is not an option and progress cannot be made, the logical conclusion is filesystem corruption that a constrained user/vm can trigger at will.

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 23:28 UTC (Tue) by neilbrown (subscriber, #359) [Link] (13 responses)

> I have seen deadlocks on non-memcg memory-constrained systems when memory runs out.

yes, I have too. In those cases they were removed by relatively simple code fixes.

While there are some common pattern, each deadlock is potentially quite different.

Without looking at the precise details of a particular deadlock, you cannot know what sort of approach might be needed to ensure it never happens again.

So saying "I've seen deadlocks" is like saying "there are bugs". Undoubtedly true, but not very helpful.

Whether there are deadlocks that can only (or most easily) be fixed by new memory reservation schemes is the important question. It is one that can only be answered by careful analysis of lots of details.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 15:30 UTC (Wed) by vbabka (subscriber, #91706) [Link] (12 responses)

>> I have seen deadlocks on non-memcg memory-constrained systems when memory runs out.

>yes, I have too. In those cases they were removed by relatively simple code fixes.

>While there are some common pattern, each deadlock is potentially quite different.

>Without looking at the precise details of a particular deadlock, you cannot know what sort of approach might be needed to ensure it never happens again.

>So saying "I've seen deadlocks" is like saying "there are bugs". Undoubtedly true, but not very helpful.

Yes, in some cases the fix is simple. But AFAIU in general it's not feasible for OOM killer to know which task is holding which locks (without the kind of overhead that enabling lockdep has), so it's not possible to guarantee it will select victims in a way that guarantees forward progress.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 22:13 UTC (Wed) by neilbrown (subscriber, #359) [Link] (11 responses)

> But AFAIU in general it's not feasible for OOM killer to know which task is holding which locks

What I keep wondering is why this matters so much.
Once the OOM killer has identified a process and sent it SIGKILL, why not just pro-actively unmap all its user-space memory. That should immediately resolve the memory problems, and the shell of the old process can be left to sort itself out as locks become available.

I'm sure this has come up before, but I don't remember why it doesn't happen. Any ideas?

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 22:26 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

I remember that it has something to do with the threads. A signal must be delivered to all the threads, some of which are quite possibly blocked inside the kernel space.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 23:31 UTC (Wed) by nix (subscriber, #2304) [Link] (5 responses)

If the process is being SIGKILLed, the process cannot receive the signal anyway, so there's no need to queue it and no need to do anything with its userspace component. You should just be able to tear it down, then let the kernel side unwind itself up to the syscall level and then go away. I too don't see why this isn't practical.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 23:31 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

I meant, of course, 'cannot *catch* the signal anyway'.

I clearly need to go to sleep...

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 1:12 UTC (Thu) by Paf (subscriber, #91811) [Link]

Two problems.

Uninterruptible sleeping, and sleeping with sigkill blocked. Doing either one in a syscall means the process won't act on sigkill until it is woken up. I believe when sleeping uninterruptibly, sigkill is ignored. (I'm pretty sure.)

One particularly fun thing in multi-threaded systems I've actually seen: The intended waker is killed and the sleeper is now unwakeable and unkillable.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 0:03 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Kernel threads might be reading memory that is currently being reclaimed, so you _need_ to deliver the signal to all threads before starting to free the RAM used.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 0:32 UTC (Thu) by neilbrown (subscriber, #359) [Link] (1 responses)

> Kernel threads might be reading memory that is currently being reclaimed,

So either they will have called get_user_pages() and will hold references to the pages which will keep them safe, or it will be calling copy_{to,from}_user which is designed to handle missing
addresses and will return an appropriate error status if the memory isn't there.

Is there some other way to access user memory that I have missed? Or is one of those racy in a way that I cannot see?

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 18:45 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> So either they will have called get_user_pages() and will hold references to the pages which will keep them safe
Wouldn't this require splitting the victim's VMA to free pages that are not pinned (requiring more RAM to do it)? On the other hand, in most cases only a couple of pages are going to be pinned at any given moment.

> Is there some other way to access user memory that I have missed? Or is one of those racy in a way that I cannot see?
Other than weird zero-copy scenarios I think you're not missing anything.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 8:08 UTC (Thu) by vbabka (subscriber, #91706) [Link] (3 responses)

> Once the OOM killer has identified a process and sent it SIGKILL, why not just pro-actively unmap all its user-space memory. That should immediately resolve the memory problems, and the shell of the old process can be left to sort itself out as locks become available.

> I'm sure this has come up before, but I don't remember why it doesn't happen. Any ideas?

Yeah Mel suggested this to Dave before the session, but it didn't seem a sufficient solution to avoid the need for reservations completely.

I'm not sure about the exact reason, but if you think about it, there's not much difference between the pages you can reclaim and pages you can unmap. And as long as you can reclaim, OOM is not invoked.

- file pages that are clean, could have been reclaimed, those that are dirty cannot be simply discarded (maybe except some temporary files that have been already unlinked)
- anonymous pages could have been swapped out. Yes, there might be a difference if your swap is full, or file-backed (thus potentially blocking). Otherwise mempools in I/O layer should have guaranteed progress swapping out during reclaim.
- unevictable pages (mlock) - here unmapping on OOM could help, but we could also maybe just breach mlock guarantees and reclaim the pages if the system is in trouble - at that point, any performance guarantees are probably lost anyway. OK, maybe not, since you might be using mlock to prevent sensitive data in anonymous private mappings to hit persistent storage...
- pages holding the page tables, once you empty them - that will gain you some memory, but likely not guaranteed enough to save the situation

Also did you know that SLE11 (SP1? not sure) kernel already has some limited form of memory reservations? For swap over NFS, I heard :)

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 8:30 UTC (Thu) by neilbrown (subscriber, #359) [Link] (2 responses)

Surely an excess of anonymous or mlocked pages while swap is full is the only situation that can trigger OOM? Those are exactly the pages that can be unmapped but not reclaimed.

There may still be a need for reservations, but that seems to be a largely separate problem from the OOM killer not being able to free memory from the worst offender.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 19:45 UTC (Thu) by mm7323 (subscriber, #87386) [Link] (1 responses)

I think the problems become related when the system deadlocks due to OOM killer not being able to make progress due to memory requesters holding locks or needing more memory for transactions to complete!

Now if XFS could check (and temporarily reserve) how much reclaimable memory is available before starting a transaction, XFS could fail early, or perhaps OOM killer could be started before the situation deteriorates to the point of no progress can be made due to un-reclaimable and swap memory exhaustion.

Reservations for must-succeed memory allocations

Posted Mar 21, 2015 11:42 UTC (Sat) by mtanski (guest, #56423) [Link]

That is exactly what was proposed in the talk and what a lot of the commenters are missing. These changes make a reservation before the transaction star. At that point you have a choice to cleaning space, returning an error, or waiting on previous transactions to finish.

Think of this as back pressure in a low resource scenario...and it's the right place to apply back pressure, before the transaction start., before it's too late (not enought memory to make progress).

The downside is that it will lower concurrency on heavily loaded but under resourced (memory) systems.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 15:23 UTC (Wed) by vbabka (subscriber, #91706) [Link] (1 responses)

>> Michal Hocko started off by saying that the memory-management developers would like to deprecate the __GFP_NOFAIL flag

>All of the mm developers, or just some? And do we know why?

Actually, it has been already deprecated for years. Which is exactly what lead to people working around it (literally :) with retry loops. Which means the MM subsystem cannot know (without seeing the flag) that the particular allocation site in fact cannot fail, and cannot treat it specially.

I think the article is a bit misleading on the "would like to deprecate" part here. In fact, Michal has already posted a patch to clarify the wording:

* __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- * cannot handle allocation failures. This modifier is deprecated and no new
- * users should be added.
+ * cannot handle allocation failures. New users should be evaluated carefuly
+ * (and the flag should be used only when there is no reasonable failure policy)
+ * but it is definitely preferable to use the flag rather than opencode endless
+ * loop around allocator.

So AFAIU the goal now is not to deprecate the flag, but to handle the allocations that do need to use it in a more reliable way than relying just on OOM.

> > Ted insisted, though, that this work is addressing a corner case; things work properly, he said, 99.9999% of the time.

> So team-red says "It's broken, we need to fix it", and team blue says "it ain't broke, don't fix it".
It seems that discussing a solution might be premature - more airtime needed on the problem?

I guess there might be users which want things to work properly 100% of the time, and not rely on luck.

> So the proposal is to rewrite some filesystems to make implementing memory control groups easier. Did I get that right?

As has been already mentioned, this is not at all limited to memcg.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 15:34 UTC (Wed) by vbabka (subscriber, #91706) [Link]

>> Michal Hocko started off by saying that the memory-management developers would like to deprecate the __GFP_NOFAIL flag
> I think the article is a bit misleading on the "would like to deprecate" part here.

More precisely, this was meant in a way that MM devs had wanted to deprecate the __GFP_NOFAIL flag (and even in fact they did that, in its description), but have later realized that there are allocation sites that do need it and while it would be nice to get rid of it, it doesn't seem to be a realistic goal.

Reservations for must-succeed memory allocations

Posted Mar 27, 2015 15:17 UTC (Fri) by mstsxfx (subscriber, #41804) [Link]

[...]
> > There are some workloads, often involving a small number of processes
> > running in a memory control group, where every task depends on the
> > same lock.
>
> I suspect this is the elephant in the room - memory control
> groups. Things work properly 99.9999% of the time .... when you aren't
> using control groups!?!
>
> So the proposal is to rewrite some filesystems to make implementing
> memory control groups easier. Did I get that right?

Not at all. Memory control groups had a similar issue but this has been solved because we now trigger memcg oom killer only from the page fault path after all the previous locks were dropped (have a look at pagefault_out_of_memory). It is the !memcg which is hitting the same problem now. There are certainly ways how to make OOM killer smarter (e.g. tearing down parts of the address space). The point remains though. There are non-failing allocations (__GFP_NOFAIL) which are holding locks (i_mutex to name the most visible example) which might be preventing from the further progress. It is surprisingly easy to hit some of those corner cases without privileges.

Now I am not suggesting that the full reservation system is a must but we might eventually need it if all our other options are not sufficient. Filesystem people are already using a reservation system so it is not surprising they are pushing for the similar thing in the MM code as well. I suspect that MM implementation will be tricky so we will push back and try everything before going that route.

Reservations for must-succeed memory allocations

Posted Mar 23, 2015 16:30 UTC (Mon) by meyert (subscriber, #32097) [Link] (1 responses)

"This work cannot be backed out once it has begun. Actually, it might be possible to implement a robust back-out mechanism for XFS transactions, but it would take years and double the memory requirements, making the actual problem worse."

That's really an interesting definition of a "transaction"...

Reservations for must-succeed memory allocations

Posted Mar 23, 2015 17:36 UTC (Mon) by pizza (subscriber, #46) [Link]

> That's really an interesting definition of a "transaction"...

Think of it from the perspective of data-on-disk.

Reservations for must-succeed memory allocations

Posted Mar 29, 2015 21:26 UTC (Sun) by toyotabedzrock (guest, #88005) [Link]

Given hackers know this problem exists they will target it to put the system in an unstable state. And they will find a way to cause memory to be overwritten and executed.

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 2:41 UTC (Wed) by zblaxell (subscriber, #26385) [Link] (12 responses)

Usually I run most of the interesting processes on a system in a memory cgroup and limit that cgroup to 80% of the RAM with no OOM killer...because such terrible things happen when I let the default Linux memory anarchy have its way. With the cgroup, processes block, but they don't fail or die until an administrator (or appropriately deputized automaton) intervenes.

It seems odd to see all the debate and complexity to save a few hundred kB, and I question the priorities of some of these apocryphal "administrators" who would prefer a 5% performance gain over non-deterministic lockups and a visit from the Chaos Monkey. I threw several gigabytes of RAM at this problem years ago and never looked back. I would love to see filesystems just grab a megabyte or ten of RAM at mount time to handle their worst-case peak memory demands, and be done with this kind of problem forever.

IMHO "no-fail" allocations should come out of a previously reserved (and fully committed!) pool. Allocations in excess of the amount reserved should fail in _all_ cases, not just low-memory cases. This eliminates low-memory corner cases by making them identical to the normal cases.

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 3:43 UTC (Wed) by dlang (guest, #313) [Link] (2 responses)

if you are working on a system with 512MB of ram (like a raspberry Pi), you can't just throw a few GB at the problem.

Even in the server space, a lot of people running VMs are constrained far more by the amount of RAM that they can cram into the system than the CPU cycles available.

It's seldom as simple as "trade 5% speed for frequent random crashes"

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 18:04 UTC (Wed) by zblaxell (subscriber, #26385) [Link] (1 responses)

> if you are working on a system with 512MB of ram (like a raspberry Pi), you can't just throw a few GB at the problem.

The workloads are proportionally smaller too (no multi-layer filesystem + LVM + RAID, maximum burst write size is smaller, etc), so in practice less than 100MB needs to be set aside. It's still 20% of the machine, though, and it wouldn't need to be set aside at all if the kernel could be trusted to manage its own memory.

If the worst-case transaction RAM usage in my favorite filesystem is 20MB, and I have a 16MB machine, I cannot use that filesystem on that machine. Even if the filesystem only requires 1MB 99% of the time, as soon as that 1%-of-the-time case pops up, the filesystem will fail, and it will probably take most or all of the application stack down with it. There is no option in this case that does not lead to failure. The only question is when the failure is detected. I'd much prefer mount to fail at the start because the filesystem can't reserve space for one instance of its worst case space requirement. The alternative is to fail later in the field. Possibly literally in the field, if the Pi has been installed in some kind of autonomous robot...

I don't expect a Pi to be able to sustain 2000 simultaneous writing threads for a dozen reasons, only one of which is not having enough RAM--reserved or otherwise--to handle all the filesystem transactions at once. I'd expect either serialization or ENOMEM, but what I get is a hang or a Chaos Monkey.

> Even in the server space, a lot of people running VMs are constrained far more by the amount of RAM that they can cram into the system than the CPU cycles available.

RAM size determines workload size and vice versa. If the workload exceeds the available RAM the application will fail whether we use a reservation scheme or not. The admin's job is to adjust workload or RAM sizes until there's enough RAM for the workload and not too much workload for the RAM.

In practice, the admin currently has to determine what the workload's worst-case RAM requirement is, and guess how much headroom to add on top to prevent the kernel from randomly failing. Ideally, the kernel would just manage the headroom it needs by itself, and eliminate the guesswork for the admin.

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 18:42 UTC (Wed) by dlang (guest, #313) [Link]

The problem here isn't dependent on the size of the workload overall, just the number of kernel threads working through the particular portion of the kernel.

In general people do just 'throw memory at it', this entire situation only comes up when a system is using all the memory it has, and can't easily free more.

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 9:43 UTC (Wed) by etienne (guest, #25256) [Link] (8 responses)

> I would love to see filesystems just grab a megabyte or ten of RAM at mount time to handle their worst-case peak memory demands...

But would you love grabbing "a megabyte or ten" each times someone writes one byte to a file/device because the file may be on a userspace filesystem of a unionfs of a complex filesystem on a RAID partition of a...
The reservation shall be at the kernel entry (from userspace) because that is where no locks are taken, that is also where nobody knows how much memory will be needed to write that byte.
Maybe a solution could be a new error code internal to Linux, something like -EKERNELRETRY, where it is like an error and everything is cancelled, but just before returning to the usermode application the request is retried entirely. Still a very complex solution...

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 16:50 UTC (Wed) by zblaxell (subscriber, #26385) [Link] (5 responses)

> But would you love grabbing "a megabyte or ten" each times someone writes one byte to a file/device because the file may be on a userspace filesystem of a unionfs of a complex filesystem on a RAID partition of a...

If the alternative is a messy sort of failure, then yes. I either need the memory or I don't. If I need the memory, I need the memory. The algorithms implemented in the software I'm running won't work without the memory they require _by definition_. This should not be controversial.

Currently I have to let gigabytes of RAM lie fallow because my kernel can't be trusted to manage its own memory sanely. How could a few hundred or even a few thousand 1MB preallocations make that worse? If anything, such a scheme would _save_ memory for me.

> The reservation shall be at the kernel entry (from userspace) because that is where no locks are taken, that is also where nobody knows how much memory will be needed to write that byte.

The syscall entry is really too late. The reservations should be done much earlier, e.g. when the filesystem is mounted or when files are opened for writing. We should reserve RAM as soon as usage becomes possible so we are not surprised when the bad cases pop up later.

The filesystem should always know what it would need to write _a_ byte. Multiply that amount by the number of simultaneously active writing threads (possibly less if the filesystem can combine similar requests into a single RAM reservation requirement, or serialize large requests to reduce peak RAM usage). Reserve that amount for the filesystem to use. Recurse and repeat for each lower layer until all the memory required to write the byte is reserved.

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 18:44 UTC (Wed) by dlang (guest, #313) [Link] (2 responses)

> Currently I have to let gigabytes of RAM lie fallow because my kernel can't be trusted to manage its own memory sanely

where in the world did you get this from?

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 19:37 UTC (Wed) by nix (subscriber, #2304) [Link]

This is his scheme for running everything in a cgroup and constraining its memory to much less than that available in the machine. It's not the only possible scheme, which he seems to overlook...

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 21:28 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

I discovered it mostly by accident. cgroups are obviously the wrong tool, but so far I haven't found any other tool that even _touches_ some of the problems I'm facing.

Most of my big workload applications are based on a single constant-sized blob of data (e.g. a RDBMS server, git repo, or similar) with latency-bound processing requirements (i.e. waiting for disk I/O is not permitted). I'd love to say "processes in group A get access to 75% of RAM all the time, and everything else on the system gets to fight over the remainder." As far as I can tell, cgroups provide only the exact opposite of that: "if I limit everything else to 25% of RAM, group A gets its RAM by default maybe 98% of the time." (the other 2% is a failure mode where all the RAM gets eaten by something invisible to cgroups and slabtop, and the machine watchdog-resets).

So I've got all of userspace throttled by cgroups, and suddenly many of the stupid things that the kernel does when memory is low just go away. No more high CPU usage in kernel space when free RAM is low, no more randomly killed processes, fewer random crashes, fewer spurious I/O errors, and fewer other random and bizarre bugs that only seem to occur when something uses the last free pages of RAM. Occasionally there's a kernel stack trace with "memcg" in it, but that's usually followed a few days later by a kernel patch to fix it.

I've experimentally found that somewhere between 10 and 30% of the RAM has to be inaccessible to userspace before I get predictable performance results, which is a few gigabytes on a typical 16GB system. Most of the variation comes from VFS dentry/inode cache, which isn't directly controlled by cgroups, but uses space roughly proportional to cgroup page cache limits.

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 19:40 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

Filesystems don't write by starting 'writing threads' and dumping the work into them. They write by having an existing userspace thread transition into kernel mode and initiate the write. Your proposed scheme would bound the number of *userspace threads* that could be invoked, and would require every thread creation to be accompanied by God only knows how much peripheral allocation by everything in the kernel that might potentially need to do work on behalf of that thread in the future. Thread creation is already slow enough, thank you very much!

(Note: it might be acceptable to have threads block at the point when they would otherwise be about to initiate an fs write if an allocation cannot be obtained -- but how is that different from what we have now, particularly given that writes that are necessary to resolve memory pressure can still lead to deadlocks in this scenario?)

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 21:22 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

> Your proposed scheme would bound the number of *userspace threads* that could be invoked

That number is already bounded by the amount of RAM you have to support those threads. I'd propose just lowering it slightly, e.g. to "only half the number of threads that will cripple the system by exhausting all available memory."

> would require every thread creation to be accompanied by God only knows how much peripheral allocation by everything in the kernel that might potentially need to do work on behalf of that thread in the future

When a thread modifies a file, it is the file (or the filesystem containing it) that ultimately needs the reserved space. The threads were just there incidentally.

In practice the thread doesn't have to own anything. The reserved space for writing the file would be owned by the file or its filesystem. The thread that first wrote something would create the reserved allocation and attach it to the file, and whichever thread got stuck with the job of flushing the page to disk would consume the reserved allocation from the file. There may be recycling of allocations. Or rather the filesystem code executed by the thread would do all that, since the filesystem is the expert on how much memory it needs in the first place.

I am proposing that before we let a thread dirty a page, enough space is reserved to be able to reliably clean it in the future. That doesn't seem unreasonable to me. It may not even be extra work (for the machine) since the filesystem was going to allocate and use that memory anyway, just at a different time.

> Note: it might be acceptable to have threads block at the point when they would otherwise be about to initiate an fs write if an allocation cannot be obtained

We can block earlier, before (many) locks are held.

> but how is that different from what we have now, particularly given that writes that are necessary to resolve memory pressure can still lead to deadlocks in this scenario?

I'd really prefer to not have memory pressure and writes interact at all. That's half the reason why I'm using cgroup hacks right there: to prevent any group of processes from using dirty pages to export memory pressure to the rest of the system. Among many other things, cgroups are a crude way of forcing dirty pages to be counted separately from any other kind of page. "cgroup" is just close enough to "page type" to be effective, since most of my cgroups tend to be dominated by a single page type.

Reservations for must-succeed memory allocations

Posted Apr 1, 2015 22:45 UTC (Wed) by neilbrown (subscriber, #359) [Link] (1 responses)

> Maybe a solution could be a new error code internal to Linux, something like -EKERNELRETRY,

We already have that.

#define ERESTARTSYS 512

Reservations for must-succeed memory allocations

Posted Apr 2, 2015 6:00 UTC (Thu) by kleptog (subscriber, #1183) [Link]

Right, here you could essentially reuse the mechanism used to restart system calls interrupted by signals. What I wonder about is that since currently (I believe) a read()/write() from a file never returns this code how much userspace would break if you changed this.