Expedited memory reclaim from killed processes
The kernel requires processes to clean up their own messes when they exit. So, before a process targeted by an OOM killer can rest in peace, it must first run (in the kernel) to free its memory resources and return them to the system. If the process is running with a low priority, or if it gets hung up waiting for a resource elsewhere in the system, this cleanup work may take a long time. Meanwhile, the system is still struggling, since the hoped-for influx of free memory has not yet happened; at some point, another process will also be killed in the hope of getting better results.
This problem led to the development of the OOM reaper in 2015. When the kernel's OOM killer targets a process, the OOM reaper will attempt to quickly strip that process of memory that the process will never attempt to access again. But the OOM reaper is not available to user-space OOM killers, which can only send a SIGKILL signal and hope that the target exits quickly. The user-space killer never really knows how long it will take for a killed process to exit; as Baghdasaryan pointed out, that means that it must maintain a larger reserve of free pages and kill processes sooner than might be optimal.
Baghdasaryan's proposal is to add a new flag (SS_EXPEDITE) to the (new in 5.1) pidfd_send_signal() system call. If that flag is present, the caller has the CAP_SYS_NICE capability, and the signal is SIGKILL, then the OOM reaper will be unleashed on the killed process. That should result in quicker and more predictable freeing of the target's memory, regardless of anything that might slow down the target itself.
The comments on the proposal were heavily negative, which is interesting
because most of the people involved were supportive of the objective
itself. The strongest critic, perhaps, was Michal Hocko (the author of the
OOM reaper), who
complained
that it "is abusing an implementation detail of the OOM
implementation and makes it an official API
". He questioned whether
this capability was useful at all, saying that relying on cleanup speed is
"a fundamental design problem
". Johannes Weiner, though, argued that
the idea was, instead, "just optimizing the user experience as
best as it can
". Others generally agreed that freeing memory
quickly when a process is on its way out is a good thing.
Daniel Colascione liked the idea but wanted a different interface. Rather than adding semantics to a specific signal, he suggested a new system call along the lines of:
size_t try_reap_dying_process(int pidfd, int flags, size_t max_bytes);
This call would attempt to strip memory from the process indicated by pidfd, but only if the process is currently dying. The max_bytes parameter would tell the kernel how much memory the caller would liked to see freed; that would allow the kernel to avoid doing the full work of stripping the target process if max_bytes has been reached. Notably, the reaping would be done in the context of the calling process, allowing user space to determine how important this work is relative to other tasks.
Matthew Wilcox had a simpler question: why not just expedite the reclaim of memory every time a process exits, rather than on specific request? Weiner agreed with that idea, noting that the work has to be done anyway, and doing it always would avoid the need to add a new interface for that purpose. Daniel Colascione pointed out, though, that such a mechanism would slow down the delivery of SIGKILL in general; that slowdown might be felt when running something like killall.
Baghdasaryan didn't
think that expedited reaping makes sense all the time. But, he said,
there are times when it is critical. "It would be great if we can
identify urgent cases without userspace hints, so I'm open to suggestions
that do not involve additional flags
". If such a solution can be
found, it is likely to end up as the preferred alternative here. Cleaning
up after a killed process is, in the end, the kernel's responsibility, and
there is little desire to create a new interface to control how that
responsibility is carried out. Solutions that "just work" are thus to be
preferred over the addition of yet another Linux-specific API for this
purpose.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Out-of-memory handling |
| Kernel | OOM killer |