Expedited memory reclaim from killed processes

Posted Apr 13, 2019 2:05 UTC (Sat) by quotemstr (subscriber, #45331)
In reply to: Expedited memory reclaim from killed processes by wahern
Parent article: Expedited memory reclaim from killed processes

Believe me, I'm no fan of overcommit --- but in this specific instance, it's amazingly not overcommit's fault. The same delay in memory reclaim can occur in a strict commit system! It's about the timing of page reclaim, not about (which is the point of overcommit) how many pages get allocated in the first place and from what sources.

to post comments

Expedited memory reclaim from killed processes

Posted Apr 13, 2019 3:55 UTC (Sat) by wahern (subscriber, #37304) [Link]

I don't doubt the criticisms of the patch. But in my most recent foray digging into the OOM logic, my general sense is that some of these timing and priority issues are largely a consequence of the complexity added to forestall the OOM killer and minimize the harm. For example, right now I'm helping to deal with the OOM killer being invoked in situations where there's plenty of uncommitted memory, but because of heavy concurrent dirtying the reclaim code attempting/waiting on page flush gives up too soon. Without overcommit there would be no reason to give up--"gives up to too soon" is a characterization only meaningful in a best effort, no guarantee environment.

Overcommit seems to have resulted in kernel code that emphasizes mitigations rather than on providing consistent, predictable, and tunable behavior. There is no correct or even best heuristic for mitigating OOM. It's a constantly moving target. Any interface to those heuristic mechanisms and policies that look best today will look horrible tomorrow.

One may want to opt into overcommit and wrestle with such lose-lose scenarios, but designing *that* interface (opt-in) is much easier because you're already in a position of being able to contain the system-wide fallout--architectural, performance, code maintenance, etc. Likewise, one may want the ability to cancel ("give up") some operation, but an interface describing when to cancel (a finite set of criteria) is much easier to devise and use than one describing when not to cancel (a possibly infinite set of criteria).

So, yeah, I don't doubt that one sh*t sandwich can be objectively preferable to another sh*t sandwich, but the spectacle of sh*t sandwich review is tragic.