Explicit pinning of user-space pages

By Jonathan Corbet
December 13, 2019

The saga of get_user_pages() — and the problems it causes within the kernel — has been extensively chronicled here; see the LWN kernel index for the full series. In short, get_user_pages() is used to pin user-space pages in memory for some sort of manipulation outside of the owning process(es); that manipulation can sometimes surprise other parts of the kernel that think they have exclusive rights to the pages in question. This patch series from John Hubbard does not solve all of the problems, but it does create some infrastructure that may make a solution easier to come by.

To simplify the situation somewhat, the problems with get_user_pages() come about in two ways. One of those happens when the kernel thinks that the contents of a page will not change, but some peripheral device writes new data there. The other arises with memory that is located on persistent-memory devices managed by a filesystem; pinning pages into memory deprives the filesystem of the ability to make layout changes involving those pages. The latter problem has been "solved" for now by disallowing long-lasting page pins on persistent-memory devices, but there are use cases calling for creating just that kind of pin, so better solutions are being sought.

Part of the problem comes down to the fact that get_user_pages() does not perform any sort of special tracking of the pages it pins into RAM. It does increment the reference count for each page, preventing it from being evicted from memory, but pages that have been pinned in this way are indistinguishable from pages that have acquired references in any of a vast number of other ways. So, while one can ask whether a page has references, it is not possible for kernel code to ask whether a page has been pinned for purposes like DMA I/O.

Hubbard's patch set addresses the tracking part of the problem; it starts by introducing some new internal functions as alternatives to get_user_pages() and its variants:

    long pin_user_pages(unsigned long start, unsigned long nr_pages,
		    	unsigned int gup_flags, struct page **pages,
		    	struct vm_area_struct **vmas);
    long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
			       unsigned long start, unsigned long nr_pages,
			       unsigned int gup_flags, struct page **pages,
			       struct vm_area_struct **vmas, int *locked);
    int pin_user_pages_fast(unsigned long start, int nr_pages,
			    unsigned int gup_flags, struct page **pages);

From the caller's perspective, these new functions behave just like the get_user_pages() versions. Switching callers over is just a matter of changing the name of the function called. Pages pinned in this way must be released with the new unpin_user_page() and unpin_user_pages() functions; these are a replacement for put_user_page(), which was introduced by Hubbard earlier in 2019.

The question of how a developer should choose between get_user_pages() and pin_user_pages() is somewhat addressed in the documentation update found in this patch. In short, if pages are being pinned for access to the data contained within those pages, pin_user_pages() should be used. For cases where the intent is to manipulate the page structures corresponding to the pages rather than the data within them, get_user_pages() is the correct interface.

The new functions inform the kernel about the intent of the caller, but there is still the question of how pinned pages should be tracked. Some sort of reference count is required, since a given page might be pinned multiple times and must remain pinned until the last user has called unpin_user_pages(). The logical place for this reference count is in struct page, but there is a little problem: that structure is tightly packed with the information stored there now, and increasing its size is not an option.

The solution that was chosen is to overload the page reference count. A call to get_user_pages() will increase that count by one, pinning it in place. A call to pin_user_pages(), instead, will increase the reference count by GUP_PIN_COUNTING_BIAS, which is defined in patch 23 of the series as 1024. Kernel code can now check whether a page has been pinned in this way by calling page_dma_pinned(), which simply needs to check whether the reference count for the page in question is at least 1024.

Using reference count in this way does cause a few little quirks. Should a page acquire 1024 or more ordinary references, it will now appear to be pinned for DMA. This behavior is acknowledged in the patch set, but is seen not to be a problem; false positives created in this way should not adversely affect the behavior of the system. A more potentially serious issue has to do with the fact that the reference count only has 21 bits of space; that means that only 11 bits are available for counting pins. That might be considered to be enough for most uses, but pinning a compound page causes the head page to be pinned once for each of the tail pages. A 1GB compound page contains 256 4KB pages, so such a page could only be pinned eight times before the reference count overflows.

The solution to that problem, Hubbard says, is to teach get_user_pages() (and all the variants) about huge pages so that they can be managed with a single reference count. He notes that "some work is required" to implement this behavior, though, so it might not happen right away; it is certainly not a part of this patch set which, at 25 individual patches, is already large enough.

There is one other little detail that isn't part of this set: how the kernel should actually respond to pages that have been pinned in this way. Or, as Hubbard puts it: "What to do in response to encountering such a page is left to later patchsets". One possibility can be found in the layout lease proposal from Ira Weiny, which would provide a mechanism by which long-term-pinned pages could be unpinned when the need arises. There is not yet a lot of agreement on how such a mechanism should work, though, so a full solution to the get_user_pages() problem is still a somewhat distant prospect. Expect it to be a topic for more heated discussion at the 2020 Linux Storage, Filesystem, and Memory-Management Summit.

Meanwhile, though, the kernel may have at least gained a mechanism by which pinned pages can be recognized and tracked, which is a small step in the right direction. These patches have been through a number of revisions and look poised to enter Andrew Morton's -mm tree sometime in the near future. That would make their merging for the 5.6 kernel a relatively likely prospect.

Index entries for this article
Kernel	Memory management/get_user_pages()

to post comments

Explicit pinning of user-space pages

Posted Dec 14, 2019 4:05 UTC (Sat) by jeremyhetzler (subscriber, #127663) [Link] (2 responses)

"A 1GB compound page contains 256 4KB pages"

4KB * 256 = 1MB, not 1GB.

Is this an error, or am I misunderstanding something?

Explicit pinning of user-space pages

Posted Dec 22, 2019 22:31 UTC (Sun) by liam (guest, #84133) [Link] (1 responses)

11 bits==2048==256 * 4 * 8

Explicit pinning of user-space pages

Posted Dec 23, 2019 17:34 UTC (Mon) by rschroev (subscriber, #4164) [Link]

11 bits == 2048 == 256 * 8 != 256 * 4 * 8

As I understand it, the compound page in the text has 256 pages, and there are 11 bits for counting pins. Each page needs to be pinned each time the compound page is pinned, so the maximum number of pins is 2^11 / 256.

That's not what jeremyhetzler was referring to though. AFAICS his point still stands: if that compound page contains 256 pages, each being 4 KiB large, the total size is 256 * 4 KiB == 1 MiB, not 1 GiB.

Explicit pinning of user-space pages

Posted Dec 15, 2019 2:02 UTC (Sun) by snajpa (subscriber, #73467) [Link] (8 responses)

I believe Linux is almost beyond the point, where it would have made sense to rather prepare for the future, than to chase every little bit of performance possible. If the kernel doesn't step more in direction of greater scalability, it will eventually be replaced by some other kernel in the datacentric domain and all Linux will be good for is just handsets. Who could have anticipated when we saw Red Hat go public...

clone3() is just one of great examples of what could have been anticipated, but wasn't, only 1024 reference upcounts to deem the page as pinned seems like we're going to hit that tomorrow or the day after.

I wonder why is still everyone OK with adding more work/technical debt into the future - it creates a lot of more work ^squared, not just because of all of the QA associated...

I don't understand why is this. Maybe I'm too young to get it, but not this seems like rather a recurring theme with Linux.

Why is it acceptable to even propose a patchset containing anything like this...

Explicit pinning of user-space pages

Posted Dec 15, 2019 2:11 UTC (Sun) by snajpa (subscriber, #73467) [Link]

Another example of that, relevant to our use-case, is eg. here -> https://github.com/vpsfreecz/linux/commit/261c5d8a38a849f... and I wonder, what pops up next when we deploy more vpsAdminOS servers in coming weeks :))

Explicit pinning of user-space pages

Posted Dec 16, 2019 12:20 UTC (Mon) by jan.kara (subscriber, #59161) [Link] (6 responses)

We are certainly aware that 1024 ordinary page references are going to happen. We just don't think such pages are common and so false-positive rate of identification of pinned pages is going to be low. And when we falsely identify a page as pinned, the result will be higher overhead when dealing with such page (such as bouncing the page during writeback) but nothing catastrophical. Finally, the mechanics of identifying the page is pinned is actually a detail in this series and can be switched easily if practice proves it isn't workable. The difficult part is the API change and converting all the call sites...

Explicit pinning of user-space pages

Posted Dec 17, 2019 6:43 UTC (Tue) by snajpa (subscriber, #73467) [Link] (5 responses)

That didn't answer my question, as to *why* are excuses like this even acceptable...

I get that it might be more work to do it properly (looks to be about ~2k lines of changes if made so, at least). Oh yeah. That happens.

But why is it acceptable to propose some lazy nonsolution, which adds more technical debt into the future?

Explicit pinning of user-space pages

Posted Dec 17, 2019 6:55 UTC (Tue) by snajpa (subscriber, #73467) [Link] (4 responses)

From current Linus's master:

snajpa@kerneldev:~/linux ((ea200dec5128...))$ gg get_user_page | wc -l
302

We're talking about a project, that has >> 15M lines of code *and* we're talking about a part, which obviously needs a proper redesign, but which only gets about 300 mentions in the whole codebase.

I still don't understand, what am I missing?

Explicit pinning of user-space pages

Posted Dec 17, 2019 15:08 UTC (Tue) by corbet (editor, #1) [Link] (3 responses)

Given the amount of effort that has gone into this problem over the last couple of years or so, I'd say that calling the people involved "lazy" does them a major disservice. This is not an easy problem, but if you have an idea for a better way to solve it, I don't doubt that there will be interest in hearing it.

Explicit pinning of user-space pages

Posted Dec 18, 2019 14:28 UTC (Wed) by snajpa (subscriber, #73467) [Link] (2 responses)

I'm not calling anyone lazy, I'm genuinely wondering if it's that hard to change such a "smallish" piece of such a large beast that Linux has become...

Because this seems more like a systematic issue with the development process, if changing a thing like this is seen as such an uphill battle...

Explicit pinning of user-space pages

Posted Dec 18, 2019 15:49 UTC (Wed) by snajpa (subscriber, #73467) [Link] (1 responses)

But I did use the word lazy, however I was talking about a solution, not people. So what if we replaced that word with 'too simple'?

I do understand that the larger the patchset is, the harder it is for anyone to review it. But I believe it is just as hard then for anyone to sort through such a mad review process that the "random streams of Re:'s on mailing list" are, they might not reference the thread being-replied-to properly, they might be even cross-mailing-lists...

I've seen that the debate about better review tooling has already started, this sort of core kernel change would greatly benefit from having something better than such loosely coupled all-over-the-place in-line reviews in mailing lists as the main review channel.

And if the review process gets fixed (and pretty much up to standards of any newer large project), then there will still be a discussion left to be had, how to go about deeper core kernel changes then. It can't be next-to-impossible (or this hard) to change the core internals...

When it comes to generational jumps in the evolution of computing, Linux is still just barely catching up with the last one - which is the multi/many-core approach.

We've been circling around persistent RAM for a while, what happens to Linux, when that finally rolls out? Is Linux going to catch up with that in less than 20 years after general availability of such HW? Because I don't think it has truly caught with the many- core jump (even though it might still be the best OS today for most manycore use-cases, it's far from optimal and true manycore scalability is still a pretty much TODO for many corners of the kernel).

Explicit pinning of user-space pages

Posted Jan 2, 2020 22:27 UTC (Thu) by Wol (subscriber, #4433) [Link]

> And if the review process gets fixed (and pretty much up to standards of any newer large project), then there will still be a discussion left to be had, how to go about deeper core kernel changes then. It can't be next-to-impossible (or this hard) to change the core internals...

What I think you're missing is that Linus Torvalds is an *engineer*, not an *academic*. And he is VERY strict about "do NOT break userspace".

Linux evolves slowly because yes it carries a lot of technical debt. If they swept that debt away, which would be nice I do agree with you - a LOT of user programs would stop working. And if said programs are not Open Source, and fixable by the user, then the fall-out would be unpleasant to say the least ...

(I'd like to see some effort put into marking stuff as deprecated in the kernel and putting it into #ifdef's so development kernels can be compiled without a lot of technical debt / cruft, but that in itself is a lot of work. That, and making it possible for new kernels to run old kernels in a VM so old programs continue to run, a bit like IBM emulates old hardware every time it brings out something new, so a new Z800 mainframe might run a S390 instance that itself runs an S370 instance that is running CICS or whatever it was that ran on S360s and earlier ...)

Cheers,
Wol