Readahead: the documentation I wanted to read

Posted Apr 10, 2022 11:29 UTC (Sun) by donald.buczek (subscriber, #112892)
Parent article: Readahead: the documentation I wanted to read

Thanks for the article and the docs!

A year or so ago I dug my way through the code without the help of your new documentation, because I was looking for a way to fix something, which I think is sub-optimal behavior of mm: We've noticed that a single sequential read of a single big file (bigger than the memory available for the page cache) triggers the shrinkers and makes you lose all your valuable caches for the never-again needed data from the big file.

As the readahead code is in a position to detect sequential access patterns and has access to the information of the backing file (is it "big"?) , I wonder, if that was the right place to detect the scenario and maybe drop specific pages from that file before more valuable pages and other caches are affected.

I made some experiments with the detection part, which in our use-case is complicated by the fact, that accesses come over NFS, so they are out of order occasionally. Additional NFS drawbacks: fadvice() data not available, alternative mitigations based on cgroups not available...

I could more or less identify this pattern and do a printk to show, that an opportunity was detected, but I didn't get to the other parts, which would be

- Decide, which pages of the sequential file to scarify. LRU might not be optimal here, because if the file is read a second time, it will be from the beginning again.

- How to drop specific pages from the cache. I guess, there are a lot things which can be done wrongly.

Probably i won't get very far. Maybe other work ( multi-generational LRU? ) will help in that problem area.

to post comments

Readahead: the documentation I wanted to read

Posted Apr 10, 2022 18:06 UTC (Sun) by willy (subscriber, #9762) [Link] (3 responses)

Thanks for bringing this up! It's a problem I'm aware of and want to fix, but don't have a plan for how to fix yet.

I am a bit confused that it evicts useful data. The way it *should* work is that the use-once pages go on the inactive list, then the inactive list gets pruned, and the pages on the active list stay there.

Do you see a difference if the files are accessed locally versus over NFS? If so, it may be a bug in NFSd (that it's adding pages to the active list instead of the inactive list, perhaps)

I'm not sure that LWN comments are the best place to help you debug this; would you mind taking this to linux-mm, and/or linux-fsdevel?

I don't think that readahead is the right place to fix this, but it may be the right place to fix a related problem (which might end up fixing your problem). That is, on an SMP (indeed, NUMA system), a sequential read much larger than memory will end up triggering reclaim, as it should. The problem is that each CPU tries to take pages from the end of the LRU list and then remove it from the page cache. But all the pages belong to the same file, so they all fight over the same i_pages lock, and do not make useful progress.

Since readahead is already using the i_pages lock to add the new pages to the page cache, I think it's the right place to remove unwanted pages from the page cache, but as you note, we need to find the right pages to toss (or not ... there's an argument that throwing away the wrong pages is cheaper than finding the right ones ...)

Readahead: the documentation I wanted to read

Posted Apr 11, 2022 8:19 UTC (Mon) by donald.buczek (subscriber, #112892) [Link] (2 responses)

> I am a bit confused that it evicts useful data.

Sorry, I was unclear with the term "valuable". I'm not talking about hot pages, which are accessed by the system. These can probably avoid eviction by returning to the active list fast enough. The (possibly) useful data lost, I've talked about, are other inactive pages and data from other caches (namely dcache). The original user complaint was, " `ls` take ages in the morning". So only when the user took a break, his data was replaced. That by itself is not wrong and the basic strategy of LRU. How should the system now, that the user is going to return the next morning? On the other hand, the system *could* notice, that a big file, which is never going to fit into the cache, is being read sequentially from the beginning. So keeping the already processed head of the file when memory is needed, is even more likely to be useless, because it will be evicted anyway if the observed pattern continues.

> Do you see a difference if the files are accessed locally versus over NFS

No, the same is true for access from the local system. NFS is just a complication in the regards I mentioned (sometimes out of order, no fadvice, no cgroups). In the thread referenced below, I've posted a reproducer script for a local file access.

> would you mind taking this to linux-mm, and/or linux-fsdevel

A colleague of mine did so in August 2021 [1]

Best
Donald

[1]: https://lore.kernel.org/all/878157e2-b065-aaee-f26b-5c87e...

Readahead: the documentation I wanted to read

Posted Apr 11, 2022 13:39 UTC (Mon) by willy (subscriber, #9762) [Link] (1 responses)

Ah, I see that in my inbox now ... I read the third email in the chain (the first two went only to xfs?), but didn't read the fourth and fifth. The usual too-much-email problem.

Anyway, I think recognising this special case probably isn't the right solution. Backup is always tricky, and your proposal would fix one-large-file but do nothing for many-small-files.

I suspect the right way to go is to recognise that the page cache is large and has many easily-reclaimable pages, and shrink only the page cache. ie the problem is that backup is exerting general memory pressure when we'd really like it to only exert pressure only on the page cache. Or rather, we'd like the page cache to initially exert pressure only on the page cache. The dcache should initially exert pressure only on the dcache. Etc. If a cache can't reclaim enough memory easily, then it should pressure other caches to shrink.

Readahead: the documentation I wanted to read

Posted Apr 12, 2022 19:08 UTC (Tue) by donald.buczek (subscriber, #112892) [Link]

> Backup is always tricky, and your proposal would fix one-large-file but do nothing for many-small-files.

To be exact, its not backup. Our maintenance jobs run locally and are tamed via cgroups. Its (other) users. Our scientific users often process rather big files.

> Or rather, we'd like the page cache to initially exert pressure only on the page cache. The dcache should initially exert pressure only on the dcache. Etc. If a cache can't reclaim enough memory easily, then it should pressure other caches to shrink.

This would probably help a lot in the problem area I described and also in some others. It good to know, that this is on your mind. The negative dentries discussion mentioned in the later lwn article seems to get into the same field.

Readahead: the documentation I wanted to read

Posted Apr 11, 2022 14:17 UTC (Mon) by bfields (subscriber, #19510) [Link]

Additional NFS drawbacks: fadvice() data not available

There is actually an IO_ADVISE operation, that Linux doesn't implement: https://www.rfc-editor.org/rfc/rfc7862.html#section-1.4.2

Maybe there's a good reason it hasn't been implemented yet, but, anyway, it might be another thing worth looking into here.