The end of the DAX experiment

By Jonathan Corbet
May 2, 2019

Since its inception, the DAX mechanism (which provides for direct access to files stored on persistent memory) has been seen as somewhat experimental and incomplete. At the 2019 Linux Storage, Filesystem, and Memory-Management Summit, Dan Williams ran a session where he said that perhaps the time has come to end that experiment. Some of the unimplemented DAX features may never actually need to be implemented, and it might just be possible to declare DAX finished. But first there are a few more details to take care of.

He started with the general question of what DAX actually means; he defined it as the mechanism that is used to bypass the page cache and map persistent memory directly into user space. There are obvious performance advantages to doing this, but it causes problems as well: the virtual address given to a page of data is also the address of the permanent storage for that data. Many parts of the system were not designed with that in mind, so various features have to be turned off when DAX is in use, and others must be bypassed. It gives the whole subsystem the feel of a permanent experiment, and that makes people not want to use it.

Within the kernel, there are a few ways to know that DAX is involved with any given memory mapping. A call to vma_is_dax() is a way of asking whether a given page is located in persistent storage. Then, there is vma_is_fsdax(), which is similar except it says that some other agent could break that association at any time. It is used to prevent certain long-term mappings to DAX memory. The PTE_DEVMAP page-table flag notes that a given page's lifetime is tied to that of an underlying device; users have to take extra references to that device to keep it from going away. Outside of the kernel, though, there is only one sure indication that DAX is in use: the MAP_SYNC flag to mmap(). If a mapping operation with that flag fails, persistent memory is not present.

What do we need to finish the experiment? There are still some semantics to figure out, Williams said. There are a lot of applications that are not interested in persistent memory at the level of performing individual cache-flush operations, but they do want to know whether any given mapping involves the page cache. Jan Kara added that such applications are trying to optimize the amount of memory they use; they tend to be programs like a database manager that knows it has the system to itself. If a given file is mapped through ordinary memory, the application must allow for the page cache and reduce its own memory use. If DAX is available, the application can use more memory for its own purposes. Dave Hansen suggested that what is really needed is a way to ask directly about the performance characteristics of a memory mapping.

Williams continued by asking whether there might be a need for a MAP_DIRECT option that asks the kernel to minimize the use of the page cache. Filesystems offer both buffered and direct APIs, with the latter avoiding the page cache; perhaps memory management should do the same. The problem is that these semantics might break with upcoming filesystems like NOVA, which do not use the page cache but also do not offer direct mappings. Applications are often not interested in the "direct" part, but they do care about page-cache usage.

Michal Hocko jumped in with a different point of view: the real problem, he said, is that the page cache simply sucks. It is an implementation detail in current kernels, but might well prove not to be a long-term concern. Rather than adding new APIs to enable applications to get around the page cache, we should just make caching work properly. That would help to eliminate a lot of nasty application code. This suggestion, while interesting, does not solve the short-term problem and so was set aside.

Getting back to DAX, Williams noted that there are a lot of features in the kernel that are not implemented for the DAX case; many places in the code read something like:

    if (dax)
        goto fail;

These include long-term mappings with get_user_pages(), which can pin down DAX pages indefinitely. There are some problems with reflink() functionality, a problem that was set aside for the filesystem track.

There are also problems with private mappings and DAX; pages created via copy-on-write will end up in ordinary RAM rather than persistent memory, which may not be what users want. Some may prefer that these copies remain in cheaper memory. Hansen suggested that this problem could be solved with NUMA policies. Williams said that the right solution might be to call back into the underlying filesystem to inform it of a copy-on-write fault and make it allocate a new page to handle it; the filesystem would then have to track those pages specially until they are released. Kara said that he didn't really see a use case for this feature, though. Mel Gorman described it as "difficult semantics and non-obvious behavior for a non-existent use case".

Time for the session ran out about then. Williams closed by saying that, perhaps, it is not necessary to implement all of those APIs for the DAX case until users scream about their absence. That would allow the DAX experiment to be declared done, more or less, for now.

Index entries for this article
Kernel	Memory management/Nonvolatile memory
Conference	Storage, Filesystem, and Memory-Management Summit/2019

to post comments

The end of the DAX experiment

Posted May 2, 2019 17:06 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (10 responses)

DAX doesn't exist anyway. You simply can't get DAX devices unless you are a big company that is ready to sign tons of NDAs and build a SKIF to work on the devices (might be exaggerating about SKIF a bit).

As a member of public, the most you can get are "emulators" - simple DRAMs with a battery backup.

And it's been going on like this for _years_ now.

The end of the DAX experiment

Posted May 2, 2019 17:52 UTC (Thu) by brouhaha (subscriber, #1698) [Link] (5 responses)

Intel Optane DC persistent memory is supposed to be available next month, at around $900 for a 128G DIMM or $2700 for a 256G. Some retailers are taking preorders. No SCIF needed. :-)

The end of the DAX experiment

Posted May 2, 2019 17:57 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

I'll believe it when I see it.

Also, the quoted prices are actually _worse_ than the straight DRAM price.

The end of the DAX experiment

Posted May 2, 2019 21:48 UTC (Thu) by josh (subscriber, #17465) [Link] (3 responses)

Not according to https://www.tomshardware.com/news/intel-optane-dimm-prici...

What DRAM pricing are you basing that on?

The end of the DAX experiment

Posted May 2, 2019 21:54 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Tom's pricing states that the 128Gb version is going to cost $695.

You can buy that much DRAM (one stick!) for around $500, though in the form of multiple sticks.

The end of the DAX experiment

Posted May 2, 2019 22:09 UTC (Thu) by josh (subscriber, #17465) [Link]

You can't get a single 128GB stick for anywhere close to that. Memory prices aren't linear by size.

The end of the DAX experiment

Posted May 2, 2019 22:12 UTC (Thu) by farnz (subscriber, #17727) [Link]

Multiple sticks doesn't help when you're DIMM slot limited, though - a single 128 GiB stick is still a lot more than the price of a single Optane 128 GiB stick.

Also, for the sort of systems that can take Optane, you'd be using RDIMMs; a quick check shows that 128 GiB of RDIMM DDR-4 is around $800, and takes 4 slots of the typical 16 you have on a board. So Optane is cheaper than DRAM, but not by a huge amount.

The end of the DAX experiment

Posted May 2, 2019 20:57 UTC (Thu) by SEJeff (guest, #51588) [Link]

Did you means SCIF, as in the place where TS/SCI documents are allowed? Come on, it isn't that secret. You just need a secret squirrel decoder ring, or just wait to buy some Optane persistent memory.

The end of the DAX experiment

Posted May 3, 2019 14:41 UTC (Fri) by luto (subscriber, #39314) [Link] (2 responses)

“Emulator?” A DIMM with a battery is an excellent DAX device. I suspect it’s considerably faster than Optane.

The end of the DAX experiment

Posted May 3, 2019 18:55 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

It's an emulator, emulators can be faster than the real hardware. I was hoping for much more from Optane, but for now it appears to be just slightly cheaper than DRAM while being much slower.

The end of the DAX experiment

Posted May 3, 2019 20:26 UTC (Fri) by luto (subscriber, #39314) [Link]

I think that it’s not an emulator in any respect. It’s not like Optane is somehow the real thing and battery-backed DRAM is pretending. The battery approach was also first.

In some sense, Optane is an emulator. The DIMM protocol was designed to model the way that DRAM works. I assume that Optane goes a bit out of its way to speak the same protocol.