daxctl() — getting the other half of persistent-memory performance

By Jonathan Corbet
June 26, 2017

Persistent memory promises high-speed, byte-addressable access to storage, with consequent benefits for all kinds of applications. But realizing those benefits has turned out to present a number of challenges for the Linux kernel community. Persistent memory is neither ordinary memory nor ordinary storage, so traditional approaches to memory and storage are not always well suited to this new world. A proposal for a new daxctl() system call, along with the ensuing discussion, shows how hard it can be to get the most out of persistent memory.

The "DAX" mechanism allows an application to map a file in persistent-memory storage directly into its address space, bypassing the kernel's page cache. Thereafter, data in the file can be had via a pointer, with no need for I/O operations or copying the data through RAM. So far, so good, but there is a catch: this mode really only works for applications that are reading data from persistent memory. As soon as the time comes to do a write, things get more complicated. Writes can involve the allocation of blocks on the underlying storage device; they also create metadata updates that must be managed by the filesystem. If those metadata updates are not properly flushed out, the data cannot be considered properly written.

The end result is that applications performing writes to persistent memory must call fsync() to be sure that those writes will not be lost. Even if the developer remembers to make those calls in all the right places, fsync() can create an arbitrary amount of I/O and, thus, impose arbitrary latencies on the calling application. Developers who go to the trouble of using DAX are doing so for performance reasons; such developers tend to respond to ideas like "arbitrary latencies" with poor humor at best. So they have been asking for a better solution.

daxctl()

That is why Dan Williams wrote in the introduction to this patch series that "the full promise of byte-addressable access to persistent memory has only been half realized via the filesystem-dax interface". Realizing the other half requires getting the filesystem out of the loop when it comes to write access. If, say, a file could be set up so that no metadata changes would be needed in response to writes, the problem would simply go away. Applications would be able to write to DAX-mapped memory and, as long as they ensured that their own writes were flushed to persistent store (which can be done in user space with a couple of special instructions), there should be no concerns about lost metadata.

Williams's proposal to implement this approach requires a couple of steps. The first is that the application needs to call fallocate() to ensure that the file of interest actually has blocks allocated in persistent memory. Then it has to tell the kernel that the file is to be accessed via DAX and that the existing block allocations cannot be changed under any circumstances. That is done with a new system call:

    int daxctl(char *path, int flags, int align);

Here, path indicates the file of interest, flags indicates the desired action, and align is a hint regarding the size of pages that the application would like to use. The DAXFILE_F_STATIC flag, if present, will put the file into the "no changes allowed mode"; if the flag is absent, the file becomes an ordinary file once again. While the static mode is active, any operation on the file that would force metadata changes (changing its length with truncate(), for example) will fail with an error code.

The implementation of this new mode would seem to require significant changes at the filesystem level, but it turns out that this functionality already exists. It is used by the swap subsystem which, when swapping to an ordinary file, needs to know where the blocks allocated to the file reside on disk. There are two pieces to this mechanism, the first of which is this address_space_operations method:

    /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
    sector_t (*bmap)(struct address_space *s, sector_t sector);

A call to bmap() will return the physical block number on which the given sector is located; the swap subsystem uses this information to swap pages directly to the underlying device without involving the filesystem. To ensure that the list of physical blocks corresponding to the swap file does not change, the swap subsystem sets the S_SWAPFILE inode flag on the file. Tests sprinkled throughout the virtual filesystem layer (and the filesystems themselves) will block any operation that would change the layout of a file marked with this flag.

This functionality is a close match to what DAX needs to make direct writes to persistent memory safe. So the daxctl() system call has simply repurposed this mechanism, putting the file into the no-metadata-changes mode while not actually swapping to it.

MAP_SYNC

Christoph Hellwig was not slow to register his opposition to this idea. He would rather not see the bmap() method used anywhere else in the kernel; it is, in his opinion, broken in a number of ways. Its use in swapping is also broken, he said, though "we manage to paper over the fact". He suggested that development should be focused instead on making DAX more stable before adding new features.

An alternative approach, proposed by Andy Lutomirski, has been seen before: it was raised (under the name MAP_SYNC) during the "I know what I'm doing" flag discussion in early 2016. The core idea here is to get the filesystem to transparently ensure that any needed metadata changes are always in place before an application is allowed to write to a page affected by those changes. That would be done by write-protecting the affected pages, then flushing any needed changes as part of the process of handling a write fault on one of those pages. In theory, this approach would allow for a lot of use cases blocked by the daxctl() technique, including changing the length of files, copy-on-write semantics, concurrent access, and more. It's a seemingly simple idea that hides a lot of complexity; implementing it would not be trivial.

Beyond implementation complexity, MAP_SYNC has another problem: it runs counter to the original low-latency goal. Flushing out the metadata changes to a filesystem can be a lengthy and complex task, requiring substantial amounts of CPU time and I/O. Putting that work into the page-fault handler means that page faults can take an arbitrarily long amount of time. As Dave Chinner put it:

Prediction for the MAP_SYNC future: frequent bug reports about huge, unpredictable page fault latencies on DAX files because every so often a page fault is required to sync tens of thousands of unrelated dirty objects because of filesystem journal ordering constraints.

There was some discussion about how the impact of doing metadata updates in the page-fault handler could be reduced, but nobody has come forth with an idea that would reduce it to zero. Those (such as Hellwig) who support the MAP_SYNC approach acknowledge that cost, but see it as being preferable to adding a special-purpose interface that brings its own management difficulties.

On the other hand, this work could lead to improvements to the swap subsystem as well, making it more robust and more compatible with filesystems (like Btrfs) whose copy-on-write semantics work poorly with the "no metadata changes" idea. There is another use case for this functionality: high-speed DMA directly to persistent memory also requires that the filesystem not make any unexpected changes to how the file is mapped. That, and the relative simplicity of Williams's patch, may help to push the daxctl() mechanism through, even though it is not universally popular.

Arguably, the real lesson from this discussion is that persistent memory is not a perfect match to the semantics provided by the Unix API and current filesystems. It may eventually become clear that a different type of interface is needed, at least for applications that want to get maximum performance from this technology. Nobody really knows what that interface should look like yet, though, so the current approach of trying to retrofit new mechanisms onto what we have now would appear to be the best way forward.

Index entries for this article
Kernel	DAX
Kernel	Memory management/Nonvolatile memory

to post comments

daxctl() — getting the other half of persistent-memory performance

Posted Jun 27, 2017 0:43 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (14 responses)

Why not create a specialized daxfs that is specifically designed to be friendly to persistent memory?

daxfs

Posted Jun 27, 2017 3:37 UTC (Tue) by corbet (editor, #1) [Link] (9 responses)

People are working on such things; NOVA was mentioned in the discussion, for example. But new filesystems take a long time even when the problem domain is fully understood, and some of the problems are likely to be hard to work around even when starting from scratch. I expect people will get there, but it will take a fair while yet.

daxfs

Posted Jun 27, 2017 3:45 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Are there any actual user-accessible NVDIMMs? All I can find are battery-backed DIMMs with NAND for long-term storage, with rather poor capacity (8-16Gb).

It looks like there are at least a couple of years to make a polished FS. And it should be easier than usual, since you literally have non-volatile memory for all the important structures: simple atomic operations instead of barriers, no mucking with block devices, etc.

daxfs

Posted Jun 27, 2017 13:09 UTC (Tue) by Paf (subscriber, #91811) [Link] (2 responses)

Yes, the Intel/Micron Crystal Ridge persistent memory or whatever they're calling it these days.

It's not at RAM latencies yet, but it can be configured to be byte addressable. (Or at least OEMs can do that - don't remember if the direct to consumer devices from Intel can.)

3DXP

Posted Jun 27, 2017 15:22 UTC (Tue) by willy (subscriber, #9762) [Link] (1 responses)

The DIMMs are not yet for sale. The NVMe devices that are available do not allow for byte-addressable storage.

3DXP

Posted Jun 27, 2017 16:21 UTC (Tue) by Paf (subscriber, #91811) [Link]

Ah, thanks for the clarification.

daxfs

Posted Jul 27, 2017 13:29 UTC (Thu) by kh (guest, #19413) [Link]

Micron & Intel's 3D XPoint:
https://www.micron.com/about/our-innovation/3d-xpoint-tec...
This seems like it is poised to change a lot of things.

daxfs

Posted Jun 28, 2017 5:01 UTC (Wed) by nhippi (subscriber, #34640) [Link] (3 responses)

Alternatively why not work within the MTD subsystem, that has already support for Non-volatile addressable memories (NOR flash). And already has a bunch of filesystems designed for them too...

daxfs

Posted Jun 29, 2017 11:09 UTC (Thu) by hkario (subscriber, #94864) [Link] (2 responses)

Flash is not byte addressable and has huge erase block sizes ("small" one is 16KiB, typical SSD is 512KiB)

daxfs

Posted Jun 29, 2017 13:27 UTC (Thu) by nhippi (subscriber, #34640) [Link] (1 responses)

You are thinking NAND flash, there is also NOR flash that can be byte addressible.

daxfs

Posted Jun 30, 2017 19:52 UTC (Fri) by alonz (subscriber, #815) [Link]

NOR flash also has large-to-huge erase blocks. It is only byte addressable for reads, not for writes.
Note also that NOR flash appears to be mostly dead outside the deeply-embedded market: the largest NOR-flash device you can purchase right now from Micron is 1 Gb (that is 1 giga-bit, or 128 megabytes) while NAND-flash devices are now at 6 Tb (6 tera-bits, or 750 gigabytes).

daxctl() — getting the other half of persistent-memory performance

Posted Jun 30, 2017 0:28 UTC (Fri) by neilbrown (subscriber, #359) [Link] (3 responses)

Because the underlying need is broader than just DAX.

The underlying need is the ability to access a file using the storage address, without reference to the inode.
- swap-to-file uses this (though as we proved with NFS, it can be done differently)
- DMA-to-storage could use this. (The article mentions "high-speed DMA directly to persistent memory" but the email thread says

         "I have this
	 high speed data aquisition hardware and we'd like to DMA data direct
	 to the storage because it's far too slow pushing it through memory
	 and then the OS to get it to storage. How do I map the storage
	 blocks and guarantee the mapping won't change while we are
	 transferring data direct from the hardware?".

which isn't necessarily about persistent memory.

And DAX could benefit from this.

It really makes sense for the functionality to be included in existing filesystems where possible.

daxctl() — getting the other half of persistent-memory performance

Posted Jun 30, 2017 20:07 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Sure. But it looks like the current attempt to shoehorn existing filesystems is not a good fit for it.

After all, they were designed to ship data back and forth between memory and persistent block storage. So a lot of functionality is simply superfluous when you can directly access your storage. And some functionality of filesystems is clearly incompatible with direct access: encryption and data compression, for example. Journals and snapshots have to be designed differently and so on.

With the current DAX efforts if looks like the case of leaking abstractions all the way. And use-cases like DMA to NVRAM only underscores it - you really need guarantees from the filesystem that don't exist right now.

daxctl() — getting the other half of persistent-memory performance

Posted Jul 1, 2017 0:17 UTC (Sat) by neilbrown (subscriber, #359) [Link]

> you really need guarantees from the filesystem that don't exist right now.

True, but they are guarantees that once could have been assumed. It is only these new-fangled filesystems, which try to be clever, that the requirement isn't trivial for. Even for them it is just a case of not being too clever for those files, and carefully identifying all the places where cleverness might be a problem. So no relocation of blocks, no sharing of blocks, no compression, etc.

From my quick glance at the email thread, I felt that filesystem developers were generally happy to provide the functionality, but needed to agree on details of interface and specific functionality. I particularly liked an idea from Dace Chinner to add a new flag to fallocate(). With this flag the space would be allocated and initialized, the metadata would be persisted, and the inode would be marked so that the storage allocations were immutable. Then the file could be safely used with DAX or swap or whatever. This requires metadata changes to the filesystem which the original patch from Dan appeared to try and avoid, but it is probably a change worth making. If we defined the immutability as only until the file was truncated, then old filesystems could support the new fallocate() flag with no metadata change (I think Dave wanted the flag to imply that truncation was not allowed, which is probably the better choice if you don't care about old filesystems).

daxctl() — getting the other half of persistent-memory performance

Posted Jul 1, 2017 1:23 UTC (Sat) by neilbrown (subscriber, #359) [Link]

I meant to also say: having to have multiple different filesystems for different uses is a royal pain.
Being able to have multiple different filesystems is liberating.
Being required to have them is quite the reverse.
It is bad enough that we have /proc and /sys and /dev/pts and /dev/mqueue and /dev/hugepages and /sys/kernel/debug and /sys/fs/cgroup and .... instead of just one virtual filesystem exporting kernel stuff. Don't make us have to have multiple different filesystems for storing different sorts of data too.