DMA-BUF cache handling: Off the DMA API map (part 2)

June 11, 2020

This article was contributed by John Stultz

Part 1 of this series, covered some background on ION, DMA-BUF heaps, the DMA API, and the concept of "ownership" when it comes to handling CPU-cache maintenance, finally ending on a conventional DMA API view of how DMA-BUF cache handling should be done. The article concluded with a discussion of why the traditional DMA APIs can perform poorly on contemporary systems. This article completes the series with an exploration of some of the approaches that DMA-BUF exporters can use to avoid unnecessary cache operations along with some rough proposals for how we might improve things.

The first part left off with the DMA API's viewpoint that the ownership of a DMA-BUF is transferred to the device domain with a call to dma_buf_map_attachment() and transfers back to the CPU with dma_buf_unmap_attachment(); cache-maintenance operations are done on each call. This sequence provides correctness with regard to CPU-cache handling but, for buffer pipelines that involve many devices where the CPU doesn't actually touch the buffer, these cache operations on each map and unmap call add up and can cause dramatic performance problems.

Who owns the buffer?

To avoid these extraneous cache operations, the DMA-BUF interface allows one aspect of the DMA-API rules to be turned upside down. The DMA API, remember, assumes that the CPU is the natural owner of all memory and that it is only during DMA transactions — where buffer ownership has been clearly transferred to the device — that careful rules are needed. The DMA-BUF interfaces, instead, require the CPU to call dma_buf_begin_cpu_access() before accessing a DMA-BUF. After it is done, the CPU calls dma_buf_end_cpu_access(). If the buffer is accessed from user space, these transitions are effected using the DMA_BUF_IOCTL_SYNC ioctl() command instead.

Specifically:

dma_buf_begin_cpu_access()

Allows the exporter to ensure that the memory is actually available for CPU access; the exporter might need to allocate or swap-in and pin the backing storage. The exporter also needs to ensure that CPU access is coherent for the requested access direction.

dma_buf_end_cpu_access()

Called when the importer is done accessing the CPU. The exporter can use this to flush caches and unpin any resources pinned in dma_buf_begin_cpu_access().

This approach allows one to look at the DMA-BUF memory as not being owned by the CPU by default; it can be considered to be in the device domain by default instead. CPU-cache handling can thus be done primarily in those calls to ensure the CPU gets a consistent view of the DMA-BUF and may allow for the avoidance of costly cache operations in cases where we pass, map, and access the DMA-BUF only between multiple devices.

However, this inconsistency with the DMA API causes confusion, and not all DMA-BUF exporters take the same approach. Some exporters try to abide by the DMA API, flushing and invalidating the CPU cache on every map and unmap operation, some may leave those cache operations to the exporter's begin and end hooks, and others do both.

While DMA-BUFs were designed to share buffers between user space and multiple devices, the first DMA-BUF exporters were tied to specific drivers, often superseding custom, driver-specific buffer-allocation code. Consider, for example, a GPU driver that allocates a buffer, performs some rendering into it, and provides a handle back to user space. The user-space process could then pass that buffer, and others along with it, back to the GPU to compose the web-browser window with the other windows that were open in a desktop environment. Switching to a DMA-BUF provided a more generic handle type, so it made sense to use even if the buffers weren't really being shared between multiple devices.

But, knowing that the buffer was shared between just the CPU and the device, the DMA-BUF exporter could optimize some of the cache operations. For instance, some DMA-BUF exporters cache the scatter/gather table resulting from the first DMA mapping operation and, as long as the dma_buf_map_attachment() calls are done in the same direction, reuse that table. In this way, they can avoid expensive cache operations on each dma_buf_map_attachment() and dma_buf_unmap_attachment() call, finally releasing the mapping in dma_buf_detach(). These optimizations work because the exporters are tied to the device, so the buffers aren't really being shared, or the devices the buffers are shared with are cache coherent, so the cache maintenance is unnecessary.

This scheme is efficient, but it has resulted in the dozen or so DMA-BUF exporters upstream having different cache-handling and usage semantics. So when we start to look at how to implement generic DMA-BUF exporters to support multiple-device pipelines in some sort of performant way, the rules of the road are not clearly established.

Handling ownership with multiple mappings

While the DMA API provides good guidance for specifying ownership using the map and unmap calls, getting good performance on mobile devices often requires that multiple devices and the CPU all have active mappings to a buffer at the same time. That makes the concept of buffer ownership quite a bit more subtle. For instance, it's common in graphics to have a buffer mapped both by the GPU and a display at the same time. To do this, the system has to be able to share a frame buffer between multiple devices before the frame is finished drawing. This allows the GPU to write to the buffer directly and then signal the display driver, which then can immediately show the buffer.

For this specific use case, DMA-BUF fences built on the explicit fence infrastructure were added, providing a mechanism for a driver (or user space) to wait on a fence that is specific to a buffer. Another driver will eventually signal that fence, initiating the transfer of ownership. However, supporting these parallel mappings requires careful cache handling. Usually this is left to the drivers to do explicitly using the DMA-API synchronization calls. When one is working on a single integrated device with a custom vendor kernel, it is possible to know which driver passes buffers to which and, thus, to be able to add the optimal and correct cache handling. But outside of that controlled environment, it's more complicated.

So we are already seeing two different styles of ownership tracking being handled here. Implicit handling means that the ownership of a DMA-BUF is transferred when the buffer is mapped to (or unmapped from) a device. Explicit handling, instead, is where buffers are already mapped on two (or more devices) devices and ownership is effectively transferred by DMA-BUF fences.

DMA-BUF exporters are normally responsible for handling the cache operations for buffers as the ownership of the buffer is passed around. They can do so properly in the implicit context of dma_buf_map_attachment() and dma_buf_unmap_attachment() calls or, alternatively, in the dma_buf_begin_cpu_access() and dma_buf_end_cpu_access() calls. However, in the explicit case, the DMA-BUF exporter has no hooks for DMA-BUF fence signals, so the exporter cannot do any cache management in response to the transfer of ownership. This creates a difficult situation, where the responsibility for the buffer cache management is split between the DMA-BUF exporter and the drivers using the buffer. To do it correctly, each driver must each understand its place in a buffer pipeline to know the coherency of the device that comes after it.

More problematically, even if the DMA-BUF exporter did have a hook for the DMA fence signals, it has no way of knowing which of the multiple styles of ownership tracking is being used. Do we perform cache operations on map and unmap calls, assuming explicit usage with default CPU ownership? Or on dma_buf_begin_cpu_access() and dma_buf_end_cpu_access(), assuming implicit usage with default device ownership? Or do we avoid that extra overhead and assume drivers will do explicit fence signaling to transfer ownership? These choices may leave us with an implementation that is either too slow to use, or potentially incompatible with some drivers. This is quite contrary to the goal of DMA-BUFs functioning as a generic interchange mechanism.

So for someone trying to write a DMA-BUF exporter, this all starts to feel like a level of 10 ("Read the documentation and you'll get it wrong") or 11 ("Follow common convention and you'll get it wrong") on Rusty Russell's classic API scale, especially if you care about performance. This poses a large problem for the goal of having generic DMA-BUF heaps that can be shared between vendors.

Potential solutions

I feel like this situation could be improved, and have a few ideas we might consider. Since the DMA-BUF interface already strays from the DMA API, I think we should establish some explicit conventions for how DMA-BUFs should be used. Better documentation could help, so that both DMA-BUF exporter authors and DMA-BUF users have a solid sense of the model to follow. In particular, we should try to:

Create a formal sense of ownership for DMA-BUF objects outside of the implicit map/unmap method that the DMA API specifies.
Work to provide some mechanism to formally track that ownership. These hooks could be added to the dma_buf_ops structure so the exporter can be informed of these ownership changes.
Deprecate implicit handling and move drivers to use this mechanism to mark the explicit ownership transfers.
Add some state tracking for DMA-BUFs so that we know their cache state and we can correctly perform cache operations only on transitions of ownership that make them necessary.

It may be that most of the above can be achieved with documentation and consolidating the current DMA-BUF exporter implementation semantics. The dma_buf_begin_cpu_access() and dma_buf_end_cpu_access() calls are sufficient to handle device-to-CPU and CPU-to-device transitions. But their proper use needs to be explicitly defined as such and consistently implemented by DMA-BUF exporters, formalizing the concept that buffers are device-owned by default. This would allow for safely implementing pre-flushed buffers and skipping unnecessary synchronization operations.

However, a drawback with this approach alone is that, for multiple uses by the CPU in a series, there would be unnecessary flushes on each call. Additionally, there is some concern that, with a mix of CPU-coherent and non-coherent devices, we may need to do CPU-cache handling when transferring ownership between such devices. Both of these situations might make it useful to have some device-usage bracketing calls along with some state tracking so that ownership transfer (and not just usage) could be determined.

This concept of ownership will also need to consider future work to support partial cache flushes, allowing both the CPU and device to be carefully working on the same buffer at the same time. Thus, ownership (and the respective cache operations) would be managed on the granularity of a single cache line, rather than the entire buffer, possibly looking more like advisory range locks on a file.

The DMA-BUF heaps interface (along with ION that came before it) concedes that, in some cases, user space knows more about how a buffer will be used than the kernel does. Thus, it can be optimal to let user space choose the allocation type for a given pipeline. The DMA-BUF design, which allows the rules and policy for a buffer to be left to the exporter implementations, provides a lot of useful flexibility, which I don't want to eliminate. However, I do think that, as vendors start their ION migration efforts, having clear, established conventions that don't have large pits to fall into will be important to avoid a collection of unnecessarily incompatible heaps and users. Hopefully this stirs some further discussion.

Thanks

Thanks so much to Rob Clark, Robert Foss, Sumit Semwal, Azam Sadiq Pasha Kapatrala Syed, Daniel Vetter, and Linus Walleij for their early reviews and feedback on this series.

Index entries for this article
Kernel	Device drivers/Support APIs
Kernel	Direct memory access
GuestArticles	Stultz, John

to post comments

DMA-BUF cache handling: Off the DMA API map (part 2)

Posted Jun 12, 2020 6:27 UTC (Fri) by blackwood (guest, #44174) [Link]

Maybe some lingo clarification. dma-fence wasn't created for explicit fencing, it predates that by quite a while, being used for implicit fencing. dma-fence is simply the kernel-internal construct for pipelined (see below) device synchronization, and explicit vs implicit fencing are two flavours of how this looks like to userspace and how userspace controls access and ordering when passing a dma-buf around.

Other bit that's new or untraditional lingo in the article, at least for gpu heads: "explicit handling" using dma-fence is usually called "pipelined access", since it's used to queue up an entire processing pipeline across multiple components or engines, and then let it all asynchronously run to completion, with dma-fence providing hand-off and ordering. "implicit handling" is usually called synchronous handling/access, since the cpu synchronously hands the buffer to each device, and waits for that device to finish before going to the next one. Generally code in drivers/gpu is using the pipelined model, and code everywhere else (mostly drivers/media) is using the synchronous model for dma-buf access.

DMA-BUF cache handling: Off the DMA API map (part 2)

Posted Jun 12, 2020 9:58 UTC (Fri) by Karellen (subscriber, #67644) [Link]

Really interesting article, but thanks so much for the "Classic API Scale" link. I'd not seen that before.

Was even better to go back to the first slide in that run and play them all forwards.

DMA-BUF cache handling: Off the DMA API map (part 2)

Posted Jun 15, 2020 4:20 UTC (Mon) by alison (subscriber, #63752) [Link]

Isn't the main cause of difficulty that the CPU and the device need so much detailed information about each other? Don't most subsystems solve this problem with an intermediary "host" library that abstracts these details away? Perhaps the required speed for cache operations won't allow the luxury of an additional abstraction layer?

Great article, by the way.

DMA-BUF cache handling: Off the DMA API map (part 2)

Posted Jun 21, 2020 13:44 UTC (Sun) by hexiaolong2008 (guest, #108652) [Link]

Hi John,

Part 1: https://blog.csdn.net/hexiaolong2009/article/details/1067...
Part 2: https://blog.csdn.net/hexiaolong2009/article/details/1068...

But I still have some questions about the following paragraph:

> "But, knowing that the buffer was shared between just the CPU and the device, the DMA-BUF exporter could optimize some of the cache operations. For instance, some DMA-BUF exporters cache the scatter/gather table resulting from the first DMA mapping operation and, as long as the dma_buf_map_attachment() calls are done in the same direction, reuse that table. In this way, they can avoid expensive cache operations on each dma_buf_map_attachment() and dma_buf_unmap_attachment() call, finally releasing the mapping in dma_buf_detach()."

I'm so confused about the first sentence. I think the optimization instance you give in this paragraph is just used for multi-device buffer sharing, which has no CPU access. So we can avoid the cache operations on each device. But how to understand this words "knowing that the buffer was shared between just the CPU and the device"? I can't find any relationship between this sentence and the instance along with it. Do you mean the cache operation is only necessary when the buffer was shared between just the CPU and the device, so we can optimize it for mult-devices buffer sharing which has no cpu access?

> "These optimizations work because the exporters are tied to the device, so the buffers aren't really being shared, or the devices the buffers are shared with are cache coherent, so the cache maintenance is unnecessary."

How to understand the words "so the buffers aren't really being shared"? I think the buffer was still shared with other devices. Or, do you mean that the buffer which was shared just between CPU and device then can be called "shared"，if the buffer was shared between devices without CPU access then it can't be called "shared"？

Maybe my English is so pool that I can't understand this paragraph. But I really want to know what these two sentences exactly mean. Waiting for your reply. Thanks!