HK1172108B

HK1172108B - Shared virtual memory

Info

Publication number: HK1172108B
Application number: HK12112728.0A
Authority: HK
Inventors: H.陈; Y.高; 周小成; S．闫; P.张; J.方; A．孟德尔森; B.萨哈; 莫罕．拉贾戈帕兰
Original assignee: 英特尔公司
Priority date: 2008-11-13
Filing date: 2009-11-05
Publication date: 2016-05-20

Description

Shared virtual memory

Background

This generally relates to shared virtual memory implementations.

The computing industry is moving towards a diverse platform architecture consisting of a general purpose CPU and a programmable GPU attached as a separate or integrated device. These GPUs are connected by a continuous or discontinuous interconnect, have different Industry Standard Architectures (ISAs) and can use their own operating systems.

Computing platforms consisting of a combination of general purpose processors (CPUs) and Graphics Processors (GPUs) are ubiquitous, particularly in client computing space. Today, almost all desktop and notebook platforms carry one or more CPUs and integrated or separate GPUs. For example, some platforms have processors paired with integrated graphics chipsets, while others use separate graphics processors connected through an interface such as PCI-Express. Some platforms carry a combination of a CPU and a GPU. For example, some of them include more integrated CPU-GPU platforms, while others include graphics processors to compensate for integrated GPU provisioning.

These CPU-GPU platforms can provide significant performance gains over non-graphical workloads in graphics processing, medical imaging, data mining, and other areas. A large number of data-parallel GPUs may be used to achieve high throughput on highly parallel portions of code. The diverse CPU-GPU platforms may have a number of unique architectural constraints, such as:

● GPUs may be connected in an integrated and separate manner. For example, some graphics processors are integrated with chipsets. On the other hand, other current GPUs are attached in a separate manner through an interface such as PCI-Express. While hardware may provide cache coherency between a CPU and an integrated graphics processor, it is difficult to do so with a separate GPU. The system may also have a hybrid configuration in which a low power, low performance GPU is integrated with the CPU, and a higher performance, discrete GPU. Finally, the platform may also have multiple GPU cards.

● the CPU and GPU may have different operating systems. For example, a processor may have its own operating system kernel. This means that the virtual memory translation mechanism may be different between the CPU and the GPU. The same virtual address may be mapped to two different physical addresses simultaneously by two different page tables on the CPU and GPU. This also means that the system environment (loader, linker, etc.) can be different between the CPU and the GPU. For example, the loader may load an application at different base addresses on the CPU and GPU.

● the CPU and GPU may have different ISAs, and thus the same code may not be able to run on both processors.

Brief Description of Drawings

FIG. 1 is a diagram of a CPU-GPU memory model, according to one embodiment.

FIG. 2 is a flow diagram of one embodiment of a shared memory model for adding ownership rights.

FIG. 3 is a flow diagram of one embodiment of a shared memory model.

FIG. 4 is a flow diagram of one embodiment of a shared memory model utilizing PCI openings.

FIG. 5 is a flow diagram of one embodiment of a shared memory model utilizing PCI openings.

FIG. 6 is a flow diagram of one embodiment of a shared memory model in operation.

Detailed Description

Various embodiments of the present invention provide a programming model for a CPU-GPU platform. In particular, embodiments of the present invention provide a unified programming model for integrating and separating devices. The model may also work uniformly for multiple GPU cards and hybrid GPU systems (both separate and integrated). This allows a software vendor to write a single application stack and will target it to all different platforms. Furthermore, embodiments of the present invention provide a shared memory model between the CPU and the GPU. Instead of sharing the entire virtual address space, only a portion of the virtual address space needs to be shared. This allows for efficient implementation in both separated and integrated settings. Further, language annotations may be used to differentiate code that must be run on the GPU. Language support can be extended to include features such as function pointers.

Embodiments of a shared memory model provide novel programming patterns. In particular, data structures can be seamlessly shared between the CPU and GPU and pointers can be passed from one end to the other without requiring any marshalling. For example, in one embodiment, the game engine and includes physics, Artificial Intelligence (AI), and rendering. The physical and AI code is preferably executed on the CPU, and the rendering is preferably executed on the GPU. Data structures such as scene pictures may need to be shared between the CPU and the GPU. Such an execution model may not be feasible in some current programming environments because scene pictures need to be serialized (or formationed) back and forth. However, in embodiments of the shared memory model, the scene picture may simply be located in shared memory and commonly accessible by the CPU and GPU.

In one embodiment, a full programming environment is implemented that includes language and runtime support. Multiple highly parallel non-graphics loads may interface with the environment via a port. This implementation can work on a diverse variety of operating systems, i.e., running different operating systems on the CPU and GPU. Further, user-level communication may be allowed between the CPU and the GPU. This may make the application stack more efficient, as the overhead of the OS driver stack in CPU-GPU communication may be eliminated. The programming environment can interface via a port with two different diverse CPU-GPU platform simulators — one simulating a GPU attached to the CPU as a separate device, and the other simulating an integrated CPU-GPU platform.

In summary, embodiments of a programming model for a CPU-GPU platform may:

● provide a unified programming model for split, integrated, multi-GPU card and hybrid GPU configurations.

● provide shared memory semantics between the CPU and the GPU, allowing pointers to be freely passed between the CPU and the GPU and sharing data structures.

● are implemented in a CPU-GPU platform with a diversity of different ISAs and different operating systems on the CPU and GPU.

● enable user-level communication between the CPU and the GPU, thus making the application stack more efficient.

Memory model

FIG. 1 is an illustration of a GPU-CPU memory model according to one embodiment. In one embodiment, the memory model 100 provides a shared virtual address window 130 between the CPU 110 and the GPU 120, such as in a split global address space (PGAS) language. Any data structures shared between the CPU 100 and the GPU 120 must typically be allocated in this space 130 by the programmer. The present system may provide a special memory allocation (malloc) function that allocates data in this space 130. Static variables may be annotated with type quantifiers so that they are allocated in the shared window 130. However, unlike the PGAS language, there is no concept of affinity (affinity) in the shared window. This is because the data in the shared space 130 migrates between the CPU and GPU caches as it is used by each processor. Also unlike the PGAS implementation, the representation of the pointers does not change in the shared and private spaces. The remaining virtual address space is private to the CPU 110 and GPU 120. By default, data is allocated in this space 130 and is not visible to the other side. The split address space approach may reduce the amount of memory needed to maintain consistency and achieve more efficient implementation of separate devices.

Embodiments of the memory model can be extended to multiple GPUs and hybrid configurations. In particular, the shared virtual address window may extend across all devices. Any data structures allocated in the shared address window 130 may be visible to all agents and pointers in this space may be swapped freely. In addition, each agent has its own private memory.

Release consistency in the shared address space may be used for a number of reasons. First, the present system need only remember all writes between consecutive release pointers, rather than a sequence of individual writes. This makes it easier to do batch conversions at the release point (such as on several pages at the same time), which is important in a split configuration. Second, this allows memory updates to be kept completely local up to the release point, which is important in a split configuration. Third, the release consistency model may be a good match to the programming model of the CPU-GPU platform because there are natural release and acquisition points. For example, a call from the CPU to the GPU is one such point. Making any CPU update visible to the GPU prior to invocation may not play any role, and it does not make sense to impose any command on how CPU updates become visible, as long as they are all visible before the GPU begins execution. Further, the proposed C/C + + memory model can be easily mapped to a shared memory space. In general, non-competing programs may not be affected by the weak continuity model of the shared memory space. This implementation may not need to be limited to providing stronger guarantees for competing procedures. However, different embodiments may choose to provide different continuity models for the shared space.

FIG. 2 is a flow diagram of one embodiment of a shared memory model for adding ownership rights. Sequence 200 may be implemented in firmware, software, or hardware. Software embodiments may be stored on a computer readable medium such as an optical disk, magnetic disk, or semiconductor memory. In particular, ownership rights may be added to embodiments of the shared memory model to enable further coherency optimizations. In the shared virtual address window, the CPU or GPU may specify that it owns a particular block of addresses (block 210). If the address range in the shared window is owned by the CPU (block 220), then the CPU knows that the GPU cannot access those addresses and therefore does not need to maintain consistency of those addresses with the GPU (block 230). This may avoid sending any snoops or other coherency information to the GPU, for example. The same is true for the addresses owned by the GPU. If the GPU accesses a CPU owned address, then the address becomes non-owned (there is symmetric behavior for the GPU owned address). Alternatively, an access by a gpu (cpu) to an address owned by the cpu (gpu) may trigger an error condition.

Embodiments of the invention may provide these ownership rights to take advantage of a common CPU-GPU usage model. For example, the CPU first accesses some data (such as initializing a data structure) and then hands it over to the GPU (such as computing on the data structure in a data parallel manner), and then the CPU analyzes the compute structure, and so on. Ownership rights allow the application to inform the system of this temporary locality and optimize the consistency implementation. Note that these ownership rights are optimization hints, and the present system can legally ignore these hints.

Privatization and globalization

In one embodiment, shared data may be privatized by copying from a shared space to a private space. Pointer-free data structures can be privatized simply by copying the memory contents. When copying pointer-containing data structures, pointers to shared data must be converted to pointers to private data.

Private data can be globalized by copying from private space to shared space and being visible to other computations. Pointer-free data structures can be globalized simply by copying the memory contents. When copying a pointer-containing data structure, pointers to private data must be converted to pointers to shared data (the opposite example of the privatization example).

For example, in one embodiment, consider a linked list of nodes in private and shared spaces. The type definition for the private linked list is standard:

typedef struct{

int val; // is simply an int field

Node*next；

}Node；

The type definition for the shared linked list is as follows. Note that the pointer to the next node is defined to be placed in the shared space. The user must explicitly declare the private and shared versions of the type.

typedef struct{

shared int val；

shared Node*shared next；

}shared Node；

The user can now explicitly copy the private linked list to the shared space by:

…

myNode＝(shared Node*)sharedMalloc(..)；

// head pointer to private linked list

myNode-＞val＝head-＞val

myNode-＞next＝(shared Node*)sharedMalloc(..)；

…

The runtime APIs used by the compiler are shown below:

// allocating and releasing memory in private address space

Maps to regular malloc

void*privateMalloc(int)；

void privateFree(void*)；

// allocated and released from shared space.

shared void*sharedMalloc(size_t size)；

void sharedFree(shared void*ptr)；

// memory continuity for shared memory

void sharedAcquire()；

void sharedRelease()；

Finally, the runtime also provides APIs for mutexes (mutex) and barriers (barrier) to allow applications to perform explicit synchronization. These structures are often distributed in shared areas.

The language provides natural acquisition and release points. For example, a call from the CPU to the GPU is a release point on the CPU followed by an acquisition point on the GPU. Similarly, the return from the GPU is a release point on the GPU and an acquisition point on the CPU. Acquiring ownership of a mutex and releasing the mutex are respectively an acquisition point and a release point of the mutex operation performed by a processor, and a hit barrier (barrier) and a pass barrier (barrier) are also the release point and the acquisition point.

In one embodiment, the runtime system may provide API calls for ownership acquisition and release. For example, sharedmemoryAcquire () and sharedmemoryRelease () may acquire and release ownership of the entire memory range. Alternatively, the system may provide shared memoraryacquire (addr, len) and shared memoraryrelease (addr, len) to obtain ownership within a particular address range.

Implementation of

In one embodiment, the compiler generates two binary codes — one for execution on the GPU and the other for CPU execution. Two different executable codes are generated because the two operating systems may have different executable code formats. The GPU binary includes code to be executed on the GPU, while the CPU binary includes CPU functions. The runtime library includes CPU and GPU components that are linked with CPU and GPU application binaries to create CPU and GPU executables. When the CPU binary starts executing, it calls a runtime function that loads the GPU executable. Both the CPU and GPU binaries create virtual threads for CPU-GPU communication.

Implementing CPU-GPU shared memory

FIG. 3 is a flow diagram of one embodiment of a shared memory model. Sequence 300 may be implemented in firmware, software, or hardware. In one embodiment, the CPU and GPU may have different page tables and different virtual-to-physical memory translations (block 310). Thus, to synchronize the contents of the virtual address V between the CPU and the GPU (such as at a release point), the contents of different physical addresses (such as P1 on the CPU and P2 on the GPU) are synchronized (block 320). However, the CPU may not have access to the GPU page tables (and therefore not know P2), and the GPU may not have access to the CPU page tables and therefore not know P1.

This problem can be solved by using PCI openings in a novel way. FIG. 4 is a flow diagram of one embodiment of a shared memory model utilizing PCI openings. Sequence 400 may be implemented in firmware, software, or hardware. At initialization, a portion of the PCI open space may be mapped to user space of an application and instantiated with a task queue, a message queue, and a copy buffer (block 410). When a page needs to be copied (e.g., from the CPU to the GPU) (block 420), the runtime copies the page into the PCI opening copy buffer and marks the buffer with a virtual address and a process identifier (block 430). On the GPU side, the virtual thread copies the contents of the buffer to its address space by using the virtual address tag (block 440). Thus, the copy may be performed in a 2-step process — the CPU copies from its address space to a common buffer (PCI opening) accessible to both the CPU and the GPU, and the GPU fetches pages from the common buffer to its address space. GPU-CPU replication can be accomplished in a similar manner. Because the opening is a pin memory, the contents of the opening are not lost if a context switch occurs to the CPU or GPU process. This allows the two processors to execute asynchronously, which is critical because the two processors may have different operating systems and thus the context switch may not be synchronous. Further, the open space may be mapped into the user space of the application, thereby enabling user-level CPU-GPU communication. This makes the application stack much more efficient than through the OS driver stack.

Embodiments of the invention may use another difference between traditional software DSM and CPU-GPU platforms. Conventional DSMs are designed to scale to medium or large clusters. In contrast, CPU-GPU systems are very small-scale clusters. It is unlikely that GPU cards and CPU sockets beyond the handheld size will be used well in the future. In addition, the PCI port provides for a physical memory space that is easily shared among the different processors.

Embodiments of the present invention are capable of aggregating many data structures and making the implementation more efficient. FIG. 5 is a flow diagram of one embodiment of a shared memory model utilizing PCI openings. Sequence 500 may be implemented in firmware, software, or hardware. Referring to block 510, a directory including metadata about pages in the shared address region is placed in the PCI opening. The metadata describes whether the CPU or GPU holds a valuable copy of the page (the page's home page), including a version number that tracks the number of updates to the page, mutexes (mutexes) that were acquired before updating the page, and various metadata. The directory may be indexed by the virtual address of the page (block 520). Both the CPU and GPU runtime systems maintain a similar private structure containing local access permissions for pages and local version numbers for pages.

FIG. 6 is a flow diagram of one embodiment of a shared memory model in operation. Sequence 500 may be implemented in firmware, software, or hardware. In one embodiment, sequence 600 may be implemented in firmware, software, or hardware. When the GPU performs a fetch operation (block 610), the corresponding page may be set to be inaccessible on the GPU (620). On subsequent read operations, if the page has been updated and released by the CPU after the last GPU fetch (630), the page fault manager on the GPU copies the page from the CPU (block 640). Directories and private version numbers may be used to determine this. The page is then set to read only (block 650). On a subsequent write operation, the page fault manager creates a backup copy of the page, marks the page as read-write and increments the local version number of the page (block 660). At the release point, difference processing (diff) is performed using the backup copy of the page and changes are sent to the home page location while incrementing the directory version number (block 670). A difference processing (diff) operation computes the difference in memory locations between two pages (i.e., a page and its backup) to find changes that have been performed. The CPU operation is done in a symmetric fashion. Thus, between the fetch point and the release point, the GPU and the CPU operate beyond their local memory and cache and communicate with each other only at the explicit synchronization point.

At the outset, the present embodiment determines the address range to be shared between the CPU and GPU, and ensures that the address range remains mapped at all times (such as using mmap on Linux). The address range may grow dynamically and need not be contiguous, although in a 64-bit address space, a runtime system may initially retain contiguous blocks.

Embodiments of the invention may be implemented in a processor-based system, which in one embodiment may include a general-purpose processor coupled to a chipset. A chipset may be coupled to the system memory and the graphics processor. A graphics processor may be coupled to a frame buffer and, in turn, to a display. In one embodiment, the embodiments of the invention illustrated in FIGS. 1-6 may be implemented as software stored in a computer-readable medium (such as system memory). However, embodiments of the invention may also be implemented as hardware or firmware.

Conclusion

Embodiments of the present programming model provide a shared memory model for a CPU-GPU platform that achieves fine grained consistency between the CPU and the GPU. The present unified programming model may be implemented for both split and integrated configurations, as well as for multiple GPUs and hybrid configurations. User annotations are also used to differentiate between the code executed by the CPU and the GPU. User-level communication may be provided between the CPU and the GPU, thus eliminating the overhead of OS driver calls. A complete software stack, including compiler and runtime support, may be implemented for the programming model.

Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one implementation encompassed within the present invention. Thus, appearances of the phrases "one embodiment" or "in an embodiment" are not necessarily referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be instituted in other suitable forms other than the particular embodiment illustrated and all such forms may be encompassed within the claims of the present application.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present invention.

Claims

1. A method for operating a processing device having a CPU and a GPU, comprising:

sharing memory semantics between the CPU and the GPU, including allowing pointers to be passed between the CPU and the GPU and sharing data structures;

allowing data structures stored in the shared memory to be accessed by both the CPU and the GPU;

determining whether the CPU and GPU have different page tables and different virtual-to-physical memory translations; and

synchronizing virtual address content and content of different physical addresses between the CPU and the GPU in response to different page tables and different virtual-to-physical memory translations, comprising:

upon initialization, mapping a portion of a PCI opening to user space of an application and instantiating the user space using a task queue, a message queue, and a copy buffer, the portion of the PCI opening accessible to both a CPU and a GPU;

for page copying from the CPU to the GPU, copying the page from the CPU address space into the PCI opening, and enabling the GPU to access the page from the PCI opening to the address space of the PCI opening; and

for page copying from GPU to CPU, the page is copied from GPU address space into PCI opening, and CPU is made to access the page from PCI opening to its address space.

2. The method of claim 1, further comprising:

sharing addresses between the CPU and the GPU includes allocating memory space to a data structure shared between the CPU and the GPU.

3. The method of claim 1, further comprising: the virtual addresses are shared between the CPU and the GPU, but are mapped to different physical addresses on the CPU and the GPU.

4. The method of claim 3, wherein allocating a data structure shared between the CPU and the GPU to the memory space further comprises: a memory allocation function is used that allocates data in the memory space.

5. The method of claim 4, wherein allocating memory space to the data structure shared between the CPU and the GPU further comprises programmer annotations of static variables that cause the variables to be allocated in the shared memory space.

6. The method of claim 1, further comprising: when data shared in the memory space is used by the CPU or GPU, the data shared in the memory space is migrated between the CPU and GPU memory as needed.

7. The method of claim 1, further comprising: dividing the address space into a shared address space between the CPU and the GPU and a residual address space private to the CPU or the GPU; and

default data is allocated to the private space.

8. The method of claim 7, wherein the representation of the pointer does not change between the shared and private spaces.

9. The method of claim 7, further comprising:

the CPU or GPU specifies that it owns a particular block of addresses in the shared virtual address space.

10. The method of claim 7, wherein the CPU or GPU designating that it owns a particular block of addresses in the shared virtual address space further comprises:

when the address range in the shared virtual address space is owned by the CPU, the CPU knows that the GPU cannot access those addresses, and does not need to maintain the consistency of those addresses with the GPU; and

when a CPU-owned address is accessed by the GPU, the address becomes non-owned.

11. The method of claim 1, wherein the PCI aperture can be mapped to user space of an application, thereby enabling user-level CPU and GPU communications.

12. The method of claim 11, further comprising:

a directory is placed in the PCI opening that includes metadata about the page in the shared address area that indicates whether the CPU or GPU holds the home page of the page, including a version number that tracks the number of updates to the page and a mutex that is acquired prior to updating the page.

13. The method of claim 12, wherein the directory is indexed by a virtual address of a page.

14. The method of claim 1, wherein the GPU comprises a discrete device or an integrated device or a combination of multiple GPUs in different configurations.

15. The method of claim 1, further comprising: memory semantics are uniformly shared for multiple graphics cards and hybrid graphics systems.

16. A shared memory in which data structures are shared between a CPU and a GPU and pointers can be passed from side to side without requiring any formatting, such that the same pointers in the shared memory can be accessed by each of the CPU and GPU,

wherein data structures stored in the shared memory are allowed to be accessed by both the CPU and the GPU;

the shared memory is adapted to:

17. The shared memory of claim 16, wherein the scene picture is located in the shared memory and is accessible by the CPU and the GPU.

18. The shared memory as in claim 16, wherein the shared memory is implemented with different operating systems running on the CPU and GPU.

19. The shared memory as in claim 16, wherein the shared memory is implemented with a GPU attached to a CPU as a separate device.

20. The shared memory of claim 16, wherein the shared memory is implemented in an integrated CPU-GPU platform.

21. An apparatus for operating a processing device having a CPU and a GPU, comprising:

means for sharing memory semantics between the CPU and the GPU, including allowing pointers to be passed between the CPU and the GPU and sharing data structures;

means for allowing data structures stored in the shared memory to be accessed by both the CPU and the GPU;

means for determining whether the CPU and GPU have different page tables and different virtual-to-physical memory translations; and

apparatus for synchronizing virtual address content and content of different physical addresses between a CPU and a GPU in response to different page tables and different virtual-to-physical memory translations, comprising:

22. The apparatus of claim 21, further comprising:

means for sharing addresses between the CPU and the GPU, including allocating memory space with a data structure shared between the CPU and the GPU.

23. The apparatus of claim 22, further comprising:

means for sharing virtual addresses between the CPU and the GPU and causing the virtual addresses to be mapped to different physical addresses on the CPU and the GPU.