CN117971349A

CN117971349A - Computing device, method of configuring virtual registers for a computing device, control device, computer-readable storage medium, and computer program product

Info

Publication number: CN117971349A
Application number: CN202410382891.5A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Priority date: 2024-03-29
Filing date: 2024-03-29
Publication date: 2024-05-03
Anticipated expiration: 2044-03-29
Also published as: CN117971349B

Abstract

The present disclosure provides a computing device, a method of configuring virtual registers for a computing device, a control device, a computer-readable storage medium, and a computer program product. The computing device includes: a plurality of computing units, each configured to run one thread of a thread bundle; a plurality of thread local registers dedicated to each compute unit for registering data associated with threads operated by the compute unit; and a shared buffer for the plurality of computing units, wherein a portion of the shared buffer is configured as a virtual register for the plurality of computing units, and at least one of the plurality of thread local registers of each computing unit is configured as a staging register for threads operated by the computing unit to access the virtual register.

Description

Computing device, method for configuring virtual registers for computing device, control device, computer readable storage medium and computer program product

技术领域Technical Field

本公开概括而言涉及处理器领域，更具体地，涉及一种计算设备、一种为计算设备配置虚拟寄存器的方法、一种控制设备、一种计算机可读存储介质和一种计算机程序产品。The present disclosure generally relates to the field of processors, and more particularly, to a computing device, a method for configuring a virtual register for a computing device, a control device, a computer-readable storage medium, and a computer program product.

背景技术Background technique

当前，随着人工智能（Artificial Intelligence， AI）技术的发展，AI芯片经常需要快速处理大量数据，因此希望数据能够与访问该数据的计算单元尽可能近。此外，与缓存和内存相比，访问寄存器中数据的开销更小，可以提升算子的性能，因此算子开发通常优先会使用寄存器来存储数据。然而，为一个计算单元固定配置的本地寄存器的数量往往有限，当算子需要使用的寄存器个数超过所配置的硬件寄存器个数时，编译器为了保证算子能正常执行会将部分数据保存在内存或缓存中，而对内存或缓存的访问开销很大，将会造成算子性能的巨大损失。Currently, with the development of artificial intelligence (AI) technology, AI chips often need to process large amounts of data quickly, so it is hoped that the data can be as close as possible to the computing unit that accesses the data. In addition, compared with cache and memory, the overhead of accessing data in registers is lower, which can improve the performance of operators, so operator development usually gives priority to using registers to store data. However, the number of local registers fixedly configured for a computing unit is often limited. When the number of registers required by an operator exceeds the number of configured hardware registers, the compiler will save part of the data in memory or cache to ensure that the operator can execute normally. The access overhead to memory or cache is very high, which will cause a huge loss in operator performance.

发明内容Summary of the invention

针对上述问题，本公开提供了一种在用于多个计算单元之间的数据共享的共享缓存中，为多个计算单元或每个计算单元配置虚拟寄存器以加快计算单元的数据访问的方案。In view of the above problems, the present disclosure provides a solution for configuring virtual registers for multiple computing units or each computing unit in a shared cache for data sharing between multiple computing units to speed up data access of the computing units.

根据本公开的一个方面，提供了一种计算设备。该计算设备包括：多个计算单元，每个计算单元被配置为运行线程束的一个线程；专用于每个计算单元的多个线程本地寄存器，用于寄存与所述计算单元运行的线程相关联的数据；以及用于所述多个计算单元的共享缓存器，其中，所述共享缓存器的一部分被配置为用于所述多个计算单元的虚拟寄存器，并且每个计算单元的多个线程本地寄存器中的至少一个线程本地寄存器被配置为中转寄存器以用于所述计算单元运行的线程访问所述虚拟寄存器。According to one aspect of the present disclosure, a computing device is provided, which includes: a plurality of computing units, each of which is configured to run a thread of a thread warp; a plurality of thread-local registers dedicated to each computing unit, for storing data associated with the thread run by the computing unit; and a shared buffer for the plurality of computing units, wherein a portion of the shared buffer is configured as a virtual register for the plurality of computing units, and at least one thread-local register of the plurality of thread-local registers of each computing unit is configured as a transit register for the thread run by the computing unit to access the virtual register.

在一些实现中，所述虚拟寄存器包括分别专用于每个计算单元的线程的一个或多个线程虚拟寄存器，并且每个计算单元被配置为在运行所述线程时，通过所述计算单元的中转寄存器来访问所述一个或多个线程虚拟寄存器。In some implementations, the virtual registers include one or more thread virtual registers dedicated to the threads of each computing unit, respectively, and each computing unit is configured to access the one or more thread virtual registers through the transfer register of the computing unit when running the thread.

在一些实现中，所述计算单元被配置为在确定所述线程要向所述一个或多个线程虚拟寄存器写入数据时，将所述数据写入所述中转寄存器，并且所述中转寄存器被配置为将所述数据写入所述计算单元的一个或多个线程虚拟寄存器。In some implementations, the computing unit is configured to write the data to the transfer register upon determining that the thread is to write the data to the one or more thread virtual registers, and the transfer register is configured to write the data to the one or more thread virtual registers of the computing unit.

在一些实现中，所述计算单元被配置为在确定所述线程要从所述一个或多个线程虚拟寄存器读取数据时，向所述中转寄存器发送读取请求，并且所述中转寄存器被配置为响应于所述读取请求从所述计算单元的一个或多个线程虚拟寄存器读取所述数据，以供所述计算单元运行的线程读取。In some implementations, the computing unit is configured to send a read request to the transfer register when determining that the thread wants to read data from the one or more thread virtual registers, and the transfer register is configured to read the data from the one or more thread virtual registers of the computing unit in response to the read request for reading by the thread running the computing unit.

在一些实现中，所述虚拟寄存器包括由所述多个计算单元的线程束共用的一个或多个线程束虚拟寄存器，并且每个计算单元被配置为在运行所述线程束时，通过所述计算单元的中转寄存器来访问所述一个或多个线程束虚拟寄存器。In some implementations, the virtual registers include one or more warp virtual registers shared by warps of the plurality of compute units, and each compute unit is configured to access the one or more warp virtual registers through a transit register of the compute unit when executing the warp.

在一些实现中，在确定所述线程束要向所述一个或多个线程束虚拟寄存器写入数据时，所述多个计算单元中的一个计算单元被配置为将所述数据写入所述计算单元的中转寄存器，并且所述中转寄存器被配置为将所述数据写入所述一个或多个线程束虚拟寄存器。In some implementations, upon determining that the thread warp is to write data to the one or more thread warp virtual registers, one of the plurality of computing units is configured to write the data to a transfer register of the computing unit, and the transfer register is configured to write the data to the one or more thread warp virtual registers.

在一些实现中，在确定所述线程束要从所述线程束虚拟寄存器读取数据时，所述多个计算单元中的每个计算单元向各自的中转寄存器发送读取请求，并且每个中转寄存器被配置为响应于所述读取请求从所述线程束虚拟寄存器读取所述数据，以供对应的计算单元运行的线程读取。In some implementations, when it is determined that the thread warp is to read data from the thread warp virtual register, each of the plurality of computing units sends a read request to a respective transfer register, and each transfer register is configured to read the data from the thread warp virtual register in response to the read request for reading by a thread executed by the corresponding computing unit.

在一些实现中，在确定所述线程束要从所述线程束虚拟寄存器读取数据时，所述多个计算单元中的每个计算单元向所述计算单元的中转寄存器发送读取请求，并且所述多个计算单元中的一个计算单元的中转寄存器被配置为响应于所述读取请求从所述线程束虚拟寄存器读取所述数据，并广播给所述多个计算单元。In some implementations, when it is determined that the thread warp is to read data from the thread warp virtual register, each of the multiple computing units sends a read request to a transfer register of the computing unit, and the transfer register of one of the multiple computing units is configured to read the data from the thread warp virtual register in response to the read request and broadcast it to the multiple computing units.

根据本公开的另一个方面，提供了一种为计算设备配置虚拟寄存器的方法，其中所述计算设备包括：多个计算单元，每个计算单元被配置为运行线程束的一个线程；专用于每个计算单元的多个线程本地寄存器，用于寄存与所述计算单元运行的线程相关联的数据；以及用于所述多个计算单元的共享缓存器。所述方法包括：将所述共享缓存器的一部分配置为用于所述多个计算单元的虚拟寄存器；以及将每个计算单元的多个线程本地寄存器中的至少一个线程本地寄存器配置为中转寄存器以用于所述计算单元运行的线程访问所述虚拟寄存器。According to another aspect of the present disclosure, a method for configuring virtual registers for a computing device is provided, wherein the computing device comprises: a plurality of computing units, each computing unit being configured to run a thread of a thread warp; a plurality of thread local registers dedicated to each computing unit, for registering data associated with the threads run by the computing unit; and a shared buffer for the plurality of computing units. The method comprises: configuring a portion of the shared buffer as a virtual register for the plurality of computing units; and configuring at least one thread local register of the plurality of thread local registers of each computing unit as a transit register for the thread run by the computing unit to access the virtual register.

在一些实现中，将所述共享缓存器的一部分配置为用于所述多个计算单元的虚拟寄存器包括：配置分别专用于每个计算单元的线程的一个或多个线程虚拟寄存器，并且所述方法还包括：配置每个计算单元以在运行所述线程时，通过所述计算单元的中转寄存器来访问所述一个或多个线程虚拟寄存器。In some implementations, configuring a portion of the shared cache as a virtual register for the multiple computing units includes: configuring one or more thread virtual registers dedicated to the threads of each computing unit, and the method also includes: configuring each computing unit to access the one or more thread virtual registers through a transfer register of the computing unit when running the thread.

在一些实现中，配置每个计算单元以在运行所述线程时，通过所述计算单元的中转寄存器来访问所述一个或多个线程虚拟寄存器包括：配置所述计算单元以在确定所述线程要向所述一个或多个线程虚拟寄存器写入数据时，将所述数据写入所述中转寄存器，以及配置所述中转寄存器以将所述数据写入所述计算单元的一个或多个线程虚拟寄存器。In some implementations, configuring each computing unit to access the one or more thread virtual registers through a transfer register of the computing unit when running the thread includes: configuring the computing unit to write the data to the transfer register when determining that the thread is to write data to the one or more thread virtual registers, and configuring the transfer register to write the data to the one or more thread virtual registers of the computing unit.

在一些实现中，配置每个计算单元以在运行所述线程时，通过所述计算单元的中转寄存器来访问所述一个或多个线程虚拟寄存器包括：配置所述计算单元以在确定所述线程要从所述一个或多个线程虚拟寄存器读取数据时，向所述中转寄存器发送读取请求，以及配置所述中转寄存器以响应于所述读取请求从所述计算单元的一个或多个线程虚拟寄存器读取所述数据，以供所述计算单元运行的线程读取。In some implementations, configuring each computing unit to access the one or more thread virtual registers through a transfer register of the computing unit when running the thread includes: configuring the computing unit to send a read request to the transfer register when determining that the thread wants to read data from the one or more thread virtual registers, and configuring the transfer register to read the data from the one or more thread virtual registers of the computing unit in response to the read request for reading by the thread running the computing unit.

在一些实现中，将所述共享缓存器的一部分配置为用于所述多个计算单元的虚拟寄存器包括：配置由所述多个计算单元的线程束共用的一个或多个线程束虚拟寄存器，并且所述方法还包括：配置每个计算单元以在运行所述线程束时，通过所述计算单元的中转寄存器来访问所述一个或多个线程束虚拟寄存器。In some implementations, configuring a portion of the shared buffer as a virtual register for the plurality of computing units includes: configuring one or more warp virtual registers shared by warps of the plurality of computing units, and the method further includes: configuring each computing unit to access the one or more warp virtual registers through a transit register of the computing unit when executing the warp.

在一些实现中，配置每个计算单元以在运行所述线程束时，通过所述计算单元的中转寄存器来访问所述一个或多个线程束虚拟寄存器包括：配置所述多个计算单元中的一个计算单元以在确定所述线程束要向所述一个或多个线程束虚拟寄存器写入数据时，将所述数据写入所述计算单元的中转寄存器，以及配置所述中转寄存器以将所述数据写入所述一个或多个线程束虚拟寄存器。In some implementations, configuring each computing unit to access the one or more warp virtual registers through a transfer register of the computing unit when running the warp includes: configuring one computing unit among the plurality of computing units to write the data to the transfer register of the computing unit when determining that the warp is to write data to the one or more warp virtual registers, and configuring the transfer register to write the data to the one or more warp virtual registers.

在一些实现中，配置每个计算单元以在运行所述线程束时，通过所述计算单元的中转寄存器来访问所述一个或多个线程束虚拟寄存器包括：配置所述多个计算单元中的每个计算单元以在确定所述线程束要从所述线程束虚拟寄存器读取数据时，向各自的中转寄存器发送读取请求，以及配置每个中转寄存器以响应于所述读取请求从所述线程束虚拟寄存器读取所述数据，以供对应的计算单元运行的线程读取。In some implementations, configuring each computing unit to access the one or more warp virtual registers through a transfer register of the computing unit when running the warp includes: configuring each computing unit of the plurality of computing units to send a read request to a respective transfer register when determining that the warp is to read data from the warp virtual register, and configuring each transfer register to read the data from the warp virtual register in response to the read request for reading by a thread run by the corresponding computing unit.

在一些实现中，配置每个计算单元以在运行所述线程束时，通过所述计算单元的中转寄存器来访问所述一个或多个线程束虚拟寄存器包括：配置所述多个计算单元中的每个计算单元以在确定所述线程束要从所述线程束虚拟寄存器读取数据时，向所述计算单元的中转寄存器发送读取请求，以及配置所述多个计算单元中的一个计算单元的中转寄存器以响应于所述读取请求从所述线程束虚拟寄存器读取所述数据，并广播给所述多个计算单元。In some implementations, configuring each computing unit to access the one or more warp virtual registers through the transfer register of the computing unit when running the warp includes: configuring each computing unit of the multiple computing units to send a read request to the transfer register of the computing unit when determining that the warp wants to read data from the warp virtual register, and configuring the transfer register of one computing unit of the multiple computing units to read the data from the warp virtual register in response to the read request and broadcast it to the multiple computing units.

根据本公开的再一个方面，提供了一种控制设备，包括：至少一个处理器；以及至少一个存储器，所述至少一个存储器被耦合到所述至少一个处理器并且存储用于由所述至少一个处理器执行的指令，所述指令当由所述至少一个处理器执行时，使得所述控制设备执行如上所述的方法的步骤。According to another aspect of the present disclosure, a control device is provided, comprising: at least one processor; and at least one memory, wherein the at least one memory is coupled to the at least one processor and stores instructions for execution by the at least one processor, wherein the instructions, when executed by the at least one processor, cause the control device to perform the steps of the method as described above.

根据本公开的又一个方面，提供了一种计算机可读存储介质，其上存储有计算机程序代码，所述计算机程序代码在被运行时执行如上所述的方法。According to yet another aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program code is stored. When the computer program code is executed, the method described above is executed.

根据本公开的又一个方面，提供了一种计算机程序产品，包括计算机程序，所述计算机程序在被机器执行时执行如上所述的方法。According to yet another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program performs the method described above when executed by a machine.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

通过参考下列附图所给出的本公开的具体实施方式的描述，将更好地理解本公开，并且本公开的其他目的、细节、特点和优点将变得更加显而易见。The present disclosure will be better understood and other objects, details, features and advantages of the present disclosure will become more apparent through the description of specific embodiments of the present disclosure given with reference to the following drawings.

图1示出了一种计算设备的结构示意图。FIG1 shows a schematic diagram of the structure of a computing device.

图2示出了根据本发明实施例的计算设备的结构示意图。FIG. 2 shows a schematic diagram of the structure of a computing device according to an embodiment of the present invention.

图3A示出了根据本发明实施例的用于一种线程操作的计算设备的示意图。FIG. 3A is a schematic diagram showing a computing device for a thread operation according to an embodiment of the present invention.

图3B示出了根据本发明实施例的用于另一种线程操作的计算设备的示意图。FIG. 3B is a schematic diagram showing a computing device for another thread operation according to an embodiment of the present invention.

图4示出了根据本发明实施例的用于为计算设备配置虚拟寄存器的方法的示例性流程图。FIG. 4 shows an exemplary flow chart of a method for configuring virtual registers for a computing device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的优选实施例。虽然附图中显示了本公开的优选实施例，然而应该理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了使本公开更加透彻和完整，并且能够将本公开的范围完整地传达给本领域的技术人员。The preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although the preferred embodiments of the present disclosure are shown in the accompanying drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to make the present disclosure more thorough and complete, and to fully convey the scope of the present disclosure to those skilled in the art.

在本文中使用的术语“包括”及其变形表示开放性包括，即“包括但不限于”。除非特别申明，术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个实施例”和“一些实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。As used herein, the term "including" and its variations mean open inclusion, i.e., "including but not limited to". Unless otherwise stated, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one embodiment" and "some embodiments" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first", "second", etc. may refer to different or the same objects.

图1示出了一种计算设备100的结构示意图。如图1中所示，计算设备100可以包括多个计算单元110和专用于每个计算单元110的多个线程本地寄存器120。图1中示例性地示出了8个计算单元110-1、110-2……110-8，并且为每个计算单元110示例性地示出了4个线程本地寄存器120-1、120-2……120-4。注意，图1中所示的计算单元110和线程本地寄存器120的数量仅仅是示例性的，并不用于限制本发明的范围。此外，为了便于描述将不同计算单元110的线程本地寄存器120使用了相同的标号，但是本领域技术人员可以理解，它们实际上是不同的物理寄存器。例如，计算单元110-1的线程本地寄存器120-1和计算单元110-2的线程本地寄存器120-1是不同的物理寄存器。FIG. 1 shows a schematic diagram of the structure of a computing device 100. As shown in FIG. 1 , the computing device 100 may include multiple computing units 110 and multiple thread local registers 120 dedicated to each computing unit 110. FIG. 1 exemplarily shows eight computing units 110-1, 110-2 ... 110-8, and exemplarily shows four thread local registers 120-1, 120-2 ... 120-4 for each computing unit 110. Note that the number of computing units 110 and thread local registers 120 shown in FIG. 1 is merely exemplary and is not intended to limit the scope of the present invention. In addition, for ease of description, the thread local registers 120 of different computing units 110 are labeled with the same reference numerals, but those skilled in the art will understand that they are actually different physical registers. For example, the thread local register 120-1 of the computing unit 110-1 and the thread local register 120-1 of the computing unit 110-2 are different physical registers.

在计算设备100运行一个线程束（warp）时，每个计算单元110可以运行该线程束中的一个线程。每个计算单元110的多个线程本地寄存器120用于寄存与该计算单元110运行的线程相关联的数据，例如寄存该线程运行所需要的数据或者该线程运行产生的数据。多个线程本地寄存器120位于该计算单元110附近，例如与计算单元110位于同一芯片模块上，可被该计算单元110上运行的线程直接访问。When the computing device 100 runs a warp, each computing unit 110 can run a thread in the warp. The multiple thread local registers 120 of each computing unit 110 are used to store data associated with the thread running on the computing unit 110, such as data required for the thread running or data generated by the thread running. The multiple thread local registers 120 are located near the computing unit 110, for example, on the same chip module as the computing unit 110, and can be directly accessed by the threads running on the computing unit 110.

在一种实例中，计算设备100可以是AI芯片、GPU（Graphic Processing Unit，图形处理单元）、通用GPU（General Purpose GPU，GPGPU）中的通用计算单元CU（ComputingUnit），并且计算单元110可以是通用计算单元CU包括的计算执行单元EU（Executing Unit）等，在通用计算单元CU运行一个线程束时，各个计算执行单元EU根据任务调度分别运行线程束中的一个线程。In one example, the computing device 100 may be an AI chip, a GPU (Graphic Processing Unit), or a general-purpose computing unit CU (Computing Unit) in a general-purpose GPU (GPGPU), and the computing unit 110 may be a computing execution unit EU (Executing Unit) included in the general-purpose computing unit CU, etc. When the general-purpose computing unit CU runs a thread warp, each computing execution unit EU runs a thread in the thread warp according to task scheduling.

此外，计算设备100还可以包括用于多个计算单元110的共享缓存器130，该共享缓存器130是计算设备100的片上内存，用于在计算设备100的多个计算单元110的线程之间交换数据。In addition, the computing device 100 may further include a shared buffer 130 for the plurality of computing units 110 . The shared buffer 130 is an on-chip memory of the computing device 100 and is used to exchange data between threads of the plurality of computing units 110 of the computing device 100 .

当前，每个计算单元110的线程在运行时，只能访问该计算单元110的线程本地寄存器120。当线程需要使用的寄存器个数超过该计算单元110的线程本地寄存器120的个数时，将造成寄存器溢出。Currently, when a thread of each computing unit 110 is running, it can only access the thread local registers 120 of the computing unit 110. When the number of registers required to be used by a thread exceeds the number of thread local registers 120 of the computing unit 110, register overflow will occur.

对此，一种解决方法是将溢出的数据保存在计算设备100的片外内存，如线程本地内存（Thread Local Memory, TLM）中，以确保线程正常执行。然而，在这种实现中，编译器对片外内存的访问开销很大，将造成运算性能的巨大损失。One solution is to store the overflowed data in an off-chip memory of the computing device 100, such as a thread local memory (TLM), to ensure that the thread is executed normally. However, in this implementation, the compiler has a large access overhead to the off-chip memory, which will cause a huge loss in computing performance.

针对上述问题，在本发明公开的方案中，通过将计算设备的共享缓存器130的一部分从逻辑上配置为线程束或每个线程的专用空间，以作为与线程本地寄存器120具有类似功能的虚拟寄存器来使用。To solve the above problem, in the solution disclosed in the present invention, a part of the shared buffer 130 of the computing device is logically configured as a dedicated space for a warp or each thread so as to be used as a virtual register having a similar function to the thread local register 120 .

图2示出了根据本发明实施例的计算设备200的结构示意图。图2所示的计算设备200与图1所示的计算设备100的区别在于，共享缓存器130的一部分被配置为用于计算设备200的多个计算单元110的虚拟寄存器132。2 is a schematic diagram of the structure of a computing device 200 according to an embodiment of the present invention. The computing device 200 shown in FIG2 differs from the computing device 100 shown in FIG1 in that a portion of the shared buffer 130 is configured as a virtual register 132 for multiple computing units 110 of the computing device 200.

由于计算单元110上运行的线程仅能直接访问该计算单元110的线程本地寄存器120，因此为了访问虚拟寄存器132，计算单元110的多个线程本地寄存器120中的至少一个线程本地寄存器被配置为中转寄存器以用于该计算单元110上运行的线程访问虚拟寄存器132。在这种情况下，中转寄存器仅用于计算单元110上的线程与共享缓存器130中的虚拟寄存器132之间的中转，而不用于寄存该计算单元110的相关数据。Since the thread running on the computing unit 110 can only directly access the thread local register 120 of the computing unit 110, in order to access the virtual register 132, at least one thread local register among the multiple thread local registers 120 of the computing unit 110 is configured as a transfer register for the thread running on the computing unit 110 to access the virtual register 132. In this case, the transfer register is only used for transfer between the thread on the computing unit 110 and the virtual register 132 in the shared buffer 130, and is not used to store the relevant data of the computing unit 110.

中转寄存器通常由每个计算单元110的最后一个或多个线程本地寄存器120充当，例如图2中可以将线程本地寄存器120-4作为计算单元110的中转寄存器。注意，本文中以一个中转寄存器为例来进行描述，但是本领域技术人员可以理解，每个计算单元110的中转寄存器的个数可以根据需要配置为一个或多个。此外，中转寄存器可以以硬件方式预先固定配置，也可以通过软件（例如计算设备200的控制设备的软件）根据需要灵活配置。The transfer register is usually performed by the last one or more thread local registers 120 of each computing unit 110. For example, in FIG2 , the thread local register 120-4 can be used as the transfer register of the computing unit 110. Note that one transfer register is used as an example for description herein, but those skilled in the art can understand that the number of transfer registers of each computing unit 110 can be configured to one or more as required. In addition, the transfer register can be pre-fixedly configured in hardware, or can be flexibly configured as required by software (such as software of the control device of the computing device 200).

计算设备200的线程束操作可能存在两种不同类型的线程操作。在一种线程操作中，线程束中的每个线程针对不同的数据执行相同的操作，在这种情况下，需要为每个线程配置不同的虚拟寄存器以用于相应线程的操作，如下面结合图3A所述。在另一种线程操作中，线程束中的每个线程针对相同的数据执行相同的操作，在这种情况下，可以为整个线程束统一配置虚拟寄存器以用于所有线程的操作，如下面结合图3B所述。There may be two different types of thread operations for the warp operation of the computing device 200. In one type of thread operation, each thread in the warp performs the same operation on different data. In this case, a different virtual register needs to be configured for each thread for the operation of the corresponding thread, as described below in conjunction with FIG. 3A. In another type of thread operation, each thread in the warp performs the same operation on the same data. In this case, a virtual register may be uniformly configured for the entire warp for the operation of all threads, as described below in conjunction with FIG. 3B.

图3A示出了根据本发明实施例的用于一种线程操作的计算设备200的示意图。如图3A所示，虚拟寄存器132包括分别专用于每个计算单元110的线程的一个或多个线程虚拟寄存器1322。在这种情况下，每个计算单元110被配置为在运行该线程时，通过该计算单元110的中转寄存器（如线程本地寄存器120-4）访问该一个或多个线程虚拟寄存器1322。这里，每个计算单元110的线程虚拟寄存器1322的个数可以通过硬件预先配置实现，或者通过软件（例如计算设备200的控制设备的软件）根据需要灵活配置。在一些实例中，每个计算单元110的线程虚拟寄存器1322的个数可以设置为该计算单元110的线程本地寄存器120的个数的1/2至2倍。FIG3A shows a schematic diagram of a computing device 200 for a thread operation according to an embodiment of the present invention. As shown in FIG3A , the virtual register 132 includes one or more thread virtual registers 1322 dedicated to the thread of each computing unit 110, respectively. In this case, each computing unit 110 is configured to access the one or more thread virtual registers 1322 through the transfer register (such as the thread local register 120-4) of the computing unit 110 when running the thread. Here, the number of thread virtual registers 1322 of each computing unit 110 can be implemented by hardware pre-configuration, or flexibly configured as needed by software (such as software of a control device of the computing device 200). In some instances, the number of thread virtual registers 1322 of each computing unit 110 can be set to 1/2 to 2 times the number of thread local registers 120 of the computing unit 110.

计算单元110的线程对线程虚拟寄存器1322的访问包括写操作和读操作。The access to the thread virtual register 1322 by the threads of the computing unit 110 includes a write operation and a read operation.

对于线程虚拟寄存器1322的写操作，在计算单元110确定其线程要向其一个或多个线程虚拟寄存器1322写入数据时，计算单元110将该数据写入该计算单元110的中转寄存器（如线程本地寄存器120-4），并且该中转寄存器将该数据写入该计算单元110的线程虚拟寄存器1322。For the write operation of the thread virtual register 1322, when the computing unit 110 determines that its thread wants to write data to one or more of its thread virtual registers 1322, the computing unit 110 writes the data to the transfer register of the computing unit 110 (such as the thread local register 120-4), and the transfer register writes the data to the thread virtual register 1322 of the computing unit 110.

更具体地，例如，计算单元110在运行该线程时，可以将所产生的数据依次写入其多个线程本地寄存器120，当数据到达多个线程本地寄存器120中的中转寄存器（如线程本地寄存器120-4）时，该中转寄存器可以直接将写入的数据写入该计算单元110的线程虚拟寄存器1322，依次类推直至写满所有线程虚拟寄存器1322。More specifically, for example, when the computing unit 110 runs the thread, it can write the generated data into its multiple thread local registers 120 in sequence. When the data reaches a transfer register (such as thread local register 120-4) in the multiple thread local registers 120, the transfer register can directly write the written data into the thread virtual register 1322 of the computing unit 110, and so on until all thread virtual registers 1322 are filled.

对于线程虚拟寄存器1322的读操作，在计算单元110确定其线程要从线程虚拟寄存器1322读取数据时，计算单元110向其中转寄存器（如线程本地寄存器120-4）发送读取请求，并且该中转寄存器响应于该读取请求从该计算单元110的线程虚拟寄存器1322读取该数据，以供计算单元110运行的线程读取。For the read operation of the thread virtual register 1322, when the computing unit 110 determines that its thread wants to read data from the thread virtual register 1322, the computing unit 110 sends a read request to its transfer register (such as the thread local register 120-4), and the transfer register reads the data from the thread virtual register 1322 of the computing unit 110 in response to the read request for reading by the thread running by the computing unit 110.

更具体地，例如，计算单元110在运行该线程时，需要依次从其多个线程本地寄存器120读取该线程所需使用的数据，当读取到多个线程本地寄存器120中的中转寄存器（如线程本地寄存器120-4）时，该中转寄存器可以从该计算单元110的线程虚拟寄存器1322读取数据，并且例如通过软件代码或硬件配置使得计算单元110可以从该中转寄存器中读取所需数据。More specifically, for example, when the computing unit 110 runs the thread, it needs to read the data required by the thread from its multiple thread local registers 120 in sequence. When the transfer register (such as the thread local register 120-4) in the multiple thread local registers 120 is read, the transfer register can read data from the thread virtual register 1322 of the computing unit 110, and the computing unit 110 can read the required data from the transfer register, for example, through software code or hardware configuration.

图3B示出了根据本发明实施例的用于另一种线程操作的计算设备200的示意图。如图3B所示，虚拟寄存器132包括由多个计算单元110的线程束共用的一个或多个线程束虚拟寄存器1324。在这种情况下，每个计算单元110被配置为在运行该线程束时，通过该计算单元110的中转寄存器（如线程本地寄存器120-4）访问该一个或多个线程束虚拟寄存器1324。类似地，线程束虚拟寄存器1324的个数也可以通过硬件预先配置实现，或者通过软件（例如计算设备200的控制设备的软件）根据需要灵活配置。在一些实例中，线程束虚拟寄存器1324的个数可以设置为每个计算单元110的线程本地寄存器120的个数的数十倍。FIG3B shows a schematic diagram of a computing device 200 for another thread operation according to an embodiment of the present invention. As shown in FIG3B , the virtual register 132 includes one or more warp virtual registers 1324 shared by warps of multiple computing units 110. In this case, each computing unit 110 is configured to access the one or more warp virtual registers 1324 through the transit register (such as the thread local register 120-4) of the computing unit 110 when running the warp. Similarly, the number of warp virtual registers 1324 can also be implemented by hardware pre-configuration, or flexibly configured as needed by software (such as software of a control device of the computing device 200). In some instances, the number of warp virtual registers 1324 can be set to be tens of times the number of thread local registers 120 of each computing unit 110.

类似地，计算设备200的多个计算单元110所运行的线程束对线程束虚拟寄存器1324的访问包括写操作和读操作。Similarly, access to the warp virtual register 1324 by the warps executed by the plurality of computing units 110 of the computing device 200 includes write operations and read operations.

对于线程束虚拟寄存器1324的写操作，在确定由多个计算单元110运行的线程束要向一个或多个线程束虚拟寄存器1324写入数据时，多个计算单元110中的一个计算单元110（例如计算单元110-1）将该数据写入该计算单元110的中转寄存器（如线程本地寄存器120-4），并且该中转寄存器将该数据写入该线程束虚拟寄存器1324。For the write operation of the thread warp virtual register 1324, when it is determined that the thread warps executed by multiple computing units 110 are to write data to one or more thread warp virtual registers 1324, one computing unit 110 among the multiple computing units 110 (for example, computing unit 110-1) writes the data into a transfer register (such as thread local register 120-4) of the computing unit 110, and the transfer register writes the data into the thread warp virtual register 1324.

更具体地，例如，多个计算单元110在运行该线程束时，每个计算单元110可以将所产生的数据依次写入其多个线程本地寄存器120，当数据到达多个线程本地寄存器120中的中转寄存器（如线程本地寄存器120-4）时，该中转寄存器可以直接将写入的数据写入多个计算单元110共享的线程束虚拟寄存器1324，依次类推直至写满所有线程束虚拟寄存器1324。这里，由于多个计算单元110的线程针对相同的数据执行相同的操作，因此在每个计算单元110产生的数据相同。在这种情况下，可以仅由指定的计算单元110向线程束虚拟寄存器1324执行写入，或者可以由多个计算单元110分别向线程束虚拟寄存器1324执行写入。在后者的情况下，在后的计算单元写入的数据将覆盖在先的数据。More specifically, for example, when multiple computing units 110 are running the thread warp, each computing unit 110 may write the generated data into its multiple thread local registers 120 in sequence. When the data reaches the transit register (such as thread local register 120-4) in the multiple thread local registers 120, the transit register may directly write the written data into the thread warp virtual register 1324 shared by the multiple computing units 110, and so on until all the thread warp virtual registers 1324 are filled. Here, since the threads of the multiple computing units 110 perform the same operation on the same data, the data generated in each computing unit 110 is the same. In this case, only the designated computing unit 110 may write to the thread warp virtual register 1324, or the multiple computing units 110 may write to the thread warp virtual register 1324 respectively. In the latter case, the data written by the later computing unit will overwrite the earlier data.

对于线程束虚拟寄存器1324的读操作，在确定由多个计算单元110运行的线程束要从线程束虚拟寄存器1324读取数据时，多个计算单元110中的每个计算单元110分别向各自的中转寄存器（如线程本地寄存器120-4）发送读取请求，并且每个中转寄存器响应于该读取请求从线程束虚拟寄存器1324读取该数据，以供对应的计算单元110运行的线程读取。For the read operation of the warp virtual register 1324, when it is determined that the warp executed by the plurality of computing units 110 is to read data from the warp virtual register 1324, each computing unit 110 in the plurality of computing units 110 sends a read request to its respective transfer register (such as the thread local register 120-4), and each transfer register reads the data from the warp virtual register 1324 in response to the read request for reading by the thread executed by the corresponding computing unit 110.

更具体地，例如，在该线程束运行过程中，每个计算单元110需要依次从其多个线程本地寄存器120读取该线程束所需使用的数据，当读取到多个线程本地寄存器120中的中转寄存器（如线程本地寄存器120-4）时，该中转寄存器可以从线程束虚拟寄存器1324读取数据，并且使得计算单元110可以从相应的中转寄存器中读取所需数据。More specifically, for example, during the execution of the thread warp, each computing unit 110 needs to read the data required for the thread warp from its multiple thread local registers 120 in sequence. When a transfer register (such as thread local register 120-4) among the multiple thread local registers 120 is read, the transfer register can read data from the thread warp virtual register 1324, and enable the computing unit 110 to read the required data from the corresponding transfer register.

或者，在共享缓存器130具有广播功能的情况下，在确定由多个计算单元110运行的线程束要从线程束虚拟寄存器1324读取数据时，可以由每个计算单元110都向对应的中转寄存器发送读取请求，并且仅由一个计算单元110的中转寄存器从线程束虚拟寄存器1324中读取该数据并通过广播操作广播给多个计算单元110。Alternatively, in the case where the shared cache 130 has a broadcast function, when it is determined that the thread warps executed by multiple computing units 110 need to read data from the thread warp virtual register 1324, each computing unit 110 can send a read request to the corresponding transfer register, and only the transfer register of one computing unit 110 reads the data from the thread warp virtual register 1324 and broadcasts it to multiple computing units 110 through a broadcast operation.

本文中，在中转寄存器被软件配置的情况下，可以根据当前运行的线程束将计算单元110的一个或多个线程本地寄存器120配置为中转寄存器。例如，假设当前运行的线程束对数据读写的速度要求较高，则可以配置多个中转寄存器（如2个），从而可以一次性执行两个寄存器的读写。Herein, when the transfer register is configured by software, one or more thread local registers 120 of the computing unit 110 can be configured as transfer registers according to the currently running thread warp. For example, assuming that the currently running thread warp has a high requirement for the speed of data reading and writing, multiple transfer registers (such as 2) can be configured, so that the reading and writing of two registers can be performed at one time.

此外，在虚拟线程寄存器1322被软件配置的情况下，可以根据要运行的线程束，在线程束运行之前通过软件指令在共享缓存器130中配置所需个数的虚拟线程寄存器1322。In addition, when the virtual thread registers 1322 are configured by software, a required number of virtual thread registers 1322 may be configured in the shared buffer 130 by software instructions before the warp is executed, according to the warp to be executed.

类似地，在虚拟线程束寄存器1324被软件配置的情况下，可以根据要运行的线程束，在线程束运行之前通过软件指令在共享缓存器130中配置所需个数的虚拟线程束寄存器1324。Similarly, when the virtual warp registers 1324 are configured by software, a required number of virtual warp registers 1324 may be configured in the shared buffer 130 by software instructions before the warps are executed, according to the warps to be executed.

注意，上面虽然通过图3A和图3B分别示出和描述了虚拟线程寄存器1322和虚拟线程束寄存器1324，但是在一些实施例中，共享缓存器130可以既包括专用于各个计算单元110的虚拟线程寄存器1322，又包括由多个计算单元110共享的虚拟线程束寄存器1324。Note that although the virtual thread registers 1322 and the virtual thread warp registers 1324 are respectively shown and described above in Figures 3A and 3B, in some embodiments, the shared cache 130 may include both virtual thread registers 1322 dedicated to each computing unit 110 and virtual thread warp registers 1324 shared by multiple computing units 110.

图4示出了根据本发明实施例的用于为计算设备200配置虚拟寄存器132的方法400的示例性流程图。计算设备200例如如上结合图2、图3A和图3B所示。方法400可以由固化在计算设备200中的代码执行，或者，可以由计算设备200的控制设备（例如，在计算设备200是GPGPU中的通用计算单元CU的情况下，该控制设备可以是用于控制该计算设备200的中央处理器CPU）通过软件来控制执行。FIG4 shows an exemplary flow chart of a method 400 for configuring a virtual register 132 for a computing device 200 according to an embodiment of the present invention. The computing device 200 is, for example, as shown above in conjunction with FIG2 , FIG3A , and FIG3B . The method 400 may be executed by a code solidified in the computing device 200, or may be controlled by a control device of the computing device 200 (for example, when the computing device 200 is a general purpose computing unit CU in a GPGPU, the control device may be a central processing unit CPU for controlling the computing device 200) through software.

如图4中所示，方法400包括方框410，其中将共享缓存器130的一部分配置为用于多个计算单元110的虚拟寄存器132。As shown in FIG. 4 , the method 400 includes block 410 , in which a portion of the shared buffer 130 is configured as a virtual register 132 for a plurality of compute units 110 .

在方框420，可以将每个计算单元110的多个线程本地寄存器120中的至少一个线程本地寄存器120配置为中转寄存器以用于该计算单元110运行的线程访问该虚拟寄存器132。At block 420 , at least one thread local register 120 of the plurality of thread local registers 120 of each computing unit 110 may be configured as a transit register for the thread running on the computing unit 110 to access the virtual register 132 .

在一些实施例中，方框410可以包括：配置分别专用于每个计算单元110的线程的一个或多个线程虚拟寄存器1322。在这种情况下，方法400还包括：配置每个计算单元110以在运行该线程时，通过该计算单元110的中转寄存器来访问该一个或多个线程虚拟寄存器1322。In some embodiments, block 410 may include configuring one or more thread virtual registers 1322 that are dedicated to the threads of each computing unit 110. In this case, method 400 also includes configuring each computing unit 110 to access the one or more thread virtual registers 1322 through the transit register of the computing unit 110 when running the thread.

具体地，对于线程虚拟寄存器1322的写操作，可以配置计算单元110以在确定该线程要向一个或多个线程虚拟寄存器1322写入数据时，将该数据写入该中转寄存器，并且配置该中转寄存器以将该数据写入该计算单元110的一个或多个线程虚拟寄存器1322。Specifically, for write operations of the thread virtual register 1322, the computing unit 110 can be configured to write the data to the transfer register when it is determined that the thread wants to write data to one or more thread virtual registers 1322, and the transfer register can be configured to write the data to one or more thread virtual registers 1322 of the computing unit 110.

具体地，对于线程虚拟寄存器1322的读操作，可以配置该计算单元110以在确定该线程要从一个或多个线程虚拟寄存器1322读取数据时，向该中转寄存器发送读取请求，并且配置该中转寄存器以响应于该读取请求从该计算单元的一个或多个线程虚拟寄存器1322读取该数据，以供该计算单元110运行的线程读取。Specifically, for the read operation of the thread virtual register 1322, the computing unit 110 can be configured to send a read request to the transfer register when it is determined that the thread wants to read data from one or more thread virtual registers 1322, and the transfer register can be configured to read the data from one or more thread virtual registers 1322 of the computing unit in response to the read request for reading by the thread running the computing unit 110.

在一些实施例中，方框410还可以包括：配置由多个计算单元110的线程束共用的一个或多个线程束虚拟寄存器1324。在这种情况下，方法400还包括：配置每个计算单元110以在运行该线程束时，通过该计算单元110的中转寄存器来访问该一个或多个线程束虚拟寄存器1324。In some embodiments, block 410 may further include configuring one or more warp virtual registers 1324 shared by warps of multiple compute units 110. In this case, method 400 further includes configuring each compute unit 110 to access the one or more warp virtual registers 1324 through a transit register of the compute unit 110 when executing the warp.

具体地，对于线程束虚拟寄存器1324的写操作，可以配置多个计算单元110中的一个计算单元110以在确定该线程束要向一个或多个线程束虚拟寄存器1324写入数据时，将该数据写入该计算单元110的中转寄存器，并且配置该中转寄存器以将该数据写入一个或多个线程束虚拟寄存器1324。Specifically, for the write operation of the thread warp virtual register 1324, one of the multiple computing units 110 can be configured to write the data to the transfer register of the computing unit 110 when it is determined that the thread warp is to write data to one or more thread warp virtual registers 1324, and configure the transfer register to write the data to one or more thread warp virtual registers 1324.

具体地，对于线程虚拟寄存器1322的读操作，可以配置多个计算单元110中的每个计算单元110以在确定该线程束要从该线程束虚拟寄存器1324读取数据时，向各自的中转寄存器发送读取请求，并且配置每个中转寄存器以响应于该读取请求从线程束虚拟寄存器1324读取该数据，以供对应的计算单元110运行的线程读取。Specifically, for the read operation of the thread virtual register 1322, each computing unit 110 in the multiple computing units 110 can be configured to send a read request to the respective transfer register when it is determined that the thread warp wants to read data from the thread warp virtual register 1324, and each transfer register can be configured to read the data from the thread warp virtual register 1324 in response to the read request for reading by the thread running by the corresponding computing unit 110.

或者，对于线程虚拟寄存器1322的读操作，还可以配置多个计算单元110中的每个计算单元110以在确定该线程束要从线程束虚拟寄存器1324读取数据时，向该计算单元110的中转寄存器发送读取请求，并且仅由其中一个计算单元110的中转寄存器响应于该读取请求从线程束虚拟寄存器1324读取该数据，并广播给多个计算单元110。Alternatively, for the read operation of the thread virtual register 1322, each computing unit 110 in the multiple computing units 110 can also be configured to send a read request to the transfer register of the computing unit 110 when it is determined that the thread warp wants to read data from the thread warp virtual register 1324, and only the transfer register of one of the computing units 110 reads the data from the thread warp virtual register 1324 in response to the read request and broadcasts it to multiple computing units 110.

利用本发明的方案，通过将计算设备的共享缓存器的一部分从逻辑上配置为线程束或每个线程的虚拟寄存器并且将已有的线程本地寄存器中的一个或多个配置为中转寄存器来执行线程/线程束与虚拟寄存器之间的数据中转，能够扩展计算设备在运行线程束时可用的线程本地寄存器的数量，从而避免了寄存器溢出导致的性能下降，并且寄存器数量的增加也为计算设备上运行的线程束提供了更大的操作空间。By utilizing the solution of the present invention, by logically configuring a portion of the shared buffer of the computing device as a virtual register of a thread warp or each thread and configuring one or more of the existing thread local registers as transfer registers to perform data transfer between threads/thread warps and virtual registers, the number of thread local registers available to the computing device when running thread warps can be expanded, thereby avoiding performance degradation caused by register overflow, and the increase in the number of registers also provides a larger operating space for thread warps running on the computing device.

本领域技术人员可以理解，图中所示的计算设备200仅是示意性的，其可以包含更多或更少的组成部分。Those skilled in the art will appreciate that the computing device 200 shown in the figure is merely illustrative and may include more or fewer components.

以上结合附图对根据本公开的计算设备200及配置虚拟寄存器的方法400进行了描述。然而本领域技术人员可以理解，计算设备200并不必须包括图中所示的所有组件，其可以仅仅包括执行本公开中所述的功能所必须的其中一些组件或更多组件，并且这些组件的连接方式也不局限于图中所示的形式，并且方法400也可以包括图中未示出的更多的步骤。The computing device 200 and the method 400 for configuring a virtual register according to the present disclosure are described above in conjunction with the accompanying drawings. However, those skilled in the art will appreciate that the computing device 200 does not necessarily include all the components shown in the drawings, and may only include some or more components necessary to perform the functions described in the present disclosure, and the connection method of these components is not limited to the form shown in the drawings, and the method 400 may also include more steps not shown in the drawings.

本发明可以实现为方法、计算设备、该计算设备的控制设备、计算机可读存储介质和/或计算机程序产品。计算机可读存储介质上存储有计算机程序代码，计算机程序代码在被运行时用于执行本公开的方法。计算机程序产品包括计算机程序，计算机程序在被运行时执行本公开的方法。计算设备和/或计算设备可以包括至少一个处理器和耦合到该至少一个处理器的至少一个存储器，该存储器可以存储用于由至少一个处理器执行的指令。该指令在由该至少一个处理器执行时，该计算设备和/或控制设备可以执行上述方法。The present invention may be implemented as a method, a computing device, a control device of the computing device, a computer-readable storage medium and/or a computer program product. A computer-readable storage medium stores a computer program code, which is used to execute the method of the present disclosure when it is executed. The computer program product includes a computer program, which executes the method of the present disclosure when it is executed. The computing device and/or the computing device may include at least one processor and at least one memory coupled to the at least one processor, and the memory may store instructions for execution by the at least one processor. When the instruction is executed by the at least one processor, the computing device and/or the control device may execute the above method.

在一个或多个示例性设计中，可以用硬件、软件、固件或它们的任意组合来实现本公开所述的功能。例如，如果用软件来实现，则可以将所述功能作为一个或多个指令或代码存储在计算机可读介质上，或者作为计算机可读介质上的一个或多个指令或代码来传输。In one or more exemplary designs, the functions described in the present disclosure may be implemented using hardware, software, firmware, or any combination thereof. For example, if implemented using software, the functions may be stored as one or more instructions or codes on a computer-readable medium, or transmitted as one or more instructions or codes on a computer-readable medium.

本领域普通技术人员还应当理解，结合本公开的实施例描述的各种示例性的逻辑块、模块、电路和算法步骤可以实现成电子硬件、计算机软件或二者的组合。Those skilled in the art should also understand that the various illustrative logical blocks, modules, circuits, and algorithm steps described in conjunction with the embodiments of the present disclosure may be implemented as electronic hardware, computer software, or a combination of the two.

本公开的以上描述用于使本领域的任何普通技术人员能够实现或使用本公开。对于本领域普通技术人员来说，本公开的各种修改都是显而易见的，并且本文定义的一般性原理也可以在不脱离本公开的精神和保护范围的情况下应用于其它变形。因此，本公开并不限于本文所述的实例和设计，而是与本文公开的原理和新颖性特性的最广范围相一致。The above description of the present disclosure is intended to enable any person of ordinary skill in the art to implement or use the present disclosure. Various modifications of the present disclosure are obvious to those of ordinary skill in the art, and the general principles defined herein may also be applied to other variations without departing from the spirit and scope of the present disclosure. Therefore, the present disclosure is not limited to the examples and designs described herein, but is consistent with the broadest scope of the principles and novel features disclosed herein.

Claims

1. A computing device, comprising:

a plurality of computing units, each configured to run one thread of a thread bundle;

A plurality of thread local registers dedicated to each compute unit for registering data associated with threads operated by the compute unit; and

A shared buffer for the plurality of computing units,

Wherein a portion of the shared buffer is configured as a virtual register for the plurality of compute units and at least one of the plurality of thread local registers of each compute unit is configured as a staging register for threads operated by the compute unit to access the virtual register.

2. The computing device of claim 1, wherein the virtual registers comprise one or more thread virtual registers that are respectively dedicated to threads of each computing unit, and each computing unit is configured to access the one or more thread virtual registers through a staging register of the computing unit when running the thread.

3. The computing device of claim 2, wherein

The computing unit is configured to, upon determining that the thread is to write data to the one or more thread virtual registers, write the data to the staging register, and

The staging register is configured to write the data to one or more thread virtual registers of the computing unit.

4. The computing device of claim 2, wherein

The computing unit is configured to send a read request to the staging register upon determining that the thread is to read data from the one or more thread virtual registers, and

The staging register is configured to read the data from one or more thread virtual registers of the computing unit for reading by a thread run by the computing unit in response to the read request.

5. The computing device of claim 1, wherein the virtual registers comprise one or more thread bundle virtual registers that are shared by thread bundles of the plurality of computing units, and each computing unit is configured to access the one or more thread bundle virtual registers through a staging register of the computing unit when running the thread bundle.

6. The computing device of claim 5, wherein

Upon determining that the thread bundle is to write data to the one or more thread bundle virtual registers, one of the plurality of computing units is configured to write the data to a staging register of the computing unit, an

The staging register is configured to write the data to the one or more thread bundle virtual registers.

7. The computing device of claim 5, wherein

Upon determining that the thread bundle is to read data from the thread bundle virtual register, each of the plurality of computing units sends a read request to a respective staging register, and

Each staging register is configured to read the data from the thread bundle virtual register for reading by a thread operated by a corresponding compute unit in response to the read request.

8. The computing device of claim 5, wherein

Upon determining that the thread bundle is to read data from the thread bundle virtual register, each of the plurality of computing units sends a read request to a staging register of the computing unit, and

The staging register of one of the plurality of computing units is configured to read the data from the thread bundle virtual register in response to the read request and broadcast to the plurality of computing units.

9. A method of configuring virtual registers for a computing device, wherein the computing device comprises:

A shared buffer for the plurality of computing units,

The method comprises the following steps:

configuring a portion of the shared buffer as a virtual register for the plurality of computing units; and

At least one of the plurality of thread local registers of each compute unit is configured as a staging register for threads operated by the compute unit to access the virtual registers.

10. The method of claim 9, wherein configuring a portion of the shared buffer as a virtual register for the plurality of computing units comprises:

one or more thread virtual registers are configured that are respectively dedicated to threads of each compute unit, and the method further comprises:

Each compute unit is configured to access the one or more thread virtual registers through the staging registers of the compute unit while running the thread.

11. The method of claim 10, wherein configuring each computing unit to access the one or more thread virtual registers through a staging register of the computing unit while running the thread comprises:

Configuring the computing unit to write data to the staging register upon determining that the thread is to write the data to the one or more thread virtual registers, and

12. The method of claim 10, wherein configuring each computing unit to access the one or more thread virtual registers through a staging register of the computing unit while running the thread comprises:

configuring the computing unit to send a read request to the staging register upon determining that the thread is to read data from the one or more thread virtual registers, an

The staging register is configured to read the data from one or more thread virtual registers of the compute unit for reading by a thread run by the compute unit in response to the read request.

13. The method of claim 9, wherein configuring a portion of the shared buffer as a virtual register for the plurality of computing units comprises:

one or more thread bundle virtual registers shared by thread bundles of the plurality of compute units are configured, and the method further comprises:

each compute unit is configured to access the one or more thread bundle virtual registers through the transit registers of the compute unit while running the thread bundle.

14. The method of claim 13, wherein configuring each computing unit to access the one or more thread bundle virtual registers through a staging register of the computing unit while running the thread bundle comprises:

Configuring one of the plurality of compute units to write data to the one or more thread bundle virtual registers upon determining that the thread bundle is to write the data to the transit register of the compute unit, and

15. The method of claim 13, wherein configuring each computing unit to access the one or more thread bundle virtual registers through a staging register of the computing unit while running the thread bundle comprises:

configuring each of the plurality of compute units to send a read request to a respective staging register upon determining that the thread bundle is to read data from the thread bundle virtual register, and

Each transfer register is configured to read the data from the thread bundle virtual register in response to the read request for reading by a thread operated by the corresponding compute unit.

16. The method of claim 13, wherein configuring each computing unit to access the one or more thread bundle virtual registers through a staging register of the computing unit while running the thread bundle comprises:

Configuring each of the plurality of compute units to send a read request to a staging register of the compute unit upon determining that the thread bundle is to read data from the thread bundle virtual register, and

17. A control apparatus comprising:

At least one processor; and

At least one memory coupled to the at least one processor and storing instructions for execution by the at least one processor, which when executed by the at least one processor, cause the control device to perform the steps of the method according to any one of claims 9 to 16.

18. A computer readable storage medium having stored thereon computer program code which, when executed, performs the method of any of claims 9 to 16.

19. A computer program product comprising a computer program which, when executed by a machine, performs the method of any of claims 9 to 16.