JP2007200288A

JP2007200288A - System and method for grouping execution threads

Info

Publication number: JP2007200288A
Application number: JP2006338917A
Authority: JP
Inventors: Brett W Coon; ダブリュー．クーンブレット; John E Lindholm; エリックリンドホルムジョン
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2005-12-16
Filing date: 2006-12-15
Publication date: 2007-08-09
Anticipated expiration: 2026-12-15
Also published as: TW200745953A; JP4292198B2; CN1983196A; CN1983196B; TWI338861B; US20070143582A1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method for grouping execution threads so that execution hardware is utilized more efficiently. <P>SOLUTION: A plurality of threads are divided into buddy groups of two or more threads, so that each thread has assigned to it one or more buddy threads. Only one thread in each buddy group actively executes instructions and this allows buddy threads to share hardware resources, such as registers. When an active thread encounters a swap event, such as a swap instruction, the active thread suspends execution and one of its buddy threads begins execution using that thread's private hardware resources and the buddy group's shared hardware resources. As a result, the thread count can be increased without replicating all of the per-thread hardware resources. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

Field of Invention

[0001]本発明の実施の形態は、広くマルチスレッド処理に係り、より詳細には、改良されたハードウェアの利用を達成するために実行スレッドをグループ化するシステム及び方法に係る。 [0001] Embodiments of the present invention relate generally to multi-threaded processing, and more particularly to systems and methods for grouping execution threads to achieve improved hardware utilization.

Explanation of related technology

[0002]一般に、コンピュータの命令は、実行のために複数のクロックサイクルを必要とする。このために、マルチスレッドプロセッサは、命令の並列スレッドを連続的に実行して、命令を実行するためのハードウェアをできるだけビジー状態に保持することを可能にする。例えば、以下に示す特性を有する命令のスレッドを実行する場合には、マルチスレッドプロセッサは、四つの並列のスレッドを連続的にスケジュールすることができる。このようにスレッドをスケジュールすることによって、マルチスレッドプロセッサは、四つのスレッドの実行を、２３個のクロックサイクルの後に完了することができる。ここで、第１スレッドは、クロックサイクル１−２０の間に実行され、第２スレッドは、クロックサイクル２−２１の間に実行され、第３スレッドは、クロックサイクル３−２２の間に実行され、更に、第４スレッドは、クロックサイクル４−２３の間に実行される。これに比して、プロセッサが、プロセス中のスレッドが実行を完了するまでスレッドをスケジュールしない場合には、四つのスレッドの実行を完了するのに８０個のクロックサイクルを行うことになる。ここでは、第１スレッドがクロックサイクル１−２０の間に実行され、第２スレッドがクロックサイクル２１−４０の間に実行され、第３スレッドがクロックサイクル４１−６０の間に実行され、第４スレッドがクロックサイクル６１−８０の間に実行される。

命令待ち時間必要なリソース
１４クロックサイクル３個のレジスタ
２４クロックサイクル４個のレジスタ
３４クロックサイクル３個のレジスタ
４４クロックサイクル５個のレジスタ
５４クロックサイクル３個のレジスタ
[0002] Generally, computer instructions require multiple clock cycles to execute. To this end, a multi-thread processor allows a parallel thread of instructions to run continuously, keeping the hardware for executing the instructions as busy as possible. For example, when executing a thread of instructions having the following characteristics, a multi-thread processor can schedule four parallel threads sequentially. By scheduling threads in this way, a multi-thread processor can complete the execution of four threads after 23 clock cycles. Here, the first thread is executed during clock cycle 1-20, the second thread is executed during clock cycle 2-21, and the third thread is executed during clock cycle 3-22. In addition, the fourth thread is executed during clock cycles 4-23. In contrast, if the processor does not schedule a thread until a thread in the process has completed execution, it will take 80 clock cycles to complete the execution of the four threads. Here, the first thread is executed during clock cycles 1-20, the second thread is executed during clock cycles 21-40, the third thread is executed during clock cycles 41-60, and the fourth The thread is executed during clock cycles 61-80.

Instruction wait time required resource 1 4 clock cycles 3 registers 2 4 clock cycles 4 registers 3 4 clock cycles 3 registers 4 4 clock cycles 5 registers 5 4 clock cycles 3 registers

[0003]しかしながら、上述した並列処理は、多量のハードウェアリソース、例えば、多数のレジスタを必要とする。上述した例では、並列処理に要するレジスタの数が、非並列処理の場合の５に比して、２０となる。 [0003] However, the parallel processing described above requires a large amount of hardware resources, eg, a large number of registers. In the example described above, the number of registers required for parallel processing is 20 compared to 5 for non-parallel processing.

[0004]多くの場合には、実行の待ち時間(latency)が均一でない。例えば、グラフィック処理のケースでは、命令のスレッドが、通常、１０クロックサイクル未満の待ち時間をしばしば有する数学(math)オペレーションと、１００クロックサイクル以上の待ち時間を有するメモリアクセスオペレーションとを含む。このようなケースでは、並列スレッドの実行を連続的にスケジュールしても、あまりうまく機能しない。連続的に実行される並列スレッドの数が少な過ぎる場合には、メモリアクセスオペレーションの待ち時間が長くなる結果として、実行ハードウェアの多くが過少利用となる。他方、連続的に実行される並列スレッドの数が、メモリアクセスオペレーションの長い待ち時間をカバーするに充分なほど多くされた場合には、実行中のスレッド（live thread）をサポートするに要するレジスタの数が著しく増加する。 [0004] In many cases, the latency of execution is not uniform. For example, in the case of graphics processing, instruction threads typically include math operations that often have a latency of less than 10 clock cycles and memory access operations that have a latency of 100 clock cycles or more. In such cases, continuously scheduling the execution of parallel threads will not work very well. If too few parallel threads are executed continuously, much of the execution hardware will be underutilized as a result of the increased latency of memory access operations. On the other hand, if the number of parallel threads running continuously is large enough to cover the long latency of memory access operations, the number of registers required to support a live thread The number increases significantly.

Summary of the Invention

[0005]本発明は、実行ハードウェアがより効率的に利用されるように実行スレッドをグループ化する方法を提供する。また、本発明は、実行ハードウェアがより効率的に利用されるように実行スレッドをグループ化するよう構成されたメモリユニットを備えるコンピュータシステムも提供する。 [0005] The present invention provides a method for grouping execution threads so that execution hardware is utilized more efficiently. The present invention also provides a computer system comprising a memory unit configured to group execution threads so that execution hardware is utilized more efficiently.

[0006]本発明の一実施の形態によれば、複数のスレッドが、二つ以上のスレッドのバディー(buddy：仲間)グループに分割され、各スレッドには一以上のバディースレッドが割り当てられる。各バディーグループの一つのスレッドだけが命令をアクティブに実行する。アクティブなスレッドが、スワップ命令のようなスワップイベントに遭遇すると、アクティブなスレッドは、実行を保留し、そのバディースレッドのうちの一つが実行を開始する。 [0006] According to one embodiment of the present invention, a plurality of threads are divided into buddy groups of two or more threads, and each thread is assigned one or more buddy threads. Only one thread in each buddy group actively executes instructions. When an active thread encounters a swap event, such as a swap instruction, the active thread suspends execution and one of its buddy threads begins executing.

[0007]スワップ命令は、通常、待ち時間の長い命令の後に現われ、現在アクティブなスレッドを、アクティブな実行リストにおけるそのバディースレッドのうちの一つとスワップさせる。バディースレッドの実行は、当該バディースレッドがスワップ命令に遭遇するまで続き、この遭遇がバディースレッドをアクティブな実行リストにおけるそのバディースレッドのうちの一つとスワップさせる。グループに二つのバディースレッドしかない場合には、そのバディースレッドがアクティブな実行リストにおけるオリジナルスレッドとスワップされ、オリジナルスレッドの実行が再開する。グループにバディースレッドが三つ以上ある場合には、そのバディースレッドは、ある所定の順序に基づきグループにける次のバディースレッドとスワップされる。 [0007] A swap instruction usually appears after a long latency instruction and causes the currently active thread to swap with one of its buddy threads in the active execution list. Execution of a buddy thread continues until the buddy thread encounters a swap instruction, which causes the buddy thread to swap with one of its buddy threads in the active execution list. If there are only two buddy threads in the group, that buddy thread is swapped with the original thread in the active execution list and execution of the original thread resumes. If there are more than two buddy threads in the group, the buddy threads are swapped with the next buddy thread in the group based on some predetermined order.

[0008]レジスタファイルの使用を節約するために、各バディースレッドは、そのレジスタ割り当てを、プライベート及び共有の二つのグループに分割している。プライベートグループに属するレジスタだけがスワップが生じた場合でも値を保持する。共有レジスタは、常に、バディーグループの現在のアクティブなスレッドにより所有される。 [0008] To conserve register file usage, each buddy thread divides its register assignments into two groups, private and shared. Only the registers belonging to the private group retain their values even when swapping occurs. Shared registers are always owned by the current active thread of the buddy group.

[0009]バディーグループは、プログラムが実行のためにロードされるときにスレッドが設定されるテーブルを使用して編成される。このテーブルは、オンチップレジスタに維持されてもよい。このテーブルは、複数の行を有し、各バディーグループ内のスレッドの数に基づいて構成される。例えば、各バディーグループに二つのスレッドがある場合には、テーブルが二つの列で構成される。各バディーグループに三つのスレッドがある場合には、テーブルが三つの列で構成される。 [0009] Buddy groups are organized using tables in which threads are set when a program is loaded for execution. This table may be maintained in on-chip registers. This table has a plurality of rows and is configured based on the number of threads in each buddy group. For example, if there are two threads in each buddy group, the table consists of two columns. If each buddy group has three threads, the table consists of three columns.

[0010]コンピュータシステムは、本発明の一実施の形態によれば、上述したテーブルをメモリに記憶し、更に、第１及び第２の実行パイプラインを用いて構成された処理ユニットを備えている。第１の実行パイプラインは数学オペレーションを実行するために使用され、第２の実行パイプラインはメモリオペレーションを実行するために使用される。 [0010] According to an embodiment of the present invention, a computer system stores the above-described table in a memory, and further includes a processing unit configured using first and second execution pipelines. . The first execution pipeline is used to perform mathematical operations and the second execution pipeline is used to perform memory operations.

[0011]本発明の上述の特徴を詳細に理解できるように、上に要約した本発明を、実施の形態を参照して詳細に説明する。実施の形態のうち幾つかについては、添付図面に示す。添付図面は、本発明の典型的な実施形態を示すに過ぎず、それ故、本発明の範囲を限定するものではない。これは、本発明が、他の同様に有効な実施の形態にも通じるものであるからである。 [0011] In order that the foregoing features of the invention may be understood in detail, the invention summarized above will be described in detail with reference to embodiments. Some of the embodiments are shown in the accompanying drawings. The accompanying drawings only illustrate exemplary embodiments of the invention and therefore do not limit the scope of the invention. This is because the present invention leads to other equally effective embodiments.

Detailed description

[0019]図１は、本発明を実施し得る複数の処理ユニットを有するグラフィック処理ユニット（ＧＰＵ）１２０を実装したコンピュータシステム１００の簡単なブロック図である。ＧＰＵ１２０は、複数の処理ユニット１２４−１、１２４−２、・・・１２４−Ｎに結合されたインタフェイスユニット１２２を備えている。ここで、Ｎは、１より大きな整数である。処理ユニット１２４は、メモリコントローラ１２６を介してローカルグラフィックメモリ１３０へアクセスすることができる。ＧＰＵ１２０及びローカルグラフィックメモリ１３０は、システムメモリ１１２に記憶されたドライバを使用してコンピュータシステム１００の中央処理ユニット（ＣＰＵ）１１０によりアクセスされるグラフィックサブシステムである。 [0019] FIG. 1 is a simplified block diagram of a computer system 100 that implements a graphics processing unit (GPU) 120 having a plurality of processing units in which the present invention may be implemented. The GPU 120 includes an interface unit 122 coupled to a plurality of processing units 124-1, 124-2, ... 124-N. Here, N is an integer greater than 1. The processing unit 124 can access the local graphics memory 130 via the memory controller 126. GPU 120 and local graphics memory 130 are graphics subsystems that are accessed by central processing unit (CPU) 110 of computer system 100 using drivers stored in system memory 112.

[0020]図２は、処理ユニット１２４の一つを更に詳細に示す。図２に示す処理ユニットは、本明細書では参照符号２００によって参照されており、図１に示す処理ユニット１２４のうち任意の一つを表わしている。処理ユニット２００は、処理ユニット２００によって実行されるべき命令を発行するための命令ディスパッチユニット２１２と、命令の実行に使用されるオペランドを記憶するレジスタファイル２１４と、一対の実行パイプライン２２２及び２２４と、を備えている。第１の実行パイプライン２２２は、数学オペレーションを実行するように構成されており、第２の実行パイプライン２２４は、メモリアクセスオペレーションを実行するように構成されている。一般的に、第２の実行パイプライン２２４で実行される命令の待ち時間は、第１の実行パイプライン２２２で実行される命令の待ち時間よりかなり長い。命令ディスパッチユニット２１２が命令を発行するときには、命令ディスパッチユニット２１２は、二つの実行パイプライン２２２及び２２４の一方にパイプラインコンフィギュレーション信号を送信する。命令が数学形式である場合には、パイプラインコンフィギュレーション信号は、第１の実行パイプライン２２２へ送信される。命令がメモリアクセス形式である場合には、パイプラインコンフィギュレーション信号は、第２の実行パイプライン２２４へ送信される。二つの実行パイプライン２２２及び２２４の実行結果は、レジスタファイル２１４へ書き戻される。 [0020] FIG. 2 shows one of the processing units 124 in more detail. The processing unit shown in FIG. 2 is referred to herein by reference numeral 200 and represents any one of the processing units 124 shown in FIG. The processing unit 200 includes an instruction dispatch unit 212 for issuing instructions to be executed by the processing unit 200, a register file 214 for storing operands used for execution of instructions, and a pair of execution pipelines 222 and 224. It is equipped with. The first execution pipeline 222 is configured to perform mathematical operations, and the second execution pipeline 224 is configured to perform memory access operations. In general, the latency of instructions executed in the second execution pipeline 224 is significantly longer than the latency of instructions executed in the first execution pipeline 222. When the instruction dispatch unit 212 issues an instruction, the instruction dispatch unit 212 sends a pipeline configuration signal to one of the two execution pipelines 222 and 224. If the instruction is in mathematical form, the pipeline configuration signal is sent to the first execution pipeline 222. If the instruction is in memory access format, the pipeline configuration signal is sent to the second execution pipeline 224. The execution results of the two execution pipelines 222 and 224 are written back to the register file 214.

[0021]図３は、命令ディスパッチユニット２１２の機能ブロック図である。命令ディスパッチユニット２１２は、複数のスロットを有する命令バッファ３１０を備えている。この実施の形態におけるスロットの数は、１２であり、各スロットは、２個までの命令を保持することができる。スロットのうち何れか一つが別の命令のためのスペースを有する場合には、フェッチ３１２が、スレッドプール３０５から命令キャッシュ３１４へなされる。スレッドプール３０５には、プログラムが実行のためにロードされるときに、スレッドが設定される。命令キャッシュ３１４に記憶された命令が、現在実行中の命令、即ち発行されたが完了されておらず且つ命令バッファ３１０の空きスペースに置かれている命令を追跡するスコアボード３２２に追加される前に、命令はデコード３１６される。 FIG. 3 is a functional block diagram of instruction dispatch unit 212. The instruction dispatch unit 212 includes an instruction buffer 310 having a plurality of slots. The number of slots in this embodiment is 12, and each slot can hold up to two instructions. If any one of the slots has room for another instruction, a fetch 312 is made from the thread pool 305 to the instruction cache 314. A thread is set in the thread pool 305 when a program is loaded for execution. Before an instruction stored in the instruction cache 314 is added to the scoreboard 322 that keeps track of the currently executing instruction, i.e., issued but not completed and placed in a free space in the instruction buffer 310 In addition, the instruction is decoded 316.

[0022]命令ディスパッチユニット２１２は、更に、発行(issue)ロジック３２０も備えている。この発行ロジック３２０は、スコアボード３２２を検査し、そして実行中の何れの命令にも依存しない命令を、命令バッファ３１０から発行する。命令バッファ３１０からの発行と共に、発行ロジック３２０は、パイプラインコンフィギュレーション信号を適切な実行パイプラインへ送信する。 [0022] The instruction dispatch unit 212 further includes issue logic 320. The issue logic 320 examines the scoreboard 322 and issues instructions from the instruction buffer 310 that do not depend on any instructions being executed. Along with the issue from the instruction buffer 310, the issue logic 320 sends a pipeline configuration signal to the appropriate execution pipeline.

[0023]図４は、本発明の第１の実施の形態に係るスレッドプール３０５の構成を示す。スレッドプール３０５は、１２行２列のテーブルとして構成される。テーブルの各セルは、スレッドを記憶するメモリスロットを表わす。テーブルの各行は、バディーグループを表わす。従って、テーブルのセル０Ａのスレッドは、テーブルのセル０Ｂのスレッドのバディースレッドである。本発明の実施の形態によれば、バディーグループのうちの一つのスレッドのみが、一度にアクティブとなる。命令フェッチの間に、アクティブなスレッドからの命令がフェッチされる。フェッチされた命令は、その後、デコードされ、命令バッファ３１０の対応スロットに記憶される。本明細書に示す本発明の実施の形態では、スレッドプール３０５のセル０Ａ又はセル０Ｂの何れかからフェッチされた命令は、命令バッファ３１０のスロット０に記憶され、スレッドプール３０５のセル１Ａ又はセル１Ｂの何れかからフェッチされた命令は、命令バッファ３１０のスロット１に記憶され、等々となる。また、命令バッファ３１０に記憶された命令は、発行ロジック３２０に従って連続するクロックサイクルで発行される。図６に示す簡単な例では、命令バッファ３１０に記憶された命令は、行０の命令、次いで、行１の命令、等々で始まる連続するクロックサイクルで発行される。 [0023] FIG. 4 shows a configuration of the thread pool 305 according to the first embodiment of the present invention. The thread pool 305 is configured as a table with 12 rows and 2 columns. Each cell in the table represents a memory slot that stores a thread. Each row in the table represents a buddy group. Therefore, the thread of table cell 0A is a buddy thread of the table cell 0B thread. According to the embodiment of the present invention, only one thread of the buddy group is active at a time. During an instruction fetch, instructions from the active thread are fetched. The fetched instruction is then decoded and stored in the corresponding slot of the instruction buffer 310. In the embodiment of the invention described herein, instructions fetched from either cell 0A or cell 0B of thread pool 305 are stored in slot 0 of instruction buffer 310 and are either cell 1A or cell of thread pool 305. Instructions fetched from any of 1B are stored in slot 1 of instruction buffer 310, and so on. The instructions stored in the instruction buffer 310 are issued in successive clock cycles according to the issue logic 320. In the simple example shown in FIG. 6, instructions stored in the instruction buffer 310 are issued in successive clock cycles beginning with row 0 instructions, then row 1 instructions, and so on.

[0024]図５は、本発明の第２の実施の形態に係るスレッドプール３０５の構成を示す。スレッドプール３０５は、８行３列のテーブルとして構成される。テーブルの各セルは、スレッドを記憶するメモリスロットを表わす。テーブルの各行は、バディーグループを表わす。従って、テーブルのセル０Ａ、０Ｂ及び０Ｃのスレッドは、バディーススレッドと考えられる。本発明の実施の形態によれば、バディーグループのうちの一つのスレッドのみが、一度にアクティブとなる。命令フェッチの間に、アクティブなスレッドからの命令がフェッチされる。フェッチされた命令は、その後、デコードされ、命令バッファ３１０の対応のスロットに記憶される。本明細書に示す本発明の実施の形態では、スレッドプール３０５のセル０Ａ、セル０Ｂ又はセル０Ｃからフェッチされた命令が命令バッファ３１０のスロット０に記憶され、スレッドプール３０５のセル１Ａ、セル１Ｂ又はセル１Ｃの何れかからフェッチされた命令が命令バッファ３１０のスロット１に記憶され、等々となる。また、命令バッファ３１０に記憶された命令は、発行ロジック３２０に従って連続するクロックサイクルで発行される。 [0024] FIG. 5 shows a configuration of a thread pool 305 according to the second embodiment of the present invention. The thread pool 305 is configured as a table with 8 rows and 3 columns. Each cell in the table represents a memory slot that stores a thread. Each row in the table represents a buddy group. Therefore, the threads of the cells 0A, 0B, and 0C in the table are considered as buddy threads. According to the embodiment of the present invention, only one thread of the buddy group is active at a time. During an instruction fetch, instructions from the active thread are fetched. The fetched instruction is then decoded and stored in the corresponding slot of the instruction buffer 310. In the embodiment of the present invention described herein, instructions fetched from cell 0A, cell 0B, or cell 0C of thread pool 305 are stored in slot 0 of instruction buffer 310, and cell 1A, cell 1B of thread pool 305 are stored. Or, an instruction fetched from any of the cells 1C is stored in slot 1 of the instruction buffer 310, and so on. The instructions stored in the instruction buffer 310 are issued in successive clock cycles according to the issue logic 320.

[0025]スレッドプール３０５にスレッドが設定されるときには、当該スレッドプール３０５は列順(column major order)にロードされる。セル０Ａが最初にロードされ、その後、セル１Ａ、セル２Ａ等々と続き、セルＡが満たされるまでロードされる。次いで、セル０Ｂがロードされ、その後、セル１Ｂ、セル２Ｂ等々と続き、セルＢが満たされるまでロードされる。スレッドプール３０５が追加の列をもって構成される場合には、このスレッドロードプロセスは、全ての列が満たされるまで同様に続けられる。スレッドプール３０５を列順にロードすることにより、バディースレッドを、一時的に、互いに可能な限り分離することができる。また、バディースレッドの各行は、他の行とは全く独立しており、命令バッファ３１０から命令が発行されるときに、行間の順序は発行ロジック３２０によって最小限に強制される。 [0025] When a thread is set in the thread pool 305, the thread pool 305 is loaded in a column major order. Cell 0A is loaded first, followed by cell 1A, cell 2A, and so on, until cell A is full. Cell 0B is then loaded, followed by cell 1B, cell 2B, etc., and loaded until cell B is full. If the thread pool 305 is configured with additional columns, this thread loading process continues in the same manner until all columns are filled. By loading the thread pool 305 in column order, buddy threads can be temporarily separated from each other as much as possible. Also, each line of the buddy thread is completely independent of the other lines, and the order between lines is forced to a minimum by the issue logic 320 when instructions are issued from the instruction buffer 310.

[0026]図６は、グループ当たり二つのバディースレッドがある場合のアクティブな実行スレッドのスワップを示すタイミングチャートである。実線の矢印は、アクティブなスレッドに対して実行される命令のシーケンスに対応する。このタイミング図は、スレッドプール３０５におけるセル０Ａのスレッドが最初に開始され、そのスレッドからスワップ命令が発行されるまで当該スレッドからの命令のシーケンスが実行されることを示している。スワップ命令が発行されると、スレッドプール３０５のセル０Ａのスレッドがスリープ状態に入り（即ち、インアクティブにされ）、そのバディースレッド、即ちスレッドプール３０５のセル０Ｂのスレッドがアクティブにされる。その後、スレッドプール３０５のセル０Ｂのスレッドからの命令のシーケンスが、そのスレッドからスワップ命令が発行されるまで、実行される。このスワップ命令が発行されると、スレッドプール３０５のセル０Ｂのスレッドがスリープ状態に入り、そのバディースレッド、即ちスレッドプール３０５のセル０Ａのスレッドがアクティブにされる。これは、両スレッドがそれらの実行を完了するまで続けられる。バディースレッドへのスワップは、スレッドが実行を完了したがそのスレッドのバディースレッドが完了しないときにも行われる。 [0026] FIG. 6 is a timing chart showing swapping of active execution threads when there are two buddy threads per group. The solid arrow corresponds to the sequence of instructions executed for the active thread. This timing diagram shows that the thread in cell 0A in the thread pool 305 is started first and the sequence of instructions from that thread is executed until a swap instruction is issued from that thread. When a swap instruction is issued, the thread in cell 0A of thread pool 305 goes to sleep (ie, is made inactive) and its buddy thread, ie, the thread in cell 0B of thread pool 305, is activated. Thereafter, the sequence of instructions from the thread in cell 0B of the thread pool 305 is executed until a swap instruction is issued from that thread. When this swap instruction is issued, the thread of the cell 0B in the thread pool 305 enters the sleep state, and the buddy thread, that is, the thread of the cell 0A in the thread pool 305 is activated. This continues until both threads have completed their execution. Swapping to a buddy thread also occurs when a thread completes execution but the thread's buddy thread does not complete.

[0027]図６に示すように、スレッドプール３０５の他のアクティブなスレッドは、セル０Ａのスレッドの後に連続して開始される。セル０Ａのスレッドと同様に、他のアクティブなスレッドの各々も、そのスレッドからスワップ命令が発行されるまで実行され、スワップ命令が発行されたときに、当該スレッドはスリープ状態に入り、そのスレッドのバディースレッドがアクティブにされる。次いで、アクティブな実行が、バディースレッド間で、両スレッドがそれらの実行を完了するまで、交互に行われる。 [0027] As shown in FIG. 6, other active threads in the thread pool 305 are started sequentially after the thread in cell 0A. Like the thread in cell 0A, each of the other active threads is executed until a swap instruction is issued from the thread, and when the swap instruction is issued, the thread enters a sleep state, and the thread's A buddy thread is activated. Active execution is then alternated between buddy threads until both threads have completed their execution.

[0028]図７は、バディーグループのスレッド（又は手短に言えば、バディースレッド）を実行するときに処理ユニットにより実行されるプロセスの各ステップを示すフローチャートである。ステップ７１０において、バディースレッドに対するハードウェアリソース、特に、レジスタが割り当てられる。割り当てられるレジスタは、バディースレッドの各々に対するプライベートレジスタ、及びバディースレッドにより共有されるべき共有レジスタを含む。共有レジスタの割り当ては、レジスタの使用を節約する。例えば、二つのバディースレッドがあり、且つ、バディースレッドの各々により２４個のレジスタが必要とされる場合には、従来のマルチ処理方法を実行するには、合計４８個のレジスタが必要になる。しかしながら、本発明の実施の形態では、共有レジスタが割り当てられる。これらのレジスタは、スレッドがアクティブであるときには必要であるが、スレッドが非アクティブであるとき、例えば、スレッドが待ち時間の長いオペレーションの完了を待機しているときには必要とされないレジスタに対応する。プライベートレジスタは、スワップとスワップとの間に保存する必要のある情報を記憶するために割り当てられる。二つのバディースレッドの各々により２４個のレジスタが必要とされる実施例では、これらレジスタのうち１６個を共有レジスタとして割り当てることができる場合に、両バディースレッドを実行するのに合計３２個のレジスタしか必要とされない。バディーグループ当たり三つのバディースレッドがある場合には、節減が更に大きくなる。この実施例において、本発明では合計４０個のレジスタが必要となるのに比して、従来のマルチ処理方法では合計７２個のレジスタが必要になる。 [0028] FIG. 7 is a flowchart illustrating the steps of a process performed by a processing unit when executing a buddy group thread (or, in short, a buddy thread). In step 710, hardware resources, especially registers, for the buddy thread are allocated. Registers that are allocated include private registers for each of the buddy threads and shared registers to be shared by the buddy threads. Shared register allocation saves register usage. For example, if there are two buddy threads and each of the buddy threads requires 24 registers, a total of 48 registers are required to perform the conventional multi-processing method. However, in the embodiment of the present invention, a shared register is allocated. These registers correspond to registers that are required when the thread is active but are not required when the thread is inactive, for example, when the thread is waiting for completion of a long latency operation. Private registers are allocated to store information that needs to be saved between swaps. In an embodiment where 24 registers are required by each of the two buddy threads, a total of 32 registers can be used to execute both buddy threads if 16 of these registers can be assigned as shared registers. Only needed. The savings are even greater if there are three buddy threads per buddy group. In this embodiment, a total of 72 registers are required in the conventional multi-processing method, compared to a total of 40 registers required in the present invention.

[0029]バディースレッドのうち一つが、アクティブなスレッドとしてスタートし、このスレッドからの命令が実行のために取り出される（ステップ７１２）。ステップ７１４では、ステップ７１２で取り出された命令の実行が開始される。次いで、ステップ７１６において、その取り出された命令を検査して、スワップ命令であるかどうか調べる。スワップ命令である場合には、現在アクティブなスレッドが非アクティブにされ、バディーグループにおける他のスレッドのうちの一つがアクティブにされる（ステップ７１７）。スワップ命令でない場合には、ステップ７１４で開始された実行が完了しているか否かについて調べられる（ステップ７１８）。この実行が完了すると、現在アクティブなスレッドを検査して、実行されるべき命令が残っているかどうか調べる（ステップ７２０）。もし残っていれば、プロセスの流れがステップ７１２へ戻り、実行されるべき次の命令が現在アクティブなスレッドから取り出される。そうでなければ、全てのバディースレッドが実行を完了したか否かを調べるためにチェックがなされる（ステップ７２２）。完了した場合には、プロセスは終了となる。完了しない場合には、プロセスの流れがステップ７１７へ戻り、完了していないバディースレッドへのスワップが行われる。 [0029] One of the buddy threads starts as an active thread and instructions from this thread are retrieved for execution (step 712). In step 714, execution of the instruction fetched in step 712 is started. Next, in step 716, the fetched instruction is examined to see if it is a swap instruction. If it is a swap instruction, the currently active thread is deactivated and one of the other threads in the buddy group is activated (step 717). If it is not a swap instruction, a check is made as to whether the execution started in step 714 is complete (step 718). When this execution is complete, the currently active thread is examined to see if there are any remaining instructions to be executed (step 720). If so, process flow returns to step 712 and the next instruction to be executed is fetched from the currently active thread. Otherwise, a check is made to see if all buddy threads have completed execution (step 722). If completed, the process ends. If not completed, the process flow returns to step 717, where swapping to an uncompleted buddy thread is performed.

[0030]上述した本発明の実施の形態では、プログラムがコンパイルされるときにスワップ命令が挿入される。スワップ命令は、通常、待ち時間の長い命令の直後に挿入され、好ましくは、プライベートレジスタの数に比して多数の共有レジスタを割り当てできるプログラム内の各ポイントにおいて挿入される。例えば、グラフィック処理では、スワップ命令がテクスチャ命令の直後に挿入される。本発明の別の実施の形態では、スワップイベントがスワップ命令でなく、ハードウェアが認識する何らかのイベントであってもよい。例えば、ハードウェアは、命令の実行において長い待ち時間を認識するように構成されていることがある。これを認識すると、長い待ち時間を生じさせる命令を発行したスレッドをインアクティブ状態に至らせ、同じバディーグループの別のスレッドをアクティブにさせることができる。また、スワップイベントは、長い待ち時間のオペレーション中の何らかの認識可能なイベント、例えば、長い待ち時間のオペレーション中に生じる第１のスコアボードの停止（ストール）であってもよい。 [0030] In the embodiment of the present invention described above, a swap instruction is inserted when a program is compiled. The swap instruction is usually inserted immediately after the high latency instruction and is preferably inserted at each point in the program where a large number of shared registers can be allocated relative to the number of private registers. For example, in graphic processing, a swap instruction is inserted immediately after a texture instruction. In another embodiment of the present invention, the swap event may not be a swap instruction but any event recognized by hardware. For example, the hardware may be configured to recognize long latencies in instruction execution. Recognizing this, the thread that issued the instruction causing the long wait time can be brought into the inactive state, and another thread in the same buddy group can be activated. The swap event may also be some recognizable event during a long latency operation, for example, a first scoreboard stall that occurs during a long latency operation.

[0031]以下の命令シーケンスは、スワップ命令がコンパイラーにより挿入され得るシェーダープログラムの箇所を例示するものである。

Inst_00: Interpolate iw
Inst_01: Reciprocal w
Inst_02: Interpolate s, w
Inst_03: Interpolate t, w
Inst_04: Texture s, t //Texturereturns r, g, b, a values
Inst_05: Swap
Inst_06: Multiply r, r, w
Inst_07: Multiply g, g, w

スワップ命令（ｉｎｓｔ＿０５）は、コンパイラーにより待ち時間の長いテクスチャ命令（ｉｎｓｔ＿０４）の直後に挿入される。このように、バディースレッドへのスワップは、待ち時間の長いテクスチャ命令（ｉｎｓｔ＿０４）が実行される間に行うことができる。スワップ命令を乗算命令（ｉｎｓｔ＿０６）の後に挿入するのは、あまり望ましくない。これは、乗算命令（ｉｎｓｔ＿０６）が、テクスチャ命令（Ｉｎｓｔ＿０４）の結果に依存し、バディースレッドへのスワップを、待ち時間の長いテクスチャ命令（Ｉｎｓｔ＿０４）がその実行を完了する後まで行えないからである。 [0031] The following instruction sequence illustrates the location of a shader program where a swap instruction can be inserted by the compiler.

Inst_00: Interpolate iw
Inst_01: Reciprocal w
Inst_02: Interpolate s, w
Inst_03: Interpolate t, w
Inst_04: Texture s, t // Texturereturns r, g, b, a values
Inst_05: Swap
Inst_06: Multiply r, r, w
Inst_07: Multiply g, g, w

The swap instruction (inst_05) is inserted by the compiler immediately after the long-latency texture instruction (inst_04). As described above, the swap to the buddy thread can be performed while the texture instruction (inst_04) having a long waiting time is executed. It is less desirable to insert a swap instruction after a multiply instruction (inst_06). This is because the multiply instruction (inst_06) depends on the result of the texture instruction (Inst_04) and cannot be swapped to the buddy thread until after the long-latency texture instruction (Inst_04) completes its execution. .

[0032]例示を簡単化するために、本発明の実施の形態の上述の説明で使用したスレッドは、単一スレッドの命令としている。しかしながら、本発明は、同様のスレッドが共にグループ化され、コンボイ(convoy)とも称されるこのグループからの同じ命令が、単一命令マルチデータ（ＳＩＭＤ）プロセッサを使用して複数の並列データパスを介して処理されるような実施形態にも適用することができる。 [0032] For simplicity of illustration, the threads used in the above description of the embodiments of the invention are single threaded instructions. However, the present invention allows similar instructions from this group, also referred to as convoys, grouped together, to use multiple single data paths using a single instruction multi-data (SIMD) processor. It can also be applied to embodiments that are processed via

[0033]以上、本発明の実施の形態を説明したが、本発明の基本的な範囲から逸脱せずに、他の及び更に別の実施形態を案出することも可能である。本発明の範囲は、特許請求の範囲により決定される。 [0033] While embodiments of the invention have been described above, other and further embodiments can be devised without departing from the basic scope of the invention. The scope of the invention is determined by the claims.

本発明を実施し得る複数の処理ユニットを有するＧＰＵを実装したコンピュータシステムの簡単なブロック図である。FIG. 2 is a simple block diagram of a computer system that implements a GPU having multiple processing units in which the present invention may be implemented. 図１の処理ユニットを更に詳細に示す図である。It is a figure which shows the processing unit of FIG. 1 in detail. 図２に示す命令ディスパッチユニットの機能ブロック図である。It is a functional block diagram of the instruction dispatch unit shown in FIG. 本発明の第１の実施の形態によるスレッドプール及び命令バッファを示す概念図である。It is a conceptual diagram which shows the thread pool and instruction buffer by the 1st Embodiment of this invention. 本発明の第２の実施の形態によるスレッドプール及び命令バッファを示す概念図である。It is a conceptual diagram which shows the thread pool and instruction buffer by the 2nd Embodiment of this invention. バディースレッド間でのアクティブな実行スレッドのスワップを示すタイミングチャートである。It is a timing chart which shows the swap of the active execution thread between buddy threads. バディースレッドを実行するときに処理ユニットによって実行されるプロセスステップを示すフローチャートである。FIG. 6 is a flowchart illustrating process steps executed by a processing unit when executing a buddy thread. FIG.

Explanation of symbols

１００…コンピュータシステム、１１０…中央処理ユニット(ＣＰＵ)、１１２…システムメモリ、１２０…グラフィック処理ユニット(ＧＰＵ)、１２２…インタフェイスユニット、１２４…処理ユニット、１２６…メモリコントローラ、１３０…ローカルグラフィックメモリ、２００…処理ユニット、２１２…命令ディスパッチユニット、２１４…レジスタファイル、２２２，２２４…実行パイプライン、３０５…スレッドプール、３１０…命令バッファ、３１４…命令キャッシュ、３２０…発行ロジック、３２２…スコアボード。 DESCRIPTION OF SYMBOLS 100 ... Computer system, 110 ... Central processing unit (CPU), 112 ... System memory, 120 ... Graphic processing unit (GPU), 122 ... Interface unit, 124 ... Processing unit, 126 ... Memory controller, 130 ... Local graphic memory, DESCRIPTION OF SYMBOLS 200 ... Processing unit, 212 ... Instruction dispatch unit, 214 ... Register file, 222, 224 ... Execution pipeline, 305 ... Thread pool, 310 ... Instruction buffer, 314 ... Instruction cache, 320 ... Issue logic, 322 ... Score board.

Claims

A method of executing instructions of a plurality of threads in a processing unit,
Assigning a first set, a second set, and a shared set of hardware resources of the processing unit to instructions of a first thread and instructions of a second thread;
Using the first set of hardware resources and the shared set to execute instructions of the first thread until a predetermined event occurs;
In response to the occurrence of the predetermined event, the execution of the instruction of the first thread is suspended, and the second set of hardware resources and the shared set are used to execute the instruction of the second thread. Steps to perform;
With a method.

The instruction of the second thread is executed until another predetermined event occurs, the execution of the second thread is suspended in response to the occurrence of the other predetermined event, and the first of the instructions The method of claim 1, wherein execution of the thread is resumed.

The instruction of the first thread includes a swap instruction, the predetermined event occurs when the swap instruction in the first thread is executed, the instruction of the second thread includes a swap instruction, The method of claim 2, wherein the another predetermined event occurs when the swap instruction of a second thread is executed.

Further comprising assigning a third set of hardware resources and the shared set of hardware resources to instructions of a third thread, wherein the instructions of the second thread are executed until another predetermined event occurs; The method of claim 1, wherein execution of the second thread instruction is suspended and execution of the third thread instruction is executed in response to the occurrence of the other predetermined event.

The method according to claim 1, wherein the predetermined event is generated when an instruction having a high latency among instructions of the first thread is executed.

The method of claim 5, wherein the long latency instruction comprises a memory access instruction.

The method of claim 1, wherein the hardware resource includes a register.

The method of claim 7, wherein the hardware resource further comprises an instruction buffer.

Assigning a third set, a fourth set, and a fifth set of hardware resources of the processing unit to instructions of a third thread and instructions of a fourth thread;
Executing instructions of the third thread using the third and fifth sets of hardware resources until a swap event occurs for the third thread;
In response to the occurrence of the swap event for the third thread, the execution of instructions of the third thread is suspended, and the fourth set and the fifth set of hardware resources are used to Executing instructions of a fourth thread;
The method of claim 1, further comprising:

The instruction of the fourth thread is executed until a swap event for the fourth thread occurs, and in response to the occurrence of the swap event for the fourth thread, the instruction of the fourth thread is executed. The method of claim 9 suspending and resuming execution of instructions of the third thread.