TW201342228A

TW201342228A - Compute thread array granularity execution preemption

Info

Publication number: TW201342228A
Application number: TW101141781A
Authority: TW
Inventors: Lacky V Shah; Gregory Scott Palmer; Gernot Schaufler; Samuel H Duncan; Philip Browning Johnson; Shirish Gadre; Robert Ohannessian; Nicholas Wang; Christopher Lamb; Philip Alexander Cuadra; Timothy John Purcell
Original assignee: Nvidia Corp
Priority date: 2011-11-10
Filing date: 2012-11-09
Publication date: 2013-10-16
Also published as: TWI457828B; CN103197917A; DE102012220365A1

Abstract

One embodiment of the present invention sets forth a technique instruction level and compute thread array granularity execution preemption. Preempting at the instruction level does not require any draining of the processing pipeline. No new instructions are issued and the context state is unloaded from the processing pipeline. When preemption is performed at a compute thread array boundary, the amount of context state to be stored is reduced because execution units within the processing pipeline complete execution of in-flight instructions and become idle. If, the amount of time needed to complete execution of the in-flight instructions exceeds a threshold, then the preemption may dynamically change to be performed at the instruction level instead of at compute thread array granularity.

Description

Privilege array granulation execution priority calculation

本發明概略關於程式執行優先權，尤指執行緒陣列粒化執行的優先權計算。 The present invention is generally directed to program execution priorities, and more particularly to priority calculations for thread array granulation execution.

優先權為一種在多個不同應用程式之間將一處理器做時間分割的機制。當有多個不同應用程式需要同時使用該處理器時，要對所有該等應用程式達到提前進度的一種方式為針對該處理器上一短的時間片來運作每一應用程式。以往時間分割需要該處理器管線被完全地排除，且當該處理器為閒置時，切換一不同的應用程式進入來由該處理器管線做執行。此種時間分割的機制已被稱之為「等待閒置」(wait for idle)優先權，而該機制在當該處理器耗用很長時間來排除正在該處理器管線上運作的該工作時，即無法良好地工作。例如，考慮到一種需要非常長時間運作的圖形遮影器程式，或在最差情況下即為具有一無限迴圈的遮影器程式。為了能夠在不同應用程式之間進行時間分割，對於每一應用程式之閒置執行所需要的時間長度必須受限，所以長運作時間的應用程式無法有效地減少針對其它應用程式可使用的該時間片。 Priority is a mechanism for time division of a processor between multiple different applications. When there are multiple different applications that need to use the processor at the same time, one way to achieve advancement for all of these applications is to run each application for a short time slice on the processor. In the past, time division required that the processor pipeline be completely eliminated, and when the processor is idle, a different application is switched in to be executed by the processor pipeline. This time splitting mechanism has been referred to as "wait for idle" priority, and when the processor takes a long time to eliminate the work being performed on the processor pipeline, That is, it does not work well. For example, consider a graphics shader program that takes a very long time to operate, or in the worst case, a shader program with an infinite loop. In order to be able to time split between different applications, the length of time required for idle execution of each application must be limited, so long-running applications cannot effectively reduce the time slice available for other applications. .

另一種已被考慮來實作優先權的機制為停止或凍結該處理器，然後儲存該處理器內所有該等暫存器和管線正反器的該等內容，而在稍後恢復該處理器內所有該等暫存器和管線正反器的該等內容。儲存和恢復所有該等暫存器和管線正反器的該等內容基本上會造成非常大量的狀態要被儲存和恢復。儲存和恢復該狀態的時間會減少在該等時間片期間該等應用程式之每一者可用來執行的時間。 Another mechanism that has been considered to implement priority is to stop or freeze the processor, then store all of the contents of the registers and pipeline flip-flops in the processor, and restore the processor later. The contents of all such registers and pipeline flip-flops. Storing and restoring all of these registers and pipeline flip-flops essentially results in a very large number of states being stored and restored. The time to store and restore this state reduces the time available to each of the applications during the time slices.

因此，本技術中需要一種用於執行優先權的系統和方法，其在當該應用程式被佔先時不需要儲存一應用程式的整個狀態，也不需要等待一處理管線成為閒置來優先化該應用程式。 Therefore, there is a need in the art for a system and method for performing priority that does not require the storage of an application's entire state when the application is preempted. State, there is no need to wait for a processing pipeline to become idle to prioritize the application.

本發明為一種用於計算執行緒陣列粒化執行優先權的系統和方法。當優先權被啟動時，該內容狀態自該處理管線卸載。當於一計算執行緒陣列邊界處執行優先權時，要被儲存的內容狀態量可減少，其係因為在該處理管線內的執行單元完成進行中指令的執行並成為閒置。如果完成該等進行中指令之執行所需要的時間長度超過一臨界值，則該優先權可以動態地改為於該指令層級處來進行，而非在計算執行緒陣列粒化時進行。 The present invention is a system and method for computing thread array granulation execution priorities. When the priority is initiated, the content status is unloaded from the processing pipeline. When the priority is executed at a computational thread array boundary, the amount of content state to be stored can be reduced because the execution unit within the processing pipeline completes the execution of the in-progress instruction and becomes idle. If the length of time required to complete the execution of the in-progress instructions exceeds a threshold, then the priority can be dynamically changed to the level of the instruction rather than the computational thread array granulation.

本發明之一種在一多執行緒系統中優先化程式指令之執行的方法之多種具體實施例包括執行在該多執行緒系統內一處理管線中使用一第一內容的程式指令。使用該第一內容的執行於一計算執行緒陣列層級處被佔先來執行在該多執行緒系統中使用一第二內容的不同程式指令。使用該第一內容的該等程式指令之執行被佔先的指示被儲存，且該等不同程式指令在該處理管線中使用該第二內容來執行。 Various embodiments of the method of prioritizing the execution of program instructions in a multi-threaded system of the present invention include executing program instructions for using a first content in a processing pipeline within the multi-threaded system. Execution of the first content is performed at a compute thread array level to execute different program instructions that use a second content in the multi-threaded system. Execution of the execution of the program instructions using the first content is stored, and the different program instructions are executed using the second content in the processing pipeline.

本發明之多種具體實施例包括用於優先化程式指令之執行的一多執行緒系統。該多執行緒系統包含一記憶體、一主控介面和一處理管線。該記憶體設置成儲存對應於一第一內容的程式指令和對應於一第二內容的不同程式指令。該主控介面耦合至該處理管線，並設置成於一計算執行緒陣列層級處優先化使用該第一內容之程式指令的執行來執行使用一第二內容之不同的程式指令。該處理管線設置成執行使用該第一內容的該等程式指令、優先化使用該第一內容的該等程式指令之執行來執行使用該第二內容的該等不同程式指令、儲存使用該第一內容的該等程式指令之執行被佔先的一指示、以及執行使用該第二內容的該等不同程式指令。 Various embodiments of the present invention include a multi-threaded system for prioritizing the execution of program instructions. The multi-thread system includes a memory, a master interface, and a processing pipeline. The memory is configured to store program instructions corresponding to a first content and different program instructions corresponding to a second content. The master interface is coupled to the processing pipeline and is configured to perform execution of a program instruction using the first content at a compute thread array level to execute a different program instruction using a second content. The processing pipeline is configured to execute the program instructions using the first content, prioritize execution of the program instructions using the first content, execute the different program instructions using the second content, and store the first use The execution of the program instructions of the content is preceded by an instruction and execution of the different program instructions that use the second content.

該優先權機制可最小化當一應用程式被佔先時所要儲存以及當該應用程式恢復執行時所要恢復的該狀態數量。此外，長運作時間的應用程式可在非常短的時間內被佔先。 This priority mechanism minimizes the number of states to be restored when an application is preempted and when the application resumes execution. In addition, long-running applications can be taken up in a very short time.

在以下的說明中，許多特定細節係被提出來提供對於本發明之更為完整的瞭解。但是本技術專業人士將可瞭解到本發明可不利用一或多個這些特定細節來實施。在其它實例中，並未說明熟知的特徵，藉以避免混淆本發明。 In the following description, numerous specific details are set forth to provide a more complete understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without one or more of these specific details. In other instances, well-known features are not described in order to avoid obscuring the invention.

系統概述System Overview

第一圖係例示設置成實作本發明一或多種態樣之一電腦系統100的方塊圖。電腦系統100包括一中央處理單元(CPU)102與一系統記憶體104，其經由包括一記憶體橋接器105的互連接路徑進行通訊。記憶體橋接器105可為一北橋晶片，其經由一匯流排或其它通訊路徑106(例如HyperTransport聯結)連接到一I/O(輸入/輸出)橋接器107。I/O橋接器107可為一南橋晶片，其接收來自一或多個使用者輸入裝置108(例如鍵盤、滑鼠)的使用者輸入，並經由路徑106及記憶體橋接器105轉送該輸入到CPU 102。一平行處理子系統112經由一匯流排或其它通訊路徑113(例如PCI Express，加速繪圖埠、或HyperTransport聯結)耦合至記憶體橋接器105；在一具體實施例中，平行處理子系統112為一繪圖子系統，其傳遞像素到一顯示器110(例如一習用CRT或LCD式的監視器)。一系統碟114亦連接至I/O橋接器107。一交換器116提供I/O橋接器107與其它像是網路轉接器118與多種嵌入卡120,121之其它組件之間的連接。其它組件(未明確顯示)，包括有USB或其它埠連接、CD驅動器、DVD驅動器、薄膜記錄裝置及類似者，其亦可連接至I/O橋接器107。互連接第一圖中該等多種組件的通訊路徑可使用任何適當的協定來實作，例如PCI(周邊組件互連，Peripheral Component Interconnect)、PCI Express(PCI快速，PCI-E)、AGP(加速圖形通訊埠，Accelerated Graphics Port)、HyperTransport(超輸送)、或任何其它匯流排或點對點通訊協定、及不同裝置之間的連接，皆可使用如本技術中所知的不同協定。 The first figure illustrates a block diagram of a computer system 100 that is configured to implement one or more aspects of the present invention. Computer system 100 includes a central processing unit (CPU) 102 and a system memory 104 that communicate via an interconnect path including a memory bridge 105. The memory bridge 105 can be a north bridge wafer that is connected to an I/O (input/output) bridge 107 via a bus or other communication path 106 (e.g., a HyperTransport link). The I/O bridge 107 can be a south bridge chip that receives user input from one or more user input devices 108 (eg, a keyboard, mouse) and forwards the input via path 106 and memory bridge 105 to CPU 102. A parallel processing subsystem 112 is coupled to the memory bridge 105 via a bus or other communication path 113 (e.g., PCI Express, accelerated graphics, or HyperTransport coupling); in one embodiment, the parallel processing subsystem 112 is a A graphics subsystem that passes pixels to a display 110 (eg, a conventional CRT or LCD type monitor). A system disk 114 is also coupled to the I/O bridge 107. A switch 116 provides a connection between the I/O bridge 107 and other components such as the network adapter 118 and various embedded cards 120, 121. Other components (not explicitly shown), including USB or other port connections, CD drives, DVD drives, thin film recording devices, and the like, may also be coupled to I/O bridge 107. Interconnecting the communication paths of the various components in the first figure can be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect, Peripheral) Component Interconnect), PCI Express (PCI Express, PCI-E), AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point protocol, and connections between different devices Different protocols as known in the art can be used.

在一具體實施例中，平行處理子系統112加入可針對圖形及視訊處理最佳化的電路，其包括例如視訊輸出電路，且構成一圖形處理單元(GPU)。在另一具體實施例中，平行處理子系統112加入可針對一般性目的處理最佳化的電路，而可保留底層的運算架構，在此處會有更為詳細的說明。在又另一具體實施例中，平行處理子系統112可被整合於一或多個其它系統元件，例如記憶體橋接器105、CPU 102、及I/O橋接器107而形成一系統上晶片(SoC,“System on chip”)。 In one embodiment, parallel processing subsystem 112 incorporates circuitry that can be optimized for graphics and video processing, including, for example, video output circuitry, and constitutes a graphics processing unit (GPU). In another embodiment, the parallel processing subsystem 112 incorporates circuitry that can be optimized for general purpose processing while retaining the underlying computing architecture, as will be described in more detail herein. In yet another embodiment, the parallel processing subsystem 112 can be integrated into one or more other system components, such as the memory bridge 105, the CPU 102, and the I/O bridge 107 to form a system-on-chip ( SoC, "System on chip").

將可瞭解到此處所示的系統僅為例示性，其有可能有多種變化及修正。該連接拓樸，包括橋接器的數目與配置，CPU 102的數目及平行處理子系統112的數目皆可視需要修改。例如，在一些具體實施例中，系統記憶體104直接連接至CPU 102而非透過一橋接器耦接，而其它裝置透過記憶體橋接器105及CPU 102與系統記憶體104進行通訊。在其它可替代的拓樸中，平行處理子系統112連接至I/O橋接器107或直接連接至CPU 102，而非連接至記憶體橋接器105。在又其它具體實施例中，I/O橋接器107及記憶體橋接器105可被整合到一單一晶片當中。大型具體實施例可包括兩個或更多的CPU 102，及兩個或更多的平行處理子系統112。此處所示的該等特定組件皆為選擇性的；例如其可支援任何數目的嵌入卡或周邊裝置。在一些具體實施例中，交換器116被省略，且網路轉接器118及嵌入卡120、121直接連接至I/O橋接器107。 It will be appreciated that the systems shown herein are merely illustrative and that many variations and modifications are possible. The connection topology, including the number and configuration of bridges, the number of CPUs 102, and the number of parallel processing subsystems 112 can all be modified as needed. For example, in some embodiments, system memory 104 is directly coupled to CPU 102 rather than through a bridge, while other devices communicate with system memory 104 via memory bridge 105 and CPU 102. In other alternative topologies, parallel processing subsystem 112 is coupled to I/O bridge 107 or directly to CPU 102, rather than to memory bridge 105. In still other embodiments, I/O bridge 107 and memory bridge 105 can be integrated into a single wafer. Large specific embodiments may include two or more CPUs 102, and two or more parallel processing subsystems 112. The particular components shown herein are all optional; for example, they can support any number of embedded cards or peripheral devices. In some embodiments, switch 116 is omitted and network adapter 118 and embedded cards 120, 121 are directly connected to I/O bridge 107.

第二圖例示根據本發明一具體實施例之一平行處理子系統112。如所示，平行處理子系統112包括一或多個平行處理單元(PPU,“Parallel processing unit”)202，其每一者耦合於一局部平行處理(PP,“Parallel processing”)記憶體204。概言之，一平行處理子系統包括數目為U的PPU，其中U≧1。(在此處類似物件的多個實例標示為辨識該物件之參考編號，而括號中的數目辨識所需要的實例)。PPU 202及平行處理記憶體204可以使用一或多個積體電路裝置來實作，例如可程式化處理器，特殊應用積體電路(ASIC,“Application specific integrated circuits”)，或記憶體裝置，或以任何其它技術上可行的方式來實作。 The second figure illustrates a parallel processing subsystem 112 in accordance with one embodiment of the present invention. As shown, parallel processing subsystem 112 includes one or more parallel processing orders Elements (PPUs, "Parallel processing units") 202, each coupled to a partial parallel processing (PP, "Parallel processing") memory 204. In summary, a parallel processing subsystem includes a number U of PPUs, where U ≧ 1. (Several instances of the analog component are labeled here to identify the reference number of the object, and the number in parentheses identifies the desired instance). The PPU 202 and the parallel processing memory 204 can be implemented using one or more integrated circuit devices, such as a programmable processor, an application specific integrated circuit (ASIC), or a memory device. Or in any other technically feasible way.

請再次參照第一圖，在一些具體實施例中，平行處理子系統112中部份或所有的PPU 202為圖形處理器，其具有顯像管線，其能夠設置成執行關於自CPU 102及/或系統記憶體104經由記憶體橋接器105及匯流排113所供應的圖形資料產生像素資料的多種作業，與本地平行處理記憶體204進行互動(其能夠做為圖形記憶體，其包括例如一習用像框緩衝器)，以儲存及更新像素資料，傳遞像素資料到顯示裝置110及類似者。在一些具體實施例中，平行處理子系統112可以包括可操作為圖形處理器的一或多個PPU 202，及用於通用型運算的一或多個其它PPU 202。該等PPU可為相同或不同，且每個PPU可以具有其本身專屬的平行處理記憶體裝置或並無專屬的平行處理記憶體裝置。一或多個PPU 202可以輸出資料到顯示裝置110，或每個PPU 202可以輸出資料到一或多個顯示裝置110。 Referring again to the first figure, in some embodiments, some or all of the PPUs 202 in the parallel processing subsystem 112 are graphics processors having a visualization pipeline that can be configured to execute with respect to the self-CPU 102 and/or system. The memory 104 generates a plurality of operations of the pixel data via the graphic data supplied from the memory bridge 105 and the bus bar 113, and interacts with the local parallel processing memory 204 (which can be used as a graphic memory, which includes, for example, a conventional image frame buffer. To store and update pixel data, pass pixel data to display device 110 and the like. In some embodiments, parallel processing subsystem 112 can include one or more PPUs 202 that are operable as graphics processors, and one or more other PPUs 202 for general purpose operations. The PPUs may be the same or different, and each PPU may have its own proprietary parallel processing memory device or no proprietary parallel processing memory device. One or more PPUs 202 may output data to display device 110, or each PPU 202 may output data to one or more display devices 110.

在作業中，CPU 102為電腦系統100的主控處理器，其控制及協調其它系統組件的作業。特別是CPU 102發出控制PPU 202之作業的命令。在一些具體實施例中，CPU 102對每一PPU 202寫入一命令串流至一資料結構(未明確示於第一圖或第二圖中)，其可位於系統記憶體104、平行處理記憶體204或可同時由CPU 102與PPU 202存取的其它儲存位置。指向至每一資料結構的一指標被寫入至一推入緩衝器來啟始在該資料結構中該命令串流之處理。PPU 202自一或多個推入緩衝器讀取命令串流，然後相對於CPU 102的該作業非同步地執行命令。執行優先性可針對每一推入緩衝器來指定，以控制該等不同推入緩衝器的排程。 In operation, CPU 102 is the master processor of computer system 100 that controls and coordinates the operation of other system components. In particular, the CPU 102 issues commands to control the operation of the PPU 202. In some embodiments, CPU 102 writes a command stream to each PPU 202 to a data structure (not explicitly shown in the first or second figure), which may be located in system memory 104, parallel processing memory. Body 204 may be other storage locations that are simultaneously accessible by CPU 102 and PPU 202. An indicator directed to each data structure is written to a push-in buffer to initiate processing of the command stream in the data structure. PPU 202 reads from one or more push buffers The stream is streamed, and then the command is executed asynchronously with respect to the job of the CPU 102. Execution priority can be specified for each push-in buffer to control the scheduling of the different push-in buffers.

現在請回頭參照第二B圖，每個PPU 202包括一I/O(輸入/輸出)單元205，其經由通訊路徑113與電腦系統100的其它部份進行通訊，其連接至記憶體橋接器105(或在一替代性具體實施例中直接連接至CPU 102)。PPU 202與電腦系統100的其餘部份之連接亦可改變。在一些具體實施例中，平行處理子系統112係實作成一嵌入卡，其可被插入到電腦系統100的一擴充槽中。在其它具體實施例中，PPU 202可利用一匯流排橋接器整合在一單一晶片上，例如記憶體橋接器105或I/O橋接器107。在又其它的具體實施例中，PPU 202之部份或所有元件可與CPU 102整合在一單一晶片上。 Referring now back to FIG. 2B, each PPU 202 includes an I/O (input/output) unit 205 that communicates with other portions of computer system 100 via communication path 113, which is coupled to memory bridge 105. (Or directly connected to the CPU 102 in an alternative embodiment). The connection of the PPU 202 to the rest of the computer system 100 can also vary. In some embodiments, parallel processing subsystem 112 is implemented as an embedded card that can be inserted into an expansion slot of computer system 100. In other embodiments, PPU 202 can be integrated on a single wafer, such as memory bridge 105 or I/O bridge 107, using a bus bridge. In still other embodiments, some or all of the components of PPU 202 may be integrated with CPU 102 on a single wafer.

在一具體實施例中，通訊路徑113為一PCI-EXPRESS鏈路，其中如本技術中所熟知具有專屬的線路會分配給每個PPU 202。其亦可使用其它通訊路徑。一I/O單元205產生封包(或其它信號)在通訊路徑113上傳輸，且亦自通訊路徑113接收所有進入的封包(或其它信號)，導引該等進入封包到PPU 202的適當組件。例如，關於處理工作的命令可被導引到一主控介面206，而關於記憶體作業的命令(例如自平行處理記憶體204讀取或寫入其中)可被導引到一記憶體交叉開關單元210。主控介面206讀取每個推入緩衝器，並輸出儲存在該推入緩衝器中的該命令串流至一前端212。 In one embodiment, communication path 113 is a PCI-EXPRESS link in which a dedicated line is known to be assigned to each PPU 202 as is known in the art. It can also use other communication paths. An I/O unit 205 generates a packet (or other signal) for transmission over the communication path 113 and also receives all incoming packets (or other signals) from the communication path 113, directing the incoming packets to the appropriate components of the PPU 202. For example, commands regarding processing operations can be directed to a host interface 206, and commands for memory jobs (eg, read or written from parallel processing memory 204) can be directed to a memory crossbar switch. Unit 210. The master interface 206 reads each push-in buffer and outputs the command stream stored in the push-in buffer to a front end 212.

每一PPU 202較佳地是實作一高度平行的處理架構。如詳細所示，PPU 202(0)包括一處理叢集陣列230，其包括數目為C的通用處理叢集(GPC,“General processing clusters”)208，其中C≧1。每一GPC 208能夠同時執行大量(例如數百或數千)的執行緒，其中每個執行緒為一程式的一實例。在多種應用中，不同的GPC 208可分配來處理不同種類的程式，或執行不同種類的運算。GPC 208的分配可根據每種程式或運算所提升的工作負荷而改變。 Each PPU 202 is preferably implemented as a highly parallel processing architecture. As shown in detail, PPU 202(0) includes a processing cluster array 230 that includes a number C of general processing clusters (GPC) 208, where C ≧ 1. Each GPC 208 can execute a large number (e.g., hundreds or thousands) of threads simultaneously, each of which is an instance of a program. In a variety of applications, different GPCs 208 can be assigned to handle different kinds of programs, or perform different types. The operation. The allocation of GPC 208 can vary depending on the workload of each program or operation.

GPC 208由一任務/工作單元207內的一工作分配單元接收要被執行的處理任務。該工作分配單元接收指標來運算被編碼成任務中介資料(TMD)且儲存在記憶體中的處理任務。該等指向至TMD的指標被包括在儲存成一推入緩衝器且由前端單元212自主控介面206接收的該命令串流中。可被編碼成TMD的處理任務包括要被處理之資料的索引，以及定義了該資料要如何被處理的狀態參數和命令(例如那一個程式要被執行)。任務/工作單元207自前端212接收任務，並確保GPC 208在由該等TMD之每一者所指定的該處理啟始之前被設置成一有效狀態。一優先性可針對用於排程該處理任務之執行的每一TMD來指定。 The GPC 208 receives processing tasks to be performed by a work distribution unit within a task/work unit 207. The work distribution unit receives the metrics to compute processing tasks that are encoded into task mediation data (TMD) and stored in memory. The metrics directed to the TMD are included in the command stream stored as a push-in buffer and received by the front end unit 212 autonomous interface 206. A processing task that can be encoded into a TMD includes an index of the material to be processed, and state parameters and commands that define how the material is to be processed (eg, which program is to be executed). Task/work unit 207 receives tasks from front end 212 and ensures that GPC 208 is set to an active state prior to the start of the process specified by each of the TMDs. A priority may be specified for each TMD used to schedule the execution of the processing task.

記憶體介面214包括數目為D的區隔單元215，其每一者被直接耦合至平行處理記憶體204的一部份，其中D≧1。如所示，區隔單元215的該數目大致上等於DRAM 220的數目。在其它具體實施例中，區隔單元215的數目可能不等於記憶體裝置的數目。本技術專業人士將可瞭解到DRAM 220可由其它適當儲存裝置取代，並可為一般的習用設計。因此可省略詳細說明。顯像目標，例如圖框緩衝器或紋路地圖，其可儲存在不同DRAM 220中，其允許區隔單元215平行地寫入每個顯像目標之不同部份而有效率地使用平行處理記憶體204之可使用頻寬。 The memory interface 214 includes a number D of partitioning units 215, each of which is directly coupled to a portion of the parallel processing memory 204, where D≧1. As shown, the number of compartments 215 is substantially equal to the number of DRAMs 220. In other embodiments, the number of compartments 215 may not be equal to the number of memory devices. Those skilled in the art will appreciate that DRAM 220 can be replaced by other suitable storage devices and can be designed for general use. Therefore, the detailed description can be omitted. Development targets, such as frame buffers or texture maps, which may be stored in different DRAMs 220, allow the partition unit 215 to write different portions of each development target in parallel to efficiently use parallel processing memory. 204 can use the bandwidth.

GPC 208之任何一者可處理要被寫入到平行處理記憶體204內DRAM 220中任一者的資料。交叉開關單元210設置成導引每個GPC 208之輸出到任何區隔單元215的輸入或到另一個GPC 208做進一步處理。GPC 208經由交叉開關單元210與記憶體介面214進行通訊，以自多個外部記憶體裝置讀取或寫入其中。在一具體實施例中，交叉開關單元210具有到記憶體介面 214的一連接來與I/O單元205進行通訊，以及到局部平行處理記憶體204的一連接，藉此使得不同GPC 208內該等處理核心能夠與系統記憶體104或並非位在PPU 202局部之其它記憶體進行通訊。在第二圖所示的該具體實施例中，交叉開關單元210直接連接於I/O單元205。交叉開關單元210可使用虛擬通道來隔開GPC 208與區隔單元215之間的流量串流。 Any of the GPCs 208 can process data to be written to any of the DRAMs 220 in the parallel processing memory 204. The crossbar unit 210 is arranged to direct the output of each GPC 208 to the input of any of the segmentation units 215 or to another GPC 208 for further processing. The GPC 208 communicates with the memory interface 214 via the crossbar unit 210 to read or write from a plurality of external memory devices. In a specific embodiment, the crossbar unit 210 has a memory interface A connection of 214 communicates with I/O unit 205 and a connection to local parallel processing memory 204, thereby enabling the processing cores in different GPCs 208 to be associated with system memory 104 or not local to PPU 202. Other memory communicates. In the particular embodiment shown in the second figure, the crossbar unit 210 is directly coupled to the I/O unit 205. The crossbar unit 210 can use a virtual channel to separate the flow of traffic between the GPC 208 and the segmentation unit 215.

再次地，GPC 208可被程式化來執行關於許多種應用之處理工作，其中包括但不限於線性及非線性資料轉換、影片及/或聲音資料的過濾、模型化作業(例如應用物理定律來決定物體的位置、速度及其它屬性)、影像顯像作業(例如鑲嵌遮影器、頂點遮影器、幾何遮影器及/或像素遮影器程式)等等。PPU 202可將來自系統記憶體104及/或局部平行處理記憶體204的資料轉移到內部(晶片上)記憶體、處理該資料、及將結果資料寫回到系統記憶體104及/或局部平行處理記憶體204，其中這些資料可由其它系統組件存取，包括CPU 102或另一個平行處理子系統112。 Again, GPC 208 can be programmed to perform processing on a wide variety of applications, including but not limited to linear and non-linear data conversion, filtering of film and/or sound data, modeling operations (eg, applying physical laws to determine Object position, velocity, and other properties), image development jobs (such as mosaic shaders, vertex shaders, geometry shaders, and/or pixel shader programs). The PPU 202 can transfer data from the system memory 104 and/or the partially parallel processing memory 204 to internal (on-wafer) memory, process the data, and write the resulting data back to the system memory 104 and/or locally parallel. Memory 204 is processed where the data is accessed by other system components, including CPU 102 or another parallel processing subsystem 112.

一PPU 202可具有任何數量的局部平行處理記憶體204，並不包括局部記憶體，並可用任何的組合來使用局部記憶體及系統記憶體。例如，一PPU 202可為在一統一記憶體架構(UMA,“Unified memory architecture”)具體實施例中的一圖形處理器。在這些具體實施例中，將可提供少數或沒有專屬的圖形(平行處理)記憶體，且PPU 202將專有地或大致專有地使用系統記憶體。在UMA具體實施例中，一PPU 202可被整合到一橋接器晶片中或處理器晶片中，或提供成具有一高速鏈路(例如PCI-EXPRESS)之一分離的晶片，其經由一橋接器晶片或其它通訊手段連接PPU 202到系統記憶體。 A PPU 202 can have any number of locally parallel processing memories 204, does not include local memory, and can use local memory and system memory in any combination. For example, a PPU 202 can be a graphics processor in a unified memory architecture (UMA, "Unified Memory Architecture") embodiment. In these specific embodiments, a few or no proprietary graphics (parallel processing) memory will be provided, and the PPU 202 will use system memory exclusively or substantially exclusively. In a UMA embodiment, a PPU 202 can be integrated into a bridge die or processor chip, or provided as a separate wafer having a high speed link (eg, PCI-EXPRESS) via a bridge A wafer or other means of communication connects the PPU 202 to the system memory.

如上所述，任何數目的PPU 202可以包括在一平行處理子系統112中。例如，多個PPU 202可提供在一單一嵌入卡上，或多個嵌入卡可被連接至通訊路徑113，或一或多個PPU 202 可被整合到一橋接器晶片中。在一多PPU系統中PPU 202可彼此相同或彼此不相同。例如，不同的PPU 202可具有不同數目的處理核心、不同數量的局部平行處理記憶體等等。當存在有多個PPU 202時，那些PPU可平行地作業而以高於一單一PPU 202所可能的流量來處理資料。加入有一或多個PPU 202之系統可實作成多種組態及型式因子，其中包括桌上型、膝上型、或掌上型個人電腦、伺服器、工作站、遊戲主機、嵌入式系統及類似者。 As noted above, any number of PPUs 202 can be included in a parallel processing subsystem 112. For example, multiple PPUs 202 may be provided on a single embedded card, or multiple embedded cards may be connected to communication path 113, or one or more PPUs 202 Can be integrated into a bridge wafer. The PPUs 202 may be identical to each other or different from one another in a multi-PPU system. For example, different PPUs 202 can have different numbers of processing cores, different numbers of locally parallel processing memories, and the like. When there are multiple PPUs 202, those PPUs can operate in parallel to process data at a higher traffic than is possible with a single PPU 202. Systems incorporating one or more PPUs 202 can be implemented in a variety of configurations and style factors, including desktop, laptop, or palm-sized personal computers, servers, workstations, game consoles, embedded systems, and the like.

多並行任務排程Multiple parallel task scheduling

多個處理任務可在GPC 208上並行地執行，且一處理任務於執行期間可以產生一或多個「子」(child)處理任務。任務/工作單元207接收該等任務，並動態地排程該等處理任務和子處理任務來由GPC 208執行。 Multiple processing tasks can be executed in parallel on GPC 208, and a processing task can generate one or more "child" processing tasks during execution. Task/work unit 207 receives the tasks and dynamically schedules the processing tasks and sub-processing tasks for execution by GPC 208.

第三A圖係根據本發明一具體實施例中第二圖之任務/工作單元207的方塊圖。任務/工作單元207包括一任務管理單元300和工作分配單元340。任務管理單元300基於執行優先性程度組織要被排程的任務。對於每一優先性程度，任務管理單元300儲存指向至對應於在排程器表321中該等任務的TMD 333之指標的一鏈接串列。TMD 322可被儲存在PP記憶體204或系統記憶體104中。任務管理單元300接受任務並儲存該等任務在排程器表321中的速率與任務管理單元300排程任務做執行的速率相脫離，使得任務管理單元300可基於優先性資訊或使用其它技術來排程任務。 The third A is a block diagram of the task/work unit 207 of the second diagram in accordance with an embodiment of the present invention. The task/work unit 207 includes a task management unit 300 and a work distribution unit 340. The task management unit 300 organizes tasks to be scheduled based on the degree of execution priority. For each level of priority, the task management unit 300 stores a list of links to metrics corresponding to the TMDs 333 of the tasks in the scheduler table 321 . The TMD 322 can be stored in the PP memory 204 or the system memory 104. The task management unit 300 accepts the tasks and stores the rates of the tasks in the scheduler table 321 out of the rate at which the task management unit 300 schedules tasks to perform, such that the task management unit 300 can be based on prioritization information or using other techniques. Scheduled tasks.

工作分配單元340包括一任務表345，其具有位置，而每一位置可由將要被執行的一任務之TMD 322佔用。任務管理單元300在當任務表345中有空的位置時即可排程任務來執行。當沒有空位置時，不會佔用一位置的一較高優先性的任務可以逐出佔用一空位的一較低優先性的任務。當一任務被逐出時，該任務即停止，且如果該任務的執行並未完成，該任務被加入到排程器表321中一鏈接串列。當一子處理任務被產生時，該子處理任務被加入到排程器表321中一鏈接串列。當該任務被逐出時，一項任務自一位置中移除。 The work distribution unit 340 includes a task table 345, which has locations, and each location can be occupied by a TMD 322 of a task to be executed. The task management unit 300 can execute the scheduled task when there is an empty position in the task table 345. When there is no empty location, a higher priority task that does not occupy a location can evict a lower priority task that occupies a vacancy. When a task is evicted, the task is stopped, and if the execution of the task is not completed, the task is added to A list of links in the scheduler table 321 . When a sub-processing task is generated, the sub-processing task is added to a link string in the scheduler table 321. When the task is evicted, a task is removed from a location.

任務處理概述Task processing overview

第三B圖為根據本發明一具體實施例中第二圖之該等PPU 202中之一者內一GPC 208的方塊圖。每個GPC 208可設置成平行地執行大量的執行緒，其中術語「執行緒」(thread)代表對於一特定組合的輸入資料執行的一特定程式之實例。在一些具體實施例中，使用單一指令、多重資料(SIMD,“Single-instruction,multiple-data”)指令發行技術來支援大量執行緒之平行執行，而不需要提供多個獨立指令單元。在其它具體實施例中，單一指令多重執行緒(SIMT,“Single-instruction,multiple-thread”)技術係用來支援大量概略同步化執行緒的平行執行，其使用一共用指令單元設置成發出指令到GPU 208之每一者內一組處理引擎。不像是一SIMD執行方式，其中所有處理引擎基本上執行相同的指令，SIMT的執行係允許不同的執行緒經由一給定執行緒程式而更可立即地遵循相異的執行路徑。本技術專業人士將可瞭解到一SIMD處理規範代表一SIMT處理規範的一功能子集合。 FIG. 3B is a block diagram of a GPC 208 within one of the PPUs 202 of the second diagram in accordance with an embodiment of the present invention. Each GPC 208 can be arranged to execute a large number of threads in parallel, wherein the term "thread" refers to an instance of a particular program that is executed for a particular combination of input material. In some embodiments, a single instruction, multiple data (SIMD, "Single-instruction, multiple-data") instruction issuance technique is used to support parallel execution of a large number of threads without the need to provide multiple independent instruction units. In other embodiments, SIMT ("Single-instruction" (multiple-thread)" technology is used to support parallel execution of a large number of roughly synchronized threads, which are set to issue instructions using a common instruction unit. A set of processing engines into each of GPUs 208. Unlike a SIMD execution mode in which all processing engines basically execute the same instructions, SIMT's execution allows different threads to follow the different execution paths more immediately via a given thread. Those skilled in the art will appreciate that a SIMD processing specification represents a functional subset of a SIMT processing specification.

GPC 208的作業較佳地是經由一管線管理員305控制，其可分配處理任務至串流多處理器(SM,“Streaming multiprocessor”)310。管線管理員305亦可設置成藉由指定SM 310之已處理資料輸出的目的地來控制一工作分配交叉開關330。 The operation of GPC 208 is preferably controlled via a pipeline administrator 305, which can assign processing tasks to a Streaming Multiprocessor (SM) 310. The pipeline manager 305 can also be arranged to control a work distribution crossbar 330 by specifying the destination of the processed data output of the SM 310.

在一具體實施例中，每個GPC 208包括M個數目的SM 310，其中M≧1，每個SMU 310設置成處理一或多個執行緒群組。同時，每個SM 310較佳地是包括可被管線化的一相同組合的功能單元，允許在一先前指令已經完成之前發出一新指令，其為本技術中已知。其可提供任何組合的功能單元。在一具體實施例中，該等功能單元支援多種運算，其中包括整數及浮點數算術(例如加法及乘法)，比較運算，布林運算(AND,OR,XOR)、位元偏位，及多種代數函數的運算(例如平面內插、三角函數、指數、及對數函數等)；及相同的功能單元硬體可被利用來執行不同的運算。 In one embodiment, each GPC 208 includes M number of SMs 310, where M ≧ 1, each SMU 310 is configured to process one or more thread groups. At the same time, each SM 310 is preferably a functional unit that includes an identical combination that can be pipelined, allowing a new instruction to be issued before a previous instruction has been completed, as is known in the art. It can provide any combination of functional units. In a In a specific embodiment, the functional units support a variety of operations, including integer and floating point arithmetic (eg, addition and multiplication), comparison operations, Boolean operations (AND, OR, XOR), bit offsets, and multiple algebras. Functions of functions (such as plane interpolation, trigonometric functions, exponents, and logarithmic functions, etc.); and the same functional unit hardware can be utilized to perform different operations.

傳送到一特定GPC 208之該等系列的指令構成一執行緒，如先前此處所定義者，橫跨一SM 310內該等平行處理引擎(未示出)並行地執行某個數目之執行緒的集合在此稱之為「包繞」(warp)或「執行緒群組」(thread group)。如此處所使用者，一「執行緒群組」代表並行地對於不同輸入資料執行相同程式的一執行緒的群組，該群組的每一執行緒被指定給一SM 310內的一不同處理引擎。一執行緒群組可包括比SM 310內處理引擎的數目要少的執行緒，其中當該執行緒群組正在被處理的循環期間一些處理引擎將為閒置。一執行緒群組亦可包括比SM 310內處理引擎之數目要更多的執行緒，其中處理將發生在連續的時脈循環之上。因為每個SM 310可並行地支援最多到G個執行緒群組，因此在任何給定時間在GPC 208中最高可執行G * M個執行緒群組。 The series of instructions transmitted to a particular GPC 208 constitute a thread that, as previously defined herein, executes a certain number of threads in parallel across the parallel processing engine (not shown) within an SM 310. The collection is referred to herein as "warp" or "thread group". As used herein, a "thread group" represents a group of threads executing the same program in parallel for different input materials, each thread of the group being assigned to a different processing engine within an SM 310. . A thread group may include fewer threads than the number of processing engines within the SM 310, where some processing engines will be idle while the thread group is being processed. A thread group can also include more threads than the number of processing engines in the SM 310, where processing will occur over successive clock cycles. Because each SM 310 can support up to G thread groups in parallel, up to G*M thread groups can be executed in GPC 208 at any given time.

此外，在相同時間於一SM 310內可以啟動複數相關的執行緒群組(在不同的執行階段)。此執行緒群組的集合在此處稱之為「協同執行緒陣列」(CTA,“Cooperative thread array”)或「執行緒陣列」(thread array)。一特定CTA之大小等於m*k，其中k為在一執行緒群組中並行地執行的執行緒之數目，其基本上為SM 310內平行處理引擎數目之整數倍數，而m為在SM 310內同時啟動的執行緒群組之數目。一CTA的大小概略由程式師及該CTA可使用之硬體資源(例如記憶體或暫存器)的數量所決定。 In addition, complex related thread groups (at different stages of execution) can be initiated within an SM 310 at the same time. The collection of this thread group is referred to herein as a "Cooperative thread array" (CTA, "thread array") or "thread array". The size of a particular CTA is equal to m*k, where k is the number of threads executing in parallel in a thread group, which is essentially an integer multiple of the number of parallel processing engines in SM 310, and m is at SM 310 The number of thread groups that are simultaneously started within. The size of a CTA is roughly determined by the number of hardware resources (such as memory or scratchpads) that the programmer and the CTA can use.

每一SM 310包含一階(L1)快取，或使用在SM 310外部一相對應L1快取中用於執行負載與儲存作業的空間。每個SM 310亦可存取到所有GPC 208之間共用的二階(L2)快取，並可用於在執行緒之間傳送資料。最後，SM 310亦可存取到晶片外的「通用」記憶體，其可包括例如平行處理記憶體204及/或系統記憶體104。應瞭解到在PPU 202外部的任何記憶體皆可做為通用記憶體。此外，一1.5階(L1.5)快取335可包括在GPC 208之內，設置成由SM 310要求經由記憶體介面214接收及保持自記憶體提取的資料，其中包括指令、一致性資料與常數資料，並提供該要求的資料至SM 310。在GPC 208中具有多個SM 310的具體實施例較佳地是共用被快取在L1.5快取335中的共通指令和資料。 Each SM 310 includes a first order (L1) cache or a space for performing load and store operations in a corresponding L1 cache external to the SM 310. Each SM The 310 also has access to a second-order (L2) cache shared between all GPCs 208 and can be used to transfer data between threads. Finally, the SM 310 can also access "universal" memory external to the wafer, which can include, for example, parallel processing memory 204 and/or system memory 104. It should be understood that any memory external to the PPU 202 can be used as a general purpose memory. In addition, a 1.5th order (L1.5) cache 335 can be included within the GPC 208, configured to receive and maintain data extracted from the memory via the memory interface 214, including instructions, consistency data, and Constant data and provide the required information to SM 310. A particular embodiment having multiple SMs 310 in GPC 208 is preferably to share common instructions and data that are cached in L1.5 cache 335.

每一GPC 208可包括一記憶體管理單元(MMU,“Memory management unit”)328，其設置成將虛擬位址映射到實體位置。在其它具體實施例中，MMU 328可存在於記憶體介面214內。MMU 328包括一組頁表項(PTE,“Page table entries”)，用於將一虛擬位置映射到一瓷磚的一實體位址，或是一快取線索引。MMU 328可包括位址轉譯旁看緩衝器(TLB,“Translation lookaside buffer”)，或是可以存在於多處理器SM 310或L1快取或GPC 208內的快取。該實體位址被處理來分配表面資料存取局部性而允許交錯在區隔單元當中的有效率要求。該快取線索引可用於決定一快取線的一要求是否為達成或錯失。 Each GPC 208 can include a Memory Management Unit (MMU) 328 that is configured to map virtual addresses to physical locations. In other embodiments, MMU 328 may be present within memory interface 214. The MMU 328 includes a set of page table entries (PTE, "Page table entries") for mapping a virtual location to a physical address of a tile, or a cache line index. MMU 328 may include a Translating Lookaside Buffer (TLB), or a cache that may exist within multiprocessor SM 310 or L1 cache or GPC 208. The physical address is processed to allocate surface data access locality to allow for efficient requirements interleaved in the segmentation unit. The cache line index can be used to determine whether a request for a cache line is fulfilled or missed.

在圖形和運算應用中，一GPC 208可設置成使得每個SM 310耦合於一紋路單元315，用於執行紋路映射作業，例如決定紋路樣本位置、讀取紋路資料及過濾該紋路資料。紋路資料自一內部紋路L1快取(未示出)讀取，或是在一些具體實施例中自SM 310內的L1快取讀取，且視需要自一L2快取、平行處理記憶體204或系統記憶體104提取。每一SM 310輸出已處理的任務至工作分配交叉開關330，藉以提供該已處理的任務至另一GPC 208進行進一步處理，或是將該已處理的任務經由交叉開關單元310儲存在一L2快取、平行處理記憶體204或系統記憶體104中。一preROP(預先掃描場化作業)325設置成自SM 310接收資料、導引資料到隔間單元215內的ROP單元、並進行色彩混合的最佳化、組織像素色彩資料、並執行位址轉譯。 In graphics and computing applications, a GPC 208 can be configured such that each SM 310 is coupled to a texture unit 315 for performing a texture mapping operation, such as determining a texture sample position, reading texture data, and filtering the texture data. The texture data is read from an internal texture L1 cache (not shown) or, in some embodiments, from the L1 cache in SM 310, and processed from a L2 cache, parallel processing memory 204 as needed. Or system memory 104 is extracted. Each SM 310 outputs the processed task to the work distribution crossbar 330 to provide the processed task to another GPC 208 for further processing, or to store the processed task via the crossbar unit 310 in an L2 fast. Take or parallelize memory 204 or system memory In body 104. A preROP (pre-scan fielding job) 325 is arranged to receive data from the SM 310, direct the data to the ROP unit in the compartment unit 215, optimize color mixing, organize pixel color data, and perform address translation. .

將可瞭解到此處所示的核心架構僅為例示性，其有可能有多種變化及修正。在一GPC 208內可包括任何數目的處理單元，例如SM 310或紋路單元315、preROP 325。再者，雖僅顯示一個GPC 208，一PPU 202可以包括任何數目的GPC 208，其較佳地是在功能上彼此類似，所以執行行為並不會根據是那一個GPC 208接收一特定處理任務而決定。再者，每個GPC 208較佳地是與其它GPC 208獨立地運作，其使用獨立及不同的處理單元、L1快取等等。 It will be appreciated that the core architecture shown herein is merely illustrative and that there are many variations and modifications possible. Any number of processing units, such as SM 310 or routing unit 315, preROP 325, may be included within a GPC 208. Moreover, although only one GPC 208 is shown, a PPU 202 can include any number of GPCs 208 that are preferably functionally similar to each other, so the execution behavior does not depend on which GPC 208 receives a particular processing task. Decide. Moreover, each GPC 208 preferably operates independently of other GPCs 208, using separate and distinct processing units, L1 caches, and the like.

本技術專業人士將可瞭解到在第一、二、三A和三B圖中所述之該架構並未以任何方式限制本發明之範圍，而此處所教示的技術可以實作在任何適當設置的處理單元上，其包括但不限於一或多個CPU、一或多個多核心CPU、一或多個PPU 202、一或多個GPC 208、一或多個圖形或特殊目的處理單元或類似者，其皆不背離本發明之範圍。 It will be appreciated by those skilled in the art that the architecture described in the first, second, third and third B diagrams does not limit the scope of the invention in any way, and the techniques taught herein can be implemented in any suitable setting. Processing unit, including but not limited to one or more CPUs, one or more multi-core CPUs, one or more PPUs 202, one or more GPCs 208, one or more graphics or special purpose processing units or the like They do not depart from the scope of the invention.

在本發明之具體實施例中，需要使用PPU 202或一運算系統的其它處理器來使用執行緒陣列執行一般性運算。在該執行緒陣列中每一執行緒被指定一唯一執行緒識別(thread ID)，其可在該執行緒的執行期間由該執行緒存取。可被定義成一維或多維度數值的執行緒ID控制該執行緒的處理行為之多種態樣。例如，一執行緒ID可用於決定一執行緒要做處理的是該輸入資料集的那一部份，及/或決定一執行緒要產生或寫入的是在一輸出資料集的那一部份。 In a particular embodiment of the invention, the PPU 202 or other processor of an arithmetic system is required to perform general operations using a thread array. Each thread in the thread array is assigned a unique thread ID that can be accessed by the thread during execution of the thread. The thread ID, which can be defined as a one-dimensional or multi-dimensional value, controls various aspects of the processing behavior of the thread. For example, a thread ID can be used to determine which portion of the input data set to be processed by a thread, and/or to determine which thread to generate or write in an output data set. Share.

每個執行緒指令的一序列可以包括至少一指令來定義該代表性執行緒和該執行緒陣列的一或多個其它執行緒之間的一協同行為。例如，每個執行緒的該指令序列可以包括一指令來在該序列中一特定點處中止該代表性執行緒之作業的執行，直到當該等其它執行緒中一或多者到達該特定點為止，該代表性執行緒的一指令係儲存資料在該等其它執行緒中一或多者可存取的一共用記憶體中，該代表性執行緒的一指令係基於它們的執行緒ID原子性地讀取和更新儲存在該等其它執行緒中一或多者可存取的一共用記憶體中的資料，或類似者。該CTA程式亦可包括一指令來運算資料在該共用記憶體中要被讀取的一位址，利用該位址為執行緒ID的函數。藉由定義適當的函數和提供同步化技術，資料可藉由一CTA的一執行緒被寫入到共用記憶體中一給定的位置，並以一可預測的方式由該相同CTA的一不同執行緒自該位置讀取。因此，即可支援可在執行緒當中共用任何需要的資料型式，且在一CTA中任何執行緒能夠與該相同CTA中任何其它執行緒共用資料。如果有的話，在一CTA的執行緒當中資料共用的程度係由該CTA程式決定；因此，應瞭解到在使用CTA的一特定應用中，根據該CTA程式，一CTA的該等執行緒可以或不需要實際地彼此共用資料，該等術語"CTA”和「執行緒陣列」在此處為同義地使用。 A sequence of each thread instruction can include at least one instruction to define a cooperative behavior between the representative thread and one or more other threads of the thread array. For example, the sequence of instructions for each thread may include an instruction to suspend execution of the representative thread at a particular point in the sequence. In a row, until one or more of the other threads arrive at the particular point, an instruction of the representative thread stores a shared memory that the data is accessible to one or more of the other threads. An instruction of the representative thread is based on their thread ID to atomically read and update data stored in a shared memory accessible by one or more of the other threads, or the like. By. The CTA program may also include an instruction to calculate an address of the data to be read in the shared memory, using the address as a function of the thread ID. By defining appropriate functions and providing synchronization techniques, data can be written to a given location in shared memory by a thread of a CTA, and by a different prediction of the same CTA in a predictable manner. The thread reads from this location. Therefore, it is possible to support sharing any required data patterns in the thread, and any thread in a CTA can share data with any other thread in the same CTA. If so, the degree of data sharing in a CTA's thread is determined by the CTA program; therefore, it should be understood that in a particular application using CTA, according to the CTA program, a CTA's threads can Or do not need to actually share material with each other, the terms "CTA" and "thread array" are used synonymously herein.

程式執行和優先權Program execution and priority

優先權可用於在多個不同應用程式之間時間分割一處理器，所以該等不同應用程式被序列化，且每一者在該處理器上執行一短的時間片。優先權亦可用於為了其它目的來卸載該目前正在執行的內容。例如，主控介面206可在當CPU 102啟始一通道佔先或一運行清單佔先時優先化一內容，其中一通道為指向至處理工作的指標之一集合，而一應用程式可包含一或多個通道。一通道佔先藉由清除在一通道隨機存取記憶體項目中一有效位元及寫入要被佔先的該通道之一通道識別至一佔先暫存器來進行。然後該經指定的通道自PPU 202卸載而同時離開主機和該引擎。 Priority can be used to time divide a processor between multiple different applications, so the different applications are serialized and each executes a short time slice on the processor. Priority can also be used to offload the content currently being executed for other purposes. For example, the master interface 206 can prioritize a content when the CPU 102 initiates a channel preemption or a run list preemption, wherein one channel is a set of indicators pointing to processing work, and an application can include one or more Channels. A channel preemption is performed by clearing a valid bit in a random access memory item and writing to a preemptive register of one of the channels to be preempted. The designated channel is then unloaded from the PPU 202 while leaving the host and the engine.

一運行清單佔先藉由寫入一指標至該運行清單暫存器來進行。該指標可指向至一新的運行清單，或可指向至目前啟用的該運行清單。運行清單佔先使得正在一PPU 202中運作的被卸載。然後主控介面206開始處理關聯於該指標的該運行清單上的第一個項目，並搜尋具有等待中工作的該第一有效的項目。在具有等待中工作的該運行清單上該第一通道被載入到PPU 202中。 A run list is preempted by writing an indicator to the run list register. This metric can point to a new run list or can point to the run list currently enabled. The run list preempts the unloading that is operating in a PPU 202. The master interface 206 then begins processing the first item on the run list associated with the indicator and searches for the first valid item with the pending job. The first channel is loaded into the PPU 202 on the run list with the pending work.

主控介面206亦可在當該內容用完方法(即程式)而另一內容正在等待執行時，於一時間片過期之前優先化正在執行的一內容。在一具體實施例中，該等時間片並非相等的時間長度，而是基於每一內容的方法串流，所以具有一密集方法串流的一內容相較於具有一鬆散方法串流的一不同內容被分配了一較大的時間片。主控介面206設置成指明給前端212何時主控介面206針對一正在執行內容不具有任何方法。但是，主控介面206針對該正在執行內容不會啟始一內容切換，直到分配給該內容的該時間片已經過期或是該處理管線為閒置且沒有任何方法為止。 The master interface 206 can also prioritize a content being executed before a time slice expires when the content is exhausted (ie, the program) and another content is waiting to be executed. In a specific embodiment, the time slices are not equal in length, but are based on a method stream of each content, so a content having a dense method stream is different from a stream having a loose method stream. The content was assigned a larger time slice. The master interface 206 is arranged to indicate to the front end 212 when the master interface 206 does not have any method for an ongoing content. However, the master interface 206 does not initiate a content switch for the content being executed until the time slice assigned to the content has expired or the processing pipeline is idle and there is no method.

第四圖係根據本發明一具體實施例中主控介面206和開始於任務/工作單元207通過GPC 208的該處理管線。該優先權程序具有由前端212控制的五個階段。一第一階段(階段1)停止在該目前內容中的該處理。針對CTA層級的優先權，此代表停止在一CTA任務邊界處的工作。針對指令層級優先權，此代表停止在一SM 310指令邊界處的工作。如果於啟始優先權之後和階段1期間發生一中斷或錯誤，前端212即在進行到階段2之前等待該等待中的中斷或錯誤被清除。 The fourth diagram is the processing pipeline of the master interface 206 and the task/work unit 207 through the GPC 208 in accordance with an embodiment of the present invention. The priority procedure has five phases controlled by the front end 212. A first phase (Phase 1) stops the processing in the current content. For the priority of the CTA level, this represents stopping work at the boundary of a CTA task. For instruction level priority, this represents stopping work at the boundary of an SM 310 instruction. If an interrupt or error occurs after the start priority and during phase 1, the front end 212 waits for the pending interrupt or the error is cleared before proceeding to stage 2.

一旦該內容並停止(及任何中斷或錯誤被清除)，階段2將該目前內容的狀態儲存在記憶體中。階段3在階段4載入一新內容的狀態到該機器上之前重置該引擎。階段5重新開始在先前的階段1中被佔先的任何工作之處理。當優先化一內容時，主控介面206自該運行清單選擇一新的內容來執行，及指示前端212開始內容優先權。前端212藉由完成該優先權程序的該等五個階段來設置該處理管線以執行該新內容。在該優先權程序的該等五個階段完成之後，前端212傳送一確認(ACK)至主控介面206。在一具體實施例中，一獨立的圖形處理管線(未示於第四圖)進行圖形特定的作業，而前端212亦等待該圖形處理管線成為閒置。基本上，該等圖形處理方法相較於運算處理方法係以較短的時間執行，所以等待該圖形處理管線成為閒置可於該處理管線完成該優先權程序的該第一階段時完成。同時，被維持在一圖形處理管線中的狀態資訊量基本上遠大於被維持在該(運算)處理管線中該內容狀態的量。等待該圖形處理管線成為閒置可顯著地減少補捉該內容狀態所需要的儲存量。 Once the content is stopped (and any interruptions or errors are cleared), Phase 2 stores the state of the current content in memory. Phase 3 resets the engine before loading the state of a new content on stage 4 to the machine. Phase 5 resumes the processing of any work that was preempted in the previous Phase 1. When prioritizing a content, the main The control interface 206 selects a new content from the run list to execute, and instructs the front end 212 to begin content priority. The front end 212 sets the processing pipeline to perform the new content by completing the five stages of the priority procedure. After the five phases of the priority procedure are completed, the front end 212 transmits an acknowledgment (ACK) to the master interface 206. In one embodiment, a separate graphics processing pipeline (not shown in the fourth diagram) performs graphics-specific operations, and the front end 212 also waits for the graphics processing pipeline to become idle. Basically, the graphics processing methods are executed in a shorter time than the arithmetic processing methods, so waiting for the graphics processing pipeline to become idle can be completed when the processing pipeline completes the first phase of the priority procedure. At the same time, the amount of status information maintained in a graphics processing pipeline is substantially greater than the amount of state that is maintained in the (processing) processing pipeline. Waiting for the graphics processing pipeline to become idle can significantly reduce the amount of storage required to capture the state of the content.

在進行優先權之前，要針對一特定內容儲存該CTA層級(及指令層級)內容狀態的一內容緩衝器由在CPU 102上執行的一程式所分配。被分配的該內容緩衝器之大小可以基於PPU 202組態和SM 310的數目而定。 Prior to prioritization, a content buffer that stores the CTA level (and instruction level) content state for a particular content is allocated by a program executing on CPU 102. The size of the content buffer that is allocated may be based on the number of PPU 202 configurations and the number of SMs 310.

為了完成該優先權程序的該第一階段，前端212停止自主控介面206接受新的方法，並輸出一佔先命令至任務/工作單元207。當該佔先命令被一處理單元接收時，該處理單元停止輸出工作至一下游單元。前端212等待所有下游單元來停止輸出工作，然後確定一內容凍結信號而成為該優先權程序的該第二階段。該內容凍結信號的確定可確保該處理管線基於用於儲存該內容狀態的該等交易而不會進行任何作業。前端212亦決定一正在被處理的等待閒置命令是否需要前端212等待該處理管線成為閒置，如果是，前端212即中斷該等待閒置作業，及儲存內容狀態資訊以指明一等待閒置命令正針對該內容在執行中。當該內容被恢復時，該等待閒置執行將由前端212重新開始。 To complete this first phase of the priority procedure, the front end 212 stops the master interface 206 from accepting the new method and outputs a preemptive command to the task/work unit 207. When the preemptive command is received by a processing unit, the processing unit stops outputting the operation to a downstream unit. The front end 212 waits for all downstream units to stop the output operation and then determines a content freeze signal to become the second stage of the priority procedure. The determination of the content freeze signal ensures that the processing pipeline does not perform any jobs based on the transactions for storing the content status. The front end 212 also determines whether a pending idle command being processed requires the front end 212 to wait for the processing pipeline to become idle. If so, the front end 212 interrupts the waiting for idle operation and stores content status information to indicate that a wait idle command is being directed to the content. In execution. When the content is restored, the wait for idle execution will be restarted by the front end 212.

當任務/工作單元207收到該佔先命令時，任務/工作單元207停止啟始新的工作。最後，任務/工作單元207決定該優先權程序的該等前兩個階段已經完成，並通知前端212該處理管線為閒置。然後前端212將會在重置該處理管線之前儲存被維持在任務/工作單元207內的該內容狀態，以完成該優先權程序的該第三階段。當使用指令層級優先權時，被維持在GPC 208內的該內容狀態由GPC 208它們本身儲存。當使用該CTA層級優先權時，GPU 208被排除，所以被儲存的內容狀態數量可減少。 When the task/work unit 207 receives the preemption command, the task/work unit 207 stops starting a new job. Finally, task/work unit 207 determines that the first two phases of the priority procedure have been completed and notifies front end 212 that the processing pipeline is idle. The front end 212 will then store the content state maintained within the task/work unit 207 prior to resetting the processing pipeline to complete the third phase of the priority procedure. When the instruction level priority is used, the content status maintained within GPC 208 is stored by GPC 208 itself. When the CTA level priority is used, GPU 208 is excluded, so the number of stored content states can be reduced.

甚至在任務/工作單元207停止啟始工作之後，任務/工作單元207可以在先前指令的執行期間接收可能由GPC 208產生的額外工作。任務/工作單元207緩衝化要由前端212儲存的該額外工作做為任務/工作單元207的該內容狀態之一部份。 Even after the task/work unit 207 stops starting work, the task/work unit 207 can receive additional work that may be generated by the GPC 208 during execution of the previous instruction. The task/work unit 207 buffers the additional work to be stored by the front end 212 as part of the content state of the task/work unit 207.

當收到該佔先命令時，工作分配單元340停止啟始CTA。當進行CTA層級優先權時，在工作分配單元340下游的該處理管線中該等處理單元，例如GPC 208，即被排除，所以沒有內容狀態留在那些下游處理單元中。因此，相較於指令層級優先權，當進行CTA層級優先權時，即可減少內容狀態的數量，其係因為指令層級優先權並不需要排除該等下游處理單元。 When the preemption command is received, the work distribution unit 340 stops starting the CTA. When the CTA level priority is made, the processing units, such as GPC 208, are excluded from the processing pipeline downstream of the work distribution unit 340, so no content status remains in those downstream processing units. Therefore, the number of content states can be reduced when the CTA level priority is made compared to the instruction level priority, which is because the instruction level priority does not need to exclude the downstream processing units.

工作分配單元340基於由任務管理單元300產生的資訊決定那一個GPC 208將要執行收到的工作。因為GPC 208被管線化時，一單一GPC 208可並行地執行多個任務。任務管理單元300排程每一處理任務做為一網格或佇列來執行。工作分配單元340將每一CTA關聯於一特定網格或佇列用於一或多個任務的並行執行。屬於一網格的CTA具有隱式x,y,z參數以指明該網格內個別CTA的位置。工作分配單元340追蹤可使用的GPC 208，並當GPC 208可使用時啟始該等CTA。 The work distribution unit 340 determines which one of the GPCs 208 is about to perform the received work based on the information generated by the task management unit 300. Because GPC 208 is pipelined, a single GPC 208 can perform multiple tasks in parallel. The task management unit 300 schedules each processing task to be executed as a grid or queue. Work assignment unit 340 associates each CTA with a particular grid or queue for parallel execution of one or more tasks. CTAs belonging to a grid have implicit x, y, z parameters to indicate the location of individual CTAs within the grid. Work distribution unit 340 tracks the GPCs 208 that are available and initiates the CTAs when GPC 208 is available.

在指令層級優先權期間，工作分配單元340傳送該佔先命令至GPC 208中的管線管理員305。管線管理員305可包括每一 SM 310的一控制器。在收到該佔先命令時，SM 310停止發出指令，並進入一補捉處理器。SM 310亦等待關聯於先前發出的指令之所有記憶體交易來完成，即針對所有待決的記憶體要求來完成。當一讀取要求尚未返回時及當對於用於一確認被明確地要求的一寫入要求尚未自MMU 328收到一確認時，記憶體要求被視為待決中。管線管理員305維持關於CTA及執行緒群組的資訊，並追蹤每一CTA是那些執行緒群組被佔先。 During the instruction level priority, the work distribution unit 340 transmits the preemption command to the pipeline manager 305 in the GPC 208. Pipeline manager 305 can include each A controller of the SM 310. Upon receipt of the preemptive command, the SM 310 stops issuing instructions and enters a capture processor. SM 310 also waits for all memory transactions associated with previously issued instructions to complete, ie, for all pending memory requests. A memory request is considered pending when a read request has not been returned and when a write request explicitly requested for an acknowledgment has not been received from the MMU 328. The pipeline manager 305 maintains information about the CTA and the thread group and tracks which CTAs are preempted by those thread groups.

一旦在GPC 208中的SM 310已經停止發出指令且每一SM 310成為閒置之後，該補捉處理器卸載在GPC 208上運作的該等CTA之該內容狀態，且一或多個該補捉處理器、管線管理員305及前端312的一組合儲存該內容狀態。被卸載及儲存的該內容狀態包括SM 310內的暫存器、管線管理員305內的暫存器、GPC 208內的暫存器、共用的記憶體等等，其皆被儲存到圖形記憶體中的一預先定義的緩衝器。同時，由GPC 208內的該等快取(例如L1.5快取335)寫入到記憶體者皆被迫離開到記憶體，且該等快取被失效。一旦所有該內容狀態已經被卸載和儲存，該補捉處理器將離開所有啟動的執行緒，藉此閒置SM 310和GPC 208。 Once the SM 310 in the GPC 208 has stopped issuing instructions and each SM 310 becomes idle, the capture processor unloads the content state of the CTAs operating on the GPC 208, and one or more of the capture processes A combination of the pipeline manager 305 and the front end 312 stores the content status. The status of the content that is unloaded and stored includes a scratchpad in the SM 310, a scratchpad in the pipeline manager 305, a scratchpad in the GPC 208, a shared memory, etc., all of which are stored in the graphics memory. A pre-defined buffer in . At the same time, those caches written by the cache (eg, L1.5 cache 335) within the GPC 208 are forced to leave the memory and the caches are disabled. Once all of the content status has been unloaded and stored, the capture processor will leave all initiated threads, thereby vacating SM 310 and GPC 208.

然後該補捉處理器控制來自SM 310的一信號至管線管理員305，以指明該優先權程序的該等前兩個階段已經由GPC 208完成，及GPC 208為閒置。管線管理員305回報給工作分配單元340，確認(ACK)該佔先命令來指明該優先權程序的該等前兩個階段已經完成。此ACK被傳送到工作分配單元340的上游至任務管理單元300，及最終傳送上到前端212。 The capture processor then controls a signal from the SM 310 to the pipeline manager 305 to indicate that the first two phases of the priority procedure have been completed by the GPC 208 and that the GPC 208 is idle. The pipeline manager 305 reports to the work distribution unit 340, acknowledging (ACK) the preemption command to indicate that the first two phases of the priority procedure have been completed. This ACK is transmitted upstream of the work distribution unit 340 to the task management unit 300, and finally to the front end 212.

管線管理員305在當該佔先命令由工作分配單元340輸出時，針對GPC 208內正在執行的每一執行緒群組保持狀態資訊。該狀態資訊指明一執行緒群組是否在完成執行之後離開，或是是否該執行緒群組被佔先。該狀態資訊由管線管理員305儲存，且可由管線管理員305使用來僅恢復被佔先的那些執行緒群組。當在一執行緒群組中所有該等執行緒在管線管理員305收到該佔先命令之後且在該補捉處理器被進入來儲存該狀態資訊之前離開時，狀態資訊並未針對該執行緒群組儲存，且該執行緒群組並未被恢復。在該等GPC 208閒置之後，該等GPC可被重置來完成該優先權程序的該第三階段。 The pipeline manager 305 maintains status information for each thread group being executed within the GPC 208 when the preemption command is output by the work distribution unit 340. The status information indicates whether a thread group has left after completion of execution, or whether the thread group is preempted. This status information is stored by the pipeline manager 305 and can be used by the pipeline manager 305 to restore only those that are preempted. Group. When all of the threads in a thread group are removed after the pipeline administrator 305 receives the preemption command and before the trapping processor is entered to store the status information, the status information is not directed to the thread. The group is saved and the thread group is not restored. After the GPCs 208 are idle, the GPCs can be reset to complete the third phase of the priority procedure.

然後前端212藉由寫出由前端212維持的該內容狀態來完成該優先權程序的該第二階段。前端212儲存所有暫存器，且隨機記憶體串鏈出到該經佔先內容的該內狀態緩衝器。為了完成該優先權程序的該第三階段，前端212確定由該處理管線收到的一內容重置信號，例如任務/工作單元207及GPC 208。 The front end 212 then completes the second phase of the priority procedure by writing the content state maintained by the front end 212. The front end 212 stores all of the scratchpads, and the random memory string is chained out to the internal state buffer of the preempted content. To complete this third phase of the priority procedure, the front end 212 determines a content reset signal, such as task/work unit 207 and GPC 208, received by the processing pipeline.

當一內容被選擇來執行時，主控介面206需要決定該經選擇的內容是否為先前被佔先的一內容。指明一內容是否被佔先的一內容重新載入(ctx_reload)旗標由主控介面206維持。當主控介面206認定該經選擇的內容被佔先時，該先前被卸載和儲存的內容狀態於該經選擇的內容之執行恢復之前被重新載入。已經被佔先的一內容將會被重新載入，即使當針對該經選擇的內容已沒有方法時，因為在該等方法的執行期間可能有由SM 310產生的工作，並儲存成該內容狀態的一部份。 When a content is selected for execution, the master interface 206 needs to determine if the selected content is a previously pre-empted content. A content reload (ctx_reload) flag indicating whether a content is preempted is maintained by the master interface 206. When the master interface 206 determines that the selected content is preempted, the previously unloaded and stored content state is reloaded prior to execution recovery of the selected content. A content that has been preempted will be reloaded even when there is no way for the selected content, because there may be work generated by the SM 310 during execution of the methods and stored in the content state. a part.

當主控介面206啟始該優先權時，前端212發信給主控介面206是否該內容為閒置。如果該內容為閒置，即該處理管線為閒置且並無待決的記憶體要求，該經佔先的內容即不需要在該內容的執行恢復之前被重新載入。如果該內容並未閒置，主控介面206在當該通道被重新載入時儲存要被處理的該內容重新載入狀態。 When the master interface 206 initiates the priority, the front end 212 sends a message to the master interface 206 whether the content is idle. If the content is idle, that is, the processing pipeline is idle and has no pending memory requirements, the preempted content does not need to be reloaded before the execution of the content is restored. If the content is not idle, the master interface 206 stores the content reload status to be processed when the channel is reloaded.

亦有案例為當前端212自主控介面206接收該佔先命令時，該處理管線已經為閒置。當該處理管線已經為閒置時，前端212並不傳送一佔先命令至任務/工作單元207，而是繼續該優先權程序的該第二階段。因此，任務/工作單元207和GPC 208的該閒置狀態必須使得那些單元來接收一新的內容狀態或恢復一內容狀態。例如，任務/工作單元207必須在一狀態中使得並無任務正在運作。管線管理員305必須僅恢復經佔先的執行緒群組或CTA，且必須不恢復離開的執行緒群組。 There are also cases where the processing pipeline is already idle when the current terminal 212 master interface 206 receives the preemption command. When the processing pipeline is already idle, the front end 212 does not transmit a preemptive command to the task/work unit 207, but continues the second phase of the priority procedure. Therefore, the idle state of task/work unit 207 and GPC 208 must cause those units to receive a new content state or restore Repeat the content status. For example, task/work unit 207 must be in a state such that no tasks are running. The pipeline manager 305 must only recover the preempted thread group or CTA and must not resume the leaving thread group.

當前端212完成該優先權程序的該第四階段時，該經選擇的內容狀態由一內容緩衝器讀取，且被載入到該等暫存器和隨機處理記憶體串鏈當中。該內容凍結信號自該第二階段的開始時由前端212確認，直到該優先權程序的該第四階段結束為止。該內容凍結信號的確定可確保該處理管線基於由前端212所使用的該等交易而不會進行任何作業來儲存和恢復該內容狀態。 When the front end 212 completes the fourth phase of the priority program, the selected content state is read by a content buffer and loaded into the scratchpad and random processing memory chain. The content freeze signal is acknowledged by the front end 212 from the beginning of the second phase until the end of the fourth phase of the priority procedure. The determination of the content freeze signal ensures that the processing pipeline does not perform any jobs to store and restore the content state based on the transactions used by the front end 212.

前端212藉由輸出一佔先恢復命令至任務/工作單元207來啟始該優先權程序的該第五階段(階段5)。在任務/工作單元207接收該佔先恢復命令之後，任務/工作單元207並不確定一預備好信號至前端212，所以並無新的工作可由前端212傳送至任務/工作單元207，直到該優先權程序完成為止。任務/工作單元207內的工作分配單元340分別接收該佔先恢復命命，且恢復該經選擇的內容狀態、重播該等經恢復的任務到GPC 208中、及恢復經佔先的CTA及執行緒群組回到管線管理員305及SM 310當中。 The front end 212 initiates the fifth phase (stage 5) of the priority program by outputting a preemptive resume command to the task/work unit 207. After the task/work unit 207 receives the preemptive resume command, the task/work unit 207 does not determine a ready signal to the front end 212, so no new work can be transferred from the front end 212 to the task/work unit 207 until the priority The program is completed. The work distribution unit 340 in the task/work unit 207 receives the preemptive life, respectively, and restores the selected content state, replays the resumed tasks into the GPC 208, and restores the preempted CTA and the thread group. The group is returned to the pipeline manager 305 and the SM 310.

例如，一管線管理員305輸出該佔先恢復命令來設置一個別的SM 310來進入「優先權-恢復-開始」模式。然後，管線管理員305傳送該等經佔先的CTA和執行緒群組至SM 310。在管線管理員305已經恢復所有佔先的執行緒群組之後，管線管理員305輸出一命令至SM 310以指明必須離開該「優先權-恢復-開始」模式。當使用該CTA層級優先權時，GPC 308並不具有任何儲存的內容狀態來重新載入，以及沒有任何執行緒群組狀態來恢復。 For example, a pipeline manager 305 outputs the preemptive resume command to set up another SM 310 to enter the "priority-recovery-start" mode. The pipeline manager 305 then transmits the preemptive CTAs and thread groups to the SM 310. After the pipeline manager 305 has restored all of the preempted thread groups, the pipeline manager 305 outputs a command to the SM 310 to indicate that the "priority-recovery-start" mode must be left. When using this CTA level priority, GPC 308 does not have any stored content status to reload, and does not have any thread group status to recover.

當使用指令層級優先權來恢復一經選擇的內容時，GPC 308自一內容緩衝器讀取該經選擇的內容之內容狀態，並載入該等暫存器和共用記憶體。管線管理員305藉由以該等CTA被回報被佔先的順序傳送該等CTA至每一CTA正在其上執行的個別SM 310來重新開始被佔先的所有該等CTA。此技術可確保每一CTA在一SM 310中以當該內容被佔先時該CTA佔用的相同實體CTA位置上被啟始。執行緒群組以相同的實體執行緒群組ID被啟始。在優先權之後在該相同位置上重新開始該等執行緒群組的好處係因為該等執行緒群組和CTA可保證不會超過該記憶體及在個別SM 310中可使用的其它資源。每一SM 310針對每一執行緒群組及類似者恢復暫存器數值、一程式運數器、堆疊指標、有效遮罩。 When the instruction level priority is used to recover a selected content, the GPC 308 reads the content status of the selected content from a content buffer and loads The registers and shared memory. The pipeline manager 305 restarts all of the preempted CTAs by transmitting the CTAs to the individual SMs 310 on which each CTA is executing in the order in which the CTAs are reported to be preempted. This technique ensures that each CTA is initiated in an SM 310 at the same physical CTA location occupied by the CTA when the content is preempted. The thread group is started with the same entity thread group ID. The benefit of restarting the thread group at the same location after the priority is because the thread group and CTA can guarantee that the memory and other resources available in the individual SM 310 will not be exceeded. Each SM 310 restores a scratchpad value, a program number, a stacking indicator, a valid mask for each thread group and the like.

最後，前端212確認(ACK)該原始佔先命令至主控介面206。該ACK指明該優先權程序已完成，且該經選擇的內容之執行已經啟始。任何先前被佔先的CTA在任務/工作單元207和GPC 208中已經恢復執行。當使用指令層級優先權時，任何先前被佔先的執行緒已經恢復在SM 310上執行。主控介面206現在開始傳送新的工作到該圖形管線當中。 Finally, the front end 212 acknowledges (ACK) the original preemption command to the master interface 206. The ACK indicates that the priority procedure has been completed and execution of the selected content has begun. Any previously preempted CTA has resumed execution in task/work unit 207 and GPC 208. When the instruction level priority is used, any previously preempted threads have resumed execution on the SM 310. The master interface 206 now begins to transfer new work to the graphics pipeline.

在一具體實施例中，前端212在輸出該佔先恢復命令到任務/工作單元207之後確認(ACK)該原始佔先命令，且任務/工作單元207緩衝化在該佔先恢復命令之後收到的任何新工作直到完成階段5為止。任務/工作單元207並不啟始任何新的(未恢復的)CTA，直到完成該優先權程序為止。因此前端212並不知道何時完成該第五階段。如果任務/工作單元207無法緩衝化所有的新工作，任務/工作單元207否定該預備好信號到前端212。但是，前端212無法區別該預備好信號是在該優先權程序期間或完成之後被否定。 In a specific embodiment, the front end 212 acknowledges (ACK) the original preemption command after outputting the preemptive resume command to the task/work unit 207, and the task/work unit 207 buffers any new received after the preemptive resume command. Work until the completion of stage 5. Task/work unit 207 does not initiate any new (unrecovered) CTA until the priority procedure is completed. Therefore, the front end 212 does not know when to complete the fifth stage. If task/work unit 207 is unable to buffer all new work, task/work unit 207 negates the ready signal to front end 212. However, the front end 212 cannot distinguish that the ready signal is negated during or after the priority procedure.

第五A圖例示根據本發明一具體實施例中當一程序被佔先時用於卸載內容狀態的一種卸載方法500。雖然該等方法步驟係配合第一、二、三A、三B和四圖之該等系統做說明，本技術專業人士將可瞭解到設置成以任何順序執行該等方法步驟的任何系統皆在本發明之範圍內。 Figure 5A illustrates an unloading method 500 for offloading a content state when a program is preempted in accordance with an embodiment of the present invention. Although the method steps are described in conjunction with the systems of the first, second, third, third, and fourth figures, Those skilled in the art will appreciate that any system configured to perform the method steps in any order is within the scope of the present invention.

在步驟505，主控介面206輸出一指令層級佔先命令至前端212，並啟始該目前內容的卸載。在步驟510，前端212決定該處理管線是否為閒置，如果是，前端212直接進行到步驟545來儲存由前端212所維持的該內容狀態。 At step 505, the master interface 206 outputs an instruction level preemption command to the front end 212 and initiates the offloading of the current content. At step 510, the front end 212 determines if the processing pipeline is idle, and if so, the front end 212 proceeds directly to step 545 to store the content state maintained by the front end 212.

如果在步驟510前端212決定該處理管線並非閒置，則在步驟515前端212停止啟始針對該目前內容的新工作。在步驟520，前端212輸出一佔先命令至任務/工作單元207。在步驟525，任務/工作單元207內任務管理單元300停止發出任務至工作分配單元340，並輸出該佔先命令至工作分配單元340。在步驟525，工作分配單元340亦停止啟始CTA，並輸出該佔先命令至管線管理員305。管線管理員305輸出該指令層級佔先命令至SM 310。 If the front end 212 determines at step 510 that the processing pipeline is not idle, then at step 515 the front end 212 stops starting a new job for the current content. At step 520, the front end 212 outputs a preemptive command to the task/work unit 207. At step 525, task management unit 300 within task/work unit 207 stops issuing tasks to work distribution unit 340 and outputs the preemption command to work distribution unit 340. At step 525, the work distribution unit 340 also stops starting the CTA and outputs the preemption command to the pipeline manager 305. The pipeline manager 305 outputs the command level preemption command to the SM 310.

在步驟525，SM 310停止執行指令，且在步驟530 SM 310等待任何待決的記憶體交易來完成。每一SM 310重複步驟530直到所有該等記憶體交易完成。SM 310指明給管線管理員305每一執行緒群組是否離開或被佔先。當所有該等待決的記憶體交易完成時，在步驟535，在SM 310中維持的該內容狀態被儲存到一內容緩衝器中，且被維持在管線管理員305中的該內容狀態亦儲存到該內容緩衝器中。 At step 525, SM 310 stops executing the instruction, and at step 530 SM 310 waits for any pending memory transactions to complete. Step 530 is repeated for each SM 310 until all of the memory transactions are completed. SM 310 indicates to pipeline manager 305 whether each thread group is left or preempted. When all of the pending memory transactions are completed, in step 535, the content state maintained in SM 310 is stored in a content buffer, and the content state maintained in pipeline manager 305 is also stored to The content buffer.

在步驟540，管線管理員305回報給工作分配單元340該處理管線的該指令層級部份(例如SM 310和GPC 208)為閒置，然後工作分配單元340針對該目前內容儲存被維持在工作分配單元340中的該CTA層級狀態。工作分配單元340回報給任務管理單元300其已經完成此階段的優先權。然後任務管理單元30儲存被維持在任務管理單元300中的該任務層級狀態。任務管理單元300回報給前端212何時該目前狀態已經被儲存，及在步驟545前端212儲存由前端212針對該目前內容所維持的該內容狀態到該內容緩衝器。在步驟550，前端212即儲存該經儲存的內容狀態係針對一被佔先的內容之一指示，並重置該處理管線。 At step 540, the pipeline manager 305 reports to the work distribution unit 340 that the instruction level portion of the processing pipeline (e.g., SM 310 and GPC 208) is idle, and then the work distribution unit 340 is maintained in the work distribution unit for the current content store. The CTA level state in 340. The work distribution unit 340 reports to the task management unit 300 that it has completed the priority of this phase. The task management unit 30 then stores the task hierarchy state maintained in the task management unit 300. The task management unit 300 reports back to the front end 212 when the current state has been stored, and at step 545 the front end 212 stores the content maintained by the front end 212 for the current content. Status to the content buffer. At step 550, the front end 212 stores the stored content status for one of the preempted content indications and resets the processing pipeline.

第五B圖例示根據本發明一具體實施例中當一在該指令層級處被佔先的程序被恢復時用於恢復內容狀態的一種恢復方法560。雖然該等方法步驟係配合第一、二、三A、三B和四圖之該等系統做說明，本技術專業人士將可瞭解到設置成以任何順序執行該等方法步驟的任何系統皆在本發明之範圍內。 FIG. 5B illustrates a recovery method 560 for restoring a state of content when a program preempted at the instruction level is restored in accordance with an embodiment of the present invention. Although the method steps are described in conjunction with the systems of the first, second, third, third, and fourth figures, those skilled in the art will appreciate that any system configured to perform the method steps in any order is Within the scope of the invention.

在步驟565，前端212啟始針對由主控介面206所選擇的一內容之一被儲存內容的恢復。在步驟570，前端212確定該內容凍結信號來確保該處理管線不會基於由前端212所使用的該等交易來進行任何作業以恢復該內容狀態。在步驟575，該經選擇的內容狀態由前端212和任務/工作單元207自一內容緩衝器讀取，並於該任務和CTA層級處被恢復。 At step 565, the front end 212 initiates recovery of the stored content for one of the content selected by the host interface 206. At step 570, the front end 212 determines the content freeze signal to ensure that the processing pipeline does not perform any jobs based on the transactions used by the front end 212 to restore the content state. At step 575, the selected content state is read from a content buffer by the front end 212 and the task/work unit 207 and restored at the task and CTA level.

在步驟580，每一管線管理員305向下輸出一命令來設置個別的SM 310進入「佔先-恢復-開始」模式，藉此設置SM 310進入一暫停狀態。在步驟580，管線管理員305傳送經佔先的CTA和執行緒群組到SM 310，及GPC 208針對在步驟535被儲存的該經選擇的內容來恢復被維持在SM 310中的該指令層及內容狀態(參見第五A圖)。在恢復該CTA和指令層級狀態之後，管線管理員305輸出一命令至個別的SM 310以指明必須離開該「佔先-恢復-開始」模式，及在步驟582前端212否定該內容凍結信號。步驟580和528可以同時進行。在步驟585，該等CTA以該被佔先的順序來啟始，而在步驟590使用該經選擇內容的該已恢復的內容狀態來恢復執行。在步驟590，前端212亦確認(ACK)給主控介面206該指令層級佔先命令已經完成執行的信號。現在主控介面206可以開始由該推入緩衝器傳送更多的工作至前端212。在一具體實施例中，任務/工作單元207確定和否定該內容凍結，而在該內容凍結於步驟570中被確定之後進行步驟590(由前端212)。任務/工作單元緩衝化來自該推入緩衝器的該新的工作，直到該指令層級佔先命令已經完成執行為止。該新的工作直到該等CTA於步驟585被啟始之後才由該任務/工作單元輸出。 At step 580, each pipeline manager 305 outputs a command to set the individual SM 310 to enter the "preemptive-recover-start" mode, thereby setting the SM 310 to a suspend state. At step 580, the pipeline manager 305 transmits the preempted CTA and thread group to the SM 310, and the GPC 208 recovers the command layer maintained in the SM 310 for the selected content stored at step 535 and Content status (see Figure 5A). After recovering the CTA and instruction level status, the pipeline manager 305 outputs a command to the individual SM 310 to indicate that the "preemption-recover-start" mode must be left, and the front end 212 denies the content freeze signal at step 582. Steps 580 and 528 can be performed simultaneously. At step 585, the CTAs are initiated in the preempted order, and at step 590, the resumed content state of the selected content is used to resume execution. At step 590, the front end 212 also acknowledges (ACK) the signal to the master interface 206 that the instruction level preemption command has completed execution. The master interface 206 can now begin to transfer more work to the front end 212 by the push-in buffer. In a specific embodiment, task/work unit 207 determines and denies the content freeze, and is determined in the content freeze in step 570. Step 590 is then performed (by front end 212). The task/work unit buffers the new work from the push-in buffer until the instruction level preemption command has completed execution. This new work is not output by the task/work unit until the CTAs are initiated at step 585.

如前所述，被儲存和恢復的該內容狀態可用可能較長的潛時為代價來減少，以用於在該CTA層級佔先而非在該指令層級佔先來停止該運行中內容。當一內容於該CTA層級被佔先時，SM 310完成任何已啟使的CTA之執行，所以需要被儲存的CTA狀態並未維持在管線管理員305和GPC 208之內。但是，啟始至少一額外的CTA來完成該任務之執行所需要的任務層級狀態針對該經佔先的內容而儲存。 As previously mentioned, the state of the content being stored and restored may be reduced at the expense of potentially longer latency for preempting at the CTA level rather than preempting the in-run content at the instruction level. When a content is preempted at the CTA level, SM 310 completes execution of any enabled CTA, so the CTA state that needs to be stored is not maintained within pipeline manager 305 and GPC 208. However, the task level state required to initiate at least one additional CTA to perform the execution of the task is stored for the preempted content.

在一具體實施例中，該內容於該任務層級處被佔先，而任務/工作單元207完成任何具有至少一CTA被啟始的任務之執行，所以任務狀態不需要被儲存。於該任務層級處的佔先可能需要啟始一或多個額外的CTA來在該前端狀態被儲存之前完成該任務的執行。當進行任務層級佔先時，針對任務或CTA皆不儲存狀態。 In a specific embodiment, the content is preempted at the task level, and the task/work unit 207 performs any execution of the task with at least one CTA initiated, so the task status does not need to be stored. Preemption at the task level may require initiation of one or more additional CTAs to complete execution of the task before the front end state is stored. When the task level is preempted, no state is stored for the task or CTA.

第六A圖例示根據本發明一具體實施例中當一程序於一CTA層級處被佔先時用於卸載內容狀態的一種卸載方法600。雖然該等方法步驟係配合第一、二、三A、三B和四圖之該等系統做說明，本技術專業人士將可瞭解到設置成以任何順序執行該等方法步驟的任何系統皆在本發明之範圍內。 Figure 6A illustrates an unloading method 600 for offloading a content state when a program is preempted at a CTA level in accordance with an embodiment of the present invention. Although the method steps are described in conjunction with the systems of the first, second, third, third, and fourth figures, those skilled in the art will appreciate that any system configured to perform the method steps in any order is Within the scope of the invention.

在步驟605，主控介面206輸出一CTA層級佔先命令至前端212，並啟始該目前內容的卸載。在步驟610，前端212決定該處理管線是否為閒置，如果是，前端212直接進行到步驟645來儲存由前端212所維持的該內容狀態。 At step 605, the master interface 206 outputs a CTA level preemption command to the front end 212 and initiates the offloading of the current content. At step 610, the front end 212 determines if the processing pipeline is idle, and if so, the front end 212 proceeds directly to step 645 to store the content state maintained by the front end 212.

如果在步驟610前端212決定該處理管線並非閒置，則在步驟615前端212停止啟始針對該目前內容的新工作。在步驟620，前端212輸出一佔先命令至任務/工作單元207。在步驟 625，任務/工作單元207內任務管理單元300停止發出任務至工作分配單元340，並輸出該佔先命令至工作分配單元340。工作分配單元340停止啟始CTA，且在步驟630工作分配單元340等帶GPC 208成為閒置。 If the front end 212 determines at step 610 that the processing pipeline is not idle, then at step 615 the front end 212 stops starting a new job for the current content. At step 620, front end 212 outputs a preemptive command to task/work unit 207. In the steps 625. The task management unit 300 in the task/work unit 207 stops issuing the task to the work distribution unit 340, and outputs the preemption command to the work distribution unit 340. The work distribution unit 340 stops starting the CTA, and at step 630, the work distribution unit 340 or the like with the GPC 208 becomes idle.

如果在步驟630工作分配單元340決定GPC 208並非閒置，則在步驟635工作分配單元340決定一計時器是否逾時。該計時器限制工作分配單元340將等待該等GPC成為閒置的時脈循環數目。該等時脈循環的數目可為一程式化的數值，而在一具體實施例中，當超過該數值時，工作分配單元340於該指令層級而非在該CTA層級處執行優先權。如果在步驟635工作分配單元340決定該計時器尚未逾時，則工作分配單元340回到步驟630。否則，當該計時器已經逾時時，則工作分配單元340進行到第五A圖的步驟520來於該指令層級處執行優先權。 If the work distribution unit 340 determines at step 630 that the GPC 208 is not idle, then at step 635 the work distribution unit 340 determines if a timer has expired. The timer limits the number of clock cycles that the work distribution unit 340 will wait for the GPCs to become idle. The number of such clock cycles may be a stylized value, and in a particular embodiment, when the value is exceeded, the work distribution unit 340 performs priority at the instruction level rather than at the CTA level. If the work distribution unit 340 determines in step 635 that the timer has not expired, the work distribution unit 340 returns to step 630. Otherwise, when the timer has expired, the work distribution unit 340 proceeds to step 520 of the fifth A diagram to perform the priority at the instruction level.

當在步驟630該等GPC為閒置時，在步驟640工作分配單元340針對該目前內容儲存被維持在工作分配單元340中的該CTA層級狀態。工作分配單元340回報給任務管理單員300該目前狀態已被儲存。然後任務管理單元300儲存被維持在任務管理單元300中的該任務層級狀態。任務管理單元300回報給前端212何時該目前狀態已經被儲存，及在步驟645前端212儲存由前端212針對該目前內容所維持的該內容狀態到該內容緩衝器。在步驟650，前端212即儲存該經儲存的內容狀態係針對一被佔先的內容之一指示，並重置該處理管線。 When the GPCs are idle at step 630, the work distribution unit 340 stores the CTA level status maintained in the work distribution unit 340 for the current content at step 640. The work distribution unit 340 reports back to the task management member 300 that the current state has been stored. The task management unit 300 then stores the task hierarchy state maintained in the task management unit 300. The task management unit 300 reports back to the front end 212 when the current state has been stored, and at step 645 the front end 212 stores the content state maintained by the front end 212 for the current content to the content buffer. At step 650, the front end 212 stores the stored content status for one of the preempted content indications and resets the processing pipeline.

第六B圖例示根據本發明一具體實施例中當一在該CTA層級處被佔先的程序被恢復時用於恢復內容狀態的一種恢復方法660。雖然該等方法步驟係配合第一、二、三A、三B和四圖之該等系統做說明，本技術專業人士將可瞭解到設置成以任何順序執行該等方法步驟的任何系統皆在本發明之範圍內。 Figure 6B illustrates a recovery method 660 for restoring content status when a program preempted at the CTA level is restored in accordance with an embodiment of the present invention. Although the method steps are described in conjunction with the systems of the first, second, third, third, and fourth figures, those skilled in the art will appreciate that any system configured to perform the method steps in any order is Within the scope of the invention.

在步驟665，前端212啟始先前於該CTA層級處被佔先的一內容之恢復。在步驟670，前端212確定該內容凍結信號來確保該處理管線基於由前端212所使用的該等交易而不會進行任何作業來恢復該內容狀態。在步驟675，該經選擇的內容狀態由前端212和任務/工作單元207自一內容緩衝器讀取，並於該任務和CTA層級處被恢復。在步驟682，該內容凍結信號被解除確定。 At step 665, the front end 212 initiates recovery of a content that was previously preempted at the CTA level. At step 670, the front end 212 determines the content freeze signal to confirm The processing pipeline is based on the transactions used by the front end 212 and does not perform any jobs to restore the content state. At step 675, the selected content state is read from a content buffer by the front end 212 and the task/work unit 207 and restored at the task and CTA level. At step 682, the content freeze signal is de-asserted.

在步驟685，上次此內容正在運行時被佔先的該等CTA由任務/工作單元207重新啟始到GPC 208中。在步驟690，前端212確認(ACK)主控介面206該CTA層級佔先命令已經完成執行的信號。現在主控介面206可以開始由該推入緩衝器傳送更多的工作至前端212。在一具體實施例中，任務/工作單元207確定和否定該內容凍結，而在該內容凍結於步驟670中被確定之後進行步驟690(由前端212)。任務/工作單元緩衝化來自該推入緩衝器的該新的工作，直到該指令層級佔先命令已經完成執行為止。該新的工作直到該等CTA於步驟685被重新啟始之後才由該任務/工作單元輸出。 At step 685, the CTAs that were preempted when this content was last run are restarted by the task/work unit 207 into the GPC 208. At step 690, the front end 212 acknowledges (ACK) the signal that the master interface 206 has completed the execution of the CTA level preemption command. The master interface 206 can now begin to transfer more work to the front end 212 by the push-in buffer. In a specific embodiment, task/work unit 207 determines and denies the content freeze, and proceeds to step 690 (by front end 212) after the content freeze is determined in step 670. The task/work unit buffers the new work from the push-in buffer until the instruction level preemption command has completed execution. This new work is not output by the task/work unit until the CTAs are restarted in step 685.

要在該指令層級或在該CTA層級處佔先一內容的能力可針對每一特定內容來指定。一長運行時間的內容可於該指令層級處被佔先，以避免當該佔先被啟始時到當該佔先被完成時之間有長的延遲。並不需要長運行時間但維持大量狀態的一內容可於該CTA層級被佔先來最小化被儲存的內容狀態數量。 The ability to preempt a content at the instruction level or at the CTA level can be specified for each particular content. A long run time of content can be preempted at the instruction level to avoid a long delay between when the preemption is initiated and when the preemption is completed. A content that does not require long runtime but maintains a large number of states can be preempted at the CTA level to minimize the amount of stored state of the content.

本發明一具體實施例可以實作成由一電腦系統使用的一程式產品。該程式產品的程式定義該等具體實施例的功能(包括此處所述的方法)，並可包含在多種電腦可讀取儲存媒體上。例示性的電腦可讀取儲存媒體包括但不限於：(i)不可寫入儲存媒體(例如在一電腦內唯讀記憶體裝置，例如可由CD-ROM讀取的CD-ROM碟片，快閃記憶體，ROM晶片，或任何其它種類的固態非揮發性半導體記憶體)，其上可永久儲存資訊；及(ii)可寫入儲存媒體(例如在一磁碟機內的軟碟片、或硬碟機、或任何種類的固態隨機存取半導體記憶體)，其上可儲存可改變的資訊。 An embodiment of the invention can be implemented as a program product for use by a computer system. The program of the program product defines the functions of the specific embodiments (including the methods described herein) and can be included on a variety of computer readable storage media. Exemplary computer readable storage media include, but are not limited to: (i) non-writable storage media (eg, a read-only memory device in a computer, such as a CD-ROM disc that can be read by a CD-ROM, flashing Memory, ROM chip, or any other kind of solid non-volatile semiconductor memory) on which information can be stored permanently; and (ii) writeable to a storage medium (such as a floppy disk in a disk drive) A slice, or a hard disk drive, or any kind of solid state random access semiconductor memory, on which information that can be changed can be stored.

本發明已經參照特定具體實施例在以上進行說明。但是本技術專業人士將可瞭解到在不背離附屬申請專利範圍所提出之本發明的廣義精神與範圍之下可對其進行多種修正與改變。因此前述的說明及圖面係在以例示性而非限制性的角度來看待。 The invention has been described above with reference to specific embodiments. It will be apparent to those skilled in the art that various modifications and changes can be made without departing from the spirit and scope of the invention. Accordingly, the foregoing description and drawings are to be regarded as illustrative

100‧‧‧電腦系統 100‧‧‧ computer system

102‧‧‧中央處理單元 102‧‧‧Central Processing Unit

103‧‧‧裝置驅動器 103‧‧‧Device Driver

104‧‧‧系統記憶體 104‧‧‧System Memory

105‧‧‧記憶體橋接器 105‧‧‧Memory Bridge

106‧‧‧通訊路徑 106‧‧‧Communication path

107‧‧‧輸入/輸出橋接器 107‧‧‧Input/Output Bridge

108‧‧‧輸入裝置 108‧‧‧ Input device

110‧‧‧顯示器 110‧‧‧ display

112‧‧‧圖形處理單元 112‧‧‧Graphic Processing Unit

113‧‧‧通訊路徑 113‧‧‧Communication path

114‧‧‧系統碟 114‧‧‧System Disc

116‧‧‧交換器 116‧‧‧Switch

118‧‧‧網路轉接器 118‧‧‧Network Adapter

120,121‧‧‧嵌入卡 120,121‧‧‧ embedded card

202‧‧‧平行處理單元 202‧‧‧Parallel processing unit

204‧‧‧平行處理記憶體 204‧‧‧Parallel processing of memory

205‧‧‧輸入/輸出單元 205‧‧‧Input/output unit

206‧‧‧主控介面 206‧‧‧Master interface

207‧‧‧任務/工作單元 207‧‧‧Tasks/Working Units

208‧‧‧通用處理叢集 208‧‧‧General Processing Cluster

210‧‧‧交叉開關單元 210‧‧‧cross switch unit

212‧‧‧前端 212‧‧‧ front end

214‧‧‧記憶體介面 214‧‧‧ memory interface

215‧‧‧區隔單元 215‧‧‧ segment unit

220‧‧‧動態隨機存取記憶體 220‧‧‧Dynamic random access memory

230‧‧‧處理叢集陣列 230‧‧‧Process cluster array

300‧‧‧任務管理單元 300‧‧‧Task Management Unit

305‧‧‧管線管理員 305‧‧‧Pipeline Administrator

310‧‧‧串流多處理器 310‧‧‧Streaming multiprocessor

315‧‧‧紋路單元 315‧‧‧Line unit

321‧‧‧排程器表 321‧‧‧ Scheduler

322‧‧‧任務中介資料 322‧‧‧Mission agency information

325‧‧‧預先掃描場化作業 325‧‧‧Pre-scanning field operations

328‧‧‧記憶體管理單元 328‧‧‧Memory Management Unit

330‧‧‧工作分配交叉開關 330‧‧‧Work distribution crossbar

335‧‧‧L1.5快取 335‧‧‧L1.5 cache

340‧‧‧工作分配單元 340‧‧‧Works allocation unit

345‧‧‧任務表 345‧‧‧Task Schedule

所以，可以詳細瞭解本發明上述特徵之方式當中，本發明之一更為特定的說明簡述如上，其可藉由參照具體實施例來進行，其中一些例示於所附圖式中。但是應要注意到，該等附屬圖式僅例示本發明的典型具體實施例，因此其並非要做為本發明之範圍的限制，其可允許其它同等有效的具體實施例。 Therefore, a more detailed description of one of the embodiments of the present invention can be understood by reference to the specific embodiments, which are illustrated in the accompanying drawings. It is to be noted, however, that the appended drawings are merely illustrative of exemplary embodiments of the invention, and are not intended to limit the scope of the invention.

第一圖例示配置成實作本發明一或多種態樣之電腦系統的方塊圖；第二圖係根據本發明一具體實施例中第一圖之該電腦系統的一平行處理子系統的方塊圖；第三A圖係根據本發明一具體實施例中第二圖之任務/工作單元的方塊圖；第三B圖為根據本發明一具體實施例中第二圖之該等平行處理單元中之一者內一通用處理叢集的方塊圖；第四圖為根據本發明一具體實施例該處理管線的方塊圖；第五A圖例示根據本發明一具體實施例中當一程序被佔先時用於卸載內容狀態的一種卸載方法；第五B圖例示根據本發明一具體實施例中當一被佔先的程序被恢復時用於恢復內容狀態的一種恢復方法；第六A圖例示根據本發明一具體實施例中當一程序被佔先時用於卸載內容狀態的另一種卸載方法；以及第六B圖例示根據本發明一具體實施例中當一被佔先的程序被恢復時用於恢復內容狀態的另一種恢復方法。 The first diagram illustrates a block diagram of a computer system configured to implement one or more aspects of the present invention; and the second diagram is a block diagram of a parallel processing subsystem of the computer system in accordance with the first embodiment of the present invention. 3 is a block diagram of a task/work unit according to a second diagram of an embodiment of the present invention; and FIG. 3B is a diagram of the parallel processing unit of the second diagram according to an embodiment of the present invention; A block diagram of a general processing cluster within one; a fourth block diagram of the processing pipeline in accordance with an embodiment of the present invention; and a fifth diagram illustrating an application when a program is preempted in accordance with an embodiment of the present invention An unloading method for uninstalling a content state; FIG. 5B illustrates a recovery method for restoring a content state when a preempted program is restored according to an embodiment of the present invention; FIG. 6A illustrates a specific method according to the present invention Another method of uninstalling the state of the content when a program is preempted in an embodiment; Figure 6B illustrates another recovery method for restoring the state of a content when a preempted program is restored, in accordance with an embodiment of the present invention.

Claims

A method for prioritizing execution of a program instruction in a multi-threaded system, the method comprising: executing a program instruction using a first content in a processing pipeline within the multi-thread system; using the first content, The instruction level is preferentially executed to execute a different program instruction in the multi-thread system using a second content; storing an indication that the first content is executed using the first content; and in the processing pipeline Using the second content to execute the different program instructions.

The method of claim 1, further comprising storing a portion of the first content state maintained in the processing pipeline during execution of the program instructions using the first content prior to executing the different program instructions Share.

The method of claim 1, wherein using the first content priority execution further comprises storing each thread group storage being performed in a streaming multiprocessor when preemptively occurring at the instruction level The first content state.

The method of claim 1, wherein the preempting of the execution of the first content further comprises determining that the streaming multiprocessor configured to execute the program instructions using the first content is idle.

The method of claim 1, further comprising: determining that the processing pipeline is idle before executing the different program instructions; and resetting the processing pipeline without maintaining the processing of the first content in the processing pipeline The status of the content.

A method for prioritizing execution of program instructions in a multi-threaded system, the method comprising: performing a first internal use in a processing pipeline within the multi-threaded system a program instruction; prioritizing execution of the first content at a computing thread array level to execute different program instructions using a second content in the multi-thread system; storing the first content using the first content The execution of the program instructions is preceded by an indication; and the different program instructions that use the second content are executed in the processing pipeline.

The method of claim 6, further comprising performing execution of all of the computational thread arrays that have been initiated in the processing pipeline prior to executing the different program instructions, and storing is maintained to initiate an additional A thread content array is calculated and a first content state of execution of the program instructions that use the first content is completed.

The method of claim 6, wherein the performing the preemption of the first content further comprises: performing execution of all computing thread arrays that have been initiated to execute in the processing pipeline; initiating at least one additional Executing a thread array to perform execution of the program instructions using the first content; and performing the execution of the additional computing thread array by the processing pipeline.

The method of claim 6, wherein the preempting of the execution of the first content further comprises determining that the streaming multiprocessor configured to execute the program instructions using the first content is idle.

The method of claim 6, further comprising: determining that the processing pipeline is idle before executing the different program instructions; and resetting the processing pipeline without being maintained in the processing pipeline for the first content storage The status of the content.