TWI327434B

TWI327434B - Video processing

Info

Publication number: TWI327434B
Application number: TW94140179A
Authority: TW
Inventors: Shirish Gadre; Ashish Karandikar; Stephen D Lew; Christopher T Cheng
Original assignee: Nvidia Corp
Priority date: 2004-11-15
Filing date: 2005-11-15
Publication date: 2010-07-11
Also published as: TW201010410A; TWI327436B

Description

13274341327434

玖、發明說明：【發明所屬之技術領域】本發明之領域有關數位電子電腦系統。尤其是本發更有關一種用於在電腦系統上有效地處理視訊資訊之統。其一態樣中所揭係用於執行視訊處理操作之潛時容系統。另一態樣中所揭係在視訊處理器中之流處理。進步揭示者係為在視訊處理器中的多維資料路徑處理。同揭示的是具有純量和向量組件之視訊處理器。【先前技術】影像及全動作視訊之顯示允許係電子工業近年來具大幅進展之改進的範疇。高品質視訊（尤其是高畫質數位訊）之顯示和描繪係現代視訊技術應用和裝置之第一標。視訊技術係使用於從行動電話、個人視訊記錄器、位視訊投影機、高畫質電視及其類似者之各式各樣產中。能夠產生及顯示高畫質視訊之裝置的出現和持續發係電子工業經歷大幅度創新及進展之範疇。在許多消費者電子類型及專業層次裝置中發展之視技術，係依賴一或多個視訊處理器以格式化及/或增強用顯示之視訊信號。此對於數位視訊應用尤其是真實的。如，一或多個視訊處理器係被併入一典型轉頻器且係用轉換HDTV廣播信號成為可由顯示器使用之視訊信號。轉換有關（例如）其中視訊信號係自一非 1 6 X 9視訊影像換用於適當地顯示在一真實16x9(如寬螢幕）顯示器上之明系許時有視 S 數品展訊於例以此轉調 ί1327434发明, invention description: [Technical field to which the invention pertains] The field of the invention relates to a digital electronic computer system. In particular, the present invention relates to a system for efficiently processing video information on a computer system. In one aspect, the latent time system for performing video processing operations is disclosed. Another aspect is the stream processing in the video processor. The progressive revealer is a multi-dimensional data path processing in the video processor. Also disclosed are video processors with scalar and vector components. [Prior Art] The display of video and full-motion video allows for the improvement of the electronics industry in recent years. The display and rendering of high-quality video (especially high-definition digital) is the first standard for modern video technology applications and devices. Video technology is used in a variety of applications from mobile phones, personal video recorders, video projectors, high-definition televisions and the like. The emergence and continuous emergence of devices capable of generating and displaying high-definition video has undergone significant innovation and progress in the electronics industry. The technology developed in many consumer electronic and professional-level devices relies on one or more video processors to format and/or enhance the displayed video signals. This is especially true for digital video applications. For example, one or more video processors are incorporated into a typical transponder and used to convert HDTV broadcast signals into video signals that can be used by the display. Converting, for example, a video signal from a non-16 X 9 video image for proper display on a real 16x9 (eg widescreen) display This transfer ί1327434

整比例。可使用一或多個視訊處理器以施行掃描轉換，該處一視訊信號係自交錯格式（其中奇及偶數線係分開顯示）轉換成連續格式，其中整個圖框係在一單一掃描中出° 視訊處理器應用之額外實例包括例如信號解壓縮，中信號係依如MPEG-2之壓縮格式接收且被解壓縮且格化用於一顯示器。另一實例係再交錯掃描轉換，其涉及換一來自 DVI(數位視訊介面）格式之進入數位視訊信號成為與市面上安裝之大量舊電視顯示器相容的複合視訊式。大多數精通之使用者需要更複雜的視訊處理器功能諸如迴路内/迴路外（In-Loop/Out-of-Loop)解塊渡波器、進調適性解交錯、用於編碼操作之輸入雜訊濾波、多相調整/再取樣、子圖片合成、及處理器-放大器操作（諸如色空間轉換、調整、像素點操作（如鮮明化、直方圖調整等及各種視訊表面格式轉換支援操作。提供此複雜視訊處理器功能之問題係將具有足夠強架構以實施此等功能的視訊處理器併入許多類型裝置中能過於昂貴的事實。就矽晶粒區域、電晶體總數、記憶速率需求等而言，視訊處理器功能愈複雜，則需要施行等功能之積體電路裝置愈昂貴。因此，先前技術系統設計者被迫就視訊處理器效能成本進行折衷。被廣泛視為具有可接受成本/效能比之先技術視訊處理器通常係僅在潛時限制（如，避免使視訊斷在地繪其式轉 > 格先可彩 )) 之可體此與前續 7 1327434The whole ratio. One or more video processors can be used to perform the scan conversion, where a video signal is converted from a staggered format (where the odd and even lines are displayed separately) to a continuous format in which the entire frame is in a single scan. Additional examples of video processor applications include, for example, signal decompression, which is received in a compressed format such as MPEG-2 and decompressed and formatted for use in a display. Another example is a re-interlaced scan conversion involving the entry of a digital video signal from a DVI (Digital Video Interface) format into a composite video format compatible with a large number of old television displays installed on the market. Most proficient users require more sophisticated video processor functions such as In-Loop/Out-of-Loop deblocking, adaptive de-interlacing, and input noise for encoding operations. Filtering, polyphase adjustment/re-sampling, sub-picture synthesis, and processor-amplifier operations (such as color space conversion, adjustment, pixel point operations (such as sharpening, histogram adjustment, etc.) and various video surface format conversion support operations. The problem with complex video processor functions is the fact that video processors with strong enough architecture to implement these functions can be incorporated into many types of devices that are too expensive. In terms of die area, total number of transistors, memory rate requirements, etc. The more complex the video processor function, the more expensive the integrated circuit device that needs to perform the functions. Therefore, the prior art system designers are forced to trade off the performance cost of the video processor. It is widely regarded as having an acceptable cost/performance ratio. The first technical video processor is usually limited only in the latent time (for example, to avoid making the video break in the ground) )) can be this and the previous 7 1327434

或拖延視訊處理應用）及計算密度（如晶粒之每平方毫米處理器操作數）方面足夠。再者，先前技術視訊處理器大上係不適於一線性調整效能需求，諸如在其中視訊裝置預期處理多視訊流之情況下（如，同時處理多輸入流及輸顯示流）。因此需求的是一種新視訊處理器系統，其可克服先技術之限制。該新視訊處理器系統應具有可擴充性，且有高計算密度，以處理日漸增多之精通使用者預期的複視訊處理器功能。【發明内容】本發明之具體實施例提供一種新穎視訊處理器系統其支援複雜之視訊處理功能，同時有效地使用積體電路晶粒區域、電晶體總數、記憶體速率需求及其類似者。發明之具體實施例維持高計算密度且係易於可擴充以處多視訊流。在一具體實施例中係實施一潛時容許系統，用於在視訊處理器中執行視訊處理操作。該系統包括一主機面，用於在視訊處理器和一主機CPU間實施通信；一純執行單元，其係耦合至該主機介面且配置以執行純量視處理操作；及一向量執行單元，其係輕合至該主機介面配置以執行向量視訊處理操作。一命令FIFO係被包括於允許該向量執行單元藉由存取記憶體命令FIFO在以求驅動基礎上操作。一記憶體介面係包括以在視訊處理的體係出前具雜矽本理介量訊且用需器 8 1327434 和一圖框緩衝記憶體間實施通信。一 DMA引擎係建立在該記憶體介面内，用於在複數個不同記憶體位置間實施 DMA轉移，且用於以向量執行單元之資料和指令載入一資料儲存記憶體及一指令快取。It is sufficient to delay the video processing application and to calculate the density (such as the number of processor operations per square millimeter of the die). Furthermore, prior art video processors are largely unsuitable for a linear adjustment performance requirement, such as where a video device is expected to process multiple video streams (e.g., simultaneously processing multiple input streams and output streams). What is therefore needed is a new video processor system that overcomes the limitations of the prior art. The new video processor system should be scalable and highly computationally intensive to handle the growing number of video processor functions that are proficient in user expectations. SUMMARY OF THE INVENTION A specific embodiment of the present invention provides a novel video processor system that supports complex video processing functions while efficiently using integrated circuit die areas, total transistor counts, memory rate requirements, and the like. Embodiments of the invention maintain high computational density and are easily scalable for multi-streaming. In a specific embodiment, a latent time permitting system is implemented for performing video processing operations in the video processor. The system includes a host surface for communicating between a video processor and a host CPU; a pure execution unit coupled to the host interface and configured to perform a pure volume processing operation; and a vector execution unit The host interface configuration is lightly coupled to perform vector video processing operations. A command FIFO is included to allow the vector execution unit to operate on a drive-by-drive basis by accessing the memory command FIFO. A memory interface includes communication between the video processing system and the user interface 8 1327434 and a frame buffer memory. A DMA engine is built into the memory interface for performing DMA transfers between a plurality of different memory locations and for loading a data storage memory and an instruction cache with data and instructions of the vector execution unit.

在一具體實施例中，該向量執行單元係配置以藉由存取該命令FIFO以與純量執行單元不同步地操作，而在需求驅動基礎上操作。該需求驅動基礎可配置以隱藏一自不同記憶體位置（如圖框緩衝記憶體、系統記憶體、快取記憶體等）轉移至向量執行單元之命令FIFO的資料之潛時。命令FIFO可為一管道化FIFO以避免拖延向量執行單元。In a specific embodiment, the vector execution unit is configured to operate on a demand driven basis by accessing the command FIFO to operate asynchronously with the scalar execution unit. The demand-driven basis can be configured to hide the latency of data transferred from a different memory location (such as buffer memory, system memory, cache memory, etc.) to the command FIFO of the vector execution unit. The command FIFO can be a piped FIFO to avoid stalling vector execution units.

在一具體實施例中，本發明係實施為一用於執行視訊處理操作之視訊處理器。該視訊處理器包括一主機介面，用於在視訊處理器和一主機CPU間實施通信。該視訊處理器包括一記憶體介面，係用於在視訊處理器和一圖框緩衝記憶體間實施通信。一純量執行單元係耦合至該主機介面及記憶體介面且係配置以執行純量視訊處理操作。一向量執行單元係耦合至該主機介面及記憶體介面且係配置以執行向量視訊處理操作。該視訊處理器可為一獨立視訊處理器積體電路或可為一積體化至一 GPU積體電路内之組件。在一具體實施例中，該純量執行單元之功能為視訊處理器之控制器，且控制向量執行單元的操作。該純量執行單元係可配置以執行一應用之流量控制演算法，且向量執行單元係可配置以執行該應用之像素處理操作。該視訊處理器可包括一向量介面單元用於使該純量執行單元與向量 9 1327434 執行單元產生介面。在一具體實施例中，該純量執行單元和向量執行單元係配置以不同步地操作。該純量執行單元可在一第一時脈頻率執行，而向量執行單元可在一不同時脈頻率（如較快、較慢等等）執行。向量執行單元可於純量執行單元控制下在一需求驅動基礎上操作。In one embodiment, the invention is embodied in a video processor for performing video processing operations. The video processor includes a host interface for implementing communication between the video processor and a host CPU. The video processor includes a memory interface for communicating between the video processor and a frame buffer memory. A scalar execution unit is coupled to the host interface and memory interface and configured to perform scalar video processing operations. A vector execution unit is coupled to the host interface and the memory interface and configured to perform vector video processing operations. The video processor can be an independent video processor integrated circuit or can be a component integrated into a GPU integrated circuit. In a specific embodiment, the function of the scalar execution unit is a controller of the video processor and controls the operation of the vector execution unit. The scalar execution unit is configurable to execute an application flow control algorithm, and the vector execution unit is configurable to perform pixel processing operations of the application. The video processor can include a vector interface unit for causing the scalar execution unit to interface with the vector 9 1327434 execution unit. In a specific embodiment, the scalar execution unit and the vector execution unit are configured to operate asynchronously. The scalar execution unit can be executed at a first clock frequency, and the vector execution unit can be executed at a different clock frequency (e.g., faster, slower, etc.). The vector execution unit can operate on a demand driven basis under the control of a scalar execution unit.

在一具體實施例中，本發明係實施為一用於執行視訊處理操作之視訊處理器的多維資料路徑處理系統。該視訊處理器包括一係配置以執行純量視訊處理操作之純量執行單元，及一係配置以執行向量視訊處理操作之向量執行單元。一資料儲存記憶體係被包括以儲存向量執行單元之資料。該資料儲存記憶體包括具有配置在一陣列中之對稱記憶庫資料結構的複數個分塊（tile)。該記憶庫資料結構係配置以支援存取至各記憶庫之不同分塊。In one embodiment, the invention is embodied in a multi-dimensional data path processing system for a video processor for performing video processing operations. The video processor includes a scalar execution unit configured to perform scalar video processing operations, and a vector execution unit configured to perform vector video processing operations. A data storage memory system is included to store the data of the vector execution unit. The data storage memory includes a plurality of tiles having a symmetric memory library data structure disposed in an array. The memory data structure is configured to support access to different partitions of each memory bank.

取決於一特定組態之需求，各記憶庫（b a n k )資料結構可至少包含複數個分塊（如4x4、8x8、8x16、16x24或其類似者）。在一具體實施例中，該等記憶庫係配置以支援存取至各記憶庫之不同分塊。此允許一單一存取以自二相鄰記憶庫擷取一列或行之分塊。在一具體實施例中，一縱橫件（crossbar)係用以選擇一組態，供存取該複數之記憶庫資料結構（如，列、行、區塊等）的分塊。可包括一收集器以容置由縱橫件存取之記憶庫的分塊，且用以基於每一時脈提供該等分塊至一向量資料路徑之前端。在一具體實施例中，本發明係實施為用於視訊處理器之基於流的記憶體存取系統。該視訊處理器包括一配置以 10 1327434 執行純量視訊處理操作之純量執行單元，及一配置以執行向量視訊處理操作之向量執行單元。包括一圖框缓衝記憶體以儲存用於純量執行單元和向量執行單元之資料。包括一記憶體介面以實施在純量執行單元和向量執行單元及圖框緩衝記憶體間之通信。該圖框緩衝記憶體至少包含複數個分塊。該記憶體介面實施用於純量執行單元的分塊之第一順序存取的一第一流，且實施用於向量執行單元的分塊之第二順序存取的一第二流。Depending on the requirements of a particular configuration, each bank (b a n k ) data structure may contain at least a plurality of blocks (e.g., 4x4, 8x8, 8x16, 16x24, or the like). In one embodiment, the memories are configured to support access to different blocks of each bank. This allows a single access to fetch a column or row of blocks from two adjacent banks. In one embodiment, a crossbar is used to select a configuration for accessing blocks of the plurality of memory material structures (e.g., columns, rows, blocks, etc.). A collector may be included to accommodate the partitions of the memory accessed by the crossbars and to provide the blocks to the front end of a vector data path based on each clock. In one embodiment, the invention is embodied as a stream-based memory access system for a video processor. The video processor includes a scalar execution unit configured to perform scalar video processing operations at 10 1327434, and a vector execution unit configured to perform vector video processing operations. A frame buffer memory is included to store data for the scalar execution unit and the vector execution unit. A memory interface is included to implement communication between the scalar execution unit and the vector execution unit and the frame buffer memory. The frame buffer memory contains at least a plurality of blocks. The memory interface implements a first stream for the first sequential access of the block of scalar execution units and implements a second stream for the second sequential access of the block of vector execution units.

在一具體實施例中，該第一流和第二流至少包含預取分塊之一順序串列，預取之方式係要隱藏來自起始記憶體位置（如圖框緩衝記憶體、系統記憶體等）之存取潛時。在一具體實施例中，該記憶體介面係配置以管理來自複數個不同起始位置而至複數之不同终止位置的複數個不同流。在一具體實施例中，一建立在該記憶體介面内之DMA引擎係用於實施複數之記憶體讀取和複數之記憶體寫入，以支援該多數流。大體上，本揭露書揭示至少以下四方法。In a specific embodiment, the first stream and the second stream include at least one of the pre-fetched blocks, and the pre-fetching method is to hide the position from the starting memory (such as the buffer memory, the system memory). Waiting for access latency. In a specific embodiment, the memory interface is configured to manage a plurality of different streams from a plurality of different starting positions to different ending positions of the plurality. In one embodiment, a DMA engine built into the memory interface is used to implement a plurality of memory reads and complex memory writes to support the majority stream. In general, the present disclosure discloses at least the following four methods.

(A)本說明書中大體上教示之方法係一種用於在一視訊處理器中執行視訊處理操作之多維資料路徑處理系統，至少包含：藉由使用一純量執行單元執行純量視訊處理操作；藉由使用一向量執行單元執行向量視訊處理操作；藉由使用一資料儲存記憶體儲存用於向量執行單元之資料，其中該資料儲存記憶體至少包含複數個分塊，其至少包含具有配置成陣列之對稱記憶庫資料 11 1327434(A) The method generally taught in the present specification is a multi-dimensional data path processing system for performing a video processing operation in a video processor, comprising at least: performing a scalar video processing operation by using a scalar execution unit; Performing a vector video processing operation by using a vector execution unit; storing data for a vector execution unit by using a data storage memory, wherein the data storage memory includes at least a plurality of blocks, at least comprising having an array configured Symmetric memory data 11 1327434

結構，且其中該記憶庫資料結構係配置以支援存取至各記憶庫之不同分塊。此外，上揭方法A至少包含各個包括配置成4x4圖案之複數個分塊的記憶庫資料結構。同時，上揭方法A至少包含各個包括配置成8x8、 8x 1 6或1 6x24圖案之複數個分塊的記憶庫資料結構。此外，上揭方法A至少包含係配置以支援存取至各記憶庫資料結構之不同分塊的記憶庫資料結構，其中至少一存取係至二鄰近記憶庫資料結構，至少包含該二記憶庫資料結構的一列分塊。上揭方法A亦有關係配置以支援存取至各記憶庫資料結構之不同分塊的該等分塊，其中至少一存取係至二鄰近記憶庫資料結構，其至少包含該二記憶庫資料結構的一行分塊。此外，上揭方法A至少包含藉由使用一耦合至該資料儲存器之縱橫件選擇一用於存取複數個記憶庫資料結構的分塊之組態。在此選擇步驟中，該縱橫件存取複數個記憶庫資料結構之分塊，以基於每一時脈來提供資料至一向量資料路徑。同時，其有關藉由使用一收集器以容置由該縱橫件存取之複數個記憶庫的分塊；且以基於每一時脈來提供該等分塊至一向量資料路徑之前端。 (B)本說明書中大體上教示之方法亦係一種用於執行視訊處理操作之方法，所實施之方法使用一執行電腦可讀碼之電腦系統的視訊處理器，該方法至少包含：藉由使用一主機介面在該視訊處理器和一主機CPU間建立 12 1327434Structure, and wherein the memory data structure is configured to support access to different partitions of each memory bank. In addition, the method A includes at least a memory data structure including a plurality of blocks arranged in a 4x4 pattern. Meanwhile, the method A includes at least a memory data structure including a plurality of blocks configured as 8x8, 8x16 or 16x24 patterns. In addition, the method A includes at least a memory data structure configured to support access to different partitions of each memory data structure, wherein at least one access system to two adjacent memory data structures includes at least the two memory banks. A column of data structures. The method A is also configured to support access to different blocks of each memory data structure, wherein at least one access is to two adjacent memory data structures, and the at least two memory data are included A row of blocks of the structure. In addition, the method A includes at least selecting a configuration for accessing a plurality of memory data structures by using a crossbar coupled to the data storage. In this selection step, the crossbar accesses a plurality of blocks of the memory data structure to provide data to a vector data path based on each clock. At the same time, it relates to the use of a collector to accommodate the partitions of the plurality of memories accessed by the crossbar; and to provide the blocks to the front end of a vector data path based on each clock. (B) The method generally taught in the present specification is also a method for performing a video processing operation, the method implemented using a video processor of a computer system that executes a computer readable code, the method comprising at least: by using A host interface establishes 12 1327434 between the video processor and a host CPU

通信；藉由使用一記憶體介面在視訊處理器和一圖框緩衝記憶體間建立通信；藉由使用一耦合至主機介面和記憶體介面之純量執行單元執行純量視訊處理操作；及藉由使用一耦合至主機介面及記憶體介面之向量執行單元執行向量視訊處理操作。上揭方法B更包含該純量執行單元之功能為視訊處理器的控制器，且控制向量執行單元的操作。上揭方法B亦至少包含一向量介面單元，用於使該純量執行單元與向量執行單元產生介面。上揭之方法B亦至少包含配置以不同步操作的純量執行單元和向量執行單元。同時，該純量執行單元在一第一時脈頻率執行，而向量執行單元在一第二時脈頻率執行。上揭方法B亦至少包含係配置以執行一應用之流量控制演算法的純量執行單元，而向量執行單元係配置以執行該應用之像素處理操作。再者，該向量執行單元係可配置以於該純量執行單元控制下在一需求驅動基礎上操作。此外，該純量執行單元係配置以使用一記憶體命令 FIFO傳送函數呼叫至向量執行單元，該向量執行單元藉由存取記憶趙命令FIFO在一需求驅動基礎上操作。同時，向量執行單元之不同步操作係配置以支援該應用之一向量次常式或純量次常式之分別獨立更新。最後，上揭方法B至少包含係配置以使用VLIW(極長指令）碼操作之純量執行單元。 (C)本說明書中大體上亦教示一種在執行視訊處理操作之 13 1327434Communication; establishing communication between the video processor and a frame buffer memory by using a memory interface; performing scalar video processing operations by using a scalar execution unit coupled to the host interface and the memory interface; The vector video processing operation is performed by using a vector execution unit coupled to the host interface and the memory interface. The method B further includes the function of the scalar execution unit as a controller of the video processor, and controls the operation of the vector execution unit. The method B further includes at least a vector interface unit for causing the scalar execution unit and the vector execution unit to generate an interface. The method B disclosed above also includes at least a scalar execution unit and a vector execution unit configured to operate asynchronously. At the same time, the scalar execution unit is executed at a first clock frequency and the vector execution unit is executed at a second clock frequency. The method B further includes at least a scalar execution unit configured to execute an application flow control algorithm, and the vector execution unit is configured to perform pixel processing operations of the application. Furthermore, the vector execution unit is configurable to operate on a demand driven basis under the control of the scalar execution unit. In addition, the scalar execution unit is configured to call to the vector execution unit using a memory command FIFO transfer function, the vector execution unit operating on a demand driven basis by accessing the memory FIFO. At the same time, the asynchronous execution of the vector execution unit is configured to support separate updates of the vector subnormal or scalar subnormal of the application. Finally, the method B described above includes at least a scalar execution unit configured to operate using a VLIW (very long instruction) code. (C) This specification also generally teaches a method of performing video processing operations 13 1327434

視訊處理器中用以基於流之記憶體存取的方法，該方法至少包含：藉由使用一純量執行單元執行純量視訊處理操作；藉由使用一向量執行單元執行向量視訊處理操作；藉由使用一圖框緩衝記憶體用於儲存純量執行單元和向量執行單元之資料；及藉由使用一記憶體介面以實施在純量執行單元和向量執行單元及圖框緩衝記憶體間之通信，其中該圖框緩衝記憶體至少包含複數個分塊，且其中該記憶體介面實施一至少包含分塊之第一順序存取的第一流，且實施一至少包含用於向量執行單元或純量執行單元之分塊的第二順序存取之第二流。上揭方法c亦具有包括至少一預取分塊之第一流和第二流。上揭方法C更包含起始於圖框緩衝記憶體中第一位置之第一流，及起始於圖框緩衝記憶體中第二位置之第二流。上揭方法C也至少包含記憶體介面，其係配置以管理來自複數個不同起始位置且至複數個不同終止位置之複數流。在此方面中，至少一起始位置或至少一終止位置係在一系統記憶體中。上揭方法C亦至少包含藉由使用一建立在該記憶體介面内之 DMA引擎實施複數之記憶體讀取以支援第一流及.第二流；且實施複數之記憶體寫入以支援第一流和第二流。此外，上揭方法C至少包含第一流經歷比第二流更高數量之潛時，其中該第一流結合一比第二流數目較大之緩衝器用於儲存分塊。方法c亦至少包含記憶體介面，其係配置以預取可調整數目之第一流 14 1327434 或第二流的分塊，以.補償第一流或第二流的潛時。a method for stream-based memory access in a video processor, the method comprising: performing a scalar video processing operation by using a scalar execution unit; performing a vector video processing operation by using a vector execution unit; Data for storing scalar execution units and vector execution units by using a frame buffer memory; and implementing communication between the scalar execution unit and the vector execution unit and the frame buffer memory by using a memory interface The frame buffer memory includes at least a plurality of blocks, and wherein the memory interface implements a first stream including at least a first sequential access of the block, and the implementation one includes at least a vector execution unit or a scalar The second stream of the second sequential access of the chunks of the execution unit. The method c also has a first stream and a second stream comprising at least one prefetched block. The method C further includes a first stream starting from a first position in the frame buffer memory and a second stream starting at a second position in the frame buffer memory. The method C also includes at least a memory interface configured to manage a plurality of streams from a plurality of different starting positions to a plurality of different ending positions. In this aspect, at least one of the starting positions or at least one of the ending positions is in a system memory. The method C further includes performing at least a memory reading by using a DMA engine built in the memory interface to support the first stream and the second stream; and performing a plurality of memory writes to support the first stream And the second stream. In addition, the method C includes at least a first stream experiencing a higher number of potentials than the second stream, wherein the first stream combines a buffer having a larger number of streams than the second stream for storing the blocks. The method c also includes at least a memory interface configured to prefetch an adjustable number of first streams 14 1327434 or blocks of the second stream to compensate for the latency of the first stream or the second stream.

(D)在此說明之方法亦包括大體上一種用於潛時容許視訊處理操作的方法，該方法至少包含：藉由使用一主機介面實施在視訊處理器和一主機CPU間之通信；藉由使用一耦合至該主機介面之純量執行單元執行純量視訊處理操作；藉由使用一耦合至該主機介面之向量執行單元執行向量視訊處理操作；允許該向量執行單元藉由存取一記憶體命令FIFO在一需求驅動基礎上操作；藉由使用一記憶體介面在視訊處理器和一圖框緩衝記憶體間實施通信；且實施在複數個不同記憶體位置間之DMA轉移，其藉由使用一建立在該記憶體介面内之 DMA引擎且配置以用於向量執行單元之資料和指令載入一資料儲存記憶體及一指令快取。上揭方法D 更包含該向量執行單元，其係配置以藉由存取該命令 FIF 0而與純量執行單元不同步地操作，以在需求驅動基礎上操作。上揭方法 D亦至少包含該需求驅動基礎，其係配置以隱藏一自不同記憶體位置轉移至向量執行單元之命令FIFO的資料之潛時。此外，上揭方法 D至少包含該純量執行單元，其係配置以實施演算法流量控制處理，且其中該向量執行單元係配置以實施大多數視訊處理工作量。在此，該純量執行單元係配置以預先計算用於向量執行單元之工作參數以隱藏資料轉移潛時。上揭方法D至少包含該向量執行單元係配置以經由DMA引擎將一記憶庫程，以預取用於向量 15 1327434 次常式之順序執行的命令。在此，記憶體讀取係被排在由純量執行單元呼叫至向量次常式前，預取用向量次常式之執行的命令。安於(D) The method described herein also includes a method for substantially allowing video processing operations in a latent manner, the method comprising: at least: communicating between the video processor and a host CPU by using a host interface; Performing a scalar video processing operation using a scalar execution unit coupled to the host interface; performing a vector video processing operation by using a vector execution unit coupled to the host interface; allowing the vector execution unit to access a memory The command FIFO operates on a demand-driven basis; by using a memory interface to implement communication between the video processor and a frame buffer memory; and implementing DMA transfer between a plurality of different memory locations by using A DMA engine built into the memory interface and configured to load a data storage memory and an instruction cache for data and instructions for the vector execution unit. The method D further includes the vector execution unit configured to operate asynchronously with the scalar execution unit by accessing the command FIF 0 to operate on a demand driven basis. The above method D also includes at least the demand-driven basis configured to hide the latency of a data transfer from a different memory location to the command FIFO of the vector execution unit. In addition, the method D includes at least the scalar execution unit configured to implement an algorithm flow control process, and wherein the vector execution unit is configured to perform most video processing workloads. Here, the scalar execution unit is configured to pre-calculate operating parameters for the vector execution unit to hide the material transfer latency. The above method D includes at least the vector execution unit configuration to execute a memory library via the DMA engine to prefetch commands for the order of the vector 15 1327434. Here, the memory reading system is arranged before the execution of the vector subroutine by the scalar execution unit call to the vector subroutine. Safe

【實施方式】現將詳細參考本發明之較佳具體實施例，其實例係示於附圖中。雖然本發明將會結合較佳具體實施例說明但應瞭解不希望以此等具體實施例限制本發明。反之，發明係預期涵蓋由隨附申請專利範圍界定之本發明的精與範疇中之替代' 修改和等同者。再者，在以下本發明具體實施例的詳細說明中，會提出各種特定細節以供通瞭解本發明。然而，熟習此項技術人士應瞭解本發明可須此等特定細節來實現。在其他事例中，未詳細描述之人熟的方法、程序、組件及電路，係希望不致模糊本發具體實施例之特點。顯本神之盤無為明DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Reference will now be made in detail to the preferred embodiments embodiments While the invention will be described in conjunction with the preferred embodiments, the invention Instead, the invention is intended to cover alternatives and modifications of the invention and the scope of the invention as defined by the appended claims. In addition, in the following detailed description of the embodiments of the invention, various specific details are However, it will be apparent to those skilled in the art that the present invention may be embodied in the specific details. In other instances, methods, procedures, components, and circuits that are not described in detail are intended to not obscure the features of the embodiments. Show the god of the disk

注釋及術語以下一些詳細說明之部分係以電腦記憶體中之資料元上的程序、步驟、邏輯組塊、處理及操作之其他符號表示法來呈現。此等說明及表示法係由熟習資料處理技之人士用以最有效地傳達其工作真義予其他熟習此項技人士之方法。在此，一程序、電腦執行步驟、邏輯組塊過程等大體上被視為導致需求結果之步驟或指令的自相致之順序。該等步驟係需求物理量之實際操控者。通常位狀術術但 16 1327434 並非必然的是此等量係採取能在電腦系統中儲存、轉移、組合、比較或操控之電氣或磁信號的形式。業經證明有時 (主要為共同使用原因）稱此等信號為位元、值、元件、符號、字元、項、號碼或其類似者係較便利的。Notes and Terminology Some of the following detailed descriptions are presented in terms of procedures, steps, logical blocks, processing, and other symbolic representations of operations on computer data. These instructions and representations are used by those who are familiar with the data processing techniques to best convey the true meaning of their work to other people who are familiar with the technology. Here, a program, computer execution steps, logical chunking process, etc. are generally considered to be the self-consistent sequence of steps or instructions leading to a demand result. These steps are the actual controllers that require physical quantities. Usually positional surgery but 16 1327434 is not necessarily a form of electrical or magnetic signals that can be stored, transferred, combined, compared or manipulated in a computer system. It has proven convenient at times (primarily for common use reasons) that such signals are bits, values, components, symbols, characters, terms, numbers or the like.

然而，應注意的是，所有此等和類似名詞係與適當物理量相關連且僅係應用於此等量之合宜標號。除非以下討論中另行特定說明，否則在遍及本發明中利用諸如「處理」或「存取」或「執行」或「儲存」或「描繪」或其類似名詞之討論，係指一電腦系統（如第1圖之電腦系統1 00)、或類似電子計算裝置之動作和處理，其操控且轉換在該電腦系統之暫存器和記憶體内由物理（電子）量表示之資料，成為類似地表示成該電腦系統記憶體或暫存器或其他此類資訊儲存器、傳輸或顯示裝置内的物理量之其他資料。電腦系統平台:It should be noted, however, that all such and similar terms are to be construed as being Unless specifically stated otherwise in the following discussion, a discussion such as "processing" or "access" or "execution" or "storage" or "description" or similar terms throughout the present invention refers to a computer system (eg, The operation and processing of the computer system 1 00), or similar electronic computing device of Figure 1, which manipulates and converts data represented by physical (electronic) quantities in the registers and memory of the computer system, to be similarly represented Other data in the physical quantities of the computer system memory or scratchpad or other such information storage, transmission or display device. Computer system platform:

第 1圖顯示依據本發明一具體實施例之電腦系統 1 0 0。電腦系統1 0 0描述依據本發明之一具體實施例的基本電腦系統組件，該電腦系統提供用於某些基於硬體和軟體之功能的執行平台。一般而言，電腦系統1 0 0至少包含至少一 C P U 1 0 1、一系統記憶體 1 1 5、和至少一圊形處理器單元（GPU)llO及一視訊處理器單元（VPU)lll。CPU 101可經由橋接組件1 05耦合至系統記憶體1 1 5，或經由CPU 1 0 1 内部之記憶體控制器（未顯示）直接耦合至系統記憶體 115»橋接組件 105(如，Northbridge)可支援連接各種I/O 17 1327434Figure 1 shows a computer system 100 in accordance with an embodiment of the present invention. Computer system 100 depicts a basic computer system component in accordance with an embodiment of the present invention that provides an execution platform for certain hardware and software based functions. In general, computer system 100 includes at least one C P U 1 0 1 , a system memory 1 15 , and at least one processor unit (GPU) 110 and a video processor unit (VPU) 111. The CPU 101 can be coupled to the system memory 1 15 via the bridge component 105 or directly coupled to the system memory 115»bridge component 105 (eg, Northbridge) via a memory controller (not shown) internal to the CPU 101. Support for connecting various I/Os 17 1327434

裝置（如一或多個硬碟機、Ethernet配接器' CD ROM、DVD 等）之擴充匯流排。GPU 110和視訊處理器單元ηι係輕合至一顯示器112。一或多個額外GPU可附加地耗合至系統 100以進一步增加其計算能力。GPU 110和視訊處理器單元1 1 1係經由橋接組件1 〇 5耦合至C P U 1 0 1和系統記憶體 1 1 5»系統1 0 0可實施為例如桌上型電腦系統或伺服器電腦系統，其具有一耦合至一專用圖形描繪GPU 110之強力通用C P U 1 0 1。在此一具體實施例中，可包括組件以增加周邊匯流排、專用圖形記憶體和系統記憶體、10裝置及其類似者。同樣地，系統1 00可實施為手持裝置（如行動電話等）或電玩控制裝置，諸如可自美國華盛頓州Redmond之微軟公司獲得的Xbox®，或來自日本東京新力電腦娛樂公司之 PlayStation3®。Expansion bus for devices (such as one or more hard drives, Ethernet adapters, CD ROM, DVD, etc.). The GPU 110 and the video processor unit ηι are lightly coupled to a display 112. One or more additional GPUs may additionally be consumed to system 100 to further increase its computing power. The GPU 110 and the video processor unit 1 1 1 are coupled to the CPU 1 0 1 via the bridge component 1 〇 5 and the system memory 1 1 5 » The system 1 0 0 can be implemented, for example, as a desktop computer system or a server computer system. It has a powerful general purpose CPU 1 01 coupled to a dedicated graphics depicting GPU 110. In this embodiment, components may be included to increase peripheral busbars, dedicated graphics memory and system memory, 10 devices, and the like. Similarly, system 100 can be implemented as a handheld device (such as a mobile phone, etc.) or a video game control device, such as the Xbox® available from Microsoft Corporation of Redmond, Washington, USA, or the PlayStation 3® from Sony Computer Entertainment, Tokyo, Japan.

應瞭解的是，GPU 1 1 0可實施為經設計經由一連接器 (如 AGP插槽、PCI-Express插槽等）耦合至電腦系統1〇〇之一離散組件、一離散圖形卡，一離散積體電路晶粒（如直接安裝在母板上），或成為一包括在一電腦系統晶片組組件之積體電路晶粒内的積體化GPU 100(如整合於橋接晶片 105中）。此外，可包括一局部圖形記憶體用於高頻寬圖形資料儲存器之GPU 11 〇。此外，應瞭解GPU 11 〇和視訊處理器單元111可被整合於相同積體電路晶粒（如組件120) 上’或可為連接至或安裝於電腦系統100母板上的分離離散積體電路組件。 18 1327434It should be appreciated that GPU 1 10 can be implemented as a discrete component, a discrete graphics card, a discrete device, designed to be coupled to a computer system via a connector (eg, an AGP slot, a PCI-Express slot, etc.) The integrated circuit die (e.g., mounted directly on the motherboard) or integrated GPU 100 (e.g., integrated into the bridge wafer 105) included in the integrated circuit die of a computer system chipset assembly. In addition, a partial graphics memory can be included for the GPU 11 高频 of the high frequency wide graphics data storage. In addition, it should be understood that the GPU 11 and the video processor unit 111 can be integrated on the same integrated circuit die (eg, component 120)' or can be a discrete discrete integrated circuit connected to or mounted on the motherboard of the computer system 100. Component. 18 1327434

本發明之具體實施例第2圖顯示描述依據本發明之一具體實施例的理器單元111之内部組件的圖式。如第2圖所示，理器單元111包括一純量執行單元201、一向量執 202、一記憶體介面203及一主機介面204。在第2圖之具體實施例中，視訊處理器單元（以為視訊處理器）111包括用於執行視訊處理操作之功件。視訊處理器111使用主機介面204以經由橋接立視訊處理器1 1 1和主機C P U 1 0 1間之通信。視訊 111使用記憶體介面203(如用於耦合顯示器112，> 以建立視訊處理器 11 1和一圖框緩衝記憶體 2 0 5 信。純量執行單元201係耦合該主機介面204和記面 2 0 3，且係配置以執行純量視訊處理操作。一向單元係耦合至主機介面204和記憶體介面203，且以執行向量視訊處理操作。第2圖之具體實施例顯示其中視訊處理器111 執行功能成為純量操作與向量操作。純量操作係藉執行單元201實施。向量操作係藉由向量執行單元施。在一具體實施例中，向量執行單元202係配置一純量執行單元201之從屬共處理器。在此一具體中，純量執行單元藉由饋入控制流至向量執行單元管理向量執行單元2 02之工作量，且管理用於向量元2 0 2之資料輸入/輸出。控制流通常至少包含功能視訊處視訊處行單元下簡稱能性組 105建處理器 b顯示）間之通憶體介量執行係配置區分其由純量 202實以作為實施例 202而執行單參數、 19 1327434 次常式引數及其類似者。在典型視訊處理應用中，該應用之處理演算法的控制流將會在純量執行單元2 0 1上執行，而實際像素/資料處理操作將會在向量執行單元 202上執行。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Figure 2 shows a diagram depicting the internal components of the processor unit 111 in accordance with an embodiment of the present invention. As shown in FIG. 2, the processor unit 111 includes a scalar execution unit 201, a vector implementation 202, a memory interface 203, and a host interface 204. In the particular embodiment of Figure 2, the video processor unit (as the video processor) 111 includes functionality for performing video processing operations. The video processor 111 uses the host interface 204 to communicate between the video processor 1 1 1 and the host C P U 1 0 1 via the bridge. The video 111 uses a memory interface 203 (e.g., for coupling the display 112, > to establish a video processor 11 1 and a frame buffer memory 205. The scalar execution unit 201 couples the host interface 204 and the face. 2 0 3, and configured to perform a scalar video processing operation. The directional unit is coupled to the host interface 204 and the memory interface 203, and performs vector video processing operations. The specific embodiment of FIG. 2 shows the video processor 111. The execution function becomes a scalar operation and a vector operation. The scalar operation is implemented by the execution unit 201. The vector operation is performed by the vector execution unit. In a specific embodiment, the vector execution unit 202 is configured with a scalar execution unit 201. A slave coprocessor. In this embodiment, the scalar execution unit manages the workload of the vector execution unit 202 by the feed control flow to the vector execution unit, and manages the data input/output for the vector element 2 0 2 . The control flow usually includes at least the functional video location, the video unit, the energy component, the processor, and the processor. A real scalar 202 to 202 as an embodiment to perform a single parameter, 191,327,434 subroutine arguments, and the like. In a typical video processing application, the control flow of the application's processing algorithm will be executed on the scalar execution unit 201, and the actual pixel/data processing operations will be performed on the vector execution unit 202.

再請參考第2圖，純量執行單元2 01可實施為結合基於RISC之執行技術的一 RISC式樣純量執行單元。向量執行單元202可實施為具有例如一或多個SIMD管道之SIMD 機器。在一 2SIMD管道具體實施例中，例如各SIMD管道可用一 1 6像素寬資料路徑（或更寬）實施，且因此以原始計算能力提供向量執行單元202，以產生每一時脈高達32像素之產生資料輸出。在一具體實施例中，純量執行單元201 包括配置以使用VLIW(極長指令）軟體碼操作之硬體，以基於每一時脈最佳化該純量操作之平行執行。Referring again to Figure 2, the scalar execution unit 201 can be implemented as a RISC model scalar execution unit incorporating RISC-based execution techniques. Vector execution unit 202 can be implemented as a SIMD machine having, for example, one or more SIMD pipes. In a 2 SIMD pipeline embodiment, for example, each SIMD pipeline can be implemented with a 16 pixel wide data path (or wider), and thus vector execution unit 202 is provided with raw computational power to produce up to 32 pixels per clock. Data output. In one embodiment, scalar execution unit 201 includes hardware configured to operate using VLIW (very long instruction) software code to optimize parallel execution of the scalar operation based on each clock.

在第2圖之具體實施例中，純量執行單元201包括耦合至純量處理器 2 1 0之一指令快取 2 1 1和一資料快取 212。快取211-212與記憶體介面203產生介面以存取至外部記憶體，諸如圖框緩衝器205。純量執行單元201更包括一向量介面單元213,以與向量執行單元202建立通信。在一具體實施例中，向量介面單元213可包括配置以允許純量執行單元201和向量介面單元213間不同步通信之一或多個同步信箱2 1 4。在第2圖之具體實施例中，向量執行單元202包括一配置以控制一向量執行資料路徑（向量資料路徑 2 2 1 )之操作的向量控制單元 220。向量控制單元 220包括一命令 20 1327434 FIFO 225以自純量執行單元201接收指令和資料。一指令快取2 2 2係耦合以提供指令至向量控制單元2 2 0。一資料儲存記憶體 2 2 3係耦合以提供輸入資料至向量資料路徑 22 1，且自向量資料路徑22 1接收產生資料。資料儲存器 223之功能為向量資料路徑22 1之指令快取及資料RAM。In the particular embodiment of FIG. 2, scalar execution unit 201 includes an instruction cache 2 1 1 coupled to scalar processor 2 1 0 and a data cache 212. The cache 211-212 and the memory interface 203 create an interface for access to external memory, such as the frame buffer 205. The scalar execution unit 201 further includes a vector interface unit 213 to establish communication with the vector execution unit 202. In a specific embodiment, vector interface unit 213 can include one or more synchronization mailboxes 2 1 4 configured to allow for asynchronous communication between scalar execution unit 201 and vector interface unit 213. In the particular embodiment of Figure 2, vector execution unit 202 includes a vector control unit 220 configured to control the operation of a vector execution data path (vector data path 2 2 1 ). Vector control unit 220 includes a command 20 1327434 FIFO 225 to receive instructions and data from scalar execution unit 201. An instruction cache 2 2 2 is coupled to provide instructions to the vector control unit 2 2 0. A data storage memory 2 2 3 is coupled to provide input data to the vector data path 22 1, and the generated data is received from the vector data path 22 1 . The function of the data storage 223 is the instruction cache of the vector data path 22 1 and the data RAM.

指令快取 222和資料儲存器 223係耦合至記憶體介面 203，用於存取外部記憶體，諸如圖框緩衝器205。第2圖之具體實施例亦顯示一第二向量資料路徑2 3 1及一個別第二資料儲存器 2 3 3 (如，點線輪廓者）。應瞭解顯示第二向量資料路徑2 3 1及第二資料儲存器2 3 3係用以示範向量執行單元2 02具有二向量執行管道（如一雙重SIMD管道組態）之情況。本發明之具體實施例係適於具有大量向量執行管道（如，四、八、十六等）之向量控制單元。純量執行單元201提供用於向量控制單元220之資料和命令。在一具體實施例中，純量執行單元2 0 1使用一記憶體映射命令 FIFO 225傳送函數呼叫至向量執行單元 202。向量執行單元202命令係佇列在此命令FIFO 225中。Instruction cache 222 and data store 223 are coupled to memory interface 203 for accessing external memory, such as frame buffer 205. The specific embodiment of Fig. 2 also shows a second vector data path 2 3 1 and a second data storage 2 3 3 (e.g., a dotted line profile). It should be understood that the display second vector data path 2 3 1 and the second data storage 2 3 3 are used to demonstrate that the vector execution unit 202 has a two vector execution pipeline (e.g., a dual SIMD pipeline configuration). Particular embodiments of the present invention are suitable for vector control units having a large number of vector execution pipes (e.g., four, eight, sixteen, etc.). The scalar execution unit 201 provides data and commands for the vector control unit 220. In one embodiment, scalar execution unit 207 transfers a function call to vector execution unit 202 using a memory map command FIFO 225. The vector execution unit 202 commands the system to be listed in this command FIFO 225.

命令FIFO 225之使用有效地將純量執行單元201自向量執行單元202解耦合。純量執行單元201可在其本身之個別時脈上作用、依其本身與向量執行單元202之時脈頻率不同的個別時脈頻率操作且分離地控制。命令FIFO 225使得向量執行單元202能夠操作為一需求驅動單元。例如，工作可自純量執行單元201交遞予命令FIFO 225，且接著由向量執行單元202存取，用於以一 21 1327434 解耦合之不同步方式處理。向量執行單元202將因此視需求或視純量執行單元2 0 1需求處理其工作量。此功能將允許向量執行單元202在無須最大效能時節省電力（如，藉由減少/停止一或多數内部時脈）。The use of command FIFO 225 effectively decouples scalar execution unit 201 from vector execution unit 202. The scalar execution unit 201 can operate on its own individual clock, on its own individual clock frequency different from the clock frequency of the vector execution unit 202, and be separately controlled. The command FIFO 225 enables the vector execution unit 202 to operate as a demand drive unit. For example, the work may be handed over to the command FIFO 225 from the scalar execution unit 201, and then accessed by the vector execution unit 202 for processing in an unsynchronized manner of a decoupling of 21 1327434. The vector execution unit 202 will therefore process its workload as needed or as a scalar execution unit 2 0 1 requirement. This function will allow vector execution unit 202 to conserve power when maximum performance is not required (e.g., by reducing/stopping one or more internal clocks).

區分視訊處理功能成為一純量部分（如藉由純量執行單元201執行）及一向量部分（如藉由向量執行單元202執行）允許為視訊處理器 111建立之視訊處理程式被編譯成為分離之純量軟體碼及向量軟體碼。可分離地編繹純量軟體碼及向量軟體碼且後續連結在一起以形成一同調應用。The distinction between the video processing function becomes a scalar portion (as performed by the scalar execution unit 201) and a vector portion (as performed by the vector execution unit 202) allows the video processing program created for the video processor 111 to be compiled into separate Scalar software code and vector software code. The scalar software code and the vector software code can be separately encoded and subsequently joined together to form a coherent application.

區分允許向量軟體碼函數被分離地寫入且與純量軟體碼函數不同。例如，可分離地寫入向量函數（如在不同時間、藉由不同團隊之工程師等等）且可被提供作為一或多個次常式或程式館函數，供由純量函數（如純量執行緒、處理等等）使用或與之一起使用。此允許純量軟體碼及/或向量軟體碼分別獨立地更新。例如，一向量次常式可與一純量次常式獨立地更新（如，透過先前之分配程式更新、一新增以增加已分配程式之功能的新特徵），或反之亦然。該區分係藉由純量處理器2 1 0(如快取2 1 1至2 1 2)和向量控制單元220與向量資料路徑221(如快取222至223)之分離個別快取而實現。如上述，純量執行單元2 01和向量執行單元 202經由命令FIFO 225通信。第2圖顯示依據本發明之一具體實施例的視訊處理器 111之示範性軟體程式300的圖式。如第3圖所述，軟體程式300顯示視訊處理器111之編程模式的屬性，其中一 22 1327434 純量控制執行緒3 Ο 1係藉由結合一向量資料執行緒3 02之視訊處理器1 1 1執行。The distinction allowed vector software code function is written separately and is different from the scalar software code function. For example, vector functions can be written separately (eg at different times, by engineers of different teams, etc.) and can be provided as one or more subnormal or library functions for scalar functions (eg scalar) Use or use with threads, processes, etc.). This allows scalar software codes and/or vector software codes to be updated independently, respectively. For example, a vector subroutine may be updated independently of a scalar subroutine (e.g., by a previous allocation program update, a new feature added to increase the functionality of the assigned program), or vice versa. The zone is implemented by a scalar processor 210 (e.g., cache 2 1 1 to 2 1 2) and a separate vector cache of vector control unit 220 and vector data path 221 (e.g., caches 222 to 223). As described above, the scalar execution unit 201 and the vector execution unit 202 communicate via the command FIFO 225. Figure 2 shows a diagram of an exemplary software program 300 of video processor 111 in accordance with an embodiment of the present invention. As shown in FIG. 3, the software program 300 displays the attributes of the programming mode of the video processor 111, wherein a 22 1327434 scalar control thread 3 Ο 1 is a video processor 1 1 coupled by a vector data thread 021 1 execution.

第3圖具體實施例之軟體程式3 00實例顯示一用於視訊處理器111之編程模式，其中一在純量執行單元201上之純量控制程式（如純量控制執行緒3 0 1)在向量執行單元 2 02上執行次常式呼叫（如向量資料執行緒302)。軟體程式 300實例顯示其中一編譯器或軟體編程器已將一視訊處理應用分解成一純量部分（如第一執行緒）及一向量部分（如第二執行緒）之情況。如第3圖所示，在純量執行單元2 0 1上執行之純量控制執行緒3 0 1係預先計算工作參數，且將此等參數饋入向量執行單元 202，其施行主要處理工作。如上述，二執行緒301和302之軟體碼可分離地寫入且編譯。純量執行緒係負責以下各項： 1. 與主機單元204產生介面且實施一類別介面； 2. 向量執行單元2 0 2之初始化、設置及組態；及 3. 在一迴路中以工作單元、塊、工作集執行演算法，The software program of the embodiment of the third embodiment shows a programming mode for the video processor 111, wherein a scalar control program (such as a scalar control thread 3 0 1) on the scalar execution unit 201 is A subroutine call (e.g., vector data thread 302) is executed on vector execution unit 02. The software program 300 example shows a situation in which a compiler or software programmer has decomposed a video processing application into a scalar portion (such as a first thread) and a vector portion (such as a second thread). As shown in Fig. 3, the scalar control thread 3 0 1 executed on the scalar execution unit 2 0 1 pre-calculates the operational parameters, and feeds the parameters to the vector execution unit 202, which performs the main processing work. As described above, the software codes of the two threads 301 and 302 can be written and compiled separately. The scalar thread is responsible for the following: 1. Generate interface with host unit 204 and implement a class interface; 2. Initialize, set and configure vector execution unit 2 0 2; and 3. Work unit in primary loop , block, working set execution algorithm,

使得在各迭代中； a. 用於目前工作集之參數被計算； b. 初始該輸入資料轉移至向量執行單元中；及 c. 初始自向量執行單元轉移輸出資料。純量執行緒之典型執行模式係「射後不理 (fire-and-forget)」。名詞「射後不理」所指之屬性，係對於一視訊基頻處理應用之典型模式，其中命令和資料係自 23 1327434In each iteration; a. parameters for the current working set are calculated; b. initially the input data is transferred to the vector execution unit; and c. the initial self-vector execution unit transfers the output data. The typical execution mode of a scalar thread is "fire-and-forget". The term "unexpected after shooting" refers to the typical mode of a video baseband processing application, in which the commands and data are from 23 1327434.

純量執行單元2 Ο 1傳送至向量執行單元2 Ο 2 (例如經由 FIFO 225)，且直到演算法完全前不會有資料自向量執元2 0 2返回。在第3圖之程式300實例中，純量執行單元201 持為向量執行單元202安排工作，直到命令FIFO 225 再有任何空間（如 ’ lend_of_alg&!cmd_fifo_full)。由執行單元2 0 1排程之工作會計算參數且將此等參數傳向量次常式，且後續呼叫該向量次常式以執行該工作常式（如vector_funcB)藉由向量執行單元2〇2之執行時間上延遲，主要用以隱藏來自主記憶體（如系統記 115)之潛時。因此，視訊處理器111之架構對於指令料流量在向量執行單元2 0 2側上提供一潛時補償機制下更詳細說明此等潛時補償機制。應注意到在其中具有二或以上向量執行管道（如圖之向量資料路徑221和第二向量資料路徑231)之該況下，軟體程式3 00實例將會更複雜。同樣地，對於程式3 00係針對一具有二向量執行管道編寫但仍維持具有單一向量執行管道之系統上執行的能力之該等下，軟體程式3 00實例將會更複雜。因此，如以上第2圖和第3圖之討論所述，純量單元201係負責起始向量執行單元2 02上之計算。在體實施例中，自純量執行單元2 0 1傳遞至向量執行單Λ 之命令係以下主要類型： 1.讀取命令（如，memRd)，係由純量執行單元201起命令行單會保中不純量送至。次係在憶體及資 0以第2 等情其中在一情況執行一具 ‘202 始， 24 1327434 以自記憶體轉移目前工作集資料至向量執行單元202的資料RAM ; 2. 參數，係自純量執行單元201傳遞至向量執行單元202 ; 3. 執行命令，.係依欲執行之向量次常式的PC(如，程式計數器）之形式；及 4_寫入命令（如，memWr)，係由純量執行單元201起始以拷貝向量計算之結果到記憶體中。The scalar execution unit 2 Ο 1 is transferred to the vector execution unit 2 Ο 2 (e.g., via FIFO 225), and no data is returned from the vector grant 2 0 2 until the algorithm is completely complete. In the example of the program 300 of Figure 3, the scalar execution unit 201 holds the work for the vector execution unit 202 until the command FIFO 225 has any more space (e.g., 'lend_of_alg&!cmd_fifo_full). The parameters are calculated by the execution unit 2 0 1 scheduling and the parameters are passed to the vector subroutine, and the vector subroutine is subsequently called to execute the working routine (eg, vector_funcB) by the vector execution unit 2〇2 The delay in execution time is mainly used to hide the latency from the main memory (such as system record 115). Thus, the architecture of video processor 111 illustrates these latent time compensation mechanisms in greater detail for the command material flow to provide a latent time compensation mechanism on the vector execution unit 202 side. It should be noted that in the case where there are two or more vector execution pipelines (such as the vector data path 221 and the second vector data path 231 of the figure), the software program 300 instance will be more complicated. Similarly, the software program 300 will be more complicated for the ability of the program 300 to perform on a system with two vector execution pipelines while still maintaining a single vector execution pipeline. Thus, as discussed above in Figures 2 and 3, the scalar unit 201 is responsible for the calculations on the start vector execution unit 202. In the embodiment, the command passed from the scalar execution unit 2 0 1 to the vector execution unit is the following main types: 1. The read command (eg, memRd) is commanded by the scalar execution unit 201. Baozhong is not sent to the amount. The secondary system in the memory and the resource 0 in the second case, in one case, a '202 start, 24 1327434 to transfer the current working set data from the memory to the data RAM of the vector execution unit 202; 2. parameters, from The scalar execution unit 201 passes to the vector execution unit 202; 3. executes the command, in the form of a PC (eg, a program counter) of the vector subroutine to be executed; and a 4_ write command (eg, memWr), The result of the copy vector calculation is started by the scalar execution unit 201 into the memory.

在一具體實施例中，當接收到此等命令時，向量執行單元202立即將mmRd命令排程至記憶體介面203 (如，以自圖框緩衝器205讀取所請求之資料）。向量執行單元202 亦檢查執行命令且預取欲執行之向量次常式（若未出現在快取222中）。In one embodiment, upon receiving such commands, vector execution unit 202 immediately schedules the mmRd command to memory interface 203 (e.g., reads the requested data from frame buffer 205). Vector execution unit 202 also checks the execution command and prefetches the vector subroutine to be executed (if not present in cache 222).

在此情況下，向量執行單元202的目標是要將用於後續幾個執行之指令和資料流預先排程，而向量執行單元 2 02係在目前執行上運作。預先排程特徵有效地隱藏有關來自其記憶體位置之提取指令/資料的潛時。為使此等讀取要求預先進行，向量執行單元2 0 2、資料儲存器（如資料儲存器223)及指令快取（如快取222)係藉由使用高速最佳化硬體實施。如上述，資料儲存器（如資料儲存器 223)之功能如同向量執行單元202之工作RAM。純量執行單元201感知資料儲存器且與之互動，如同係FIFO之集合。FIFO至少包含視訊處理器1 1 1所操作之「流」。在一具體實施例中，流大體上係純量執行單元201起始轉移（至向量執行單元202) 25 1327434 的輸入/輸出FIFO。如上述，純量執行單元201和向量執行單元202之操作係解耦合。一旦輸入/輸出流已滿，向量控制單元220中之一 DMA 引擎停止處理命令FIFO 225。此立即導致命令FIFO 225 為滿。當命令FIFO 225滿時，純量執行單元201停止發送額外工作至向量執行單元202。In this case, the goal of vector execution unit 202 is to pre-arrange the instructions and data streams for subsequent executions, while vector execution unit 02 operates on the current execution. The pre-scheduling feature effectively hides the latency associated with the fetch instructions/data from its memory location. In order for these read requests to be made in advance, the vector execution unit 202, the data store (e.g., data store 223), and the instruction cache (e.g., cache 222) are implemented using high speed optimized hardware. As described above, the data store (e.g., data store 223) functions as the work RAM of vector execution unit 202. The scalar execution unit 201 senses and interacts with the data store as if it were a collection of FIFOs. The FIFO contains at least the "stream" operated by the video processor 111. In one embodiment, the stream is substantially the scalar execution unit 201 that initiates the transfer (to vector execution unit 202) 25 1327434 the input/output FIFO. As described above, the operations of the scalar execution unit 201 and the vector execution unit 202 are decoupled. Once the input/output stream is full, one of the vector control units 220 stops processing the command FIFO 225. This immediately causes the command FIFO 225 to be full. When the command FIFO 225 is full, the scalar execution unit 201 stops transmitting additional work to the vector execution unit 202.

在一具體實施例中，向量執行單元202除了輸入和輸出流外可能會需要中間流。因此整個資料儲存器2 2 3可被視為有關與純量執行單元20 1交互作用之流的一集合。第4圖顯示使用一依據本發明一具體實施例之視訊處理器混合子圖片與視訊之實例。第4圖顯示一其中一視訊表面係與一子圖片混合而後轉換成一 ARGB表面之代表性案例。至少包含該等表面之資料係駐在圖框緩衝記憶體 205中成為Luma參數412和Chroma參數413。子圖片像素元件414亦係如圖示駐在圖框緩衝記憶體205中。向量次常式指令和參數4 1 1係如圖示在記憶體2 0 5中實例化。In a specific embodiment, vector execution unit 202 may require an intermediate stream in addition to the input and output streams. Thus the entire data store 2 2 3 can be viewed as a collection of streams relating to the scalar execution unit 201. Figure 4 shows an example of using a video processor to mix sub-pictures and video in accordance with an embodiment of the present invention. Figure 4 shows a representative case where one video surface is mixed with a sub-picture and then converted into an ARGB surface. The data containing at least the surfaces resides in the frame buffer memory 205 to become the Luma parameter 412 and the Chroma parameter 413. Sub-picture pixel element 414 is also resident in frame buffer memory 205 as shown. The vector subroutine instruction and parameter 4 1 1 are instantiated in memory 250 as shown.

在一具體實施例中，各流至少包含一運作稱為「分塊 (tile)」之2D塊狀資料的FIFO。在此一具體實施例中，向量執行單元2 02針對各流維持一讀取分塊指向器及一寫入分塊指向器。例如，對於輸入流，當執行一向量次常式時，該向量次常式可消耗一目前（讀取）分塊或自其讀取。在背景中，資料係藉由memRd命令轉移至目前（寫入）分塊。該向量執行單元亦產生輸出分塊用於輸出流。此等分塊而後藉由跟著執行命令之memWr〇命令移至記憶體。此有效地 26 1327434 進行操作，有效地隱藏潛時。混合之實例中，向量資料路徑2 2 1係和參數 411(如，&v一subp_blend)之實例化事例配置。此係由* 42】所示。纯量執行單元^讀取成塊（如刀塊）之表面且使用DMA引擎40 1 (如，在記憶體介面203中）將其栽入資料儲存器223中。載入操作係由線422、線423及線424所示。In one embodiment, each stream contains at least one FIFO that operates 2D block data called "tiles." In this embodiment, the vector execution unit 202 maintains a read block pointer and a write block pointer for each stream. For example, for an input stream, when performing a vector subroutine, the vector subroutine can consume a current (read) block or read from it. In the background, the data is transferred to the current (write) block by the memRd command. The vector execution unit also produces an output block for the output stream. These blocks are then moved to memory by following the memWr command that executes the command. This effectively operates on 26 1327434, effectively hiding the latency. In the hybrid example, the vector data path 2 2 1 and the parameter 411 (eg, & v a subp_blend) are instantiated. This is shown by * 42]. The scalar execution unit ^ is read into the surface of a block (e.g., a block) and loaded into the data store 223 using the DMA engine 40 1 (e.g., in the memory interface 203). The load operation is indicated by line 422, line 423, and line 424.

預取分塊且使其準備在第4圖子圖片藉由向量次常式指令仍請參考第4圖，由於有多個輸入表面，故需要維持多輸入流。各流具有—對應FID〇。各流可具有不同數之分Prefetch the block and prepare it in the picture in Figure 4. By the vector subroutine instruction. Please refer to Fig. 4. Since there are multiple input surfaces, it is necessary to maintain multiple input streams. Each stream has a corresponding FID〇. Each stream can have a different number of points

示一其中子圖片表面係在系統記憶體 115内之案例（如，子圖片像素元件414)，.且因此將會有額外緩衝器（如η、n+i、n + 2、n + 3等等），而視訊流（如Luma 412、Chroma 413等等）可具有數目較少之分塊。所用緩衝器/FIFO之數目可依據由流經歷的潛時之程度調整。A case where the surface of the sub-picture is attached to the system memory 115 (e.g., sub-picture pixel element 414), and thus there will be additional buffers (e.g., η, n+i, n + 2, n + 3, etc.) Etc.), while video streams (such as Luma 412, Chroma 413, etc.) can have a smaller number of chunks. The number of buffers/FIFOs used can be adjusted based on the degree of latency experienced by the stream.

如上述’資料儲存器223利用一前視預取方法以隱藏潛時。由於此’一流可具有在二或以上分塊中之資料，因為該資料係預取用於適當向量資料路徑執行軟體（如，描述為 FIFOn ' n+1、n + 2 等等）。一旦載入資料儲存器，FIF〇係藉由向量資料路徑硬體 221存取且由向量次常式（如，次常式430)操作。向量資料路徑操作之結果至少包含_輸出流403。此輸出流係由純量執行單元2 0 1經由D Μ A引擎4 0 1拷貝回到圖框緩衝記憶體205(如，ARGB OUT415)中。此係由線425顯示。因此，本發明之具體實施例使用流處理之重要特點， 27 1327434The data storage 223 as described above utilizes a forward look-ahead method to hide the latency. Since this 'first class' can have data in two or more blocks, the data is prefetched for the appropriate vector data path execution software (e.g., as FIFOn 'n+1, n + 2, etc.). Once loaded into the data store, the FIF is accessed by the vector data path hardware 221 and operated by a vector subroutine (e.g., subroutine 430). The result of the vector data path operation includes at least the _output stream 403. This output stream is copied back to the frame buffer memory 205 (e.g., ARGB OUT 415) by the scalar execution unit 2 0 1 via the D Μ A engine 401. This is shown by line 425. Thus, embodiments of the present invention use the important features of stream processing, 27 1327434

其係資料儲存和記憶體係抽取為複數個分塊之以，可將一流視為分塊之持續存取集合。流係用料。此資料係依分塊之形式。該等分塊被預取以資料起始之特定記憶體來源（如系統記憶體、圖框體或其類似者）之潛時。同樣地，可指定該流不 (如，用於向量執行單元之快取、用於純量執行取、圖框缓衝記憶體、系統記憶體等等）。流之另於其大體上係依預先預取模式來存取分塊。如上愈高，預取愈深且每一流使用之緩衝越多（如第述）。第5圖顯示一描述依據本發明一具體實施例行單元的内部組件之圖式。第5圖之圖式從編程示一向量執行單元202之各種功能性單元和暫存資源的配置。第5圖之具體實施例中，向量執行單元202 一最適用於視訊基頻處理之效能和各種碼的執行壓縮演算法）之VLIW數位信號處理器。因此，向元 202具有一些直接指向增加視訊處理/解碼編效率的屬性。第5圖之具體實施例中，該等屬性至少包含 1.藉由提供用於結合多向量執行管道之選項效能； 2_每一管道2資料位址產生器（DAG)之分配； 3 .記憶體/暫存器運算元；事實。所以預取資隱藏來自緩衝記憶同之位置單元的快一特徵在述，潛時 4圖中所之向量執的觀點顯器 /SRAM 至少包含 (壓縮-解量執行單碼執行之的可擴充 28 1327434 4. 2D(x，y)指向器/迭代器； 5 ·深管道（如1 1至1 2)級； 6·純量（整數）/分支單元； * 7 ·可變指令寬度（長/短指令）； \ 8 .運算元抽取之資料對準器： • 9.典型運算元及結果之2D資料深度（4x4)形狀；及 10.純量執行單元之從屬向量執行單元，其執行遠端程序呼叫。 Φ 大體上，向量執行單元202之編程器的觀點係如一具有2DAG 503之SIMD資料深度。指令係依VLIW方式發送（如，指令係同時為向量資料深度5 0 4和位址產生器5 0 3 發送），且藉由指令解碼器 501解碼且調度至適當執行單元。指令係可變長度，其中最常使用之指令依短形式編碼。全指令集可依長形式（如VLIW型指令）使用。Its data storage and memory system is extracted into a plurality of blocks, which can be regarded as a continuous access set of blocks. Flow system material. This information is in the form of chunks. The blocks are prefetched with the potential of a particular memory source (such as system memory, frame, or the like) from which the data begins. Similarly, the stream can be specified (e.g., for vector execution unit cache, for scalar execution fetches, frame buffer memory, system memory, etc.). The stream is generally accessed in a pre-fetch mode to access the tiles. The higher the above, the deeper the prefetch and the more buffers used per stream (as described above). Figure 5 shows a diagram depicting the internal components of a row unit in accordance with an embodiment of the present invention. The diagram of Figure 5 illustrates the configuration of various functional units and scratchpad resources of a vector execution unit 202. In the specific embodiment of Fig. 5, vector execution unit 202 is a VLIW digital signal processor that is most suitable for the performance of video baseband processing and the execution compression algorithm of various codes. Therefore, the meta 202 has some attributes that directly point to increasing the efficiency of video processing/decoding. In the specific embodiment of Figure 5, the attributes include at least 1. by providing an option performance for combining multiple vector execution pipelines; 2_ each pipeline 2 data address generator (DAG) allocation; Body/scratch operator; fact. Therefore, the pre-funding hides the fast-characteristic feature from the location unit of the buffer memory, and the vector display of the submersible 4 image viewer/SRAM contains at least (the compression-decompression execution of the single-code execution expandable 28 1327434 4. 2D (x, y) pointer / iterator; 5 · deep pipeline (such as 1 1 to 12) level; 6 scalar (integer) / branch unit; * 7 · variable instruction width (long / Short instruction); \8. Data aligner for operand extraction: • 9. 2D data depth (4x4) shape of typical operands and results; and 10. Dependent vector execution unit of scalar execution unit, executing remote Program call Φ In general, the programmer of vector execution unit 202 has a SIMD data depth of 2DAG 503. The instruction is transmitted in VLIW mode (for example, the instruction system is both vector data depth 5 0 4 and address generator). 5 0 3 is transmitted) and decoded by the instruction decoder 501 and dispatched to the appropriate execution unit. The instruction is variable length, wherein the most commonly used instructions are encoded in a short form. The full instruction set can be in a long form (such as a VLIW type instruction). )use.

圖註 502顯示具有三個此 VLIW指令之三個時脈循環。依據圖註510，最上方之VLIW指令502至少包含二位址指令（如用於2DAG 503)及一用於向量資料深度504之指令。中間V LIW指令至少包含一整數指令（如，整數單元 505)、一位址指令及一向量指令。最下方VLIW指令至少包含一分支指令（如用於分支單元 5 0 6)、一位址指令及一向量指令。向量執行單元可配置以具有單一資料管道或多個資料管道。各資料管道係由局部RAM(如資料儲存器5 1 1)、一縱橫件516、2 DAG 503及一 SIMD執行單元（如向量資料 29 1327434 深度504)。第5圖顯示一用於解說目的之基本組態，其中係僅舉例說明1資料管道。當舉例說明2資料管道時，其等可執行為獨立執行緒或為協同執行緒。六個不同埠（如4 sf取和2寫案單元515存取。此等暫存器接收來自純量執行單元之參數或來自整數單元505或位址單元503之結果。DAG 503 也作為收集控制器且管理暫存器之分布以定址資料儲存器Legend 502 shows three clock cycles with three of these VLIW instructions. According to the legend 510, the uppermost VLIW instruction 502 includes at least two address instructions (e.g., for 2DAG 503) and an instruction for vector data depth 504. The intermediate V LIW instruction includes at least one integer instruction (e.g., integer unit 505), an address instruction, and a vector instruction. The bottommost VLIW instruction contains at least one branch instruction (e.g., for branch unit 5 0 6), an address instruction, and a vector instruction. The vector execution unit can be configured to have a single data pipeline or multiple data pipelines. Each data pipeline is comprised of a local RAM (e.g., data storage 51 1 1), a crossbar 516, 2 DAG 503, and a SIMD execution unit (e.g., vector data 29 1327434 depth 504). Figure 5 shows a basic configuration for illustrative purposes, in which only the data pipeline is illustrated. When exemplifying 2 data pipelines, they can be executed as independent threads or as collaborative threads. Six different 埠 (such as 4 sf take and 2 write unit 515 access. These registers receive parameters from scalar execution units or results from integer unit 505 or address unit 503. DAG 503 also acts as a collection control And manage the distribution of the scratchpad to address the data store

511 之内容（如 RAO、RA1、RA2、RA3、WA0 和 WA1)。一縱橫件5 1 6係耦合以依任何次序/組合分配輸出資料蜂 R〇、Rl、R2、R3至向量資料深度5〇4内以實施一給定之才s令。向量資料深度504之輪出可如所示被饋入資料儲存器511中。一固定RAM 517係用以自整數單元5〇5提供經常使用之運异元至向量資料深度504及資料儲存器511。第6圖顯示依據本發明—具體實施例之記憶體6〇〇的複數個記憶庫6〇1至604及一具有對稱之分塊陣列的資料儲存器61〇布局。如第6圖所示，為了解說目的僅顯示資Contents of 511 (such as RAO, RA1, RA2, RA3, WA0, and WA1). A crosspiece 5 16 is coupled to distribute the output data bee R 〇, Rl, R2, R3 to the vector data depth 5 〇 4 in any order/combination to implement a given s command. The wheel of vector data depth 504 can be fed into data store 511 as shown. A fixed RAM 517 is used to provide the commonly used transport-to-vector data depth 504 and data store 511 from the integer unit 5〇5. Figure 6 shows a plurality of memory banks 〇1 to 604 and a data storage 61 具有 layout having a symmetrical block array in accordance with the present invention. As shown in Figure 6, only for the purpose of understanding

㈣存器61G之-部分。資料健存器㈣邏輯上至少包含 —(或複數）分塊陣列。各分塊係4χ4 ?1| _ 心狀之子分塊的陣幻。實際上，如記憶體600所示，資料在記憶體的「Ν」實體記憶庫之:器61G係儲存中。、S己憶庫601至604) ~流中之邏輯分 6位元組高且16 中為 4 X 4 )陣列。此外，資料儲存器610視覺上描述塊》第6圖之具趙實施例中，此分塊係ι 位元組寬。此分塊係一子分塊（在此實例 30 1327434 各子分塊係儲存在一實體記憶庫中。在其中有8記憶實體記憶體的情況下（如記憶庫〇至7)，此顯示於第中各4x4子分塊内之號碼。記憶庫中之子分塊的組織在子分塊之2x2配置中沒有共用記憶庫。此使得任何準存取（如在X和y二方向）可在沒有任何記憶庫衝突行。 s己憶庫60 1至604係配置以支援存取至各記憶庫同分塊。例如，在一情況下，縱橫件5丨6可自記憶庫存取2x4集之分塊（如組601之前二列）。在另一情況縱橫件516可自二相鄰記憶庫存取1χ8集的分塊。同右在另一情況下，縱橫件5 i 6可自二相鄰記憶庫存取i】分塊。在各情況下，因為該等記憶庫係由縱橫件516存 DAG/收集器5 〇3可接收該等分塊，且將該等分塊依基一時脈提供至向量資料深度5〇4之前端。依此方式’本發明之具體實施例提供一種新穎視理Is架構’其支援複雜之視訊處理功能，同時有效地積體電路*夕晶粒區域、電晶體總數、記憶體速率需求類似者。本發明之具體實施例維持高計算密度且係易擴充以處理多視訊流。本發明之具體實施例可提供一雜視訊處理操作，諸如MPEG-2/WMV9/H.264編碼協J 迴路内解碼器）、MPEG_2/WMV9/H.264解碼（如後熵解及迴路内/迴路外解塊濾波器。由本發明之具體實施例提供之額外視訊處理器包諸如先進動作調適性解交錯、用於編瑪之輸入雜訊濾庫之 6圖使得未對下實的不 60 1 下， i地，【8集取，於每訊處使用及其於可些複边（如碼器）括，波、 31 1327434(4) Part of the 61G register. The data payload (4) logically contains at least a (or plural) block array. Each block is 4χ4 ?1| _ heart-shaped sub-blocks. In fact, as shown in memory 600, the data is stored in the memory of the "Ν" entity of the memory: the 61G system. , S memory library 601 to 604) ~ logical division in the stream 6-bit tuple high and 16 in 4 X 4) array. In addition, the data store 610 visually describes the block embodiment of Figure 6, which is the width of the ι byte group. This block is a sub-block (in this example 30 1327434 each sub-block is stored in a physical memory. In the case of 8 memory entities (such as memory bank 〇 to 7), this is shown in The number in each 4x4 sub-block. The organization of the sub-blocks in the memory does not share the memory in the 2x2 configuration of the sub-block. This allows any quasi-access (as in the X and y directions) to be absent. Any memory conflict line. The memory library 60 1 to 604 are configured to support access to the same block of each memory. For example, in one case, the crossbar 5丨6 can take a block of 2x4 set from the memory stock. (For example, the first two columns of group 601.) In another case, the crosspiece 516 can take 1 χ 8 sets of blocks from two adjacent memory stocks. In the other case, the crossbar 5 i 6 can be taken from two adjacent memory stocks. i] Blocking. In each case, because the memory banks are stored by the crossbar 516, the DAG/collector 5 〇3 can receive the blocks, and the blocks are provided to the depth of the vector data according to the base clock. 5〇4 front end. In this way, the specific embodiment of the present invention provides a novel visual Is architecture It supports complex video processing functions while effectively integrating the integrated circuit area, the total number of transistors, and the memory rate requirements. Embodiments of the present invention maintain high computational density and are easily scalable to handle multi-view streams. A specific embodiment of the present invention can provide a video processing operation, such as MPEG-2/WMV9/H.264 encoding co-J loop decoder, MPEG_2/WMV9/H.264 decoding (such as post-entropy solution and intra-loop/ An out-of-loop deblocking filter. An additional video processor package provided by a specific embodiment of the present invention, such as an advanced motion adaptive deinterlacing, and a 6-character input noise filter library, so that the unrealized non-60 1 Next, i, [8 episodes, used at each telegram, and some versatile (such as coder), wave, 31 1327434

多相調整/再取樣及子圖片合成。本發明之視訊處理器可用於一些視訊處理器-放大器（procamp)應用，諸如空間轉換、彩色空間調整、像素點操作（諸如鮮明化、圖調整等）及各種視訊表面格式轉換。大體上且未加以限制的是本說明書已揭露於下。示一種用於執行視訊處理操作之潛時容許系統。該系括一主機介面，用於在視訊處理器和一主機CPU間實信；一純量執行單元，其係耦合至該主機介面且配置行純量視訊處理操作；及一向量執行單元，其係耦合主機介面且配置以執行向量視訊處理操作。一命令係被包括用於允許該向量執行單元藉由存取一記憶體 FIFO在需求驅動基礎上操作。所包括之一記憶體介面於在視訊處理器和一圖框緩衝記憶體間實施通信。一引擎係建立在該記憶體介面内，用於在複數個不同記位置間實施D Μ A轉移，且以用於向量執行單元之資指令載入命令FIFO中。已揭示一用於執行視訊處理之視訊處理器。該視訊處理器包括一主機介面，用於訊處理器和一主機CPU間實施通信。其所包括之記憶面係用於在視訊處理器和一圖框緩衝記憶體間實施通一純量執行單元係耦合至該主機介面和記憶體介面且置以執行純量視訊處理操作。一向量執行單元係稱合主機介面和記憶體介面且係配置以執行向量視訊處作。已揭示一種用於供一視訊處理器執行視訊處理操多維資料路徑處理系統。該視訊處理器包括一配置以架構彩色直方其揭統包施通以執至該 FIFO 命令係用 DMA 憶體料和操作在視體介信。係配至該理操作之執行 32 1327434Multiphase adjustment/resampling and subpicture synthesis. The video processor of the present invention can be used in some video processor-procamp applications such as spatial conversion, color space adjustment, pixel point operations (such as sharpening, map adjustment, etc.) and various video surface format conversions. In general and without limitation, the present specification has been disclosed below. A latent time allowing system for performing a video processing operation is shown. The system includes a host interface for authenticating between the video processor and a host CPU; a scalar execution unit coupled to the host interface and configured for line-quantity video processing operations; and a vector execution unit The host interface is coupled and configured to perform vector video processing operations. A command is included to allow the vector execution unit to operate on a demand driven basis by accessing a memory FIFO. A memory interface is included to implement communication between the video processor and a frame buffer memory. An engine is built into the memory interface for performing D Μ A transfers between a plurality of different locations and loaded into the command FIFO with the instructions for the vector execution unit. A video processor for performing video processing has been disclosed. The video processor includes a host interface for communicating between the processor and a host CPU. The memory surface is configured to perform a scalar execution unit coupling between the video processor and a frame buffer memory to the host interface and the memory interface and to perform a scalar video processing operation. A vector execution unit is called a host interface and a memory interface and is configured to perform vector video operations. A multi-dimensional data path processing system for performing a video processing operation for a video processor has been disclosed. The video processor includes a configuration to structure the color histograms, and to perform the FIFO command to use the DMA memory material and operate in the visual communication. Matching to the operation of the operation 32 1327434

純量視訊處理操作之純量執行單元，及一配置以執行向視訊處理操作之向量執行單元。一資料儲存記憶體係包以儲存向量執行單元之資料。該資料儲存記憶體包括複個具有配置成陣列之對稱記憶庫資料結構的分塊。該記庫資料結構係配置以支援存取至各記憶庫之不同分塊。揭示一種用於執行視訊操作之視訊處理器的基於流之記體存取系統。該視訊處理器包括一配置以執行純量視訊理操作之純量執行單元，及一配置以執行向量視訊處理作之向量執行單元。所包括的一圖框緩衝記憶體係用以存純量執行單元和向量執行單元之資料。一記憶體介面包括以建立在純量執行單元和向量執行單元及圖框緩衝憶體間之通信。該圖框緩衝記憶體至少包含複數個分塊記憶體介面實施分塊之第一順序存取，且實施一至少包一用於向量執行單元或純量執行單元之分塊的第二順序取之第二流。本發明之特定具體實施例的前揭說明已就示範和說目的加以呈現。其等並非意於毫無遺漏或以所揭精確形限制本發明，且根據以上教示可有許多修改和變化。具實施例被選定且說明以最佳地解說本發明之原理及其實應用，因而使熟習此項技術人士得以最佳地利用本發明且具有各種修改之各種具體實施例係適於所涵蓋之特定途。本發明之範疇預期係由隨附申請專利範圍及其同等界定。量括數憶已憶處操儲係記 0 含存明式體際 > 用者 33 1327434 【圖式簡單說明】本發明係藉由附圖之圖式舉例說明而不受其限制，而其中相同參考元件號碼指類似元件，且其中： * 第1圖顯示說明依據本發明之一具體實施例的電腦系統之基本組件之示意圖。第2圖顯示描述依據本發明之一具體實施例的視訊處理單元之内部組件的圖式。第3圖顯示用於依據本發明之一具體實施例的視訊處理器 • 之示範性軟體程式的圖式。第4圖顯示使用視訊處理器且依據本發明之一具體實施例與視訊混合之子圖片的實例。第5圖顯示描述依據本發明之一具體實施例的向量執行之内部組件的圖式。第6圖顯示一依據本發明之一具體實施例具有對稱的分塊陣列之資料儲存記憶體的布局之圖式。A scalar execution unit of a scalar video processing operation, and a vector execution unit configured to perform a video processing operation. A data storage memory system package stores the data of the vector execution unit. The data storage memory includes a plurality of blocks having a symmetric memory data structure configured in an array. The library data structure is configured to support access to different partitions of each bank. A stream-based document access system for a video processor for performing video operations is disclosed. The video processor includes a scalar execution unit configured to perform a scalar video operation and a vector execution unit configured to perform vector video processing. A frame buffer memory system is included for storing data of scalar execution units and vector execution units. A memory interface includes communication between the scalar execution unit and the vector execution unit and the frame buffer memory. The frame buffer memory includes at least a plurality of block memory interfaces to perform a first sequential access of the block, and a second sequence of at least one block for the vector execution unit or the scalar execution unit is implemented. Second stream. The foregoing description of the specific embodiments of the invention has been presented The invention is not intended to be exhaustive or to limit the invention. The embodiment was chosen and described in order to best explain the principles of the invention way. The scope of the invention is intended to be defined by the scope of the appended claims and their equivalents. Quantitative Recollections have been recalled in the Department of Operational Memory 0. Included in the context of the syllabus> User 33 1327434 [Simplified illustration of the drawings] The present invention is exemplified by the drawings of the drawings without being limited thereto, and wherein The same reference element numbers refer to like elements, and wherein: * Figure 1 shows a schematic diagram illustrating the basic components of a computer system in accordance with an embodiment of the present invention. Figure 2 is a diagram showing the internal components of a video processing unit in accordance with an embodiment of the present invention. Figure 3 shows a diagram of an exemplary software program for a video processor in accordance with an embodiment of the present invention. Figure 4 shows an example of a sub-picture mixed with video using a video processor in accordance with an embodiment of the present invention. Figure 5 shows a diagram depicting the internal components of vector execution in accordance with an embodiment of the present invention. Figure 6 is a diagram showing the layout of a data storage memory having a symmetric block array in accordance with an embodiment of the present invention.

【主要元件符號說明】 100 電腦系統 101 CPU 105 橋接組件 110 圖形處理器單元/GPU 111 視訊處理器單元 /VPU 112 顯示器 115 系統記憶體 120 組件 201 純量執行單元 202 向量執行單元 203 1己憶體介面 204 主機介面 205 圖框緩衝記憶體 2 10 純量處理器 34 1327434 211 指令快取 2 12 資料快取 2 13 向量介面單元 2 14 同步信箱 220 向量控制單元 221 向量資料路徑 • 222 指令快取 223 資料儲存記憶體 225 « 命令FIFO 23 1 第二向量資料路 - 233 第二資料儲存器 300 軟體程式 301 純量控制執行緒 302 向量資料執行緒 401 DMA引擎 403 流 Φ 412 視訊表面/Luma參數 4 13 視訊表面/Chro 414 子圖片像素元件 4 15 ARGBOUT 42 1 線 422 線 423 線 424 線 425 線 430 次常式 501 指令解碼器 502 VLIW指令 503 位址產生器 504 向量資料深度 505 整數單元 506 分支單元 5 10 圖註 5 11 資料儲存器 • 515 位址暫存器檔案單元 5 16 縱橫件 • 517 固定RAM 600 記憶體 ν 601- 6 04記憶庫 610 資料儲存器 35[Main component symbol description] 100 Computer system 101 CPU 105 Bridge component 110 Graphics processor unit / GPU 111 Video processor unit / VPU 112 Display 115 System memory 120 Component 201 Scalar execution unit 202 Vector execution unit 203 1 Recall Interface 204 Host Interface 205 Frame Buffer Memory 2 10 scalar processor 34 1327434 211 Instruction Cache 2 12 Data Cache 2 13 Vector Interface Unit 2 14 Sync Mailbox 220 Vector Control Unit 221 Vector Data Path • 222 Instruction 223 Data storage memory 225 «Command FIFO 23 1 Second vector data path - 233 Second data storage 300 Software program 301 sine control thread 302 Vector data thread 401 DMA engine 403 Stream Φ 412 Video surface / Luma parameter 4 13 Video Surface / Chro 414 Sub Picture Pixel Element 4 15 ARGBOUT 42 1 Line 422 Line 423 Line 424 Line 425 Line 430 Times Normal 501 Instruction Decoder 502 VLIW Instruction 503 Address Generator 504 Vector Data Depth 505 Integer unit 506 Branch unit 5 10 Legend 5 11 Data storage • 515 Address register File unit 5 16 Crossbar • 517 Fixed RAM 600 Memory ν 601- 6 04 Memory 610 Data storage 35

Claims

1327434 Ua Day Butterfly Scale _ Case Supplementary Year Correction Pickup, Patent Application Range: 1. A system for video processing operations, the system includes:

a scalar execution unit configured to perform a scalar video processing operation; a vector execution unit configured to perform a vector video processing operation; a data storage memory configured to store data of the vector execution unit, wherein The data storage memory includes at least a plurality of tiles, and the blocks include at least a bank data structure configured as an array, and wherein the memory data structures are configured to support Access to different partitions of each memory bank. 2. The system of claim 1, wherein the system is a multidimensional data path processing system for performing video processing operations. 3. A system for multidimensional data path processing to support video processing operations, comprising:

a motherboard (motherboard); a host CPU coupled to the motherboard; a video processor coupled to the motherboard and coupled to the CPU and including the system of claim 1 . 4. The system of claim 1, 2 or 3, wherein each of the memory data structures comprises a plurality of blocks arranged in a 4x4 pattern. 36 1327434 i 4^23

5. The system of claim 1, 2 or 3, wherein each of said memory data structures comprises a plurality of blocks arranged in a pattern of 8x8, 8x16 or 16x24.

6. The system of claim 1, 2 or 3, wherein the memory data structures are configured to support access to different partitions of each memory data structure, and wherein the two phases are At least one access to the adjacent memory data structure includes one of the two adjacent memory data structures. 7. The system of claim 1, wherein the blocks are configured to support access to different partitions of each memory data structure, and wherein the two adjacent memories At least one access to the data structure includes a row partition of the two adjacent memory data structures. 8. The system as described in claim 1, 2 or 3 includes:

A crossbar coupled to the data storage memory for selecting a block for configuring access to the plurality of memory data structures. 9. The system of claim 8, wherein the crossbar accesses the plurality of memory data structures to provide data to a vector data path based on each clock. 10. The system of claim 9, further comprising a collector, 37 1327434

The plurality of blocks of the plurality of memory data structures accessed by the crossbar are received, and the blocks are provided to a front end of the one of the vector data paths based on each clock. 11. A video processor for performing a video processing operation, comprising: a host interface for performing communication between the video processor and a host CPU;

a memory interface for performing communication between the video processor and a frame buffer memory; a scalar execution unit coupled to the host interface and the memory interface, and configured to perform A singular video processing operation; and a vector execution unit coupled to the host interface and the memory interface and configured to perform a vector video processing operation.

12. The video processor of claim 11, wherein the scalar execution unit function is a controller of the video processor and controls operation of the vector execution unit. 13. The video processor of claim 11, further comprising a vector interface unit for causing the scalar execution unit to generate an interface with the vector execution unit. 14. The video processor of claim 11, wherein the scalar execution unit and the vector execution unit are configured to operate asynchronously. 38 1327434 卺/1 si repair {seven replacement page

15. The video processor of claim 14, wherein the quantity execution unit is executed according to a first clock frequency and the vector element is executed according to a second clock frequency. 16. The video processor of claim 11, wherein the quantity interface unit is configured to perform an application flow control algorithm. The vector execution unit is configured to perform pixel processing of the application. The video processor of claim 16, wherein the quantity execution unit is configured to operate on the basis of command driving under the control of the scalar execution unit. 18. The video processor of claim 16, wherein the quantity execution unit is configured to call the vector execution unit using a command FIFO transfer function, and wherein the vector execution unit causes the FIFO to be in demand by accessing Drive based on operation. 19. The video processor of claim 16, wherein the asynchronous operation of the processor is configured to support separate or independent updates of the one-time routine or the scalar sub-normal of the application. The video processor of claim 11, wherein the quantity execution unit is configured to operate the pure ir using the VLIW (very long instruction) code, and ‘made. The direction of the pure call to the life of the visual vector of the pure 39 1327434

21. A stream-based system for performing a video processor, the system comprising: a scalar execution unit configured to execute a pure d-vector execution unit, Configuring to: a frame buffer memory for storing metadata and the vector execution unit; and a memory interface for implementing the pure vector execution unit and the frame slow The inter-memory pass buffer memory includes at least a plurality of blocks, and wherein a first stream of the first sequential access including one of the blocks is used for the vector execution unit or the sequential access of the scalar execution unit The second stream. 2 2. A system for performing stream-based memory access, comprising: a motherboard; a host CPU coupled to the motherboard; a video processor coupled to the mother The video processor includes: a host interface for establishing communication between the video CPUs; a scalar execution unit coupled to the stream-based memory access t video processing operation; t video processing Operating the scalar, executing the scalar execution unit and the letter, wherein the 'frame' of the memory interface is implemented, and implementing a block containing one of the second compliant video processing operations coupled to the CPU, the processor and The host host interface is configured with 40 1327434 i. y/ί to perform a scalar video processing operation; a vector execution unit coupled to the host interface and configured to perform a vector video processing operation;

a memory interface coupled to the scalar execution unit and the vector execution unit and configured to establish flow-based communication between the scalar execution unit and the vector execution unit and a frame buffer memory, wherein The frame buffer memory includes at least a plurality of blocks, and wherein the memory interface implements a first stream including one of the first sequential accesses of the block, and implementing one for the vector execution unit or the scalar The second stream of the second sequential access of one of the blocks of the execution unit. 23. The system of claim 2, wherein the first stream and the second stream comprise at least one prefetched title.

2. The system of claim 2, wherein the first flow system starts from a first position in the frame woven memory, and the second flow system buffers memory from the frame One of the second positions in the body starts. 25. The system of claim 21, wherein the memory interface is configured to manage a plurality of streams from a plurality of different starting locations and to a plurality of different termination locations. 26. The system of claim 25, wherein at least one of the starting positions or at least one of the ending positions is in a system 41 1327434

In memory. 2 7. The system of claim 21 or 22, wherein the DMA engine is configured to perform a plurality of memory readings in the memory interface to support the first stream and to implement the plurality of A memory write to support the first stream and 1 28. The system of claim 21 or 22, wherein the stream experiences a higher amount of latency than the second stream (the first stream combination in Latenc) A larger number of buffers than the second stream. 2 9. The system of claim 2, wherein the body interface is configured to prefetch the first stream or the second stream a number of blocks to compensate for the first stream or the second > 1 (Latency). A system for video processing operations, the system comprising: a host interface for use in the video The processor and an implementation communication; a scalar execution unit coupled to the host interface and performing a scalar video processing operation; a vector execution unit coupled to the host interface and performing a line vector video processing operation; comprising: In the second stream; in the second stream. A first y), and the host CPU which is arranged for storing in the memory when an adjustable potential to perform the ί configured to perform 421327434

a command FIFO for causing the vector execution unit to operate on a demand driven basis by accessing the command FIFO; a memory interface for buffering memory in the video processor and a frame Implement communication between the bodies;

a DMA engine for establishing a DMA transfer between the plurality of different memory locations for loading DMA transfer between the plurality of different memory locations, and loading a data storage memory with data and instructions for the vector execution unit And an instruction cache. 3. A system as claimed in claim 30, wherein the system is a Latency tolerance system for performing video processing operations. 3. The system of claim 3, further comprising: a motherboard; a host CPU coupled to the motherboard; a video processor coupled to the motherboard and coupled to The CPU »

The system of claim 30, 31 or 32, wherein the vector execution unit is configured to access the command FIFO to operate asynchronously with respect to the scalar execution unit, Drive based on operation. 34. The system of claim 30, 31 or 32, wherein the demand-driven infrastructure is configured to hide a different memory location from the same 43 1327434

Transfer to the data of the command FIFO of the vector execution unit. 35. The system of claim 30, 31 or 32, wherein the scalar execution unit is configured to perform an algorithm flow control process, wherein the vector operation unit is configured to perform a majority of video processing. a dive and its work

36. The system of claim 35, wherein the scalar unit is configured to pre-calculate a number of work for the vector execution unit to hide a data transfer latency. The system of claim 30, wherein the vector unit is configured to read a memory read via the DMA engine to prefetch commands for subsequent execution of a vector subroutine. . Execution process,

3. The system of claim 3, wherein the memory is scheduled to be prefetched for the vector subroutine before the unit call is performed to the routine by the scalar execution unit. The command to execute. The system of claim 32, wherein the vector unit is configured to read a memory read by the DMA engine for prefetching for subsequent execution of a vector subroutine. Command, and its memory reading is scheduled to be prefetched for execution of the vector subroutine before the vector subroutine is called to 44 1327434 by performing the unit call execution process by the scalar quantity a system for performing a video processing operation, comprising: a motherboard; a host CPU coupled to the motherboard;

The video processor of claim 11 is coupled to the motherboard and coupled to the CPU. 41. The system of claim 40, wherein the scalar execution unit is executed at a first clock frequency and the vector execution unit is executed at a second clock frequency. The system of claim 40, wherein the scalar interface unit is configured to perform an application flow control algorithm, and the vector execution unit is configured to perform pixel processing operations of the application.

43. The system of claim 4, wherein the scalar execution unit is configured to call to the vector execution unit using a command FIFO transfer function, and wherein the vector execution unit accesses the command FIFO by Operate on a demand-driven basis. 44. The system of claim 42, wherein the asynchronous operation of the video processor is configured to support one of the applications of the vector subroutine 45 1327434 1327434 Patent No. 094140179 Patent Application Replacement Page ( October, 1996) 300 301 - J.l4 · Replacement page \ Quantitative data thread

Parse_methods〇; setup_vector_engine〇; whi!e(!end_of_alg&!cind_fifo_full) { compute_param_.for_tHe{i) }_ send_params_to_vector〇; //mem RD fetch^.data_to_vector〇;3-// initiate RPC call on vector exe_vec_func(&amp ;vector_funcB); // mem write write_data_from-vector〇;j' 302

Vector data thread vector_funcA { vector_funcB { 1327434 Patent application No. 094140179 Chinese picture replacement page (96 years 1 month) 96.10. ϋ Year, month and day))

0 1 4 5 2 3 6 7 4 5 0 1 6 7 2 3 610 Figure 6