TW200803528A

TW200803528A - Video encoders and graphic processing units

Info

Publication number: TW200803528A
Application number: TW096122002A
Authority: TW
Inventors: Zahid Hussain; Kiumars Sabeti
Original assignee: Via Tech Inc
Priority date: 2006-06-16
Filing date: 2007-06-15
Publication date: 2008-01-01
Also published as: TWI444047B; CN101068365A; CN101068364B; CN101068364A; TWI348654B; TWI395488B; CN101083763A; CN101072351A; TWI350109B; CN101083764B; TW200816820A; CN101072351B; TW200803527A; TW200816082A; CN101083763B; TW200821986A; TWI383683B; CN101083764A; CN101068353A; CN101068353B

Abstract

An exemplary graphics processing unit (GPU) comprises: an instruction decoder; and a video processing unit. The instruction decoder is configured to decode a plurality of deblocking filter acceleration instructions. The deblocking filter instructions are associated with a filter used by a particular video decoder. The video processing unit (VPU) is configured to receive parameters encoded by the deblocking filter acceleration instructions. The VPU is further configured to determine one of a plurality of first pixel data sources from the received parameters. The VPU is further configured to determine one of a plurality of second pixel data sources from the received parameters. The VPU is further configured to to load a first block of pixel data from the determined first pixel data source. The VPU is further configured to load second block of pixel data from the determined second pixel data source.

Description

200803528200803528

V •九、發明說明：【發明所屬之技術領域】本發明係關於影像壓縮與解壓縮，且尤其係關於具有影像壓縮與解壓縮特徵之圖形處理單元。【先前技術】個人電腦與消費性電子產品係用於各種娛樂用品。這 _ 些娛樂用品可以大致區分為2類：使用電腦製圖 (computer-generated graphics )的那些，例如電腦遊戲；與使用壓細視訊資料流（compressed video stream)的那些，例如預錄節目到數位式影音光碟（DVD)上，或由有線龟視或愤星業者提供數位節目（digi tal programming)至一機上盒（set-top box)。第2種亦包含編碼類比視訊資料流’例如由一數位錄影機（DVR，digital video recorder ) 所執行。參電腦製圖通常由一圖形處理單元（GPU，graphic processing unit)產生。一圖形處理單元是一種建立在電腦遊戲平台（computer game consoles)與一些個人電腦上一種特別的微處理器。一圖形處理單元係被最佳化為快速執行描緣三度空間基本物件（three-dimensional primitiveobjects)，例如三角形、四邊形等。這些基本物件係以多個頂點描述，其中每個頂點具有屬性（例如顏色），且可施加紋理（texture )至該基本物件上。描繪的結果係一二度空間像素陣列（two-dimensional array of 6Client’s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林環輝/2007/06/14 6 200803528 v卜 p i xe 1 s )，顯示在一電腦之顯示器或監視器上。視訊資料流的編碼與解碼牽涉到不同種類的運算，例如，離散餘弦變換（discrete cosine transform)、移動估測（motion estimation )、移動補償（motion compensation )、去方塊效應濾波器（deblocking filter)。這些計算通常由一般用途中央處理器（CPU)結合特別的硬體邏輯電路，例如特殊應用積體電路（ASICs， application specif ic integrated circuits)，來處理。鲁消費者因而需要多個運算平台以滿足他們的娛樂需求。因而需要可以處理電腦製圖與視訊編碼/解碼的單一計算平台。【發明内容】在此揭露之實施例提供一種用於視訊壓縮去方塊效應的系統與方法。一示範性圖形處理單元（Gpu )包含：一指令解碼器與一視訊處理單元。該指令解碼器，設置成解碼 • 複數個去方塊效應渡波器加速指令。該去方塊效應濾、波加速器指令均與被一特定視訊解碼器所使用的一去方塊效應濾波器相關。該視訊處理單元設置成接收由該去方塊效應慮波為加速指令所編碼的參數。該視訊處理單元更設置成由該接收參數判斷複數個第一像素資料源其中之一。該視訊處理單元更設置成由該接收參數判斷複數個第二像素資料源其中之一。該視訊處理單元更設置成從所判斷之第三記憶體源下載一第一像素資料方塊。該視訊處理單元更設置成從所判斷之第二記憶體源下载一第二像素資料方塊。 7Clienfs Docket N〇.：S3U06-0011 TT’s Docket No:0608-A41238-TW/final/林璟輝/2007/06/14 7 200803528 % > “ 【實施方式】嚴於視訊羞碼的曼！第1圖係用於圖形與視訊編碼及/或解碼之—示範性運算 :口 =方塊圖。系統}⑽包含一一般用途⑶U 11〇 (此後稱理器）、一圖形處理器（_ 120、記憶體130與匯流排140、圖升/處理單元12〇包含一視訊加速單元（卿）⑽， /、可加速視錢碼及/或解碼，將於後敘述。圖形處理單元12〇的視訊加速功_可在圖形處理單元i2Q上執行的指令。幸人體解碼為16〇與視訊加速驅動器17〇位於記憶體13〇中而至乂部份的解碼器160與視訊加速驅動器17〇在主處理态no上執行。透過一個由視訊加速驅動器17〇提供的一主處理器介面180,解碼器160亦可發出給圖形處理單元12〇的視訊加速指令。如此一來，系統〗〇〇透過發出視訊加速指令給圖形處理單元120的主處理器軟體（h〇st pr〇cess〇r _ software)執行視訊編碼及/或解碼，圖形處理單元ι2〇透過加速解碼器160之一部分回應這些指令。在一些貫施例中，僅有一小部分的解碼器16〇在主處理器 110上執行，而大部分的解碼器16〇係由圖形處理單元12〇執行’在驅動器極少超載之下。依此法，經常被執行的密集運算方塊（computationally intensive blocks)被卸至圖形處理單元120，而更複雜的運算係由主處理器no所執行。 8Clienfs Docket N〇.;S3U06-0011 TT’s Docket No:0608-A41238-TW/final/林璟輝/2007/06/14 8 200803528 ? 在一些實施例中，由圖形處理單元120内之VPU 150所實現的一個密集運算功能包含回路内去方塊效應濾波器硬體加速邏輯電路（IDF ’ ini00p deblocking filter hardware acceleratioii logic) 400，亦稱為回路内方塊效應濾波器4⑽ 或去方塊政應滤波器400。VPU 150的一些實施例包含多個回路内去方塊效應濾波器硬體加速邏輯電路的範例，例如，與不同編碼標準如VC-1與H· 264相符的濾波器。例如示於第i圖的實施例，其中VPU 150包含Η· 264回路内去方塊效應濾波鲁器硬體加速邏輯電路170與VC_回路内去方塊效應濾波器硬體加速邏輯電路400 (猶後結合第4圖說明）。另一密集運算功能之範例係判定各濾波器之邊界強度（BS，b〇undary strength)。上述之結構因而使下列運作有彈性：在主處理器11〇上對解碼器160執行一些透過對巨圖塊（marc〇bl〇ck)執行一著色程式（Shaderpr〇gram)之特殊功能（例如去方塊效應或計算鲁邊界強度）；或在圖形處理單元12〇上執行大部分的解碼器 160 ’利用管線流通（pipelining)與平行化（歐aUeii⑻。在-些解碼器、160在圖形處理單元12〇上執行之實施例中，該去方塊效應處理係該解碼器16〇各態樣間同步之執行緒 (thread)° 第1圖中省略數個對於解釋圖形處理單元12〇之視訊加速特徵並非必要且熟悉此項記憶者熟知的習知元件。訊解碼器 ^Clienfs Docket N〇.：S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 200803528 由j 2圖係第1圖中該視訊解碼器160之方塊圖。在第2圖 #况缺之特殊貫施例’解.16Q施用Ιτυ Η· 264視訊壓縮規 ^，而’熟悉此項技藝者應當瞭解到第2圖之解碼器⑽係 ^瞒碼③之初步表不，該視訊解碼器亦說明類似於η·· 之，、他賴解碼器之運作，例如smpte ^與見範。 ^ ’儘管示為-_處理單元⑽之—部分，熟悉此項技二亦，瞭解到在此揭露之部分解碼器刪亦可實現於一圖 $處理單之外，例如—獨立存在之邏輯電路，特殊應用積體響電路（ASIC)之一部分等。輸入之位元流205首先由一熵解碼器（entr〇py dec〇der) 210所處理。熵編碼具有統計重複型（statisticredundancy) 之優點：-些圖樣比其他圖樣更常出現，所以較常出現的就用較短的碼代表。熵編碼包含霍夫曼編碼（Huffmancoding)與運行長度編碼（run-length encoding)。在熵編碼之後，該資料由一空間解碼器（Spatial dec〇der) 215所處理，其具有 • 下述優點，事實上，一圖形中鄰近的像素通常相同或相關^所以只要對差異編碼即可。在此示範性實施例中，空間解碼界 215包含一反相罝化器（inverse quantizer) 220，與一反相離散餘弦轉換（IDCT)功能230。IDCT功能230之輸出可視為一圖形（235)，由數像素組成。 … 圖形235被處理為較小的子區塊，稱為巨圖塊。JJ· 2料视訊壓縮規範使用16x16像素的巨圖塊尺寸，而其他壓縮規範可使用其他尺寸。圖形235内的巨圖塊與先前解碼圖項之資气妹 lOClienfs Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 200803528 —合’稱為晝面間預測（inter prediction)處理，或與 235之其他巨圖塊之資訊結合，稱為晝面内預测形 prediction)處理。該輸入位元流2〇5，被熵解碼器2〇5 而依各類型之圖形施用畫面間或晝面内預測。 L，當施用晝面間預測時，熵解碼器21〇產生一移動向旦 (motion vect〇r) 245輸出。移動向量245被用來暫時碼，其具有下述優點，事實上，通常在一連串的圖形中許$ • f會有相同的值。從-圖形到另-圖形之改變係編碼為移二向量245。移動補償方塊250將一個或多個先前解碼圖形255钟合移動向量245以產生一預測圖形（2阳）。當施用晝面間預測時，空間補償方塊270將得自鄰近巨圖塊的資訊與圖形235内的巨圖塊結合以產生一預測圖形（2Y5)。結合器280將圖形235與模式選擇器（m〇de selectQ]r) 285的輸出相加。模式選擇器285使用熵解碼位元流以判定結 φ 合器280使用移動補償方塊250產生的預測圖形（ 265)或使用空間補償方塊270所產生的預測圖形（275)。編碼程序引起如在沿著巨圖塊邊緣的不連續以及沿著巨圖塊内的子方塊邊緣不連續的產物（artifact)。結果是在解石馬圖框出現了邊緣”（edge)，而原本沒有。去方塊效應濾波器 290係施用於由結合器280輸出之結合圖形，以移去這些邊緣產物。儲存由去方塊效應濾波器產生之該解碼圖形295用來解碼接下來的圖形。 1 lClienf s Docket N〇.：S3U06-0011 TT’s Docket NO:0608-A41238-TW/fmal/林璟輝/2007/06/14 11 200803528 結合第1圖之討論，部分解碼器ι60在主處理器no上執仃，而解碼器160亦有由圖形處理單元12〇提供視訊加速指令之優點。尤其是，在一些實施例中，去方塊效應濾波器29〇使用由圖形處理單元12〇提供之一個或多個指令用來實現使用相對低運算成本之濾波。去方塊效應濾波器290係一多單元濾波器（multi—tap filter)，其基於鄰近像素值調整子方塊邊緣的像素值。可依煦解碼器160施行之壓縮規範使用去方塊效應濾波器 290之不同實施例。各規範使用不同的濾波器參數，例如子區塊的尺寸、由該濾波運作更新之像素數目、該濾波器施用之頻率（例如每N列或每Μ行）。此外，各規範使用不同濾波器長度結構。熟悉此項技藝者應瞭解多單元濾波裔’在此不討論特定單元之結構。 U去立塊效應濾波器由VC-ι規範規定之去方塊效應濾波器實施例將結合弟4圖説明。首先，VC-1濾波器之子方塊像素安排將結合弟3圖說明。第3圖顯示兩個鄰近4x4子方塊（ 310,320)，定義為列Rl-R4與行Cl-C8。這兩個子方塊間的垂直邊界330係沿著行C4與C5。該VC-1濾波器對每個4x4子方塊運作。對於最左邊的子方塊，該VC-1濾波器檢驗在一預定列（3) 中之一預定群像素（Ρ1、Ρ2、Ρ3)。若該預定群像素達到一特定標準，則更新相同預定列中另一像素Ρ4。該標準係 12Client’s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 12 200803528 - 由該預定組中像素之計算與比較之特殊集合而定。熟悉此項技藝者應瞭解到這些計算與比較亦可是為一組濾波單元 (a set of taps )，而詳細的計算與比較將稍後結合第5 圖討論。更新值亦基於對預定群組中像素所執行之運算。該VC-1濾波器以類比方式處理最右邊的子方塊，判定像素P6、P7、P8是否達到一標準，若達到該標準則更新 P5。換言之，該VC-1濾波器為一預定列（R3)之一群預定像素-邊緣像素P4與P5-根據同一列中其他群預定像素之 • 值計算數值，P4的值根據PI、P2、P3，而P5的值根據P6、 P7 、 P8 。該VC-1有條件的更新其餘列的相同群預定像素，係根據為該預定列（R3)之預定群像素（邊緣像素P4、P5)所計算之值。如此一來，R1中之P4基於R1中之PI、P2、P3 更新了，然而僅有在R3中之P4、P5更新之後。同樣地， R1中之P5基於R1中之P6、P7、P8更新了，然而僅有在 R3中之P4、P5更新之後。第2列與第4列亦以類似方式 m w 處理。從另一方面來看，在一預定第三列之像素的一些像素被濾波或更新了，當在第三列之其他像素達到一標準時。該濾波器牽涉到對這些其他像素執行比較與計算。若在第三列之其他像素未達到該標準時，在其餘列相應的各像素係以一類比方式濾波，如上所述。在此揭露之去方塊效應濾波器290之一些實施例使用一開創性技術，先對第三列濾波，接著再對其他列濾波。這些開創性的技術將結合第 13Client5s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 200803528 — 4、5、6A—6D圖，更詳細的說明。儘官第3圖說明一列列的處理垂直邊緣，熟悉此項技藝者應可瞭解同一圖旋轉90度後亦可說明一行行處理水平邊緣。熟悉此項技藝者亦可瞭解到儘管V(>1使用四列中的第三列作為判定有條件更新其他列的預定列，在此揭露之原則亦可應用至使用其他預定列之實施例（例如第一列、第二列等），亦可應用至形成子方塊列數目不同之其他實施例。同樣地，熟悉此項技藝者亦可瞭解到儘管 ⑩檢驗鄰近一組像素的值以設定欲更新像素之值，在此揭露之原則亦可應用至其他像素已被檢驗且其他像素已設定之實施例。就一範例而言，可檢驗P2與P3以判定P4之更新值。另一範例，P3可根據P2與P4之值設定。圖形處理單元120中之視訊加速單元150為一回路内去方塊濾波器（IDF，inloop deblockging filter)，例如由VC-1規範之回路内去方塊效應濾波器，實現硬體加速邏輯電路。一圖形處理單元指令實現此硬體加速邏輯電鲁路，將於後說明。實現VC-1回路内去方塊效應濾波器之習知方法係平行處理各列/行，因為相同像素計算係在一子方塊之各列/行執行。此習知方法每週期對兩個鄰近的4x4 子方塊濾波，但需要一增加邏輯閘數來執行。相對的，由 VC-1回路内去方塊效應濾波器硬體加速邏輯電路400所使用的開創性方法係先處理第三列/行像素，而若這些像素達到該所要求之標準，接著順序處理剩下的那三列/行。此開創性方法比習知方法使用較少的邏輯閘數，其複製各列/ 14Clienfs Docket No. ：S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 14 200803528 ‘磔二，行之機此。VC-l Ifd加速邏輯電路400的循序列處理每4 個週期對兩個鄰近的4χ4子方塊濾波。此較長之濾波時間與圖形處理單元12〇之指令週期一致，其中該習知方法較快速的濾波，事貫上比所需求之速度還快，造成邏輯閘數上的浪費。第4圖係VC-1回路内去方塊效應濾波器硬體加速邏輯電路400之硬體描述虛擬碼之列表。雖非使用實際硬體描述語言（HDL，hardware description language)，例 _ 如Veri 1 〇g與VHDL而使用一虛擬碼，熟悉此項技藝者應對這些虛擬碼相當熟悉。這些人應可瞭解當以實際描述時，這些程式碼應可被編譯並接著合成為構成部分視訊加速單元150之數邏輯閘配置。這些人應當可瞭解到這些邏輯閘可以各種技術實現，例如一特定應用積體電路 (ASIC)、可程式化邏輯閘陣列（PGA)或現場程式化邏輯閘陣列（FPGA)。此私式碼的410段係模組定義（module definition)。 VC-1回路内去方塊效應濾波器硬體加速邏輯電路4〇〇有許多輸入參數。要進行濾波之子方塊係由該方塊參數 (Block parameter)所規範。若垂直參數（Vertical parameter)為真（True)，則該加速邏輯電路4〇〇將方塊參數視為4x8方塊（參見第3圖），並執行垂直邊緣滤波。若垂直參數為假（False)，則該加速邏輯電路4〇〇將方塊參數視為8x4方塊（參見第3圖），並執行水平邊緣淚波。程式碼之區段420開始一疊代迴圈（iterati〇n 15Client’s Docket No.:S3U06-0011 TT’s Docket No:0608-A41238-TW/final/林璟輝/2007/06/14 15 200803528 ::loop)，設定該迴圈參數變數之值。第一次通過此迴圈時，迴圈參數設為3，故先處理第3行。後續的迴圈疊代設定迴圈參數為1、2與4。利用這些參數，VC-1回路内去方塊效應濾波器硬體加速邏輯電路400重複4次，每次處理 8個像素，其中一行可為一水平列或一垂直行。每一列係由行加速邏輯電路500所處理（參見第5圖）。在一些實施例中，此行加速邏輯電路500係以一 HDL次模組實現，將結合第5圖說明。 • 區段430測試垂直參數以判定執行垂直或水平邊緣濾波。根據該結果，行陣列變數之8個元素係自該4x8輸入方塊之列或8x4輸入方塊之行初始化。區段440藉由將迴圈參數（由區段420所設定）與3 做比較判定該第3行是否處理。若迴圈參數為3，另雨個控制變數，ProcessingPixel3 與 FILTER—OTHER—3 則設為真。若迴圈參數不為3，將ProcessingPixel3設為真。區段 450 舉例說明另一 HDL 模組’ 鲁 VCl—IDC_Fi 1 ter_Line，該滤波器施用目前之行。（結合第 3圖所述，該行濾波器基於鄰近像素值更新邊緣像素值。）提供至該子模組之參數包含該控制變數 ProcessingPixel3、FILTERJ)THEtL3 與迴圈參數。在一實施例中，VC-1回路内去方塊效應濾波器硬體加速邏輯電路 400有一額外輸入參數，——量化值，而此量化參數亦提供給該子模組。在子模組處理該列之後，VC-1回路内去方塊效應濾波 16Client’s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 16 200803528 W· ‘丨 ’ 器硬體加速邏輯電路400在區段420以一迴圈參數更新值繼續該疊代迴圈。依此法，對輸入方塊之第3行施用該濾波器，接著第1行、第2行、第4行。第5圖係行加速邏輯電路5⑽之硬體描述語έ程式碼之列表，其實現了上述之子模組。程式碼之區段係一模組定義。行加速邏輯電路500有許多輸入參數。將進行濾波的行係定義為行輸入參數。ProcessingPixel3係一輸入參數，若該行為第行或第3列則藉由較高層邏輯電路將鲁其設為真。參數FILTER—OTHER一3 —開始係由較高層邏輯電路設為真，而根據像素值由行加速邏輯電路500調整。區段520執行如VC-1所定之各種像素值運算。（因為該計算可以參考VC-1之規範理解，將不對這些運算作詳細說明。）區段530測試由較高層VC-1回路内去方塊效應濾波器硬體加速邏輯電路400所提供之ProcessingPixel3 參數。若ProcessingPixel3為真，則區段530將一控制變鲁數DO—FILTER初始化為一預設值，真。在區段52〇中間的運算之各種結果係用來判定是否也要處理其他3行。若該像素運算結果表示不處理其他3行，則將D0—FILTER設為若ProcessingPixel3為假，區段540使用輸入參數 FILTER_0THER_3(由較高層VC-1回路内去方塊效應濾波器硬體加速邏輯電路400所設定）以設定D0_FILTER之值。若DO—FILTER為真，區段550測試該DCLFILTER變數並更新該行變數之該邊緣像素P4、P5(參見第3圖）。 17Clienfs Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 17 200803528 區段560測試該ProcessingPixel3參數，並適當更新 FILTERJ)THER„3。該FILTER_〇THER_3變數係用來傳達此模組中不同範例之狀態資訊。若ProcessingPixel3為真，則區段550以D0_FILTER之值更新該FILTER_0THER_3參數。此技術使得用來說明此模組之較高層模組（即 VC1 —InloopFilter ) 提供由此例之 YC_1_INL00PFILTER_LINE 低層模組所更新之 FILTER_0THER_3 值至另一例之 VC—1-1NL00PFILTER—LINE。熟悉此項技藝者應瞭解到弟5圖之虛擬碼可以各種方式合成以產生實現行加速邏輯電路500之邏輯閑布置。其中一種布置係在第6A-D圖中說明，他們一起構成行加速邏輯電路500之方塊圖。熟悉此項技藝者應當對V(>1回路内去方塊效應濾波器演算法及邏輯電路結構感到熟悉。因此’第6A-D圖之元件將不詳述。而將選擇詳述行加速邏輯電路500之特徵。熟悉此項技藝者應暸解到’ VC-ι回路内去方塊效應濾波器所牽涉到之運算包含下列’其中Ρ1-Ρδ係指像素在被處理之列/行中之位置。 AO = (2*(Ρ3 - Ρ6) - 5*(P4 - Ρ5) + 4) >> 3V. IX. DESCRIPTION OF THE INVENTION: TECHNICAL FIELD OF THE INVENTION The present invention relates to image compression and decompression, and more particularly to a graphics processing unit having image compression and decompression features. [Prior Art] Personal computers and consumer electronics are used in various entertainment products. These entertainment items can be roughly classified into two categories: those using computer-generated graphics, such as computer games; and those using compressed video streams, such as pre-recorded programs to digital On a video disc (DVD), or by a cable turtle or an angry star, a digital program is provided to a set-top box. The second type also includes a coded analog video stream', for example, implemented by a digital video recorder (DVR). The computer graphics are usually generated by a graphics processing unit (GPU). A graphics processing unit is a special type of microprocessor built on computer game consoles and some personal computers. A graphics processing unit is optimized to quickly perform three-dimensional primitive objects, such as triangles, quads, and the like. These basic objects are described by a plurality of vertices, each of which has an attribute (e.g., a color) and a texture can be applied to the base object. The result is a two-dimensional array of 6Client's Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林环辉/2007/06/14 6 200803528 v Pi xe 1 s ), displayed on a computer monitor or monitor. The encoding and decoding of video streams involves different kinds of operations, such as discrete cosine transform, motion estimation, motion compensation, and deblocking filters. These calculations are typically handled by a general purpose central processing unit (CPU) in conjunction with special hardware logic circuits, such as application specific integrated circuits (ASICs). Lu consumers therefore need multiple computing platforms to meet their entertainment needs. There is therefore a need for a single computing platform that can handle computer graphics and video encoding/decoding. SUMMARY OF THE INVENTION Embodiments disclosed herein provide a system and method for video compression deblocking. An exemplary graphics processing unit (Gpu) includes: an instruction decoder and a video processing unit. The instruction decoder is set to decode • a plurality of deblocking effector acceleration instructions. The deblocking filter and the wave accelerator command are all associated with a deblocking filter used by a particular video decoder. The video processing unit is configured to receive a parameter encoded by the deblocking effect for the acceleration command. The video processing unit is further configured to determine one of the plurality of first pixel data sources from the receiving parameter. The video processing unit is further configured to determine one of the plurality of second pixel data sources from the received parameter. The video processing unit is further configured to download a first pixel data block from the determined third memory source. The video processing unit is further configured to download a second pixel data block from the determined second memory source. 7Clienfs Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/final/林璟辉/2007/06/14 7 200803528 % > " [Embodiment] Man who is strict with video shame! For graphics and video coding and / or decoding - exemplary operation: port = block diagram. System} (10) contains a general purpose (3) U 11 〇 (hereafter the processor), a graphics processor (_ 120, memory 130 and The bus 140 and the image processing/processing unit 12A include a video acceleration unit (10), /, an accelerated video code and/or decoding, which will be described later. The video processing unit 12 〇 video acceleration work _ The instruction executed on the graphics processing unit i2Q. Fortunately, the human body is decoded to 16 and the video acceleration driver 17 is located in the memory 13A, and the decoder 160 and the video acceleration driver 17 are executed in the main processing state no. The decoder 160 can also issue a video acceleration command to the graphics processing unit 12 through a main processor interface 180 provided by the video acceleration driver 17A. In this way, the system sends a video acceleration command to the graphics processing. Unit 120 The processor software (h〇st pr〇cess〇r_software) performs video encoding and/or decoding, and the graphics processing unit ι2〇 responds to these instructions through a portion of the acceleration decoder 160. In some embodiments, only a small portion The decoder 16 is executed on the main processor 110, and most of the decoders 16 are executed by the graphics processing unit 12' under the driver's minimal overload. In this way, the computationally intensive blocks are often executed (computationally The intensive blocks are unloaded to the graphics processing unit 120, and the more complex operations are performed by the main processor no. 8Clienfs Docket N〇.;S3U06-0011 TT's Docket No:0608-A41238-TW/final/林璟辉/2007/ 06/14 8 200803528 ? In some embodiments, a dense computational function implemented by the VPU 150 within the graphics processing unit 120 includes an in-loop deblocking filter hardware acceleration logic circuit (IDF 'ini00p deblocking filter hardware acceleratioii logic 400, also known as in-loop block effect filter 4 (10) or deblocking policing filter 400. Some embodiments of VPU 150 include multiple loops Blocking filter exemplary hardware acceleration logic circuitry, e.g., a filter with different coding standards such as VC-1 and H · 264 match. For example, the embodiment shown in FIG. 5, wherein the VPU 150 includes a 264 internal loop deblocking filter hardware acceleration logic circuit 170 and a VC_in-loop deblocking filter hardware acceleration logic circuit 400 (afterwards) Combined with Figure 4). Another example of intensive computing functions is to determine the boundary strength (BS, b〇undary strength) of each filter. The above structure thus makes the following operations flexible: performing some special functions on the decoder 160 to perform a shaderpr〇gram on the macroblock (marc〇bl〇ck) on the main processor 11〇 (for example, go Blocking or calculating the boundary strength); or performing most of the decoder 160' on the graphics processing unit 12' utilizing pipeline pipelining and parallelization (European aUeii(8). In some decoders, 160 in the graphics processing unit 12 In the embodiment executed on the ,, the deblocking processing is a thread of synchronization between the decoders of the decoders. In the first figure, a plurality of video acceleration features for interpreting the graphics processing unit 12 are omitted. Necessary and familiar with the familiar components of this memory. Signal decoder ^Clienfs Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 200803528 by j 2 Figure 1 is a block diagram of the video decoder 160 in Fig. 1. In the second figure, the special case of the problem is 'solution. 16Q applies Ιτυ Η· 264 video compression rule ^, and 'skilled by the skilled person should know Figure 2 decoder The initial description of the system 3 code, the video decoder also describes similar to η··, the operation of the decoder, such as smpte ^ and see. ^ 'Although shown as -_ processing unit (10) - In part, familiar with this technique, it is understood that some of the decoders disclosed herein can also be implemented in a figure other than a processing list, for example, an independent logic circuit, a part of a special application integrated circuit (ASIC). The input bit stream 205 is first processed by an entropy decoder (entr〇py dec〇der) 210. Entropy coding has the advantage of statistically statistic redundancy: - some patterns appear more often than other patterns, so Often appearing is represented by a shorter code. Entropy coding includes Huffman coding and run-length encoding. After entropy coding, the data is represented by a spatial decoder (Spatial dec〇der). Processed by 215, which has the advantage that, in fact, adjacent pixels in a pattern are generally identical or related so that only the difference is encoded. In this exemplary embodiment, spatial decoding boundary 215 includes a An inverse quantizer 220 is coupled to an inverse discrete cosine transform (IDCT) function 230. The output of the IDCT function 230 can be viewed as a graph (235) consisting of a number of pixels. ... the graph 235 is processed to be smaller Sub-blocks, called giant tiles. The JJ 2 video compression specification uses a giant tile size of 16x16 pixels, while other compression specifications can use other sizes. The huge block in the graphic 235 and the previously decoded picture item lOClienfs Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 200803528 Inter prediction processing, or combined with the information of other giant tiles of 235, is called intra-prediction prediction processing. The input bit stream 2〇5 is applied by the entropy decoder 2〇5 to inter-picture or in-plane prediction according to each type of pattern. L, the entropy decoder 21 produces a motion vect〇r 245 output when inter-plane prediction is applied. The motion vector 245 is used for the temporary code, which has the advantage that, in fact, it is common to have the same value for $•f in a series of graphics. The change from the -picture to the other - is encoded as a shift binary 245. Motion compensation block 250 clocks one or more previously decoded graphics 255 into motion vector 245 to produce a predicted pattern (2 yang). When inter-panel prediction is applied, spatial compensation block 270 combines the information from the adjacent macroblocks with the giant tiles within graphics 235 to produce a predicted pattern (2Y5). The combiner 280 adds the graphic 235 to the output of the mode selector (m〇de selectQ]r 285. The mode selector 285 uses the entropy decoded bitstream to determine the predictive graph (265) produced by the junction compensator 280 using the motion compensation block 250 or the predictive graph (275) generated using the spatial compensation block 270. The encoding process causes artifacts such as discontinuities along the edges of the giant tile and discontinuities along the edges of the sub-blocks within the giant tile. The result is that an edge appears in the solution of the solution to the stone, but not originally. The deblocking filter 290 is applied to the combined pattern output by the combiner 280 to remove these edge products. The decoded pattern 295 generated by the filter is used to decode the next graphic. 1 lClienf s Docket N〇.:S3U06-0011 TT's Docket NO:0608-A41238-TW/fmal/林璟辉/2007/06/14 11 200803528 In the discussion of the figure, part of the decoder ι60 is executed on the main processor no, and the decoder 160 also has the advantage of providing video acceleration instructions by the graphics processing unit 12. In particular, in some embodiments, deblocking filtering The processor 29 uses one or more instructions provided by the graphics processing unit 12 to implement filtering using relatively low computational cost. The deblocking filter 290 is a multi-tap filter based on proximity. The pixel values adjust the pixel values at the edges of the sub-blocks. Different embodiments of the deblocking filter 290 can be used depending on the compression specification implemented by the decoder 160. Each specification uses a different filter parameter. For example, the size of the sub-block, the number of pixels updated by the filtering operation, the frequency of application of the filter (eg, every N columns or rows). In addition, each specification uses a different filter length structure. Those skilled in the art are familiar with the art. It should be understood that multi-cell filtering is not discussed here. The structure of the specific unit is not discussed here. The U-deblocking filter is defined by the VC-ι specification. The block-effect filter embodiment will be described in conjunction with the Figure 4. First, VC-1 filtering The sub-block pixel arrangement of the device will be described in conjunction with Figure 3. Figure 3 shows two adjacent 4x4 sub-blocks (310, 320), defined as columns R1-R4 and rows Cl-C8. The vertical boundary between the two sub-blocks is 330 Lines C4 and C5. The VC-1 filter operates for each 4x4 sub-block. For the leftmost sub-block, the VC-1 filter checks for a predetermined group of pixels in a predetermined column (3) (Ρ1 Ρ2, Ρ3). If the predetermined group of pixels reaches a certain standard, update another pixel Ρ4 in the same predetermined column. The standard is 12Client's Docket N〇.: S3U06-0011 TT's Docket No: 0608-A41238-TW/fmal/ Lin Yihui/2007/06/14 12 200803528 - by the pre The calculation of the pixels in the group depends on the special set of comparisons. Those skilled in the art should understand that these calculations and comparisons can also be a set of taps, and detailed calculations and comparisons will be combined later. Figure 5: The update value is also based on the operation performed on the pixels in the predetermined group. The VC-1 filter processes the rightmost sub-block in analogy, and determines whether the pixels P6, P7, and P8 reach a standard. The standard updates P5. In other words, the VC-1 filter is a predetermined column (R3) of a group of predetermined pixels - edge pixels P4 and P5 - based on the values of other groups of predetermined pixels in the same column, the value of P4 is based on PI, P2, P3, The value of P5 is based on P6, P7, and P8. The VC-1 conditionally updates the same group of predetermined pixels of the remaining columns based on the values calculated for the predetermined group of pixels (edge pixels P4, P5) of the predetermined column (R3). As a result, P4 in R1 is updated based on PI, P2, and P3 in R1, but only after P4 and P5 in R3 are updated. Similarly, P5 in R1 is updated based on P6, P7, and P8 in R1, but only after P4 and P5 in R3 are updated. Columns 2 and 4 are also treated in a similar manner, m w . On the other hand, some pixels of a pixel in a predetermined third column are filtered or updated when the other pixels in the third column reach a standard. This filter involves performing comparisons and calculations on these other pixels. If the other pixels in the third column do not meet the criteria, the corresponding pixels in the remaining columns are filtered in an analogy manner, as described above. Some embodiments of the deblocking filter 290 disclosed herein use a groundbreaking technique to first filter the third column and then filter the other columns. These groundbreaking technologies will be combined with the 13th Client5s Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 200803528 — 4,5,6A—6D, more detailed Description. Figure 3 illustrates the vertical edges of a column. Those skilled in the art should be able to understand that the same image can be rotated by 90 degrees to indicate that the horizontal edges are processed. Those skilled in the art will also appreciate that although V(>1 uses the third column of the four columns as a predetermined column for determining the conditional update of other columns, the principles disclosed herein can be applied to embodiments using other predetermined columns. (e.g., the first column, the second column, etc.) may also be applied to other embodiments that form different numbers of sub-blocks. Similarly, those skilled in the art will also appreciate that although 10 checks the value of a group of pixels to set To update the value of the pixel, the principles disclosed herein can also be applied to embodiments in which other pixels have been verified and other pixels have been set. For an example, P2 and P3 can be tested to determine the updated value of P4. The P3 can be set according to the values of P2 and P4. The video acceleration unit 150 in the graphics processing unit 120 is an in-loop deblocking filter (IDF), for example, a block-effect filtering in the loop of the VC-1 specification. Implementing a hardware acceleration logic circuit. A graphics processing unit instruction implements the hardware acceleration logic circuit, which will be described later. A conventional method for implementing a block-effect filter in a VC-1 loop Each column/row is processed in parallel because the same pixel calculation is performed on each column/row of a sub-block. This conventional method filters two adjacent 4x4 sub-blocks per cycle, but requires an increase in the number of logic gates to execute. The groundbreaking method used by the VC-1 circuit to remove the block effect filter hardware acceleration logic circuit 400 first processes the third column/row of pixels, and if the pixels reach the required standard, then sequentially process the remaining The three columns/rows. This groundbreaking method uses fewer logical gates than the conventional method, which replicates the columns / 14Clienfs Docket No. :S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 14 200803528 '磔2, this is the case. The sequential processing of the VC-l Ifd acceleration logic circuit 400 filters two adjacent 4χ4 sub-blocks every 4 cycles. This longer filtering time is The instruction processing cycle of the graphics processing unit 12 is the same, wherein the conventional method has a faster filtering, which is faster than the required speed, resulting in waste of the number of logic gates. Figure 4 is a block in the VC-1 loop. Effect filter hardware acceleration The hardware of circuit 400 describes a list of virtual codes. Although not using a hardware description language (HDL), examples such as Veri 1 〇g and VHDL use a virtual code, those skilled in the art should deal with these The virtual code is quite familiar. These people should understand that when actually described, these codes should be compiled and then synthesized into the number of logic gate configurations that make up the partial video acceleration unit 150. These people should be aware that these logic gates can be various Technical implementations, such as an application-specific integrated circuit (ASIC), programmable logic gate array (PGA), or field-programmed logic gate array (FPGA). The 410 segment of this private code is a module definition. The VC-1 loop de-blocking filter hardware acceleration logic circuit 4 has many input parameters. The sub-block to be filtered is specified by the block parameter. If the vertical parameter is True, the acceleration logic 4 considers the block parameter as a 4x8 block (see Figure 3) and performs vertical edge filtering. If the vertical parameter is False, the acceleration logic circuit 4 considers the block parameter as an 8x4 block (see Figure 3) and performs a horizontal edge tear wave. The code section 420 starts a generation of loops (itera〇n 15Client's Docket No.: S3U06-0011 TT's Docket No: 0608-A41238-TW/final/林璟辉/2007/06/14 15 200803528 ::loop), Set the value of this loop parameter variable. When the loop is passed for the first time, the loop parameter is set to 3, so the third line is processed first. Subsequent loop iteration settings The loop parameters are 1, 2 and 4. Using these parameters, the VC-1 in-loop deblocking filter hardware acceleration logic circuit 400 repeats four times, processing 8 pixels at a time, one of which can be a horizontal column or a vertical line. Each column is processed by row acceleration logic circuit 500 (see Figure 5). In some embodiments, the row acceleration logic circuit 500 is implemented as an HDL sub-module and will be described in conjunction with FIG. • Section 430 tests the vertical parameters to determine the execution of vertical or horizontal edge filtering. Based on the result, the eight elements of the row array variable are initialized from the row of the 4x8 input block or the row of the 8x4 input block. Section 440 determines whether the third line is processed by comparing the loop parameter (set by section 420) to 3. If the loop parameter is 3, another rain control variable, ProcessingPixel3 and FILTER_OTHER-3 are set to true. If the loop parameter is not 3, set ProcessingPixel3 to true. Section 450 illustrates another HDL module 'Lu VCl-IDC_Fi 1 ter_Line, which applies the current trip. (As described in connection with Figure 3, the row filter updates the edge pixel values based on neighboring pixel values.) The parameters supplied to the sub-module include the control variables ProcessingPixel3, FILTERJ) THEtL3 and the loop parameters. In one embodiment, the VC-1 in-loop deblocking filter hardware acceleration logic circuit 400 has an additional input parameter, a quantized value, which is also provided to the sub-module. After the sub-module processes the column, the VC-1 loop de-blocking filter 16Client's Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 16 200803528 W· The 'hardware acceleration logic circuit 400 continues the iteration loop with a loop parameter update value in section 420. In this way, the filter is applied to the third row of the input block, followed by the first row, the second row, and the fourth row. Figure 5 is a list of hardware description language codes for the acceleration logic circuit 5 (10) that implements the sub-modules described above. The section of the code is a module definition. The row acceleration logic circuit 500 has a number of input parameters. The line system to be filtered is defined as the line input parameter. ProcessingPixel3 is an input parameter. If the behavior is in the first or third column, it is set to true by higher level logic. The parameter FILTER - OTHER - 3 - the start is set to true by the higher layer logic circuit, and is adjusted by the line acceleration logic circuit 500 according to the pixel value. Section 520 performs various pixel value operations as determined by VC-1. (Because the calculation can be understood with reference to the specifications of VC-1, these operations will not be described in detail.) Section 530 tests the ProcessingPixel3 parameters provided by the higher layer VC-1 loop deblocking filter hardware acceleration logic circuit 400. . If ProcessingPixel3 is true, then section 530 initializes a control variable number DO_FILTER to a predetermined value, true. The various results of the operations in the middle of the segment 52 are used to determine if the other 3 rows are to be processed. If the pixel operation result indicates that the other three rows are not processed, then D0_FILTER is set to be if ProcessingPixel3 is false, and segment 540 is used as input parameter FILTER_0THER_3 (by the higher layer VC-1 loop to remove the block effect filter hardware acceleration logic circuit 400 settings) to set the value of D0_FILTER. If DO_FILTER is true, segment 550 tests the DCLFILTER variable and updates the edge pixels P4, P5 of the row variable (see Figure 3). 17Clienfs Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 17 200803528 Section 560 tests the ProcessingPixel3 parameter and updates FILTERJ)THER„3. The FILTER_ The THER_3 variable is used to convey the status information of different examples in this module. If ProcessingPixel3 is true, then section 550 updates the FILTER_0THER_3 parameter with the value of D0_FILTER. This technique is used to describe the higher layer module of this module ( That is, VC1 - InloopFilter ) provides the FILTER_0THER_3 value updated by the YC_1_INL00PFILTER_LINE low-level module of this example to another VC_1-1NL00PFILTER_LINE. Those skilled in the art should understand that the virtual code of the 5th figure can be synthesized in various ways. A logical idle arrangement is implemented that implements row acceleration logic circuit 500. One of the arrangements is illustrated in Figures 6A-D, which together form a block diagram of row acceleration logic circuit 500. Those skilled in the art should have a V (> The internal block-effect filter algorithm and logic circuit structure are familiar. Therefore, the components of the 6A-D diagram will not be detailed. The characteristics of the speed logic circuit 500. Those skilled in the art should understand that the operation involved in the 'VC-ι loop deblocking filter' includes the following 'where Ρ1-Ρδ means that the pixel is in the processed column/row. Position AO = (2*(Ρ3 - Ρ6) - 5*(P4 - Ρ5) + 4) >> 3

Al - (2*(P1 - Ρ4) - 5*(Ρ2 - Ρ3) + 4) » 3 Α2 = (2*(Ρ5 - Ρ8) - 5*(Ρ6 - Ρ7) + 4) >> 3 clip = (Ρ4 - Ρ5)/2 前3個運算中的每一個牽涉到3個減法、2個乘法q 個加法與1個右移。第6A圖中之行加速邏輯電路5〇〇之一 18Clienf s Docket No. :S3U06-0011 TT’s Docket N〇:0608-A41238-TW/fmal/林璟輝/2007/06/14 18 200803528 ' 部分使用共用邏輯電路循序計算AO、A1、A2,而非為了 AO、 A1、A2使用特定獨立邏輯電路方塊。藉由避免邏輯電路方塊重複’利用多工器循序處理各輸入，減少了邏輯閘及/ 或功率消耗。多工器605、610、615與620係用來從像素暫存器 P1-P8在不同時序週期選擇不同之輸入，而這些輸入係提供給各共用邏輯電路方塊。邏輯電路方塊625與630各執行一減法。邏輯電路方塊635藉由執行左移1位實現乘以籲2 °乘以5係、由左移1位（_)所實行，後面接—加法器 645。加法器650將左移器635之輸出、一常數4與645 輸出之負數加在一起。最後，邏輯電路方塊655執行右移 3位。在第1時序週期，一輸入T=1係提供至各多工器 605、610與615，而計算A1之值並存在暫存器66〇。在第 2時序週期，一輸入T=2係提供至各多工器6〇5、61〇與 615,而計算Α2之值並存在暫存器665。在第3時序週期^ 一輸入Τ=3係提供至各多工器605、61〇與615，而計管 Α0之值並存在暫存器670。存在暫存器66〇、β65、67〇 2 值Al、Α2、A3將被第6Β圖之部分行加速邏輯電路5〇〇使用，將於後說明。Ρ4暫存器（671)之輪出與朽暫存器 ( 673)之輸出將被第6C圖之部分行加逮邏輯電路5〇〇使用，將於後說明。熟悉此項技藝者亦應瞭解在V(M回路内去方塊效應濾波器所牽涉到後敘之額外運算： ^ 19Clienfs Docket N〇.：S3U06-0011 TT’s Docket No:0008-A41238-TW/fmal/林璟輝/2007/06/14 19 200803528 亡 D 二 5*((sign(A0) * A3) - A0)/8 if (CLIP > 0) { if (D < 0) D 二 0 if (D > CLIP)Al - (2*(P1 - Ρ4) - 5*(Ρ2 - Ρ3) + 4) » 3 Α2 = (2*(Ρ5 - Ρ8) - 5*(Ρ6 - Ρ7) + 4) >> 3 clip = (Ρ4 - Ρ5)/2 Each of the first 3 operations involves 3 subtractions, 2 multiplications, q additions, and 1 right shift. One of the lines of acceleration logic circuit 5 in Figure 6A 18Clienf s Docket No. :S3U06-0011 TT's Docket N〇:0608-A41238-TW/fmal/林璟辉/2007/06/14 18 200803528 'Partial use of shared logic The circuit sequentially calculates AO, A1, A2 instead of using specific independent logic blocks for AO, A1, A2. The logic gate and/or power consumption is reduced by avoiding the logic block repeating 'processing the inputs sequentially using the multiplexer. Multiplexers 605, 610, 615, and 620 are used to select different inputs from pixel registers P1-P8 for different timing periods, and these inputs are provided to the respective shared logic blocks. Logic circuit blocks 625 and 630 each perform a subtraction. The logic circuit block 635 is implemented by multiplying the left shift by 1 bit by 2 ° by 5 lines, by shifting 1 bit (_) to the left, followed by the adder 645. Adder 650 adds the output of left shifter 635, a constant 4, and the negative of the 645 output. Finally, logic circuit block 655 performs a right shift of 3 bits. In the first timing cycle, an input T = 1 is supplied to each of the multiplexers 605, 610, and 615, and the value of A1 is calculated and stored in the register 66. In the second timing cycle, an input T = 2 is supplied to each of the multiplexers 6 〇 5, 61 〇 and 615, and the value of Α 2 is calculated and stored in the register 665. In the third timing cycle, an input Τ = 3 is supplied to each of the multiplexers 605, 61 〇 and 615, and the value of Α 0 is stored in the register 670. The presence of the registers 66〇, β65, 67〇 2 values A1, Α2, and A3 will be used by the partial line acceleration logic circuit 5 of the sixth drawing, which will be described later. The output of the Ρ4 register (671) and the output of the semaphore (673) will be used by the partial line of the 6C diagram, which will be described later. Those skilled in the art should also understand the additional operations involved in the V (M-loop back-blocking filter): ^ 19Clienfs Docket N〇.: S3U06-0011 TT's Docket No:0008-A41238-TW/fmal/林璟辉/2007/06/14 19 200803528 死D二五*((sign(A0) * A3) - A0)/8 if (CLIP > 0) { if (D < 0) D 2 0 if (D &gt ; CLIP)

D - CLIP } else { if (D > 0) D = 0 if (D < CLIP)D - CLIP } else { if (D > 0) D = 0 if (D < CLIP)

® D = CLIP 第6B圖之部分行加速邏輯電路500從第6A圖之部分行加速邏輯電路500接收輸入，並計算D ( 675)。再次參照第6A圖，CLIP ( 677)係如下產生：像素P4與P5由邏輯電路方塊679相減，該結果由邏輯電路方塊680右移（整數除以2)以產生CLIP 677。回到第6B圖，A1可在第一週期自暫存器660取得，A2可在第二週期自暫存器665 20Client’s Docket N〇.:S3U06-0011 XT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 20 200803528 取得，=可在第三週期自暫存器670取得。因而，在第四週期第6圖之部分行加速邏輯電路500根據上述夕士^ 式計算D ( 675)。之方程行加逮邏輯電路500利用（ 675)以更新P4、P5之像素:置。尤其是，P扣P4-D而P5=P5+D。儘管第6Α、6β 先釣、、Ό 5單一列/行（例如單一組像素位置P0-P8 )說明°，二子區塊第3列/行之運算會影響該子區塊其他3列/行之仃為二行加速邏輯電路500利用一開創性方法實現^行為。當獨立濾波運算從最前面開始-平行地—完成，'結合= 6A、6B圖之說明，示於第6G、6D圖之部分行加速邏= 路5〇〇有條件的選擇要更新之位置。換言之，ν(>1回路内去方塊政應濾波器硬體加速邏輯電路400判定是原本的值被舄回或新的值被寫回。相對地，一習知方法，— 回路内去方塊效應濾波器使用迴圈，所以獨立濾條件地執行。 ^有如先兒明的’第4圖解釋行加速邏輯電路5q〇的产擬碼在一迴圈内如此運作：在一重複區段420中出現了 2 例區段（instantiati〇n section) 450。此外行加速邏輯電路500之示例使用2個參數，Pr〇cessingPixei3與 FILTER—〇THER_3。用行加速邏輯電路500的這些參數如下執^亍像素P4、P5有條件的更新。麥見第6C圖，暫存器寫入減法器681之結果，其中減法器681有一輸入為P4 (671)，為〇或 D( 675 )，依 D0_FILTER(683)之值而定。同樣地，暫存器P5寫入加法器685之結果，其中加法器 2IClient^ Docket No,:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 21 200803528 685 有一輸入為 P5( 673)，為〇或 d(675)，依 DO—FILTER(683) 之值而定。因而’P4之更新值為原本之p4值（若D〇J?ILTE]R 為假），或P4-D。同樣地，p5之更新值為原本之p5值（若 DO—FILTER 為假），或 P5+D。熟悉此項技藝者應當瞭解到，當處理一子方塊第3列時，以P4-D更新P4之標準為： ((ABS(AO) < PQUANT) OR (A3 < ABS(AO)) OR (CLIP !- 0) DO一FILTER 683係由第6D圖中檢驗這些條件的部分行加速邏輯電路500所計算。多工器明7提供一輸入至〇R閘697，若 ABS(AO) < PQUANT則選擇一真輸出，其他則為假。多工器689 提供另一輸入至0R閘697，若A3<ABS(A0)則選擇一真輸出，其他則為假。多工器691提供另一輸入至〇R閘β97，若CLIP ! =0 則選擇一真輸出，其他則為假。 D0_FILTER 683係由多工器693所提供，其利用控制輸入 Processing一Pixel—3 ( 695 )以選擇輸出〇R 閘 697 的輸出或輸入信號FILTER—OTHER—3 ( 699 )。輸入® D = CLIP Section 6B of the row acceleration logic circuit 500 receives input from the portion of the row acceleration logic circuit 500 of Figure 6A and calculates D (675). Referring again to Figure 6A, CLIP (677) is generated as follows: Pixels P4 and P5 are subtracted from logic circuit block 679, and the result is shifted right by logic block 680 (integer divided by 2) to produce CLIP 677. Returning to Figure 6B, A1 can be retrieved from the scratchpad 660 in the first cycle, and A2 can be self-registered in the second cycle. 665 20Client's Docket N〇.:S3U06-0011 XT's Docket No:0608-A41238-TW/fmal / 林璟辉/2007/06/14 20 200803528 Acquired, = can be obtained from the register 670 in the third cycle. Thus, the partial line acceleration logic circuit 500 of Fig. 6 of the fourth cycle calculates D (675) based on the above-described formula. The equation of the line up logic circuit 500 utilizes (675) to update the pixels of P4, P5: set. In particular, P buckles P4-D and P5 = P5+D. Although the sixth Α, 6β first fishing, Ό 5 single column/row (for example, a single group of pixel positions P0-P8) indicates °, the operation of the third column/row of the two sub-blocks affects the other three columns/rows of the sub-block. The two-line acceleration logic circuit 500 utilizes a groundbreaking method to implement the behavior. When the independent filtering operation is started from the front - parallel - complete, 'combination = 6A, 6B diagram description, shown in the 6G, 6D diagram part of the line acceleration logic = road 5 〇〇 conditional choice to update the position. In other words, the ν(>1 loop-in-the-box politic filter hardware acceleration logic circuit 400 determines that the original value was rounded back or the new value was written back. In contrast, a conventional method, - the inner loop of the block The effect filter uses a loop, so it is executed independently of the filter condition. ^ As explained earlier, the fourth diagram explains that the line of the acceleration logic circuit 5q is operated in a loop: in a repeating section 420. There are 2 instances of the instantiat section 450. In addition, the example of the row acceleration logic circuit 500 uses two parameters, Pr〇cessingPixei3 and FILTER_〇THER_3. These parameters of the row acceleration logic circuit 500 are as follows: P4, P5 conditionally updated. Mai's see Figure 6C, the result of the register being written to the subtractor 681, wherein the subtractor 681 has an input of P4 (671), 〇 or D (675), according to D0_FILTER (683) Similarly, the register P5 writes the result of the adder 685, where the adder 2IClient^ Docket No,:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/ 14 21 200803528 685 There is an input for P5 ( 673), which is 〇 d (675), depending on the value of DO-FILTER (683). Therefore, the update value of 'P4 is the original p4 value (if D〇J? ILTE)R is false), or P4-D. Similarly, p5 The updated value is the original p5 value (if DO_FILTER is false), or P5+D. Those skilled in the art should understand that when processing the third column of a sub-block, the standard for updating P4 with P4-D is : ((ABS(AO) < PQUANT) OR (A3 < ABS(AO)) OR (CLIP !- 0) DO-FILTER 683 is calculated by the partial line acceleration logic circuit 500 which tests these conditions in Fig. 6D The multiplexer provides an input to the 〇R gate 697, if ABS(AO) < PQUANT selects a true output, the others are false. The multiplexer 689 provides another input to the 0R gate 697, if A3< ABS (A0) selects a true output, others are false. Multiplexer 691 provides another input to 〇R gate β97, if CLIP ! =0 then selects a true output, others are false. D0_FILTER 683 is more The tool 693 provides a control input of a Pixel-3 (695) to select an output of the output R gate 697 or an input signal FILTER_OTHER-3 (699). Input

Processing—Pixel—3 (695)與 FILTER—OTHER—3 (699)先前結合第4圖與舉例說明行加速邏輯電路500之較高層 VC-1回路内去方塊效應濾波器硬體加速邏輯電路4〇〇的虛擬碼已說明過了。回到第4圖，當處理第3行/列時（第 1 圈），Processing—Pixel 一3 ( 695 )設為真，其他則為假。基於關於PQUANT、ABS(AO)、CLIP之條件，記錄一中間變數 DO一FILTER，不論 P4/P5 是否更新。最後 FILTER—OTHER 3 22Clienfs Docket No. :S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 22 200803528 #. (699)之值係設自該中間變數D0_FILTER。第6C、6圖之邏輯電路部分之行加速邏輯電路500之結果係為，每4個週期，在4鄰近列/行之P4、P5的像素位置設為濾波後的值（根據AO-A3、PQUANT、CLIP等變數）或再次寫入其原本的值。該VC-1去方塊效應加速單元400開創性地採用平行與循序之結合，如前所述。平行處理提供較快速的執打亚減少延遲。儘管平行化增加了邏輯閘數，但增_量被别达的 • 循序處理所抵銷。沒有使用前述循序處理的餐知方法徒增邏輯閘數。去方塊效應濾波器由VC-1所規範之去方塊效應濾波器（IPF)之一實施例已如上所述。圖形處理單元120的一些實施例包含一用於H.264去方塊效應的硬體加速單元。熟悉该項技藝之人士應當對H. 264回路内去方塊效應濾波器相當熟悉’故僅 * 對該濾波操作簡單概述。H. 264回路内去方斑效應滤波器係一條件濾波器（conditional filter)，施用於所有圖Processing—Pixel—3 (695) and FILTER—OTHER—3 (699) previously combined with FIG. 4 and exemplifying the higher layer VC-1 in-loop deblocking filter hardware acceleration logic circuit of row acceleration logic circuit 500〇 The virtual code of 〇 has already been explained. Going back to Figure 4, when the third row/column is processed (the first lap), Processing-Pixel-3 (695) is set to true, and the others are false. Based on the conditions for PQUANT, ABS(AO), and CLIP, an intermediate variable DO-FILTER is recorded, regardless of whether P4/P5 is updated. Finally FILTER—OTHER 3 22Clienfs Docket No. :S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 22 200803528 #. The value of (699) is set from the intermediate variable D0_FILTER. The result of the row acceleration logic circuit 500 of the logic circuit portion of FIGS. 6C and 6 is that the pixel position of P4 and P5 in 4 adjacent columns/rows is set as a filtered value every 4 cycles (according to AO-A3, PQUANT, CLIP, etc. variables) or write their original values again. The VC-1 deblocking acceleration unit 400 pioneered the combination of parallel and sequential, as previously described. Parallel processing provides faster execution delays. Although parallelization increases the number of logic gates, the increase_quantity is offset by the sequential processing of the other. The number of logic gates is not increased by using the above-described sequential processing method. Deblocking Filter One of the embodiments of the deblocking filter (IPF) specified by VC-1 has been described above. Some embodiments of graphics processing unit 120 include a hardware acceleration unit for H.264 deblocking. Those familiar with the art should be familiar with the square-effect filter in the H.264 loop, so only * a brief overview of the filtering operation. H. 264 loop despeckle effect filter is a conditional filter applied to all graphs

形的 4X4 方塊邊緣，除非 Disable—DeblockinfFilter—IDC 係定義為該邊緣。該濾波器係循序的施用於所有的巨圖塊以增加巨圖塊位址。對每個巨圖塊，垂直邊緣先從左濾到右，接著水平從上到下的濾波（VC-1施用相反的順序）° 從而係使用來自目前巨圖塊上面與左邊的茛圖塊及先七濾、過的巨圖塊取樣值，且可能再次濾波。 23Clienfs Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW7final/林環輝/2007/06/14 23 200803528 ' Η· 264回路内去方塊效應濾波器硬體加速單元700的一些先進特色將結合第7圖之硬體描述虛擬碼說明。雖非使用實際硬體描述語言（HDL，hardware description language)，例如Verilog與VHDL而使用一虛擬碼，熟悉此項技藝者應對這些虛擬碼相當熟悉。這些人應可瞭解當以實際HDL描述時，這些程式碼應可被編譯並接著合成為構成部分視訊加速單元150之數邏輯閘配置。這些人應當可瞭解到這些邏輯閘可以各種技術實現，例如一特定應用 • 積體電路（ASIC)、可程式化邏輯閘陣列（PGA)或現場程式化邏輯閘陣列（FPGA)。此程式碼的710段係模組定義（m〇(iule definition)。 Η· 264回路内去方塊效應濾波器硬體加速邏輯電路7⑽有許多輸入參數。要進行濾波之子方塊係由該方塊參數 (Block parameter)所規範。若垂直參數（Vertical parameter)為真（True)，則該加速邏輯電路7〇〇將方塊參數視為4x8方塊，並執行垂直邊緣濾波。若垂直參數為假⑽此），則該加速邏輯電路將方塊參數視為8x4 方塊（參見第3圖），並執行水平邊緣濾波。程式碼之區段720開始—疊代迴圈（iterati〇n loop)，設定該迴圈參數變數之值。利用這些參數，Η 264 回路内去方塊效應濾波器硬體加速邏輯電路^〇〇重複4 次，每次處理8個像素，其中—行可為—水平列或一垂直行，依垂直參數而定。將於下詳述’各行係由行加速邏輯電路800執行2次（參見第8圖）。 24Client’s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 24 200803528 區段730測試垂直參數以判定執行垂直或水平邊緣濾波。根據該結果，行陣列變數之8個元素係自該4χ8輸= 方塊之列或8x4輸入方塊之行初始化。當示例時（ instantiated)，區段730的這些碼結合區段72〇的晶代碼變成多工與位元放置（bit-positioning)邏輯電路（有時亦稱為重組邏輯電路，swizzling l〇gic)，其依程式碼所描述的’從記憶體之輸入方塊中移動位元至p暫存界中適當的位元位置。應注意到區段720、73〇中的程式碼2第 4圖中用於VC-1去方塊效應濾波器400的類比石馬 (analogous)相同。該選擇之結果是，單一個多工/重組邏輯電路方塊係被產生且用於Η·264回路内去方塊效應濾波器邏輯電路700與VC-1回路内去方塊效應濾波器^ 輯電路400。 ^ 區段750從由圖形處理單元12〇所提供的1264指令内所含的資訊擷取使用於實際濾波器的參數。bs (邊界二度）與chromaEdgeFlag參數被使用於H264回路内去方塊效應濾波為且為熟悉該項技藝者所熟悉。參數i 與indexB對應H.264回路内去方塊效應濾波器所使用的阿法(alpha)與貝塔（beta)參數，其亦為熟悉該項技藝者所熟悉。圖形處理單元120之一開創性特徵為incjeXA、 indexB、bS參數不由H.264回路内去方塊效應濾波器硬體加速邏輯電路700所計算，而由圖形處理單元12〇内之_ 執行單元940所計算（稍後結合第9圖說明）。藉由使用 25Clienfs Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 200803528 EU指令以實現bS、indexA、indexB之計算，可利用圖形處理單元執行單元940的運算功率與一般用途，增強回路内去方塊效應濾波器硬體加速邏輯電路7⑽。這樣的選擇避免了回路内去方塊效應濾波器硬體加速邏輯電路了〇〇中額外的、可能複雜的邏輯電路。在另一實施例中，參數、 indexA、indexB係由主處理器11〇 (參見第i圖）上所執行的程式碼所運算。區段750舉例說明另一 HDL模組， H264一Deblock一Filter—Line，該濾波器施用目前之行。提供至該子模組之參數包含擷取自上述EU指令的控制變數，與LeftTop參數。邏輯電路7〇〇之一開創性特徵係呼叫該行濾波器兩次，每次呼叫僅更新一半的像素，其中被更新的一半係由LeftTop參數所標示。此設計衡量節省了邏輯閘數但需要更多的時脈週期。熟悉此項技藝支人士應當瞭解到如何以不同參數值示例該濾波行模組以產生雨不同之邏輯電路方塊，具有如該像素方塊不同兩半之輸入。在子模組處理該列之後，硬體加速邏輯電路7〇〇在區段720以一迴圈參數更新值繼續該疊代迴圈。依此法，對輸入方塊之第 1-4 行施用 H264J)eblock_Filter_Line。第8A、B圖顯示用於行加速邏輯電路800之硬體描述虛擬碼，其實現了上述之H264_Deblock_Fi 1 ter_Line次模組。如第8A圖所示，該行模組800係分成模組定義區塊 810、對應參數區塊820與計算像素區塊830。熟悉此項技藝之人士應可從第8A的的程式碼中理解模組定義區塊 26Client’s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/final/林璟輝/2007/06/14 26 200803528 810，將不再解釋。對應參數區塊820呼叫其他雨個子程序 (結合第8B圖說明）以對應由H. 264回路内去方塊效應濾波器硬體加速邏輯電路700所提供的參數IndexA' IndexB至阿法與貝塔參數。阿法與貝塔，以及ChromaEdge f lag，接著被區段830 使用以藉由基於阿法、貝塔、ChromaEdge與鄰近的像素值計算新的像素值確實的施用該濾波器。未顯示該區段實際的虛擬碼，因為熟悉此項技藝之人士應知道如何實現用於 ⑩ 單一行之去方塊效應濾波器，如H. 264規範中所述。行加速邏輯電路800之開創性特徵更示於邏輯區段， getAlphaBeta 850 與 getThreshold 870，示於第 8B 圖。這些邏輯區段對應第8A圖之對應參數區段820所使用的子程序。如第8B圖中可看到的程式碼，唯讀記憶體（ROM， read only memory)表係用來自IndexA與IndexB對應相對的阿法與貝塔值。同樣地，ROM表係用來計算該臨界值。在圖形處理單元120的一些實施例中，其中上述 H· 264去方塊效應功能係透過圖形處理單元指令實施。將結合第10圖更詳細的說明圖形處理單元120,強調圖形處理單元指令之特殊選擇以實施H. 264去方塊效應加速。圖形處理器多重去方堍效應指令的原理圖形處理單元120的指令集包含在軟體裡執行的部分解碼器160可用來加速一去方塊效應濾波器。在此說明 27Client’s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 27 200803528 一開創性技術提供不只一個的多重圖形處理單70指令以加速特定去方塊效應濾波器。面路内去方塊效應濾波器290 原本就是循序的，因而一特定濾波器必須以一定次序對像素濾波（例如EL 264規定從立到右接著從上到下）。因而，先前濾過的或更新過的像素在滤後面像素時被拿來作為輸入。主處理器處理儲存在習知記憶體的像素值’這使得像素一個接一個讀取、寫入。然而’這循序的本質當回路内去方塊效應濾波器290使用/圖形處理單元加速部分濾波處理時無法適當配合。習知圖形處理皁元將像素儲存在一紋理快取（texture cache) ’而該圖形處理單元管線設計不遵從一個接一個（back-t〇一back)讀取、寫入紋理快取。在此揭露圖形處理單元I20的一些實施例提供多重圖形處理單元指令，其可一起用來加速一特定去方塊效應濾波器。其中一些指令把紋理快取當像素資料源，而一些指令使用圖形處理單元執行單元作為資料源。回路内去方塊效應濾波器290適當的結合使用這些不同的圖形處理單元指令以達成一個接一個讀取、寫入像素。接下來概要說明流經圖形處理單元120的資料’再接著解釋由圖形處理單元120提供的去方塊效應加速指令’與回路内去方塊效應濾波器290運用這些指令。圖形處理單元流第9圖係圖形處理單元120資料流的圖，其中指令流係由第9圖左邊之箭頭，而影像或圖形流係由右邊的箭頭 28Client5s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/final/林璟輝/2007/06/14 28 200803528 ‘ 表示。第9圖省略了數個熟悉此項技藝者習知的元件，這些對解釋圖形處理單元12〇之回路内去方塊效應特徵非必要。一指令流處理器910從一系統匯流排（未示）接收一指令920，並解碼該指令，產生指令資料930，例如頂點資料。圖形處理單元12〇支援一習知圖形處理指令，以及加速視§孔編碼及/或解碼的指令。習知圖形處理指令牽涉到如頂點著色（vertex shading)、幾何著色（geometry shading)、像素著色（pixel 参 fading)等難題。因此，指令資料930係施用於著色器執行單元（shader execution units )之池（pool ) 940。執行單元940必要使用一紋理濾波單元（TFU，texture filter unit) 950以施加一紋理至一像素。紋理資料係快取自紋理快取960，其係在主記憶體（未示）後面。一些指令送給視訊加速器150，其運作將於後說明。產生的資料接著由後包裝器（post-packer 970 )處理，其壓縮該資料。在後處理（post-processing)之後，由視訊 • 加速單元所產生的資料係提供給執行單元池（execut i on unit pool ) 940。視訊編碼/解碼加速指令的執行，例如前述之去方塊效應濾波指令，在許多方面與前述之習知圖形指令不同。首先，視訊加速指令係由視訊加速單元150執行，而非著色器執行單元。其次，視訊加速指令不使用其紋理資料。然而，視訊加速指令所使用的影像資料與圖形指令所使用的紋理資料均為2維陣列。圖形處理單元120同樣利 29Client’s Docket No·:S3U06-0011 TT，s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06A4 29 200803528 用此優點，使用紋理濾波單元950下載給視訊加速單元 150的影像資料，因而使紋理快取96〇快取一些由視訊加速單元150運作之景>像資料。因此，示於第g圖，視訊加速單元150係位於紋理濾波單元950與後包裝器gw之間。紋理慮波单元9 5 0檢驗從指令g 2 〇掏取的指令資料 930。指令資料930更提供紋理濾波單元950紋理快取9β〇内想要的影像賀料的座標。在一實施例中，這些座標標明 ❿ 為U、V對’熟悉此項技藝者應對此熟悉。當指令920係一視訊加速指令時，所擷取的指令資料更命令紋理濾波單元 950略過紋理濾波單元950内的紋理濾波器（未示）。依此法，紋理濾波單元950係受操縱為視訊加速指令去下載影像資料給視訊加速單元150。視訊加速單元150 從資料路徑上的紋理渡波單元9 5 0接收影像資料，與命令路徑上的命令資料930，並根據命令資料930對該影像資料執行一運作。由視訊加速單元150所輸出影像資料係回鲁饋給執行單元池940，在由後包裝器970處理之後。去方塊效應指令在此敘述之圖形處理單元120之實施例，提供VC-1 去方塊效應濾波器與H· 264去方塊效應濾波器硬體加速。 VC-1去方塊效應濾波器係由一圖形處理單元指令 (” IDF_VC-1”）加速，而H. 264去方塊效應滤波器由三個圖形處理單元指令 30Client’s Docket N〇_:S3U06-0011 TT，s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06八4 30 200803528 IDF_J264—2”） (” IDF—H264_0”、” idF_H264_J” 加速。如先前說明的’各圖形處理單元指令係解碼且分析 (parsed)為指令資料930，其可视為各指令之特定彖數集’示於第1表。IDF_H264_X指令共用—些共用參數，而其他的為各指令獨有的。熟悉此項技藝者應瞭解到這些參數可以使用各種操作碼（opcode)與指令格式編碼，所以這些議題將不在此討論。第1表·· IDFJI264指令的參數參數大小運算元意述 FieldFlag (Input) 1-位元若 FieldFlag == 1 則 Field Picture, 其他 Frame Picture TopFieldFlag (Input) 1-位元若 TopFieldFlag =二 1 則 Top-Field-Picture, 其他 Bottom-F i e1d-P i cture 若設定了 FieldFlag. PictureWidth (Input) 16-位元例如，用於HDTV之 1920 PictureHeight (Input) 16-位元例如，用於30P HDTV 之 1080 YC Flag 1-位元 Control-2 Y平面or彩度平面 31Clienfs Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 31 200803528A 4X4 square edge, unless the Disable—DeblockinfFilter—IDC is defined as the edge. The filter is applied sequentially to all of the giant tiles to increase the giant tile address. For each giant block, the vertical edges are first filtered from left to right, then horizontally from top to bottom (the opposite order of VC-1 application). Thus, the tiles from the top and left of the current giant block are used. The sample values of the giant block filtered and filtered may be filtered again. 23Clienfs Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW7final/林环辉/2007/06/14 23 200803528 ' Η· 264 In-Circuit Deblocking Filter Hardware Acceleration Unit 700 Some advanced features The virtual code description will be described in conjunction with the hardware of FIG. Although not using a virtual hardware description language (HDL), such as Verilog and VHDL, a virtual code is used, and those skilled in the art should be familiar with these virtual codes. These individuals should be aware that when described in actual HDL, these codes should be compiled and then synthesized into a number of logic gate configurations that form part of the video acceleration unit 150. These people should be aware that these logic gates can be implemented in a variety of technologies, such as a specific application • an integrated circuit (ASIC), a programmable logic gate array (PGA), or a field-programmed logic gate array (FPGA). The 710-segment module definition of this code (m〇(iule definition). Η· 264 loop-in-blocking filter hardware acceleration logic circuit 7(10) has many input parameters. The sub-block to be filtered is determined by the block parameter ( Block parameter) If the vertical parameter is True, the acceleration logic 7 considers the block parameter as a 4x8 block and performs vertical edge filtering. If the vertical parameter is false (10), The acceleration logic then treats the block parameters as 8x4 squares (see Figure 3) and performs horizontal edge filtering. The section 720 of the code begins - iterati〇n loop, which sets the value of the loop parameter variable. Using these parameters, the block-effect filter hardware acceleration logic circuit in the 264 loop is repeated 4 times, each time processing 8 pixels, where - the line can be - horizontal column or vertical line, depending on the vertical parameters . As will be described in more detail below, each line is executed twice by the line acceleration logic circuit 800 (see Fig. 8). 24Client’s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 24 200803528 Section 730 tests the vertical parameters to determine the vertical or horizontal edge filtering. Based on the result, the 8 elements of the row array variable are initialized from the row of the 4 8 input = block or the 8 x 4 input block. When instantiated, the code of section 730 in conjunction with section 72's crystal code becomes multiplexed and bit-located logic (sometimes referred to as reassembly logic, swizzling l〇gic) According to the code description, 'moving the bit from the input block of the memory to the appropriate bit position in the p temporary storage boundary. It should be noted that the analogy for the VC-1 deblocking filter 400 in Fig. 4 of the code 2 of the sections 720, 73 is the same as the analogous. As a result of this selection, a single multiplex/recombination logic circuit block is generated and used in the 264 internal loop deblocking filter logic circuit 700 and the VC-1 loop deblocking filter circuit 400. The section 750 retrieves the parameters used in the actual filter from the information contained in the 1264 instructions provided by the graphics processing unit 12A. The bs (boundary second degree) and chromaEdgeFlag parameters are used in the H264 loop to block filtering and are familiar to those skilled in the art. The parameters i and indexB correspond to the alpha and beta parameters used in the H.264 loop deblocking filter, which are also familiar to those skilled in the art. One of the inventive features of the graphics processing unit 120 is that the incjeXA, indexB, and bS parameters are not calculated by the block motion filter hardware acceleration logic circuit 700 in the H.264 loop, but by the execution unit 940 within the graphics processing unit 12. Calculation (described later in conjunction with Figure 9). By using 25Clienfs Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 200803528 EU instruction to realize the calculation of bS, indexA, indexB, can be executed by the graphics processing unit The operational power of unit 940 and the general purpose, the in-loop deblocking filter hardware acceleration logic circuit 7 (10). This choice avoids the need for additional, potentially complex logic circuits in the loop to remove the block effect filter hardware acceleration logic. In another embodiment, the parameters, indexA, indexB are computed by the code executed on the main processor 11 (see Figure i). Section 750 illustrates another HDL module, H264-Deblock-Filter-Line, which applies the current line. The parameters supplied to the submodule include control variables taken from the above EU commands, and the LeftTop parameter. One of the groundbreaking features of the logic circuit 7 calls the line filter twice, with only half of the pixels updated per call, with the updated half being indicated by the LeftTop parameter. This design measures the number of logic gates but requires more clock cycles. Those skilled in the art should understand how to model the filter line module with different parameter values to produce different logic blocks for rain, with inputs such as two halves of the pixel block. After the sub-module processes the column, the hardware acceleration logic circuit 7 continues the iteration loop with a loop parameter update value in segment 720. In this way, apply H264J)eblock_Filter_Line to lines 1-4 of the input box. 8A, B show a hardware description virtual code for row acceleration logic circuit 800 that implements the H264_Deblock_Fi 1 ter_Line submodule described above. As shown in FIG. 8A, the row module 800 is divided into a module definition block 810, a corresponding parameter block 820, and a calculation pixel block 830. Those familiar with the art should be able to understand the module definition block 26Client's Docket N〇 from the code of 8A::S3U06-0011 TT's Docket No:0608-A41238-TW/final/林璟辉/2007/06/14 26 200803528 810, will not be explained. Corresponding parameter block 820 calls the other rain subroutine (described in conjunction with FIG. 8B) to correspond to the parameter IndexA' IndexB provided by the H.264 loop deblocking filter hardware acceleration logic circuit 700 to the alpha and beta parameters. Alpha and Beta, and ChromaEdge Flag, are then used by section 830 to apply the filter by calculating new pixel values based on alpha, beta, ChromaEdge and neighboring pixel values. The actual virtual code for the segment is not shown, as those skilled in the art should know how to implement a deblocking filter for 10 single rows, as described in the H.264 specification. The groundbreaking features of row acceleration logic circuit 800 are further illustrated in the logic section, getAlphaBeta 850 and getThreshold 870, shown in Figure 8B. These logical sections correspond to the subroutines used by the corresponding parameter section 820 of Figure 8A. As shown in Fig. 8B, the read only memory (ROM) table uses the relative Apha and Beta values from IndexA and IndexB. Similarly, the ROM table is used to calculate the threshold. In some embodiments of graphics processing unit 120, wherein the H.264 deblocking function is implemented by graphics processing unit instructions. The graphics processing unit 120 will be described in greater detail in conjunction with FIG. 10, emphasizing the particular selection of graphics processing unit instructions to implement H.264 deblocking acceleration. The principle of the multi-departition effect instruction of the graphics processor The instruction set of the graphics processing unit 120 is included in the software. The partial decoder 160 can be used to accelerate a deblocking filter. 27Client's Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 27 200803528 A groundbreaking technology provides more than one multi-graphics processing single 70 command to speed up specific Go to the block effect filter. The in-plane deblocking filter 290 is originally sequential, so a particular filter must filter the pixels in a certain order (e.g., EL 264 specifies from right to right and then top to bottom). Thus, previously filtered or updated pixels are taken as input when filtering the back pixels. The main processor processes the pixel values stored in the conventional memory' which causes the pixels to be read and written one by one. However, the nature of this sequence cannot be properly coordinated when the block-in-band filter 290 in the loop uses the /graphic processing unit to speed up the partial filtering process. Conventional graphics processing soap cells store pixels in a texture cache and the graphics processing unit pipeline design does not follow one-to-one (back-to-back) read and write texture caches. It is disclosed herein that some embodiments of graphics processing unit I20 provide multiple graphics processing unit instructions that can be used together to speed up a particular deblocking filter. Some of these instructions use texture cache as a pixel data source, while some instructions use a graphics processing unit execution unit as a data source. The in-loop deblocking filter 290 suitably uses these different graphics processing unit instructions to achieve read and write pixels one after the other. The following is a summary of the data flowing through the graphics processing unit 120, followed by the interpretation of the deblocking acceleration command provided by the graphics processing unit 120 and the in-loop deblocking filter 290. Graphic Processing Unit Flow Figure 9 is a diagram of the data flow of the graphics processing unit 120, wherein the instruction stream is from the left arrow of Figure 9, and the image or graphics stream is from the right arrow 28Client5s Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/final/林璟辉/2007/06/14 28 200803528 ' Figure 9 omits several components that are familiar to those skilled in the art, and it is not necessary to interpret the in-loop deblocking features of the graphics processing unit 12〇. An instruction stream processor 910 receives an instruction 920 from a system bus (not shown) and decodes the instruction to generate instruction material 930, such as vertex data. Graphics processing unit 12 supports a conventional graphics processing instruction and instructions for accelerating video encoding and/or decoding. Conventional graphics processing instructions involve problems such as vertex shading, geometry shading, and pixel shading. Thus, the instruction material 930 is applied to a pool 940 of shader execution units. Execution unit 940 necessarily uses a texture filter unit (TFU) 950 to apply a texture to a pixel. The texture data is cached from the texture cache 960, which is appended to the main memory (not shown). Some instructions are sent to the video accelerator 150, the operation of which will be described later. The resulting data is then processed by a post-packer (post-packer 970) which compresses the data. After post-processing, the data generated by the video acceleration unit is provided to an execution i on unit pool 940. The execution of the video encoding/decoding acceleration instructions, such as the aforementioned de-blocking filtering instructions, differs in many respects from the conventional graphics instructions described above. First, the video acceleration command is executed by the video acceleration unit 150 instead of the shader execution unit. Second, the video acceleration instructions do not use their texture data. However, the image data used by the video acceleration command and the texture data used by the graphics commands are both 2-dimensional arrays. The graphics processing unit 120 is similar to 29Client's Docket No.: S3U06-0011 TT, s Docket No: 0608-A41238-TW/fmal/林璟辉/2007/06A4 29 200803528 With this advantage, the texture filtering unit 950 is used to download to the video acceleration unit 150. The image data thus makes the texture cache 96 〇 to capture some of the scenes operated by the video acceleration unit 150. Therefore, as shown in Fig. g, the video acceleration unit 150 is located between the texture filtering unit 950 and the post-packer gw. The texture filter unit 905 checks the instruction material 930 retrieved from the instruction g 2 . The instruction material 930 further provides a coordinate of the desired image highlight in the texture filtering unit 950 texture cache 9β. In one embodiment, these coordinates indicate that ❿ is U, V pairs. Those skilled in the art should be familiar with this. When the instruction 920 is a video acceleration command, the captured instruction material further instructs the texture filtering unit 950 to skip the texture filter (not shown) in the texture filtering unit 950. According to this method, the texture filtering unit 950 is manipulated as a video acceleration command to download image data to the video acceleration unit 150. The video acceleration unit 150 receives the image data from the texture wave unit 905 on the data path, and the command material 930 on the command path, and performs an operation on the image data according to the command data 930. The image data output by the video acceleration unit 150 is fed back to the execution unit pool 940 after being processed by the post wrapper 970. Deblocking Instructions The embodiment of graphics processing unit 120, described herein, provides a VC-1 deblocking filter and H.264 deblocking filter hardware acceleration. The VC-1 deblocking filter is accelerated by a graphics processing unit instruction ("IDF_VC-1"), while the H.264 deblocking filter is commanded by three graphics processing units. 30Client's Docket N〇_:S3U06-0011 TT ,s Docket No:0608-A41238-TW/fmal/林璟辉/2007/06八4 30 200803528 IDF_J264—2”) ("IDF-H264_0", "idF_H264_J" acceleration. As described earlier, 'each graphics processing unit command system Decoded and parsed into instruction material 930, which can be regarded as a specific number of sets of instructions 'shown in the first table. IDF_H264_X instructions share some common parameters, while others are unique to each instruction. The skilled artisan should understand that these parameters can be encoded using various opcodes and instruction formats, so these topics will not be discussed here. Table 1 · IDFJI264 Instruction Parameter Parameter Size Operation Unit Description FieldFlag (Input) 1-bit If FieldFlag == 1 then Field Picture, Other Frame Picture TopFieldFlag (Input) 1-bit If TopFieldFlag = 2, then Top-Field-Picture, Other Bottom-F i e1d-P i cture If set FieldFlag. PictureWidth (Input) 16-bit, for example, 1920 PictureHeight (Input) 16-bit for HDTV For example, 1080 YC Flag 1-bit Control-2 Y plane or chroma plane 31Clienfs Docket for 30P HDTV N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 31 200803528

Field Direction 1 -位元 Control-1 CBCR Flag 1-位元 Control-1 -_ - — —- Cb 或 Cr __ BaseAddress (Input) 32-位元無符號的用於IDF_H64_0與 IDFJ6〇: 紋理記憶體中之子方塊基本位址 ___一 BlockAddress (Input) 13.3 格式，省略分數部分 SRC1[0:15] SRC1[31:16]-V 用於 IDF_H64_0: 整個子方塊（關於基本位址）之紋理座標 For IDF—Η64__1:剩下的子方塊（關於基本位址）之紋理座標在IDF_H64_2未使用 DataBlockl 4x4x8- 位元在IDFJH64—0未使用 SRC2[127:0] 用於IDF—H64J:子方塊的上半或左半部，根據依Control 2參數編碼的 FilterDirection SRC2[127:0] 用於IDF—H64〜2:第一 (偶數）暫存器對 DataBlock2 4x4x8- 位元在 IDF—H64—0 〇或 IDF—H64一 1中未使用 32Client5s Docket No.:S3U06-0011 TT’s Docket No:0608，A41238-TW/fmal/林璟輝/2007/06/14 32 200803528 SRC2[255:128] ---- 用於 IDF—H64_2:第二 (奇數）暫存器對 Sub-block 128-位，—-—— 去方塊效應之 (Output) 元 8x4x8-bit子方塊 —-— (128-位元）結合使用許多輸入參數以判定由紋理濾波單元950所擷取的4x4方塊位址。gaseA(idress參數指出在紋理快取中该紋理資料的起點。將此區域内左上方塊座標給Field Direction 1 - Bit Control-1 CBCR Flag 1-bit Control-1 -_ - — —- Cb or Cr __ BaseAddress (Input) 32-bit unsigned for IDF_H64_0 and IDFJ6〇: in texture memory Sub-block basic address ___-BlockAddress (Input) 13.3 format, omit the fractional part SRC1[0:15] SRC1[31:16]-V for IDF_H64_0: texture sub-for the whole sub-block (about basic address) IDF—Η64__1: The texture coordinates of the remaining sub-blocks (for the basic address) are not used in IDF_H64_2. DataBlockl 4x4x8-bits are not used in IDFJH64—0. SRC2[127:0] is used for IDF—H64J: the first half of the sub-block Or left half, FilterDirection SRC2[127:0] according to Control 2 parameter is used for IDF-H64~2: first (even) register pair DataBlock2 4x4x8-bit in IDF-H64-0 or IDF —32Client5s Docket No. is not used in H64-1. S3U06-0011 TT's Docket No:0608, A41238-TW/fmal/林璟辉/2007/06/14 32 200803528 SRC2[255:128] ---- For IDF— H64_2: second (odd) register to Sub-block 128-bit, --- The blocking effect (Output) Element 8x4x8-bit sub-block --- (128-bit) number of input parameters used in determining binding by the texture filter unit 950 of the 4x4 block fetch address. gaseA (the idress parameter indicates the starting point of the texture data in the texture cache. The upper left square of this area is given to

BaseAddress 參數。pictureHeight 與 PictureWidth 輪入參數係用來判斷該方塊的範圍，即左下方座標。最後，視訊圖形可為漸進式掃瞄（pr0gessive)或隔行掃目苗 (inter 1 ace)。右為隔行掃瞒’其係由兩個方向組成（上方兵下方）。紋理渡波早元9 5 0使用F i e 1 dF 1 ag偽 TopFieldFlag以適當處理隔行掃瞄影像。去方塊效應8x4x8位元輸出係提供於一目標暫存哭，且亦寫回執行單元池940。將去方塊效應輪出寫回執行單元池940係一”位置修改（modify in place)”運作，在某些解碼器的實現中是必要的，例如Η· 264其中方塊中之像素值，右邊與下方，係依先前的結果所計算。然而解碼器不像H.264有此限制關係。在VC-1中，對每個8χ8 邊界（先垂直再水平）濾波。所有的垂直邊緣可以因而實質上平行地執行，4x4邊緣稍後濾波。可以利用平行化因為僅有兩個像素（一個邊緣一個）被更新，而這些像素不 33Clienfs Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 33 200803528 -用來計算其他邊緣。既然去方塊效應資料是寫回執行單元池940而非紋理快取960，提供了不同的H264—X指令’這子方塊從不同位置被擷取。這可在第1表中看到，在 BlockAddress 的敘述中，Data Block 1 與 Data Block 2 參數。IDF—H264—0指令從紋理快取960擷取整個8x4x8位元子方塊。IDF—H264—1指令從紋理快取960擷取半個子方塊並從執行單元池940擷取半個。隨解碼器160而變之IDF—H264一X指令的功用將結合第 ⑩ 8圖詳述。接下來敘述在供應像素資料給視訊加速單元15〇前，紋理濾波單元950與執行單元池940轉換所擷取的像素資料的處理。影像資料的韓換上述之指令參數，提供欲從紋理快取960或從執行單元池940解取的子方塊位址之座標給紋理濾波單元950。影像資料包含亮度（Y)與彩度（Cb，Cr)平面。一 YC旗標 ^ 輸入參數定義要處理Y平面或是CbCr平面。當處理亮度（Y)資料時，如YC旗標參數所標示的，紋理濾波單元950擷取該子方塊並提供該128位元作為 VC-1回路内去方塊效應濾波器硬體加速邏輯電路40〇的輸入（例如第4圖之VC-1加速器範例之方塊輸入參數）。所產生的賁料係寫入目標暫存器作為一 4組-暫存器 (register quad ，即，DST 、 DST+1 、 DST+2 、 DST+3)。當處理彩度資料時，如YC旗標參數所標示的，Cb與 34Clienfs Docket N〇.:S3U06-0011 IT’s Docket No:0608-A41238_TW/fmal/林璟輝/2007/06/14 200803528 赛ι 、Cr方塊將由VC-1回路内去方塊效應濾波器硬體加速邏輯電路400連續地處理。所產生的資料係寫入紋理快取 960 °在一些實施例中，此寫入動作在各週期中發生，每個週期寫入256位元。一些視訊加速單元實施例使用隔行掃目苗CbCr平面，各存為一半寬度與一半長度。在這些實施例中，紋理濾波單元950為視訊加速單元將CbCr子方塊資料解隔行掃兩至用來溝通紋理濾、波單元950與視訊加速單元15〇之一 ⑩ 緩衝裔。尤其是，紋理濾波單元950將2個4x4 Cb方塊寫入該緩衝器，接著將2個4x4 Cr方塊寫入該緩衝器。8χ4 Cb方塊首先由vc-1回路内去方塊效應濾波器硬體加速邏輯電路400處理，所產生的資料寫入紋理快取96〇。接著， 8x4 Cr方塊由VC-i回路内去方塊效應濾波器硬體加速邏輯電路400處理，所產生的資料寫入紋理快取96〇。視訊加速單元150使用CbCr旗標參數以管理此循序處理。春使用去方塊效應指令結合先前第1圖之說明，解碼器160在主處理哭11 〇上執行但亦利用圖形處理單元120所提供的視訊加速指令。尤其是H· 264回路内去方塊效應濾波器290之實施例使用特定IDF—H264—X結合以處理邊緣，依H· 264所規定之次序’從紋理快取960擷取一些子方塊並從執行單元池擷取另一些。在適當結合之下，這些IDF—H264—X指令達成一個接一個像素讀取與寫入。 35Clienfs Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/final/林璟輝/2007/06/14 35 200803528 α ·* ν 第10圖係用於Η· 264之16x16巨圖塊之方塊圖。這巨圖塊切割成16個4x4子方塊，每個均將進行去方塊效應。第10圖中之4個子方塊可依列與行定義（例如R1， C2)。丨264定義先處理垂直邊緣在處理水平邊緣，如第 10圖所示之邊緣順序（a-h)。因此，該去方塊效應濾波器係應用於一對子方塊間的邊緣，子方塊對依此次序濾、波· edge a=[block to left of R1, C1] | [Rl, Cl ] ; [block to • left of R2,C1] I [R2,C1]; [block to left of R3, Cl] I [R3, Cl]; [block to left of R4，C1] I [R4，C1] edge b二[R1，C1] I [R2, C2] ; [ R2, Cl ] | [R2, C2]; [R3，C1] I [R3，C2] ; [ R4，C1] I [R4，C2]; edge c二[R1，C2] I [R2,C3] ; [ R2,C2] | [R2,C3]; [R3,C2] I [R3，C3] ; [ R4，C2] I [R4，C3]; edge d=[Rl，C3] I [R2，C4] ; [ R2，C3] | [R2，C4]; ❿ [R3，C3] I [R3，C4] ; [ R4，C3] I [R4，C4]; edge e= [block to top of R1,C1] | [Rl，Cl] ; [block to top of R1,C2] I [R1,C2]; [block to top of Rl，C3] I [Rl, C3]; [block to top of R1，C4] I [R1，C4] edge f二[R1，C1] I [R2,C1] ; [R1,C2] | [R2,C2]; [R1，C3] I [R2，C3]; [R1，C4] | [R2，C4] edge g二[R2，C1] I [R3，C1] ; [R2，C2] | [R3，C2]; 36Clienfs Docket No. :S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 36 200803528 [R2,C3] I [R3, C3]； [R2,C4] | [R3, C4] edge h-[R3, Cl] | [R4, Cl] ； [R3?C2] | [R4, C2]; [R3，C3] I [R4, C3]； [R35C4] | [R4?C4] 對於第1對子方塊，均下載自紋理快取96〇，因為還沒有像素因施用濾波器而被改變。儘管第i垂直邊緣（a)之濾波為可以改變（R1，C1)之像素值，第2列垂直邊緣實際上與第ί列垂直邊緣共用所有像素。因此，第2鈴子方塊（邊緣b) 亦下載自紋理快取960。既然兩相鄰列間的垂直邊緣不共用像素’第3對（邊緣c)與第4對（邊緣d)子方塊亦同。由回路内去方塊效應濾波器29〇所發出的特定 IDF—H264一X指令判定要從那個位置下載像素資料。由回路内去方塊效應濾波器290所使用的IDFJI264一X指令處理第1垂直邊緣（a)之次序為： IDF—H264—0 SRCl-address of (R1,C1); 春 IDF_H264—0 SRCl^address of (R2, Cl); IDF—H264—0 SRCl^address of (R3,C1); IDF_H264—0 SRCl=address of (R4,Cl)；接下來，回路内去方塊效應濾波器290處理第2垂直邊緣（b)，從（Rl，C2)開始。在定義為（ri，C2) 8x4子方塊内最左邊4個像素與（Rl，Cl)子方塊最右邊的像素重疊。這些由（Rl ’ C1)之垂直邊緣滤波器所處理，亦可能更新，之重 $像素係因而被讀自執行單元池940而非紋理快取960。然而’在（Rl，C2)子方塊最右邊的4個像素還沒被濾波，因而 37Client，s Docket No. :S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林環輝/2007/06/14 200803528BaseAddress parameter. The pictureHeight and PictureWidth rounding parameters are used to determine the range of the square, the lower left coordinate. Finally, the video graphic can be a progressive scan (pr0gessive) or an inter 1 ace. Right is the interlaced broom', which consists of two directions (below the upper pawn). The texture wave early 950 uses F i e 1 dF 1 ag pseudo TopFieldFlag to properly process the interlaced scanned image. The deblocking 8x4x8 bit output system is provided for a target temporary cry, and is also written back to the execution unit pool 940. Writing the deblocking effect back to the execution unit pool 940 is a "modify in place" operation, which is necessary in some decoder implementations, such as Η·264 where the pixel values in the square, right Below, it is calculated based on the previous results. However, the decoder does not have this limitation relationship like H.264. In VC-1, each 8χ8 boundary (first vertical and then horizontal) is filtered. All vertical edges can thus be performed substantially in parallel, with the 4x4 edges being filtered later. Parallelization can be utilized because only two pixels (one edge) are updated, and these pixels are not 33Clienfs Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 33 200803528 - Used to calculate other edges. Since the deblocking data is written back to the execution unit pool 940 instead of the texture cache 960, a different H264-X instruction is provided that the sub-blocks are retrieved from different locations. This can be seen in the first table, in the description of BlockAddress, the Data Block 1 and Data Block 2 parameters. The IDF-H264-0 instruction fetches the entire 8x4x8 sub-block from texture cache 960. The IDF-H264-1 instruction fetches half of the sub-blocks from texture cache 960 and fetches half from execution unit pool 940. The function of the IDF-H264-X command as a function of decoder 160 will be described in more detail in connection with FIG. Next, the processing of converting the captured pixel data by the texture filtering unit 950 and the execution unit pool 940 before supplying the pixel data to the video acceleration unit 15A will be described. The instruction parameters of the image data are provided to the texture filtering unit 950 for the coordinates of the sub-block address to be extracted from the texture cache 960 or from the execution unit pool 940. The image data contains the brightness (Y) and chroma (Cb, Cr) planes. A YC flag ^ The input parameter defines whether to process the Y plane or the CbCr plane. When processing the luminance (Y) data, as indicated by the YC flag parameter, the texture filtering unit 950 retrieves the sub-block and provides the 128-bit as a VC-1 loop deblocking filter hardware acceleration logic circuit 40. 〇 input (such as the block input parameter of the VC-1 accelerator example in Figure 4). The resulting data is written to the target scratchpad as a set of 4 registers - register (ie, DST, DST+1, DST+2, DST+3). When processing chroma data, as indicated by the YC flag parameter, Cb and 34Clienfs Docket N〇.:S3U06-0011 IT's Docket No:0608-A41238_TW/fmal/林璟辉/2007/06/14 200803528 Sai, Cr The block-effect filter hardware acceleration logic circuit 400 will be continuously processed by the VC-1 loop. The resulting data is written to texture cache 960 °. In some embodiments, this write action occurs in each cycle, with 256 bits written per cycle. Some video acceleration unit embodiments use interlaced sweeping CbCr planes, each of which is half the width and half the length. In these embodiments, the texture filtering unit 950 de-interlaces the CbCr sub-block data for the video acceleration unit to communicate with the texture filter, the wave unit 950, and the video acceleration unit 15 . In particular, texture filtering unit 950 writes two 4x4 Cb blocks to the buffer, and then writes two 4x4 Cr squares to the buffer. The 8χ4 Cb block is first processed by the vc-1 loop to the block effect filter hardware acceleration logic circuit 400, and the resulting data is written to the texture cache 96〇. Next, the 8x4 Cr block is processed by the block-effect filter hardware acceleration logic circuit 400 in the VC-i loop, and the generated data is written to the texture cache 96 〇. The video acceleration unit 150 uses the CbCr flag parameters to manage this sequential processing. Spring Using the Square Blocking Command In conjunction with the previous Figure 1, the decoder 160 executes on the main processing crying 11 但 but also utilizes the video acceleration command provided by the graphics processing unit 120. In particular, the embodiment of the H.264 loop deblocking filter 290 uses a specific IDF-H264-X combination to process the edges, extracting some sub-blocks from the texture cache 960 and executing them in the order specified by H.264. The cell pool picks up some others. With proper combination, these IDF-H264-X instructions achieve one pixel read and write. 35Clienfs Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/final/林璟辉/2007/06/14 35 200803528 α ·* ν Figure 10 is the block for the 16x16 giant block of Η·264 Figure. This giant tile is cut into 16 4x4 sub-blocks, each of which will perform a deblocking effect. The four sub-blocks in Figure 10 can be defined by columns and rows (for example, R1, C2).丨 264 defines the processing of the vertical edges at the processing horizontal edges, as shown in Figure 10 (a-h). Therefore, the deblocking filter is applied to the edge between a pair of sub-blocks, and the sub-block pairs are filtered in this order, wave edge a = [block to left of R1, C1] | [Rl, Cl]; [block To • left of R2,C1] I [R2,C1]; [block to left of R3, Cl] I [R3, Cl]; [block to left of R4,C1] I [R4,C1] edge b[ R1,C1] I [R2, C2] ; [ R2, Cl ] | [R2, C2]; [R3,C1] I [R3,C2] ; [ R4,C1] I [R4,C2]; edge c II [R1,C2] I [R2,C3] ; [ R2,C2] | [R2,C3]; [R3,C2] I [R3,C3] ; [ R4,C2] I [R4,C3]; edge d =[Rl,C3] I [R2,C4] ; [ R2,C3] | [R2,C4]; ❿ [R3,C3] I [R3,C4] ; [ R4,C3] I [R4,C4]; Edge e= [block to top of R1,C1] | [Rl,Cl] ; [block to top of R1,C2] I [R1,C2]; [block to top of Rl,C3] I [Rl, C3] ; [block to top of R1,C4] I [R1,C4] edge f2[R1,C1] I [R2,C1] ; [R1,C2] | [R2,C2]; [R1,C3] I [ R2,C3]; [R1,C4] | [R2,C4] edge g2[R2,C1] I [R3,C1] ; [R2,C2] | [R3,C2]; 36Clienfs Docket No. :S3U06- 0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 36 200803528 [R 2,C3] I [R3, C3]; [R2,C4] | [R3, C4] edge h-[R3, Cl] | [R4, Cl] ; [R3?C2] | [R4, C2]; [ R3, C3] I [R4, C3]; [R35C4] | [R4?C4] For the first pair of sub-blocks, the texture cache 96 is downloaded, since no pixels have been changed due to the application of the filter. Although the filter of the i-th vertical edge (a) is such that the pixel value of (R1, C1) can be changed, the vertical edge of the second column actually shares all the pixels with the vertical edge of the column. Therefore, the second bell block (edge b) is also downloaded from the texture cache 960. Since the vertical edges between two adjacent columns do not share the same, the third pair (edge c) and the fourth pair (edge d) sub-blocks. The specific IDF-H264-X command issued by the in-loop deblocking filter 29 determines that the pixel data is to be downloaded from that location. The order of processing the first vertical edge (a) by the IDFJI264-X instruction used by the in-loop deblocking filter 290 is: IDF - H264 - 0 SRCl - address of (R1, C1); Spring IDF_H264 - 0 SRCl ^ address Of (R2, Cl); IDF—H264—0 SRCl^address of (R3, C1); IDF_H264—0 SRCl=address of (R4, Cl); Next, the in-loop deblocking filter 290 processes the second vertical Edge (b), starting with (Rl, C2). The leftmost 4 pixels in the 8x4 sub-block defined as (ri, C2) overlap with the rightmost pixel of the (Rl, Cl) sub-block. These are processed by the vertical edge filter of (Rl 'C1) and may also be updated, and the weight of the pixel is thus read from the execution unit pool 940 instead of the texture cache 960. However, the 4 pixels on the far right of the (Rl, C2) sub-block have not been filtered, so 37Client, s Docket No. : S3U06-0011 TT's Docket No: 0608-A41238-TW/fmal/林环辉/2007/ 06/14 200803528

讀自紋理快取960。子方塊（R2，C2)到（R4，C2)亦同。回路内去方塊效應濾波器290藉由命令下面IDF-H264—X的順序以處理第2垂直邊緣（b)，以完成此結果： IDF—H264—1 SRCl^address of (R1,C2); IDF_J264j SRCl=address of (R2,C2); IDF—H264—1 SRCl-address of (R3?C2); IDF—H264—1 SRCl-address of (R4?C2); 當處理第3垂直邊緣（c)時，從（R1，C3)開始。在（ri， C3) 8x4子方塊内最左邊4個像素與（R1，C2)子方塊最右邊的像素重疊，因而要讀自執行單元池940而非紋理快取960。然而，在（Rl，C2)子方塊最右邊的4個像素還沒被濾波，因而讀自紋理快取960。子方塊（Rl，C2)到（R4，C2)亦同。最後一垂直邊緣（d)會發生類似的情形。因此，回路内去方塊效應濾波器290藉由命令下面IDF_H264_x的順序以處理剩下 2垂直邊緣c與d : IDF—H264—1 SRCl=address of (R1,C3); IDF—H264—1 SRCl=address of (R2,C3); IDF_H264—1 SRCl^address of (R3?C3); IDF_H264—1 SRCl=address of (R4,C3); IDF_H264—1 SRCl^address of (R1,C4); IDF—H264—1 SRCl-address of (R2,C4); IDF_H264_1 SRCl-address of (R3,C4); IDF_H264_1 SRCl^address of (R4, C4); 接著處理水平邊緣（e-h)。此時，去方塊效應濾波器已 38Client’s Docket N〇.:S3U06-0011 TT，s Docket No:0608-A41238-TW/final/林璟輝/2007/06/14 38 200803528Read from texture cache 960. Sub-blocks (R2, C2) to (R4, C2) are also the same. The in-loop deblocking filter 290 performs the result by commanding the following IDF-H264-X sequence to process the second vertical edge (b) to complete the result: IDF-H264-1 SRCl^address of (R1, C2); IDF_J264j SRCl=address of (R2,C2); IDF—H264—1 SRCl-address of (R3?C2); IDF—H264—1 SRCl-address of (R4?C2); When processing the 3rd vertical edge (c) , starting with (R1, C3). The leftmost 4 pixels in the (ri, C3) 8x4 sub-block overlap with the rightmost pixel of the (R1, C2) sub-block, and thus are read from the execution unit pool 940 instead of the texture cache 960. However, the 4 pixels to the far right of the (Rl, C2) sub-block are not yet filtered, so they are read from texture cache 960. The sub-blocks (Rl, C2) to (R4, C2) are also the same. A similar situation occurs at the last vertical edge (d). Therefore, the in-loop deblocking filter 290 processes the remaining 2 vertical edges c and d by ordering the following IDF_H264_x: IDF_H264-1 SRCl=address of (R1, C3); IDF-H264-1 SRCl= Address of (R2,C3); IDF_H264-1 SRCl^address of (R3?C3); IDF_H264-1 SRCl=address of (R4,C3); IDF_H264-1 SRCl^address of (R1,C4); IDF-H264 —1 SRCl-address of (R2, C4); IDF_H264_1 SRCl-address of (R3, C4); IDF_H264_1 SRCl^address of (R4, C4); The horizontal edge (eh) is then processed. At this point, the deblocking filter has been 38Client’s Docket N〇.:S3U06-0011 TT,s Docket No:0608-A41238-TW/final/林璟辉/2007/06/14 38 200803528

應用於巨圖塊中的每個子方塊，因而每個像素可能已更新。因此，送去進行水平邊緣濾波的各子方塊係讀自執行單元池 940而非紋理快取960。因此，回路内去方塊效應濾波器290 藉由命令下面IDF_H264_x的順序以處理水平邊緣： IDF_H264_2 SRCl=address of (R1,C1)； IDF_H264_2 SRCl^address of (R2,C1)； IDF—H264—2 SRC1二address of (R3,Cl); IDF_H264—2 SRCl=address of (R4,Cl); IDF一H264—2 SRCl=address of (R1,C2)； IDF—H264—2 SRCl^address of (R2,C2); IDF_H264_2 SRCl^address of (R3, C2); IDF_H264_2 SRCl=address of (R4, C2); IDF—H264—2 SRCl=address of (R1，C3); 依此法，複雜的濾波運作係透過圖形處理單元指令集所實施。整個去方塊效應濾波運作通常太複雜而難以實現為單一指令濾波器。例如，H· 264滤波器太複雜了，其包含水平路徑與垂直路徑。此外，方塊尺寸亦相當大。因此，與其建構硬體管理濾波器的控制方面，不如依序結合各單一指令（例如，巨集），於是這些指令序列就被用來處理 4x4方塊。這使得可使用執行單元資源，其已齊備，因而將在回路内去方塊效應濾波器中複雜控制結構之需求降到最低，如此一來可減低回路内去方塊效應濾波器單元中硬體與記憶體之需求。另一方面，在去方塊效應濾波器290 中實現這些濾波指令而不是透過在執行單元上執行之指令 39Clienfs Docket N〇.：S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 39 200803528 * # - 實現是有好處的，因為該濾波包含一些數量運作（scalar operation，例如資料重組、查表、條件濾波），這對以向量為基礎之執行單元來說是沒有效率的。任何程序說明或流程圖中的方塊應被理解為表示模組、區段或部分程式碼，其包含用於實現特定邏輯電路功月b或私序中的步驟之一個或多個可執行的指令。熟悉軟體部門之技藝者應當暸解到，其他的實現方法亦包含於所揭 _ 露之範圍内。在其他的實現方法中，各功能可不依所示或揭露之順序執行，包含實質上同步進行或逆向進行，依所涉之功能而定。在此揭露之系統與方法可以軟體、硬體或其結合實現。在一些實施例中，該系統及/或方法係以存在記憶體中之軟體實現，且由位於一計算裝置中之適當處理器所執行 (包含而不限於一微處理器、微控制器、網路處理器、可籲重新裝配處理器、可擴充處理器）。在其他實施例中，哕系統及/或方法係以邏輯電路實現，包含而不限於一可種^ 邏輯裝置（PLD，programmable logic device)、可程式邏輯閘陣列（PGA，programmable gate array)、現場可程式化邏輯閘陣列（FPGA，field programmable gate array)或特定應用電路（ASIC)。在其他實施例中，這此邏輯敘述係在一圖形處理器或圖形處理單元（GPU)完成。在此揭露之系統與方法可被嵌入任何電腦可讀媒體而使用’或連結一指令執行糸統、設備、裝置。該指令執行 4〇aient，s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 200803528Applies to each sub-block in the giant tile, so each pixel may have been updated. Therefore, each sub-block sent to perform horizontal edge filtering is read from execution unit pool 940 instead of texture cache 960. Therefore, the in-loop deblocking filter 290 processes the horizontal edge by ordering the following IDF_H264_x: IDF_H264_2 SRCl=address of (R1, C1); IDF_H264_2 SRCl^address of (R2, C1); IDF-H264-2 SRC1 Address of (R3, Cl); IDF_H264-2 SRCl=address of (R4,Cl); IDF-H264-2 SRCl=address of (R1,C2); IDF-H264-2 SRCl^address of (R2,C2 IDF_H264_2 SRCl^address of (R3, C2); IDF_H264_2 SRCl=address of (R4, C2); IDF-H264-2 SRCl=address of (R1,C3); According to this method, the complex filtering operation is through the graphics The processing unit instruction set is implemented. The entire deblocking filtering operation is often too complex to be implemented as a single instruction filter. For example, the H·264 filter is too complex, including horizontal and vertical paths. In addition, the block size is also quite large. Therefore, instead of constructing a hardware management filter, it is better to combine the individual instructions (for example, macros) in sequence, and these sequences of instructions are used to process 4x4 blocks. This makes it possible to use the execution unit resources, which are ready, thus minimizing the need for complex control structures in the block effect filter in the loop, thus reducing the hardware and memory in the in-loop deblocking filter unit. The needs of the body. On the other hand, these filtering instructions are implemented in the deblocking filter 290 instead of the instructions executed on the execution unit 39Clienfs Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007 /06/14 39 200803528 * # - Implementation is advantageous because the filter contains some scalar operations (such as data reorganization, table lookup, conditional filtering), which is not for vector-based execution units. efficient. Any block diagram in a program description or flowchart should be understood to mean a module, segment or portion of code containing one or more executable instructions for implementing steps in a particular logic circuit or in a private sequence. . Those skilled in the software department should be aware that other implementation methods are also included in the scope of the disclosure. In other implementations, the functions may be performed out of the order shown or disclosed, including substantially synchronous or reverse, depending on the functionality involved. The systems and methods disclosed herein can be implemented in software, hardware, or a combination thereof. In some embodiments, the system and/or method is implemented in software stored in memory and executed by a suitable processor located in a computing device (including but not limited to a microprocessor, microcontroller, network Road processor, can be reassembled processor, expandable processor). In other embodiments, the 哕 system and/or method is implemented by a logic circuit, including but not limited to a programmable logic device (PLD), a programmable gate array (PGA), and a field. A programmable gate array (FPGA) or an application specific circuit (ASIC). In other embodiments, this logical statement is done in a graphics processor or graphics processing unit (GPU). The systems and methods disclosed herein can be embedded in any computer readable medium and used to execute a system, device, or device. The instruction is executed 4〇aient, s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 200803528

系統包含任何以電腦為基礎的系統、含有處理器的系統或其他可以從該指令執行系統擷取與執行這些指令的系統。所揭i备之文子電腦可讀媒體（computer-readable medium)可.為任何可以容納、儲存、溝通、傳遞或傳送該程式作為使用或與該指令執行系統連結之工具。該電腦可讀媒體可為，例如（非限制）為基於電子的、有磁性的、光的、電磁的、紅外線的或半導體技術的一系統或傳遞媒使用電子技術之電腦可讀媒體之特定範例（非限制）可包含：具有—條或多條電性（電子）連接的線；-隨機存取記憶體（RAM，random access memory); —唯讀記憔體 (ROM，read-only memory); 一可拭去可程式化 ^ 體（EPROM或快閃記憶體）。使用磁技術之電腦可;媒二之特定範例（非限制）可包含：可攜帶電腦磁碟m 技術之電腦可讀媒體之特定範例（非限制）可包八·、纖與一可攜帶唯讀光碟（CD-ROM)。 3 · 光雖然本發明在此以-個或更多個特定的範例例闡明及描述，不過不應將本發明侷限於所示之纟二、、而仍可在不背離本發明的精神下且在申請專利範卩二然領域與範圍内實現許多不同的修改與結構上均等之 α 的改變。因此，隶好將所附上的申請專利範圍廣泛地且以符人义領域之方法解釋，在隨後的申請專利範圍前提出此明【圖式簡單說明】示範性第1圖係用於圖形與視訊編碼及/或解碼之 41Client，s Docket N〇，:S3U06-0011 TT’s Docket No:0608-A41238-TW/final/林環輝/2007/06/14 200803528 運算平台之方塊圖。第2圖係第1圖中該視訊解碼器160之方塊圖。第3圖說明一 VC-1濾波器之子方塊像素設置。第4圖係第1圖VC-1回路内去方塊效應濾波器硬體加速邏輯電路400之硬體描述虛擬碼之列表。第5圖係第4圖行加速邏輯電路500之硬體描述語言程式碼之列表。第6A-D圖形成第4、5圖之行加速邏輯電路之一方塊圖。第7圖H· 264回路内去方塊效應濾波器硬體加速單元 700之硬體描述虛擬碼。弟8A與8B圖顯示用於行加速邏輯電路8〇〇之硬體描述虛擬碼。第9圖係第1圖之圖形處理單元12〇之資料流程圖。第10圖係H· 264所用之16x16巨圖塊之方塊圖。【主要元件符號說明】 100〜系統、110--般用途CPU、120〜圖形處理器（GPU)、 130〜記憶體、140〜匯流排、150〜視訊加速單元（γρυ)、160 〜軟體解碼器、170〜視訊加速驅動器。 205〜輸入之位元流、210〜熵解碼器、215〜空間解碼器、 220〜反相量化器、230〜反相離散餘弦轉換、235〜圖形、245 〜移動向量、250〜移動補償、255〜先前解碼圖形、265〜預測圖形、270〜空間補償、280〜加法器、290〜去方塊效應濾 42Client’s Docket No.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14 200803528 '' 波器、295〜解碼圖形。 310-320〜兩個鄰近4x4子方塊、330〜垂直邊界。 400〜回路内去方塊效應濾波器硬體加速邏輯電路、41〇〜模組定義區段、420〜疊代迴圈區段、430〜測試垂直參數區段、440〜比較迴圈參數與3區段、450〜示例區段。 500〜行加速邏輯電路、510〜模組定義區段、520〜像素值鲁運算區段、530〜比較迴圈參數與3區段、540〜測試DCLFIuer 區段、550〜更新狀態區段。 605-610-615-620 〜多工器、625-630-679 〜減法器、 635-640-655-680〜邏輯電路方塊、645-650〜加法器、 660-665-670〜暫存器、671〜P4暫存器輸出、673〜P5暫存器輸出。681〜減法器、685〜加法器。687-689-691-693〜多工器、697〜OR閘。 ⑩ 700〜Η·2β4回路内去方塊效應濾波器硬體加速單元、71〇〜模組定義區段、720〜疊代迴圈區段、730〜測試垂直參數區段、740〜擷取參數區段、750〜示例區段。 800〜行加速邏輯電路、810〜模組定義區段、82〇〜對應表數區段、830〜像素計算區段。 910〜指令流處理器、920〜指令、930〜指令資料、94〇〜執行單元池、950〜紋理濾波單元、960〜紋理快取、97〇〜後 43Client，s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/final/林璟輝/2007/06/14 43 200803528 ” 包裝器。 44Client5s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟輝/2007/06/14The system includes any computer-based system, a system containing a processor, or other system that can retrieve and execute these instructions from the instruction execution system. The computer-readable medium can be any tool that can be used to store, store, communicate, transfer or transfer the program for use or in connection with the execution system of the instruction. The computer readable medium can be, for example, a non-limiting example of a computer readable medium that uses electronic technology for a system or medium that is electronically based, magnetic, optical, electromagnetic, infrared, or semiconductor technology. (non-restrictive) may include: a line with one or more electrical (electronic) connections; - random access memory (RAM); - read-only memory (ROM, read-only memory) ; Can be erased to be programmable (EPROM or flash memory). Computers using magnetic technology; specific examples of media 2 (non-restrictive) may include: specific examples of computer readable media that can carry computer disk m technology (non-restrictive) can be packaged, fiber, and portable CD-ROM (CD-ROM). The present invention is illustrated and described herein by way of example only, and the invention is not limited by the scope of the invention, and without departing from the spirit of the invention. A number of different modifications and structurally equivalent alpha changes are implemented within the scope and scope of the patent application. Therefore, the scope of the patent application attached is widely and in a pervasive manner, and is presented before the scope of the subsequent patent application. [Illustration of the drawing] The exemplary first figure is for graphics and video coding. And/or decoding of 41Client, s Docket N〇,: S3U06-0011 TT's Docket No: 0608-A41238-TW/final/Lin Huanhui/2007/06/14 200803528 Block diagram of the computing platform. Figure 2 is a block diagram of the video decoder 160 in Figure 1. Figure 3 illustrates the sub-block pixel settings of a VC-1 filter. Figure 4 is a first block of the VC-1 loop deblocking filter hardware. The hardware of the acceleration logic circuit 400 describes a list of virtual codes. Figure 5 is a list of hardware description language code for the acceleration logic circuit 500 of Figure 4. Figure 6A-D shows a block diagram of the acceleration logic circuit of lines 4 and 5. Figure 7 H. 264 loop de-blocking filter hardware acceleration unit 700 hardware description virtual code. The 8A and 8B diagrams show the hard description virtual code for the line acceleration logic circuit 8〇〇. Figure 9 is a data flow diagram of the graphics processing unit 12 of Figure 1. Figure 10 is a block diagram of a 16x16 giant tile used by H.264. [Main component symbol description] 100~ system, 110-- general purpose CPU, 120~ graphics processor (GPU), 130~memory, 140~bus, 150~video acceleration unit (γρυ), 160~soft decoder , 170~ video acceleration drive. 205~ input bit stream, 210~en entropy decoder, 215~space decoder, 220~inverting quantizer, 230~inverted discrete cosine transform, 235~ graphics, 245~moving vector, 250~moving compensation, 255 ~ Previously decoded graphics, 265~predicted graphics, 270~space compensation, 280~adder, 290~deblocking filter 42Client's Docket No.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/ 06/14 200803528 '' Waves, 295~ decoding graphics. 310-320 ~ two adjacent 4x4 sub-squares, 330 ~ vertical boundaries. 400~loop to block effect filter hardware acceleration logic circuit, 41〇~ module definition section, 420~ iteration loop section, 430~test vertical parameter section, 440~ comparison loop parameter and 3 zone Segment, 450~ example section. 500~row acceleration logic circuit, 510~module definition section, 520~pixel value luth operation section, 530~ comparison loop parameter and 3 section, 540~test DCLFIuer section, 550~update status section. 605-610-615-620 ~ multiplexer, 625-630-679 ~ subtractor, 635-640-655-680 ~ logic circuit block, 645-650 ~ adder, 660-665-670 ~ register, 671~P4 register output, 673~P5 register output. 681~Subtractor, 685~Adder. 687-689-691-693~ multiplexer, 697~OR gate. 10 700~Η·2β4 loop deblocking filter hardware acceleration unit, 71〇~module definition section, 720~ iteration loop section, 730~test vertical parameter section, 740~take parameter area Segment, 750~ example section. 800 to line acceleration logic circuit, 810 to module definition section, 82 〇 to corresponding table number section, and 830 to pixel calculation section. 910~ instruction stream processor, 920~ instruction, 930~ instruction data, 94〇~ execution unit pool, 950~texture filter unit, 960~ texture cache, 97〇~after 43Client, s Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/final/林璟辉/2007/06/14 43 200803528 ” Wrapper. 44Client5s Docket N〇.:S3U06-0011 TT's Docket No:0608-A41238-TW/fmal/林璟辉/2007/ 06/14

Claims

200803528 ^ X. Patent application scope · 1. A graphics processing unit includes: a decoder configured to decode a first and a second deblocking filter gain acceleration command, the first and second deblocking filter accelerators The instructions are each associated with a deblocking filter used by a particular video decoder; and a video processing unit configured to receive the first parameter encoded by the first deblocking filter glitch acceleration command, and Determining, by the receiving first parameter, a first memory source, the first memory source being located in one of a plurality of memory sources of the graphics processing unit, and receiving the second deblocking filter Acquiring a second parameter encoded by the instruction, and determining a second memory source specified by the receiving second parameter, the second memory source being located in one of a plurality of memory sources of the graphics processing unit, wherein The video processing unit is further configured to download a first pixel data block from the determined first memory source, and apply the deblocking filter to The first pixel data block, and a second pixel data block is downloaded from the determined second memory source, and the deblocking filter is applied to the pixel data block. 2. The graphics processing unit of claim 1, wherein the plurality of memory sources are included in a texture cache and an execution unit in the graphics processing unit. 3. The graphics processing unit of claim 1, wherein the first memory source and the second memory source are used to achieve one pixel read and write. 45Client’s D〇cket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/final/林璟辉/2007/06/14 45 200803528

4. If the pattern of the i-th item of the patent application and the second memory source are both the graphic processing sheet _ /, the middle of the fifth. If the memory source of the patent item is in the graphics processing unit A 纟 70 /, the second memory source is in the graphics processing unit ^ &&; take the 'the hopper - the execution unit. 6. The graphic a < shape processing unit of claim 1 of the patent scope, the first and second memory sources of the first and second memory sources are the graphic processing single -: the second brother 7 · the graphic processing of the first item of the patent scope The unit, the bean two-way effect filter acceleration command system is related to Η. 264祯^ " The video processing unit used in the video transcoder includes: - the video processing unit 'μ is applied to one of the deblocking filters associated with the specific video decoder; the decoding is, and e is paired a plurality of deblocking filter acceleration instruction decoding associated with the deblocking chopper; a second texture unit configured to provide a pixel: #料到该视频处理单元' for the deblocking filter; An execution unit is configured to perform a graphics processing function on the pixel data, wherein the video processing unit is further configured to receive the parameter encoded by the deblocking effect, and determine that one of the parameters is specified by the receiving parameter a memory source corresponding to the texture filter unit or the execution unit 'and determining that one of the second memory sources corresponds to the texture filter unit or the execution unit specified by the receiving parameter; the video processing Early is set to download a 46Clienfs Docket N〇 from the first memory source.: S3U06-0011 TT's Docket No: 0608-A41238-TW/fmal/林璟辉/ 2007/06/14 200803528; * a pixel block, and downloading a second pixel block from the second memory source, and applying the deblocking filter to the first pixel data block and applying the deblocking filter Up to the second pixel data block, in accordance with the received parameters. 9. The graphics processing unit of claim 8, wherein the video processing unit is further configured to apply to the deblocking filter, the execution unit is further configured to calculate the at least one filter according to at least one filter parameter. The parameter is based on the first pixel data block. • 10. The graphics processing unit of claim 8 wherein the de-effecting filter acceleration command is associated with a chopper used by the 264264 video decoder. 11. The graphics processing unit of claim 8, wherein the first memory source defined by the receiving parameter corresponds to a texture filtering unit, and the second memory source defined by the receiving parameter corresponds to an execution unit. To achieve one pixel read and write. 12. The video encoder comprises: a plurality of execution unit instructions configured to calculate at least one deblocking filter configuration parameter associated with a pixel data block and a special video coding specification, and further configured to execute in a graphic a coloring execution unit within the processing unit; a plurality of deblocking effect instructions configured to apply to the deblocking filter conforming to the calculated filter configuration parameter, and further configured to execute in the graphics processing unit Video processing unit. 13·If you apply for the video encoder of the 12th patent range, the one should go to 47Clienfs Docket N〇.:S3U06-0011 TT s Docket No:0608-A41238-TW/fmal/林环辉/2007/06/14 47 200803528 The square effect filter is added to the instructional fish TT c^nA ^ ^ waver correlation.糸糸 H H H H 遽遽遽 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 The de-blocking filter is used to control the ^^, A and 禝------the memory source, one is the pixel data block, and the other one is another ^ the Wei de-blocking filter, the wave source The first and second blocks define a second memory*. The feU is in the graph 16. As in the patent application, the first 15th in the patent range - V e _, /5 Α _Λ , the visual encoder, wherein the first sentence is a texture cache in the graphics processing unit. 2: the video encoder of the 15th item, wherein the first:: t: body source is in the graphic A texture cache is processed in the processing unit, and a memory source is an execution unit in the graphics processing unit. Λ, as in the video of the fifteenth patent application, wherein the first and the first Α龙源 are both The execution unit I9. in the graphics processing unit, such as the video encoder of the patent scope #12, wherein the de-square effect filtering The acceleration command is related to the H.264 video solution. ‘48Client’s Docket N〇.:S3U06-0011 TT’s Docket No:0608-A41238-TW/fmal/林璟辉/2007/06/14 48