JP4000880B2

JP4000880B2 - Image processing apparatus and method

Info

Publication number: JP4000880B2
Application number: JP2002091032A
Authority: JP
Inventors: 仁佐藤
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2002-03-28
Filing date: 2002-03-28
Publication date: 2007-10-31
Anticipated expiration: 2022-03-28
Also published as: JP2003288610A

Description

【０００１】
【発明の属する技術分野】
本発明は、複数の演算処理装置が処理データを共有して並列処理を行う画像処理装置およびその方法に関するものである。
【０００２】
【従来の技術】
近年、３次元コンピュータグラフィックス（３ＤＣｏｍｐｕｔｅｒＧｒａｐｈｉｃｓ）をハードウェアで高速に実行するグラフィックスＬＳＩの普及は著しく、特にゲーム機やパーソナルコンピュータ（ＰＣ）では、このグラフィックスＬＳＩを標準で搭載しているものが多い。
また、グラフィックスＬＳＩにおける技術的進歩は早く、「ＤｉｒｅｃｔＸ」で採用された「ＶｅｒｔｅｘＳｈａｄｅｒ」や「Ｐｉｘｅ１Ｓｈａｄｅｒ」に代表される機能面での拡張が続けられているとともに、ＣＰＵを上回るペースで性能が向上している。
【０００３】
グラフィックスＬＳＩの性能を向上させるには、ＬＳＩの動作周波数を上げるだけではなく、並列処理の手法を利用することが有効である。並列処理の手法を大別すると以下のようになる。
第１は領域分割による並列処理法であり、第２はプリミティブレベルでの並列処理法であり、第３はピクセルレベルでの並列処理法である。
【０００４】
上記分類は並列処理の粒度に基づいており、領域分割並列処理の粒度が最もあらく、ピクセルレベル並列処理の粒度が最も細かい。それぞれの手法の概要を以下に述べる。
【０００５】
領域分割による並列処理
画面を複数の矩形領域に分割し、複数の処理ユニットそれぞれが担当する領域を割り当てながら並列処理する手法である。
【０００６】
プリミティブレベルでの並列処理
複数の処理ユニットに別々のプリミティブ（たとえば三角形）を与えて並列動作させる手法である。
プリミティブレベルでの並列化処理について概念的に示したものを図１に示す。
図１において、ＰＭ０〜ＰＭｎ−１がそれぞれ異なるプリミティブを示し、ＰＵ０〜ＰＵｎ−１が処理ユニット、ＭＭ０〜ＭＭｎ−１がメモリモジュールをそれぞれ示している。
各処理ユニットＰＵ０〜ＰＵｎ−１に比較的均等な大きさのプリミティブＰＭ０〜ＰＭｎ−１が与えられているときには、各処理ユニットＰＵ０〜ＰＵｎ−１に対する負荷のバランスがとれ、効率的並列処理が行える。
【０００７】
ピクセルレベルでの並列処理
最も粒度の細かい並列処理の手法である。
図２は、ピクセルレベルでの並列処理の手法に基づくプリミティブレベルでの並列化処理について概念的に示す図である。
図２のように、ピクセルレベルでの並列処理の手法では三角形をラスタライズする際に、２×８のマトリクス状に配列されたピクセルからなるピクセルスタンプ（ＰｉｘｅｌＳｔａｍｐ）ＰＳと呼ばれる矩形領域単位にピクセルが生成される。
図２の例では、ピクセルスタンプＰＳ０からピクセルスタンプＰＳ７までの合計８個のピクセルスタンプが生成されている。これらピクセルスタンプＰＳ０〜ＰＳ７に含まれる最大１６個のピクセルが同時に処理される。
この手法は、他の手法に比べ粒度が細かい分、並列処理の効率が良い。
【０００８】
【発明が解決しようとする課題】
しかしながら、上述した領域分割による並列処理の場合、各処理ユニットを効率良く並列動作させるためには、各領域に描画されるべきオブジェクトをあらかじめ分類する必要があり、シーンデータ解析の負荷が重い。
また、１フレーム分のシーンデータが全て揃った上で描画を開始するのではなく、オブジェクトデータが与えられると即描画を開始するいわゆるイミーディエートモードでの描画を行う際には並列性を引き出すことができない。
【０００９】
また、プリミティブレベルでの並列処理の場合、実際には、オブジェクトを構成するプリミティブＰＭ０〜ＰＭｎ−１の大きさにはバラツキがあることから、処理ユニットＰＵ０〜ＰＵｎ−１ごとに一つのプリミティブを処理する時間に差が生じる。この差が大きくなった際には、処理ユニットが描画する領域も大きく異なり、データのローカリティが失われるので、メモリモジュールを構成するたとえばＤＲＡＭのページミスが頻発し性能が低下する。
また、この手法の場合には、配線コストが高いという問題点もある。一般に、グラフィックス処理を行うハードウェアでは、メモリのバンド幅を広げるために、複数メモリモジュールを用いてメモリインターリーブを行う。
その際、図１に示すように、各処理ユニットＰＵ０〜ＰＵｎ−１と各内蔵メモリモジュールＭＭ０〜ＭＭｎ−１を全て結ぶ必要がある。
【００１０】
一方、ピクセルレベルでの並列処理の場合、上述したように、粒度が細かい分、並列処理の効率が良いという利点があり、実際のフィルタリングを含む処理としては図３に示すような手順で行われている。
【００１１】
すなわち、ＤＤＡ（ＤｉｇｉｔａｌＤｉｆｆｅｒｅｎｔｉａｌＡｎａｌｙｚｅｒ）パラメータ、たとえばラスタライゼーション（Ｒａｓｔｅｒｉｚａｔｉｏｎ）に必要な各種データ（Ｚ、テクスチャ座標、カラーなど）の傾き等のＤＤＡパラメータを算出する（ＳＴ１）。
次に、メモリからテクスチャデータを読み出し（ＳＴ２）、サブワード再配置処理を行った後（ＳＴ３）、クロスバー回路により各処理ユニットにグローバル分配する（ＳＴ４）。
次に、テクスチャフィルタリング（ＴｅｘｔｕｒｅＦｉｌｔｅｒｉｎｇ）を行う（ＳＴ５）。この場合、処理ユニットＰＵ０〜ＰＵ３は、読み出されたテクスチャデータと、（ｕ，ｖ）アドレスは算出時に得た小数部を使って４近傍補間などのフィルタリング処理を行う。
次に、ピクセルレベルの処理（Ｐｅｒ−ＰｉｘｅｌＯｐｅｒａｔｉｏｎ）、具体的には、フィルタリング後のテクスチャデータと、ラスタライズ後の各種データを用いて、ピクセル単位の演算を行う（ＳＴ５）。
そして、ピクセルレベルの処理における各種テストをパスしたピクセルデータを、メモリモジュールＭＭ０〜ＭＭ３上のフレームバッファおよびＺバッファに描画する（ＳＴ６）。
【００１２】
ところで、テクスチャリード系のメモリアクセスは、描画系のメモリアクセスとは異なるため、他のモジュールに属すメモリからの読み出しが必要となる。
したがって、テクスチャリード系のメモリアクセスに関しては、上述したようにクロスバー回路のような配線を必要とする。
【００１３】
しかしながら、従来の画像処理装置では、上述したように、クロスバー回路により各処理ユニットにグローバル分配し、その後、テクスチャフィルタリングを行っていることから、グローバル分配するデータ量が多く（たとえば４Ｔｂｐｓ）、グローバルバスとしてのクロスバー回路が大型化し、配線遅延の観点等から処理の高速化の妨げとなるという不利益がある。
【００１４】
本発明は、かかる事情に鑑みてなされたものであり、その目的、クロスバー回路の小型化を図れ、また処理の高速化を図れる画像処理装置およびその方法を提供することにある。
【００１５】
【課題を解決するための手段】
上記目的を達成するため、本発明の第１の観点は、複数の処理モジュールが処理データを共有して並列処理を行う画像処理装置であって、上記複数の処理モジュールは、少なくともフィルタリング処理に関するデータが記憶されるメモリモジュールと、フィルタリング処理用テクスチャデータを得るとともに、処理データに基づいてあらかじめ対応するメモリインターリーブで決められた担当する処理を行う処理回路と、自処理モジュールがピクセルを担当する場合に、上記処理回路で得られた担当する処理データおよび各処理モジュールにおける担当するテクセルのみのフィルタ値の部分演算データを受けて、各部分演算データを加算して最終的なフィルタリング後のテクスチャ値を求める演算処理を行う第１の演算器と、上記処理回路で得られた自処理モジュールが担当するフィルタリング処理用データおよび上記メモリモジュールに記憶されているフィルタリング処理に関するデータに基づいて担当するテクセルのみのフィルタ値の部分演算を行い、上記第１の演算器による演算処理データを受けて上記メモリモジュールに対して描画する複数の第２の演算器と、を有し、さらに、上記各処理モジュールの複数の第１の演算器と複数の第２の演算器間を相互に接続するグローバルバスであって、各処理モジュールにおいて上記処理回路で得られたフィルタリング処理用データを同一の処理モジュールの第２の演算器に供給し、各処理モジュールの第２の演算器によるフィルタリング後の上記部分演算データを、ピクセルを担当する処理モジュールの第１の演算器に集約し、当該第１の演算器による演算処理データを第２の演算器に供給するクロスバー回路を有する。
【００１６】
第１の観点では、フレームバッファおよびテクスチャバッファ共に、全メモリモジュールにインターリーブされている。
【００１７】
また、第１の観点では、プリミティブの頂点データに対する演算を行い、１プリミティブをセットアップして、各処理モジュールの処理回路にそれぞれ担当するデータを出力するセットアップ回路を有する。
【００１８】
また、好適には、上記セットアップ回路は、全テクスチャ情報を各処理モジュールの処理回路に分配する。
【００１９】
本発明の第２の観点は、複数の処理モジュールが処理データを共有して並列処理を行う画像処理方法であって、各処理モジュールにおいて、フィルタリング処理用テクスチャデータを得るとともに、処理データに基づいてあらかじめ対応するメモリインターリーブで決められた担当する処理を行い、各処理モジュールにおいて、自処理モジュールが担当するフィルタリング処理用データおよび上記メモリモジュールに記憶されているフィルタリング処理に関するデータに基づいて担当するテクセルのみのフィルタ値の部分演算を行い、各処理モジュールにおける上記部分演算データを、グローバルバスを介してピクセルを担当する処理モジュールに集約し、上記部分演算データが集約された処理モジュールにおいて、各部分演算データを加算して最終的なフィルタリング後のテクスチャ値を求め、当該最終的な演算処理データを上記メモリモジュールに対して描画する。
【００２０】
第２の観点では、フレームバッファおよびテクスチャバッファ共に、全メモリモジュールにインターリーブされる。
【００２１】
また、第２の観点では、全テクスチャ情報を各処理モジュールに分配する。
【００２２】
本発明によれば、たとえばセットアップ回路において、頂点データに対する演算が行われ、１プリミティブがセットアップされ、各処理モジュールにそれぞれ担当テクスチャ分のセットアップ情報が出力される。
各処理モジュールにおける処理回路では、セットアップ回路による情報に基づいてたとえばＤＤＡパラメータ、具体的には、ラスタライゼーションに必要な各種データ（Ｚ、テクスチャ座標、カラーなど）の傾き等のＤＤＡパラメータが算出される。
また、各処理回路では、パラメータデータに基づいて、たとえば三角形が自分が担当する領域であるか否かを判断し、担当領域である場合には、ラスタライゼーションが行われる。
さらに、各処理回路では、ＬＯＤ計算によるミップマップレベルの算出や、テクスチャアクセスのための（ｕ，ｖ）アドレス計算が行われる。
【００２３】
そして、各処理回路では、得られたテクスチャ座標や、テクスチャアクセスのためのアドレス情報等が第２の演算器に出力される。
一方、各処理回路では、得られたテクスチャ以外のカラー等の情報が第１の演算器にそれぞれ供給される。
そして、各処理モジュールの第２の演算器では、処理回路から供給されたテクスチャに関する座標データやアドレスデータを受けて、メモリモジュールからテクスチャデータが読み出されて、読み出されたテクスチャデータと、（ｕ，ｖ）アドレスは算出時に得た小数部を使って４近傍補間などのテクスチャフィルタリングが行われる。このとき、各処理モジュールの第２の演算器では、担当するテクセルのみのフィルタ値の部分演算が行われる。
この第２の演算器によるフィルタリング後の各部分演算データが、クロスバー回路を介して、ピクセルを担当する処理モジュールの第１の演算器に供給される。
この処理モジュールの第１の演算器では、各処理モジュールにおける担当するテクセルのみのフィルタ値の部分演算データを受けて、各部分演算データを加算して最終的なフィルタリング後のテクスチャ値を求める演算処理が行われ、その結果が第２の演算器に出力される。
そして、第２の演算器では、第１の演算器により供給されたピクセルレベルの処理結果を受けて、ピクセルレベルの処理における各種テストをパスしたピクセルデータをメモリモジュールに描画される。
以上の処理が各モジュールで並列的に行われる。
【００２４】
【発明の実施の形態】
図４は、本発明の係る画像処理装置の一実施形態を示すブロック構成図である。
【００２５】
本実施形態に係る画像処理装置１０は、図４に示すように、セットアップ回路１１、処理モジュール１２−０〜１２−３、およびクロスバー回路１３を有している。
【００２６】
本画像処理装置１０では、セットアップ回路１１に対して複数個、本実施形態では４個の処理モジュール１２−０〜１２−３が並列に接続されて、複数の処理モジュール１２−０〜１２−３で処理データを共有し並列に処理する。
そして、テクスチャリード系に関しては、他の処理モジュールに対するメモリアクセスを必要とするが、このアクセスにはグローバルアクセスバスとしてのクロスバー回路１３が用いられる。
【００２７】
以下に各構成要素の構成および機能について、図面に関連付けて順を追って説明する。
【００２８】
セットアップ回路１１は、ＣＰＵや外部メモリとのデータの授受、並びに各処理モジュール１２−０〜１２−３とのデータの授受を司るとともに、頂点データに対する演算を行い、１プリミティブをセットアップして、各処理モジュール１２−０〜１２−３にそれぞれ担当テクスチャ分のセットアップ情報を出力する。
具体的には、セットアップ回路１１は、データが入力されると、Ｐｅｒ−Ｖｅｒｔｅｘオペレーションを行う。
この処理においては、３次元座標、法線ベクトル、テクスチャ座標の各頂点データが入力されると、頂点データに対する演算が行われる。代表的な演算としては、物体の変形やスクリーンへの投影などを行う座標変換の演算処理、ライティング（Ｌｉｇｈｔｉｎｇ）の演算処理、クリッピング（Ｃｌｉｐｐｉｎｇ）の演算処理がある。
【００２９】
処理モジュール１２−０は、処理回路としてのＤＤＡ（ＤｉｇｉｔａｌＤｉｆｆｅｒｅｎｔｉａｌＡｎａｌｙｚｅｒ）回路１２１−０、第１の演算器（演算器１）１２２−０、第２の演算器（演算器２）１２３−０、およびたとえばＤＲＡＭからなるメモリモジュール（ＭＥＭ）１２４−０を有している。
【００３０】
同様に、処理モジュール１２−１は、処理回路としてのＤＤＡ回路１２１−１、第１の演算器（演算器１）１２２−１、第２の演算器（演算器２）１２３−１、およびたとえばＤＲＡＭからなるメモリモジュール（ＭＥＭ）１２４−１を有している。
処理モジュール１２−２は、処理回路としてのＤＤＡ回路１２１−２、第１の演算器（演算器１）１２２−２、第２の演算器（演算器２）１２３−２、およびたとえばＤＲＡＭからなるメモリモジュール（ＭＥＭ）１２４−２を有している。
処理モジュール１２−３は、処理回路としてのＤＤＡ回路１２１−３、第１の演算器（演算器１）１２２−３、第２の演算器（演算器２）１２３−３、およびたとえばＤＲＡＭからなるメモリモジュール（ＭＥＭ）１２４−３を有している。
【００３１】
そして、各処理モジュール１２−０〜１２−３の第１の演算器１２２−０〜１２２−３と第２の演算器１２３−０〜１２３−３が、後で詳述するように、クロスバー回路１３を介して相互に接続されている。
【００３２】
図５は、本実施形態に係る画像処理装置の基本的なアーキテクチャおよび処理フローを示す図である。なお、図５において、丸印を付した矢印はテクスチャに関するデータの流れを示し、丸印を付していない矢印はピクセルに関するデータの流れを示している。
【００３３】
本実施形態では、各処理モジュール１２−０〜１２−３は、メモリモジュール１２４−０〜１２４−３が所定の大きさ、たとえば４×４の矩形領域単位にインターリーブされている。
具体的には、図５に示すように、いわゆるフレームバッファおよびテクスチャバッファ共に、全メモリモジュールにインターリーブされている。
【００３４】
処理モジュール１２−０におけるＤＤＡ回路１２１−０は、セットアップ回路１１による情報に基づいてＤＤＡパラメータを計算する。
この処理では、ラスタライゼーション（Ｒａｓｔｅｒｉｚａｔｉｏｎ）に必要な各種データ（Ｚ、テクスチャ座標、カラーなど）の傾き等のＤＤＡパラメータを算出する。
また、ＤＤＡ回路１２１−０は、パラメータデータに基づいて、たとえば三角形が自分が担当する領域であるか否かを判断し、担当領域である場合には、ラスタライゼーション（Ｒａｓｔｅｒｉｚａｔｉｏｎ）を行う。
具体的には、その三角形が自分が担当する領域、たとえば４×４ピクセルの矩形領域単位でインターリーブされた領域に属しているか否かを判断し、属している場合には、各種データ（Ｚ、テクスチャ座標、カラーなど）をラスタライズする。この場合、生成単位は、１ローカルモジュール当たり１サイクルで２×２ピクセルである。
次に、ＤＤＡ回路１２１−０は、テクスチャ座標のパースペクティブコレクション（ＰｅｒｓｐｅｃｔｉｖｅＣｏｒｒｅｃｔｉｏｎ）を行う。また、この処理ステージにはＬＯＤ（ＬｅｖｅｌｏｆＤｅｔａｉｌ）計算によるミップマップ（ＭｉｐＭａｐ）レベルの算出や、テクスチャアクセスのための（ｕ，ｖ）アドレス計算も含まれる。
【００３５】
そして、ＤＤＡ回路１２１−０は、たとえば図６に示すように、テクスチャ系のＤＤＡ部１２１１によりテクスチャ座標や、テクスチャアクセスのためのアドレス情報等のテクスチャ用処理を行い、テクスチャに関する情報を第１の演算器１２２−０、クロスバー回路１３を介して第２の演算器１２３−０に出力する。
一方、ＤＤＡ回路１２１−０は、テクスチャ以外のカラー等の処理はその他のＤＤＡ部１２１２で行って第１の演算器１２２−０に出力する。
本実施形態においては、ＤＤＡ回路１２１（−０〜３）には、その他のＤＤＡ部１２１２のデータ入力側にのみＦＩＦＯ（Ｆｉｒｓｔ−ＩｎＦｉｒｓｔ−Ｏｕｔ）を設け、テクスチャ系のフィルタリング処理の時間を考慮した時間調整を行っている。
また、テクスチャ系のＤＤＡ部１２１１は全ピクセルに関する担当するテクスチャのデータを発生し、その他のＤＤＡ部１２１２はメモリインターリーブによる担当部分にのみ発生する。
【００３６】
第１の演算器１２２−０は、ＤＤＡ回路１２１−０により供給されたテクスチャ情報以外のデータ、および、クロスバー回路１３を介して受信する各処理モジュール１２−０〜１２−３の第２の演算器１２３−０〜１２３−３でテクスチャフィルタリング処理後の部分演算データを受けて、各部分演算データを加算して最終的なフィルタリング後のテクスチャ値を求める演算処理を行うピクセルレベルの処理（Ｐｅｒ−ＰｉｘｅｌＯｐｅｒａｔｉｏｎ）を行い、その結果をクロスバー回路１３を介して第２の演算器１２３−０に出力する。
このピクセルレベルの処理においては、フィルタリング後のテクスチャデータと、ラスタライズ後の各種データを用いて、ピクセル単位の演算が行われる。ここで行われる処理は、ピクセルレベルでのライティング（Ｐｅｒ−ＰｉｘｅｌＬｉｇｈｔｉｎｇ）などいわゆるＰｉｘｅｌＳｈａｄｅｒに相当する。
【００３７】
第２の演算器１２３−０は、ＤＤＡ回路１２１−０から供給されたテクスチャに関する座標データやアドレスデータを受けて、メモリモジュール１２４−０からテクスチャデータを読み出し、テクスチャフィルタリング（ＴｅｘｔｕｒｅＦｉｌｔｅｒｉｎｇ）を行う。
このとき第２の演算器１２３−０は、担当するテクセルのみのフィルタ値の部分演算のみを行い、フィルタリング後の部分演算データを、クロスバー回路１３を介して、スタンプに対応するフレームバッファを持つ処理モジュールの第１の演算器１２２−０〜１２２−３のいずれかに出力する。
この場合、第２の演算器１２３−０は、読み出されたテクスチャデータと、（ｕ，ｖ）アドレスは算出時に得た小数部を使って４近傍補間などのフィルタリング処理を行う。
また、第２の演算器１２３−０は、第１の演算器１２２−０により供給されたピクセルレベルの処理結果を受けて、ピクセルレベルの処理における各種テストをパスしたピクセルデータをメモリモジュール１２４−０に描画する。
【００３８】
処理モジュール１２−１におけるＤＤＡ回路１２１−１は、セットアップ回路１１による情報に基づいてＤＤＡパラメータ、具体的には、ラスタライゼーションに必要な各種データ（Ｚ、テクスチャ座標、カラーなど）の傾き等のＤＤＡパラメータを算出する。
また、ＤＤＡ回路１２１−１は、パラメータデータに基づいて、たとえば三角形が自分が担当する領域であるか否かを判断し、担当領域である場合には、ラスタライゼーションを行う。
具体的には、その三角形が自分が担当する領域、たとえば４×４ピクセルの矩形領域単位でインターリーブされた領域に属しているか否かを判断し、属している場合には、各種データ（Ｚ、テクスチャ座標、カラーなど）をラスタライズする。この場合、生成単位は、１ローカルモジュール当たり１サイクルで２×２ピクセルである。
次に、ＤＤＡ回路１２１−１は、テクスチャ座標のパースペクティブコレクション（ＰｅｒｓｐｅｃｔｉｖｅＣｏｒｒｅｃｔｉｏｎ）を行う。また、この処理ステージにはＬＯＤ計算によるミップマップレベルの算出や、テクスチャアクセスのための（ｕ，ｖ）アドレス計算も含まれる。
【００３９】
そして、ＤＤＡ回路１２１−１は、たとえば図６に示すように、テクスチャ系のＤＤＡ部１２１１によりテクスチャ座標や、テクスチャアクセスのためのアドレス情報等のテクスチャ用処理を行い、テクスチャに関する情報を第１の演算器１２２−１、クロスバー回路１３を介して第２の演算器１２３−１に出力する。
一方、ＤＤＡ回路１２１−１は、テクスチャ以外のカラー等の処理はその他のＤＤＡ部１２１２で行って第１の演算器１２２−１に出力する。
【００４０】
第１の演算器１２２−１は、ＤＤＡ回路１２１−１により供給されたテクスチャ情報以外のデータ、および、クロスバー回路１３を介して受信する各処理モジュール１２−０〜１２−３の第２の演算器１２３−０〜１２３−３でテクスチャフィルタリング処理後の部分演算データを受けて、各部分演算データを加算して最終的なフィルタリング後のテクスチャ値を求める演算処理を行うピクセルレベルの処理を行い、その結果をクロスバー回路１３を介して第２の演算器１２３−１に出力する。
このピクセルレベルの処理においては、フィルタリング後のテクスチャデータと、ラスタライズ後の各種データを用いて、ピクセル単位の演算が行われる。ここで行われる処理は、ピクセルレベルでのライティングなどいわゆるＰｉｘｅｌ
Ｓｈａｄｅｒに相当する。
【００４１】
第２の演算器１２３−１は、ＤＤＡ回路１２１−１から供給されたテクスチャに関する座標データやアドレスデータを受けて、メモリモジュール１２４−１からテクスチャデータを読み出し、テクスチャフィルタリング（ＴｅｘｔｕｒｅＦｉｌｔｅｒｉｎｇ）を行う。
このとき第２の演算器１２３−１は、担当するテクセルのみのフィルタ値の部分演算のみを行い、フィルタリング後の部分演算データを、クロスバー回路１３を介して、スタンプに対応するフレームバッファを持つ処理モジュールの第１の演算器１２２−０〜１２２−３のいずれかに出力する。
この場合、第２の演算器１２３−１は、読み出されたテクスチャデータと、（ｕ，ｖ）アドレスは算出時に得た小数部を使って４近傍補間などのフィルタリング処理を行う。
また、第２の演算器１２３−１は、第１の演算器１２２−１により供給されたピクセルレベルの処理結果を受けて、ピクセルレベルの処理における各種テストをパスしたピクセルデータをメモリモジュール１２４−１に描画する。
【００４２】
処理モジュール１２−２におけるＤＤＡ回路１２１−２は、セットアップ回路１１による情報に基づいてＤＤＡパラメータ、具体的には、ラスタライゼーションに必要な各種データ（Ｚ、テクスチャ座標、カラーなど）の傾き等のＤＤＡパラメータを算出する。
また、ＤＤＡ回路１２１−２は、パラメータデータに基づいて、たとえば三角形が自分が担当する領域であるか否かを判断し、担当領域である場合には、ラスタライゼーションを行う。
具体的には、その三角形が自分が担当する領域、たとえば４×４ピクセルの矩形領域単位でインターリーブされた領域に属しているか否かを判断し、属している場合には、各種データ（Ｚ、テクスチャ座標、カラーなど）をラスタライズする。この場合、生成単位は、１ローカルモジュール当たり１サイクルで２×２ピクセルである。
次に、ＤＤＡ回路１２１−２は、テクスチャ座標のパースペクティブコレクション（ＰｅｒｓｐｅｃｔｉｖｅＣｏｒｒｅｃｔｉｏｎ）を行う。また、この処理ステージにはＬＯＤ計算によるミップマップレベルの算出や、テクスチャアクセスのための（ｕ，ｖ）アドレス計算も含まれる。
【００４３】
そして、ＤＤＡ回路１２１−２は、たとえば図６に示すように、テクスチャ系のＤＤＡ部１２１１によりテクスチャ座標や、テクスチャアクセスのためのアドレス情報等のテクスチャ用処理を行い、テクスチャに関する情報を第１の演算器１２２−２、クロスバー回路１３を介して第２の演算器１２３−２に出力する。
一方、ＤＤＡ回路１２１−２は、テクスチャ以外のカラー等の処理はその他のＤＤＡ部１２１２で行って第１の演算器１２２−２に出力する。
【００４４】
第１の演算器１２２−２は、ＤＤＡ回路１２１−２により供給されたテクスチャ情報以外のデータ、および、クロスバー回路１３を介して受信する各処理モジュール１２−０〜１２−３の第２の演算器１２３−０〜１２３−３でテクスチャフィルタリング処理後の部分演算データを受けて、各部分演算データを加算して最終的なフィルタリング後のテクスチャ値を求める演算処理を行うピクセルレベルの処理を行い、その結果をクロスバー回路１３を介して第２の演算器１２３−２に出力する。
このピクセルレベルの処理においては、フィルタリング後のテクスチャデータと、ラスタライズ後の各種データを用いて、ピクセル単位の演算が行われる。ここで行われる処理は、ピクセルレベルでのライティングなどいわゆるＰｉｘｅｌ
Ｓｈａｄｅｒに相当する。
【００４５】
第２の演算器１２３−２は、ＤＤＡ回路１２１−２から供給されたテクスチャに関する座標データやアドレスデータを受けて、メモリモジュール１２４−２からテクスチャデータを読み出し、テクスチャフィルタリング（ＴｅｘｔｕｒｅＦｉｌｔｅｒｉｎｇ）を行う。
このとき第２の演算器１２３−２は、担当するテクセルのみのフィルタ値の部分演算のみを行い、フィルタリング後の部分演算データを、クロスバー回路１３を介して、スタンプに対応するフレームバッファを持つ処理モジュールの第１の演算器１２２−０〜１２２−３のいずれかに出力する。
この場合、第２の演算器１２３−２は、読み出されたテクスチャデータと、（ｕ，ｖ）アドレスは算出時に得た小数部を使って４近傍補間などのフィルタリング処理を行う。
また、第２の演算器１２３−２は、第１の演算器１２２−２により供給されたピクセルレベルの処理結果を受けて、ピクセルレベルの処理における各種テストをパスしたピクセルデータをメモリモジュール１２４−２に描画する。
【００４６】
処理モジュール１２−３におけるＤＤＡ回路１２１−３は、セットアップ回路１１による情報に基づいてＤＤＡパラメータ、具体的には、ラスタライゼーションに必要な各種データ（Ｚ、テクスチャ座標、カラーなど）の傾き等のＤＤＡパラメータを算出する。
また、ＤＤＡ回路１２１−３は、パラメータデータに基づいて、たとえば三角形が自分が担当する領域であるか否かを判断し、担当領域である場合には、ラスタライゼーションを行う。
具体的には、その三角形が自分が担当する領域、たとえば４×４ピクセルの矩形領域単位でインターリーブされた領域に属しているか否かを判断し、属している場合には、各種データ（Ｚ、テクスチャ座標、カラーなど）をラスタライズする。この場合、生成単位は、１ローカルモジュール当たり１サイクルで２×２ピクセルである。
次に、ＤＤＡ回路１２１−３は、テクスチャ座標のパースペクティブコレクション（ＰｅｒｓｐｅｃｔｉｖｅＣｏｒｒｅｃｔｉｏｎ）を行う。また、この処理ステージにはＬＯＤ計算によるミップマップレベルの算出や、テクスチャアクセスのための（ｕ，ｖ）アドレス計算も含まれる。
【００４７】
そして、ＤＤＡ回路１２１−３は、たとえば図６に示すように、テクスチャ系のＤＤＡ部１２１１によりテクスチャ座標や、テクスチャアクセスのためのアドレス情報等のテクスチャ用処理を行い、テクスチャに関する情報を第１の演算器１２２−３、クロスバー回路１３を介して第２の演算器１２３−３に出力する。
一方、ＤＤＡ回路１２１−３は、テクスチャ以外のカラー等の処理はその他のＤＤＡ部１２１２で行って第１の演算器１２２−３に出力する。
【００４８】
第１の演算器１２２−３は、ＤＤＡ回路１２１−３により供給されたテクスチャ情報以外のデータ、および、クロスバー回路１３を介して受信する各処理モジュール１２−０〜１２−３の第２の演算器１２３−０〜１２３−３でテクスチャフィルタリング処理後の部分演算データを受けて、各部分演算データを加算して最終的なフィルタリング後のテクスチャ値を求める演算処理を行うピクセルレベルの処理を行い、その結果をクロスバー回路１３を介して第２の演算器１２３−３に出力する。
このピクセルレベルの処理においては、フィルタリング後のテクスチャデータと、ラスタライズ後の各種データを用いて、ピクセル単位の演算が行われる。ここで行われる処理は、ピクセルレベルでのライティングなどいわゆるＰｉｘｅｌ
Ｓｈａｄｅｒに相当する。
【００４９】
第２の演算器１２３−３は、ＤＤＡ回路１２１−３から供給されたテクスチャに関する座標データやアドレスデータを受けて、メモリモジュール１２４−３からテクスチャデータを読み出し、テクスチャフィルタリング（ＴｅｘｔｕｒｅＦｉｌｔｅｒｉｎｇ）を行う。
このとき第２の演算器１２３−３は、担当するテクセルのみのフィルタ値の部分演算のみを行い、フィルタリング後の部分演算データを、クロスバー回路１３を介して、スタンプに対応するフレームバッファを持つ処理モジュールの第１の演算器１２２−０〜１２２−３のいずれかに出力する。
この場合、第２の演算器１２３−３は、読み出されたテクスチャデータと、（ｕ，ｖ）アドレスは算出時に得た小数部を使って４近傍補間などのフィルタリング処理を行う。
また、第２の演算器１２３−３は、第１の演算器１２２−３により供給されたピクセルレベルの処理結果を受けて、ピクセルレベルの処理における各種テストをパスしたピクセルデータをメモリモジュール１２４−３に描画する。
【００５０】
図７は、本実施形態に係るクロスバー回路のグローバルバス系統の一構成例を示す図である。
このクロスバー回路１３は、図７に示すように、４本のテクスチャラインを１グループとして、４グループの第１〜第４の配線群ＧＲＰ０〜ＧＲＰ３を有している。
第１の配線群ＧＲＰ０は４本の配線ｔｅｘ００〜ｔｅｘ０３を有し、第２の配線群ＧＲＰ１は４本の配線ｔｅｘ１０〜ｔｅｘ１３を有し、第３の配線群ＧＲＰ２は４本の配線ｔｅｘ２０〜ｔｅｘ２３を有し、第４の配線群ＧＲＰ３は４本の配線ｔｅｘ３０〜ｔｅｘ３３を有している。
そして、処理モジュール１２−０の第２の演算器１２３−０の端子が第１の配線群ＧＲＰ０の配線ｔｅｘ００、第２の配線群ＧＲＰ１の配線ｔｅｘ１０、第３の配線群ＧＲＰ２の配線ｔｅｘ２０、第４の配線群ＧＲＰ３の配線ｔｅｘ３０に接続されている。
同様に、処理モジュール１２−１の第２の演算器１２３−１の端子が第１の配線群ＧＲＰ０の配線ｔｅｘ０１、第２の配線群ＧＲＰ１の配線ｔｅｘ１１、第３の配線群ＧＲＰ２の配線ｔｅｘ２１、第４の配線群ＧＲＰ３の配線ｔｅｘ３１に接続されている。
処理モジュール１２−２の第２の演算器１２３−２の端子が第１の配線群ＧＲＰ０の配線ｔｅｘ０２、第２の配線群ＧＲＰ１の配線ｔｅｘ１２、第３の配線群ＧＲＰ２の配線ｔｅｘ２２、第４の配線群ＧＲＰ３の配線ｔｅｘ３２に接続されている。
処理モジュール１２−３の第２の演算器１２３−３の端子が第１の配線群ＧＲＰ０の配線ｔｅｘ０３、第２の配線群ＧＲＰ１の配線ｔｅｘ１３、第３の配線群ＧＲＰ２の配線ｔｅｘ２３、第４の配線群ＧＲＰ３の配線ｔｅｘ３３に接続されている。
【００５１】
そして、第１の配線群ＧＲＰ０の４本の配線ｔｅｘ００〜ｔｅｘ０３が処理モジュール１２−０の第１の演算器１２２−０の各端子に接続されている。
同様に、第２の配線群ＧＲＰ１の４本の配線ｔｅｘ１０〜ｔｅｘ１３が処理モジュール１２−１の第１の演算器１２２−１の各端子に接続されている。
第３の配線群ＧＲＰ２の４本の配線ｔｅｘ２０〜ｔｅｘ２３が処理モジュール１２−２の第１の演算器１２２−２の各端子に接続されている。
第４の配線群ＧＲＰ３の４本の配線ｔｅｘ３０〜ｔｅｘ３３が処理モジュール１２−３の第１の演算器１２２−３の各端子に接続されている。
【００５２】
次に、上記図４の構成による動作を、図５に関連付けて説明する。
【００５３】
まず、セットアップ回路１１において、頂点データに対する演算が行われ、１プリミティブがセットアップされ、各処理モジュール１２−０〜１２−３にそれぞれ担当テクスチャ分のセットアップ情報が出力される。
【００５４】
各処理モジュール１２−０〜１２−３におけるＤＤＡ回路１２１−０〜１２１−３では、セットアップ回路１１による情報に基づいてＤＤＡパラメータ、具体的には、ラスタライゼーションに必要な各種データ（Ｚ、テクスチャ座標、カラーなど）の傾き等のＤＤＡパラメータが算出される。
また、ＤＤＡ回路１２１−０〜１２１−３では、パラメータデータに基づいて、たとえば三角形が自分が担当する領域であるか否かを判断し、担当領域である場合には、ラスタライゼーションが行われる。
さらに、ＤＤＡ回路１２１−０〜１２１−３では、ＬＯＤ計算によるミップマップレベルの算出や、テクスチャアクセスのための（ｕ，ｖ）アドレス計算が行われる。
【００５５】
そして、ＤＤＡ回路１２１−０〜１２１−３では、テクスチャ系のＤＤＡ部１２１１により得られたテクスチャ座標や、テクスチャアクセスのためのアドレス情報等が第１の演算器１２２−０〜１２２−３、クロスバー回路１３を介して第２の演算器１２３−０〜１２３−３に出力される。
一方、ＤＤＡ回路１２１−０〜１２１−３では、その他のＤＤＡ部１２１２で得られたテクスチャ以外のカラー等の情報が第１の演算器１２２−０〜１２２−３にそれぞれ供給される。
【００５６】
そして、各処理モジュール１２−０〜１２−３の第２の演算器１２３−０〜１２３−３では、ＤＤＡ回路１２１−０〜１２１−３から供給されたテクスチャに関する座標データやアドレスデータを受けて、メモリモジュール１２４−０〜１２４−３からテクスチャデータが読み出されて、読み出されたテクスチャデータと、（ｕ，ｖ）アドレスは算出時に得た小数部を使って４近傍補間などの部分テクスチャフィルタリングが行われる。
すなわち、このとき、第２の演算器１２３−０〜１２３−３では、担当するテクセルのみのフィルタ値の部分演算が行われる。
この第２の演算器１２３−０〜１２３−３によるフィルタリング後の各部分演算データが、クロスバー回路１３を介して、スタンプに対応するフレームバッファを持つたとえば処理モジュール１２−１の第１の演算器１２２−１に供給される。
【００５７】
処理モジュール１２−１の第１の演算器１２２−１では、ＤＤＡ回路１２１−１により供給されたテクスチャ情報以外のデータ、および、クロスバー回路１３を介して受信する各処理モジュール１２−０〜１２−３の第２の演算器１２３−０〜１２３−３でテクスチャフィルタリング処理後の部分演算データを受けて、各部分演算データを加算して最終的なフィルタリング後のテクスチャ値を求める演算処理を行うピクセルレベルの処理が行われ、その結果がクロスバー回路１３を介して第２の演算器１２３−１に出力される。
【００５８】
そして、第２の演算器１２３−１では、第１の演算器１２２−１により供給されたピクセルレベルの処理結果を受けて、ピクセルレベルの処理における各種テストをパスしたピクセルデータがメモリモジュール１２４−１に描画される。
【００５９】
以上の処理が各モジュールで並列的に行われる。
【００６０】
以上説明したように、本実施形態によれば、図８に示すように、ＤＤＡ処理後（ＳＴ１１）、メモリからテクスチャデータを読み出し（ＳＴ１２）、サブワード再配置処理を行った後（ＳＴ１３）、テクセルについての部分フィルタリングを行い（ＳＴ１４）、その後、クロスバー回路１３により各処理モジュールの第１の演算器にグローバル分配し（ＳＴ１５）、ピクセルレベルの処理、具体的には、フィルタリング後の部分演算データを受けて、各部分演算データを加算して最終的なフィルタリング後のテクスチャ値を求める演算処理を行うピクセル単位の演算を行い（ＳＴ１６）、ピクセルレベルの処理における各種テストをパスしたピクセルデータを、メモリモジュール上のフレームバッファに描画する（ＳＴ１７）ことから、以下の効果を得ることができる。
【００６１】
すなわち、処理モジュール間のテクセル転送量を削減できることから、グローバルバスとしてのクロスバー回路１３を小型化できる。
また、メモリの配置がグローバルで良い。すなわち、フレームバッファと同じ配置にできることから、フレームバッファをテクスチャメモリとして用いることが可能である。また、サイズの違うマップも効率的に配置することができる。
また、特定のテクセルに対するキャッシュが１箇所だけになることから、キャッシュの利用効率が高い。
また、処理モジュール間で４×４単位のインターリーブとした場合、２×２の４近傍補間フィルタで１ピクセル当たり、２５／１６テクセルの転送で処理可能である。
なお、本実施形態によらない場合には、４８／１６テクセルの転送となる。リクエストマージを行っても３２／１６テクセルの転送である。
【００６２】
【発明の効果】
以上説明したように、本発明によれば、複数の処理装置が処理データを共有して並列処理する際に、クロスバー回路の配線本数を削減し、小型化することができる。その結果、設計が容易で、配線コスト、配線遅延を低減できる画像処理装置を実現できる利点がある。
【図面の簡単な説明】
【図１】プリミティブレベルでの並列化処理について概念的に示す図である。
【図２】ピクセルレベルでの並列処理の手法に基づくプリミティブレベルでの並列化処理について概念的に示す図である。
【図３】従来の画像処理装置のテクスチャフィルタリングを含む処理手順を説明するための図である。
【図４】本発明の係る画像処理装置の一実施形態を示すブロック構成図である。
【図５】本実施形態に係る画像処理装置の基本的なアーキテクチャおよび処理フローを示す図である。
【図６】本実施形態に係るＤＤＡ回路の要部の構成例を示す図である。
【図７】本実施形態に係るクロスバー回路の具体的な構成例を示す図である。
【図８】本実施形態に係る画像処理装置の概念的な処理フローを示す図である。
【符号の説明】
１０…画像処理装置、１１…セットアップ回路、１２−０〜１２−３…処理モジュール、１２１−０〜１２１−３…ＤＤＡ回路、１２２−０〜１２２−３…第１の演算器、１２３−０〜１２３−３…第２の演算器、１２４−０〜１２４−３…メモリモジュール、１３…クロスバー回路。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an image processing apparatus in which a plurality of arithmetic processing apparatuses share processing data and perform parallel processing, and a method thereof.
[0002]
[Prior art]
In recent years, graphics LSIs that execute three-dimensional computer graphics (3D Computer Graphics) at high speed with hardware have been widely used. Especially, game machines and personal computers (PCs) are equipped with this graphics LSI as a standard. There are many things.
In addition, technological advances in graphics LSIs are fast, and functional expansions such as “Vertex Shader” and “Pixe 1 Shader” adopted in “DirectX” continue to be expanded, and performance exceeds the CPU speed. Has improved.
[0003]
In order to improve the performance of the graphics LSI, it is effective not only to increase the operating frequency of the LSI but also to use a parallel processing technique. The parallel processing methods can be broadly classified as follows.
The first is a parallel processing method by area division, the second is a parallel processing method at a primitive level, and the third is a parallel processing method at a pixel level.
[0004]
The above classification is based on the granularity of parallel processing, the granularity of region division parallel processing is the largest, and the granularity of pixel level parallel processing is the finest. The outline of each method is described below.
[0005]
Parallel processing by area division
This is a technique of dividing a screen into a plurality of rectangular areas and performing parallel processing while assigning areas to which each of the plurality of processing units is responsible.
[0006]
Parallel processing at the primitive level
This is a technique in which different primitives (for example, triangles) are given to a plurality of processing units to operate in parallel.
FIG. 1 conceptually shows the parallel processing at the primitive level.
In FIG. 1, PM0 to PMn-1 indicate different primitives, PU0 to PUn-1 indicate processing units, and MM0 to MMn-1 indicate memory modules, respectively.
When primitives PM0 to PMn-1 having relatively equal sizes are given to the respective processing units PU0 to PUn-1, the loads on the respective processing units PU0 to PUn-1 are balanced and efficient parallel processing can be performed. .
[0007]
Parallel processing at the pixel level
This is the method of parallel processing with the finest granularity.
FIG. 2 is a diagram conceptually illustrating parallel processing at a primitive level based on a parallel processing technique at a pixel level.
As shown in FIG. 2, in the parallel processing technique at the pixel level, when rasterizing triangles, pixels are divided into units of rectangular areas called pixel stamps (Pixel Stamp) PS made up of pixels arranged in a 2 × 8 matrix. Generated.
In the example of FIG. 2, a total of eight pixel stamps from pixel stamp PS0 to pixel stamp PS7 are generated. A maximum of 16 pixels included in these pixel stamps PS0 to PS7 are processed simultaneously.
This method is more efficient in parallel processing because of its finer granularity than other methods.
[0008]
[Problems to be solved by the invention]
However, in the case of the parallel processing based on the region division described above, in order to efficiently operate each processing unit in parallel, it is necessary to classify objects to be drawn in each region in advance, and the load of scene data analysis is heavy.
In addition, drawing is not started after all the scene data for one frame is prepared, but parallelism is drawn when drawing in so-called immediate mode in which drawing is started immediately when object data is given. I can't.
[0009]
In the case of parallel processing at the primitive level, there is actually a variation in the size of the primitives PM0 to PMn-1 constituting the object, so one primitive is processed for each processing unit PU0 to PUn-1. There is a difference in the time to do. When this difference becomes large, the drawing area of the processing unit is also greatly different, and the locality of data is lost. For example, page misses of the DRAM constituting the memory module frequently occur and the performance deteriorates.
In addition, this method has a problem that the wiring cost is high. Generally, hardware that performs graphics processing performs memory interleaving using a plurality of memory modules in order to widen the memory bandwidth.
At that time, as shown in FIG. 1, it is necessary to connect all the processing units PU0 to PUn-1 and all the built-in memory modules MM0 to MMn-1.
[0010]
On the other hand, the parallel processing at the pixel level has an advantage that the efficiency of parallel processing is good because the granularity is fine as described above, and the processing including actual filtering is performed according to the procedure shown in FIG. ing.
[0011]
That is, DDA (Digital Differential Analyzer) parameters, for example, DDA parameters such as slopes of various data (Z, texture coordinates, color, etc.) necessary for rasterization are calculated (ST1).
Next, the texture data is read from the memory (ST2), the subword rearrangement process is performed (ST3), and then globally distributed to each processing unit by the crossbar circuit (ST4).
Next, texture filtering is performed (ST5). In this case, the processing units PU0 to PU3 perform filtering processing such as 4-neighbor interpolation using the read texture data and the (u, v) address using the decimal part obtained at the time of calculation.
Next, pixel-level processing (Per-Pixel Operation), specifically, pixel-based computation is performed using filtered texture data and various data after rasterization (ST5).
Then, the pixel data that passes the various tests in the pixel level processing is drawn in the frame buffer and the Z buffer on the memory modules MM0 to MM3 (ST6).
[0012]
By the way, since the memory access of the texture read system is different from the memory access of the drawing system, it is necessary to read from the memory belonging to another module.
Therefore, for the memory read type memory access, as described above, wiring such as a crossbar circuit is required.
[0013]
However, in the conventional image processing apparatus, as described above, global distribution is performed to each processing unit by the crossbar circuit, and then texture filtering is performed. Therefore, the amount of data to be globally distributed is large (for example, 4 Tbps). There is a disadvantage that the crossbar circuit as a bus becomes large and hinders the speeding up of processing from the viewpoint of wiring delay.
[0014]
The present invention has been made in view of such circumstances, and an object thereof is to provide an image processing apparatus and method for reducing the size of a crossbar circuit and increasing the processing speed.
[0015]
[Means for Solving the Problems]
In order to achieve the above object, a first aspect of the present invention is an image processing apparatus in which a plurality of processing modules share processing data and perform parallel processing, and the plurality of processing modules include at least data related to filtering processing. A memory module that stores memory, texture data for filtering processing, a processing circuit that performs processing in charge determined in advance by corresponding memory interleaving based on the processing data, and when the processing module is in charge of pixels The process data obtained in the above processing circuit and the partial calculation data of only the filter value of the texel in charge in each processing module are received and the respective partial calculation data are added to obtain the final texture value after filtering. The first arithmetic unit that performs arithmetic processing and the above processing circuit And performing a partial calculation of the filter value of only the texel in charge based on the filtering processing data handled by the self-processing module and the data related to the filtering processing stored in the memory module, and the arithmetic processing by the first computing unit A plurality of second arithmetic units that receive data and draw the data on the memory module, and further, the plurality of first arithmetic units and the plurality of second arithmetic units of each processing module are mutually connected. Filtering data obtained by the above processing circuit in each processing module is supplied to a second arithmetic unit of the same processing module, and is filtered by the second arithmetic unit of each processing module. The subsequent partial calculation data is aggregated in the first calculation unit of the processing module in charge of the pixel, Having a crossbar circuit supplies the operation process data of the first computing unit to the second operator.
[0016]
In the first aspect, both the frame buffer and the texture buffer are interleaved in all memory modules.
[0017]
In addition, in the first aspect, there is provided a setup circuit that performs an operation on the vertex data of a primitive, sets up one primitive, and outputs data in charge to each processing circuit of each processing module.
[0018]
Preferably, the setup circuit distributes all texture information to the processing circuits of the processing modules.
[0019]
A second aspect of the present invention is an image processing method in which a plurality of processing modules share processing data and perform parallel processing. In each processing module, texture data for filtering processing is obtained and based on the processing data Performs the processing in charge determined in advance by the corresponding memory interleaving, and in each processing module, only the texel in charge based on the filtering processing data handled by the own processing module and the data related to the filtering processing stored in the memory module The partial calculation data of each of the processing modules is aggregated in the processing module in charge of the pixels via the global bus, and the partial calculation data is collected in the processing module in which the partial calculation data is aggregated. Add Obtains a texture value after final filtering Te, draws the final processing data to the memory module.
[0020]
In the second aspect, both the frame buffer and the texture buffer are interleaved in all memory modules.
[0021]
In the second aspect, all texture information is distributed to each processing module.
[0022]
According to the present invention, for example, in a setup circuit, an operation is performed on vertex data, one primitive is set up, and setup information for the assigned texture is output to each processing module.
In the processing circuit in each processing module, for example, DDA parameters, specifically, DDA parameters such as inclinations of various data (Z, texture coordinates, color, etc.) necessary for rasterization are calculated based on information from the setup circuit. .
Each processing circuit determines, for example, whether or not the triangle is an area in charge of the triangle based on the parameter data. If the triangle is the area in charge, rasterization is performed.
Further, each processing circuit performs calculation of a mipmap level by LOD calculation and (u, v) address calculation for texture access.
[0023]
In each processing circuit, the obtained texture coordinates, address information for texture access, and the like are output to the second arithmetic unit.
On the other hand, in each processing circuit, information such as a color other than the obtained texture is supplied to the first computing unit.
The second computing unit of each processing module receives coordinate data and address data related to the texture supplied from the processing circuit, reads the texture data from the memory module, and reads the texture data ( The u, v) address is subjected to texture filtering such as 4-neighbor interpolation using the decimal part obtained at the time of calculation. At this time, the second calculator of each processing module performs a partial calculation of the filter value of only the texel in charge.
Each partial calculation data after filtering by the second calculator is supplied to the first calculator of the processing module in charge of the pixel via the crossbar circuit.
The first arithmetic unit of this processing module receives the partial calculation data of the filter value of only the texel in charge in each processing module and adds the partial calculation data to obtain the final filtered texture value And the result is output to the second computing unit.
Then, the second arithmetic unit receives the pixel level processing result supplied from the first arithmetic unit, and draws pixel data that has passed various tests in the pixel level processing in the memory module.
The above processing is performed in parallel in each module.
[0024]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 4 is a block diagram showing an embodiment of the image processing apparatus according to the present invention.
[0025]
As illustrated in FIG. 4, the image processing apparatus 10 according to the present embodiment includes a setup circuit 11, processing modules 12-0 to 12-3, and a crossbar circuit 13.
[0026]
In the image processing apparatus 10, a plurality of processing modules 12-0 to 12-3 in the present embodiment are connected in parallel to the setup circuit 11, and a plurality of processing modules 12-0 to 12-3 are connected. Process data in parallel and process in parallel.
With respect to the texture read system, memory access to other processing modules is required. For this access, a crossbar circuit 13 as a global access bus is used.
[0027]
The configuration and function of each component will be described below in order with reference to the drawings.
[0028]
The setup circuit 11 performs data exchange with the CPU and external memory and data exchange with each of the processing modules 12-0 to 12-3, performs operations on vertex data, sets up one primitive, Setup information for the assigned texture is output to each of the processing modules 12-0 to 12-3.
Specifically, the setup circuit 11 performs a Per-Vertex operation when data is input.
In this process, when the vertex data of three-dimensional coordinates, normal vectors, and texture coordinates is input, the calculation for the vertex data is performed. Typical calculations include coordinate conversion calculation processing that performs deformation of an object, projection onto a screen, lighting calculation processing, and clipping calculation processing.
[0029]
The processing module 12-0 includes a DDA (Digital Differential Analyzer) circuit 121-0 as a processing circuit, a first arithmetic unit (arithmetic unit 1) 122-0, a second arithmetic unit (arithmetic unit 2) 123-0, And a memory module (MEM) 124-0 made of, for example, a DRAM.
[0030]
Similarly, the processing module 12-1 includes a DDA circuit 121-1 as a processing circuit, a first computing unit (calculating unit 1) 122-1, a second computing unit (calculating unit 2) 123-1, and, for example, It has a memory module (MEM) 124-1 made of DRAM.
The processing module 12-2 includes a DDA circuit 121-2 as a processing circuit, a first arithmetic unit (arithmetic unit 1) 122-2, a second arithmetic unit (arithmetic unit 2) 123-2, and a DRAM, for example. It has a memory module (MEM) 124-2.
The processing module 12-3 includes a DDA circuit 121-3 as a processing circuit, a first arithmetic unit (arithmetic unit 1) 122-3, a second arithmetic unit (arithmetic unit 2) 123-3, and a DRAM, for example. It has a memory module (MEM) 124-3.
[0031]
Then, as will be described in detail later, the first computing units 122-0 to 122-3 and the second computing units 123-0 to 123-3 of the processing modules 12-0 to 12-3 are crossbars. They are connected to each other through a circuit 13.
[0032]
FIG. 5 is a diagram showing a basic architecture and a processing flow of the image processing apparatus according to the present embodiment. In FIG. 5, arrows with circles indicate the flow of data regarding texture, and arrows without circles indicate the flow of data regarding pixels.
[0033]
In the present embodiment, in each of the processing modules 12-0 to 12-3, the memory modules 124-0 to 124-3 are interleaved in units of a predetermined size, for example, a 4 × 4 rectangular area.
Specifically, as shown in FIG. 5, both the so-called frame buffer and texture buffer are interleaved in all memory modules.
[0034]
The DDA circuit 121-0 in the processing module 12-0 calculates DDA parameters based on information from the setup circuit 11.
In this process, DDA parameters such as inclinations of various data (Z, texture coordinates, color, etc.) necessary for rasterization are calculated.
Further, the DDA circuit 121-0 determines, for example, whether or not the triangle is an area in charge of the triangle based on the parameter data. If the triangle is the area in charge, the DDA circuit 121-0 performs rasterization.
Specifically, it is determined whether or not the triangle belongs to an area that the user is in charge of, for example, an area interleaved in units of a rectangular area of 4 × 4 pixels. Rasterize texture coordinates, color, etc.) In this case, the generation unit is 2 × 2 pixels in one cycle per local module.
Next, the DDA circuit 121-0 performs perspective correction of the texture coordinates (Perspective Collection). Further, this processing stage includes calculation of a mipmap (MipMap) level by LOD (Level of Detail) calculation, and (u, v) address calculation for texture access.
[0035]
Then, for example, as shown in FIG. 6, the DDA circuit 121-0 performs texture processing such as texture coordinates and address information for texture access by the texture DDA unit 1211, and obtains information about the texture as the first information. The result is output to the second calculator 123-0 via the calculator 122-0 and the crossbar circuit 13.
On the other hand, the DDA circuit 121-0 performs processing of colors and the like other than the texture in the other DDA unit 1212 and outputs the result to the first computing unit 122-0.
In this embodiment, the DDA circuit 121 (−0 to 3) is provided with a FIFO (First-In First-Out) only on the data input side of the other DDA unit 1212 to take into account the time of texture filtering processing. The time is adjusted.
Also, the texture DDA unit 1211 generates texture data for all the pixels, and the other DDA units 1212 are generated only for the assigned part by memory interleaving.
[0036]
The first computing unit 122-0 receives data other than the texture information supplied by the DDA circuit 121-0 and the second processing modules 12-0 to 12-3 received via the crossbar circuit 13. Pixel level processing (Per) which receives the partial calculation data after texture filtering processing by the arithmetic units 123-0 to 123-3 and adds the partial calculation data to obtain the final filtered texture value (Per -Pixel Operation) and outputs the result to the second arithmetic unit 123-0 via the crossbar circuit 13.
In the pixel level processing, calculation in units of pixels is performed using the texture data after filtering and various data after rasterization. The processing performed here corresponds to so-called Pixel Shader such as lighting at the pixel level (Per-PixelLighting).
[0037]
The second computing unit 123-0 receives coordinate data and address data related to the texture supplied from the DDA circuit 121-0, reads the texture data from the memory module 124-0, and performs texture filtering.
At this time, the second computing unit 123-0 performs only the partial computation of the filter value of only the texel in charge, and has a frame buffer corresponding to the stamp via the crossbar circuit 13 for the filtered partial computation data. The data is output to one of the first computing units 122-0 to 122-3 of the processing module.
In this case, the second computing unit 123-0 performs filtering processing such as 4-neighbor interpolation using the read texture data and the (u, v) address using the decimal part obtained at the time of calculation.
Further, the second arithmetic unit 123-0 receives the pixel level processing result supplied from the first arithmetic unit 122-0, and passes the pixel data that has passed various tests in the pixel level processing to the memory module 124-. Draw at 0.
[0038]
The DDA circuit 121-1 in the processing module 12-1 is based on the information from the setup circuit 11, specifically DDA parameters such as slopes of various data (Z, texture coordinates, color, etc.) necessary for rasterization. Calculate the parameters.
Further, the DDA circuit 121-1 determines, for example, whether or not the triangle is an area in charge of the triangle based on the parameter data. If the triangle is the area in charge, the DDA circuit 121-1 performs rasterization.
Specifically, it is determined whether or not the triangle belongs to an area that the user is in charge of, for example, an area interleaved in units of a rectangular area of 4 × 4 pixels. Rasterize texture coordinates, color, etc.) In this case, the generation unit is 2 × 2 pixels in one cycle per local module.
Next, the DDA circuit 121-1 performs a perspective collection of texture coordinates (Perspective Collection). Further, this processing stage includes calculation of a mipmap level by LOD calculation and calculation of (u, v) address for texture access.
[0039]
Then, for example, as shown in FIG. 6, the DDA circuit 121-1 performs texture processing such as texture coordinates and address information for texture access by the texture DDA unit 1211, and obtains information about the texture as the first information. The result is output to the second arithmetic unit 123-1 via the arithmetic unit 122-1 and the crossbar circuit 13.
On the other hand, the DDA circuit 121-1 performs processing of colors other than the texture in the other DDA unit 1212 and outputs it to the first calculator 122-1.
[0040]
The first arithmetic unit 122-1 receives data other than the texture information supplied from the DDA circuit 121-1 and the second of each processing module 12-0 to 12-3 received via the crossbar circuit 13. Pixel level processing is performed in which arithmetic units 123-0 to 123-3 receive partial arithmetic data after texture filtering processing, and perform arithmetic processing to obtain final filtered texture values by adding the partial arithmetic data. The result is output to the second arithmetic unit 123-1 via the crossbar circuit 13.
In the pixel level processing, calculation in units of pixels is performed using the texture data after filtering and various data after rasterization. The processing performed here is so-called Pixel, such as lighting at the pixel level.
Corresponds to Shader.
[0041]
The second arithmetic unit 123-1 receives coordinate data and address data related to the texture supplied from the DDA circuit 121-1, reads the texture data from the memory module 124-1, and performs texture filtering.
At this time, the second arithmetic unit 123-1 performs only the partial calculation of the filter value of only the texel in charge, and the filtered partial calculation data has a frame buffer corresponding to the stamp via the crossbar circuit 13. The data is output to any one of the first computing units 122-0 to 122-3 of the processing module.
In this case, the second computing unit 123-1 performs filtering processing such as 4-neighbor interpolation using the read texture data and the (u, v) address using the decimal part obtained at the time of calculation.
In addition, the second arithmetic unit 123-1 receives the pixel level processing result supplied from the first arithmetic unit 122-1, and outputs pixel data that has passed various tests in the pixel level processing to the memory module 124-. Draw to 1.
[0042]
The DDA circuit 121-2 in the processing module 12-2 is based on the information from the setup circuit 11, and specifically, DDA parameters such as slopes of various data (Z, texture coordinates, color, etc.) necessary for rasterization. Calculate the parameters.
Further, the DDA circuit 121-2 determines, for example, whether or not a triangle is an area in charge of the triangle based on the parameter data.
Specifically, it is determined whether or not the triangle belongs to an area that the user is in charge of, for example, an area interleaved in units of a rectangular area of 4 × 4 pixels. Rasterize texture coordinates, color, etc.) In this case, the generation unit is 2 × 2 pixels in one cycle per local module.
Next, the DDA circuit 121-2 performs a perspective collection of texture coordinates (Perspective Collection). Further, this processing stage includes calculation of a mipmap level by LOD calculation and calculation of (u, v) address for texture access.
[0043]
Then, for example, as shown in FIG. 6, the DDA circuit 121-2 performs texture processing such as texture coordinates and address information for texture access by the texture DDA unit 1211, and obtains information about the texture as the first information. The result is output to the second calculator 123-2 via the calculator 122-2 and the crossbar circuit 13.
On the other hand, the DDA circuit 121-2 performs processing such as color other than texture in the other DDA unit 1212 and outputs the result to the first computing unit 122-2.
[0044]
The first arithmetic unit 122-2 receives the data other than the texture information supplied from the DDA circuit 121-2 and the second processing modules 12-0 to 12-3 received via the crossbar circuit 13. Pixel level processing is performed in which arithmetic units 123-0 to 123-3 receive partial arithmetic data after texture filtering processing, and perform arithmetic processing to obtain final filtered texture values by adding the partial arithmetic data. The result is output to the second arithmetic unit 123-2 via the crossbar circuit 13.
In the pixel level processing, calculation in units of pixels is performed using the texture data after filtering and various data after rasterization. The processing performed here is so-called Pixel, such as lighting at the pixel level.
Corresponds to Shader.
[0045]
The second computing unit 123-2 receives coordinate data and address data related to the texture supplied from the DDA circuit 121-2, reads the texture data from the memory module 124-2, and performs texture filtering.
At this time, the second arithmetic unit 123-2 performs only the partial calculation of the filter value of only the texel in charge, and has a frame buffer corresponding to the stamp through the crossbar circuit 13 for the partial calculation data after filtering. The data is output to any one of the first computing units 122-0 to 122-3 of the processing module.
In this case, the second computing unit 123-2 performs filtering processing such as 4-neighbor interpolation using the read texture data and the (u, v) address using the decimal part obtained at the time of calculation.
Further, the second computing unit 123-2 receives the pixel level processing result supplied from the first computing unit 122-2, and passes the pixel data that has passed various tests in the pixel level processing to the memory module 124-. Draw in 2.
[0046]
The DDA circuit 121-3 in the processing module 12-3 is a DDA parameter based on information from the setup circuit 11, specifically, a DDA such as a gradient of various data (Z, texture coordinates, color, etc.) necessary for rasterization. Calculate the parameters.
Further, the DDA circuit 121-3 determines, for example, whether or not the triangle is an area in charge of the triangle based on the parameter data. If the triangle is in the area in charge, the DDA circuit 121-3 performs rasterization.
Specifically, it is determined whether or not the triangle belongs to an area that the user is in charge of, for example, an area interleaved in units of a rectangular area of 4 × 4 pixels. Rasterize texture coordinates, color, etc.) In this case, the generation unit is 2 × 2 pixels in one cycle per local module.
Next, the DDA circuit 121-3 performs perspective correction of the texture coordinates (Perspective Correction). Further, this processing stage includes calculation of a mipmap level by LOD calculation and calculation of (u, v) address for texture access.
[0047]
Then, for example, as shown in FIG. 6, the DDA circuit 121-3 performs texture processing such as texture coordinates and address information for texture access by the texture DDA unit 1211, and obtains information about the texture as the first information. The result is output to the second calculator 123-3 via the calculator 122-3 and the crossbar circuit 13.
On the other hand, the DDA circuit 121-3 performs other processing such as color other than the texture in the other DDA unit 1212 and outputs it to the first computing unit 122-3.
[0048]
The first computing unit 122-3 receives the data other than the texture information supplied from the DDA circuit 121-3 and the second processing modules 12-0 to 12-3 received via the crossbar circuit 13. Pixel level processing is performed in which arithmetic units 123-0 to 123-3 receive partial arithmetic data after texture filtering processing, and perform arithmetic processing to obtain final filtered texture values by adding the partial arithmetic data. The result is output to the second arithmetic unit 123-3 via the crossbar circuit 13.
In the pixel level processing, calculation in units of pixels is performed using the texture data after filtering and various data after rasterization. The processing performed here is so-called Pixel, such as lighting at the pixel level.
Corresponds to Shader.
[0049]
The second arithmetic unit 123-3 receives coordinate data and address data related to the texture supplied from the DDA circuit 121-3, reads the texture data from the memory module 124-3, and performs texture filtering.
At this time, the second arithmetic unit 123-3 performs only the partial calculation of the filter value of only the texel in charge, and has a frame buffer corresponding to the stamp through the crossbar circuit 13 for the partial calculation data after filtering. The data is output to any one of the first computing units 122-0 to 122-3 of the processing module.
In this case, the second arithmetic unit 123-3 performs a filtering process such as 4-neighbor interpolation using the read texture data and the (u, v) address using the decimal part obtained at the time of calculation.
Further, the second arithmetic unit 123-3 receives the pixel level processing result supplied from the first arithmetic unit 122-3, and passes the pixel data that has passed various tests in the pixel level processing to the memory module 124-. Draw in 3.
[0050]
FIG. 7 is a diagram illustrating a configuration example of the global bus system of the crossbar circuit according to the present embodiment.
As shown in FIG. 7, the crossbar circuit 13 includes four texture lines as one group and four groups of first to fourth wiring groups GRP0 to GRP3.
The first wiring group GRP0 has four wirings tex00 to tex03, the second wiring group GRP1 has four wirings tex10 to tex13, and the third wiring group GRP2 has four wirings tex20 to tex23. The fourth wiring group GRP3 includes four wirings tex30 to tex33.
The terminals of the second computing unit 123-0 of the processing module 12-0 are the wiring tex00 of the first wiring group GRP0, the wiring tex10 of the second wiring group GRP1, the wiring tex20 of the third wiring group GRP2, It is connected to the wiring tex30 of the fourth wiring group GRP3.
Similarly, the terminals of the second arithmetic unit 123-1 of the processing module 12-1 are connected to the wiring tex01 of the first wiring group GRP0, the wiring tex11 of the second wiring group GRP1, and the wiring tex21 of the third wiring group GRP2. It is connected to the wiring tex31 of the fourth wiring group GRP3.
The terminals of the second computing unit 123-2 of the processing module 12-2 are the wiring tex02 of the first wiring group GRP0, the wiring tex12 of the second wiring group GRP1, the wiring tex22 of the third wiring group GRP2, and the fourth. It is connected to the wiring tex32 of the wiring group GRP3.
The terminals of the second arithmetic unit 123-3 of the processing module 12-3 are the wiring tex03 of the first wiring group GRP0, the wiring tex13 of the second wiring group GRP1, the wiring tex23 of the third wiring group GRP2, and the fourth. It is connected to the wiring tex33 of the wiring group GRP3.
[0051]
The four wirings tex00 to tex03 of the first wiring group GRP0 are connected to the respective terminals of the first computing unit 122-0 of the processing module 12-0.
Similarly, the four wirings tex10 to tex13 of the second wiring group GRP1 are connected to the respective terminals of the first computing unit 122-1 of the processing module 12-1.
Four wirings tex20 to tex23 of the third wiring group GRP2 are connected to the respective terminals of the first computing unit 122-2 of the processing module 12-2.
Four wirings tex30 to tex33 of the fourth wiring group GRP3 are connected to each terminal of the first computing unit 122-3 of the processing module 12-3.
[0052]
Next, the operation of the configuration of FIG. 4 will be described with reference to FIG.
[0053]
First, the setup circuit 11 performs an operation on the vertex data, sets up one primitive, and outputs setup information for the assigned texture to each processing module 12-0 to 12-3.
[0054]
In the DDA circuits 121-0 to 121-3 in the processing modules 12-0 to 12-3, DDA parameters based on information from the setup circuit 11, specifically, various data (Z, texture coordinates required for rasterization) DDA parameters such as the slope of the color, etc.) are calculated.
In addition, the DDA circuits 121-0 to 121-3 determine, for example, whether or not a triangle is an area in charge of the triangle based on the parameter data. If the triangle is an area in charge, rasterization is performed.
Further, the DDA circuits 121-0 to 121-3 perform calculation of mipmap levels by LOD calculation and (u, v) address calculation for texture access.
[0055]
In the DDA circuits 121-0 to 121-3, the texture coordinates obtained by the texture DDA unit 1211, the address information for accessing the texture, and the like are stored in the first arithmetic units 122-0 to 122-3. The data is output to the second computing units 123-0 to 123-3 via the bar circuit 13.
On the other hand, in the DDA circuits 121-0 to 121-3, information such as colors other than the texture obtained by the other DDA units 1212 is supplied to the first calculators 122-0 to 122-3, respectively.
[0056]
Then, the second computing units 123-0 to 123-3 of the processing modules 12-0 to 12-3 receive the coordinate data and address data related to the texture supplied from the DDA circuits 121-0 to 121-3. The texture data is read from the memory modules 124-0 to 124-3, and the read texture data and the (u, v) address are partial textures such as 4-neighbor interpolation using the decimal part obtained at the time of calculation. Filtering is performed.
That is, at this time, the second calculators 123-0 to 123-3 perform partial calculation of the filter value of only the texel in charge.
Each partial calculation data after filtering by the second calculators 123-0 to 123-3 has a frame buffer corresponding to the stamp via the crossbar circuit 13, for example, the first calculation of the processing module 12-1. Is supplied to the device 122-1.
[0057]
In the first computing unit 122-1 of the processing module 12-1, each of the processing modules 12-0 to 12-12 received via the crossbar circuit 13 and data other than the texture information supplied by the DDA circuit 121-1. -3 second arithmetic units 123-0 to 123-3 receive the partial operation data after the texture filtering process, add the respective partial operation data, and perform the operation process to obtain the final filtered texture value Pixel level processing is performed, and the result is output to the second calculator 123-1 via the crossbar circuit 13.
[0058]
Then, the second arithmetic unit 123-1 receives the pixel level processing result supplied from the first arithmetic unit 122-1, and the pixel data that has passed various tests in the pixel level processing is stored in the memory module 124-. 1 is drawn.
[0059]
The above processing is performed in parallel in each module.
[0060]
As described above, according to the present embodiment, as shown in FIG. 8, after DDA processing (ST11), texture data is read from the memory (ST12), and subword rearrangement processing is performed (ST13). (ST14), and then globally distributed to the first arithmetic units of the respective processing modules by the crossbar circuit 13 (ST15) to process the pixel level, specifically, the partial operation data after filtering. In response, each partial calculation data is added to perform calculation processing for each pixel to perform calculation processing for obtaining a final filtered texture value (ST16), and pixel data that has passed various tests in pixel level processing is Since drawing in the frame buffer on the memory module (ST17), Effect can be obtained.
[0061]
That is, since the amount of texel transfer between processing modules can be reduced, the crossbar circuit 13 as a global bus can be reduced in size.
In addition, the memory arrangement may be global. That is, since it can be arranged in the same manner as the frame buffer, the frame buffer can be used as a texture memory. Also, maps of different sizes can be arranged efficiently.
Further, since there is only one cache for a specific texel, the cache utilization efficiency is high.
In addition, when interleaving is performed in units of 4 × 4 between processing modules, processing can be performed by transferring 25/16 texels per pixel with a 2 × 2 4-neighbor interpolation filter.
In the case where the present embodiment is not used, 48/16 texels are transferred. Even if request merging is performed, 32/16 texels are transferred.
[0062]
【The invention's effect】
As described above, according to the present invention, when a plurality of processing devices share processing data and perform parallel processing, the number of wires of the crossbar circuit can be reduced and the size can be reduced. As a result, there is an advantage that an image processing apparatus that can be easily designed and can reduce wiring cost and wiring delay can be realized.
[Brief description of the drawings]
FIG. 1 is a diagram conceptually illustrating parallel processing at a primitive level.
FIG. 2 is a diagram conceptually illustrating a parallel processing at a primitive level based on a parallel processing technique at a pixel level.
FIG. 3 is a diagram for explaining a processing procedure including texture filtering of a conventional image processing apparatus.
FIG. 4 is a block configuration diagram showing an embodiment of an image processing apparatus according to the present invention.
FIG. 5 is a diagram illustrating a basic architecture and a processing flow of an image processing apparatus according to the present embodiment.
FIG. 6 is a diagram illustrating a configuration example of a main part of a DDA circuit according to the present embodiment.
FIG. 7 is a diagram showing a specific configuration example of a crossbar circuit according to the present embodiment.
FIG. 8 is a diagram illustrating a conceptual processing flow of the image processing apparatus according to the present embodiment.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Image processing apparatus, 11 ... Setup circuit, 12-0 to 12-3 ... Processing module, 121-0 to 121-3 ... DDA circuit, 122-0 to 122-3 ... First computing unit, 123-0 ˜123-3, a second computing unit, 124-0 to 124-3, a memory module, and 13 a crossbar circuit.

Claims

An image processing apparatus in which a plurality of processing modules share processing data and perform parallel processing,
The plurality of processing modules are
A memory module storing at least data related to the filtering process;
A processing circuit for obtaining texture data for filtering processing and performing processing in charge determined in advance by corresponding memory interleaving based on the processing data;
When the own processing module takes charge of the pixel, it receives the processing data in charge obtained in the processing circuit and the partial calculation data of the filter value of only the texel in charge in each processing module, and adds each partial calculation data A first arithmetic unit that performs arithmetic processing to obtain a final texture value after filtering;
Based on the filtering processing data handled by the processing module obtained in the processing circuit and the data related to the filtering processing stored in the memory module, partial calculation of the filter value of only the texel in charge is performed, and the first A plurality of second arithmetic units which receive arithmetic processing data from the arithmetic units and draw the data on the memory module;
Have
Furthermore, a global bus interconnecting the plurality of first arithmetic units and the plurality of second arithmetic units of each processing module, wherein the filtering processing data obtained by the processing circuit in each processing module Supplying the second arithmetic unit of the same processing module to the partial arithmetic data after filtering by the second arithmetic unit of each processing module, the first arithmetic unit of the processing module responsible for the pixel, An image processing apparatus having a crossbar circuit for supplying arithmetic processing data by a first arithmetic unit to a second arithmetic unit.

The image processing apparatus according to claim 1, wherein both the frame buffer and the texture buffer are interleaved in all memory modules.

The image processing apparatus according to claim 1, further comprising: a setup circuit that performs an operation on the vertex data of the primitive, sets up one primitive, and outputs data assigned to the processing circuit of each processing module.

4. The image processing apparatus according to claim 3, wherein the setup circuit distributes all texture information to the processing circuit of each processing module.

An image processing method in which a plurality of processing modules share processing data and perform parallel processing,
In each processing module, the texture data for filtering processing is obtained, and the processing in charge determined by the corresponding memory interleave based on the processing data is performed.
In each processing module, perform a partial calculation of the filter value of only the texel in charge based on the data for filtering processing in charge of the own processing module and the data related to the filtering processing stored in the memory module,
The above partial calculation data in each processing module is aggregated in a processing module in charge of pixels via a global bus,
In the processing module in which the partial calculation data is aggregated, each partial calculation data is added to obtain a final filtered texture value,
An image processing method for drawing the final arithmetic processing data on the memory module.

6. The image processing method according to claim 5, wherein both the frame buffer and the texture buffer are interleaved in all memory modules.

6. The image processing method according to claim 5, wherein all the texture information is distributed to each processing module.