JPH07129546A

JPH07129546A - Matrix operation algorithm and circuit

Info

Publication number: JPH07129546A
Application number: JP26924393A
Authority: JP
Inventors: Kao Jin-Nan; カオジン−ナン
Original assignee: Industrial Technology Research Institute ITRI
Current assignee: Industrial Technology Research Institute ITRI
Priority date: 1993-10-27
Filing date: 1993-10-27
Publication date: 1995-05-19

Abstract

(57)【要約】【目的】本発明は、修正ブースアルゴリズムを利用し
て各クロック周期毎に少なくとも２ビットを処理するマ
トリクス演算アルゴリズムと回路設計を提供して回路の
占有領域を縮小することを目的とする。【構成】本発明は第１のマトリクスの列及び第２のマ
トリクスの対応する行の数値の複数の組を受信して記憶
する記憶手段を各々が含む複数のプロセッサ素子（Ｐ
Ｅ）ユニットを各々が含む複数のプロセッサ素子（Ｐ
Ｅ）列を含む乗算回路から構成される。ＰＥ列の他のＰ
Ｅユニットに接続されるＰＥユニットの各々は、数値の
組を時系列的に記憶手段間で順次に伝送する。ＰＥユニ
ットの各々は、記憶手段からの数値の組の複数ビットを
受信する乗算手段を更に有し、乗算手段は部分処理結果
を発生するよう複数ビットを処理する。部分処理結果を
利用する少なくとも一の該乗算手段が該数値の組の積を
発生し、これにより、ＰＥ列の各々は第２のマトリクス
の列のマトリクス乗算を実行する。 (57) [Summary] [Object] The present invention provides a matrix operation algorithm and a circuit design for processing at least 2 bits for each clock cycle by using a modified Booth algorithm to reduce the occupied area of the circuit. To aim. The present invention is directed to a plurality of processor elements (P) each including storage means for receiving and storing a plurality of sets of numerical values in columns of a first matrix and corresponding rows of a second matrix.
E) a plurality of processor elements (P
E) It is composed of a multiplication circuit including a column. Other P in PE row
Each of the PE units connected to the E unit sequentially transmits a set of numerical values between the storage means in time series. Each of the PE units further comprises multiplication means for receiving a plurality of bits of the set of numbers from the storage means, the multiplication means processing the plurality of bits to produce a partial processing result. At least one of the multiplying means utilizing the partial processing result produces a product of the set of numerical values, whereby each PE column performs a matrix multiplication of a column of the second matrix.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、概括的にはマトリクス
演算を実施するアルゴリズムと回路実装に関する。特
に、本発明は、マトリクス乗算アルゴリズムと、ＩＣチ
ップ上の占有領域が削減され、しかも、より高速な乗算
演算を実行するように集積回路（ＩＣ）チップに実装さ
れる回路設計に関する。FIELD OF THE INVENTION The present invention relates generally to algorithms and circuit implementations for performing matrix operations. In particular, the present invention relates to matrix multiplication algorithms and circuit designs that are implemented on integrated circuit (IC) chips to reduce the occupied area on the IC chip and to perform faster multiplication operations.

【０００２】[0002]

【従来の技術】マトリクス演算は応用毎に種々の状況下
で通常実行される。乗算を能率的に実行するために、一
般に乗算アルゴリズムは特殊ＩＣ回路設計を利用してハ
ードウェア的に実装される。より小さな電子部品を絶え
ず製造する最近の小型化傾向は、ＩＣチップ上の回路に
より占められる領域を最小限度にするよう多大な要求を
回路設計技術にもたらしている。チップ上の回路サイズ
が縮小されたＩＣ回路設計により実装しうる特徴を有す
るアルゴリズムは、電子部品のサイズを縮小するために
広範囲の応用に適用され得る。BACKGROUND OF THE INVENTION Matrix operations are usually performed under different circumstances depending on the application. In order to perform the multiplication efficiently, the multiplication algorithm is generally implemented in hardware using a special IC circuit design. The recent trend toward miniaturization of smaller electronic components has placed great demands on circuit design technology to minimize the area occupied by circuitry on an IC chip. The algorithms, which have the feature that they can be implemented by IC circuit designs with reduced circuit size on chip, can be applied to a wide range of applications to reduce the size of electronic components.

【０００３】マトリクス乗算の一つの典型的な応用は、
２次元離散余弦変換（ＤＣＴ）が行われるビデオ画像デ
ータ圧縮の分野にある。Ｎ×Ｎデータマトリクスに対し
て、２次元離散余弦変換が次のように定義される：One typical application of matrix multiplication is
It is in the field of video image data compression where two-dimensional discrete cosine transform (DCT) is performed. For an NxN data matrix, the two-dimensional discrete cosine transform is defined as:

【０００４】[0004]

【数１】 [Equation 1]

【０００５】逆変換も同様にして次のように実行され
る：The inverse transform is similarly performed as follows:

【０００６】[0006]

【数２】 [Equation 2]

【０００７】ここで、Ｋ＝０の場合には、Ｃ（ｘ）＝１／２^**（０．５）Ｋ≠０の場合には、Ｃ（ｘ）＝１である。Ｘ（ｎ₁，ｎ₂）は入力データマトリクスであ
って、Ｚ（Ｋ₁，Ｋ₂）は、変換係数マトリクスであ
る。Here, in the case of K = 0, C (x) = 1/2 ^** (0.5) and in the case of K ≠ 0, C (x) = 1. X (n ₁ , n ₂ ) is the input data matrix and Z (K ₁ , K ₂ ) is the transform coefficient matrix.

【０００８】式（１ａ）をマトリクス形式で表すと、Ｚ＝ＣＸＣ ^t （２）であって、Ｃは余弦係数マトリクスを表し、Ｃ ^tはＣの
転置を表す。更に詳細な形式では、式（２）は次のよう
に表すことができる：Expressing equation (1a) in matrix form, Z = CXC ^t (2), where C represents the cosine coefficient matrix and C ^t represents the transpose of C. In a more detailed form, equation (2) can be expressed as:

【０００９】[0009]

【数３】 [Equation 3]

【００１０】ここで、Ｃ^t _ij＝Ｃ_ijである。互いに乗算
する３つのＮ×Ｎマトリクスがある。従って、マトリク
スＹをＹ＝ＸＣ ^t、又は、Here, C ^t _ij = C _ij . There are three N × N matrices that multiply together. Thus, Matrix <br/> scan Y to Y = XC ^t, or,

【００１１】[0011]

【数４】 [Equation 4]

【００１２】により定義すると、マトリクスは次のよう
に表される：Defined by, the matrix is represented as:

【００１３】[0013]

【数５】 [Equation 5]

【００１４】２行２列の例によると、式（４）は、According to the 2-row, 2-column example, equation (4) is

【００１５】[0015]

【数６】 [Equation 6]

【００１６】のように変形され、式（５）は、## EQU3 ## Equation (5) is transformed into

【００１７】[0017]

【数７】 [Equation 7]

【００１８】のように変形される。ＤＣＴマトリクス乗
算に関する従来技術には幾つかの制約がある。行−列分
解方法を利用する場合、転置されたメモリが必要とされ
る。多量のパイプライン型データフローによる解法は、
実際の回路設計に実装することが極めて困難となる。直
接２次元ＤＣＴが実装される状況下では、意図されたハ
ードウェア実装から潜在的に得られる利点と比較して、
ハードウェア設計と製造のコストは高すぎる。その上、
提案された技術は、一般にモジュール化特性に欠け、高
度に入り組んだ複雑な回路構造を必要とするので、実際
の実装には時間を膨大に浪費し、効率が良くない。It is transformed as follows. The prior art regarding DCT matrix multiplication has some limitations. When using the row-column decomposition method, transposed memory is required. The solution with a large amount of pipelined data flow is
It becomes extremely difficult to implement in the actual circuit design. In situations where a two-dimensional DCT is directly implemented, compared to the potential benefits of the intended hardware implementation,
Hardware design and manufacturing costs are too high. Moreover,
Since the proposed technology generally lacks modularization characteristics and requires a highly intricate and complicated circuit structure, it is very time-consuming and inefficient for actual implementation.

【００１９】マトリクス乗算を実装するようにモジュー
ル化された回路設計が、本発明の発明者による従来特許
により提供される（「高速パイプライン方式マトリクス
乗算器」と名付けられた特許出願第０７／８３６，０７
５号）。特許出願第０７／８３６，０７５号ではパイプ
ライン構造が利用され、ここで、複数のレジスタとマル
チプレクサが複数のビット乗算素子、即ち、プロセシン
グ素子（ＰＥｓ）を形成するよう統括的なアーキテクチ
ャで配列される。乗算が全ての各プロセシング素子で１
ビット毎に時系列的に実行される。統括的な構成に加え
て、乗算が並列的に実行される、即ち、乗算されるマト
リクスの列と対応する行の各要素は多数の統括的なプロ
セシング行の一つで並列的に同時に処理される、という
利点を乗算器は更に有する。設計と、パイプライン型で
あって並列的なプロセシングアーキテクチャとの単純さ
故に、乗算器は非常に経済的に設計され製造される。ハ
ードウェア実装のコストは、従来技術のそれと比べて大
きく削減される。モジュール構造により、主要な回路設
計の労を要することなく、多段マトリクス乗算の実行に
便宜的に利用することが更に可能となる。A circuit design modularized to implement matrix multiplication is provided by a prior patent by the inventor of the present invention (Patent Application No. 07/836, entitled "Fast Pipelined Matrix Multiplier"). , 07
No. 5). In patent application No. 07 / 836,075, a pipeline structure is used, in which a plurality of registers and multiplexers are arranged in a general architecture to form a plurality of bit multiplication elements, or processing elements (PEs). It Multiply is 1 for every processing element
It is executed in time series for each bit. In addition to the overall configuration, the multiplications are carried out in parallel, i.e. each element of the row corresponding to the column of the matrix to be multiplied is simultaneously processed in parallel in one of a number of overall processing rows. The multiplier further has the advantage that Due to the simplicity of the design and the pipelined, parallel processing architecture, the multipliers are very economically designed and manufactured. The cost of hardware implementation is greatly reduced compared to that of the prior art. The modular structure further allows it to be conveniently used to perform multi-stage matrix multiplications without major circuit design effort.

【００２０】しかしながら、このパイプライン乗算器
は、ある特定の状況下で種々の潜在的なハードウェア制
約をうけることがある。複数のビット乗算素子、即ち、
プロセッシング素子（ＰＥｓ）の各々は１クロック周期
当たり１ビットを処理するので、個々のマトリクス素子
のビット数及びマトリクスの行と列の数が増加するにつ
れて、レジスタとマルチプレクサの数が多数に増加する
ことがある。電子部品の小型化の要求は、ＩＣチップ上
の電子回路部品数を制約する。多数のレジスタとマルチ
プレクサと関連した相互結合線が、ＩＣチップ上の領域
を大きく占有し過ぎることがあり、これにより、ハード
ウェア実装が製造仕様と両立しない場合もある。However, this pipeline multiplier may be subject to various potential hardware constraints under certain circumstances. Multiple bit multiplication elements, ie
Since each of the processing elements (PEs) processes one bit per clock period, the number of registers and multiplexers will increase as the number of bits of individual matrix elements and the number of rows and columns of the matrix increase. There is. The demand for miniaturization of electronic components limits the number of electronic circuit components on the IC chip. The interconnection lines associated with the large number of registers and multiplexers may occupy too much area on the IC chip, which may cause the hardware implementation to be incompatible with manufacturing specifications.

【００２１】したがって、従来技術においてはパイプラ
イン乗算器の回路部品数を削減する必要性が依然として
あり、これにより、より広い応用分野で実装されるよう
にパイプライン／並列乗算構成が更に改善され得る。Therefore, there is still a need in the prior art to reduce the number of circuit components in a pipeline multiplier, which can further improve the pipeline / parallel multiplication configuration for implementation in a wider range of applications. .

【００２２】[0022]

【発明が解決しようとする課題】したがって、本発明の
目的は、修正ブースアルゴリズム（Modified Booth Alg
orithm）を利用して各クロック周期毎に少なくとも２ビ
ットを処理し、これにより、回路の占有領域が削減され
るマトリクス演算アルゴリズムと回路設計を提供するこ
とである。Accordingly, it is an object of the present invention to provide a Modified Booth Algorithm.
orithm) to process at least two bits every clock cycle, thereby providing a matrix operation algorithm and circuit design that reduces the occupied area of the circuit.

【００２３】本発明の別の目的は、ＩＣチップ上の回路
領域が縮小されパイプラインと並列処理アーキテクチャ
を利用するＩＤＣＴに対するＣＣＩＴＴ（国際電信電話
諮問委員会）提案規格に基づく精度ベンチマーク要求を
満足するマトリクス演算アルゴリズムと回路設計を提供
することである。本発明の別の目的は、中間テスト機能
を可能とし、これにより、乗算精度が更に容易に検査さ
れ得るマトリクス演算アルゴリズムと回路設計を提供す
ることである。Another object of the present invention is to meet the accuracy benchmark requirement based on CCITT (International Telegraph and Telephone Advisory Committee) proposed standard for IDCT which uses a pipeline and parallel processing architecture in which the circuit area on the IC chip is reduced. It is to provide a matrix operation algorithm and a circuit design. Another object of the present invention is to provide a matrix operation algorithm and circuit design that enables an intermediate test function, whereby the multiplication accuracy can be more easily checked.

【００２４】[0024]

【課題を解決するための手段】簡単に述べると、望まし
い一実施例により、本発明は第１のマトリクスの行及び
第２のマトリクスの対応する列の数値の複数の組を受信
して記憶する記憶手段を各々が含む複数のプロセッサ素
子（ＰＥ）ユニットを各々が含む複数のプロセッサ素子
（ＰＥ）列を含む乗算回路から構成される。ＰＥユニッ
トの各々は、ＰＥ列の他のＰＥユニットに接続される数
値の組を時系列的に記憶手段間で順次に伝送する。ＰＥ
ユニットの各々は、記憶手段からの数値の組の複数ビッ
トを受信する乗算手段を更に有し、乗算手段は部分処理
結果を発生するよう複数ビットを処理する。部分処理結
果を利用する少なくとも一の該乗算手段は該数値の組の
積を発生し、これにより、ＰＥ列の各々は第２のマトリ
クスの列のマトリクス乗算を実行し、マトリクス回路は
マトリクス乗算を実行する。Briefly, in accordance with a preferred embodiment, the present invention receives and stores multiple sets of numeric values in rows of a first matrix and corresponding columns of a second matrix. A multiplication circuit including a plurality of processor element (PE) columns each including a plurality of processor element (PE) units each including a storage unit. Each of the PE units sequentially transmits a set of numerical values connected to the other PE units in the PE column between the storage means in time series. PE
Each of the units further comprises multiplication means for receiving the plurality of bits of the set of numbers from the storage means, the multiplication means processing the plurality of bits to produce a partial processing result. At least one of the multiplying means utilizing the partial processing result produces a product of the set of numerical values, whereby each PE column performs a matrix multiplication of the columns of the second matrix and the matrix circuit performs the matrix multiplication. Run.

【００２５】[0025]

【作用】本発明の利点は、修正ブースアルゴリズムを利
用して各クロック周期毎に少なくとも２ビットを処理
し、これにより回路の占有領域が縮小されるマトリクス
演算アルゴリズムと回路設計を提供することである。本
発明の別の利点は、ＩＣチップ上の回路領域が縮小され
パイプラインと並列処理アーキテクチャを利用するＩＤ
ＣＴに対するＣＣＩＴＴ提案規格に基づく精度ベンチマ
ーク要求を満足するマトリクス演算アルゴリズムと回路
設計を提供することである。An advantage of the present invention is that it provides a matrix operation algorithm and circuit design that utilizes a modified Booth algorithm to process at least two bits every clock cycle, thereby reducing the occupied area of the circuit. . Another advantage of the present invention is an ID that reduces the circuit area on the IC chip and utilizes a pipeline and parallel processing architecture.
It is an object of the present invention to provide a matrix operation algorithm and a circuit design that satisfy the accuracy benchmark requirement based on the CCITT proposed standard for CT.

【００２６】本発明の別の利点は、中間テスト機能を可
能とし、これにより、乗算精度が更に容易に検査され得
るマトリクス演算アルゴリズムと回路設計を提供するこ
とである。本発明の以上及び他の目的と利点は、通常当
業者においては種々の図面に記載される望ましい実施例
の以下の詳細な説明を読むことにより疑いなく明らかと
なる。Another advantage of the present invention is that it provides an intermediate test function, thereby providing a matrix operation algorithm and circuit design in which multiplication accuracy can be more easily checked. These and other objects and advantages of the present invention will no doubt become apparent to those skilled in the art upon reading the following detailed description of the preferred embodiments, which is set forth in various drawings.

【００２７】[0027]

【実施例】図１は本発明によるマトリクス乗算回路１０
を示す。乗算回路１０は２行２列マトリクス乗算、即ち
Ａ×Ｂ＝Ｃを実行するため利用され、ここでＡ、Ｂ及び
Ｃは各々が２行と２列を有する３個のマトリクスを表
す。乗算器１０は、２つの同一であって並列な列、即ち
列１６と列１８とを含み、ここに、列１６はマトリクス
Ａの要素とマトリクスＢの第１列との乗算を実行し、列
１８はマトリクスＡの要素とＢの第２列の要素との乗算
を実行する。列１６及び１８の演算は並列に実行されう
る。特許出願第０７／８３６，０７５号に開示されたパ
イプライン乗算器により実行される乗算と同様に、列１
６及び１８の構造は、統括的に構成され、ここに、乗算
は全て連続して実行される。1 shows a matrix multiplication circuit 10 according to the present invention.
Indicates. The multiplication circuit 10 is a 2 × 2 matrix multiplication, that is,
Is used to implement A × B = C , where A , B and
C represents three matrices each having two rows and two columns. The multiplier 10 comprises two identical and parallel columns, column 16 and column 18, where column 16 is a matrix.
Column A performs the multiplication of the elements of A with the first column of matrix B , and column 18 performs the multiplication of the elements of matrix A with the elements of the second column of B. The operations in columns 16 and 18 can be performed in parallel. Column 1 as well as the multiplication performed by the pipeline multiplier disclosed in patent application Ser. No. 07 / 836,075
The structure of 6 and 18 is organized as a whole, where the multiplications are all performed sequentially.

【００２８】各プロセシング素子列、即ち、列１６及び
１８は、夫々ＰＥ１（１６−１）、ＰＥ２（１６−
２）、ＰＥ３（１６−４）、及びＰＥ１（１８−１）、
ＰＥ２（１８−２）、ＰＥ３（１８−４）と記述される
複数のプロセシング素子（ＰＥ）ユニットから成る。Ｐ
Ｅ１及びＰＥ２ユニットは、連続した方法で乗算を実行
するために利用され、ここで、各プロセシングユニット
は、あるビット数の乗算をパイプライン形式で実行す
る。各クロック周期の最初に、１レベルのプロセシング
ユニットの内容が、更なる処理のために次のレベルに転
送され、そこで次のビットの組の乗算が実行される。し
たがって、ＰＥ１及びＰＥ２ユニットの総数は、乗算さ
れるマトリクス要素のビット数と各プロセシングユニッ
トで処理されるビット数とに依存する。例えば、各プロ
セシングユニットにより処理されるビット数を２とする
と、プロセシングユニットの総数、つまり、ＰＥ１及び
ＰＥ２の全数は、ＮをマトリクスＡの最大ビット数とし
てＮが偶数の場合にＮ／２であって、Ｎが奇数の場合に
（Ｎ＋１）／２である。Each processing element row, that is, rows 16 and 18 are PE1 (16-1) and PE2 (16-), respectively.
2), PE3 (16-4), and PE1 (18-1),
It is composed of a plurality of processing element (PE) units described as PE2 (18-2) and PE3 (18-4). P
The E1 and PE2 units are utilized to perform the multiplications in a sequential manner, where each processing unit performs a certain number of multiplications in pipeline form. At the beginning of each clock cycle, the contents of one level of processing unit are transferred to the next level for further processing, where the multiplication of the next set of bits is performed. Therefore, the total number of PE1 and PE2 units depends on the number of bits of the matrix element to be multiplied and the number of bits processed by each processing unit. For example, assuming that the number of bits processed by each processing unit is 2, the total number of processing units, that is, the total number of PE1 and PE2 is N / 2 when N is an even number of the matrix A and N is an even number. Then, when N is an odd number, it is (N + 1) / 2.

【００２９】乗算器１０は、複数のレジスタから成るレ
ジスタ列２０を更に含む。プロセッサ素子列、即ち列１
６及び１８の演算は、以下に説明するように、レジスタ
列２０のレジスタと共に実行される。レジスタ列１８の
演算はレジスタ列１６の演算と同一であるので、レジス
タ列１６に関する以下の説明は、レジスタ列１８に対し
ても適用できる。２行２列マトリクスＡのマトリクス要
素ａ₁₁、ａ₁₂、ａ₂₁、及びａ₂₂が、一度にひとつずつ、
レジスタ列２０にライン２０−１から入力される。同時
に、マトリクスＢの一つの列のマトリクス要素ｂ₁₁、ｂ
₁₂が第１の列１６の第１のプロセッサ列ＰＥ（１６−
１）にライン３４−１から入力される。並列処理は、マ
トリクスＢの各列の乗算が列１６と同様に各列により同
時に行われるような乗算器のアーキテクチャを構成する
ことにより実行される。マトリクス乗算においてマトリ
クスＢの第１の列がマトリクスＡの第１及び第２の行に
乗算されるので、各マトリクス要素は、ライン３４−１
から２度入力される。The multiplier 10 further includes a register string 20 including a plurality of registers. Processor element row, ie row 1
The operations of 6 and 18 are performed with the registers of register column 20, as described below. Since the calculation of the register train 18 is the same as the calculation of the register train 16, the following description regarding the register train 16 is also applicable to the register train 18. The matrix elements a ₁₁ , a ₁₂ , a ₂₁ , and a ₂₂ of the 2-row 2-column matrix A , one at a time,
It is input to the register string 20 from the line 20-1. At the same time, the matrix elements of one row of the matrix B b _11, b
₁₂ is the first processor row PE (16-
1) is input from the line 34-1. Parallel processing is carried out by configuring the architecture of the multiplier such that the multiplication of each column of matrix B is performed by each column simultaneously as in column 16. Since in matrix multiplication the first column of matrix B is multiplied by the first and second rows of matrix A , each matrix element is associated with line 34-1.
Is input twice from.

【００３０】マトリクスＡの第１のマトリクス要素ａ₁₁
とマトリクスＢの第１のマトリクス要素ｂ₁₁が夫々ライ
ン２０−１及びライン３４−１を介して乗算器１０に同
じクロック周期中に入力される。要素ａ₁₁の第１及び第
２のビットを第１のプロセッサ素子ＰＥ１（１６−１）
に供給するためのタップライン３８−１がライン２０−
１の第１及び第２ビットの入力ラインに接続される。ａ
₁₁の第１及び第 2のビットがマルチプレクサ３６−１の
セレクタ制御入力に転送される。マルチプレクサ３６−
１は、データ入力としてライン３４−１から要素ｂ₁₁も
また受信する。ａ₁₁の第１及び第２のビットの乗算を実
行するよう修正ブースアルゴリズムがマルチプレクサ３
６−１において適用される。表１に３ビット修正ブース
テーブルを示し、ここで、各時点で２つの新しいビット
が処理される。これら２つの新しいビットの中、下位ビ
ットがビットｂ（ｉ）で、上位ビットがビットｂ（ｉ＋
１）として示され、一方、前回の２ビットの中の上位ビ
ットも利用されｂ（ｉ−１）と示される。最初の２ビッ
トの演算のために、ｂ（ｉ−１）の値は零に指定され
る。First matrix element a ₁₁ of matrix A
And the first matrix element b ₁₁ of the matrix B is input to the multiplier 10 during the same clock cycle via line 20-1 and line 34-1, respectively. The first and second bits of the element a ₁₁ are set to the first processor element PE1 (16-1).
The tap line 38-1 for supplying to
1 to the input lines of the first and second bits. a
_{The eleventh} first and second bits are transferred to the selector control input of multiplexer 36-1. Multiplexer 36-
1 also receives element b ₁₁ from line 34-1 as a data input. The modified Booth algorithm implements the multiplexer 3 to perform the multiplication of the first and second bits of a ₁₁ .
6-1 is applied. Table 1 shows a 3-bit modified Booth table, where two new bits are processed at each time. Of these two new bits, the lower bit is bit b (i) and the upper bit is bit b (i +
1), while the upper bits of the previous 2 bits are also used and denoted as b (i-1). For the first 2-bit operation, the value of b (i-1) is designated as zero.

【００３１】[0031]

【表１】 [Table 1]

【００３２】ＰＥ１で実行される最初の２ビットに対す
る修正ブーステーブルの演算が図２に示される。これら
２ビットからは２つの制御値だけが選択される。値０を
ビットｂ（ｉ−１）に割り当てることにより、修正ブー
ステーブルに従って、マルチプレクサは、演算＊０、＊
１、＊（−２）、及び＊（−１）を実行する。マルチプ
レクサの出力は、データ伝送ライン２８−１を介して、
クロック周期の立ち下がりエッジによって記憶されるレ
ジスタ（Ｒ）２４−１に転送される。その値は、最初の
２入力マトリクス要素ａ₁₁及びｂ₁₁の部分積を構成す
る。一方、クロック周期の最後に、これらの２つのマト
リクス要素、即ち、ａ₁₁及びｂ₁₁はライン２０−１及び
３２−１を介して転送されて、夫々レジスタ２１−１及
び３０−１に記憶される。ＰＥ１の演算は、最初の２入
力要素に対して終了する。The modified Booth table operation for the first two bits performed in PE1 is shown in FIG. Only two control values are selected from these two bits. By assigning the value 0 to bit b (i−1), the multiplexer operates according to the modified Booth table to compute * 0, *
Perform 1, * (-2), and * (-1). The output of the multiplexer is via the data transmission line 28-1.
It is transferred to the register (R) 24-1 stored at the falling edge of the clock cycle. Its value constitutes the partial product of the first two-input matrix elements a ₁₁ and b ₁₁ . On the other hand, at the end of the clock period, these two matrix elements, a ₁₁ and b _11, are transferred via lines 20-1 and 32-1 and stored in registers 21-1 and 30-1, respectively. It The operation of PE1 ends for the first two input elements.

【００３３】次のクロック周期の先頭で、次の２入力ａ
₁₂及びｂ₂₁がＰＥ１（１６−１）により受信され、ａ₁₁
及びｂ₁₁に関して実行されたのと同じ方法で処理され
る。一方、ａ₁₁の次の２ビットとマトリクス要素ｂ₁₁の
乗算が次のプロセッサ素子ＰＥ２（１６−２）で実行さ
れる。タップライン３８−２は、ａ₁₁の第２、第３及び
第４ビットをＰＥ２ユニット（１６−２）のマルチプレ
クサ３６−２のセレクタ制御入力に転送し、マトリクス
要素ｂ₁₁もライン３４−２を介してマルチプレクサ３６
−２に転送される。再度修正ブーステーブル、即ち、表
１が乗算を実行するようマルチプレクサ３６−２で利用
される。ＰＥ２ユニットで利用されるマルチプレクサが
示される図３を参照する。入力要素ａ₁₁の最初の３ビッ
トがセレクタ制御入力として利用される。３ビットの値
に依存して、５通りの演算、即ち、＊０、＊１、＊２、
＊（−１）及び＊（−２）が実行される。At the beginning of the next clock cycle, the next two inputs a
₁₂ and b ₂₁ are received by PE1 (16-1) and a ₁₁
And b ₁₁ are processed in the same manner as performed. On the other hand, the multiplication of the next 2 bits of a ₁₁ and the matrix element b ₁₁ is executed by the next processor element PE2 (16-2). Tap line 38-2 transfers the second, third and fourth bits of a ₁₁ to the selector control input of multiplexer 36-2 of PE2 unit (16-2) and matrix element b ₁₁ also connects line 34-2. Through multiplexer 36
-2. Again, the modified Booth table, i.e. Table 1, is utilized in multiplexer 36-2 to perform the multiplication. Reference is made to FIG. 3 where the multiplexer utilized in the PE2 unit is shown. The first 3 bits of the input element a ₁₁ are used as the selector control input. Depending on the 3-bit value, there are 5 types of operations, namely * 0, * 1, * 2,
* (-1) and * (-2) are executed.

【００３４】マルチプレクサ３６−２の出力が、ライン
２６−１を介してＰＥ１ユニット１６−１のレジスタ２
４−１に記憶された部分積をも受信するキャリ保存加算
器（ＣＳＡ）２４−２にライン２８−２を介して転送さ
れる。ＰＥ２ユニットのマルチプレクサ３６−２の出力
データは、データａ₁₁の第２、第３及び第４ビットに関
して実行された演算の結果得られた積を表し、したがっ
て、マルチプレクサ３６−２の出力は、まず最初に、マ
ルチプレクサ３６−１からの出力に関して２ビット左に
シフトされなければならない。標準の加算器の代わり
に、ＣＳＡ２４−２は、プロセッサ素子ユニットのスル
ープットを増加させるよう「キャリ算出及び加算演算」
を利用する。「キャリ及び加算」演算は、桁上げなしに
「和」を計算するよう「排他的ＯＲ」演算が各ビットに
対して実行され、「キャリ」を計算するよう各ビットに
ＡＮＤ演算を適用する演算に基づいていて、「和」と
「キャリ」とを加算することにより本当の和が得られ
る。The output of the multiplexer 36-2 is sent to the register 2 of the PE1 unit 16-1 via the line 26-1.
It is transferred via line 28-2 to a carry save adder (CSA) 24-2 which also receives the partial product stored in 4-1. The output data of the multiplexer 36-2 of the PE2 unit represents the product resulting from the operation performed on the second, third and fourth bits of the data a ₁₁ , so that the output of the multiplexer 36-2 is initially First, it must be shifted two bits to the left with respect to the output from multiplexer 36-1. Instead of a standard adder, the CSA 24-2 "carries and adds" to increase the throughput of the processor element unit.
To use. The "carry and add" operation is an operation in which an "exclusive OR" operation is performed on each bit to calculate "sum" without carry, and an AND operation is applied to each bit to calculate "carry". , The true sum is obtained by adding the "sum" and the "carry".

【００３５】第２のクロックの立ち下がりで、「キャ
リ」と「和」の新しい組が加算レジスタ２４−２に記憶
され、入力ａ₁₁は、記憶用レジスタ列２０のライン２０
−２を介してレジスタ２１−２に転送される。同様に、
マトリクス要素ｂ₁₁は、ライン３２−２を介してＰＥ２
ユニット１６−２のレジスタ３０−２に転送され、更に
処理されるようそこに記憶される。一方、マトリクス要
素の第２の対ａ₁₂及びｂ ₂₁と、キャリと和は、レジスタ
列のレジスタ２１−１と、レジスタ３０−１と２４−１
に夫々記憶される。At the falling edge of the second clock,
A new set of "ri" and "sum" is stored in the addition register 24-2.
And input a₁₁Is the line 20 of the storage register column 20.
-2 to the register 21-2. Similarly,
Matrix element b₁₁PE2 through line 32-2
Transferred to the register 30-2 of the unit 16-2,
It is stored there for processing. On the other hand, matrix required
The second pair of prime a₁₂And b _{twenty one}And carry and sum are registers
Column register 21-1, registers 30-1 and 24-1
Are memorized respectively.

【００３６】各クロック周期の先頭で、データは種々の
段の乗算を実行する各プロセッサ素子ユニットを用いて
全部にわたってパイプライン方式で処理される。ＰＥ３
ユニット（１６−３）がプロセシング素子列１６の最後
に設けられる。ＰＥ２と同様に、修正ブースアルゴリズ
ムがマルチプレクサ３６−３に実装される。マルチプレ
クサ３６−３の詳細を図４に示す。At the beginning of each clock cycle, the data is pipelined throughout with each processor element unit performing the various stages of multiplication. PE3
The unit (16-3) is provided at the end of the processing element array 16. Similar to PE2, the modified Booth algorithm is implemented in multiplexer 36-3. Details of the multiplexer 36-3 are shown in FIG.

【００３７】プロセシング素子列１６の演算を続ける
と、ａ₁₁の全ビットが処理されたとき、マトリクス要素
の最初の対ａ₁₁及びｂ₁₁の「和」と「キャリ」の最後の
組が補助レジスタ２４−３から得られる。ｂ₁₁が乗算さ
れたａ₁₁の値は、レジスタ２４−３に記憶された「和」
と「キャリ」の最後の組を加算レジスタ２４−４に加算
することにより得られる。Continuing the operation of processing element array 16, when all bits of a ₁₁ have been processed, the last pair of "sum" and "carry" of the first pair of matrix elements a ₁₁ and b ₁₁ is the auxiliary register. 24-3. The value of a ₁₁ multiplied by b ₁₁ is the “sum” stored in the register 24-3.
And the last set of "carries" are added to the addition register 24-4.

【００３８】積ａ₁₁ｂ₁₁は、ライン２６−３を介して累
算器（ＡＣＣ）４０に転送される。累算器４０は、帰還
経路４６を有し、その結果、第１の積ａ₁₁ｂ₁₁が累算器
４０にループバックする。積ａ₁₁ｂ₁₁終了の１クロック
周期後、累算器４０はライン２６−３を介して第２の積
ａ₁₂ｂ₂₁を受信する。これらの２つの積が累算器４０で
加算され、記憶される。The products a ₁₁ b ₁₁ are transferred to accumulator (ACC) 40 via line 26-3. The accumulator 40 has a feedback path 46 so that the first product a ₁₁ b ₁₁ loops back to the accumulator 40. One clock period after the end of the product a ₁₁ b ₁₁ , accumulator 40 receives the second product a ₁₂ b ₂₁ via line 26-3. These two products are added in the accumulator 40 and stored.

【００３９】次のクロック周期で、第３の積ａ₂₁ｂ₁₁が
ライン２６−３を介して累算器４０に到達する。累算器
４０の内容、即ち、ａ₁₁ｂ₁₁とａ₁₂ｂ₂₁との和ｃ₁₁がラ
イン４２を介してレジスタ（Ｒｈ）４４に転送される。
レジスタ４４は、次に、出力ポートか或いは更なる処理
のための別の処理回路のいずれかに接続される。ｃ₁₁が
レジスタ４４に到達した２クロック周期後、マトリクス
Ｃの別のマトリクス要素ｃ₂₁もライン４２を介して、出
力或いは更なる処理の準備が整ったレジスタ４４に転送
される。At the next clock cycle, the third product a ₂₁ b ₁₁ reaches the accumulator 40 via line 26-3. The contents of the accumulator 40, that is, the sum c _{11 of} a ₁₁ b ₁₁ and a ₁₂ b ₂₁ is transferred to the register (Rh) 44 via the line 42.
Register 44 is then connected either to the output port or to another processing circuit for further processing. Two clock cycles after c ₁₁ reaches register 44, the matrix
Another matrix element c ₂₁ of C is also transferred via line 42 to a register 44 ready for output or further processing.

【００４０】マトリクスＣの第１のマトリクス要素、即
ち、ｃ₁₁は、Ａの第１行とＢの第１列との積の和である
ため、累算器４０とこれに関連したレジスタ４４はｃ₁₁
の最終値を計算するために利用される。累算器４０とレ
ジスタ４４におけるデータ処理の詳細は、参照のために
ここに引用した特許出願第０７／８３６，０７５号に関
するのと同じである。The first matrix elements of the matrix C, that is, c ₁₁ are the sum of the product of the first column of the first row and the B of A, the register 44 associated therewith and accumulator 40 c ₁₁
Used to calculate the final value of. The details of the data processing in accumulator 40 and register 44 are the same as for patent application Ser. No. 07 / 836,075, incorporated herein by reference.

【００４１】かくして、プロセシング素子列１６は入力
データの複数の対の一連の積を順々に連続して発生し、
これらの積の和を計算することができる。入力データの
対が適当に配置されると、プロセシング素子列１６は、
第１のマトリクスの行と第２のマトリクスの列との乗算
の結果としてマトリクス要素を発生することができる。Thus, the processing element array 16 sequentially produces a series of products of a plurality of pairs of input data,
The sum of these products can be calculated. When the input data pairs are properly arranged, the processing element array 16 is
Matrix elements can be generated as a result of the multiplication of the rows of the first matrix and the columns of the second matrix.

【００４２】同様の過程により、プロセシング素子列１
８は、マトリクスＣの第２列のマトリクス要素を発生す
るようマトリクスＢの第２行のマトリクス要素を順々に
受信する。プロセシング素子列１６及び１８は、同期さ
せる方法により乗算タスクを同時に実行することができ
る。列１６及び１８の出力は、２×２離散余弦変換（Ｄ
ＣＴ）計算に更に利用することができる。乗算器１０に
類似した第２段の乗算器が別のマトリクス乗算を実行す
るために利用される。ＤＣＴ乗算回路の詳細は、８×８
ＤＣＴ回路により以下に説明する。By the same process, the processing element array 1
8 in turn receives the matrix elements of the second row of matrix B to generate the matrix elements of the second column of matrix C. The processing element trains 16 and 18 can perform multiplication tasks simultaneously in a synchronized manner. The outputs of columns 16 and 18 are the 2x2 discrete cosine transform (D
More available for CT) calculation. A second stage multiplier similar to multiplier 10 is utilized to perform another matrix multiplication. For details of the DCT multiplication circuit, see 8x8.
The DCT circuit will be described below.

【００４３】図５は、８×８ＤＣＴ処理を実行するモジ
ュール回路１００を示す。ＤＣＴ回路１００は、２つの
段、つまりＹ＝ＸＣ ^tを計算する第１段１０６とＺ＝Ｃ
Ｙを計算する第２段１０８から成る。第１段は、入力デ
ータ、即ち、マトリクス要素を時系列的に上から下へ１
レジスタ当たり１クロック周期で順々に転送するよう一
列に接続された複数のレジスタを有するレジスタ列１０
２を含む。第１段は１０６は、８つのプロセシング素子
列１０４−１乃至１０４−８から更に構成される。各プ
ロセシング素子列１０４−１乃至１０４−８は、図１に
示した回路１０のＰＥ１ユニット１６−１及び１８−１
と一致するＰＥ１ユニット１４０を有する。第１段は、
図１に示した回路１０のＰＥ２及びＰＥ３ユニットと同
じく一致するＰＥ２及びＰＥ３ユニットから更に構成さ
れる。FIG. 5 shows a module circuit 100 for performing 8 × 8 DCT processing. The DCT circuit 100 has two stages, the first stage 106 for calculating Y = XC ^t and Z = C.
It consists of a second stage 108 for calculating Y. The first stage is to input data, that is, the matrix elements from the top to the bottom in chronological order.
Register row 10 having a plurality of registers connected in a row so as to sequentially transfer one clock cycle per register
Including 2. The first stage 106 is further composed of eight processing element arrays 104-1 to 104-8. Each of the processing element arrays 104-1 to 104-8 corresponds to the PE1 units 16-1 and 18-1 of the circuit 10 shown in FIG.
With a PE1 unit 140 that matches The first stage is
It further comprises PE2 and PE3 units which also match the PE2 and PE3 units of the circuit 10 shown in FIG.

【００４４】乗算過程の実行において、修正ブースアル
ゴリズムが各ＰＥユニット内の２つの新しいビットを処
理するために利用される。乗算器１０に関して説明した
ように、ＰＥ１及びＰＥ２ユニットの総数は、Ｎが偶数
の場合にはＮ／２に等しく、Ｎが奇数の場合には（Ｎ＋
１）／２に等しく、ここでＮは、レジスタ列１０２に記
憶された入力要素のビット数である。In performing the multiplication process, a modified Booth algorithm is used to process the two new bits in each PE unit. As described with respect to the multiplier 10, the total number of PE1 and PE2 units is equal to N / 2 when N is even and (N +
1) / 2, where N is the number of bits of the input element stored in register series 102.

【００４５】図１の２×２乗算回路１０と同様に、列１
０４−１乃至１０４−８の各プロセス素子は、２つの入
力を受信する。第１データマトリクスＸの要素は、１周
期当たり１要素ずつ、レジスタ列１０２に順々に入力さ
れる。これらのマトリクス要素は、レジスタ列１０２の
次の下位レジスタに１周期当たり１レジスタずつ転送さ
れる。各クロック周期において、レジスタ列１０２の各
々のレジスタからの２つの新しいビットが、各プロセシ
ング素子ユニット１０４−１乃至１０４−８内の対応す
るＰＥユニットに転送される。部分積の過程が次に、各
行と対応する列に関して同時に各ＰＥユニットにおいて
パイプライン方式で実行される。各プロセシング素子列
１０４−１乃至１０４−８は中間マトリクスＹの要素の
一の列を発生する。各プロセシング素子列の最後で、中
間マトリクスＹのマトリクス要素を計算するために８つ
の積が加算される。各々のマトリクス要素に対する計算
の終了後、ライン１３２−１乃至１３２−８を介してマ
トリクス要素が転送され、第２段乗算器１０８により更
に処理される。Similar to the 2 × 2 multiplier circuit 10 of FIG.
Each of the process elements 04-1 to 104-8 receives two inputs. The elements of the first data matrix X are sequentially input to the register array 102, one element per cycle. These matrix elements are transferred to the next lower register of the register array 102, one register per cycle. In each clock cycle, two new bits from each register of register train 102 are transferred to the corresponding PE unit in each processing element unit 104-1 to 104-8. The partial product process is then pipelined in each PE unit simultaneously for each row and corresponding column. Each processing element array 104-1 to 104-8 produces one array of elements of the intermediate matrix Y. At the end of each processing element row, eight products are added to calculate the matrix elements of the intermediate matrix Y. After completion of the calculations for each matrix element, the matrix elements are transferred via lines 132-1 through 132-8 for further processing by the second stage multiplier 108.

【００４６】第１段１０６と同様に、第２段１０８も、
レジスタ列１１０と８つのプロセシング素子列１２４−
１乃至１２４−８を含み、ここに、各列もマトリクス計
算Ｚ＝ＣＹを実行するようＰＥ１、ＰＥ２及びＰＥ３ユ
ニットを有する。修正ブースアルゴリズムを利用するこ
とにより、ＰＥ１及びＰＥ２ユニットの総数は、レジス
タ列１１０に記憶されたマトリクス要素のビット数の約
半数である。演算の順序は、第１段１０６のそれと同じ
である。各プロセシング素子列の最終の積がライン１３
０−１乃至１３０−８を介して８つの加算器（ＡＤＤ）
１６２−１乃至１６２−８に転送され、積の最終値を得
るためのキャリと和とが加算される。Like the first stage 106, the second stage 108 also
Register array 110 and eight processing element arrays 124-
1 to 124-8, where each column also has PE1, PE2 and PE3 units to perform the matrix calculation Z = CY . By utilizing the modified Booth algorithm, the total number of PE1 and PE2 units is approximately half the number of bits of the matrix elements stored in the register column 110. The order of operations is the same as that of the first stage 106. The final product of each processing element array is line 13
8 adders (ADD) through 0-1 to 130-8
162-1 to 162-8, and carry and sum for obtaining the final value of the product are added.

【００４７】加算器１６２−１乃至１６２−８において
計算された積は、累算器（ＡＣＣ）１６４−１乃至１６
４−８の配列に転送され更に処理される。マトリクスＣ
とマトリクスＹのマトリクス乗算を考察すると、マトリ
クスＺの各要素は、マトリクスＣの１行とマトリクスＹ
の１列との積の和である。しかしながら、第１段１０６
の出力からは、マトリクスＹの各行に対してマトリクス
要素のシーケンスが発生される。先の特許出願第０７／
８３６，０７５号において、全６４要素を正確に累算
し、最終的な２次元ＤＣＴ要素を計算するために２つの
スイッチ配列ボックスと６４個の累算器の累算器配列が
使用される。図５は、８×８ＤＣＴ回路１００により占
有されるＩＣチップ領域の更なる節約を行うためより複
雑でない回路設計を利用する簡略化された回路を示す。
一時記憶レジスタ１８０−１乃至１８０−８の配列及び
出力レジスタ１９０−１乃至１９０−８の配列が利用さ
れ、ここで、一時記憶レジスタ及び出力レジスタ各々
は、８個のシフトレジスタを含む。一時記憶レジスタの
詳細は各シフトレジスタが２６ビットを有するので図６
にｆｆ＿２６として示され、出力レジスタのそれは各シ
フトレジスタが１８ビットを有するので図７にｆｆ＿１
８として示される。The products calculated by the adders 162-1 to 162-8 are stored in accumulators (ACC) 164-1 to 16-16.
It is transferred to a 4-8 array for further processing. Matrix C
And matrix multiplication of matrix Y , each element of matrix Z is one row of matrix C and matrix Y.
Is the sum of products with one column of. However, the first stage 106
From the output of, a sequence of matrix elements is generated for each row of the matrix Y. Previous patent application No. 07 /
In 836,075, two switch array boxes and an accumulator array of 64 accumulators are used to accurately accumulate all 64 elements and compute the final 2D DCT element. FIG. 5 shows a simplified circuit that utilizes a less complex circuit design to further save the IC chip area occupied by the 8 × 8 DCT circuit 100.
An array of temporary storage registers 180-1 through 180-8 and an array of output registers 190-1 through 190-8 are utilized, where each temporary storage register and output register includes eight shift registers. Details of the temporary storage register are shown in FIG. 6 because each shift register has 26 bits.
Ff_26 in FIG. 7 and each of the shift registers has 18 bits.
Shown as 8.

【００４８】累算器１６４−１乃至１６４−８における
加算演算は、８クロック周期毎にのみ実行され、累算器
１６４−１乃至１６４−８において得られる和は、一時
記憶レジスタ１８０−１乃至１８０−８に最初に転送さ
れる。一時記憶レジスタ１８０の第１のシフトレジスタ
に記憶されたデータは、９番目の各クロック周期の先頭
でループバックされ、累算器１６４−１乃至１６４−８
の値に加算される。一方、各クロック周期の先頭で、累
算器１６４−１乃至１６４−８により新しく計算された
値が第１のシフトレジスタに記憶される時、一時記憶レ
ジスタ１８０−１乃至１８０−８におけるデータが隣の
シフトレジスタへ右にシフトされる。第５７番目のクロ
ック周期の先頭でマトリクスＺの第１行の計算の終了時
に、全マトリクス要素が第１の出力レジスタ１９０−１
に並列に転送される。同様に、マトリクスＺの第２行の
終了時に、マトリクス要素が出力レジスタ１９０−２に
並列に転送され、以下同様である。これらのデータは、
高解像度テレビ（ＨＤＴＶ）システム又はこれ以外の型
の応用に利用されるように最終的な処理をするラウンド
回路２００に直列出力ポートを介して直列に更に転送さ
れる。The addition operation in the accumulators 164-1 to 164-8 is executed only every 8 clock cycles, and the sum obtained in the accumulators 164-1 to 164-8 is the temporary storage registers 180-1 to 180-1. First transferred to 180-8. The data stored in the first shift register of the temporary storage register 180 is looped back at the beginning of each ninth clock cycle, and the accumulators 164-1 to 164-8 are added.
Is added to the value of. On the other hand, when the value newly calculated by the accumulators 164-1 to 164-8 is stored in the first shift register at the beginning of each clock cycle, the data in the temporary storage registers 180-1 to 180-8 are It is shifted right to the next shift register. At the beginning of the 57th clock cycle, at the end of the calculation of the first row of the matrix Z , all matrix elements have the first output register 190-1.
Are transferred in parallel to. Similarly, at the end of the second row of matrix Z , the matrix elements are transferred in parallel to output register 190-2, and so on. These data are
It is further serially transferred via the serial output port to the rounding circuit 200 for final processing for use in high definition television (HDTV) systems or other types of applications.

【００４９】ＤＣＴ回路１００は、入力ＤＣＴマトリク
ス係数の読み込みと中間テストの実行の２重の目的のた
めの保持レジスタ１２６を更に有する。図８は保持レジ
スタ１２６の回路を示す。乗算演算の終了時、積の値が
累算器１２５−１乃至１２５−８に記憶される。望まし
い一実施例では、図に示すように、保持レジスタは２つ
の直列双方向ポート、つまり、１０ビット直列ポートと
１２ビット直列ポートを有する。これらのポートは、第
２段のＤＣＴ計算を実行するためのＤＣＴマトリクス係
数を読み込むために利用される。また同時に、テストの
ために、テスト用の積の高位ビット、即ちビット１乃至
ビット１２が１２ビット直列ポートから読み込まれ、テ
スト用の積の下位ビット、即ちビット１３乃至ビット２
２が１０ビット直列ポートから読み込まれる。７クロッ
ク周期があるので、８×８ＤＣＴ計算において加算演算
が実行される前に、２つのマトリクス要素の積が一時的
に記憶される。ＤＣＴタスクの通常の演算とマトリクス
係数の入力とを干渉することなく所定のクロック周期に
おいてデータが読み込まれる。データ入力ポートを双方
向直列ポートとすることにより、回路１００は、回路設
計にコストを追加することなく中間テストを実行する付
加的な能力を提供する。The DCT circuit 100 further comprises a holding register 126 for the dual purpose of reading the input DCT matrix coefficients and performing the intermediate test. FIG. 8 shows a circuit of the holding register 126. At the end of the multiplication operation, the product values are stored in accumulators 125-1 through 125-8. In a preferred embodiment, the holding register has two serial bidirectional ports, a 10-bit serial port and a 12-bit serial port, as shown. These ports are used to read the DCT matrix coefficients to perform the second stage DCT calculation. At the same time, for testing purposes, the high order bits of the product for testing, ie bits 1 through 12 are read from the 12-bit serial port and the low order bits of the product for testing, ie bits 13 through 2 are read.
2 is read from the 10-bit serial port. Since there are 7 clock periods, the product of two matrix elements is temporarily stored before the addition operation is performed in the 8x8 DCT calculation. Data is read at a predetermined clock cycle without interfering with the normal operation of the DCT task and the input of matrix coefficients. By making the data input port a bidirectional serial port, the circuit 100 provides the additional ability to perform intermediate tests without adding cost to the circuit design.

【００５０】上記の回路素子に加えて、２段の８×８Ｄ
ＣＴ回路１００は、クロック発生器２１０とＤＣＴ係数
を供給する係数供給手段（図示されない）をも有する。
これらの素子及びキャリ保存加算器（ＣＳＡ）の詳細
は、本発明と共通の発明人による先の特許出願第０７／
８３６，０７５号に記載されたものと同様である。先の
特許出願と比較して、プロセッサ素子ユニットの数、即
ち、ＰＥ１及びＰＥ２の総数は、本発明において約半数
に削減されている。In addition to the above circuit elements, two stages of 8 × 8D
The CT circuit 100 also has a clock generator 210 and coefficient supply means (not shown) for supplying DCT coefficients.
Details of these elements and carry-save adders (CSA) can be found in the previous patent application No. 07 /
Similar to that described in No. 836,075. Compared to the previous patent application, the number of processor element units, ie the total number of PE1 and PE2, has been reduced to about half in the present invention.

【００５１】特許出願第０７／８３６，０７５号と同様
に、時系列的なパイプライン方式でＤＣＴ変換を実行す
るために２段回路が利用され、ここに、第２マトリクス
の各行の乗算が並列に処理される。しかしながら、本発
明はプロセッサ素子ユニットの数、即ち、ＰＥ１及びＰ
Ｅ２の数が殆どの場合において約５０％に削減されると
いう利点を有する。これは、各プロセッサ素子ユニット
が１ビットに代わって新しい２ビットを処理することを
可能とする修正ブースアルゴリズムを利用したことによ
り得られ、その結果、同数のデータを処理するのに必要
とされる回路を削減する。プロセッサ素子ユニットＰＥ
１及びＰＥ２においてクロック周期毎に２ビットが処理
されること、及び、ＣＳＡが使用されて、ここで、通常
の加算器においては屡々生ずる桁上げによるリップル効
果を伴わずに簡単な排他的ＯＲ及びＡＮＤ論理演算が実
行されるので演算がより効率的になることにより、デー
タ処理スループットも増加する。Similar to the patent application No. 07 / 836,075, a two-stage circuit is used to perform the DCT transform in a time-series pipeline system, in which the multiplication of each row of the second matrix is performed in parallel. Is processed. However, the present invention does not limit the number of processor element units, namely PE1 and P1.
It has the advantage that the number of E2s is reduced to about 50% in most cases. This is obtained by utilizing a modified Booth algorithm that allows each processor element unit to process a new 2 bits instead of 1 bit, so that it is required to process the same number of data. Reduce the circuit. Processor element unit PE
1 and PE2 process 2 bits per clock period, and CSA is used where a simple exclusive OR and a simple adder and OR without the carry ripple effect that often occurs in conventional adders. Data processing throughput is also increased by making AND operations more efficient because they are performed.

【００５２】マトリクスの列及び行数が増加され、処理
されるデータのビット数が大きくなるにつれて、ＩＣチ
ップ上の「リアルエステート」の節約が一層重要にな
る。単一のチップに複数の機能を集積するために、１種
類の回路、例えばＤＣＴ変換用の回路が利用しうるＩＣ
チップ上の有効な領域がかくしてさらに制約される。本
発明は、従来技術における潜在的な制約を克服しうるパ
イプライン乗算回路を教示する。As the number of columns and rows in the matrix increases and the number of bits of data processed increases, the savings of "real estate" on the IC chip becomes more important. An IC that can be used by one type of circuit, for example, a circuit for DCT conversion, to integrate multiple functions on a single chip
The effective area on the chip is thus further constrained. The present invention teaches a pipelined multiplication circuit that can overcome the potential limitations of the prior art.

【００５３】本発明を現在望ましい実施例に関して説明
しているが、かかる開示を限定としては解釈しないこと
が理解されるべきである。上記の説明を読むことによ
り、当業者にとっては、種々の置き換え及び変更が疑い
なく自明である。したがって、追加された請求項は、本
発明の真の精神と目的の範囲において、全ての置換と変
更に及ぶと解釈されることが意図されている。While the present invention has been described in terms of its presently preferred embodiments, it should be understood that such disclosure is not to be construed as limiting. Various substitutions and modifications will no doubt become apparent to those skilled in the art upon reading the above description. Therefore, it is intended that the appended claims cover all substitutions and modifications within the true spirit and scope of the invention.

[Brief description of drawings]

【図１】本発明による２行２列マトリクス乗算回路を示
す図である。FIG. 1 is a diagram showing a 2 × 2 matrix multiplication circuit according to the present invention.

【図２】プロセス素子１（ＰＥ１）に修正ブースアルゴ
リズムを実装するためのマルチプレクサを示す図であ
る。FIG. 2 shows a multiplexer for implementing the modified Booth algorithm in process element 1 (PE1).

【図３】プロセス素子２（ＰＥ２）に修正ブースアルゴ
リズムを実装するためのマルチプレクサを示す図であ
る。FIG. 3 shows a multiplexer for implementing the modified Booth algorithm in process element 2 (PE2).

【図４】プロセス素子３（ＰＥ３）に修正ブースアルゴ
リズムを実装するためのマルチプレクサを示す図であ
る。FIG. 4 shows a multiplexer for implementing the modified Booth algorithm in process element 3 (PE3).

【図５】８行８列ＤＣＴ処理を実行するモジュール回路
を示す図である。FIG. 5 is a diagram showing a module circuit that executes 8-row by 8-column DCT processing.

【図６】一時記憶レジスタの配列ｆｆ＿２６を示す図で
ある。FIG. 6 is a diagram showing an array ff_26 of temporary storage registers.

【図７】出力レジスタの配列ｆｆ＿１８を示す図であ
る。FIG. 7 is a diagram showing an array ff_18 of output registers.

【図８】中間テストを実行する保持レジスタの配列を示
す図である。FIG. 8 is a diagram showing an array of holding registers for executing an intermediate test.

[Explanation of symbols]

１０マトリクス乗算回路１６、１８プロセシング素子列１６−１、１６−２、１６−３、１６−４、１８−１、
１８−２、１８−４、１０４−１、．．、１０４−８、
１２４−１、．．、１２４−８プロセシング素子ユ
ニット２０、１０２、１１０レジスタ列２０−１、２０−２、２６−１、２６−３、３２−１、
３２−２、３４−１、３４−２、４２、１３０−
１、．．、１３０−８、１３２−１、．．、１３２−８
ライン２１−１、２１−２、２４−１、３０−１、３０−２、
４４レジスタ２４−２キャリ保存加算器２４−３補助レジスタ２４−４加算レジスタ２８−１、２８−２データ伝送ライン３６−１、３６−２、３６−３マルチプレクサ３８−１、３８−２タップライン４０、１２５−１、．．、１２５−８、１６４−
１、．．、１６４−８累算器４６帰還経路１００モジュール回路１０６第１段乗算器１０８第２段乗算器１２６保持レジスタ１４０ＰＥ１ユニット１６２−１、．．．、１６２−８加算器１８０−１、．．．、１８０−８一時記憶レジスタ１９０−１、．．．、１９０−８出力レジスタ２００ラウンド回路２１０クロック発生器10 matrix multiplication circuit 16, 18 processing element array 16-1, 16-2, 16-3, 16-4, 18-1,
18-2, 18-4, 104-1 ,. ． , 104-8,
124-1. ． , 124-8 Processing element unit 20, 102, 110 Register sequence 20-1, 20-2, 26-1, 26-3, 32-1,
32-2, 34-1, 34-2, 42, 130-
1 ,. ． , 130-8, 132-1 ,. ． , 132-8
Lines 21-1, 21-2, 24-1, 30-1, 30-2,
44 register 24-2 carry save adder 24-3 auxiliary register 24-4 addition register 28-1, 28-2 data transmission line 36-1, 36-2, 36-3 multiplexer 38-1, 38-2 tap line 40, 125-1 ,. ． , 125-8, 164-
1 ,. ． , 164-8 Accumulator 46 Feedback path 100 Module circuit 106 First stage multiplier 108 Second stage multiplier 126 Holding register 140 PE1 unit 162-1 ,. ．． , 162-8 adders 180-1 ,. ．． , 180-8 Temporary storage registers 190-1 ,. ．． , 190-8 Output register 200 Round circuit 210 Clock generator

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号庁内整理番号ＦＩ技術表示箇所Ｈ０４Ｎ 7/30 Ｈ０４Ｎ 7/133 Ｚ ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁶ Identification code Office reference number FI Technical display location H04N 7/30 H04N 7/133 Z

Claims

[Claims]

1. A plurality of processor elements (P) each having storage means for storing at least two numbers to be multiplied.
E) unit, each of the PE units being connected to another PE unit that transmits the numerical value between the storage means in a predetermined order, each PE unit having a plurality of bits of the numerical value from the storage means. A multiplication circuit further comprising multiplication means for receiving and processing the plurality of bits to generate a partial processing result, wherein at least one of the PE units uses the partial processing result to generate a product of the numerical values.

2. The multiplication circuit according to claim 1, wherein the PE units are sequentially connected so that the numerical value is transmitted from one PE unit to an adjacent PE unit in time series.

3. The multiplication circuit according to claim 2, wherein the multiplication means includes a multiplexer.

4. The multiplier circuit of claim 3, wherein the multiplexer utilizes a modified Booth algorithm.

5. The multiplication means further comprises a carry save adder (CSA) for generating a sum and a carry for the partial processing result, the at least one PE unit and the carry so as to generate the product. The multiplying circuit according to claim 4, further comprising an adding unit that adds the carry.

6. The multiplication circuit according to claim 5, wherein the storage means further includes at least two registers for storing the carry and the sum.

7. A plurality of processor elements (P) each having storage means for receiving and storing a plurality of sets of numbers to be multiplied.
E) unit, each of the PE units being connected to another PE unit in which the set of numerical values is sequentially transmitted from one of the storage means to an adjacent storage means in time sequence according to a predetermined order, Each of the PE units further comprises multiplication means including a multiplexer for receiving the plurality of bits of the set of numbers from the storage means and processing the plurality of bits utilizing a modified Booth algorithm to generate a partial processing result. The multiplication means further includes a carry save adder (CSA) for generating a sum and a carry for the partial processing result, and at least one of the PE units adds the sum and the carry to generate a product. A multiplying circuit including adding means for performing.

8. The multiplier circuit of claim 7, wherein the multiplexer in each of the PE units utilizes a modified Booth algorithm to process 2 bits each time.

9. The set of numbers comprises a plurality of matrix elements of two matrices, thereby performing a multiplication of the matrix elements to generate a plurality of matrix elements of a third matrix, wherein 9. The multiplication circuit according to claim 8, wherein the third matrix is a product of the two matrices.

10. A method for pipeline-wise multiplying a plurality of sets of a first numerical value and a second numerical value, comprising: (a) first multiplying the first numerical value and the second numerical value by Receiving and storing by a storage means in a processing element (PE) unit of said (b) processing said plurality of bits of said first and second numerical values to obtain a partial product of said first and second numerical values; A multiplying means in one processing element (PE) unit, thus generating a carry and a sum, and (c) the carry and the partial product containing the sum and the first and second
Is transmitted from the first PE unit to a neighboring PE unit in a predetermined order and step (b) is repeated until all bits of the first and second numbers have been processed. Processing the further bits of the two numbers, (d) adding the carry and the sum to produce a product of the first and second numbers, and (e) each set of the first and second numbers. About step (a)
A multiplication method comprising the steps of repeating (d) to (d).

11. A plurality of processor elements (PEs) each including storage means for receiving and storing a plurality of sets of numerical values in columns of a first matrix and corresponding rows of a second matrix.
A multiplication circuit comprising a plurality of processor element (PE) sequences each including a unit, wherein each of the PE units is configured to time-sequentially output the set of numerical values connected to another PE unit of the PE sequence. Sequentially transmitting between the storage means, each of the PE units further comprising multiplication means for receiving the plurality of bits of the set of numbers from the storage means and processing the plurality of bits to generate a partial processing result. , Said at least one multiplication means utilizing said partial processing result produces a product of said set of numerical values, each said column of said second matrix having a corresponding PE column, said PE column being said product , Whereby each of the PE columns performs a matrix multiplication of the columns of the second matrix to matrix multiply the first matrix with the second matrix.

12. A plurality of processor element (PE) units each including a storage means for receiving and storing a plurality of sets of numerical values in a row of the first matrix and corresponding columns of the second matrix. A plurality of processor element (PE) columns, each of said PE columns having a corresponding column of said second matrix, each of said PE units being connected to another PE unit of said PE column; Transmit the sets of numbers sequentially between the storage means in time series, each of the PE units receiving a plurality of bits of the sets of numbers from the storage means and generating a partial processing result. Further comprising multiplying means for processing the plurality of bits, wherein at least one of the multiplying means utilizes the partial processing result,
Generating a product of the set of numbers, thus multiplying a corresponding element of a column of the second matrix by a corresponding element of a column of the second matrix; A plurality of processor element (PE) columns each including a plurality of processor element (PE) units each including storage means for receiving and storing matrix elements of corresponding columns of the three matrices. The product is the multiplication of the element of the row of the first matrix with the corresponding element of the column of the second matrix, each PE unit being connected to another PE unit of the PE column. , Sequentially transmitting the intermediate product from each first stage and the matrix element set of the corresponding column of the third matrix in chronological order between the storage means, each of the PE units Receiving a plurality of bits of the set of numbers from the storage means Further comprising multiplication means for processing the plurality of bits to generate a partial processing result, the at least one multiplication means utilizing the partial processing result being the intermediate product from the first stage and the third matrix. A second stage for producing a product of the matrix elements of the corresponding column of each of the PE columns of the second stage having accumulating means for adding the products, Causes the first stage and the second stage to move to the second stage.
A matrix circuit for performing matrix multiplication of the third matrix in parallel with the third matrix in a pipeline manner.

13. The accumulating means in each of the PE columns of the second stage further comprises a plurality of storage shift registers, wherein one of the storage shift registers is the intermediate product from the first stage. The accumulating means in each of the PE columns of the second stage corresponds to the column of one of the third matrices for temporarily storing the product of the corresponding column matrix element of the third matrix. 13. The multiplier circuit of claim 12, further comprising an accumulation register for sequentially adding the products received from the storage shift register.