WO1999008204A1

WO1999008204A1 - Device and method for processing data

Info

Publication number: WO1999008204A1
Application number: PCT/JP1997/002708
Authority: WO
Inventors: Masahiro Kainaga; Koji Yamada; Hiroyuki Ono
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-08-05
Filing date: 1997-08-05
Publication date: 1999-02-18
Anticipated expiration: 2000-02-05
Also published as: TW379301B

Abstract

An arithmetic circuit having sum-of-product units corresponding in number to elements used for linearly converting vectors having four elements, first register files for storing 4x4 matrixes of linear conversion, second register files for storing the vectors having four elements which become the object of the linear conversion, and third register files for storing the results of linear conversion are provided as a basic mechanism for improving the processing speed of two-dimensional discrete cosine or two-dimensional inverse discrete cosine and, at the same time, the arithmetic circuit is controlled so that the reading out direction of values from the first register files can be set to the line direction when a class-1 matrix arithmetic instruction is executed and to the column direction when a class-2 matrix arithmetic instruction is executed by providing the two kinds of the class-1 and class-2 matrix arithmetic instructions as arithmetic instructions for the matrixes of the first register files and the vectors of the second register files.

Description

明細書データ処理装置およびデータ処理方法技術分野 Description Data processing device and data processing method

本発明は、マイクロプロセッサやマイクロコンピュータ等のデータ処理装置に関し、特に、 2次元離散コサイン変換や 2次元逆離散コサイン変換などを含む画像処理応用プログラムを実行するのに好適なデータ処理装置に関する。背景技術 The present invention relates to a data processing device such as a microprocessor or a microcomputer, and more particularly to a data processing device suitable for executing an image processing application program including two-dimensional discrete cosine transform and two-dimensional inverse discrete cosine transform. . Background art

画像データは、未加工のままであるとデータ量が膨大となり、画像データの蓄積の際に大容量のメモリを必要としたり、データ転送の際に伝送時間が長くなるなどの問題がある。そこで、画像データをメモリに格納する前に圧縮しておいて使用直前に伸長したり、送信の前に画像データを圧縮して受信後に伸長する、等の対策が取られる。 If the image data is left unprocessed, the data volume will be enormous, and there will be problems such as the need for a large-capacity memory when storing the image data and the long transmission time when transferring the data. Therefore, measures are taken such as compressing the image data before storing it in the memory and decompressing it immediately before use, or compressing the image data before transmission and decompressing it after reception.

以下、画像データの圧縮、伸長について、 J P E G規格に準拠した静止画の符号化を例にして説明する。 J P E G規格では、圧縮は以下の 2つの方法を組み合わせたものとなる。 Hereinafter, compression and decompression of image data will be described with reference to an example of encoding of a still image conforming to the JPEG standard. According to the JPEG standard, compression is a combination of the following two methods.

(1) 2次元離散コサイン変換（D C T ) (1) Two-dimensional discrete cosine transform (D CT)

(2) ハフマン符号化 (2) Huffman coding

一方、伸長は以下の 2つの方法を組み合わせたものとなる。 On the other hand, elongation is a combination of the following two methods.

(3) ハフマン復号化 (3) Huffman decoding

(4) 2次元逆離散コサイン変換 (4) 2D inverse discrete cosine transform

上記 2次元離散コサイン変換（1)は、 8 X 8画素の 2次元ブロックの値群を対象に行なわる。具体的には、 8 X 8画素の 2次元ブロックの値群と D C Tの基底と呼ばれる行列式との掛け算により積を求めるものである。従って、変換の結果 ( D C T係数と呼ばれる）も 8 X 8画素の 2次元ブロックの値群になる。変換後の D C T係数は、それぞれの係数位置ごとに異なる値をもつ量子化テーブルを用いて量子化（ある区間の値をその区間の代表値に置き換える処理）される。画像データの 2次元離散コサイン変換では、通常、変換されたブロックの右下の部分は 0に近い値が多くなり、量子化処理でそれらは大多数が 0となるという特徴を備えている。 The two-dimensional discrete cosine transform (1) is performed on a value group of a two-dimensional block of 8 × 8 pixels. Specifically, a product is obtained by multiplying a value group of an 8 × 8 pixel two-dimensional block by a determinant called a DCT base. Therefore, the result of the transformation (referred to as DCT coefficient) is also a value group of a 2D block of 8 × 8 pixels. The transformed DCT coefficients are quantized using a quantization table having a different value for each coefficient position (the process of replacing a value in a certain section with a representative value in that section). image In the two-dimensional discrete cosine transform of data, the lower right part of the transformed block usually has many values close to 0, and most of them have the characteristic of being 0 in the quantization process.

上記ハフマン符号化（2)は、上記量子化された 8 X 8画素ブロックの値群をビットストリームに変換する処理である。この際、 8 X 8画素ブロック内の値に 0が多い点を活用した符号化を行なう。すなわち、信号値のうち出現確率の多いものに対して短いビット列の符号を割り当てる可変長符号化を行なうものである。これによつて、符号化後のビットス卜リームのバイト数は変換前のデ一夕のバイ卜数の 1/10程度になる。 The Huffman coding (2) is a process of converting the quantized value group of the 8 × 8 pixel block into a bit stream. At this time, encoding is performed using the point where the value in the 8 × 8 pixel block has many 0s. In other words, variable-length coding is performed in which a short bit string code is assigned to a signal value having a high appearance probability. As a result, the number of bytes in the bitstream after encoding is about 1/10 of the number of bytes in the data stream before conversion.

上記ハフマン復号化（3)は、ハフマン符号化（2)の逆処理である。つまりビットストリームを 8 X 8画素ブロックの値群に復元する処理である。また、 2次元逆離散コサイン変換 (4)は、 2次元離散コサイン変換（1 )の逆変換処理である。つまり、 8 X 8画素ブロックの値群に 2次元離散コサイン変換（1 )の逆処理を施し、最初の 8 X 8画素ブロックの値群を復元するものである。具体的には、ハフマン復号化処理で復号された 8 X 8画素ブロックの値群（D C T係数）と D C T基底との掛け算により積を求めることで画像データが復元される。 The Huffman decoding (3) is a reverse process of the Huffman coding (2). In other words, it is the process of restoring the bit stream into the value group of the 8 × 8 pixel block. The two-dimensional inverse discrete cosine transform (4) is an inverse transform of the two-dimensional discrete cosine transform (1). That is, the value group of the 8 × 8 pixel block is subjected to the inverse processing of the two-dimensional discrete cosine transform (1), and the value group of the first 8 × 8 pixel block is restored. Specifically, the image data is restored by obtaining a product by multiplying the value group (DCT coefficient) of the 8 × 8 pixel block decoded by the Huffman decoding process and the DCT base.

なお、上記 2次元離散コサイン変換（1)の中で行なわれる量子化により、 2次元離散コサイン変換の前における 8 X 8画素ブロックの値群と上記 2次元逆離散コサイン変換（4)の後における 8 X 8画素ブロックの値群とは厳密には一致しないことになる。つまり、非可逆的な圧縮、伸長処理である。しかし、極端に粗い量子化でなければ、現画像と復元画像の違いは人間の目ではほとんど識別できないので、実用上の問題はない。以上、画像圧縮により、画像データの所要バイト数が 1/10程度になり、記憶装置への格納効率や転送効率が 10倍程度になる利点を説明した。しかし一方で、画像データの圧縮 Z伸長処理に要する手間と時間が増加するといぅデメリッ卜がある。例えば、画像データを圧縮しないで記憶装置に格納しけおけば、その画像データを記憶装匱から単に読み出せば直ぐに使用できる。しかし、圧縮された画像データが記憶装置に格納されていれば、その画像データを記憶装置から読み出した後、それを伸長して元の画像データを復元して始めて使用できることになる。本発明者らが本発明の前に検討した汎用マイクロプロセッサにより画像データの圧縮伸長を行う場合、 8 X 8画素ブロック当り、（2)，（3)のハフマン符号化処理およびハフマン復号化処理に各々実行命令数で 1000個程度、（1) , (4)の離散コサイン変換および逆離散コサイン変換処理に各々 1000— 2000個程度を要する。従って、（1)と（2)による符号化または（3)と（4)による複号化には、 8 X 8画素ブ口ック当り各々 2000一 3000個程度の命令数を要することになる。 Note that the quantization performed in the two-dimensional discrete cosine transform (1) allows the value group of the 8 × 8 pixel block before the two-dimensional discrete cosine transform and the value group after the two-dimensional inverse discrete cosine transform (4) to be performed. Will not exactly match the value group of the 8 × 8 pixel block in. In other words, irreversible compression and decompression processing. However, unless the quantization is extremely coarse, the difference between the current image and the restored image can hardly be discerned by the human eye, so there is no practical problem. As described above, the advantage that image compression reduces the required number of bytes of image data to about 1/10 and the storage efficiency and transfer efficiency to the storage device by about 10 times has been described. On the other hand, however, there is a disadvantage in that the time and effort required for the compression Z expansion processing of the image data increases. For example, if the image data is stored in the storage device without being compressed, the image data can be used immediately by simply reading it from the storage device. However, if the compressed image data is stored in the storage device, the image data is read from the storage device. After that, it can be used only after decompressing it and restoring the original image data. When compressing and decompressing image data using a general-purpose microprocessor studied before the present invention by the present inventors, Huffman encoding processing and Huffman decoding processing of (2) and (3) per 8 × 8 pixel block are performed. Each requires about 1000 execution instructions, and the discrete cosine transform and inverse discrete cosine transform processing (1) and (4) require about 1000-2000 each. Therefore, the encoding by (1) and (2) or the decoding by (3) and (4) requires about 2000-13000 instructions per 8 x 8 pixel block. .

従つて、 640 X 480画素で色付きの画像 1枚当りの所要命令数を単純に換算すると、各色ごとに上記処理の 4， 800倍、全体では 14, 400倍の命令実行が必要となる。つまり、従来のマイクロプロセッサでは、画像データの符号化ゃ復号化に、画像 1枚当り 28. 8M— 43. 2M個（Mは百万 [MEGA]の意味）程度の命令の実行が必要となる。ここで、 1命令の処理に平均で 1クロック要するとすれば、 100MHzで動作するマイク口プロセッサでは、画像 1枚当り 288nf432m秒の処理時間を必要とすることが分かる。かかる処理速度では、静止画を連続的に表示しょうとしても、 1秒当たり 2— 4枚の処理ピッチとなってしまい、これでは静止画の連続的な表示で動画的な効果をだすのがむずかしくなる。 Therefore, if the required number of instructions per 640 x 480 pixel colored image is simply converted, the number of instructions required for each color is 4,800 times that of the above processing, and the total number of instructions is 14,400 times. In other words, conventional microprocessors need to execute about 28.8M-43.2M instructions (M means one million [MEGA]) per image to encode and decode image data . Here, if it takes one clock on average to process one instruction, it can be seen that a micro-processor operating at 100 MHz requires a processing time of 288 nf 432 msec per image. At such a processing speed, even if a still image is to be displayed continuously, the processing pitch is 2 to 4 per second, which makes it possible to produce a moving image effect by displaying the still image continuously. It will be difficult.

そこで、画面のサイズを縮小したり、圧縮や伸長のための特殊なハードウェアを用意する、巧妙なアルゴリズムの導入により命令数を削減する、 1クロックに 2命令以上の命令処理が可能な機械語実行方式 (スーパースカラ）を採用する、などの工夫が必要とされる。 Therefore, reducing the size of the screen, preparing special hardware for compression and decompression, reducing the number of instructions by introducing sophisticated algorithms, and machine language capable of processing two or more instructions per clock Some means such as adopting an execution method (super scalar) are required.

なお、上記の説明では、（1)や（4)の処理に 8 X 8画素ブロック当り Note that in the above explanation, the processing of (1) and (4)

1000— 2000個の命令を要すると述べたが、特に工夫のない素朴なアルゴリズムだと 2000命令程度、工夫された巧妙なアルゴリズムだと 1000命令程度要するという意味である。 We stated that 1000-2000 instructions are required, but a simple algorithm without any ingenuity means about 2000 instructions, and a sophisticated algorithm requires about 1000 instructions.

本発明の目的は、 2次元離散コサイン変換や 2次元逆離散コサイン変換を高速に実行可能なデータ処理装置を提供することにある。 An object of the present invention is to provide a data processing device capable of executing two-dimensional discrete cosine transform and two-dimensional inverse discrete cosine transform at high speed.

本発明の目的は、 2次元離散コサイン変換や 2次元逆離散コサイン変換を高速に実行するのに必要なデータ処理装置の基本機構（必要最小限のハードウェア条件）およびそれを有効に活用するための命令形態並びにその命令による基本機構の制御方式を提供することにある。 An object of the present invention is to provide a basic mechanism (minimum hardware conditions) of a data processing device necessary for executing a two-dimensional discrete cosine transform or a two-dimensional inverse discrete cosine transform at a high speed and to effectively utilize the same. Form for instruction and basic mechanism by the instruction To provide a control method.

この発明の前記ならびにそのほかの目的と新規な特徴については、本明細書の記述および添附図面から明らかになるであろう。発明の開示 The above and other objects and novel features of the present invention will become apparent from the description of the present specification and the accompanying drawings. Disclosure of the invention

本願において開示される発明のうち、代表的なものの概要を簡単に説明すれば下記のとおりである。 The outline of typical inventions disclosed in the present application is briefly described as follows.

すなわち、本発明では、 2次元離散コサイン変換や 2次元逆離散コサイン変換を高速化するための基本機構として、要素数 4のべクタを線形変換するため要素数に対応した数の積和器を有する演算回路を用意する。また、線形変換の 4 X 4行列を格納するための第 1レジスタファイルと、線形変換の対象となる要素数 4のベクタを格納するための第 2レジス夕ファイルと、線形変換の結果を格納するための第 3レジス夕ファイルを用意する。さらに、第 1レジスタファイルの行列と第 2レジス夕ファイルのベクタとの演算命令として、第 1種行列演算命令と第 2 種行列演算命令の 2種類を用意し、第 1種行列演算命令が実行されるときは上記第 1レジスタファイルからの各値の読み出し方向が行方向とされ、第 2種行列演算命令が実行されるときは列方向とされるように演算回路を制御する。 In other words, in the present invention, as a basic mechanism for speeding up the two-dimensional discrete cosine transform and the two-dimensional inverse discrete cosine transform, a multiply-accumulator of a number corresponding to the number of elements for linearly transforming a vector having four elements is used. Is prepared. Also, the first register file for storing the 4x4 matrix of the linear transformation, the second register file for storing the vector of 4 elements to be subjected to the linear transformation, and the result of the linear transformation Prepare a third Registrar file to do this. Furthermore, two types of matrix operation instructions of the first type and the second type are prepared as operation instructions for the matrix of the first register file and the vector of the second register file, and the first type of matrix operation instruction is executed. In this case, the arithmetic circuit is controlled so that the reading direction of each value from the first register file is set to the row direction, and when the second type matrix operation instruction is executed, the reading direction is set to the column direction.

これにより、レジスタファイルが 2つの場合や行列演算命令が 1種類の場合には 2次元離散コサイン変換や 2次元逆離散コサイン変換の際に演算途中でレジス夕ファイルへのデータの格納し直しが必要であったものが、本発明を適用することでそのようなレジスタファイルへのデータの格納し直しが不用となり、その結果、 2次元離散コサイン変換や 2次元逆離散コサイン変換を高速に実行することができるようになる。図面の簡単な説明 As a result, when there are two register files or when there is only one type of matrix operation instruction, it is necessary to store the data in the register file again during the 2D discrete cosine transform or 2D inverse discrete cosine transform. However, by applying the present invention, it becomes unnecessary to store data in such a register file again, and as a result, two-dimensional discrete cosine transform and two-dimensional inverse discrete cosine transform are performed at high speed. You will be able to BRIEF DESCRIPTION OF THE FIGURES

図 1は、本発明を適用して好適なマイクロプロセッサの一実施例を示すブロック図である。 FIG. 1 is a block diagram showing an embodiment of a microprocessor suitable for applying the present invention.

図 2は、マイクロプロセッサを構成する中央処理装置（C P U) の具体的な実施例を示すブロック図である。図 3は、 TRV命令と TRVT命令を効率よく実行するのに好適なコプロセッサ（FFIG. 2 is a block diagram showing a specific example of a central processing unit (CPU) constituting a microprocessor. Figure 3 shows a coprocessor (F) suitable for efficiently executing the TRV and TRVT instructions.

P U ) を構成する演算回路の実施例を示す図である。 FIG. 3 is a diagram showing an example of an arithmetic circuit constituting P U).

図 4は、コプロセッサの演算回路内の積和器の実施例を示す図である。 FIG. 4 is a diagram showing an embodiment of the product-sum unit in the arithmetic circuit of the coprocessor.

図 5は、コプロセッサのレジスタ部の実施例を示す図である。 FIG. 5 is a diagram showing an embodiment of the register unit of the coprocessor.

図 6は、逆離散コサイン変換に係わる行列とその部分行列分解を示した図である。 FIG. 6 is a diagram showing a matrix related to the inverse discrete cosine transform and its submatrix decomposition.

図 7は、逆離散コサイン変換される行列とその部分行列への分解を示した図である。 FIG. 7 is a diagram showing a matrix subjected to the inverse discrete cosine transform and its decomposition into submatrices.

図 8は、 2次元逆離散コサイン変換の定義式と、それを部分行列に分解し、定義式の展開を示した図である。 Fig. 8 is a diagram showing the definition formula of the two-dimensional inverse discrete cosine transform, and decomposing it into sub-matrices to expand the definition formula.

図 9は、図 8中の（式 3- 10)を計算するための手順を示すための図である。 FIG. 9 is a diagram showing a procedure for calculating (Equation 3-10) in FIG.

図 1 0は、レジスタファイルの概念を示す図である。 FIG. 10 is a diagram showing the concept of a register file.

図 1 1は、 TRV命令と TRVT命令の命令形式を示す図である。 FIG. 11 is a diagram showing the instruction format of the TRV instruction and the TRVT instruction.

図 1 2は、 TRV命令と TRVT命令の命令機能を説明するための演算式を示す図である。 FIG. 12 is a diagram showing arithmetic expressions for explaining the instruction functions of the TRV instruction and the TRVT instruction.

図 1 3は、逆離散コサイン変換を行なう際のメモリ内に格納されるデータの変換プログラムから見た配置を示す図である。 FIG. 13 is a diagram showing an arrangement of data stored in the memory when performing the inverse discrete cosine transform as viewed from a conversion program.

図 1 4は、 2次元離散コサイン変換の定義式と、それを部分行列に分解し、定義式の展開を示した図である。 Figure 14 is a diagram showing the definition formula of the two-dimensional discrete cosine transform and the expansion of the definition formula by decomposing it into sub-matrices.

図 1 5は、 TRV命令と TRVT命令を実行可能なコプロセッサのレジスタ部の他の構成例を示す図である。発明を実施するための最良の形態 FIG. 15 is a diagram illustrating another configuration example of the register unit of the coprocessor capable of executing the TRV instruction and the TRVT instruction. BEST MODE FOR CARRYING OUT THE INVENTION

以下に本発明の実施例を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図 1には、本発明を適用して好適なマイクロプロセッサのブロック図が示されている。図 1において、 1は中央処理装置（以下、 C P Uと称する）、 2は C P U 1に代わって行列積や浮動小数点演算などの演算を行なうコプロセッサ（以下、 F P Uと称する）、 3は周辺回路 1 1， 1 2 , 1 3からの割り込み要求および後述の MM U 4からの例外処理要求信号を受けて優先度を判定し上記 C P U 1に対して割り込み信号 I R Qを出力する割り込み制御回路、 4は上記 C P U 1からバス 8 a上に出力されるァドレス信号を変換して仮想メモリを管理するメモリ管理ユニット（MMU) 、 5は論理アドレスを物理アドレスに変換するアドレス変換テーブルなどからなるァドレス変換回路である。 FIG. 1 shows a block diagram of a microprocessor suitable for applying the present invention. In FIG. 1, 1 is a central processing unit (hereinafter, referred to as a CPU), 2 is a coprocessor (hereinafter, referred to as an FPU) that performs operations such as matrix multiplication and floating-point operation in place of the CPU 1, and 3 is a peripheral circuit. In response to interrupt requests from 1, 12, and 13 and an exception request signal from MMU 4 described later, priority is determined and An interrupt control circuit that outputs an interrupt signal IRQ, 4 is a memory management unit (MMU) that converts the address signal output from the CPU 1 onto the bus 8a to manage virtual memory, and 5 is a logical address. It is an address conversion circuit consisting of an address conversion table for converting to a physical address.

また、 6は上記 C P U 1によって頻繁に使用されるプログラムゃデ一夕を記憶する高速のキヤッシュメモリ、 7は上記 C P U 1からバス上に出力されるァドレス信号を監視して、所定の置換アルゴリズムに従って外部の主メモリ（図外のハードディスク記憶装置等）内のデータを所定のブロック単位でキャッシュメモリ 6に転送したりキャッシュメモリ 6内の不要になったデータを廃棄したりキャッシュメモリ 6に書き込まれたデータをコピーバック方式あるいはライトスルー方式で主メモリに記憶させたりするキャッシュコントローラである。このキャッシュメモリ 6および外部の主メモリは、上記アドレス変換テーブル 5で変換された後の物理ァドレス信号によってアクセスされる。 Reference numeral 6 denotes a high-speed cache memory for storing the program data frequently used by the CPU 1, and reference numeral 7 denotes an address signal output from the CPU 1 to the bus, and a predetermined replacement algorithm. The data in the external main memory (such as a hard disk storage device, not shown) is transferred to the cache memory 6 in a predetermined block unit according to the above, or the unnecessary data in the cache memory 6 is discarded or written to the cache memory 6. It is a cache controller that stores the copied data in the main memory by the copy-back method or the write-through method. The cache memory 6 and the external main memory are accessed by the physical address signal converted in the address conversion table 5.

この実施例のシングルチップ ·マイクロプロセッサにおいては、 C P U 1から出力される論理ァドレス信号およびデータ信号を伝送するための論理ァドレスバス 8 aおよびデータバス 9 aとは別個に、上記アドレス変換テーブル 5で変換された物理アドレス信号を伝送するための物理アドレスバス 8 bと、上記キヤッシュメモリ 6と外部の主メモリとの間でデータを転送するためのデータバス 9 b が設けられているとともに、内部バス 8 b , 9 bと外部バスとの間の信号のインタフエースを行なう外部バスインタフェース回路 1 0が設けられている。 In the single-chip microprocessor of this embodiment, the address conversion table 5 is provided separately from the logical address bus 8a and the data bus 9a for transmitting the logical address signal and the data signal output from the CPU 1. A physical address bus 8b for transmitting the converted physical address signal and a data bus 9b for transferring data between the cache memory 6 and an external main memory are provided. An external bus interface circuit 10 for interfacing signals between the buses 8b and 9b and the external bus is provided.

さらに、この実施例では、上記論理アドレス側バス 8 a , 9 aや物理アドレス側バス 8 b， 9 bとは別個に、シリアル通信用のシリアル ' コミュニケーション -インタフェース回路 1 1や現在時刻の計時、カレンダーなどの機能を有するリアルタイムクロック回路 1 2、 C P U 1に夕イマ機能を与えるタイマ回路 1 3等の周辺回路が接続される周辺ァドレスバス 8 cおよび周辺データバス 9 cが設けられている。 Further, in this embodiment, separately from the logical address side buses 8a and 9a and the physical address side buses 8b and 9b, a serial communication interface circuit 11 for serial communication, clocking of the current time, There are provided a peripheral address bus 8c and a peripheral data bus 9c to which peripheral circuits such as a real-time clock circuit 12 having a function such as a calendar and a timer circuit 13 for giving a timer function to the CPU 1 are connected.

さらに、図 1において、 1 4は物理アドレス側のバス 8 b， 9 bおよび周辺バス 8 c， 9 cのパス状態を制御するバスコントローラ、 1 5は？し（フェーズ - ロックド ·ループ）回路を利用してチップ内部の C P U 1および各回路ブロックの動作に必要とされるクロック信号を発生するクロック発生回路、 1 6はハードウエアの異常を検出するためのウォッチドッグタイマ、 1 7は上記外部インタフェース回路 1 0を介して周辺バス 8 c, 9 cと外部バスとの間でデ一夕の入出力を可能にする 1 〇ポート、 1 8はユーザシステム開発時にシステムデバッグを支援するためプログラムの実行を任意のポイント（命令もしくはアドレス）で停止させる機能を提供するブレークコントローラである。 Further, in FIG. 1, reference numeral 14 denotes a bus controller that controls the path states of the buses 8b and 9b on the physical address side and the peripheral buses 8c and 9c. (Phase-Locked Loop) circuit and CPU 1 inside the chip and each circuit block A clock generation circuit that generates the clock signal required for clock operation, 16 is a watchdog timer for detecting hardware errors, and 17 is a peripheral bus 8c via the external interface circuit 10 , 9c and external bus enable 1 の port, 18 can execute program at any point (instruction or address) to support system debugging during user system development This is a break controller that provides a function to stop.

なお、図 1に示されている CPU 1および回路ブロック（2〜7， 1 0〜 1 8 および S P F) 並びにバス（8 a〜8 c， 9 a〜9 c) は単結晶シリコン基板のような一個の半導体チップ 1 00上において形成される。また、特に制限されないが、この実施例では、上記外部の主メモリが DRAM (ダイナミック ' ランダム 'アクセス 'メモリ）で構成されている場合に、そのリフレッシュ動作を行なうリフレッシュコントローラが上記外部バスインタフェース回路 1 0内に内蔵されている。図 2には、上記 CPU 1の具体的な構成例が示されている。図 2において、 2 0は実行される命令のアドレスを示すプログラムカウンタ、 2 1はデータバス 9 aを介して上記キャッシュメモリ 6もしくは外部の主メモリから取り込まれた命令コードを保持する 32ビットのような命令レジス夕、 22は命令レジスタ 2 1 に取り込まれた命令コードをデコードして制御信号を生成する命令デコーダ、 2 3は演算前のデータや演算後のデータ等を保持する各種汎用レジス夕 REG 1〜 REGnおよびァドレス演算やデータの加減算、論理演算を行なう加減算器 A L U、データのビットシフトを行なうバレルシフタ S FT、アドレス出力レジスタ ADR、データ入出力レジスタ DTRなどから構成された命令実行回路である。上記命令実行回路 23内には演算用バス BUS 1， 2, 3が設けられ、この演算用バス BUS 1, 2， 3によって上記レジス夕 REG l〜REGn， ADR, DTR, 加減算器 ALU、バレルシフ夕 S FT間が接続可能にされ、各レジス夕や演算器とのバスとの間に設けられたゲート GT 1〜GTmが上記命令レジス夕 22から出力される制御信号 C S 1〜C S iによってシーケンシャルに制御されることによって命令に対応したデータ処理が実行される。ただし、 CPU 1は、命令レジスタ 22に取り込まれた命令が F PU (コプロセッサ） 2に対する専用の命令であると判断すると、その命令の実行は FPU 2に任せ、自身は待機状態もしくは次の命令の実行へ移行する。 The CPU 1 and circuit blocks (2 to 7, 10 to 18 and SPF) and buses (8a to 8c, 9a to 9c) shown in Fig. 1 are similar to single-crystal silicon substrates. It is formed on one semiconductor chip 100. In addition, although not particularly limited, in this embodiment, when the external main memory is constituted by DRAM (dynamic 'random'access' memory), the refresh controller for performing the refresh operation is as described above. Built in the external bus interface circuit 10. FIG. 2 shows a specific configuration example of the CPU 1. In FIG. 2, reference numeral 20 denotes a program counter indicating the address of an instruction to be executed, and 21 denotes a 32-bit instruction code for holding an instruction code fetched from the cache memory 6 or an external main memory via a data bus 9a. 22 is an instruction decoder that decodes the instruction code fetched into the instruction register 21 to generate a control signal, and 23 is a general-purpose register that holds data before operation and data after operation. REG1 to REGn, adder / subtractor ALU for performing address operation, data addition / subtraction, and logical operation, barrel shifter SFT for performing data bit shift, address output register ADR, data input / output register DTR, etc. . Arithmetic buses BUS 1, 2, and 3 are provided in the instruction execution circuit 23. The arithmetic buses BUS 1, 2, and 3 provide the above registers REG1 to REGn, ADR, DTR, adder / subtractor ALU, and barrel shifter. The connection between the SFTs is enabled and the gates GT1 to GTm provided between each register and the bus to the arithmetic unit are sequentially controlled by the control signals CS1 to CSi output from the instruction register 22. Thus, data processing corresponding to the instruction is executed. However, CPU 1 If the instruction fetched into the instruction register 22 is determined to be a dedicated instruction for the FPU (coprocessor) 2, the execution of the instruction is left to the FPU 2, and the processor itself shifts to a standby state or execution of the next instruction.

また、 CPU 1内には、内部制御状態などを反映するためのステータスレジス夕 S R、例外発生時にステータスレジス夕 S Rの内容を退避するステータス退避レジス夕 S S R、例外発生時にプログラムカウンタ 20の内容を退避する P C退避レジス夕 S PC、間接ァドレッシングモードの際のベースアドレスを格納するベ一スァドレスレジスタ GBR、例外処理や割り込み処理のベクタァドレスを格納するベクタアドレスレジスタ VBRなどのレジスタからなるコントロールレジスタ 24が設けられており、命令デコーダ 22からの出力によって各ビッ卜の状態がリード · ライ卜され、コントロールレジス夕 24内の所定のビッ卜の状態に応じて命令の実行内容が制御される。図 3には、上記 FPU 2の具体的な構成例が示されている。 FPU 2は、図 3 に示すように、レジスタ部 90 1と 4 X 4の行列積が可能な積和器 910， 9 1 1, 9 12, 913と各積和器に対応した 4個のラッチ回路 920， 92 1， 9 22, 923と上記積和器に共通のラッチ回路 924とからなる演算部 900と、命令に応じて該演算部 900を制御する演算制御部 990とから構成されている。演算制御部 990は、図示しないが、 CPU 1と同様な命令レジス夕と命令デコーダとから構成され、命令レジスタに取り込まれた命令が自己の専用命令（第 1行列演算命令、第 2行列演算命令など）であると判定すると、対応する演算処理を実行するように演算部 900に対する制御信号を形成する。 FPU 2の命令レジスタおよび命令デコーダは、 CPU 1の命令レジス夕および命令デコーダと共用するように構成することも可能である。図 4には、上記積和器 9 10〜9 13の構成例が示されている。各積和器は、乗算器 960と加算器 961と一時レジスタ 962とからなり、乗算器 960は信号線 940， 944から供給される 16ビッ卜のデータ同士の積をとる演算を行なう。上記加算器 96 1は、乗算器 960の演算結果と一時レジス夕 962の保持データとの和をとり、その結果の値を一時レジスタ 962に格納してその内容を更新する。なお、積和演算の前に一時レジス夕 962を 0にしておく必要があるため、この実施例では、初期値として「0」を供給する信号線 937と選択子（セレクタ） 963とが設けられ、演算開始前に選択子 963が初期値「0」を選択して一時レジスタ 962に格納するような制御信号を信号線 93 5を介して与えるように構成されている。図 5には、レジスタ部 90 1の具体的な構成が示されている。レジスタ部 90 1は、 4つのレジスタファイル 500, 50 1， 502, 503と、上記ラッチ回路 920, 92 1， 922, 9 23, 924に対応した選択子（セレクタ） 5 5 0, 5 5 1, 5 52, 5 53, 5 54とから構成され、上記各レジスタフアイル 500， 5 0 1, 502， 503には、各々 1 6本のレジス夕が配置され、これらのレジスタは 4つのサブファイルに分割されている。例えば、レジス夕ファィル 500には、 4本のレジスタからなるサブファイル 5 1 0, 5 1 1， 5 1 2, 5 1 3が配置されている。 In CPU 1, status register SR to reflect internal control status, etc., status register SR to save the contents of status register SR when an exception occurs Status register SSR to save contents of program counter 20 when an exception occurs Control register consisting of registers such as SPC, base address register GBR that stores the base address in indirect addressing mode, and vector address register VBR that stores vector addresses for exception processing and interrupt processing. The status of each bit is read / written by the output from the instruction decoder 22, and the execution contents of the instruction are controlled according to the status of a predetermined bit in the control register 24. You. FIG. 3 shows a specific configuration example of the FPU 2. As shown in Fig. 3, the FPU 2 is composed of multiply-accumulators 910, 911, 912, and 913, each capable of performing a matrix product of the register section 901 and 4x4, and four latches corresponding to each accumulator. An arithmetic unit 900 comprising circuits 920, 921, 922, 923 and a latch circuit 924 common to the accumulator, and an arithmetic control unit 990 for controlling the arithmetic unit 900 according to an instruction. . Although not shown, the arithmetic control unit 990 includes an instruction register and an instruction decoder similar to the CPU 1, and the instruction fetched into the instruction register is a dedicated instruction of its own (the first matrix operation instruction and the second matrix operation instruction). Command, etc.), a control signal to the arithmetic unit 900 is formed to execute the corresponding arithmetic processing. The instruction register and the instruction decoder of the FPU 2 can be configured to be shared with the instruction register and the instruction decoder of the CPU 1. FIG. 4 shows a configuration example of the accumulators 910 to 913. Each accumulator comprises a multiplier 960, an adder 961 and a temporary register 962. The multiplier 960 performs an operation for multiplying 16-bit data supplied from the signal lines 940 and 944. The adder 96 1 is used to calculate the operation result of the multiplier 960 and the temporary register 962. The sum with the held data is stored, and the resulting value is stored in the temporary register 962 to update the content. Since the temporary register 962 needs to be set to 0 before the product-sum operation, a signal line 937 for supplying “0” as an initial value and a selector 963 are provided in this embodiment. The selector 963 is configured to select an initial value “0” and to supply a control signal via the signal line 935 to be stored in the temporary register 962 before starting the operation. FIG. 5 shows a specific configuration of the register section 901. The register section 901 is composed of four register files 500, 501, 502, 503 and selectors (selectors) 55, 55, 51 corresponding to the latch circuits 920, 921, 922, 923, 924. Each of the register files 500, 501, 502, and 503 has 16 registers each, and these registers have four sub-registers. Has been split into files. For example, the registry file 500 has subfiles 5110, 511, 512, and 513 consisting of four registers.

そして、各々のサブファイル内にはそれぞれ 4つのレジスタが配置され、サブファイルを特定するための選択子 5 1 6が設けられている。サブファイル 5 1 1 にレジスタ 0， 4， 8， 1 2、サブファイル 5 1 2にレジスタ 1， 5, 9, 1 3というようにレジス夕が割り当てられている。従って、 0-1 5の 1 6個のレジスタは 4ビットの 2進数からなるレジスタ番号コードで識別できる。なお、 936 はレジス夕ファイルに対して書き込み許可を与える制御信号を供給する信号線、 9 50はレジスタファイルに対して書込みデータを供給する共通の信号線である。上記各サブファイルは、それぞれ同時に 2つのレジス夕からデータの読み出しと 1つのレジスタへのデータの書き込みが可能となるように構成されている。そのため、信号線 930, 93 1, 932を介してレジスタを指定するための 4ビッ卜のレジスタ番号コードの上位 2ビットがサブファイルに入力される。例えばサブファイル 5 1 0であれば、信号線 9 30を介して供給される選択信号によって指定されたレジスタから読み出されたデータが選択子 5 50に送られ、信号線 9 34を介して供給されるレジスタファイルの選択信号に応じたレジスタのデータが選択される。 Four registers are arranged in each subfile, and a selector 516 for specifying the subfile is provided. Registers 0, 4, 8, and 12 are assigned to subfile 5 11, and registers 1, 5, 9, and 13 are assigned to subfile 5 12. Therefore, 16 registers 0 to 15 can be identified by a register number code consisting of a 4-bit binary number. Reference numeral 936 denotes a signal line for supplying a control signal for giving write permission to the register file, and reference numeral 950 denotes a common signal line for supplying write data to the register file. Each of the above subfiles is configured so that data can be read from two registers and data can be written to one register at the same time. Therefore, the upper 2 bits of the 4-bit register number code for specifying the register via the signal lines 930, 931, and 932 are input to the subfile. For example, in the case of a subfile 510, data read from the register specified by the selection signal supplied via the signal line 930 is sent to the selector 550, and the data is read via the signal line 934. Register data according to the register file selection signal supplied Is selected.

また、信号線 9 3 1を介して供給される選択信号（レジスタ番号コードの上位 2ビット）に対応したレジスタから読み出されたデ一夕が選択子 5 1 6に送られ、信号線 9 3 3を介して供給される選択信号（レジスタ番号コードの下位 2ビット）に対応したレジスタのデータが選択される。信号線 9 5 0を介して送られてきた書き込みデータは、信号線 9 3 6の信号が書き込みを許可していれば、信号線 9 3 2を介して供給される選択信号に対応したレジスタに書き込まれる。行列積の性質 The data read from the register corresponding to the selection signal (high-order 2 bits of the register number code) supplied via the signal line 931 is sent to the selector 5 16 and the signal line 9 3 The register data corresponding to the selection signal (lower 2 bits of the register number code) supplied via 3 is selected. The write data sent via the signal line 950 is stored in the register corresponding to the selection signal supplied via the signal line 932 if the signal on the signal line 9336 permits writing. Written. Properties of matrix products

次に、本発明のマイクロプロセッサによる 2次元離散コサイン変換や 2次元逆離散コサイン変換を説明する前に、まず、本発明に係る変換方式が高速化に有効となる根拠を与える行列積の性質について説明する。 Next, before explaining the two-dimensional discrete cosine transform and the two-dimensional inverse discrete cosine transform by the microprocessor of the present invention, first, the property of the matrix product that provides a basis for the conversion method according to the present invention to be effective for speeding up. explain.

8 X 8行列 Mが図 6の（式卜 1)で定義されているとする。この行列 Mは、逆離散コサイン変換に係わる行列であり、離散コサイン変換の行列の行や列をいくつか置換することでこの行列が得られるが、以下この行列そのもので本発明の説明を行なうことにする。さて、 4 X 4行列 A， Cが図 6の（式卜 2)、（式卜 3)で定義されているとする。すると、行列 Mは図 6の（式卜 4)のように表現できる。 Assume that an 8 × 8 matrix M is defined by (Equation 1) in FIG. This matrix M is a matrix related to the inverse discrete cosine transform, and this matrix can be obtained by replacing some rows and columns of the matrix of the discrete cosine transform.Hereinafter, the present invention will be described using the matrix itself. I will. Now, it is assumed that the 4 × 4 matrices A and C are defined by (Equation 2) and (Equation 3) in FIG. Then, the matrix M can be expressed as (Equation 4) in Fig. 6.

8 X 8行列 Xが図 7の（式 2-1)で定義されているとする。この行列 Xを 4つに分割して、各々の部分行列 X I、 X 2、 X 3、 X 4を図 7の（式 2-2)、（式 2-3)、 (式 2- 4)、（式 2- 5)で定義する。すると、行列 Xは（式 2- 6)で表現できる。 Assume that an 8 × 8 matrix X is defined by (Equation 2-1) in FIG. This matrix X is divided into four, and each of the submatrices XI, X2, X3, and X4 is represented by (Equation 2-2), (Equation 2-3), (Equation 2-4), (Equation 2-5) Then, the matrix X can be expressed by (Equation 2-6).

次に図 8は、離散コサイン変換の定義式とそれを 4 X 4部分行列に分解した場合の計算方法を示すための図である。（式 3-1)は、離散コサイン変換の定義式である。（式 3-1)に現われる 8 X 8行列を 4 X 4部分行列に分解して表現したのが (式 3-2)である。（式 3-1)内の行列 Mを（式卜 4)のお辺で置き換え、（式 3-1)内の行列 Xを（式 2- 6)の右辺で置き換えたものである。そして、この（式 3- 2)を展開していくと（式 3-3)、（式 3-4)、 (式 3-5)となっていく。 Next, FIG. 8 is a diagram for illustrating a definition expression of the discrete cosine transform and a calculation method when it is decomposed into a 4 × 4 submatrix. (Equation 3-1) is the definition of the discrete cosine transform. (Equation 3-2) expresses the 8 × 8 matrix appearing in (Equation 3-1) by decomposing it into a 4 × 4 submatrix. The matrix M in (Equation 3-1) is replaced by the side of (Equation 4), and the matrix X in (Equation 3-1) is replaced by the right side of (Equation 2-6). Then, when this (Equation 3-2) is expanded, it becomes (Equation 3-3), (Equation 3-4), and (Equation 3-5).

また、（式 3- 5)に現われる 4種類の項にそれぞれ (式 3-6)〜（式 3- 9) )の記号を付ける。すると、逆離散コサイン変換は、（式 3- 10)で計算できることになる。ここで、上記（式 3- 10)を少し加工する。図 9はその加工を説明するための図である。まず図 8の（式 3 - 10)に現れる 4X4行列 Tl, T2, T3, T4の個別の要素に、（式 4-1)〜（式 4- 4)に従い記号を付ける。そして、それら記号を使い 4X16行列 Tを (式 4- 5)で定義する。また、 4 X 4の定数行列 Bを（式 4- 6)で定義する。そして、 T と Bの積 TB (4X16行列）を行列 S、つまり S=TBと定義する。 Also, the symbols of (Equation 3-6) to (Equation 3-9) are attached to the four types of terms appearing in (Equation 3-5). Then, the inverse discrete cosine transform can be calculated by (Equation 3-10). Here, the above (Equation 3-10) is slightly processed. FIG. 9 is a diagram for explaining the processing. First, the individual elements of the 4X4 matrices Tl, T2, T3, and T4 appearing in (Equation 3-10) in Fig. 8 are marked with symbols according to (Equation 4-1) to (Equation 4-4). Then, using these symbols, a 4X16 matrix T is defined by (Equation 4-5). Also, a 4 × 4 constant matrix B is defined by (Equation 4-6). Then, the product TB (4x16 matrix) of T and B is defined as matrix S, that is, S = TB.

さて、行列 Sの 64個の各要素に（式 4- 7)に従い記号を付ける。そして、それら記号を使って図 9の（式 4- 8)のように表した 8 X 8行列を Yと定義する。すると、（式 4- 8)で定義された Yと図 8の（式 3- 10)で定義された Yは同じものとなる。例えば、図 8の（式 3- 10)で定義される 8 X 8行列 Yの左上の要素は、（式 4-1)〜 (式 4 - 4)で定義される 4X4行列 Tl, T2, T3, T4の各々の左上の要素 Now, mark each of the 64 elements of the matrix S according to (Equation 4-7). Then, an 8 × 8 matrix expressed as shown in FIG. 9 (Equation 4-8) using these symbols is defined as Y. Then, Y defined by (Equation 4-8) and Y defined by (Equation 3-10) in Fig. 8 are the same. For example, the upper left element of the 8 × 8 matrix Y defined by (Equation 3-10) in FIG. 8 is a 4 × 4 matrix Tl, T2, T3 defined by (Equation 4-1) to (Equation 4-4) , Top left element of each of T4

tlO, t20， t30, t40を加えたものである。一方、図 9の（式 4-8)で定義される 8 X 8 行列 Yの左上の要素は、 S=TBであるので、 Bを与える（式 4-6)の右辺の最初の行（1 1 1 1)と、 Tを与える（式 4-5)の右辺の最初の行（tlO t20 t30 t40)との内積であり、具体的には以下のようになる。 tlO, t20, t30, t40 are added. On the other hand, since the upper left element of the 8 × 8 matrix Y defined by (Equation 4-8) in FIG. 9 is S = TB, the first row (1 This is the inner product of 1 1 1) and the first line (tlO t20 t30 t40) on the right side of T (Equation 4-5), and is specifically as follows:

Ixtl0 + ixt20 + ixt30 + ixt40=tl0 + t20+t30 + t40 Ixtl0 + ixt20 + ixt30 + ixt40 = tl0 + t20 + t30 + t40

他の要素についても同様である。従って、図 9の（式 4-8)で定義された Yと図 8 の（式 3- 10)で定義された Yは同じものとなることが分かる。レジス夕ファイルおよび命令形式 The same applies to other elements. Therefore, it can be seen that Y defined by (Equation 4-8) in FIG. 9 is the same as Y defined by (Equation 3-10) in FIG. Regis evening file and instruction format

次に本発明に必要なハードウェア構成のうちレジスタファイルの構成と命令形式について説明する。本発明が適用されたプロセッサでは 16本のレジス夕で構成されるレジスタファイルを少なくとも 3セット持つ。図 1 0は 4セットのレジス夕ファイル RFL 0， RFL 1 , RFL 2, R F L 3を持つ場合を示している。なお、図 1 0の左側に付記されているレジスタ番号が、図 5のサブファイル 5 1 0〜5 1 3内に表記されており、図 5と図 1 0のレジスタファイルにおけるレジス夕の対応関係を表している。また、本発明が適用されたプロセッサでは、 4X 4行列と要素数 4のべクタとの行列積を行なう命令を 2種類有する。命令は以下のように記述するものとする。 Next, the configuration of the register file and the instruction format of the hardware configuration required for the present invention will be described. The processor to which the present invention is applied has at least three sets of register files composed of 16 registers. Figure 10 shows a case where there are four sets of register files RFL0, RFL1, RFL2, and RFL3. Note that the register numbers added to the left side of FIG. 10 are shown in the subfiles 5110 to 513 in FIG. 5, and the correspondence between the register numbers in the register files in FIG. 5 and FIG. 10 is shown. Represents a relationship. Further, the processor to which the present invention is applied has two types of instructions for performing a matrix product of a 4 × 4 matrix and a vector having 4 elements. The instruction shall be described as follows.

TRV m, n, s, d (第 1種行列演算命令） TRV m, n, s, d (matrix operation instruction of the first kind)

TRVT m， n, s, d (第 2種行列演算命令） P TJP97/02708 TRVT m, n, s, d (matrix operation instruction of the second kind) P TJP97 / 02708

12 命令形式は例えば図 1 1に示すように、命令コードが格納される命令コードフィールド I CFと、 m， s， d， nで示されるオペランドが格納される 4つのフィールド OP F l, OP F 2, OP F 3, 〇 P F 4とにより構成される。これらのオペランドのうち、 m， s， dはレジス夕ファイルを指定する番号である。また、 nはレジスタファイル内のレジスタを指定する番号で、 4の倍数（0， 4， 8， 1 2) のいずれかであるものとする。つまり、 nは 4ビットのコードからなり、 nの下位 2ビットで図 5のサブファイルが指定され、上位 2ビットでサブファイル内のレジス夕が指定される（TRV命令では、下位 2ビットは常時 0 0とされる）。次に、まず第 1種行列演算命令 TRVの機能を説明する。 TRV命令の機能は図 1 2 の（式 5-1)で定義される。つまり、レジスタファイル m内の 1 6本のレジス夕の群値を 4X4行列とみなし、またレジス夕ファイル s内のレジスタ n, n+1, n†2, n+3 の値群を 4要素のベクタとみなし、行列とベクタを掛け算し、その結果をレジスタファイル d内のレジスタ群 n, n+1, n+2, n+3へ格納する。つまり、被演算べクタは図 5の 4つのサブファイルから 1つずつ取り出され、演算結果は 4つのサブファイルに格納される。図 1 0のレジスタファイルにおいては、連続する 4つのレジスタからベクタが読み出され、対応する 4つのレジスタに結果が格納される _; 従って、以下の命令であれば、 12 The instruction format is, for example, as shown in FIG. 11, an instruction code field I CF in which an instruction code is stored, and four fields OP F l, OP F in which operands m, s, d, and n are stored. 2, OP F 3, 〇 PF 4. Of these operands, m, s, and d are numbers that specify the registry file. Also, n is a number that specifies a register in the register file, and is a multiple of 4 (0, 4, 8, 12). In other words, n is a 4-bit code, the lower 2 bits of n specify the subfile in Figure 5, and the upper 2 bits specify the register in the subfile. Always set to 0). Next, the function of the type-1 matrix operation instruction TRV will be described first. The function of the TRV instruction is defined by (Equation 5-1) in Figure 12. That is, the group value of 16 registers in the register file m is regarded as a 4 × 4 matrix, and the value group of the registers n, n + 1, n † 2, n + 3 in the register file s is divided into four elements. Treat as a vector, multiply the matrix by the vector, and store the result in registers n, n + 1, n + 2, n + 3 in register file d. In other words, the operands are extracted one by one from the four subfiles in Fig. 5, and the operation results are stored in the four subfiles. In the register file of Figure 10, vectors are read from four consecutive registers and the results are stored in the corresponding four registers _; therefore, the following instruction:

TRV 0, 0， 1, 2 TRV 0, 0, 1, 2

図 1 2の（式 5- 2)の演算が行なわれることになる。また、以下の命令列、 The calculation of (Equation 5-2) in FIG. 12 is performed. Also, the following instruction sequence,

TRV 0, 0, 1, 2 TRV 0, 0, 1, 2

TRV 0, 4， 1, 2 TRV 0, 4, 1, 2

TRV 0, 8, 1, 2 TRV 0, 8, 1, 2

TRV 0, 12, 1, 2 TRV 0, 12, 1, 2

であれば、図 1 2の（式 5- 3)の演算が行なわれることになり、右辺のような 4X 4行列と 4X4行列との掛け算をし、その結果、左辺のような 4X4行列を得ることになる。次に第 2種行列演算命令 TRVTの機能を説明する。 TRVT命令の機能は図 1 2の (式 5- 4)で定義される。（式 5- 4)で使用されている tは 4ビットのコードからなる上記オペランド nを 2ビット右シフトした値である。 tの上位 2ビットは 0 0である。第 2種行列演算命令 TRVT m, n, s， dは、レジスタファイル m内の 1 6本のレジス夕を 4 X 4行列とみなし、レジス夕ファイル s内のレジスタ群 t， 4+ t， 8+ t，を 4要素のベクタとみなし、行列とベクタを掛け算し、その結果をレジスタファィル d内のレジスタ群 n， n+ l， n+2, n+3へ格納するものである。つまり、被演算べクタは図 5の 4つのサブファイルのいずれか 1つの中の 4つのレジス夕の値が取り出され、演算結果は 4つのサブファイルの対応するレジスタにそれぞれに分散して格納される。図 1 0のレジス夕ファイルにおいては、 1 6本のレジスタのうち 4つおきの 4本のレジス夕の値が被演算べクタとして読み出され、結果は指定されたレジスタファイル内の連続する 4つのレジスタ（先頭は 4の倍数である番号を有するレジス夕）に格納される。 Then, the operation of (Equation 5-3) in Fig. 12 is performed, and the 4X4 matrix and the 4X4 matrix as shown on the right side are multiplied, and as a result, the 4X4 matrix as shown on the left side is obtained. It will be. Next, the function of the second type matrix operation instruction TRVT will be described. The function of the TRVT instruction is defined by (Equation 5-4) in Figure 12. T used in (Equation 5-4) is the value of the above operand n consisting of a 4-bit code, shifted right by 2 bits. The upper two bits of t are 00. The matrix operation instruction of the second kind TRVT m, n, s, d regards the 16 registers in the register file m as a 4 × 4 matrix, and registers t, 4 + t, 8 + t, is regarded as a four-element vector, the matrix is multiplied by the vector, and the result is stored in the registers n, n + 1, n + 2, n + 3 in the register file d. In other words, the vector to be operated on takes the values of the four registers in one of the four subfiles in Fig. 5, and distributes the operation results to the corresponding registers of the four subfiles. Is stored. In the register file shown in Fig. 10, the values of every four registers out of 16 registers are read out as vector operands, and the results are stored consecutively in the specified register file. It is stored in four registers (the first one has a number that is a multiple of four).

従って、以下の命令であれば、 Therefore, if

TRVT 0, 0, 1, 2 TRVT 0, 0, 1, 2

図 1 2の（式 5-5)の演算が行なわれることになる。そして以下の命令列、 The calculation of (Equation 5-5) in FIG. 12 is performed. And the following instruction sequence:

TRVT 0, 0, 1, 2 TRVT 0, 0, 1, 2

TRVT 0, 4, 1, 2 TRVT 0, 4, 1, 2

TRVT 0, 8, 1, 2 TRVT 0, 8, 1, 2

TRVT 0, 12, 1 , 2 TRVT 0, 12, 1, 2

であれば、図 1 2の（式 5- 6)の演算が行なわれることになり、 4 X 4行列と 4 X 4行列を掛け算し結果の 4 X 4行列を得ることになる。ここで注意すべきことは（式 5-3)と（式 5-6)とでは右辺の前項の行列が（式 5-3)と（式 5- 6)で転置されている点、つまり（式 5- 3)の前項の行列の行方向の値群の並びは（式 5- 6)の前項の行列の列方向の値群の並びと同じであり、（式 5-3)の前項の行列の列方向の値群の並びは（式 5 - 6)の前項の行列の行方向の値群の並びと同じになっている点である。 Then, the operation of (Equation 5-6) in FIG. 12 is performed, and a 4 × 4 matrix is multiplied by a 4 × 4 matrix to obtain a 4 × 4 matrix as a result. It should be noted here that in (Equation 5-3) and (Equation 5-6), the matrix of the preceding term on the right side is transposed by (Equation 5-3) and (Equation 5-6), that is, ( The ordering of the values in the row direction of the matrix in the preceding term of Equation 5-3 is the same as the ordering of the values in the column direction of the matrix in the preceding term of (Equation 5-6). The arrangement of the value groups in the column direction of the matrix of is the same as the arrangement of the value group in the row direction of the matrix in the preceding section of (Equation 5-6).

このことは、上記 2つの命令 TRU, TRUTを使用することにより、レジスタフアイルに格納されている値群の並び方を変えることなく、つまりメモリからレジスタファイルへ再ロードすることなく（式 5- 3)の演算と（式 5-6)の演算を実行することができることを意味している。その結果、（式 3-10)に従った逆離散コサイン変換のための画像データの計算を、 TRU.TRUT命令を使用しない場合に比べて命令数を大幅に減らすことできる。 This is achieved by using the two instructions TRU and TRUT without changing the order of the values stored in the register file, that is, without reloading from memory into the register file (Equation 5-3). ) And (Equation 5-6) Means you can do it. As a result, the number of image data calculations for the inverse discrete cosine transform according to (Equation 3-10) can be significantly reduced compared to the case where the TRU.TRUT instruction is not used.

なお、（式 3-10)に従った逆離散コサイン変換を実行するには、主メモリからレジスタファイルにデータをロードしたりストアする命令も必要であり、従来のマイク口プロセッサでも使用されている以下の命令を備えているものとする。 Note that executing the inverse discrete cosine transform according to (Equation 3-10) also requires instructions to load and store data from main memory to a register file, which is also used in conventional microprocessors. It is assumed that the following instructions are provided.

LD4 b+disp, d, n LD4 b + disp, d, n

ST s, n, b+disp ST s, n, b + disp

ここで、 LD4で示される命令はべ一スァドレス bからディスプレースメント値 dispで指定される値だけ隔たったアドレス（実施例では主メモリのいずれかのァドレス）から 4つ分のデータをレジス夕ファイル d内のレジス夕群 n, n+l， n+2, n+ 3へロードする命令である。また、 STで示される命令は、レジスタファイル s内のレジスタ nをベースァドレス bからディスプレースメント値 dispで指定される値だけ隔たったアドレス（実施例では主メモリのいずれかのアドレス）へストアする命令である。命令の実行手順 Here, the instruction indicated by LD4 is used to store four data from the address (in this embodiment, one of the main memory addresses) separated from the base address b by the displacement value disp (the address in the main memory). This is an instruction to load the register group n, n + l, n + 2, n + 3 in d. The instruction indicated by ST is an instruction to store the register n in the register file s from the base address b to an address (in the embodiment, any address in the main memory) separated by the displacement value disp. It is. Instruction execution procedure

次に、図 3の制御部 99 0が第 1種行列演算命令 TRV m, n， s, dを実行する場合の手順について説明する。 Next, a procedure when the control unit 990 of FIG. 3 executes the first type matrix operation instruction TRV m, n, s, d will be described.

第 1のステップで、制御部 9 90は、信号線 930を介して 2ビットのバイナリコードとして先ず「00」を各レジス夕ファイル 500〜 503のサブフアイル 5 1 0〜5 1 3に送る。これに応答して、サブファイル 5 1 0〜5 1 3はそれぞれ、レジスタ 0の内容を選択子 5 50へ、レジスタ 1の内容を選択子 5 5 1へ、レジス夕 2の内容を選択子 5 52へ、レジスタ 3の内容を選択子 5 53へ送る。そして制御部 990は、レジスタファイルを指定する番号 mを信号線 934に送る。すると mで指定されたデータが選択子 550〜5 53を通過し、ラッチ 92 0, 92 1, 922, 9 23にラッチされる（つまりレジスタファイル mのレジスタ 0, 1, 2, 3の内容がラッチ 9 20, 92 1, 922, 923にラッチされる）。 In the first step, the control unit 990 first sends “00” as a 2-bit binary code to the subfiles 510 to 513 of each of the registry files 500 to 503 via the signal line 930. In response, the subfiles 510 to 513 select the contents of register 0 to selector 550, the contents of register 1 to selector 551, and the contents of register 2 respectively. Send register 3 contents to selector 5 53 to child 5 52. Then, control unit 990 sends number m designating the register file to signal line 934. Then, the data specified by m passes through the selectors 550 to 553 and is latched by the latches 92 0, 92 1, 922, 923 (that is, the contents of registers 0, 1, 2, and 3 of the register file m are Latch 9 20, 92 1, 922, 923).

また、制御部 9 90は、信号線 93 1を介して 4ビットコードで与えられるレジス夕番号 nの上位 2ビットを各レジス夕ファイルに送る。これに応答して各レジス夕ファイルは、 nの上位 2ビットに対応するレジス夕の内容を選択子 5 1 6 に送る。さらに制御部 990は、 nの下位 2ビットを信号線 933を介して選択子 5 1 6に送る。すると、選択されたデータが選択子 5 1 6を通過し選択子 5 5 4に送られる。そして制御部 990は、レジスタファイルを指定する番号 sを信号線 93 5に送る。すると sで指定されたデータが選択子 5 54を通過し、ラッチ 924にラッチされる。同時に、制御部 990は初期値「0」が一時レジスタ 96 2にセッ卜されるよう信号線 93 5を介して選択子 96 3を制御する。第 2のステップで、ラッチ 920， 92 1, 922, 923の内容が対応する積和器 9 1 0, 9 1 1， 9 1 2， 9 1 3にそれぞれ送られるとともに、ラッチ 924 の内容が 4つの積和器 9 1 0, 9 1 1, 9 1 2, 9 1 3に送られる。そして各々の積和器 9 1 0〜9 1 3で最初の積和が行なわれ、その結果が一時レジス夕 96 2 などにセットされる。このとき同時に、信号線 93 0を介してバイナリコード「0 1」が各レジスタファイル 500〜503に送られる。これに応答して、各レジスタファイルはそれぞれ、レジスタ 4の内容を選択子 5 50へ、レジスタ 5 の内容を選択子 5 5 1へ、レジスタ 6の内容を選択子 5 52へ、レジス夕 7の内容を選択子 5 53へ送る。そして制御部 990は、レジスタファイルを指定する番号 mを信号線 934に送る。すると mで指定されたデータが選択子 5 50〜5 53を通過し、ラッチ 920, 92 1， 922, 923にラッチされる。また、制御部 990は、信号線 93 1を介して 4ビッ卜のレジス夕番号 nをインクリメント（+ 1) した値 n+ 1の上位 2ビットを各レジスタファイル 500〜503に送る。 Further, the control unit 990 outputs a signal given as a 4-bit code through the signal line 931. The upper 2 bits of the disk number n are sent to each register file. In response, each register file sends the contents of the register corresponding to the upper two bits of n to the selector 516. Further, control unit 990 sends the lower two bits of n to selector 5 16 via signal line 933. Then, the selected data passes through the selector 5 16 and is sent to the selector 5 5 4. Then, control unit 990 sends number s specifying the register file to signal line 935. Then, the data specified by s passes through the selector 554 and is latched by the latch 924. At the same time, the control unit 990 controls the selector 963 via the signal line 935 so that the initial value “0” is set in the temporary register 962. In the second step, the contents of the latches 920, 921, 922, 923 are sent to the corresponding integrators 910, 911, 912, 913, respectively, and the contents of the latches 924 are read out. To the three accumulators 9 1 0, 9 1 1, 9 1 2, 9 13 Then, the first sum of products is performed in each of the accumulators 910 to 913, and the result is temporarily set in the register 962 or the like. At this time, the binary code “01” is sent to each of the register files 500 to 503 via the signal line 930 at the same time. In response, each register file sends the contents of register 4 to selector 5 50, the contents of register 5 to selector 5 51, the contents of register 6 to selector 5 52, and the contents of register 7 respectively. Send the content to selector 5 53. Then, control unit 990 sends number m designating the register file to signal line 934. Then, the data specified by m passes through the selectors 550 to 553 and is latched by the latches 920, 921, 922, and 923. Further, the control unit 990 sends the upper 2 bits of the value n + 1 obtained by incrementing (+1) the 4-bit register number n via the signal line 931 to each of the register files 500 to 503.

これに応答して各レジスタファイルは n+ 1の上位 2ビッ卜に対応するレジス夕の内容を選択子 5 1 6に送る。さらに、制御部 990は、 n+ 1の下位 2ビッ卜を信号線 933を介して選択子 5 1 6に送る。すると、選択されたデータが選択子 5 1 6を通過し選択子 5 54に送られる。そして制御部 99 0は、レジスタファイルを指定する番号 sを信号線 934に送る。すると sで指定されたデータが選択子 5 54を通過し、ラッチ 9 24にラッチされる。 In response, each register file sends the contents of the register corresponding to the upper two bits of n + 1 to the selector 5 16. Further, control section 990 sends the lower two bits of n + 1 to selector 5 16 via signal line 933. Then, the selected data passes through the selector 516 and is sent to the selector 554. Then, control unit 990 sends number s designating the register file to signal line 934. Then, the data specified by s passes through the selector 554 and is latched by the latch 924.

第 3のステップの動作は第 2のステップと同様な動作となるが、レジス夕番号 4, 5, 6, 7が 8, 9, 10, 1 1に、また 4ビッ卜のレジスタ番号 n十 1が n + 2 となる点が異なる。 The operation of the third step is the same as the operation of the second step, except that The difference is that 4, 5, 6, 7 becomes 8, 9, 10, 1 1 and the 4-bit register number n11 becomes n + 2.

第 4のステップの動作も第 2のステップと同様な動作となるが、レジス夕番号 4, 5, 6, 7が 12, 13, 14, 1 5、また 4ビットのレジス夕番号 n十 1が n + 3となる点が異なる。 The operation of the fourth step is the same as the operation of the second step, except that the register numbers 4, 5, 6, 7 are 12, 13, 14, 1 5 and the 4-bit register number n 10 1 is The difference is that n + 3.

第 5のステップは、積和器 9 10, 91 1, 9 12, 9 13にラッチされた 4つの値をレジスタ部 90 1に書き戻すステップである。レジスタ部 901には積和器 9 10などにラッチされた値が信号線 920などを介して送られてくる。制御部 990は、信号線 932を介して 4ビットのレジスタ番号 nの上位 2ビットを各サブファイルに送る。さらに信号線 936を介してオペランド「d」に対応するレジスタファイルへの書き込みを許可する。すると 4つの値が、「d」で指定されるレジスタファイルの 4つのサブファイルにセットされる。以上の動作で (式 5-1)で定義された演算が行なわれたことになる。次に、図 3の制御部 990が以下の第 2種行列演算命令 TRVT m， n, s, dを実行する場合について説明する。 The fifth step is a step of writing back the four values latched in the accumulators 9, 10, 91 1, 9 12, 913 to the register section 901. The value latched by the accumulator 910 or the like is sent to the register unit 901 via the signal line 920 or the like. The control unit 990 sends the upper 2 bits of the 4-bit register number n to each subfile via the signal line 932. Further, writing to the register file corresponding to the operand "d" is permitted via the signal line 936. The four values are then set in the four subfiles of the register file specified by "d". With the above operation, the operation defined by (Equation 5-1) is performed. Next, a case where the control unit 990 of FIG. 3 executes the following type 2 matrix operation instruction TRVT m, n, s, d will be described.

第 1のステップでは、制御部 990は信号線 930を介して先ずバイナリコ一ド「00」を各レジス夕ファイルに送る。これに応答して、各レジス夕ファイルは各々レジス夕 0の内容を選択子 550へ、レジスタ 1の内容を選択子 55 1へ、レジスタ 2の内容を選択子 552へ、レジスタ 3の内容を選択子 553へ送る。そして制御部 990は、レジスタファイルを指定する番号 mを信号線 934に送る。すると mで指定されたデータが選択子 550〜 553を通過し、ラッチ 92 0, 921, 922, 923にラッチされる。また、制御部 990は、信号線 93 1を介して 4ビッ卜のレジスタ番号 nの下位 2ビットを各レジス夕ファイルに送る（TRV命令では上位 2ビットである点に注意）。これに応答して各レジス夕ファィルは nの下位 2ビットに対応するレジス夕の内容を選択子 5 16に送る。 In the first step, the control unit 990 first sends a binary code “00” to each registry file via a signal line 930. In response, each register file selects the contents of register 0 to selector 550, the contents of register 1 to selector 55 1, the contents of register 2 to selector 552, and the contents of register 3 respectively. Send to child 553. Then, control unit 990 sends number m designating the register file to signal line 934. Then, the data specified by m passes through the selectors 550 to 553 and is latched by the latches 920, 921, 922, and 923. The control unit 990 sends the lower 2 bits of the 4-bit register number n to each register file via the signal line 931 (note that the upper 2 bits are used in the TRV instruction). In response, each register file sends to register 516 the contents of the register corresponding to the lower two bits of n.

さらに、制御部 990は、レジスタ番号 nの上位 2ビットを信号線 933を介して選択子 5 16に送る。すると、選択されたデータが選択子 5 16を通過し選択子 554に送られる。、そして、制御部 990は、レジス夕ファイルを指定する番号 sを信号線 935に送る。すると sで指定されたデータが選択子 554を通過し、ラッチ 924にラッチされる。同時に、制御部 990は初期値「0」が一時レジスタ 962にセッ卜されるよう信号線 935を介して選択子 963を制御する。 Further, control unit 990 sends the upper two bits of register number n to selector 516 via signal line 933. Then, the selected data passes through the selector 516 and is sent to the selector 554. , And the control unit 990 specifies the registration file. Number s to signal line 935. Then, the data specified by s passes through the selector 554 and is latched by the latch 924. At the same time, the control unit 990 controls the selector 963 via the signal line 935 so that the initial value “0” is set in the temporary register 962.

第 2のステップで、ラッチ 920， 921, 922， 923の内容が各々積和器 910, 91 1, 912, 9 13に送られる。またラッチ 924の内容が 4つの積和器 910, 9 1 1, 912, 913に送られる。そして各々の積和器で最初の積和が行なわれ、その結果が一時レジスタ 962などにセットされる。同時に、信号線 930を介してバイナリコード「01」が各レジスタファイルに送られる。これに応答して、各レジスタファイルはそれぞれ、レジスタ 4の内容を選択子 5 50へ、レジスタ 5の内容を選択子 551へ、レジス夕 6の内容を選択子 552 へ、レジスタ 7の内容を選択子 553へ送る。そして制御部 990は、レジス夕ファイルを指定する番号 mを信号線 934に送る。すると mで指定されたデータが選択子 550〜553を通過し、ラッチ 920， 92 1， 922, 923にラッチされる。また、信号線 931を介してレジスタ番号 n+ 1の下位 2ビットが各レジスタファイルに送られる。これに応答して個別レジスタファイルはレジスタ番号 n + 1の下位 2ビッ卜に対応するレジスタの内容を選択子 5 16に送る。さらにレジスタ番号 n+ 1の上位 2ビッ卜を信号線 933を介して選択子 5 16に送る。そして選択されたデータが選択子 516を通過し選択子 554に送られる。そして制御部 990は、レジスタファイルを指定する番号 sを信号線 935に送る。すると sで指定されたデ一夕が選択子 554を通過し、ラッチ 924にラッチされる。 In a second step, the contents of latches 920, 921, 922, 923 are sent to accumulators 910, 911, 912, 913, respectively. The contents of the latch 924 are sent to the four accumulators 910, 911, 912, and 913. Then, the first sum of products is performed in each accumulator, and the result is set in the temporary register 962 or the like. At the same time, the binary code "01" is sent to each register file via signal line 930. In response, each register file selects the contents of register 4 to selector 5 50, the contents of register 5 to selector 551, the contents of register 6 to selector 552, and the contents of register 7 respectively. Send to child 553. Then, control unit 990 sends number m designating the registration file to signal line 934. Then, the data specified by m passes through selectors 550 to 553 and is latched by latches 920, 921, 922, and 923. Also, the lower two bits of register number n + 1 are sent to each register file via signal line 931. In response, the individual register file sends to register 516 the contents of the register corresponding to the lower two bits of register number n + 1. Further, the upper two bits of the register number n + 1 are sent to the selector 516 via the signal line 933. The selected data then passes through selector 516 and is sent to selector 554. Then, control unit 990 sends number s designating the register file to signal line 935. Then, the data specified by s passes through the selector 554 and is latched by the latch 924.

第 3のステツプの動作は第 2のステップと同様な動作となるが、レジス夕番号 4， 5, 6, 7が 8， 9, 10, 1 1に、また 4ビッ卜のレジスタ番号 n+ 1が n + 2 となる点が異なる。 The operation of the third step is the same as that of the second step, except that the register numbers 4, 5, 6, 7 are 8, 9, 10, 11 and the 4-bit register number n + 1 is The difference is that n + 2.

第 4のステップの動作も第 2のステップと同様な動作となるが、レジス夕番号 4, 5, 6， 7が 12, 13, 14, 15、また 4ビットのレジスタ番号 n+ 1が n + 3となる点が異なる。 The operation of the fourth step is the same as that of the second step, except that the register numbers 4, 5, 6, and 7 are 12, 13, 14, 15, and the 4-bit register number n + 1 is n + 3. Is different.

第 5のステップは、積和器 9 10, 9 1 1, 912, 9 13にラッチされた 4つ PC冒 97/02708 The fifth step consists of four products latched in the accumulator 9 10, 9 1 1, 912, 9 13 PC first 97/02708

18 の値をレジスタ部 9 0 1に書き戻すステップである。レジスタ部 9 0 1には積和器 9 1 0などにラッチされた値が信号線 9 2 0などを介して送られてくる。制御部 9 9 0は、信号線 9 3 2を介してレジスタ番号 nの上位 2ビットを各サブファィルに送る。さらに信号線 9 3 6を介して番号 dに対応するレジスタファイルへの書き込みを許可する。すると 4つの値が、番号 dのレジスタファイルの 4つのサブファイルにセットされることになる。以上の動作で（式 5- 4)で定義された演算が行なわれたことになる。逆離散コサイン変換に必要なデータ This is the step of writing back the value of 18 to the register section 91. The value latched by the accumulator 910 or the like is sent to the register section 901 via the signal line 920 or the like. The control unit 990 sends the upper two bits of the register number n to each subfile via the signal line 932. Further, writing to the register file corresponding to the number d is permitted via the signal line 936. Then the four values will be set in the four subfiles of the register file with number d. With the above operation, the operation defined by (Equation 5-4) is performed. Data required for inverse discrete cosine transform

次に、逆離散コサイン変換に必要なデータを整理しておく。これらのデータは変換プログラムから見て図 1 3に示すような配置で外部の主メモリ MEM内に格納される。 Next, the data necessary for the inverse discrete cosine transform is organized. These data are stored in the external main memory MEM in the arrangement shown in Fig. 13 when viewed from the conversion program.

まず、変換対象のデータ（D C T係数）が必要であるが、これは図 1 3の主メモリ内の X1 T, X2T, X3T, X4Tで示されている記憶頜域に格納されている。変換に必要な定数行列（D C Tの基底に相当）は（式卜 2)、（式卜 3)、（式 4- 6)に係わるもので、これらは図 1 3の AT, CT, Bで示されている記憶領域にそれぞれ格納されている。そして、変換結果は図 1 3内の Yで示されている記憶領域に格納される。逆離散コサイン変換 First, data to be converted (DCT coefficients) is required, which is stored in storage areas indicated by X1T, X2T, X3T, and X4T in the main memory in FIG. The constant matrices (corresponding to the DCT basis) required for the transformation are related to (Equation 2), (Equation 3), and (Equation 4-6). Each is stored in the indicated storage area. Then, the conversion result is stored in the storage area indicated by Y in FIG. Inverse discrete cosine transform

逆離散コサイン変換は、図 8の（式 3- 6)， (式 3-7) , (式 3-8) , (式 3-9) (式 3- 10)を順次に行なえばよい。 The inverse discrete cosine transform may be performed by sequentially performing (Equation 3-6), (Equation 3-7), (Equation 3-8), (Equation 3-9) (Equation 3-10) in FIG.

(式 3- 6)を実行するには、 4つの LD4命令で図 1 3の X1 Tをレジスタファイル 0 へロードし、 4つの LD4命令で図 1 3の ATをレジスタファイル 1へロードし、 4つの TRVT命令で（式 3- 6)内の A Xl tに対応する結果をレジスタファイル 2に得、 4つの TRV命令で（式 3- 6)の右辺に対応する結果をレジス夕ファイル 2に得、最後に 1 6個の ST命令で図 1 3の Tにストアしていけばよい。これは以下の命令系列で実行できる。 To execute (Equation 3-6), load X1 T in Figure 13 into register file 0 with four LD4 instructions, load AT in Figure 13 into register file 1 with four LD4 instructions, The result corresponding to A Xlt in (Equation 3-6) is obtained in register file 2 by one TRVT instruction, and the result corresponding to the right-hand side of (Equation 3-6) is obtained in register file 2 by four TRV instructions. And finally store it in T in Fig. 13 with 16 ST instructions. This can be done with the following sequence of instructions.

LD4 X1 T+0, 0, 0 LD4 X1 T + 0, 0, 0

LD4 X1 T+4, 0, 4 LD4 X1 T + 4, 0, 4

LD4 X1 T+8, 0, 8 LD4 X1T+12, 0, 12 LD4 X1 T + 8, 0, 8 LD4 X1T + 12, 0, 12

LD4 (O)AT, 1, 0 LD4 (O) AT, 1, 0

LD4 (4) AT, 1, 4 LD4 (4) AT, 1, 4

LD4 (8) AT, 1, 8 LD4 (8) AT, 1, 8

LD4 (12) AT, 1, 12 LD4 (12) AT, 1, 12

TRVT 0, 0, 1, 2 TRVT 0, 0, 1, 2

TRVT 0, 4, 1, 2 TRVT 0, 4, 1, 2

TRVT 0, 8, 1, 2 TRVT 0, 8, 1, 2

TRVT 0, 12, 1, 2 TRVT 0, 12, 1, 2

TRV 1, 0, 2, 2 TRV 1, 0, 2, 2

TRV 1, 4， 2, 2 TRV 1, 4, 2, 2

TRV 1, 8, 2， 2 TRV 1, 8, 2, 2

TRV 1, 12, 2, 2 TRV 1, 12, 2, 2

ST 2, 0， T+0 ST 2, 0, T + 0

ST 2, 1, T+4 ST 2, 1, T + 4

ST 2, 15+60 ST 2, 15 + 60

(式 3-7)を実行するには、 4つの LD4命令で図 1 3の X2Tをレジス夕ファイル 0へロードし、 4つの TRVT命令で（式 3-6)内の A Xl tに対応する結果をレジスタフアイル 2に得、 4つの TRV命令で（式 3 - 6)の右辺に対応する結果をレジスタファイル 2 に得、最後に 1 6個の ST命令で図 1 3の Tにストアしていけばよい。（式 3-6)の場合と同様に命令系列で実行していけるので命令系列の具体例は省略する。ここでは、 ATのロードが省略されている点に注意すべきである。 To execute (Equation 3-7), load the X2T shown in Figure 13 into the register file 0 with four LD4 instructions, and use four TRVT instructions to correspond to A Xlt in (Equation 3-6). The result is obtained in register file 2, the result corresponding to the right side of (Equation 3-6) is obtained in register file 2 by four TRV instructions, and finally stored in T in Fig. 13 by 16 ST instructions. I should go. As in the case of (Equation 3-6), execution can be performed in the instruction sequence, so a specific example of the instruction sequence is omitted. Note that the loading of the AT has been omitted here.

(式 3- 8) , (式 3- 9)を実行するための具体的な説明は省略する。但し、（式 3-9)を先に計算する方が、 ATや CTのロード回数を少なくできて効率的になる点を指摘しておく。 A specific description for executing (Equation 3-8) and (Equation 3-9) is omitted. However, it should be pointed out that it is more efficient to calculate (Equation 3-9) first because the number of AT and CT loads can be reduced.

(式 3- 10)を実行するには、まず 4つの LD4命令で図 1 3の Bをレジスタフアイル 0へロードしておく。次に 4つの LD4命令で図 1 3の Tから 4行分のデータをレジスタファイル 1にロードし、 4つの TRV命令で（式 3- 10)の右辺の演算の 1/4を実行し、その結果を 1 6個の ST命令で図 1 3の Yへストアする。これと同様なことを後 3回繰り返せばよい。命令系列の最初の 1/4は以下のようになる。 To execute (Equation 3-10), first load B in Figure 13 into register file 0 with four LD4 instructions. Next, four lines of data from T in Fig. 13 are loaded into register file 1 by four LD4 instructions, and one-fourth of the right-hand side of (Equation 3-10) is calculated by four TRV instructions. Execute and store the result to Y in Figure 13 with 16 ST instructions. This can be repeated three more times. The first quarter of the instruction sequence looks like this:

LD4 Β+0, 0, 0 LD4 Β + 0, 0, 0

LD4 Β+4, 0, 4 LD4 Β + 4, 0, 4

LD4 Β+8, 0, 8 LD4 Β + 8, 0, 8

LD4 B+12, 0, 12 LD4 B + 12, 0, 12

LD4 Τ+0, 0, 0 LD4 Τ + 0, 0, 0

LD4 Τ+4, 0, 4 LD4 Τ + 4, 0, 4

LD4 Τ+8, 0, 8 LD4 Τ + 8, 0, 8

LD4 TI12, 0, 12 LD4 TI12, 0, 12

TRV 0, 0, 1, 1 TRV 0, 0, 1, 1

TRV 0, 4, 1, 1 TRV 0, 4, 1, 1

TRV 0, 8, 1, 1 TRV 0, 8, 1, 1

TRV 0, 12, 1, 1 TRV 0, 12, 1, 1

ST 1, 0, Υ+0 ST 1, 0, Υ + 0

ST 1, 0, Υ+4 ST 1, 0, Υ + 4

ST 1, 2, Υ+32 ST 1, 2, Υ + 32

ST 1, 3, Υ+36 ST 1, 3, Υ + 36

ST 1, 4, Y+1 ST 1, 4, Y + 1

· · · · · · · ·

ST 1, 15, Υ+39 ST 1, 15, Υ + 39

残りの 3 4の命令系列は省略する。発明の効果 The remaining 34 instruction sequences are omitted. The invention's effect

以上で逆離散コサイン変換を実行できるが、（式 3-6)から（式 3-9)の実行に The inverse discrete cosine transform can be executed by the above, but from (Equation 3-6)

1 12命令、（式 3- 10)の実行に 100命令であり、合計でおよそ 200命令となる。従来方式による逆離散コサイン変換では、 1000-2000命令かかっていたものが、本発明の適用により約 200命令で済むようになるので、逆離散コサイン変換処理を大幅に効率化できることが分かる。 PC冒 97/02708 1 12 instructions, 100 instructions to execute (Equation 3-10), for a total of about 200 instructions. In the conventional inverse discrete cosine transform, which used to take 1000 to 2000 instructions, the application of the present invention requires only about 200 instructions, indicating that the inverse discrete cosine transform process can be made much more efficient. PC first 97/02708

21 このように命令数を大幅に低減できた理由は、行列積を行なうための命令 TRV, TRVTを用意し、かっこれを効率よく実行できるようにコプロセッサの演算回路を構成し制御するようにした点にある。さらに TRVT命令の転置行列に係わる機能も注目に値する。（式 3-1)を素朴に計算しょうとすれば、括弧内の式 M Xを計算した後で、転置行列を作る処理が必要になる。しかし、 TRVT命令は転置行列を作る処理と行列積を行なう処理とを同時に行なう機能を提供することになり、これが性能改善をさらに向上させることになる。さらにレジスタファイルに口一ドされた Atを TRVT命令を活用して 2回利用している点に注目する必要がある。なお、上記実施例では (式 3-1)を直接に計算するのでなく、（式 3- 6) , (式 21 The reason why the number of instructions could be significantly reduced in this way is to prepare instructions TRV and TRVT for performing matrix multiplication and configure and control the arithmetic circuit of the coprocessor so that parentheses can be executed efficiently. It is in the point which was made. Also noteworthy are the functions related to the transpose of the TRVT instruction. To calculate (Equation 3-1) naively, it is necessary to calculate the transpose matrix after calculating the expression M X in parentheses. However, the TRVT instruction provides a function to simultaneously perform the process of creating a transposed matrix and the process of performing a matrix product, which further improves the performance improvement. It is also necessary to note that the At used in the register file is used twice using the TRVT instruction. In the above embodiment, instead of directly calculating (Equation 3-1), (Equation 3-6), (Equation 3)

3-7) , (式 3 - 9) , (式 3 - 8)， (式 3- 10)を順次に計算していったが、他の手順で計算することも可能である。 3-7), (Equation 3-9), (Equation 3-8), and (Equation 3-10) were calculated sequentially, but it is also possible to calculate using other procedures.

なお、以上の説明では主として逆離散コサイン変換を例にとって説明したが、本発明は離散コサイン変換にも適用することができる。図 1 4に離散コサイン変換の定義式とそれを部分行列に分解し、定義式の展開型を示す。結局、（式 6- 6)から（式 6- 10)までを順次に計算していけばよい。この際にも、上記実施例で説明した TRV, TRVT命令を有効に活用することにより、実行命令数を大幅に減らすことができる。 In the above description, the inverse discrete cosine transform is mainly described as an example. However, the present invention can be applied to the discrete cosine transform. Figure 14 shows the definition of the discrete cosine transform and the expanded form of the definition by decomposing it into submatrices. After all, (Equation 6-6) to (Equation 6-10) should be calculated sequentially. At this time, the number of executed instructions can be significantly reduced by effectively utilizing the TRV and TRVT instructions described in the above embodiment.

また、上記実施例では、レジスタファイルを 4本設けたものについて説明したが、本発明に係る TRU命令および TRUT命令は、図 1 5に示すように少なくとも 3 本のレジスタファイル 5 0 0， 5 0 1 , 5 0 2からなるレジスタ部と図 3に示すような演算回路とを有するコプロセッサにおいて実行することができる。 Further, in the above embodiment, the case where four register files are provided has been described. However, the TRU instruction and the TRUT instruction according to the present invention include at least three register files 500, 50 as shown in FIG. It can be executed by a coprocessor having a register section composed of 1,502 and an arithmetic circuit as shown in FIG.

さらに、上記実施例においては、コプロセッサによる離散コサイン変換および逆離散コサイン変換について説明したが、図 3に示すような構成の演算回路を有するコプロセッサにあっては行列演算の他に浮動小数点演算を行なわせることが可能である。また、実施例のマイクロプロセッサにおいては、中央処理装置（第 1プロセッサ） 1とは別個に行列演算および浮動小数点演算を行なうコプロセッサ（第 2プロセッサ） 2を設けた場合について説明したが、これら 2つのプロセッサの機能を 1つのプロセッサで実現するように構成することも可能である。産業上の利用可能性 Further, in the above embodiment, the discrete cosine transform and the inverse discrete cosine transform by the coprocessor have been described. However, in a coprocessor having an arithmetic circuit having a configuration as shown in FIG. It is possible to perform an operation. Further, in the microprocessor according to the embodiment, a case where a coprocessor (second processor) 2 for performing a matrix operation and a floating-point operation is provided separately from a central processing unit (first processor) 1 has been described. It is also possible to configure so that the functions of one processor are realized by one processor. Industrial applicability

以上の説明では本発明者によってなされた発明を、主として汎用マイクロプロセッサに適用した場合について説明したが、この発明はそれに限定されるものでなく、画像デー夕の圧縮/伸長を行なうプロセッサ一般その他行列積を行なうデー夕処理装置に広く利用することができる。 In the above description, the case where the invention made by the present inventor is mainly applied to a general-purpose microprocessor has been described. However, the present invention is not limited to this case. It can be widely used in data processing equipment that performs matrix multiplication.

Claims

The scope of the claims

1. It is characterized by comprising an instruction consisting of an instruction code, first, second and third operands for specifying a register file and a fourth operand for specifying a register in the register file. Data processing device.

2. The NXN matrix in the first register file specified by the first operand and the number of register elements N specified by the fourth operand in the second register file specified by the second operand The instruction according to claim 1, further comprising an instruction for performing a matrix multiplication with the input vector and storing the resultant output vector in a third register file specified by the third operand. Data processing device.

3. The register group selected by the register number n in the fourth operand for designating a vector is provided with different instructions for the input vector and the output vector. The data processing device according to claim 1.

4. The matrix operation instruction is a first matrix operation instruction in which a register group selected by the register number n in the fourth operand for designating an input vector is in the row direction of the matrix, and the register number is 4. The data processing device according to claim 2, wherein the register group selected by n includes a second matrix operation instruction in a column direction of the matrix.

5. A data processing device that performs at least inverse discrete cosine transform using the instruction according to claim 1, 2, 3, or 4.

6. A control means for forming a control signal corresponding to the read instruction, and at least three register files including a predetermined number of registers and an arithmetic unit, and a matrix product is formed by the control signal from the control means. Computing means capable of executing processing corresponding to each of a plurality of instructions including a matrix operation instruction to be executed,

The above matrix operation instruction is composed of an instruction code, first, second, and third operands for specifying a register file and a fourth operand for specifying a register in a register file. Characteristic data processing device.

7. The NXN matrix in the first register file specified by the first operand and the number of elements N of the register specified by the fourth operand in the second register file specified by the second operand Performs a matrix product with the input vector of 7. The data processing apparatus according to claim 6, further comprising an instruction for storing a power vector in a third registry file specified by the third operand.

8. The register group selected by the register number n in the fourth operand for designating a vector is provided with different instructions for an input vector and an output vector. Data processing equipment.

9. The matrix operation instruction includes: a first matrix operation instruction in which a group of registers selected by a register number n in the fourth operand for designating an input vector is in a row direction of the matrix; 9. The data processing device according to claim 7, wherein the register group selected by the number n includes a second matrix operation instruction in a column direction of the matrix.

10. The arithmetic means includes at least three register files each including 16 registers, four selected data in the first register file specified by the first operand, and the Four multipliers that take the product of the data in the register specified by the fourth operand in the second register file specified by the two operands, and the operation results of these multipliers and the result of the previous addition 10. The data processing as claimed in claim 7, comprising four adders for calculating the sum of the data and four temporary registers for holding the operation results of these adders. apparatus.

1 1. A control means for forming a control signal corresponding to the read instruction, and a command having a register and an arithmetic unit and capable of executing a process corresponding to each of the above-mentioned instructions by a control signal from the control means. A first processor unit having an execution unit, a control unit for forming a control signal corresponding to the read instruction, and at least three registry files including a predetermined number of registry files and a computing unit, A second processor unit comprising: an operation unit capable of executing processing corresponding to each of a plurality of instructions including a matrix operation instruction for executing a matrix product by a control signal from the control unit; A data processing device characterized in that a matrix operation instruction is supplied to both the first and second processor units and is configured to be executed by a second processor.

1 2. A data processing method for reading a command into a control means to form a control signal, and supplying the control signal to an execution means to execute a process corresponding to the command. It is composed of an instruction code, first, second and third operands for specifying a register file and a fourth operand for specifying a register in a register file. The matrix product of the NXN matrix in the first register file and the input vector of the number of elements N of the register specified by the fourth operand in the second register file specified by the second operand specified in the second operand file is calculated. Prepare two types of instructions to store the output vector of the result in the third register file specified by the third operand.

In one matrix operation instruction, the register group selected by the register number n in the fourth operand for designating the input vector is in the row direction of the matrix, and the other matrix operation instruction is in the register number n A data processing method characterized in that a register group selected by the above is arranged in a column direction of a matrix.

13. The other matrix operation instruction is characterized in that a register group selected by a register number n in the fourth operand for specifying a vector is different between an input vector and an output vector. 13. The data processing method according to claim 12, wherein