JP2008033490A

JP2008033490A - Multithreaded processor

Info

Publication number: JP2008033490A
Application number: JP2006204345A
Authority: JP
Inventors: Yasunari Suzuki; 保成鈴木
Original assignee: Victor Company of Japan Ltd
Current assignee: Victor Company of Japan Ltd
Priority date: 2006-07-27
Filing date: 2006-07-27
Publication date: 2008-02-14

Abstract

<P>PROBLEM TO BE SOLVED: To provide a multithreaded processor adaptively performing parallel execution and serial execution without respect to the type of a program and performing fine granularity parallel execution with high efficiency even when the parallelism is increased. <P>SOLUTION: Each of thread units 2a-2h has a program generation part 23 and an instruction input part 24. The program generation part 23 specifies a thread unit, which generates a depended execution instruction, and describes semaphore information in a storage area 21 when dependent computing, in which computing executing in an optional computing unit among computing units 5a-5d uses a computing result computed in another computing unit, is included in an execution instruction to be generated, and generates an execution instruction, to which an execution instruction for clearing the semaphore information after execution of the program, when a depended execution instruction is included in the execution instruction to be generated. The instruction input part 24 feeds the execution instruction of the dependent computing to the computing unit via an instruction crossbar switch 4 after the semaphore information is cleared. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、アプリケーションをスレッドと呼ばれる複数の処理単位に分割生成し、そのスレッドを順にパイプラインで実行するようになすマルチスレッドマイクロプロセッサに関し、特に少ない電力で多くの演算を高い効率で実行するマルチスレッドプロセッサに関する。 The present invention relates to a multi-thread microprocessor in which an application is divided and generated into a plurality of processing units called threads, and the threads are sequentially executed in a pipeline, and more particularly, a multi-thread that executes many operations with low power and high efficiency. It relates to a thread processor.

従来より、情報処理装置の処理能力を高めるため、逐次処理形の単体のプロセッサではＲＩＳＣ（Reduced Instruction Set Computer）、ＶＬＩＷ（Very Long Instruction Word；超長命令語）などが用いられている。しかし、半導体加工技術微細化の限界や処理量に伴う消費電力の増加などのため、動作周波数を高めて処理速度を上げることが難しくなってきている。
一方、情報処理装置には、マルチタスクやマルチスレッドなどのＯＳ（Operating System）が使用されることが多く、その場合は複数の処理を平行して行える。プロセッサ単体の処理能力が小さくとも１つのＬＳＩ上に複数のプロセッサコアを搭載することにより処理能力を高められる。マルチコア技術が実用化されてきた。マルチコア以上にプロセッサを並列化する技術として、バスやネットワークを用い、多数のプロセッサを並列動作させる並列計算機がある。ＳＭＰ（Symmetric Multi Processor）、ＮＵＭＡ（Non-Uniform Memory Access）、ＰＣクラスタ、グリッド計算機などである。これらはハードウエア技術である。 Conventionally, RISC (Reduced Instruction Set Computer), VLIW (Very Long Instruction Word) and the like are used in a single processor of sequential processing type in order to increase the processing capability of the information processing apparatus. However, due to the limitations of miniaturization of semiconductor processing technology and the increase in power consumption associated with the amount of processing, it has become difficult to increase the operating frequency and increase the processing speed.
On the other hand, an OS (Operating System) such as multitasking or multithreading is often used for the information processing apparatus, and in this case, a plurality of processes can be performed in parallel. Even if the processing capability of a single processor is small, the processing capability can be increased by mounting a plurality of processor cores on one LSI. Multi-core technology has been put into practical use. As a technique for parallelizing processors beyond multi-core, there is a parallel computer that uses a bus or a network and operates many processors in parallel. SMP (Symmetric Multi Processor), NUMA (Non-Uniform Memory Access), PC cluster, grid computer, and the like. These are hardware technologies.

ソフトウエア技術においても高い並列性のあるアルゴリズムが開発されている。特に、画像処理や３次元ＣＧなどでは、光線の独立性に着目したレイトレース法などでの処理量の多い演算に用いられている。
従来の並列計算機では、逐次処理に適したプロセッサを複数並べることにより処理能力を高められる。反面、並列度を高めると設置場所が増えたり、消費電力が増加する。並列度のみを高めても個々のプロセッサでは演算器の利用効率が低いものも混在してしまう。並列度が高められ、且つ利用効率が高く保持されるプロセッサアーキテクチャの開発が望まれる。 Algorithms with high parallelism have also been developed in software technology. In particular, in image processing, three-dimensional CG, and the like, it is used for calculations with a large amount of processing in a ray tracing method that focuses on independence of light rays.
In a conventional parallel computer, the processing capability can be enhanced by arranging a plurality of processors suitable for sequential processing. On the other hand, increasing the degree of parallelism increases the number of installation locations and power consumption. Even if only the degree of parallelism is increased, some processors have low utilization efficiency of computing units. It is desired to develop a processor architecture that can increase the degree of parallelism and maintain high utilization efficiency.

特許文献１には、ハードウェアスケジューリング機能とソフトウェアスケジューリング機能とを混在させたマルチスレッドプロセッサが開示されている。スレッド実行ユニットが、処理中のスレッドが新たなスレッド生成を行うこと及び処理中のスレッドが終了することを可能とする機械語命令を備える。さらに、新しいスレッドの割り当てをハードウェアで直接行う機能と、ハードウェアによる割り当てを行うかどうかをソフトウェアで指示する機能と、ハードウェアによる割り当てを行わない場合に生成されるスレッドのレジスタコンテキストを保持する機能と、処理中のスレッドが終了する際に保持されているレジスタコンテキストをスレッド実行ユニット上に復旧する機能とを備え、必要に応じて、スレッド実行ユニットが処理中のスレッドにより生成される新しいスレッドを、処理中のスレッドが終了した後に自ら処理するようにしたマルチスレッドプロセッサが開示されている。
特開２０００−２０７２３３号公報 Patent Document 1 discloses a multithread processor in which a hardware scheduling function and a software scheduling function are mixed. The thread execution unit comprises machine language instructions that allow the processing thread to create a new thread and to terminate the processing thread. In addition, a function for directly assigning a new thread in hardware, a function for instructing whether to perform hardware assignment by software, and a register context of a thread generated when hardware assignment is not performed are held. A new thread that is created by the thread that the thread execution unit is processing, if necessary, with the function and the ability to restore the register context that is retained when the thread being processed terminates on the thread execution unit Is disclosed as a multi-thread processor that processes itself after the thread being processed ends.
JP 2000-207233 A

しかしながら、特許文献１に開示されているマルチスレッドプロセッサでは、ハードウエアを用いる複数スレッドをシリアライズした実行処理とソフトウエアスケジューリングによる実行処理とを混在させながら処理能力を高めようとしているため、スケジューリング処理に適したプログラムの場合は実行処理を効率的に行うことが出来るものの、実行処理中には複数スレッドの処理状況を把握したり、ソフトウエアスケジューリングの処理を行うため又はハードウエアスケジューリングの処理を行うための判断を行ったり、いずれか一方の処理に移行するための新たな処理を行ったりしなければならない。全ての種類のプログラムに対して細粒度で並列実行させるマルチスレッドプロセッサを実現することはできなかった。 However, the multithread processor disclosed in Patent Document 1 attempts to increase the processing capability while mixing execution processing obtained by serializing a plurality of threads using hardware and execution processing by software scheduling. In the case of a suitable program, execution processing can be performed efficiently, but during execution processing it is necessary to grasp the processing status of multiple threads, to perform software scheduling processing, or to perform hardware scheduling processing. Or a new process for shifting to one of the processes must be performed. It has not been possible to realize a multithread processor that can execute all types of programs in parallel at a fine granularity.

そこで、本発明は、上記のような問題点を解消するためになされたもので、アプリケーションプログラムの種類に係わらず並列実行及びシリアル実行を適応的に行い、並列度を高めても細粒度の並列実行を可能とし少ない電力で利用効率の高いパイプライン演算を行うことのできるマルチスレッドプロセッサを提供することを目的とする。 Therefore, the present invention has been made to solve the above-described problems, and performs parallel execution and serial execution adaptively regardless of the type of application program. An object of the present invention is to provide a multi-thread processor that can execute pipeline operations with low power and high utilization efficiency.

本願発明における第１の発明は、複数の実行命令群で記述されるアプリケーションプログラムを記憶したプログラムメモリを有し、前記アプリケーションプログラムの中の複数の実行命令群から一部の実行命令を取り出して、順次出力する複数のスレッドユニットと、前記複数のスレッドユニットから順次出力された前記一部の実行命令に対応した演算処理を行う複数の演算実行ユニットと、前記複数のスレッドユニットから順次出力された前記一部の実行命令に対応した演算実行ユニットを選択する命令クロスバースイッチと、前記複数のスレッドユニットで次の演算が実行される場合に、前記命令クロスバースイッチで選択された演算実行ユニットで演算を行った結果を次の実行用演算結果として前記複数のスレッドユニットのうち所望のスレッドユニットに供給する演算結果クロスバー制御部と、前記実行用演算結果が得られていないことを示すセマフォ情報を格納するセマフォ記憶部と、を有するマルチスレッドプロセッサにおいて、
前記複数のスレッドユニットのそれぞれは、
前記一部の実行命令が、前記実行用演算結果を用いて演算を行う依存性実行命令でなく且つ前記次の実行用演算結果を得るための被依存性実行命令である場合に、前記セマフォ情報を前記セマフォ記憶部に格納し、前記被依存性実行命令を実行させた後に前記実行用演算結果が得られたことを示す結果取得情報を当該実行命令を実行した演算実行ユニットから出力させる属性情報を付した実行命令を前記命令クロスバースイッチに出力し、
前記一部の実行命令が、前記依存性実行命令であり、且つ前記被依存性実行命令でない場合に、前記セマフォ記憶部の検索を行って、前記セマフォ情報が検出されたときには、前記依存性実行命令の前記命令クロスバースイッチへの出力を待機し、前記セマフォ情報が検出されなかったときには、前記依存性実行命令を前記命令クロスバースイッチに出力し、
前記一部の実行命令が、前記依存性実行命令であり且つ前記被依存性実行命令である場合に、前記セマフォ記憶部の検索を行って、前記セマフォ情報が検出されたときには、前記前記依存性実行命令であり且つ前記被依存性実行命令である実行命令の前記命令クロスバースイッチへの出力を待機し、前記セマフォ情報が検出されないときには、次の演算実行のためのセマフォ情報を前記セマフォ記憶部に記憶すると共に、前記依存性実行命令であり且つ前記被依存性実行命令である実行命令を実行させた後に前記結果取得情報を当該実行命令を実行した演算実行ユニットから出力させる属性情報を付した実行命令を前記命令クロスバースイッチに出力し、
前記一部の実行命令が、前記依存性実行命令でなく且つ前記被依存性実行命令でない場合に、前記依存性実行命令でなく且つ前記被依存性実行命令でない実行命令を前記命令クロスバースイッチに出力する命令投入部と、
前記演算結果クロスバー制御部から出力される前記結果取得情報を取得し、前記実行演算用結果に関する前記セマフォ記憶部に記憶されているセマフォ情報を消去するセマフォ情報制御部と、
を備えたことを特徴とするマルチスレッドプロセッサを提供する。
第２の発明は、請求項１記載のマルチスレッドプロセッサであって、前記複数のスレッドユニットの数は前記複数の演算ユニットの数よりも大きな数であることを特徴とするマルチスレッドプロセッサを提供する。 The first invention in the present invention has a program memory that stores an application program described by a plurality of execution instruction groups, takes out some execution instructions from the plurality of execution instruction groups in the application program, A plurality of thread units that sequentially output, a plurality of operation execution units that perform arithmetic processing corresponding to the partial execution instructions sequentially output from the plurality of thread units, and the sequentially output from the plurality of thread units An instruction crossbar switch that selects an operation execution unit corresponding to some execution instructions, and an operation executed by the operation execution unit selected by the instruction crossbar switch when the next operation is executed by the plurality of thread units. Of the plurality of thread units as a result of the next execution operation In a multi-threaded processor having a computation result supplied to the thread unit crossbar controller, and a semaphore memory unit for storing the semaphore information indicating that the execution operation result is not obtained,
Each of the plurality of thread units is
The semaphore information when the partial execution instructions are not dependent execution instructions that perform an operation using the execution operation result and are dependent execution instructions for obtaining the next execution operation result. Attribute information that causes the execution unit that executed the execution instruction to output result acquisition information indicating that the execution operation result has been obtained after the execution of the dependent execution instruction is stored in the semaphore storage unit Output the execution instruction with the instruction crossbar switch,
When the partial execution instruction is the dependency execution instruction and not the dependent execution instruction, the dependency execution is performed when the semaphore storage unit is searched and the semaphore information is detected. Waiting for output of the instruction to the instruction crossbar switch, and when the semaphore information is not detected, outputting the dependency execution instruction to the instruction crossbar switch,
When the partial execution instruction is the dependency execution instruction and the dependent execution instruction, the semaphore storage unit is searched and the semaphore information is detected. Waiting for output to the instruction crossbar switch of an execution instruction that is an execution instruction and the dependent execution instruction, and when the semaphore information is not detected, the semaphore storage unit stores semaphore information for execution of the next operation And having attribute information for outputting the result acquisition information from the operation execution unit that executed the execution instruction after executing the execution instruction that is the dependency execution instruction and the dependent execution instruction. Output an execution instruction to the instruction crossbar switch,
When the partial execution instruction is not the dependency execution instruction and the dependent execution instruction, the execution instruction that is not the dependency execution instruction and is not the dependency execution instruction is sent to the instruction crossbar switch. An instruction input section to output;
A semaphore information control unit that acquires the result acquisition information output from the calculation result crossbar control unit and erases semaphore information stored in the semaphore storage unit related to the execution calculation result;
A multi-thread processor is provided.
A second invention provides a multithread processor according to claim 1, wherein the number of the plurality of thread units is larger than the number of the plurality of arithmetic units. .

本発明によれば、複数の各スレッドユニットは、一部の実行命令が、実行用演算結果を用いて演算を行う依存性実行命令でなく且つ次の実行用演算結果を得るための被依存性実行命令である場合に、セマフォ情報をセマフォ記憶部に格納し、被依存性実行命令を実行させた後に実行用演算結果が得られたことを示す結果取得情報を当該実行命令を実行した演算実行ユニットから出力させる属性情報を付した実行命令を命令クロスバースイッチに出力し、一部の実行命令が、依存性実行命令であり、且つ被依存性実行命令でない場合に、セマフォ記憶部の検索を行って、セマフォ情報が検出されたときには、依存性実行命令の命令クロスバースイッチへの出力を待機し、セマフォ情報が検出されなかったときには、依存性実行命令を命令クロスバースイッチに出力し、一部の実行命令が、依存性実行命令であり且つ被依存性実行命令である場合に、セマフォ記憶部の検索を行って、セマフォ情報が検出されたときには、依存性実行命令であり且つ被依存性実行命令である実行命令の命令クロスバースイッチへの出力を待機し、セマフォ情報が検出されないときには、次の演算実行のためのセマフォ情報をセマフォ記憶部に記憶すると共に、依存性実行命令であり且つ被依存性実行命令である実行命令を実行させた後に結果取得情報を当該実行命令を実行した演算実行ユニットから出力させる属性情報を付した実行命令を命令クロスバースイッチに出力し、一部の実行命令が、依存性実行命令でなく且つ被依存性実行命令でない場合に、依存性実行命令でなく且つ被依存性実行命令でない実行命令を命令クロスバースイッチに出力する命令投入部と、演算結果クロスバー制御部から出力される結果取得情報を取得し、実行演算用結果に関するセマフォ記憶部に記憶されているセマフォ情報を消去するセマフォ情報制御部と、を備えた格別な構成があるので、アプリケーションプログラムの種類に係わらず並列実行及びシリアル実行を適応的に行い、並列度を高めても細粒度の並列実行を可能とし少ない電力で利用効率の高いパイプライン演算を行うことのできるマルチスレッドプロセッサを実現できる。
複数のスレッドユニットの数は複数の演算ユニットの数よりも大きな数である場合は、さらに並列度を高めても細粒度の並列実行を可能とし少ない電力で利用効率の高いパイプライン演算を行うことのできるマルチスレッドプロセッサを実現できる。 According to the present invention, each of the plurality of thread units has a dependency on which a part of the execution instructions is not a dependency execution instruction for performing an operation using the execution operation result and for obtaining the next execution operation result. In the case of an execution instruction, the semaphore information is stored in the semaphore storage unit, and the execution result obtained by executing the execution instruction is the result acquisition information indicating that the execution operation result is obtained after executing the dependent execution instruction Execution instructions with attribute information to be output from the unit are output to the instruction crossbar switch, and if some execution instructions are dependency execution instructions and not dependent execution instructions, the semaphore storage unit is searched. When the semaphore information is detected, it waits for the dependency execution instruction to be output to the instruction crossbar switch, and when the semaphore information is not detected, the dependency execution instruction is crossed. -When output to the switch and some execution instructions are dependent execution instructions and dependent execution instructions, the semaphore storage unit is searched and semaphore information is detected. When the semaphore information is not detected and the semaphore information is not detected, the semaphore information for the next operation execution is stored in the semaphore storage unit. An execution instruction with attribute information that outputs the result acquisition information from the execution unit that executed the execution instruction after executing the execution instruction that is a sexual execution instruction and a dependent execution instruction is output to the instruction crossbar switch However, if some execution instructions are not dependency execution instructions and are not dependent execution instructions, they are not dependency execution instructions and are not dependent execution instructions. The instruction input unit that outputs an execution instruction to the instruction crossbar switch and the result acquisition information output from the operation result crossbar control unit are acquired, and the semaphore information stored in the semaphore storage unit related to the execution operation result is deleted. The semaphore information control unit has a special configuration, so that parallel execution and serial execution are adaptively performed regardless of the type of application program, and even if the degree of parallelism is increased, fine-grain parallel execution is possible and low power consumption. Can realize a multi-thread processor capable of performing highly efficient pipeline operations.
When the number of multiple thread units is larger than the number of multiple arithmetic units, it is possible to execute parallel operations at a finer granularity even when the degree of parallelism is further increased, and perform highly efficient pipeline operations with less power Multi-thread processor capable of

以下に本発明の実施例に係るマルチスレッドプロセッサについて図１〜図８を用いて説明する。図１は、本発明の実施に係るマルチスレッドプロセッサの構成例を示すブロック図である。図２は、本発明の実施に係るマルチスレッドプロセッサ要部のスレッドユニットの構成例を示す図である。図３は、本発明の実施に係るマルチスレッドプロセッサの要部の信号記述例（その１）を示す図である。図４は、本発明の実施に係るマルチスレッドプロセッサの演算実行ユニットの構成例を示す図である。図５は、本発明の実施に係るマルチスレッドプロセッサの要部の信号記述例（その２）を示す図である。図６は、本発明の実施に係るマルチスレッドプロセッサのロードストアユニットの構成例を示す図である。図７は、本発明の実施に係るマルチスレッドプロセッサの動作例（その１）を示す図である。図８は、本発明の実施に係るマルチスレッドプロセッサの動作例（その２）を示す図である。 A multi-thread processor according to an embodiment of the present invention will be described below with reference to FIGS. FIG. 1 is a block diagram showing a configuration example of a multi-thread processor according to an embodiment of the present invention. FIG. 2 is a diagram showing a configuration example of the thread unit of the main part of the multi-thread processor according to the embodiment of the present invention. FIG. 3 is a diagram showing a signal description example (No. 1) of the main part of the multithread processor according to the embodiment of the present invention. FIG. 4 is a diagram showing a configuration example of the arithmetic execution unit of the multithread processor according to the embodiment of the present invention. FIG. 5 is a diagram showing a signal description example (No. 2) of the main part of the multithread processor according to the embodiment of the present invention. FIG. 6 is a diagram showing a configuration example of the load store unit of the multi-thread processor according to the embodiment of the present invention. FIG. 7 is a diagram showing an operation example (part 1) of the multi-thread processor according to the embodiment of the present invention. FIG. 8 is a diagram showing an operation example (part 2) of the multi-thread processor according to the embodiment of the present invention.

そのマルチスレッドプロセッサは、アプリケーションプログラムの種類に係わらず並列実行及びシリアル実行を適応的に行い、並列度を高めても細粒度の並列実行を可能とし少ない電力で利用効率の高いパイプライン演算を行うマルチスレッドプロセッサを実現するという目的を、複数の各スレッドユニットは、複数の各スレッドユニットは、一部の実行命令が、実行用演算結果を用いて演算を行う依存性実行命令でなく且つ次の実行用演算結果を得るための被依存性実行命令である場合に、セマフォ情報をセマフォ記憶部に格納し、被依存性実行命令を実行させた後に実行用演算結果が得られたことを示す結果取得情報を当該実行命令を実行した演算実行ユニットから出力させる属性情報を付した実行命令を命令クロスバースイッチに出力し、一部の実行命令が、依存性実行命令であり、且つ被依存性実行命令でない場合に、セマフォ記憶部の検索を行って、セマフォ情報が検出されたときには、依存性実行命令の命令クロスバースイッチへの出力を待機し、セマフォ情報が検出されなかったときには、依存性実行命令を命令クロスバースイッチに出力し、一部の実行命令が、依存性実行命令であり且つ被依存性実行命令である場合に、セマフォ記憶部の検索を行って、セマフォ情報が検出されたときには、依存性実行命令であり且つ被依存性実行命令である実行命令の命令クロスバースイッチへの出力を待機し、セマフォ情報が検出されないときには、次の演算実行のためのセマフォ情報をセマフォ記憶部に記憶すると共に、依存性実行命令であり且つ被依存性実行命令である実行命令を実行させた後に結果取得情報を当該実行命令を実行した演算実行ユニットから出力させる属性情報を付した実行命令を命令クロスバースイッチに出力し、一部の実行命令が、依存性実行命令でなく且つ被依存性実行命令でない場合に、依存性実行命令でなく且つ被依存性実行命令でない実行命令を命令クロスバースイッチに出力する命令投入部と、演算結果クロスバー制御部から出力される結果取得情報を取得し、実行演算用結果に関するセマフォ記憶部に記憶されているセマフォ情報を消去するセマフォ情報制御部と、を備えるようにして実現した。 The multi-thread processor adaptively performs parallel execution and serial execution regardless of the type of application program, enables high-efficiency pipeline operations with low power, enabling fine-grained parallel execution even when the degree of parallelism is increased For the purpose of realizing a multi-thread processor, each of a plurality of thread units is not a dependency execution instruction in which some execution instructions perform an operation using an execution operation result and A result that indicates that the execution result is obtained after the semaphore information is stored in the semaphore storage unit and the dependent execution instruction is executed when the execution instruction is a dependent execution instruction for obtaining the execution operation result. Execution instruction with attribute information that causes the execution information to be output from the execution unit that executed the execution instruction is output to the instruction crossbar switch When some execution instructions are dependency execution instructions and are not dependent execution instructions, when the semaphore storage unit is searched and semaphore information is detected, the instruction crossbar of the dependency execution instruction When waiting for output to the switch and no semaphore information is detected, a dependency execution instruction is output to the instruction crossbar switch, and some execution instructions are dependency execution instructions and dependent execution instructions. In some cases, when the semaphore storage unit is searched and semaphore information is detected, the execution of the execution instruction that is a dependent execution instruction and a dependent execution instruction is waited for output to the instruction crossbar switch, and When no information is detected, semaphore information for executing the next operation is stored in the semaphore storage unit, and is a dependent execution instruction and a dependent execution instruction. Execution instruction with attribute information that outputs the result acquisition information from the execution unit that executed the execution instruction after executing the line instruction is output to the instruction crossbar switch, and some execution instructions are dependent execution instructions If the instruction is not a dependent execution instruction, the instruction input unit outputs an execution instruction that is not a dependency execution instruction and is not a dependent execution instruction to the instruction crossbar switch, and is output from the operation result crossbar control unit. The semaphore information control part which acquires result acquisition information and erase | eliminates the semaphore information memorize | stored in the semaphore memory | storage part regarding the result for execution calculations was implement | achieved.

マルチスレッドプロセッサの構成について述べる。
図１に示すマルチスレッドプロセッサ１０は、演算クロスバースイッチ（演算結果交換伝送路）１、スレッドユニット２ａ〜２ｈ、命令クロスバー制御部３、命令クロスバースイッチ（命令交換伝送路）４、演算実行ユニット５ａ〜５ｄ、ロードストアユニット６、演算結果クロスバー制御部７、及び外部インタフェース（ＩＦ）部８より構成される。
図２に示すマルチスレッドプロセッサ１０の要部であるスレッドユニット２は、１６個のレジスタ（Ｒ₀〜Ｒ₁₅）２１ａ〜２１ｐ、プログラムカウンタ（ＰＣ）２１ｒ及びスタックポインタ（ＳＰ）２１ｓよりなるレジスタ群２１と、ｎ個の命令セマフォ部２２と、プログラムメモリ２３と、命令投入部２４と、命令ＦＩＦＯ（first in first out）部２５とより構成される。スレッドユニット２ａ〜２ｈのそれぞれはスレッドユニット２と同一構成である。
図４に示す演算実行ユニット５は、命令デコード部５１、ＡＬＵ（Arithmetic and Logic Unit）５２、ＦＰＵ（Floating point number Processing Unit）５３、及び演算結果生成部５４より構成される。
図６に示すロードストアユニット６は、命令デコード部６１およびロードストア結果生成部６２より構成される。 The configuration of the multi-thread processor will be described.
A multi-thread processor 10 shown in FIG. 1 includes an arithmetic crossbar switch (arithmetic result exchange transmission path) 1, thread units 2a to 2h, an instruction crossbar control unit 3, an instruction crossbar switch (instruction exchange transmission path) 4, and an arithmetic execution. The units 5 a to 5 d, the load store unit 6, the operation result crossbar control unit 7, and the external interface (IF) unit 8 are configured.
The thread unit 2, which is the main part of the multi-thread processor 10 shown in FIG. 2, includes a register group including 16 registers (R _{0 to} R ₁₅ ) 21a to 21p, a program counter (PC) 21r, and a stack pointer (SP) 21s. 21, n instruction semaphore units 22, a program memory 23, an instruction input unit 24, and an instruction FIFO (first in first out) unit 25. Each of the thread units 2 a to 2 h has the same configuration as the thread unit 2.
4 includes an instruction decode unit 51, an ALU (Arithmetic and Logic Unit) 52, an FPU (Floating point number Processing Unit) 53, and an operation result generation unit 54.
The load / store unit 6 shown in FIG. 6 includes an instruction decode unit 61 and a load / store result generation unit 62.

マルチスレッドプロセッサの動作について述べる。
図１に示すマルチスレッドプロセッサ１０のロードストアユニット６は、命令クロスバースイッチ４を介して供給される命令クロスバー制御部３からの制御信号により制御され、マルチスレッドプロセッサ１０で演算を行うためのアプリケーションプログラムを、外部インタフェース部８を介して得る。又はプログラムメモリ２３に記憶されているプログラムの中から所定のプログラム群をプログラムカウンタ２１ｒにより指定して取得する。演算結果クロスバー制御部７は、得られたアプリケーションプログラムを複数の処理単位（スレッド）の小プログラム（実行命令群）に分割する。演算クロスバースイッチ１は演算結果クロスバー制御部７で生成された小プログラムをスレッドユニット２ａ〜２ｈに順に割り付ける。スレッドユニット２ａ〜２ｈのそれぞれは割り付けられた小プログラムから演算を実行するための機械語プログラムで記述される実行命令を後述のレジスタ群より得る。命令クロスバー制御部３はスレッドユニット２ａ〜２ｈのそれぞれから得られるリクエストを基に命令クロスバースイッチ４の接続方法に係る制御信号を生成し、命令クロスバースイッチ（命令交換伝送路）４に供給する。命令クロスバースイッチ４はスレッドユニット２ａ〜２ｈのそれぞれで生成された実行命令を演算実行ユニット５ａ〜５ｄのうち演算待機中の、例えば演算実行ユニット５ａに供給する。その実行命令は演算開始の直前に行う、例えばセマフォ（手旗）を立ち上げる実行命令を含む。そのセマフォは複数の演算実行ユニットがメモリ領域を共有して演算を行う場合に、同時にメモリ領域をアクセスして記憶内容の破壊や不整合が起きるのを防ぐために用いるフラグである。その実行命令には、演算の実行を完了し、演算結果をライトバックした直後にセマフォを下げる実行命令も含む。 The operation of the multi-thread processor will be described.
The load store unit 6 of the multi-thread processor 10 shown in FIG. 1 is controlled by a control signal from the instruction crossbar control unit 3 supplied via the instruction crossbar switch 4, and is used for performing calculations by the multithread processor 10. An application program is obtained via the external interface unit 8. Alternatively, a predetermined program group is designated and acquired from the programs stored in the program memory 23 by the program counter 21r. The operation result crossbar control unit 7 divides the obtained application program into small programs (execution instruction groups) of a plurality of processing units (threads). The calculation crossbar switch 1 assigns the small programs generated by the calculation result crossbar control unit 7 to the thread units 2a to 2h in order. Each of the thread units 2a to 2h obtains an execution instruction described in a machine language program for executing an operation from an allocated small program from a register group described later. The command crossbar control unit 3 generates a control signal related to the connection method of the command crossbar switch 4 based on the request obtained from each of the thread units 2 a to 2 h and supplies the control signal to the command crossbar switch (command exchange transmission line) 4. To do. The instruction crossbar switch 4 supplies the execution instruction generated in each of the thread units 2a to 2h to, for example, the operation execution unit 5a that is waiting for the operation among the operation execution units 5a to 5d. The execution command includes an execution command for starting up a semaphore (hand flag), for example, performed immediately before the start of calculation. The semaphore is a flag used to prevent memory contents from being destroyed or inconsistent by accessing the memory area at the same time when a plurality of operation execution units perform operations while sharing the memory area. The execution instruction also includes an execution instruction that lowers the semaphore immediately after completing the execution of the operation and writing back the operation result.

演算実行ユニット５ａは命令クロスバースイッチ４を介して供給された実行命令に従って演算処理を行い、得られた演算結果を演算クロスバースイッチ（演算結果交換伝送路）１に供給すると共に、演算が終了した場合は演算が終了したことを示す演算終了信号を演算結果クロスバー制御部７に供給する。演算結果クロスバー制御部７は演算の終了状況に応じて新たな小プログラムを作成し演算クロスバースイッチ１に供給する。新たな小プログラムの演算は、その演算に用いるための演算結果が全て得られている場合にのみ開始できる。新たな小プログラムには演算に用いられる演算結果が全て得られているか否かをチェックするために用いるセマフォ依存性情報が付されている。 The arithmetic execution unit 5a performs arithmetic processing in accordance with the execution instruction supplied via the instruction crossbar switch 4, supplies the obtained arithmetic result to the arithmetic crossbar switch (arithmetic result exchange transmission line) 1, and ends the arithmetic operation. In such a case, a calculation end signal indicating that the calculation is completed is supplied to the calculation result crossbar control unit 7. The calculation result crossbar control unit 7 creates a new small program according to the calculation completion status and supplies it to the calculation crossbar switch 1. The calculation of a new small program can be started only when all the calculation results for use in the calculation are obtained. The new small program has semaphore dependency information used for checking whether or not all the calculation results used for the calculation are obtained.

演算クロスバースイッチ１から新たな小プログラムを供給された例えばスレッドユニット２ｂは、新たな小プログラムに付されるセマフォ依存性情報を基に、新たな演算に用いる演算結果が全て得られているかをチェックする。演算結果の全てが得られていない場合は新たな演算をさせるための実行命令を出力しない。命令クロスバースイッチ４は、スレッドユニット２ａ〜２ｈのそれぞれから出力される実行命令を、演算実行ユニット５ａ〜５ｄのうちの演算が可能である演算実行ユニットに順次供給する。演算実行ユニット５ａ〜５ｄは次々と供給される実行命令を実行し、演算結果を所定のメモリ領域に記憶する。実行命令の出力を待機しているスレッドユニット２ｂは、セマフォ依存性情報に記述される演算用定数と、その定数の演算を前もって実行させるスレッドとが関連付けられて記述されたセマフォセット情報と、当該スレッドの演算命令により演算実行ユニット５ａ〜５ｄのそれぞれが実行されてクリアされるセマフォクリア情報とを比較する。両者が合致した場合にセットされたセマフォ情報をクリアする。全ての立てられたセマフォ情報がクリアされることにより、実行命令の出力を待機しているスレッドユニット２ｂは演算に必要な全ての演算結果が得られたとして検出する。スレッドユニット２ｂは新たな演算の実行命令を出力する。 For example, the thread unit 2b supplied with the new small program from the calculation crossbar switch 1 determines whether all the calculation results used for the new calculation are obtained based on the semaphore dependency information attached to the new small program. To check. If all of the operation results are not obtained, an execution instruction for performing a new operation is not output. The instruction crossbar switch 4 sequentially supplies an execution instruction output from each of the thread units 2a to 2h to an operation execution unit that can perform an operation among the operation execution units 5a to 5d. The operation execution units 5a to 5d execute the execution instructions supplied one after another and store the operation results in a predetermined memory area. The thread unit 2b waiting for the output of the execution instruction includes semaphore set information described in association with an operation constant described in the semaphore dependency information and a thread that executes the operation of the constant in advance, Each of the operation execution units 5a to 5d is executed by a thread operation instruction and compared with semaphore clear information that is cleared. If the two match, the set semaphore information is cleared. By clearing all established semaphore information, the thread unit 2b waiting for the output of the execution instruction detects that all the operation results necessary for the operation have been obtained. The thread unit 2b outputs an execution instruction for a new operation.

以降、同様にして、スレッドユニット２ａ〜２ｈは演算実行ユニット５ａ〜５ｄで実行させる実行命令を生成し、セマフォ依存性情報と実際に立てられているセマフォとを比較することによりハザード（障害）がないことをセマフォを参照することにより検出し、ハザードがないことを検出した後に実行命令を出力する。演算実行ユニット５ａ〜５ｄは供給される実行命令に従って演算を実行する。
マルチスレッドプロセッサ１０は演算実行ユニット５ａ〜５ｄの４個の演算実行ユニットに対して８個のスレッドユニット２ａ〜２ｈを有しているため、４個のスレッドユニットが実行命令を出力しない場合であっても他の４個のスレッドユニットから実行命令が出力されるため、４個の演算実行ユニット５ａ〜５ｄは演算を待機することによるデータハザードや制御ハザードを生じさせることはない。 Thereafter, similarly, the thread units 2a to 2h generate execution instructions to be executed by the operation execution units 5a to 5d, and a hazard (failure) is generated by comparing the semaphore dependency information with the actually set semaphore. It is detected by referring to the semaphore, and an execution instruction is output after detecting that there is no hazard. The arithmetic execution units 5a to 5d execute arithmetic operations according to supplied execution instructions.
Since the multi-thread processor 10 has eight thread units 2a to 2h for the four arithmetic execution units 5a to 5d, the four thread units do not output execution instructions. However, since the execution instructions are output from the other four thread units, the four operation execution units 5a to 5d do not cause a data hazard or a control hazard due to waiting for the operation.

次に、詳細に説明する。
図２を参照し、マルチスレッドプロセッサ１０の要部であるスレッドユニット２について述べる。同図に示すスレッドユニット２は、スレッドユニット２ａ〜２ｈのそれぞれと構成が同一であり、行われる動作も同一である。
まず、レジスタ群２１は、演算クロスバースイッチ１を介して演算結果クロスバー制御部７から供給される小プログラムに記述される演算用のデータのそれぞれをレジスタ２１ａ〜２１ｐに記憶する。プログラムカウンタ２１ｒは次に実行する命令が格納されているメインメモリのアドレスを記憶する。スタックポインタ２１ｓは小プログラムに記述されるスタックポインタのアドレス値を記憶する。 Next, this will be described in detail.
With reference to FIG. 2, the thread unit 2 which is a main part of the multi-thread processor 10 will be described. The thread unit 2 shown in the figure has the same configuration as each of the thread units 2a to 2h, and the operation performed is the same.
First, the register group 21 stores, in the registers 21a to 21p, data for calculation described in the small program supplied from the calculation result crossbar control unit 7 via the calculation crossbar switch 1. The program counter 21r stores the address of the main memory where the instruction to be executed next is stored. The stack pointer 21s stores the address value of the stack pointer described in the small program.

プログラムメモリ２３はプログラムカウンタ２１ｒに記憶されるメインメモリのアドレスを参照してプログラムメモリからマシン語で記述される実行命令を読み出す。命令投入部２４は読み出された実行命令を一時記憶すると共に、実行命令中に記述されるセマフォ依存性情報を基に実行命令として生成されたセマフォセット情報を命令セマフォ部２２にロードする。セマフォセット情報は、例えば８個のスレッドユニット２ａ〜２ｈでなされる実行命令の演算経過を８ビットの数により管理する。例えばスレッドユニット２ｈでの実行命令をスレッドユニット２ａ、２ｂ、及び２ｄの演算結果を用いて演算する場合のセマフォセット情報は「１１０１００００」となる。命令セマフォ部２２には「１１０１００００」がセットされる。命令セマフォ部２２は演算クロスバースイッチ１に要求して供給される信号を基にスレッドユニット２ａ、２ｂ、及び２ｄでの演算経過情報を得る。スレッドユニット２ａから出力される実行命令の演算が演算実行ユニットでなされた後に、セマフォが下げられた場合にセマフォセット情報の最初のビットをクリアし、セマフォセット情報を「０１０１００００」に変更する。スレッドユニット２ｂ、及び２ｄのセマフォが下げられた場合にセマフォセット情報は「００００００００」となる。新しい演算に用いる演算結果が全て得られた全ビットが０の状態である。命令投入部２４は一時記憶している実行命令を出力する。実行命令は命令ＦＩＦＯ部２５を介して命令クロスバー制御部３及び命令クロスバースイッチ４に供給される。演算実行ユニットでは、供給された実行命令が実行される。 The program memory 23 reads an execution instruction described in a machine language from the program memory with reference to the address of the main memory stored in the program counter 21r. The instruction input unit 24 temporarily stores the read execution instruction, and loads the semaphore set information generated as an execution instruction based on the semaphore dependency information described in the execution instruction into the instruction semaphore unit 22. For example, the semaphore set information manages the progress of execution of execution instructions executed by the eight thread units 2a to 2h by an 8-bit number. For example, the semaphore set information when the execution instruction in the thread unit 2h is calculated using the calculation results of the thread units 2a, 2b, and 2d is “11010000”. “11010000” is set in the instruction semaphore unit 22. The instruction semaphore unit 22 obtains operation progress information in the thread units 2a, 2b, and 2d based on a signal supplied to the operation crossbar switch 1 upon request. When the execution instruction output from the thread unit 2a is calculated by the operation execution unit, when the semaphore is lowered, the first bit of the semaphore set information is cleared and the semaphore set information is changed to “01010000”. When the semaphores of the thread units 2b and 2d are lowered, the semaphore set information becomes “00000000”. All bits in which all the calculation results used for the new calculation are obtained are in a state of 0. The instruction input unit 24 outputs the temporarily stored execution instruction. The execution command is supplied to the command crossbar control unit 3 and the command crossbar switch 4 via the command FIFO unit 25. In the operation execution unit, the supplied execution instruction is executed.

図３を参照し、スレッドユニット２のプログラムメモリ２３にロードされる実行命令及び命令投入部２４から出力される命令の記述形式について述べる。
同図（Ａ）はプログラムメモリ上に記述される命令フィールドからなるマシン語であり、そのフィールドはセマフォ依存性情報、セマフォセット情報、オペレーションコード（Ｏｐｃｏｄｅ）、ソース１レジスタ（ｓｏｕｒｃｅ＿１ｒｅｇｉｓｔｅｒ）番号、ソース２レジスタ（ｓｏｕｒｃｅ＿２ｒｅｇｉｓｔｅｒ）番号、及び演算結果書き込みレジスタ（ｄｅｓｔｉｎａｔｉｏｎｒｅｇｉｓｔｅｒ）番号の順に記述される。命令投入部２４は入力されるマシン語を演算実行ユニットに入力するフォーマットの記述形式に変換する。 With reference to FIG. 3, the description format of the execution instruction loaded into the program memory 23 of the thread unit 2 and the instruction output from the instruction input unit 24 will be described.
FIG. 5A shows a machine language composed of instruction fields described in the program memory, and the fields are semaphore dependency information, semaphore set information, operation code (Opcode), source 1 register (source_1 register) number, and source. Two register (source_2 register) numbers and an operation result write register (destination register) number are described in this order. The instruction input unit 24 converts the machine language that is input into a description format that is input to the operation execution unit.

同図（Ｂ）は命令投入部２４から変換されて出力される記述形式である。セマフォ依存性情報を自分のスレッド番号に置換する。セマフォセット情報はそのままである。Ｏｐｃｏｄｅは命令の一部をデコードし、得られる演算結果がロードストアユニット６を介して出力するか又は演算クロスバースイッチ１を介してさらに演算を継続するかを判別する。演算結果の行き先に応じて同図（Ｃ）に示す行き先を記述したリクエストを生成する。
同図（Ｂ）のソース１データ（ｓｏｕｒｃｅ＿１ｄａｔａ）は、ソース１レジスタ番号を参照してレジスタ群２１から得たデータに置換する。同様にしてソース２データに取得したデータを記述する。演算結果書き込みレジスタ（ｄｅｓｔｉｎａｔｉｏｎｒｅｇｉｓｔｅｒ）番号には（Ａ）の内容がそのまま記述される。
同図（Ｂ）に示すフォーマットで記述される命令投入部２４の出力信号は命令セマフォ部２２のセマフォセット情報のそれぞれのビットのＡＮＤを演算し、全ビットが０として検出された場合に出力される。 FIG. 5B shows a description format converted from the instruction input unit 24 and output. Replace semaphore dependency information with your thread number. The semaphore set information remains the same. Opcode decodes a part of the instruction and determines whether the obtained operation result is output via the load / store unit 6 or whether further operation is continued via the operation crossbar switch 1. A request describing the destination shown in FIG. 3C is generated according to the destination of the operation result.
The source 1 data (source_1 data) in FIG. 5B is replaced with data obtained from the register group 21 with reference to the source 1 register number. Similarly, the acquired data is described in the source 2 data. The contents of (A) are described as they are in the operation result write register (destination register) number.
The output signal of the instruction input unit 24 described in the format shown in FIG. 4B is output when the AND of each bit of the semaphore set information of the instruction semaphore unit 22 is calculated and all bits are detected as 0. The

図４を参照して演算実行ユニット５について述べる。同図に示す演算実行ユニット５は、演算実行ユニット５ａ〜５ｄのそれぞれと構成が同一であり、行われる動作も同一である。
まず、命令デコード部５１は、命令クロスバースイッチ４を介してスレッドユニット２ａ〜２ｈのいずれかから供給される図３（Ｂ）に示した命令投入部出力の命令、Ｏｐｃｏｄｅをデコードする。ＡＬＵ５２は、デコードして得られた命令が整数及び論理演算の場合はソース１データ及びソース２データの演算処理を行う。ＦＰＵ５３は、命令が浮動小数点演算の場合はソース１データ及びソース２データの浮動小数点演算を行う。セマフォセット情報及びセマフォの情報はそのまま演算結果生成部５４に供給する。演算結果生成部５４は供給された演算結果や情報を所定のフォーマットの情報に記述して出力するための信号を生成する。 The operation execution unit 5 will be described with reference to FIG. The arithmetic execution unit 5 shown in the figure has the same configuration as each of the arithmetic execution units 5a to 5d, and the operation performed is the same.
First, the instruction decode unit 51 decodes the instruction, Opcode, output from the instruction input unit shown in FIG. 3B supplied from any of the thread units 2a to 2h via the instruction crossbar switch 4. When the instruction obtained by decoding is an integer or logical operation, the ALU 52 performs arithmetic processing of source 1 data and source 2 data. The FPU 53 performs floating point arithmetic on source 1 data and source 2 data when the instruction is floating point arithmetic. The semaphore set information and semaphore information are supplied to the calculation result generation unit 54 as they are. The calculation result generation unit 54 generates a signal for outputting the supplied calculation results and information in the information of a predetermined format.

図５を参照して演算結果出力信号の記述形式について述べる。
同図（Ａ）はその記述形式であり、記述されるフィールドは順にセマフォクリア情報、演算結果データ、及び演算結果書き込みレジスタ（ｄｅｓｔｉｎａｔｉｏｎｒｅｇｉｓｔｅｒ）番号である。セマフォクリア情報は、演算結果が得られた場合にセマフォセット情報によりセットされたセマフォの、該当する部分のセマフォをクリアするために用いるクリアすべきセマフォの位置を示す情報である。演算結果データには演算した結果得られたデータが記述される。演算結果書き込みレジスタ番号には図３（Ｂ）に示した演算結果書き込みレジスタ番号がそのまま記述される。
同図（Ｂ）は演算結果クロスバー制御部７に対してスレッド番号を指定し、そのスレッドの演算の経過に係る情報をリクエストして得るために出力するリクエストの記述フォーマットである。演算結果クロスバー制御部７はリクエスト情報を参照し、要求されたスレッドに対して（Ａ）の演算結果のデータを転送するように動作する。スレッドユニット２ａ〜２ｈの何れかからロード若しくはストア命令が命令クロスバー制御部３にリクエストされた場合には、リクエストの行き先はロードストアユニット６になる。 The description format of the calculation result output signal will be described with reference to FIG.
FIG. 6A shows the description format, and the fields to be described are semaphore clear information, operation result data, and operation result write register (destination register) number in this order. The semaphore clear information is information indicating the position of the semaphore to be used for clearing the semaphore of the corresponding part of the semaphore set by the semaphore set information when the operation result is obtained. In the operation result data, data obtained as a result of the operation is described. The calculation result write register number shown in FIG. 3B is described as it is in the calculation result write register number.
FIG. 6B shows a description format of a request that is output in order to request and obtain information relating to the progress of the computation of the thread by designating a thread number to the computation result crossbar control unit 7. The calculation result crossbar control unit 7 refers to the request information and operates to transfer the calculation result data of (A) to the requested thread. When a load or store instruction is requested from any of the thread units 2 a to 2 h to the instruction crossbar control unit 3, the destination of the request is the load store unit 6.

図６を参照し、ロードストアユニット６について述べる。
命令デコード部６１は命令クロスバースイッチ４から入力される命令をデコードし、外部インタフェース部８を介して供給されるデータをロードするか、乃至は外部インタフェース部８を介してデータを供給しストアさせるかの実行制御を行う。ロードストアユニット６は、ロード命令の場合には、図３（Ｂ）に示したソース１データの箇所をアドレスとし、外部インタフェース部８に対してｒｅａｄ信号を出力する。ロードストア結果生成部６２は外部インタフェース部８から入力される信号を図５（Ａ）に示した演算結果信号フォーマットの演算結果データの部分に挿入して演算クロスバースイッチ１に出力する。セマフォクリア情報及び演算結果書き込みレジスタ番号にはそれらの情報を記述する。ロードストアユニット６は、ストア命令の場合には、図３（Ｂ）に示したソース１データの箇所をアドレスとし、ソース２データの箇所をデータとし、外部インタフェース部８に対してＷｒｉｔｅ信号を出力する。演算結果は既に得られているため、ロードストアユニット６は演算結果を演算クロスバースイッチ１に供給しない。 The load store unit 6 will be described with reference to FIG.
The instruction decode unit 61 decodes an instruction input from the instruction crossbar switch 4 and loads data supplied via the external interface unit 8 or supplies and stores data via the external interface unit 8 Execution control is performed. In the case of a load instruction, the load / store unit 6 outputs the read signal to the external interface unit 8 with the location of the source 1 data shown in FIG. The load / store result generation unit 62 inserts the signal input from the external interface unit 8 into the calculation result data portion of the calculation result signal format shown in FIG. 5A and outputs it to the calculation crossbar switch 1. Such information is described in the semaphore clear information and operation result write register number. In the case of a store instruction, the load / store unit 6 outputs the Write signal to the external interface unit 8 using the source 1 data location shown in FIG. 3B as the address and the source 2 data location as the data. To do. Since the calculation result has already been obtained, the load store unit 6 does not supply the calculation result to the calculation crossbar switch 1.

ロードストア結果生成部６２は、必要に応じ、例えば演算結果を演算クロスバースイッチ１に供給しないとするダミーデータを演算クロスバースイッチ１に出力する。セマフォセット情報及び演算結果書き込みレジスタ番号は図３（Ａ）の内容がそのままコピーされて出力される。なお、図５（Ｂ）に示すスレッド番号のリクエストは、スレッドユニット２ａ〜２ｈの何れかから発せられるリクエストに対し、演算クロスバースイッチ１を介して送出するリクエストとしてロードストア結果生成部６２で生成して出力する。演算クロスバースイッチ１を介して図４（Ａ）、（Ｂ）の形式で記述される演算結果を受け取ったスレッドユニットは、図３（Ａ）の命令投入部出力のセマフォセット情報によりセットしたセマフォに対応するビットのセマフォを図５（Ａ）のセマフォクリア情報によりクリアする。これにより、新しい演算に用いる演算結果が出力されていないことにより待ち合わせ状態となっていたスレッドユニットの待ち合わせ状態が解消される。 The load / store result generation unit 62 outputs, for example, dummy data indicating that the calculation result is not supplied to the calculation crossbar switch 1 to the calculation crossbar switch 1 as necessary. The semaphore set information and the operation result write register number are output by copying the contents of FIG. Note that the request with the thread number shown in FIG. 5B is generated by the load store result generation unit 62 as a request sent via the arithmetic crossbar switch 1 in response to a request issued from any of the thread units 2a to 2h. And output. The thread unit that receives the operation result described in the format of FIGS. 4A and 4B via the operation crossbar switch 1 receives the semaphore set by the semaphore set information of the instruction input unit output of FIG. The semaphore of the bit corresponding to is cleared by the semaphore clear information of FIG. As a result, the waiting state of the thread unit that has been in the waiting state due to the fact that the operation result used for the new operation has not been output is eliminated.

図７、図８を参照し、マルチスレッドプロセッサ１０の命令実行について述べる。
まず、演算結果クロスバー制御部７は、外部インタフェース部８及びロードストアユニット６を介して得られる、Ｃ言語やフォートランなどで記述して入力されるアプリケーションプログラムの命令実行順序や演算に用いられる演算用データの依存関係を解析し、マシン語で記述される実行命令を生成する。例えば、図７（Ａ）に示される命令１〜５のうち命令１〜命令４は相互に演算の依存性がないが、命令５は命令１及び２の演算結果を得て後に行われる。そこで、命令１のセマフォクリア情報は第１ビット目を１にした「１０００００００」とし、命令２のセマフォクリア情報は第２ビット目を１にした「０１００００００」とする。命令５のセマフォセット情報は第１ビット目及び第２ビット目を１にした「１１００００００」とする。命令３、４は命令１、２の実行に関係なく演算が可能であるのでセマフォセット情報及びセマフォクリア情報の両者は「００００００００」とされる。 The instruction execution of the multi-thread processor 10 will be described with reference to FIGS.
First, the operation result crossbar control unit 7 uses the instruction execution order and operations of application programs that are obtained through the external interface unit 8 and the load / store unit 6 and written in C language, Fortran, or the like. Analyzes the dependency of the business data and generates an execution instruction described in machine language. For example, among the instructions 1 to 5 shown in FIG. 7A, the instructions 1 to 4 are not dependent on each other, but the instruction 5 is performed after obtaining the operation results of the instructions 1 and 2. Therefore, the semaphore clear information of instruction 1 is “10000000” with the first bit set to 1, and the semaphore clear information of instruction 2 is “01000000” with the second bit set to 1. The semaphore set information of the instruction 5 is “11000000” in which the first bit and the second bit are set to 1. Since instructions 3 and 4 can be operated regardless of execution of instructions 1 and 2, both semaphore set information and semaphore clear information are set to “00000000”.

図７（Ａ）は命令１〜４が同時に実行開始される場合である。そして、例えば命令１の実行が終了した場合に命令５のセマフォセット情報「１１００００００」は命令１のセマフォクリア情報「１０００００００」により１ビット目がクリアされ「０１００００００」となる。セマフォセット情報に１が存在しているため命令５の演算実行の待機を継続する。次に命令２の演算結果が得られた場合に命令５のセマフォセット情報「０１００００００」は命令２のセマフォクリア情報「０１００００００」により２ビット目がクリアされる。命令５のセマフォセット情報は「００００００００」となる。命令５は待機が解除された状態となり、命令１と命令２で得られた演算結果が用いられて演算が開始される。 FIG. 7A shows a case where execution of instructions 1 to 4 is started simultaneously. For example, when the execution of the instruction 1 is completed, the semaphore set information “11000000” of the instruction 5 is cleared to “01000000” by the semaphore clear information “10000000” of the instruction 1. Since 1 exists in the semaphore set information, the standby for the execution of the instruction 5 is continued. Next, when the operation result of the instruction 2 is obtained, the second bit of the semaphore set information “01000000” of the instruction 5 is cleared by the semaphore clear information “01000000” of the instruction 2. The semaphore set information of the instruction 5 is “00000000”. The instruction 5 is in a state in which the standby is released, and the operation is started using the operation results obtained by the instruction 1 and the instruction 2.

図７（Ｂ）は、演算実行ユニット５ａ〜５ｄが他の演算を行っているなどにより演算能力が不足している場合などで、命令２が先に実行開始されている例である。命令３、４が実行された後に命令１が実行される。この場合も（Ａ）と同様に、命令１及び命令２の両者の演算が終了して後に命令５が実行される。
図８（Ａ）は相互に依存性を有しない命令３、４は命令５の後に実行されている例である。この場合もセマフォセット情報、セマフォクリア情報は図７（Ａ）で述べたと同様にセット及びクリアがなされる。
ここで、命令５の実行結果を得て図示しない命令６の実行がなされる場合は、命令５は命令１、命令２に依存されて演算を行う依存性実行命令であると共に、命令６に対しては被依存性実行命令となる。命令５は命令１及び命令２の実行命令が実行されてセマフォ情報がクリアされた際に、命令５に対するセマフォ情報をセットする。その後、命令５が実行されて実行結果が得られた際にセットした命令５のセマフォ情報がクリアされる。命令５のセマフォクリアにより、命令６はセマフォ情報の参照時に、セマフォが立てられていないと検出される。命令６の実行命令がスレッドユニットより出力される。命令６は待機状態にある演算実行ユニットにより実行される。 FIG. 7B shows an example in which the instruction 2 is started to be executed first, for example, when the arithmetic execution units 5a to 5d perform other calculations, and thus the calculation capability is insufficient. Instruction 1 is executed after instructions 3 and 4 are executed. In this case as well, as in (A), instruction 5 is executed after the calculation of both instruction 1 and instruction 2 is completed.
FIG. 8A shows an example in which instructions 3 and 4 having no dependency are executed after instruction 5. Also in this case, semaphore set information and semaphore clear information are set and cleared as described with reference to FIG.
Here, when the execution result of the instruction 5 is obtained and the instruction 6 (not shown) is executed, the instruction 5 is a dependency execution instruction that performs an operation depending on the instructions 1 and 2, and Is a dependent execution instruction. Instruction 5 sets the semaphore information for instruction 5 when the execution instructions of instruction 1 and instruction 2 are executed and the semaphore information is cleared. Thereafter, the semaphore information of the instruction 5 set when the instruction 5 is executed and the execution result is obtained is cleared. By the semaphore clear of the instruction 5, the instruction 6 is detected that the semaphore is not set up when referring to the semaphore information. An execution instruction of instruction 6 is output from the thread unit. The instruction 6 is executed by the operation execution unit in the standby state.

ここで、演算実行ユニット５ａ〜５ｄの総数は４個であり、スレッドユニット２ａ〜２ｈの総数は８個である。演算実行ユニットの数よりもスレッドユニットの数の方が多い。これにより、スレッドユニットでは常に多くの小プログラムが生成され、演算実行ユニットでの演算待ちとなる。演算実行ユニットは与えられた演算が終了次第、セマフォが立てられていない実行可能は小プログラムが供給される。演算実行ユニットにおいて演算待ちの待機状態が生じることはない。演算実行ユニットの稼動率は高く保たれる。 Here, the total number of arithmetic execution units 5a to 5d is four, and the total number of thread units 2a to 2h is eight. The number of thread units is greater than the number of operation execution units. As a result, many small programs are always generated in the thread unit, and the computation execution unit waits for computation. As soon as a given operation is completed, the operation execution unit is supplied with a small program that can be executed without a semaphore. There is no waiting state for waiting for computation in the computation execution unit. The operation rate of the arithmetic execution unit is kept high.

演算に用いる演算結果が得られていないことによる待機はスレッドユニットで行われる。スレッドユニットは演算実行ユニットに比し回路構成が簡単であるためマルチスレッドプロセッサ１０を半導体で構成する場合の面積は演算実行ユニットの面積よりも小さい。さらに、待機状態における消費電力は、スレッドユニットの方が演算実行ユニットよりも小さな値である。演算実行ユニットの数をスレッドユニットの数よりも小さくし、且つ演算実行ユニットでの待機状態を生じさせないようにすることにより、チップ面積が小さく、演算実行時の消費電力を小さくしたマルチスレッドプロセッサ１０のＬＳＩを実現することが出来る。 Waiting due to the fact that the calculation result used for the calculation is not obtained is performed by the thread unit. Since the thread unit has a simpler circuit configuration than the operation execution unit, the area when the multi-thread processor 10 is formed of a semiconductor is smaller than the area of the operation execution unit. Further, the power consumption in the standby state is smaller in the thread unit than in the arithmetic execution unit. The multi-thread processor 10 has a smaller chip area and lower power consumption during execution of operations by making the number of operation execution units smaller than the number of thread units and not causing a standby state in the operation execution units. This LSI can be realized.

そして、セマフォ情報をＣＰＵの汎用レジスタであるレジスタ群に記憶する代わりに実行プログラムから直接アクセスする必要のない記憶領域をセマフォ記憶部として用いてもよい。セマフォ情報はＣＰＵハードウエア内部の情報として記憶する。その場合はセマフォ情報セット命令やセマフォ情報クリア命令などの命令コマンドを使用することなく演算実行をさせることが出来る。従って、１命令で演算の実行とセマフォ情報の管理を行えることになる。 Then, instead of storing the semaphore information in a register group that is a general-purpose register of the CPU, a storage area that does not need to be directly accessed from the execution program may be used as the semaphore storage unit. Semaphore information is stored as information inside the CPU hardware. In that case, the operation can be executed without using an instruction command such as a semaphore information set instruction or a semaphore information clear instruction. Therefore, it is possible to execute operations and manage semaphore information with a single instruction.

以上のように、本実施例で示したマルチスレッドプロセッサは、一部の実行命令が、実行用演算結果を用いて演算を行う依存性実行命令でなく且つ次の実行用演算結果を得るための被依存性実行命令である場合に、セマフォ情報をセマフォ記憶部２１に格納し、被依存性実行命令を実行させた後に実行用演算結果が得られたことを示す結果取得情報を当該実行命令を実行した演算実行ユニットから出力させる属性情報を付した実行命令を命令クロスバースイッチ４に出力し、一部の実行命令が、依存性実行命令であり、且つ被依存性実行命令でない場合に、セマフォ記憶部２１の検索を行って、セマフォ情報が検出されたときには、依存性実行命令の命令クロスバースイッチ４への出力を待機し、セマフォ情報が検出されなかったときには、依存性実行命令を命令クロスバースイッチ４に出力し、一部の実行命令が、依存性実行命令であり且つ被依存性実行命令である場合に、セマフォ記憶部の検索を行って、セマフォ情報が検出されたときには、依存性実行命令であり且つ被依存性実行命令である実行命令の命令クロスバースイッチ４への出力を待機し、セマフォ情報が検出されないときには、次の演算実行のためのセマフォ情報をセマフォ記憶部に記憶すると共に、依存性実行命令であり且つ被依存性実行命令である実行命令を実行させた後に結果取得情報を当該実行命令を実行した演算実行ユニットから出力させる属性情報を付した実行命令を命令クロスバースイッチに出力し、一部の実行命令が、依存性実行命令でなく且つ被依存性実行命令でない場合に、依存性実行命令でなく且つ被依存性実行命令でない実行命令を命令クロスバースイッチに出力する命令投入部２４と、演算結果クロスバー制御部から出力される結果取得情報を取得し、実行演算用結果に関するセマフォ記憶部に記憶されているセマフォ情報を消去するセマフォ情報制御部２２とを備えた複数のスレッドユニットを用いて実現した。 As described above, the multi-thread processor shown in the present embodiment is not a dependency execution instruction in which some execution instructions perform an operation using an execution operation result, and for obtaining the next execution operation result. If it is a dependent execution instruction, semaphore information is stored in the semaphore storage unit 21 and result execution information indicating that an operation result for execution has been obtained after executing the dependent execution instruction is stored in the execution instruction. An execution instruction with attribute information to be output from the executed operation execution unit is output to the instruction crossbar switch 4, and when some execution instructions are dependency execution instructions and are not dependent execution instructions, the semaphore When the storage unit 21 is searched and semaphore information is detected, output of the dependency execution instruction to the instruction crossbar switch 4 is waited. When semaphore information is not detected, When the existence execution instruction is output to the instruction crossbar switch 4 and some execution instructions are dependency execution instructions and dependent execution instructions, the semaphore storage unit is searched and the semaphore information is obtained. When it is detected, it waits for the output to the instruction crossbar switch 4 of the execution instruction that is a dependent execution instruction and a dependent execution instruction, and when no semaphore information is detected, the semaphore information for the next operation execution Is added to the semaphore storage unit, and attribute information is output to output result acquisition information from the operation execution unit that executed the execution instruction after executing the execution instruction that is a dependent execution instruction and a dependent execution instruction. Execution instructions are output to the instruction crossbar switch, and when some execution instructions are not dependent execution instructions and not dependent execution instructions, dependency execution instructions The instruction input unit 24 that outputs an execution instruction that is not a dependent execution instruction to the instruction crossbar switch and the result acquisition information output from the operation result crossbar control unit are acquired and stored in the semaphore storage unit related to the execution operation result This is realized by using a plurality of thread units provided with a semaphore information control unit 22 for erasing stored semaphore information.

本発明の実施に係るマルチスレッドプロセッサの構成例を示すブロック図である。It is a block diagram which shows the structural example of the multithread processor which concerns on implementation of this invention. 本発明の実施に係るマルチスレッドプロセッサの要部のスレッドユニットの構成例を示す図である。It is a figure which shows the structural example of the thread unit of the principal part of the multithread processor which concerns on implementation of this invention. 本発明の実施に係るマルチスレッドプロセッサの要部の信号記述例（その１）を示す図である。It is a figure which shows the signal description example (the 1) of the principal part of the multithread processor which concerns on implementation of this invention. 本発明の実施に係るマルチスレッドプロセッサの演算実行ユニットの構成例を示す図である。It is a figure which shows the structural example of the operation execution unit of the multithread processor which concerns on implementation of this invention. 本発明の実施に係るマルチスレッドプロセッサの要部の信号記述例（その２）を示す図である。It is a figure which shows the signal description example (the 2) of the principal part of the multithread processor which concerns on implementation of this invention. 本発明の実施に係るマルチスレッドプロセッサのロードストアユニットの構成例を示す図である。It is a figure which shows the structural example of the load store unit of the multithread processor which concerns on implementation of this invention. 本発明の実施に係るマルチスレッドプロセッサの動作例（その１）を示す図である。It is a figure which shows the operation example (the 1) of the multithread processor which concerns on implementation of this invention. 本発明の実施に係るマルチスレッドプロセッサの動作例（その２）を示す図である。It is a figure which shows the operation example (the 2) of the multithread processor which concerns on implementation of this invention.

Explanation of symbols

１演算クロスバースイッチ（演算結果交換伝送路）
２、２ａ〜２ｈスレッドユニット
３命令クロスバー制御部
４命令クロスバースイッチ（命令交換伝送路）
５、５ａ〜５ｄ演算実行ユニット
６ロードストアユニット
７演算結果クロスバー制御部
８外部インタフェース部
１０マルチスレッドプロセッサ
２１レジスタ群
２１ａ〜２１ｐレジスタ
２１ｒプログラムカウンタ
２１ｓスタックポインタ
２２命令セマフォ部
２３プログラムメモリ
２４命令投入部
２５令ＦＩＦＯ部
５１命令デコード部
５２ＡＬＵ
５３ＦＰＵ
５４演算結果生成部
６１命令デコード部
６２ロードストア結果生成部

1 Computation crossbar switch (calculation result exchange transmission line)
2, 2a to 2h Thread unit 3 Command crossbar control unit 4 Command crossbar switch (command exchange transmission path)
5, 5a to 5d Operation execution unit 6 Load store unit 7 Operation result crossbar control unit 8 External interface unit 10 Multi-thread processor 21 Register group 21a to 21p Register 21r Program counter 21s Stack pointer 22 Instruction semaphore unit 23 Program memory 24 Instruction input Part 25 Instruction FIFO part 51 Instruction decode part 52 ALU
53 FPU
54 Operation Result Generation Unit 61 Instruction Decode Unit 62 Load Store Result Generation Unit

Claims

A part of a plurality of execution instruction groups is acquired from a program memory that stores an application program described by a plurality of execution instruction groups, and a part of the plurality of execution instructions described in the acquired execution instruction group is acquired. A plurality of thread units that sequentially output execution instructions, an instruction exchange transmission path that outputs the execution instructions sequentially output from the plurality of thread units to a desired output terminal among a plurality of output terminals, and the instruction exchange transmission path A plurality of arithmetic units that are connected to the respective output terminals and store the operation results obtained by executing the respective execution instructions supplied to the connected output terminals in the storage means, and the operation results are stored in a desired thread. An operation result exchange transmission line to be supplied as an execution operation result to the unit,
In the multi-thread processor, wherein the instruction exchange transmission path outputs the plurality of execution instructions output from the plurality of thread units to an arithmetic unit in a standby state among the plurality of arithmetic units.
Each of the plurality of thread units is
When the execution instruction to be output includes a dependent execution instruction for obtaining the execution operation result and does not include a dependency execution instruction for performing an operation using the execution operation result, the execution instruction The semaphore information indicating that the operation result is not obtained is stored in the semaphore storage unit, and the result acquisition information indicating that the execution operation result is obtained after executing the execution instruction is transmitted to the operation result exchange transmission line. An execution instruction with attribute information to be output to the instruction exchange transmission path,
When the output execution instruction includes the dependency execution instruction and does not include the dependent execution instruction, the stored semaphore information is detected by referring to the semaphore information stored in the semaphore storage unit. If it is, wait for the output of the dependency execution instruction to the command exchange transmission line, if the stored semaphore information is not detected, output the dependency execution instruction to the instruction exchange transmission line,
When the output execution instruction includes the execution instruction that is the dependency execution instruction and the dependent execution instruction, the semaphore information stored is referred to by referring to the semaphore information stored in the semaphore storage unit. Is detected, the output of the execution command to the command exchange transmission path is waited. When the stored semaphore information is not detected, the semaphore information is stored in the semaphore storage unit, and the result acquisition information While outputting an execution instruction with attribute information that causes the operation result exchange transmission line to be output to the instruction exchange transmission line,
An instruction input unit that outputs the execution instruction to the instruction exchange transmission path when the execution instruction to be output does not include either the dependency execution instruction or the dependent execution instruction;
A semaphore information control unit that acquires the result acquisition information supplied to the operation result exchange transmission line and deletes the semaphore information stored in the semaphore storage unit according to the execution result for the operation;
A multi-thread processor comprising:

The multi-thread processor according to claim 1,
The multithread processor, wherein the number of the plurality of thread units is larger than the number of the plurality of arithmetic units.