JP2006039824A

JP2006039824A - Method for controlling multiprocessor-mounted system LSI

Info

Publication number: JP2006039824A
Application number: JP2004217175A
Authority: JP
Inventors: Mamoru Tanaka; 守田中; Kazuki Murakami; 和希村上
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2004-07-26
Filing date: 2004-07-26
Publication date: 2006-02-09

Abstract

【課題】マルチプロセッサのマルチタスク制御の最適化、高速化を一般化して行なう。
【解決手段】ハード構成として、システムＬＳＩ内部のデータキャッシュを持つプロセッサ群と、大容量の外部共有メモリとからなる。
タスクの組み合わせ候補を複数生成して、多数のプロセッサで同一タスクを重複して投機的に実行する。プロセッサ間での通信をペナルティとして、最もペナルティを避けたプロセッサとタスクの組み合わせの結果を採用することで、最も効率の良い処理手順を使いつつ、データキャッシュ−共有メモリ間のシステム全体の整合性をとる。
【選択図】図１PROBLEM TO BE SOLVED To generalize and optimize multitask control of a multiprocessor.
A hardware configuration includes a processor group having a data cache inside a system LSI and a large-capacity external shared memory.
A plurality of task combination candidates are generated, and the same task is duplicated and executed speculatively by many processors. By adopting the result of the combination of the processor and task that avoids the penalty as the penalty between communication between processors, the consistency of the entire system between the data cache and shared memory is achieved while using the most efficient processing procedure. Take.
[Selection] Figure 1

Description

本発明は、複数のＣＰＵすなわちマルチプロセッサを用い、かつ専用ハードウェア回路と共に１つのＬＳＩに組み込んで構成される、いわゆるシステムＬＳＩあるいはシステムオンチップ（ＳＯＣ）と呼ばれるＬＳＩの構成方法と、このＬＳＩ上で異なるプログラムを同時に独立に実行し、プログラム間で通信を行い制御するいわゆるマルチタスクを実行する方法に関する。 The present invention relates to an LSI configuration method called a system LSI or system-on-chip (SOC), which includes a plurality of CPUs, that is, multiprocessors, and is incorporated in one LSI together with a dedicated hardware circuit. It is related with the method of performing what is called multitasking which performs different programs simultaneously and independently, and communicates and controls between programs.

近年、半導体製造プロセスの進歩と集積度の向上に伴い、専用ハードウェア回路と共にＣＰＵを１つのＬＳＩに組み込む事で構成される、いわゆるシステムＬＳＩあるいはシステムオンチップ（ＳＯＣ）と呼ばれるＬＳＩがさかんに利用されるようになってきている。さらに、半導体の集積度の向上により、相対的に複数のＣＰＵの搭載がＬＳＩにおけるＣＰＵの占有面積の点からは容易になってきており、性能を向上させるために数個のＣＰＵを搭載するマルチプロセッサによるＳＯＣが、次第に利用されるようになってきている。今後、さらなる半導体の集積度の向上により、ＬＳＩの価格への影響度は、外部に接続するピンの数による制約がより大きくなり、数十から数百といった非常に多数のＣＰＵの搭載が、ＬＳＩの面積的な観点からは極めて安価にＳＯＣ上で可能になる時代が到来する事が予想される。 In recent years, with the advancement of semiconductor manufacturing processes and integration, LSIs called so-called system LSIs or system-on-chip (SOC), which are configured by incorporating a CPU with a dedicated hardware circuit into a single LSI, are used extensively. It has come to be. Furthermore, with the increase in the degree of integration of semiconductors, it is relatively easy to mount a plurality of CPUs in terms of the area occupied by a CPU in an LSI, and a multi-CPU including a plurality of CPUs is mounted to improve performance. The SOC by the processor is gradually being used. In the future, due to further improvements in the degree of integration of semiconductors, the impact on the price of LSI will be more limited by the number of pins connected to the outside, and mounting of a very large number of CPUs, such as tens to hundreds, From the viewpoint of area, it is expected that an era will be possible on the SOC at a very low cost.

また一方で、デジタル機器で利用されるソフトウェアは相対的に複雑になってきており、プログラムを同時に独立に実行し、プログラム間で通信を行い制御するいわゆるマルチタスクで利用できるオペレーティングシステム（ＯＳ）を利用することが多くなってきている。 On the other hand, software used in digital devices has become relatively complex, and an operating system (OS) that can be used in so-called multitasking that executes programs simultaneously and independently and communicates and controls between programs is provided. Use is increasing.

さらに一方で、半導体の物理的性質の限界や消費電力や発熱の観点から、動作周波数をこれ以上、従来のペースで上げ続けるのは困難と見られており、その点からも、ＳＯＣに限らず、従来から処理の高速化のためにマルチプロセサによるアプリケーションソフトやマルチプロセッサ用マルチタスクのＯＳ開発と利用が行われており、大規模な並列数値計算に数千のプロセッサを用いたもの、あるいは２個から４個のプロセッサを用いたマルチプロセッサ用マルチタスクのＯＳなどが利用されている。
特開平３−２１１６５６号公報 On the other hand, it is considered difficult to continue to increase the operating frequency at a conventional pace from the viewpoint of the physical properties of semiconductors, power consumption, and heat generation. In the past, application software by multiprocessors and multitasking OS for multiprocessors have been developed and used for high-speed processing. Thousands of processors are used for large-scale parallel numerical calculations, or two. A multitasking multitasking OS using four processors is used.
JP-A-3-21656

しかしながら、従来の方法には以下の課題があった。 However, the conventional method has the following problems.

１）多数個のマルチプロセセッサによるアプリケーションソフトは特定の並列演算が実行可能な例、例えば画像処理や場の方程式を解く数値計算、等に有効であるが、容易に並列分割できない制御系の処理ではアルゴリズムの自動的な並列化が困難であり、例えばマルチタスクのアプリケーションにおける個々のタスクの依存関係は特定のアプリケーション毎に異なる。このような制御系の処理における並列アルゴリズムの生成に関しては、一般解がなく、自動化が難しい。 1) Application software with a large number of multiprocessors is effective in cases where specific parallel operations can be performed, for example, image processing, numerical calculations for solving field equations, etc., but control system processing that cannot be easily divided in parallel However, it is difficult to automatically parallelize algorithms. For example, the dependency of individual tasks in a multitasking application is different for each specific application. There is no general solution for generating parallel algorithms in such control system processing, and automation is difficult.

２）一つの領域のメモリをマルチプロセッサで同時にアクセスする共有メモリを利用する場合、プロセッサ数が多くなると共有メモリへのバスのトラフィックが増大し、アクセス速度が低下する。そのため、キャッシュ機構などのプロセサごとにアクセス可能なローカルな高速メモリへのコピーを用いて高速化を図る必要があるが、この場合は更に各プロセッサごとのキャッシュと共有メモリとの整合性をとることが難しくなる。各プロセッサに近接して分散メモリだけを配置する方法も考えられるが、この場合は、分散メモリ間でのデータの受け渡しが困難であり、一般の制御アルゴリズムにおいて分散メモリの内容の整合性を取ることが困難である。 2) When using a shared memory that simultaneously accesses memory in one area by a multiprocessor, the bus traffic to the shared memory increases and the access speed decreases as the number of processors increases. For this reason, it is necessary to increase the speed by using a copy to a local high-speed memory that can be accessed for each processor such as a cache mechanism. In this case, the consistency between the cache and the shared memory for each processor must be taken. Becomes difficult. Although it is possible to arrange only the distributed memory close to each processor, in this case, it is difficult to exchange data between the distributed memories, and the consistency of the contents of the distributed memory should be ensured in a general control algorithm. Is difficult.

３）複数のプロセッサに複数のタスクを割り当てた場合のシステム全体のデータ処理の効率を向上させる為の適切な方法の選択が難しい。 3) It is difficult to select an appropriate method for improving the data processing efficiency of the entire system when a plurality of tasks are assigned to a plurality of processors.

例として、タスクＴａ、Ｔｂ、ＴｃをプロセッサＰｘ、Ｐｙで動作させる場合に、ある一つのタスクＴａに関して、一つのプロセッサに割り当てて、Ｐｘ：（Ｔａ）、Ｐｙ：（Ｔｂ、Ｔｃ）とする方法があり、また複数のタスクＴａ、Ｔｂを一つのプロセッサに割り当てて、Ｐｘ：（Ｔａ、Ｔｂ）、Ｐｙ：（Ｔｃ）とする方法がある。ここで、タスクＴａに関して、前者の方法はプロセッサ内でタスクの遷移が無くプロセッサ内のレジスタデータの入れ替え（コンテキストスイッチ）が不要であるという観点においては、タスクＴａは明らかに高速に動作する。ところが一方で、タスクＴaとＴbの間で互いにメッセージを通信して動作が行なわれる場合、異なるプロセッサ間で同期をとって通信しなければならず、そのためのオーバーヘッドが大きい場合に前者の方法は速度が低下する要因となる。このプロセッサ間の同期のオーバーヘッドは、マルチプロセッサやメモリなどのハードウェアのシステム構成や、マルチタスクのソフトウェアプログラムの制御構成に依存して変化するので、結果として、複雑なタスク数とプロセッサ数の構成になった場合に、前者と後者のどちらの方法がより効率的にデータを処理できるかは、ケースバイケースであると言える。このように、マルチプロセッサへのタスク割り当ての方法の選択が困難である。 As an example, when tasks Ta, Tb, and Tc are operated by processors Px and Py, a certain task Ta is assigned to one processor and Px: (Ta), Py: (Tb, Tc) In addition, there is a method in which a plurality of tasks Ta and Tb are assigned to one processor and Px: (Ta, Tb) and Py: (Tc) are set. Here, with respect to the task Ta, the task Ta clearly operates at a high speed from the viewpoint that the former method has no task transition in the processor and does not require register data exchange (context switch) in the processor. On the other hand, when operations are performed by communicating messages between tasks Ta and Tb, communication must be performed between different processors, and when the overhead for that is large, the former method is speedy. Will be a factor to decrease. The synchronization overhead between processors varies depending on the hardware system configuration such as multiprocessor and memory and the control configuration of the multitasking software program. As a result, the configuration of complex tasks and processors In this case, it can be said that it is a case-by-case whether the former method or the latter method can process data more efficiently. As described above, it is difficult to select a method for assigning tasks to multiprocessors.

こうした上記の課題を解決して、安価に膨大なプロセッサを利用できる環境で、共有メモリへのデータ転送速度がボトルネックとならずに、多数の分散メモリのデータの整合性をとりながら、マルチタスクのソフトウェアを高速に実行するアーキテクチャとアルゴリズムが望まれていた。 In such an environment that can solve the above problems and can use a huge number of processors at low cost, the data transfer speed to the shared memory does not become a bottleneck, and the consistency of the data in a large number of distributed memories is achieved. The architecture and algorithm to execute this software at high speed were desired.

本発明は、上述の課題を解決するために、以下の構成であることを特徴とする。 In order to solve the above-mentioned problems, the present invention has the following configuration.

マルチプロセッサと、各プロセッサＰ１、Ｐ２、Ｐ３・・・・に対応して接続されたデータキャッシュメモリおよびインストラクションキャッシュメモリと、全てのプロセッサからアクセスできるように共有接続されたバスと、バスを通して外部に共有メモリが接続可能なメモリコントローラと、を少なくとも内部に有したシステムＬＳＩの構成とし、上記プロセッサはデータキャッシュの遅延書き込みとリフィルのタイミングをユーザーが制御できる手段を持つプロセッサであり、マルチタスクの実行において、タスクＴ１、Ｔ２、Ｔ３・・・・の各プロセッサへの振り分け方法の候補を振り分け方法テーブルＤ１、Ｄ２、Ｄ３・・・・として複数種類用意する手段と、上記テーブルを用いてプロセッサをタスクに割り当てる事で同一タスクを複数のプロセッサ上で重複して実行させる手段と、各振り分け方法ごとのタスク間の遷移の進行状況を検知する手段と、前記検知手段に基づいて最も処理が進んだ振り分け方法の処理に用いたプロセッサの分散メモリの内容を前記共有メモリに書き戻す手段と、を持つことを特徴とするマルチプロセッサ搭載システムＬＳＩのマルチタスク制御方法。 A multiprocessor, a data cache memory and an instruction cache memory connected to each processor P1, P2, P3,..., A bus connected in a shared manner so as to be accessible from all processors, and externally through the bus A system LSI having at least an internal memory controller to which a shared memory can be connected. The above processor is a processor having means for allowing the user to control the timing of delayed writing and refilling of the data cache, and performs multitasking. , Means for preparing a plurality of types of allocation methods to the processors T1, T2, T3,... As allocation method tables D1, D2, D3,. Assigned to the same task Is used for processing of a distribution method that is executed on a plurality of processors, a method for detecting the progress of transition between tasks for each distribution method, and a distribution method that is most advanced based on the detection unit. A multitask control method for a multiprocessor-mounted system LSI, comprising means for writing back the contents of a distributed memory of a processor to the shared memory.

以上説明したように、本発明によれば、マルチプロセッサにマルチタスクを割り当てる際に、複数の振り分け方法の候補を上げて、それをプロセッサに適用し、最も効率の良いタスク割り当ての結果を残す事で、以下の効果が得られる。 As described above, according to the present invention, when assigning a multitask to a multiprocessor, a plurality of distribution method candidates are increased and applied to the processor, and the most efficient task assignment result is left. Thus, the following effects can be obtained.

１）予測困難なマルチプロセッサへのマルチタスクの割り当てスケジューリングが最適化される。 1) Multitask allocation scheduling to multiprocessors that are difficult to predict is optimized.

２）履歴（ログ）をとって以降の最適割り当てを予測するのではないので、再現性が乏しく統計的予測が困難な状態遷移に関しても、その瞬間の動的な最適割り当てがなされる。 2) Since the subsequent optimal allocation is not predicted by taking a history (log), the dynamic optimal allocation at that moment is made even for state transitions that are difficult to reproducibly and statistically difficult to predict.

３）プロセッサの個数が増えてもタスクの個数が増えても同一アルゴリズムでスケーラブルに拡張できる。 3) Even if the number of processors increases or the number of tasks increases, it can be scalable with the same algorithm.

特にプロセッサ数が膨大でもパッケージのピン数が少なくて済むようなシステムＬＳＩを利用する場合に有効であり、たとえばタスク数よりもプロセッサの数が十分多いという程プロセッサを多数配置したとしても、プロセッサ数のみではコストに大きく影響しない、という程の高い集積度の構成を利用していく場合や、小さなプロセッサコアを利用する場合には特に有効である。 This is particularly effective when using a system LSI that requires a small number of package pins even if the number of processors is enormous. For example, even if a large number of processors are arranged so that the number of processors is sufficiently larger than the number of tasks, the number of processors This is particularly effective when using a configuration with a high degree of integration that does not significantly affect the cost, or when using a small processor core.

＜実施形態１＞
以下、本発明の好適な実施例の一例を挙げてその原理を説明する。以下は、本発明の原理に基づく実施形態を例示しまた理解を助ける為に説明を行ったものであり、本発明の適用範囲が必ずしもこの実施例の詳細によって限定されるものではない。 <Embodiment 1>
Hereinafter, the principle of the present invention will be described with reference to an example of a preferred embodiment of the present invention. In the following, embodiments based on the principle of the present invention are illustrated and described for the purpose of facilitating understanding, and the scope of the present invention is not necessarily limited by the details of the embodiments.

図１は本発明の基本概念を示す図である。 FIG. 1 is a diagram showing the basic concept of the present invention.

図１（ａ）はシステムＬＳＩのハードウェア構成で、図１（ｂ）はマルチプロセッサ上でどのようにマルチタスクを割り当てるかをしめした制御方法を示している。 FIG. 1A shows the hardware configuration of a system LSI, and FIG. 1B shows a control method showing how multitasks are allocated on the multiprocessor.

図１（ａ）に示す通り、システムＬＳＩ内で、バス１００上に高速な分散メモリＭ１、Ｍ２、Ｍ３・・・・とプロセッサＰ１、Ｐ２、Ｐ３・・・・がそれぞれ対応して接続されている。また、バス１００にメモリコントローラ（図示せず）を介して外部に共有メモリ２００が接続されている。５００は各プロセッサＰｎからの信号を受け取り制御する調停プロセッサ５００である。４００は共有メモリと高速な分散メモリとの間でデータ転送を行なうことができるＤＭＡコントローラ４００である。３００はシステムＬＳＩで構成される部分を示す。一方、プロセッサ上で実行されるソフトウェアプログラムであるタスクはＴ１、Ｔ２、Ｔ３・・・・が用意されている。図１（ｂ）にその一部を抜粋して示す。 As shown in FIG. 1A, high-speed distributed memories M1, M2, M3,... And processors P1, P2, P3,. Yes. A shared memory 200 is connected to the bus 100 via a memory controller (not shown). An arbitration processor 500 receives and controls a signal from each processor Pn. A DMA controller 400 can perform data transfer between the shared memory and the high-speed distributed memory. Reference numeral 300 denotes a part constituted by a system LSI. On the other hand, T1, T2, T3,... Are prepared as tasks that are software programs executed on the processor. FIG. 1 (b) shows a part of it.

各プロセッサにはデータキャッシュＤ＄、インストラクションキャッシュＩ＄が装着されている。データキャッシュＤ＄は遅延書き込み（ライトバック）の機能を有し、さらに本発明では、ユーザーが利用できる命令セットとして、データキャッシュに関してライトバックの停止状態を保持し、またストア命令のキャッシュミスに関してラインが全て有効な場合にキャッシュのライトバックとリフィルを停止してウェイト状態のままプロセッサ外部に通知信号を発行する機能を有している。 Each processor is equipped with a data cache D $ and an instruction cache I $. The data cache D $ has a delayed write (write back) function, and in the present invention, as an instruction set that can be used by the user, the data cache D $ holds a write-back stop state with respect to the data cache, and a store instruction cache miss line When all of the above are valid, the cache write-back and refill are stopped, and a notification signal is issued outside the processor in a wait state.

ここでは、プロセッサが、７個以上であり、タスクがＴ１，Ｔ２，Ｔ３の３個である場合の例について説明する。まず、タスクの振り分けは以下の可能性が考えられる。これを振り分け方法テーブルＤ１からＤ５までとして以下に示す。ここで、（タスク・・・・）がプロセッサ１個当たりに割り当てられるタスクを示す。例えば（Ｔ１、Ｔ２）は一つのプロセッサにＴ１，Ｔ２のタスクを割り当てることを示す。また、｛（タスク・・・・）・・・・｝は上記割り当てによる全てのタスクのプロセッサへの割り当ての振り分け方法を示す。 Here, an example in which there are seven or more processors and three tasks T1, T2, and T3 will be described. First, there are the following possibilities for task distribution. This is shown below as distribution method tables D1 to D5. Here, (task...) Indicates a task assigned to one processor. For example, (T1, T2) indicates that tasks T1 and T2 are assigned to one processor. Further, {(task...)...} Indicates a method of assigning all tasks to the processors by the above assignment.

例えば、Ｄ２＝｛（Ｔ１、Ｔ２）、（Ｔ３）｝は１個のプロセッサにＴ１，Ｔ２のタスクを割り当て、もう一つのプロセッサにＴ３のタスクを割り当てることで全てのタスクを実行するという振り分け方法Ｄ２を示している。 For example, D2 = {(T1, T2), (T3)} assigns tasks T1 and T2 to one processor, and all tasks are executed by assigning a task T3 to another processor. D2 is shown.

Ｄ１＝｛（Ｔ１）、（Ｔ２）、（Ｔ３）｝
Ｄ２＝｛（Ｔ１、Ｔ２）、（Ｔ３）｝
Ｄ３＝｛（Ｔ２、Ｔ３）、（Ｔ１）｝
Ｄ４＝｛（Ｔ３、Ｔ１）、（Ｔ２）｝
Ｄ５＝｛（Ｔ１、Ｔ２、Ｔ３）｝
この例の場合には、例えば以下の通りプロセッサＰｎにタスクＴｎを割り当てる。 D1 = {(T1), (T2), (T3)}
D2 = {(T1, T2), (T3)}
D3 = {(T2, T3), (T1)}
D4 = {(T3, T1), (T2)}
D5 = {(T1, T2, T3)}
In this example, a task Tn is assigned to the processor Pn as follows, for example.

Ｐ１：（Ｔ１）
Ｐ２：（Ｔ２）
Ｐ３：（Ｔ３）
Ｐ４：（Ｔ１、Ｔ２）
Ｐ５：（Ｔ２、Ｔ３）
Ｐ６：（Ｔ３、Ｔ１）
Ｐ７：（Ｔ１、Ｔ２、Ｔ３）
上記の通り、同じ一つのタスクが複数のプロセッサに割り当てられている。 P1: (T1)
P2: (T2)
P3: (T3)
P4: (T1, T2)
P5: (T2, T3)
P6: (T3, T1)
P7: (T1, T2, T3)
As described above, the same task is assigned to a plurality of processors.

従って、振り分け方法テーブルＤｎに対しては、プロセッサは以下のように割り当てられていることになる。 Therefore, the processors are assigned to the distribution method table Dn as follows.

Ｄ１＝｛Ｐ１、Ｐ２、Ｐ３｝
Ｄ２＝｛Ｐ４、Ｐ３｝
Ｄ３＝｛Ｐ５、Ｐ１｝
Ｄ４＝｛Ｐ６、Ｐ２｝
Ｄ５＝｛Ｐ７｝
ここで、上記の通り、異なる振り分け方法テーブルＤｎに同じプロセッサＰｎが割り当てられている。例えば、プロセッサＰ１は振り分け方法テーブルＤ１とＤ３で共通に使われており、従って、振り分け方法テーブル毎に必要なプロセッサの合計数よりも、必要なプロセッサ数は少ない。 D1 = {P1, P2, P3}
D2 = {P4, P3}
D3 = {P5, P1}
D4 = {P6, P2}
D5 = {P7}
Here, as described above, the same processor Pn is assigned to different distribution method tables Dn. For example, the processor P1 is commonly used in the distribution method tables D1 and D3, and therefore, the required number of processors is smaller than the total number of processors required for each distribution method table.

まず処理開始シーケンスとして、各プロセッサのデータキャッシュは全てのキャッシュラインをデータ無効(invalid)と設定する。プロセッサの動作が開始されると、プロセッサのうち二つ以上のタスクが割り当てられたプロセッサは、図示しないタイマ割り込みによりカーネルがスイッチを行い割り当てられたタスク間の遷移を繰り返す。 First, as a processing start sequence, the data cache of each processor sets all the cache lines as data invalid. When the operation of the processor is started, a processor to which two or more tasks are assigned among the processors repeats a transition between assigned tasks by the kernel being switched by a timer interrupt (not shown).

この状態が続く中で、各プロセッサは適切なタスクの巡回停止条件を満たしたときに調停プロセッサに通知して動作を停止する。ここで適切なタスクの巡回停止条件とは、たとえば現在のプロセッサで動作しているタスク以外からのメッセージを要求するような場合である。 While this state continues, each processor notifies the arbitration processor when the appropriate task patrol stop condition is satisfied, and stops its operation. Here, an appropriate task cyclic stop condition is, for example, a case where a message from a task other than the task operating in the current processor is requested.

このときの動作を、上記のプロセッサＰ４にタスクＴ１、Ｔ２が割り当てられている場合の例を挙げて説明する。プロセッサＰ４のＯＳは自分に割り当てられたタスクがＴ１とＴ２である事を実行開始時に知っているので、タスクＴ１が要求したメッセージがタスクＴ３から獲得するものと判定した時点で、Ｐ４はＯＳを通じて調停プロセッサへ停止状態を通知し、停止する。 The operation at this time will be described with an example in which tasks T1 and T2 are assigned to the processor P4. Since the OS of the processor P4 knows that the tasks assigned to it are T1 and T2 at the start of execution, when it is determined that the message requested by the task T1 is obtained from the task T3, P4 passes through the OS. Notify the arbitration processor of the stop state and stop.

このようにして、各プロセッサは自身の内部で動作しているタスク以外の外のタスクからの情報が必要になった場合などに調停プロセッサに知らせて停止する。このようにして次々とプロセッサが停止していく。 In this way, each processor informs the arbitration processor and stops when information from a task other than the task operating inside itself becomes necessary. In this way, the processor stops one after another.

またここで別の停止条件として、以下の場合がある。初期状態で各プロセッサのデータキャッシュは全てのキャッシュラインがデータ無効(invalid)とされたので、読み込み／書き込みで各ラインがデータ有効(valid)となり埋められていくが、キャッシュのライトバックは停止されている。全てのデータキャッシュラインが有効データで埋められた後に、キャッシュラインが全てラインダーティすなわち全てのキャッシュライン上にプロセッサからの書き込みが行なわれるまで動作を続け、全てダーティラインになった後にストア命令によるキャッシュミスが発生した時点で、ライトバックリフィルを行なわずに、プロセッサは停止し、調停プロセッサへ停止状態を通知する。（データキャッシュヒットによるダーティラインへの再度上書きや、データ有効(valid)でもダーティでないキャッシュラインが残っている場合のラインのリプレースの場合は動作継続する。）
このようにして、データキャッシュ内容を共有メモリへ必ずライトバックしなければならなくなった場合には、調停プロセッサに、データキャッシュを全てダーティラインとしたプロセッサ番号を知らせて全てのプロセッサを停止する。 In addition, another stop condition is as follows. In the initial state, all cache lines of each processor's data cache are invalid (invalid), so each line is filled with data valid (valid) by read / write, but the cache write-back is stopped. ing. After all the data cache lines are filled with valid data, the cache line continues to operate until all lines are dirty, that is, all the cache lines are written from the processor. When a miss occurs, the processor stops without performing write back refill, and notifies the arbitration processor of the stop state. (The operation continues in the case of overwriting the dirty line again due to a data cache hit, or replacing a line when there is a cache line that is not dirty even if the data is valid.)
In this way, when it becomes necessary to write back the data cache contents to the shared memory, the arbitration processor is notified of the processor number in which all the data caches are dirty lines, and all the processors are stopped.

ここで、前項の発明が解決しようとする課題で述べた通り、停止するまでの各タスクのデータ処理量は、プロセッサ１つにタスク１つを割り当てた場合にタスク遷移が起こらず最もデータ処理効率が高いとは限らないことに注意する。何故ならば、本構成の場合、高速キャッシュ内にタスク間のメッセージを保持しており、もし一つのプロセッサ内で動作しているタスク間で高速キャッシュ内のメッセージを交換できれば、異なるプロセッサ間で動作するタスク間でのメッセージ通信より処理が高効率になる可能性があるからである。 Here, as described in the problem to be solved by the invention of the preceding paragraph, the data processing amount of each task until the stop is the highest data processing efficiency without task transition when one task is assigned to one processor. Note that is not necessarily expensive. This is because in this configuration, messages between tasks are held in the high-speed cache, and if the messages in the high-speed cache can be exchanged between tasks operating in one processor, they operate between different processors. This is because processing may be more efficient than message communication between tasks.

例えば、プロセッサＰｎとタスクＴｎにおいて、以下のように割り当てられているとする。 For example, it is assumed that the processor Pn and the task Tn are assigned as follows.

Ｐ１：（Ｔ１）
Ｐ２：（Ｔ２）
Ｐ４：（Ｔ１、Ｔ２）
このとき、Ｐ１、Ｐ２はタスクの遷移が無いことに関しては高速だが、プロセッサに直結するキャッシュが異なるため、このタスク間でメッセージ通信が起こった場合、異なるプロセッサとキャッシュ間でのデータの同期をとらなければならず、処理時間がかかる。更に、プロセッサの個数が膨大でありタスク間の通信のしくみが複雑である場合、各キャッシュ間で同期をとることは複雑となる。一方、Ｐ４ではタスクが複数でありタスクの遷移が起こるが、このプロセッサ内のタスク間の通信に関しては、同一キャッシュ内でデータの整合性がとれているため、データ同期に要する処理時間は必要ない。従って、どの組み合わせがより効率的かは、実行中の動作にもよりケースバイケースであるといえる。 P1: (T1)
P2: (T2)
P4: (T1, T2)
At this time, although P1 and P2 are high-speed when there is no task transition, the caches directly connected to the processors are different. Therefore, when message communication occurs between the tasks, data synchronization between different processors and caches is performed. Processing time. Furthermore, when the number of processors is enormous and the communication mechanism between tasks is complicated, it is complicated to synchronize the caches. On the other hand, in P4, there are a plurality of tasks and task transitions occur. However, regarding the communication between tasks in this processor, the data consistency is ensured in the same cache, so the processing time required for data synchronization is not necessary. . Therefore, it can be said that which combination is more efficient is more case-by-case than the operation being executed.

ここで、調停プロセッサでは上記の振り分け方法テーブルＤｎのうち、プロセッサの止まる毎に止まったプロセッサを含むテーブルを無効にしていく。こうして、最後まで生き残ったテーブルのプロセッサの上で動作するタスク群とキャッシュ内容を有効とする。すなわち、この時点では有効として生き残ったキャッシュ内のタスク制御に関するデータが有効となる。その後、プロセッサが止まった時点でキャッシュの有効な内容を共有メモリに書き戻す。停止した時点で複数のテーブルが有効な場合は、任意の方法、望ましくはプロセッサの数の少ない振り分けテーブルを選択すれば良い。 Here, the arbitration processor invalidates the table including the stopped processor every time the processor stops in the distribution method table Dn. Thus, the task group operating on the table processor that survives to the end and the cache contents are validated. In other words, the data regarding the task control in the cache that has survived as valid is valid at this time. After that, when the processor stops, the valid contents of the cache are written back to the shared memory. When a plurality of tables are valid at the time of stopping, an arbitrary method, preferably a sorting table with a small number of processors may be selected.

一方で、前述の通り、あるプロセッサのデータキャッシュが全てダーティラインになり、全てのプロセッサが停止させられた場合には、キャッシュを全てダーティラインにしたプロセッサを含む振り分け方法テーブルを有効とする。例えばプロセッサＰ３のデータキャッシュが全てダーティラインになり、それを調停プロセッサに通知して全プロセッサが停止した場合には、振り分け方法テーブルＤ１、Ｄ２
Ｄ１＝｛Ｐ１、Ｐ２、Ｐ３｝
Ｄ２＝｛Ｐ４、Ｐ３｝
がプロセッサＰ３を含むので、Ｄ１またはＤ２のプロセッサのデータキャッシュを共有メモリに書き戻す。ここでは実施例として、振り分け方法テーブルＤ２のプロセッサＰ４、Ｐ３のデータを書き戻す。 On the other hand, as described above, when all the data caches of a certain processor become dirty lines and all the processors are stopped, the distribution method table including the processors whose caches are all dirty lines is validated. For example, when all of the data caches of the processor P3 become dirty lines and are notified to the arbitrating processor and all the processors are stopped, the distribution method tables D1 and D2
D1 = {P1, P2, P3}
D2 = {P4, P3}
Since the processor P3 is included, the data cache of the processor D1 or D2 is written back to the shared memory. Here, as an example, the data of the processors P4 and P3 in the distribution method table D2 is written back.

この後、再び処理開始シーケンスに戻り、全てのキャッシュラインをデータ無効(invalid)と設定し、同様の処理を繰り返せばよい。 Thereafter, the process is returned to the process start sequence again, all the cache lines are set as data invalid, and the same process is repeated.

このとき、生き残った複数のプロセッサ内のキャッシュデータを共有メモリに書き出しているが、この場合に複数プロセッサのデータキャッシュの整合性がとれているかどうかについては以下の通りである。例えば、プロセッサＰｎとタスクＴｎに関して、
Ｐ３：（Ｔ３）
Ｐ４：（Ｔ１、Ｔ２）
が最後まで有効であったとすると、
Ｐ３と、Ｐ４のデータキャッシュを共有メモリへライトバックすることになる。このとき、もしキャッシュラインが同じアドレスをさしている場合には、キャッシュが共有メモリへ書き込むデータが複数ある事になり、排他制御が行なわれておらず、矛盾が生じる事になる。しかし、このような現象が発生する場合は、そもそも独立したタスク間（上記の例では、（Ｔ１、Ｔ２）と（Ｔ３）との間）で勝手に通信手順を無視して共通アドレスにデータを書き込んでいる事になり、不正かあるいは一意な結果を期待しないアクセスである。したがってこの場合、ライトバック実行時に異なるプロセッサのキャッシュに同一アドレスラインが有った場合は調停プロセッサにエラー通知が行なわれる機構（図示せず）が接続されており、ユーザーがプログラムミスであるかどうかを判定できるようになっている。ユーザーが明示的に共通領域にデータを書き込む場合、プロセッサをまたぐタスク間で事前にメッセージ通信を行なう必要があり、その場合、本発明の制御手順により、プロセッサは正常に停止する事になる。（但し例外として、次のようなユーザーが意図しない問題も発生する可能性がある。例えば、グローバル変数として１バイトのデータを用い、キャッシュラインが３２バイトであるような場合である。この時、同じ３２バイトキャッシュライン内に１バイトの異なる２つのグローバル変数が配置される可能性があり、各々独立にタスクでアクセスされる場合などは、たとえタスク間で独立したデータ領域として扱いアクセスしたとしても、プロセッサ間のキャッシュラインのデータに矛盾が生じて上記の問題が起きうる。したがって、この問題を回避する為に、グローバル変数はキャッシュラインのバイト数、たとえば３２バイトでアラインされるようにコンパイル時に設定しておく必要がある。）
また、本発明の構成例においては、入出力データは共有メモリの一部を、入出力データのリングバッファとして用い、リングバッファとＩ／Ｏコントローラの制御は調停プロセッサで行ない、データの整合性を保つ。 At this time, the cache data in the plurality of surviving processors is written to the shared memory. In this case, the consistency of the data caches of the plurality of processors is as follows. For example, regarding processor Pn and task Tn,
P3: (T3)
P4: (T1, T2)
Is valid until the end,
The data caches of P3 and P4 are written back to the shared memory. At this time, if the cache line points to the same address, there will be a plurality of data to be written to the shared memory by the cache, the exclusive control is not performed, and a contradiction occurs. However, when such a phenomenon occurs, the communication procedure is ignored without permission between independent tasks (in the above example, between (T1, T2) and (T3)), and data is transferred to the common address. It is an access that does not expect an invalid or unique result. Therefore, in this case, a mechanism (not shown) that notifies the arbitration processor of an error if the same address line is in the cache of a different processor at the time of write-back execution is connected, and whether or not the user has a program mistake Can be determined. When the user explicitly writes data in the common area, it is necessary to perform message communication in advance between tasks crossing the processors. In this case, the processor is normally stopped by the control procedure of the present invention. (However, as an exception, the following problem unintended by the user may also occur. For example, when 1 byte of data is used as a global variable and the cache line is 32 bytes. There is a possibility that two different global variables of 1 byte may be placed in the same 32-byte cache line, and if they are accessed independently by tasks, even if they are treated as independent data areas between tasks, In order to avoid this problem, global variables are aligned at the cache line number of bytes, eg 32 bytes, at compile time. Must be set.)
In the configuration example of the present invention, the input / output data uses a part of the shared memory as a ring buffer for the input / output data, and the ring buffer and the I / O controller are controlled by the arbitration processor to ensure data consistency. keep.

なお、上記の実施例では、タスク３つの場合に、全てのマルチプロセッサの組み合わせの振り分け方法テーブルを生成したが、プロセッサの数とタスクの数の組み合わせにより、適切なテーブル数に制限しても良い。 In the above-described embodiment, the distribution method table for all multiprocessor combinations is generated in the case of three tasks. However, the number of tables may be limited to an appropriate number by combining the number of processors and the number of tasks. .

システムＬＳＩのハードウェア構成とマルチタスクの制御方法System LSI hardware configuration and multitask control method

Explanation of symbols

１００バス
２００共有メモリ
３００システムＬＳＩで構成される部分
４００ＤＭＡコントローラ
５００調停プロセッサ 100 bus 200 shared memory 300 part composed of system LSI 400 DMA controller 500 arbitration processor

Claims

A multiprocessor, a data cache memory and an instruction cache memory connected to each processor P1, P2, P3,..., A bus connected in a shared manner so as to be accessible from all processors, and externally through the bus A system LSI having at least an internal memory controller to which a shared memory can be connected,
The above processor is a processor having a means for allowing the user to control the timing of delayed writing and refilling of the data cache, and in the execution of multitasking, candidates for allocation methods to each processor of tasks T1, T2, T3,... Means for preparing a plurality of types as the distribution method tables D1, D2, D3,.
Means for speculatively executing the same task on a plurality of processors by assigning processors to tasks using the table, means for detecting the progress of transition between tasks for each distribution method, and
Multitask control of a multiprocessor-mounted system LSI, characterized by having means for writing back the contents of the distributed memory of the processor used for the processing of the distribution method that has been most advanced based on the detection means to the shared memory Method.

2. The multi-method according to claim 1, wherein in each of the distribution method tables D1, D2, D3,..., The processors are shared among the distribution method tables when the tasks assigned to the processors are the same. Multitask control method for processor-mounted system LSI.