JPH08227405A

JPH08227405A - Parallel iterative execution method

Info

Publication number: JPH08227405A
Application number: JP3209995A
Authority: JP
Inventors: Teruo Tanaka; 輝雄田中; Yusaku Yamamoto; 有作山本; Yoshiaki Hojo; 芳章北城; Yoshiko Tamaoki; 由子玉置; Tadayuki Sakakibara; 忠幸榊原; Machiko Asaya; 真知子朝家; Yoshiko Yasuda; 淑子保田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1995-02-21
Filing date: 1995-02-21
Publication date: 1996-09-03

Abstract

(57)【要約】【目的】ＬＵ分解処理などの反復処理を並列計算機でよ
り短時間に実行可能にする。【構成】Ｋ回目の反復処理が、並列化不可能な処理Ｐ
（Ｋ）と並列化可能な処理Ｑ（Ｋ）とから構成され、処
理Ｑ（Ｋ）は、処理Ｐ（Ｋ）の結果を利用し、かつ、Ｋ
＋１回目の処理Ｐ（Ｋ＋１）はＫ回目の処理Ｑ1（Ｋ）
中の一部の処理Ｑs（Ｋ）の結果を利用する時、処理Ｑ
（Ｋ）を、プロセッサの台数 Pと同数の処理Ｑ1（Ｋ）
からＱP（Ｋ）に分解し、かつ、処理Ｑ1（Ｋ）には、処
理ＱS（Ｋ）を含ませる。処理Ｑ1（Ｋ）と処理Ｐ（Ｋ＋
Ｍ）を同一のプロセッサで順次実行し、これと並行し
て、他の処理Ｑ2（Ｋ）からＱn（Ｋ）を他のP-1台のプ
ロセッサで実行する。以後、全プロセッサの処理が終了
する毎に以上の処理を繰り返す。 (57) [Summary] [Purpose] To enable iterative processing such as LU decomposition processing to be executed in a shorter time on a parallel computer. [Structure] The K-th iteration process is a process P that cannot be parallelized.
(K) and a process Q (K) that can be parallelized, the process Q (K) uses the result of the process P (K), and
+ 1st processing P (K + 1) is Kth processing Q1 (K)
When using the results of some of the processing Qs (K), the processing Q
(K) is the same number of processes as the number P of processors Q1 (K)
To QP (K), and the processing Q1 (K) includes the processing QS (K). Process Q1 (K) and process P (K +
M) is sequentially executed by the same processor, and in parallel with this, other processes Q2 (K) to Qn (K) are executed by other P-1 processors. Thereafter, the above processing is repeated every time the processing of all the processors is completed.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、Ｎ次元連立一次方程式
の係数行列を下三角行列と上三角行列に分解するＬＵ分
解処理のように、同じ処理を異なるデータに対して反復
して実行する反復処理を、複数のプロセッサを持つ並列
計算機で実行させる方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention repeatedly executes the same processing on different data, such as LU decomposition processing that decomposes a coefficient matrix of N-dimensional simultaneous linear equations into a lower triangular matrix and an upper triangular matrix. The present invention relates to a method of executing repetitive processing on a parallel computer having a plurality of processors.

【０００２】[0002]

【従来の技術】各種の工業用製品の開発に数値シミュレ
ーションが実用に供されている。このような数値シミュ
レーションでよく使用される処理は、Ｎ次元連立方程式
ＡＸ＝Ｂをガウスの消去法で解く処理である。ここで、
ＡはＮ行Ｎ列の係数行列、ＸとＢはＮ列のベクトルであ
る。このガウスの消去法により上記方程式を解く処理の
主要な部分は、係数行列Ａを下三角行列Ｌと上三角行列
Ｕの積である行列ＬＵに分解するＬＵ分解処理である。2. Description of the Related Art Numerical simulation has been put to practical use in the development of various industrial products. A process often used in such a numerical simulation is a process of solving the N-dimensional simultaneous equations AX = B by the Gaussian elimination method. here,
A is a coefficient matrix with N rows and N columns, and X and B are vectors with N columns. The main part of the process of solving the above equation by the Gaussian elimination method is the LU decomposition process that decomposes the coefficient matrix A into a matrix LU that is the product of the lower triangular matrix L and the upper triangular matrix U.

【０００３】大規模行列のＬＵ分解処理の実用化は進ん
でおり、特にスーパコンピュータ向けには、多段同時展
開法が一般的である。これらの手法は、文献：村田、小
国、唐木著スーパコンピュータ、丸善、９１〜１０３
頁、１９８５や文献：ブロッキングＬＵ分解法のＶＰ２
０００シリーズ向けチューニングについて、情報処理学
会第４２回（平成３年前期）全国大会予稿集第１分冊７
１〜７２頁、特開平４−２５９０６４および特開平６−
７５９８８などに開示されている。Practical application of LU decomposition processing of large-scale matrices is progressing, and the multistage simultaneous expansion method is generally used especially for supercomputers. These methods are described in literatures: Murata, Oguni, Karaki, Supercomputer, Maruzen, 91-103.
Page, 1985 and literature: Blocking LU decomposition method VP2
About tuning for the 000 series, Proceedings of the 42nd Annual Conference of the Information Processing Society of Japan (the first half of 1991), Volume 1 7
1-72, JP-A-4-259064 and JP-A-6-
75988 and the like.

【０００４】とくに最近は、並列計算機の進歩に伴い、
並列計算機を数値シミュレーションに応用する研究が盛
んである。Particularly recently, with the progress of parallel computers,
A lot of research is being conducted to apply parallel computers to numerical simulation.

【０００５】近年実用化されている並列計算機は、一般
的に次のような３種類の形態をとる。図１は、複数台の
プロセッサ２ないし４が主記憶１を共有する主記憶共有
型並列計算機である。これは、現在のベクトル型スーパ
コンピュータの主流であり、スーパコンピュータの場
合、各プロセッサ２ないし４は多数のデータをベクトル
データとしてまとめ、そのベクトルデータに同一の処理
をパイプライン演算器を使用して連続的に施すベクトル
プロセッサとなる。A parallel computer which has been put to practical use in recent years generally takes the following three types of forms. FIG. 1 is a main memory shared parallel computer in which a plurality of processors 2 to 4 share a main memory 1. This is the mainstream of current vector type supercomputers. In the case of supercomputers, each processor 2 to 4 collects a large number of data as vector data, and the same processing is performed on the vector data by using a pipeline arithmetic unit. It is a vector processor that is continuously applied.

【０００６】図２は、各々が演算処理装置１０と専用の
主記憶９を持つ独立したを計算機６ないし８である。こ
こで、計算機６ないし８をプロセッサと呼ぶ。これらの
プロセッサ６ないし８を相互結合網５で接続する。現在
は、各プロセッサはスーパスカラなどの高性能マイクロ
プロセッサとメモリで構成されることが多い。FIG. 2 shows independent computers 6 to 8 each having an arithmetic processing unit 10 and a dedicated main memory 9. Here, the computers 6 to 8 are called processors. These processors 6 to 8 are connected by an interconnection network 5. Currently, each processor is often composed of a high-performance microprocessor such as superscalar and a memory.

【０００７】図３は、主記憶１５および複数台のプロセ
ッサ１６ないし１７から構成される主記憶共有型並列計
算機１２ないし１４（図１と同様）をさらに相互結合網
１１で接続した構成を持つ。これは、たとえば、特開平
６−９６０３２で開示されている。図３の構成における
各々の主記憶共有型計算機１２ないし１４をクラスタと
呼ぶ。FIG. 3 has a configuration in which main memory shared parallel computers 12 to 14 (similar to FIG. 1) each composed of a main memory 15 and a plurality of processors 16 to 17 are further connected by an interconnection network 11. This is disclosed in, for example, Japanese Patent Laid-Open No. 6-96032. Each of the main memory shared computers 12 to 14 in the configuration of FIG. 3 is called a cluster.

【０００８】図２の形式の並列計算機でＬＵ分解処理を
実行させる方法の例は、文献：星野著ＰＡＸコンピュー
タ、オーム社、１９８５や特開平３−１０５６９４など
に開示されている。An example of a method for executing LU decomposition processing in a parallel computer of the type shown in FIG. 2 is disclosed in Hoshino PAX Computer, Ohmsha Co., Ltd., 1985, JP-A-3-105694 and the like.

【０００９】ＬＵ分解処理の概要および図１に示した並
列計算機でＬＵ分解処理を実行させる従来の方法を図
４、図５を用いて説明する。An outline of LU decomposition processing and a conventional method for executing LU decomposition processing by the parallel computer shown in FIG. 1 will be described with reference to FIGS. 4 and 5.

【００１０】図４は、行列Ａに対するＬＵ分解処理のＫ
回目のループ処理を説明するための図である。すなわ
ち、行列Ａの各要素ａ（ｉ，ｊ）（すなわち、第ｉ行第
ｊ列番目の要素）を、異なる処理を受ける要素群に分け
て、それらの要素群を領域の形で示している。図中、Ｎ
は行列サイズである。ここでは、第１行から第Ｋ−１行
および第１列から第Ｋ−１列までのＬＵ分解処理は終了
している。各反復回では、高速化のために、一度にＭ列
をまとめてＬＵ分解処理を行なういわゆる多段同時消去
方法が使用される。このためには、Ｍは複数であるが、
原理的にはＭは１でもよい。FIG. 4 shows K of the LU decomposition processing for the matrix A.
It is a figure for demonstrating the loop processing of the time. That is, each element a (i, j) of the matrix A (that is, the element at the i-th row and the j-th column) is divided into an element group subjected to different processing, and the element group is shown in the form of a region. . N in the figure
Is the matrix size. Here, the LU decomposition processing from the first row to the (K−1) th row and from the first column to the (K−1) th column is completed. In each iteration, a so-called multi-stage simultaneous erasing method in which M columns are grouped at once and LU decomposition processing is performed is used for speeding up. For this, M is plural,
In principle, M may be 1.

【００１１】以下の説明のため、行列Ａの第Ｋ列から第
Ｎ列ないし第Ｋ行から第Ｎ行の領域を５つに分け、それ
ぞれに次のような名称を付ける。For the following description, the area from the Kth column to the Nth column to the Kth row to the Nth row of the matrix A is divided into five areas, and the areas are given the following names.

【００１２】（１）Ｌ１（Ｋ）２２：第Ｋ列から第Ｋ＋
Ｍ−１列および第Ｋ行から第Ｋ＋Ｍ−１行で構成される
左下三角領域（２）Ｕ１（Ｋ）２１：第Ｋ列から第Ｋ＋Ｍ−１列およ
び第Ｋ行から第Ｋ＋Ｍ−１行で構成される右上三角領域（３）Ｌ２（Ｋ）２３：第Ｋ列から第Ｋ＋Ｍ−１列およ
び第Ｋ＋Ｍ行から第Ｎ行で構成される矩形領域（４）Ｕ２（Ｋ）２４：第Ｋ＋Ｍ列から第Ｎ列および第
Ｋ行から第Ｋ＋Ｍ−１行で構成される矩形領域（５）Ａ（Ｋ）２５：第Ｋ＋Ｍ列から第Ｎ列および第
Ｋ＋Ｍ行から第Ｎ行構成される矩形領域図５に、図１の並列計算機で実行可能な、従来技術によ
るＬＵ分解処理プログラムを示す。この図において、Ｌ
Ｕ分解プログラムは、２重のＤＯループで構成されてい
る。(1) L1 (K) 22: from the Kth column to the Kth +
Lower left triangular area composed of M-1 column and Kth to K + M-1th rows (2) U1 (K) 21: from Kth column to K + M-1th column and from Kth row to K + M-1th row Upper right triangular area configured (3) L2 (K) 23: rectangular area composed of Kth column to K + M-1th column and K + Mth row to Nth row (4) U2 (K) 24: Kth + Mth column To Nth column and Kth to K + M-1th rows (5) A (K) 25: K + Mth to Nth columns and K + Mth to Nth rows rectangular area diagram FIG. 5 shows an LU decomposition processing program according to the related art, which can be executed by the parallel computer of FIG. In this figure, L
The U decomposition program is composed of double DO loops.

【００１３】外側のループ２０は、Ｍ列づつまとめて処
理するために、行列のサイズＮをＭで割ったＭ／Ｎ回だ
け繰り返す。The outer loop 20 repeats M / N times, which is the size N of the matrix divided by M, in order to process M columns at a time.

【００１４】内側は２つＤＯループで構成され、それら
の処理は以下のとおりである。なお、外側ループ２０の
繰返し回数はＫとする。The inside is composed of two DO loops, and their processing is as follows. The number of times the outer loop 20 is repeated is K.

【００１５】まず、最初の内側ループ（ＤＯ１）の処
理では、第Ｋ列から第Ｋ＋Ｍ−１列番目まで、つまりＵ
１（Ｋ）２１、Ｌ１（Ｋ）２２およびＬ２（Ｋ）２３の
領域に対して、各列ごとに、部分軸選択（いわゆるピボ
ット処理）およびそれに対する消去更新処理を繰り返し
て行なう。これらの処理の概要は以下のとおりである。
最初の内側ループ（ＤＯ１）のループ繰返し回数はＫ
Ｍとする。但し、ＫＭはＫ、Ｍとは関係のない値であ
る。First, in the processing of the first inner loop (DO 1), from the Kth column to the K + M-1th column, that is, U
The partial axis selection (so-called pivot processing) and the erase update processing for the partial axis selection are repeated for each column in the areas of 1 (K) 21, L1 (K) 22 and L2 (K) 23. The outline of these processes is as follows.
The number of loop iterations of the first inner loop (DO 1) is K
Let M. However, KM is a value unrelated to K and M.

【００１６】（１）ピボット処理：ＫＭ列の第ＫＭ行か
ら第Ｎ行までの要素から最大値を検索し、検索した最大
値を含む行と第ＫＭ行との行交換を行なう。すなわち、
それらの行の第ＫＭ列から第Ｋ＋Ｍ−１列番目の要素を
交換する。(1) Pivot processing: The maximum value is retrieved from the elements from the KMth row to the Nth row of the KM column, and the row containing the retrieved maximum value and the KMth row are exchanged. That is,
Swap the elements from the KMth column to the K + M-1th column of those rows.

【００１７】（２）消去更新処理：第ＫＭ＋１列から第
Ｋ＋Ｍ−１列および第ＫＭ＋１行から第Ｎ行により構成
される矩形領域内の各要素ａ（ｉ,ｊ）に対して、消去
更新処理ａ（ｉ,ｊ）＝ａ（ｉ,ｊ）−ａ（ＫＭ,ｊ）×
ａ（ｉ,ＫＭ）を行なう。(2) Erase / update processing: Erase / update processing for each element a (i, j) in the rectangular area constituted by the KM + 1th column to the K + M−1th column and the KM + 1th row to the Nth row. a (i, j) = a (i, j) −a (KM, j) ×
a (i, KM).

【００１８】このＤＯループの処理を、以下では、処理
Ｐ（Ｋ）と呼ぶ。このＤＯループの処理では、最大値検
索という逐次処理を繰り返して行なう必要があるため、
このループ自体を並列化してもあまり性能が向上しな
い。Hereinafter, the processing of this DO loop will be referred to as processing P (K). In the processing of this DO loop, since it is necessary to repeat the sequential processing of maximum value search,
Even if the loop itself is parallelized, the performance does not improve so much.

【００１９】次に、第２の内側ＤＯループ（ＤＯ２）
では、領域Ｕ２（Ｋ）および領域Ａ（Ｋ）に対して、次
の処理を行なう。Next, the second inner DO loop (DO 2)
Then, the following processing is performed on the area U2 (K) and the area A (K).

【００２０】（１）領域Ｕ２（Ｋ）および領域Ａ（Ｋ）
の行交換：前述のピボット処理において行交換処理を施
したＭ組の行の第Ｋ＋Ｍ列から第Ｎ列の要素に対して、
行交換処理を行なう。(1) Area U2 (K) and area A (K)
Row exchange: For the elements from the K + Mth column to the Nth column of the M sets of rows subjected to the row exchange processing in the pivot processing described above,
Perform line exchange processing.

【００２１】（２）領域Ｕ１（Ｋ）、領域Ｌ１（Ｋ）を
用いた領域Ｕ２（Ｋ）の消去更新処理：この領域Ｕ２
（Ｋ）内の各要素ａ（ｉ,ｊ）に対して、消去更新処理
ａ（ｉ,ｊ）＝ａ（ｉ,ｊ）−ａ（ｋ,ｊ）×ａ（ｉ,ｋ）
を行なう。ここで、ｋをＫからＫ＋Ｍ−１に順次変化さ
せて、この計算を合計Ｍ回行なう。なお、ｉは、Ｋ＋１
からＫ＋Ｍ−１の一つを１取り、ｊはＫ＋ＭからＮの一
つの値を取る。したがって、この計算は、領域Ｕ１
（Ｋ）、Ｌ１（Ｋ）内の要素を使用して、領域Ｕ２
（Ｋ）内の要素を更新している。(2) Erase / update processing of area U2 (K) using area U1 (K) and area L1 (K): this area U2
For each element a (i, j) in (K), erase update processing a (i, j) = a (i, j) −a (k, j) × a (i, k)
Perform Here, k is sequentially changed from K to K + M-1, and this calculation is performed M times in total. Note that i is K + 1
To K + M−1, and j takes a value from K + M to N. Therefore, this calculation is based on
(K), L1 (K) using elements in region U2
The element in (K) is updated.

【００２２】（３）領域Ｕ２（Ｋ）、Ｌ２（Ｋ）を用い
た領域Ａ（Ｋ）内の要素の消去更新処理：消去更新処理
ａ（ｉ,ｊ）＝ａ（ｉ,ｊ）−ａ（ｋ,ｊ）×ａ（ｉ,ｋ）
を行なう。ここで、ｋをＫからＫ＋Ｍに順次変えて、こ
の計算を合計Ｍ回行なう。ここで、ｉはｋ＋ＭからＮの
一つであり、ｊはＫ＋ＭからＮの一つである。(3) Erase update process of elements in the region A (K) using the regions U2 (K) and L2 (K): erase update process a (i, j) = a (i, j) -a (K, j) × a (i, k)
Perform Here, k is sequentially changed from K to K + M, and this calculation is performed a total of M times. Here, i is one of k + M to N and j is one of K + M to N.

【００２３】この第２の内側ループの処理を、以下で
は、処理Ｑ（Ｋ）と呼ぶ。The process of the second inner loop will be referred to as process Q (K) below.

【００２４】これらの計算（１）から（３）は、領域Ｕ
２（Ｋ）、Ａ（Ｋ）の各列ごとに独立に計算可能であ
る。すなわち、各列毎に、上記処理（１）から（３）を
実行する処理が基本の処理であり、処理Ｑ（ｋ）は、こ
の単位処理を複数の列に対して集めたものとなってい
る。このため、このような各列毎の処理は、処理Ｑ
（Ｋ）を構成する、互いに並列に実行可能な部分処理と
言える。したがって、Ｑ（Ｋ）２８を列を単位に均等に
ｐ個にグループで分け、それぞれをＱ1（Ｋ）、Ｑ2
（Ｋ）、、、ＱP（Ｋ）とし、ｐ台のプロセッサに分割
して処理を行なうことにより、ｐ倍に近い台数効果を得
ることができる。These calculations (1) to (3) are for the area U
It is possible to calculate independently for each column of 2 (K) and A (K). That is, the process of executing the above processes (1) to (3) for each column is a basic process, and the process Q (k) is a collection of this unit process for a plurality of columns. There is. For this reason, the processing for each column is performed by the processing Q.
It can be said that the partial processes constituting (K) can be executed in parallel with each other. Therefore, the Q (K) 28 is evenly divided into p groups in units of columns, and each is divided into Q1 (K) and Q2.
(K), ..., QP (K), and dividing into p processors to perform the processing makes it possible to obtain a p-number effect close to p times.

【００２５】このようなＬＵ分解処理のフローチャート
は、図７の通りである。図７では、パラメタＫの初期値
設定３５に続き、処理Ｐ（Ｋ）３６の実行後、並列化可
能な処理Ｑ（Ｋ）の並列化３７を行ない、ｐ個の処理Ｑ
1（Ｋ）３８ないし処理Ｑp（Ｋ）４０に分け、ｐ台のプ
ロセッサでそれぞれの処理を分担し、並列に実行する。
このとき、並列化された処理Ｑ（Ｋ）のあとに、処理を
逐次にもどして、Ｋの更新処理４１と収束判定４２を行
ない、次のＫ＋Ｍ回目の反復ループである処理Ｐ（Ｋ＋
Ｍ）、さらに処理Ｑ（Ｋ＋Ｍ）を実行する。A flow chart of such LU decomposition processing is as shown in FIG. In FIG. 7, following the initial value setting 35 of the parameter K, after the process P (K) 36 is executed, the process Q (K) that can be parallelized is parallelized 37, and p processes Q are performed.
The processing is divided into 1 (K) 38 to processing Qp (K) 40, each processing is shared by p processors, and executed in parallel.
At this time, after the parallelized process Q (K), the process is sequentially returned to perform the K update process 41 and the convergence determination 42, and the process P (K +) that is the next K + Mth iteration loop.
M), and further process Q (K + M) is executed.

【００２６】このように処理Ｑ（Ｋ）を複数の並列実行
可能な処理に分けて実行するプログラムを得るには、最
適化コンパイラの自動並列化機構を使用してソースプロ
グラムの内の並列実行可能な部分を検出し、かつ、それ
らを自動的に並列実行可能なオブジェクトプログラムに
変換する方法あるいはソースプログラムに並列指示行を
追加し、これをコンパイラにより並列実行可能なオブジ
ェクトプログラムに変換する方法が取られる。図５に示
したプログラムは、ＤＯＡＬＬタイプの並列化指示行
２９を有する例である。このプログラムの場合、処理Ｑ
（Ｋ）２８の並列化は並列指示行２９により各反復回ご
とに毎回行なわれる。In order to obtain a program which divides the process Q (K) into a plurality of processes which can be executed in parallel, the automatic parallelization mechanism of the optimizing compiler can be used to execute parallel execution in the source program. The method is to detect these parts and automatically convert them into an object program that can be executed in parallel, or add a parallel directive line to the source program and convert this into an object program that can be executed in parallel by the compiler. To be The program shown in FIG. 5 is an example having a DO ALL type parallelization instruction line 29. In the case of this program, process Q
The parallelization of (K) 28 is performed by the parallel instruction line 29 every time each iteration.

【００２７】以上の説明から明かな通り、ＬＵ分解処理
の基本的な処理の流れは、図６に示すように、パラメタ
Ｋの初期値設定３０およびそのパラメータの更新処理３
３により制御し、処理Ｐ（Ｋ）３１と、処理Ｑ（Ｋ）３
２を指定した条件３４（収束条件あるいは反復回数な
ど）が成立するまで繰返し実行することである。図５の
プログラムは、この処理を、図７に示す手順にしたがっ
て図１の形式の並列計算機で実行可能にし、それでもっ
てＬＵ分解処理の高速化を図っている。As is clear from the above description, the basic processing flow of the LU decomposition processing is, as shown in FIG. 6, the initial value setting 30 of the parameter K and the updating processing 3 of the parameter.
3 and controls process P (K) 31 and process Q (K) 3.
It is repeatedly executed until the condition 34 (convergence condition or the number of iterations) specifying 2 is satisfied. The program of FIG. 5 enables this processing to be executed by the parallel computer of the format of FIG. 1 in accordance with the procedure shown in FIG. 7, and thereby speeds up the LU decomposition processing.

【００２８】数値シミュレーションでは、ＬＵ分解処理
に限らず、ほぼ同一の処理を、図６に示す手順にしたが
って繰り返す反復処理が多く見受けられる。その場合
は、Ｍは１でもよい。このような他の反復処理でも、処
理Ｐが本来逐次的にしか実行できないとしても、図７に
示す処理手順にしたがって処理Ｑを並列に実行すること
により、このような他の反復処理も高速に実行できる。In the numerical simulation, not only the LU decomposition process, but also many iterative processes in which almost the same process is repeated according to the procedure shown in FIG. In that case, M may be 1. Even in such another iterative process, even if the process P can be originally executed only sequentially, by performing the process Q in parallel according to the process procedure shown in FIG. 7, such other iterative process can be performed at high speed. I can do it.

【００２９】[0029]

【発明が解決しようとする課題】ＬＵ分解処理を図７に
示した手順で処理した場合、処理Ｑ（Ｋ）が並列計算機
を構成する複数のプロセッサにより並列に実行されるた
め、その処理の実行時間は、プロセッサ数が多いほど短
縮される。しかし、各反復回ごとに処理Ｐ（Ｋ）は並列
計算機を構成する一つのプロセッサにより実施されるた
め、この処理Ｐ（Ｋ）の実行時間が、プロセッサ数を多
くしても、ＬＵ分解処理全体の処理時間を短縮するのを
妨げている。When the LU decomposition process is processed by the procedure shown in FIG. 7, the process Q (K) is executed in parallel by a plurality of processors constituting the parallel computer, and therefore the process is executed. The time is reduced as the number of processors is increased. However, since the process P (K) is executed by one processor constituting a parallel computer at each iteration, the execution time of this process P (K) is large even if the number of processors is large. It prevents the processing time of the.

【００３０】いま仮に、対象とする行列のサイズＮを１
０００としたとき、上記処理Ｑ（Ｋ）と処理Ｐ（Ｋ）
は、それぞれＬＵ分解処理の総演算量の処理のおおよそ
９８％および２％程度の演算量を占める。しかし、処理
Ｐ（Ｋ）、処理Ｑ（Ｋ）の実行時の演算器の使用率が大
幅に異なる。Suppose now that the size N of the target matrix is 1
000, the above process Q (K) and process P (K)
Respectively occupy about 98% and 2% of the total calculation amount of the LU decomposition process. However, the usage rates of the arithmetic units during execution of the processing P (K) and the processing Q (K) are significantly different.

【００３１】たとえば、図１の並列計算機が４台のプロ
セッサからなり、各プロセッサがベクトルプロセッサの
場合、この処理Ｑ（Ｋ）は並列処理可能であり、この処
理のために異なるベクトルプロセッサ内のパイプライン
演算器も使用される上に、処理Ｑ（Ｋ）の計算処理は積
和計算のため、各ベクトルプロセッサ内のパイプライン
演算器の使用率が高い。一方、処理Ｐ（Ｋ）は、一台の
ベクトルプロセッサ内の演算器しか使用しない上に、こ
の処理の主たる部分は最大値検索処理であり、必要とす
るデータ量に対して相対的に演算量が少なく、この一つ
のベクトルプロセッサ内の一部のパイプライン演算器し
か使用せず、演算器の使用率は高くない。このため、上
述の４台のベクトルプロセッサからなる並列計算機に対
する推定例では、処理Ｑ（Ｋ）に対する処理Ｐ（Ｋ）の
演算器利用率は約１／１０程度と低い。このため、処理
Ｐ（Ｋ）と処理Ｑ（Ｋ）の処理時間の比は、約１０：１２（＝２：９８／（１０×４））となり、わずか２％の演算量の処理Ｐ（Ｋ）がＬＵ分解
処理全体の処理時間の約半分を占める。したがって、４
台のプロセッサを使用しているにもかかわらず最大２台
分の性能向上しか得られない。For example, when the parallel computer of FIG. 1 is composed of four processors and each processor is a vector processor, this processing Q (K) can be processed in parallel, and pipes in different vector processors are used for this processing. A line arithmetic unit is also used, and since the calculation processing of the process Q (K) is a product-sum calculation, the usage rate of the pipeline arithmetic unit in each vector processor is high. On the other hand, the process P (K) uses only the arithmetic unit in one vector processor, and the main part of this process is the maximum value search process, and the amount of calculation is relatively large with respect to the required data amount. However, only some of the pipeline arithmetic units in this one vector processor are used, and the utilization rate of the arithmetic units is not high. Therefore, in the estimation example for the parallel computer including the four vector processors described above, the operation unit utilization rate of the process P (K) with respect to the process Q (K) is as low as about 1/10. Therefore, the ratio of the processing times of the processing P (K) and the processing Q (K) is about 10:12 (= 2: 98 / (10 × 4)), and the processing P (K ) Occupies about half of the processing time of the entire LU decomposition processing. Therefore, 4
Despite the use of one processor, the maximum performance improvement is only two.

【００３２】また、図１のプロセッサ１等が、スーパー
スカラプロセッサからなり、かつプロセッサの総数が１
００台程度である場合の推定例では、処理Ｑ（Ｋ）に対
する処理Ｐ（Ｋ）の演算器利用率は約１／２である。こ
のため、処理Ｐ（Ｋ）と処理Ｑ（Ｋ）の処理時間の比
は、約４：１（＝２：９８／（２×１００））となり、わずか２％の演算量の処理Ｐ（Ｋ）が演算時間
の約８０％を占めることがわかる。したがって、１００
台のプロセッサを使用しているにもかかわらず１０台強
の性能向上しか得られない。The processor 1 shown in FIG. 1 is a superscalar processor and the total number of processors is 1.
In the estimation example in the case of about 00 units, the operation unit utilization rate of the process P (K) with respect to the process Q (K) is about 1/2. Therefore, the ratio of the processing times of the processing P (K) and the processing Q (K) is about 4: 1 (= 2: 98 / (2 × 100)), and the processing P (K ) Occupies about 80% of the calculation time. Therefore, 100
Despite using 10 processors, only 10 or more performance improvement can be obtained.

【００３３】したがって、処理Ｐ（Ｋ）の実行時間が、
ＬＵ分解処理の全体の処理時間に及ぼす影響をさらに減
じることが望まれる。Therefore, the execution time of the process P (K) is
It is desired to further reduce the effect of LU decomposition processing on the overall processing time.

【００３４】同じような問題は、ＬＵ分解処理に限ら
ず、処理Ｐ（Ｋ）で例示される第１の処理と、その処理
の実行後に実行すべき、処理Ｑ（Ｋ）で例示される並列
実行可能な第２の処理とを繰返するとともに、この繰返
しに当たっては、第２の処理の実行後に第１の処理を実
行するように第１、第２のの処理を実行するときに生じ
る。The same problem is not limited to the LU decomposition process, but the first process exemplified by the process P (K) and the parallel process exemplified by the process Q (K) to be executed after the execution of the process. The second process that can be executed is repeated, and this repetition occurs when the first and second processes are executed so that the first process is executed after the execution of the second process.

【００３５】言い替えると、従来の方法では、複数のプ
ロセッサの内、一部のプロセッサは処理をしないで、他
のプロセッサでの処理待ちの状態にある時間が長いこと
になる。In other words, in the conventional method, some of the plurality of processors do not perform processing, and the other processors wait for a long time.

【００３６】したがって、本発明の目的は、本来は逐次
処理されるべき第１の処理とその後に実行されるべき、
並列実行可能な第２の処理とを有し、第２の処理の完了
を待って再度第１、第２の処理を実行するように、第
１、第２の処理を反復して実行すべき反復処理を、並列
計算機で実行させる場合に、この反復処理全体の処理時
間を軽減することが可能な反復処理の並列実行方法を提
供することである。Therefore, the object of the present invention is to perform the first processing which should be sequentially processed and the subsequent processing,
And a second process that can be executed in parallel, and the first and second processes should be repeatedly executed so that the first and second processes are executed again after waiting for the completion of the second process. It is an object of the present invention to provide a parallel execution method of iterative processing capable of reducing the processing time of the entire iterative processing when the iterative processing is executed by a parallel computer.

【００３７】本発明の他の目的は、このような処理の実
行時に、プロセッサが待ち状態にある時間を軽減するこ
とである。Another object of the present invention is to reduce the time during which the processor is in a waiting state when executing such processing.

【００３８】本発明のより具体的な目的は、ＬＵ分解処
理を並列計算機で実行させる場合に、この処理全体の処
理時間を軽減することが可能なＬＵ分解処理の並列実行
方法を提供することである。A more specific object of the present invention is to provide a parallel execution method of LU decomposition processing which can reduce the processing time of the entire LU decomposition processing when it is executed by a parallel computer. is there.

【００３９】本発明の他のより具体的な目的は、このＬ
Ｕ分解処理を実行しているときのプロセッサが待ち状態
にある時間を減少させることである。Another more specific object of the present invention is this L
The goal is to reduce the amount of time the processor is in a wait state when performing U decomposition processing.

【００４０】[0040]

【課題を解決するためにの手段】上に説明したＬＵ分解
処理では、外側ループの繰返し回数Ｋのときに、処理Ｐ
（Ｋ）の結果を使用して実行される処理Ｑ（Ｋ）では、
領域Ｕ２（Ｋ）と領域Ａ（Ｋ）の要素の更新後の値が計
算される。これらの要素のうち、外側ループの次の繰返
し時に実行される処理Ｐ（Ｋ＋Ｍ）の計算に使用する要
素は、このＫ＋１回目における領域Ｕ１（Ｋ）、Ｌ１
（Ｋ）、Ｌ２（Ｋ）に属すべき要素のみである。すなわ
ち、図４において、第Ｋ＋Ｍ行から第Ｎ行と第Ｋ＋Ｍ列
から第Ｋ＋２＊Ｍ−１列により構成される領域２６内の
要素である。したがって、処理Ｐ（Ｋ＋Ｍ）は、この領
域２６内の要素に関する処理Ｑ（Ｋ）（これをＱS
（Ｋ）と呼ぶことにする）が終了後に実行を開始しなけ
ればならない。しかし、処理Ｐ（Ｋ＋Ｍ）は、領域Ａ
（Ｋ）の内、この領域２６以外の領域内の要素に対する
処理Ｑ（Ｋ）とは、並列に実行可能である。In the LU decomposition process described above, the process P is executed when the outer loop repeat count K is reached.
In the processing Q (K) executed using the result of (K),
The updated values of the elements of the area U2 (K) and the area A (K) are calculated. Among these elements, the elements used to calculate the process P (K + M) executed at the next iteration of the outer loop are the regions U1 (K) and L1 at the K + 1th time.
(K) and L2 (K) are the only elements that should belong. That is, in FIG. 4, it is an element in the region 26 constituted by the K + Mth row to the Nth row and the K + Mth column to the K + 2 * M−1th column. Therefore, the process P (K + M) is the process Q (K) (this is QS
(Let's call it (K)) must start execution after. However, the process P (K + M) is the area A
The processing Q (K) for the elements in the area other than the area 26 in (K) can be executed in parallel.

【００４１】本発明では、このことを利用して、本来は
逐次処理されるべき第１の処理（例えば、Ｐ（Ｋ）、以
下括弧内は例示である）とその後に実行されるべき、並
列実行可能な第２の処理（Ｑ（Ｋ））とを有し、第２の
処理の完了を待って再度第１、第２の処理を実行するよ
うに、第１、第２の処理を反復して実行すべき反復処理
を、並列計算機で実行させる場合に、ある繰返し（Ｋ回
目）で実行される第２の処理（Ｑ（Ｋ））の内、次の繰
返し（Ｋ＋Ｍ回目）で実行されるべき第１の処理（Ｐ
（Ｋ＋Ｍ））が使用する処理結果を生成する部分処理
（Ｑs（Ｋ））と、次の繰返し時の第１の処理（Ｐ（Ｋ
＋Ｍ））とを同じプロセッサに割り当て、この第１の処
理（Ｐ（Ｋ＋Ｍ））を、この部分処理（ＱS（Ｋ））に
続いてそのプロセッサで実行させる。第２の処理（Ｑ
（Ｋ））の内、この部分処理（ＱS（Ｋ））以外の部分
を、残りのプロセッサの数P−１に等しい数の、並列に
実行可能な処理（Ｑ2（Ｋ）からＱP（Ｋ））に分割し、
それらの残りのプロセッサにより並列に実行させる。In the present invention, by utilizing this fact, the first process (for example, P (K), which is shown below in parentheses is an example) which should originally be processed sequentially, and the parallel process which should be executed after that. It has a second process (Q (K)) that can be executed, and repeats the first and second processes so that the first and second processes are executed again after waiting for the completion of the second process. When the parallel computer executes the iterative process to be executed in the same manner, the second process (Q (K)) executed in a certain iteration (Kth) is executed in the next iteration (K + Mth). First process to be performed (P
(K + M)) generates a processing result used by (Ks (K)) and the first processing (P (K) at the time of the next iteration).
+ M)) is assigned to the same processor, and this first process (P (K + M)) is executed in that processor following this partial process (QS (K)). Second process (Q
(K)), a part other than this partial process (QS (K)) can be executed in parallel by a number equal to the number P-1 of the remaining processors (Q2 (K) to QP (K)). ),
Run in parallel by those remaining processors.

【００４２】この際、上記部分処理（Ｑs（Ｋ））と第
１の処理（Ｐ（Ｋ＋Ｍ））とを実行するプロセッサの処
理時間が、他のプロセッサのそれより小さいときには、
前者の処理時間が他のプロセッサのそれとほぼ同じにな
るように、第２の処理（Ｑ（Ｋ））の内、部分処理（Ｑ
S（Ｋ））以外の一部もこの部分処理（ＱS（Ｋ））を割
り当てたプロセッサに割り当てることがより望ましい。
すなわち、第２の処理処理（Ｑ（Ｋ））を実施するのに
必要な第１の処理（Ｐ（Ｋ））を、いずれかのプロセッ
サにより予め実行した後、上記のごとくにして各プロセ
ッサに割り当てるべき第２の処理（Ｑ（Ｋ））の一部を
決定し、決定された処理の割り当てに従って、各プロセ
ッサで、部分処理（ＱS（Ｋ）もしくはＱi（Ｋ）（ｉ＝
２からｐ−１））を実行させる。At this time, when the processing time of the processor for executing the partial processing (Qs (K)) and the first processing (P (K + M)) is shorter than that of the other processor,
The partial processing (Q (K)) of the second processing (Q (K)) is performed so that the former processing time becomes almost the same as that of the other processors.
It is more desirable to allocate a part other than S (K)) to the processor to which this partial process (QS (K)) is allocated.
That is, after the first process (P (K)) necessary for carrying out the second process process (Q (K)) is executed in advance by one of the processors, each processor is processed as described above. A part of the second process (Q (K)) to be assigned is determined, and each processor follows the partial process (QS (K) or Qi (K) (i =
2 to p-1)) is executed.

【００４３】部分処理（ＱS（Ｋ））を割り当てられた
プロセッサでは、この処理が終了すると、第１の処理
（Ｐ（Ｋ＋Ｍ））を実行させる。この部分処理（ＱS
（Ｋ）と第１の処理（Ｐ（Ｋ＋Ｍ））および他の部分処
理（Ｑ2（Ｋ）からＱP（Ｋ））の実行の終了を待って、
以上の処理を繰り返す。In the processor to which the partial process (QS (K)) is assigned, when this process is completed, the first process (P (K + M)) is executed. This partial processing (QS
(K) and the first process (P (K + M)) and other partial processes (Q2 (K) to QP (K)) are completed,
The above process is repeated.

【００４４】なお、本発明は、図２、図３の形式の並列
計算機でも実施可能である。The present invention can also be implemented in parallel computers of the types shown in FIGS.

【００４５】[0045]

【作用】次の反復回（（Ｋ＋Ｍ））の第１の処理（Ｐ
（Ｋ＋Ｍ））は、第２の処理（Ｑ（Ｋ））の内、その第
１の処理（Ｐ（Ｋ＋Ｍ））に先立って実行する必要があ
る部分処理（ＱS（Ｋ））以外の処理と並列に実行され
るので、並列計算機内のプロセッサが待ち状態にある時
間を軽減することが出来、計算機資源の有効な利用を図
ることが出来る。それにより、この反復処理全体の処理
時間を軽減することが可能になる。又、従来のごとく、
第１の処理（Ｐ（Ｋ＋Ｍ））を第２の処理（Ｑ（Ｋ））
と逐次に実行する従来の場合に比べて、この第１の処理
（Ｐ（Ｋ＋Ｍ））の実行時間が、繰返し処理全体に及ぼ
す影響を減少できる。なお、第１の処理（Ｐ（Ｋ＋
Ｍ））は、部分処理（ＱS（Ｋ））を割り当てられたプ
ロセッサ内で、この処理の後に実行されるので、正常な
結果を出力する。The first process (P) of the next iteration ((K + M))
(K + M)) is a process other than the partial process (QS (K)) that needs to be executed prior to the first process (P (K + M)) of the second process (Q (K)). Since the processes are executed in parallel, the time during which the processors in the parallel computer are in the waiting state can be reduced, and the effective use of computer resources can be achieved. This makes it possible to reduce the processing time of the entire iterative process. Also, as in the past,
The first process (P (K + M)) is replaced with the second process (Q (K))
As compared with the conventional case of sequentially executing the above, the influence of the execution time of the first process (P (K + M)) on the entire iterative process can be reduced. The first process (P (K +
M)) is executed after this processing in the processor to which the partial processing (QS (K)) is assigned, and thus outputs a normal result.

【００４６】[0046]

【実施例】以下、本発明に係る反復処理の並列実行方法
を図面に示したいくつかの実施例を参照してさらに詳細
に説明する。なお、以下においては、同じ参照番号は同
じものもしくは類似のものを表わすものとする。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A parallel execution method of iterative processing according to the present invention will be described below in more detail with reference to some embodiments shown in the drawings. In the following, the same reference numbers represent the same or similar items.

【００４７】＜実施例１＞本実施例では、図１の主記憶
共有型の並列計算機を使用した、ＬＵ分解処理の並列実
行方法を示す。図８は本実施例で使用するプログラムの
例であり、図９はそのプログラムを図１の装置で実行す
る手順を示すフローチャートである。<Embodiment 1> This embodiment shows a parallel execution method of LU decomposition processing using the main memory sharing type parallel computer of FIG. FIG. 8 is an example of a program used in this embodiment, and FIG. 9 is a flowchart showing a procedure for executing the program in the apparatus of FIG.

【００４８】以下、本実施例を、図９のフローチャート
に基づいて、従来例で示した図７と異なる部分について
主に説明する。In the following, the present embodiment will be described mainly on the basis of the flow chart of FIG. 9 regarding the parts different from FIG. 7 shown in the conventional example.

【００４９】図５に関して説明した通り、ＬＵ分解処理
での次反復回の処理Ｐ（Ｋ＋Ｍ）は、図４に示す行列Ａ
において、点線で囲った領域２６に対して処理を実行す
る。したがって、前反復回の処理Ｑ（Ｋ）のうち、図４
の行列Ａ内の領域２６である第Ｋ＋Ｍ列から第Ｋ＋２＊
Ｍ−１列の更新処理ＱS（Ｋ）を処理Ｑ（Ｋ）から除い
た残りの処理（Ｑ（Ｋ）−ＱS（Ｋ））と処理Ｐ（Ｋ＋
Ｍ）は並列に実行可能である。処理Ｐ（Ｋ＋Ｍ）は処理
ＱS（Ｋ）を終了すればいつでも実行可能となる。した
がって、図９において、処理Ｑ（Ｋ）を並列化する処理
６３を実行し、処理ＱS（Ｋ）を含む処理Ｑ1´（Ｋ）６
４、処理Ｑ2´（Ｋ）６５ないし処理ＱP´（Ｋ）６６の
ｐ個に分割し、そのうち、処理Ｑ1´（Ｋ）６４を処理
Ｐ（Ｋ＋Ｍ）６７と同じプロセッサで、その処理より前
に実行することにより、処理Ｑ2´（Ｋ）６５ないし処
理ＱP´（Ｋ）６６と処理Ｐ（Ｋ＋Ｍ）とを並列に実行
することができるようになる。As described with reference to FIG. 5, the process P (K + M) in the next iteration of the LU decomposition process is the matrix A shown in FIG.
In, the processing is executed on the area 26 surrounded by the dotted line. Therefore, among the processing Q (K) of the previous iteration, FIG.
From the K + Mth column, which is the region 26 in the matrix A of K + 2 *
The remaining process (Q (K) -QS (K)) obtained by removing the update process QS (K) of the M-1 column from the process Q (K) and the process P (K +)
M) can be executed in parallel. The process P (K + M) can be executed any time after the process QS (K) is completed. Therefore, in FIG. 9, the process 63 for parallelizing the process Q (K) is executed, and the process Q1 ′ (K) 6 including the process QS (K) is performed.
4. The process Q2 '(K) 65 to the process QP' (K) 66 is divided into p parts, of which the process Q1 '(K) 64 is the same processor as the process P (K + M) 67, and before that process. By executing the process, the process Q2 '(K) 65 to the process QP' (K) 66 and the process P (K + M) can be executed in parallel.

【００５０】処理Ｐ（Ｋ＋Ｍ）６７をＫ回目の反復回で
実行するために、６１、６８および６９で制御される反
復ループに入る前に、Ｋ＝１のときの処理Ｐ（１）６０
を１度だけ実行する必要がある。In order to execute the process P (K + M) 67 in the Kth iteration, before entering the iteration loop controlled by 61, 68 and 69, the process P (1) 60 when K = 1.
Need only be executed once.

【００５１】処理Ｑ1´（Ｋ）６４、処理Ｑ2´（Ｋ）６
５ないし処理ＱP´（Ｋ）６６の分割は、処理Ｐ（Ｋ＋
Ｍ）６７を含めて、全体の処理が均等になるように分割
する必要がある。処理Ｑ（Ｋ）の分割は、各プロセッサ
に割り付ける列の数を変更することにより自由に扱うこ
とができる。ここでは、処理Ｑ1´（Ｋ）６４で扱う列
をＫ＋ＭからＫ１とし、処理Ｑ2´（Ｋ）６５で扱う列
をＫ１＋１からＫ２とし、処理ＱP´（Ｋ）６６で扱う
列をＫ（p-1）からＮとする。Process Q1 '(K) 64, process Q2' (K) 6
5 to processing QP '(K) 66 is divided into processing P (K +
It is necessary to divide so that the entire process including M) 67 is even. The division of the process Q (K) can be freely handled by changing the number of columns assigned to each processor. Here, the columns handled by the process Q1 ′ (K) 64 are K + M to K1, the columns handled by the process Q2 ′ (K) 65 are K1 + 1 to K2, and the columns handled by the process QP ′ (K) 66 are K (p- 1) to N.

【００５２】この処理の均等分割によるＫ１、Ｋ
２、、、Ｋ（p-1）の決定６２の手順を説明する。K1, K by the equal division of this processing
A procedure of determining 62 of 2, ..., K (p-1) will be described.

【００５３】まず、処理Ｐ（Ｋ＋Ｍ）に先だって実行す
る必要のある処理ＱS（Ｋ）は、最低、処理Ｑ1´（Ｋ）
６４に含テまれている必要がある。したがって、Ｋ１は
Ｋ＋２＊Ｍ−１以上となる（条件１）。今仮に、処理Ｐ
（Ｋ＋Ｍ）６７の演算量を処理Ｑ（Ｋ）における列の演
算量に換算した時の列数をProwとする。処理Ｑ1´
（Ｋ）６４と処理Ｐ（Ｋ＋Ｍ）６７の演算量の和を全体
の１／ｐにし、かつ条件１を満たすためには、Ｋ１の値
は次の式、First, the process QS (K) that needs to be executed prior to the process P (K + M) is at least the process Q1 '(K).
It must be included in 64. Therefore, K1 becomes K + 2 * M-1 or more (condition 1). Temporarily, process P
The number of columns when the calculation amount of (K + M) 67 is converted into the calculation amount of columns in the process Q (K) is Prow. Processing Q1 '
In order to make the sum of the operation amounts of (K) 64 and the process P (K + M) 67 1 / p of the whole and to satisfy the condition 1, the value of K1 is

【００５４】[0054]

【数１】Ｋ１=MAX（Ｋ+２Ｍ−１、Ｋ＋Ｍ＋（Ｎ−（Ｋ＋Ｍ）＋１−Prow）/p） (1) であらわすことができる。ここで、MAXは最大値を求め
る関数である。## EQU1 ## K1 = MAX (K + 2M-1, K + M + (N- (K + M) + 1-Prow) / p) (1) can be expressed. Here, MAX is a function for obtaining the maximum value.

【００５５】さらに、Ｋ２ないしＫ（p-1）は処理Ｑ
（Ｋ）から処理Ｑ1´（Ｋ）６４を除いた処理を、残り
のｐ−１台のプロセッサで均等に分割すれば良い。した
がって、Ｋ２ないしＫ（p-1）は次の式、Further, K2 to K (p-1) are processed Q
The process obtained by removing the process Q1 '(K) 64 from (K) may be equally divided by the remaining p-1 processors. Therefore, K2 to K (p-1) are given by

【００５６】[0056]

【数２】Ｋｉ=Ｋ１＋（ｉ−１）／（p−１）*（Ｎ-Ｋ１）＋１ (2) であらわすことができる。ここで、ＫｉはＫ２ないしＫ
（p-1）をあらわす。## EQU2 ## It can be expressed by Ki = K1 + (i-1) / (p-1) * (N-K1) +1 (2). Where Ki is K2 or K
Indicates (p-1).

【００５７】処理の均等分割によるＫ１、Ｋ２、、、Ｋ
（p-1）の決定処理６２は、各反復回ごとに行なえば効
率が良いが、対象とする行列は徐々に小さくなるので、
数回の反復回ごとに値を更新するようにすることもでき
る。K1, K2, ... K by equal division of processing
The determination process 62 of (p-1) is efficient if it is performed for each iteration, but the target matrix gradually becomes smaller.
It is also possible to update the value every few iterations.

【００５８】図８に示すプログラムは、プロセッサ台数
Pが４の場合である。The program shown in FIG. 8 is the number of processors
This is the case where P is 4.

【００５９】まず、Ｋ＝１のときの処理Ｐ（１）４５を
実行し、ＤＯ９ループ４６で指示された反復処理を実行
する。次に前述した式（式１）および（式２）により、
処理を均等に分割するためにＫ１、Ｋ２、Ｋ３を決定４
７する。First, the process P (1) 45 when K = 1 is executed, and the iterative process instructed by the DO9 loop 46 is executed. Next, according to the equations (Equation 1) and (Equation 2) described above,
Determine K1, K2, K3 to divide the process evenly 4
7

【００６０】プログラムの並列化記述は、ひとつのプロ
セッサのみに処理Ｐ（Ｋ＋Ｍ）５８を割り付けるので、
ＳＥＣＴＩＯＮタイプの並列化指示行４８ないし５３を
用いる。各プロセッサの処理は、各SECTON（）文で区切
られた範囲を実行する。In the parallel description of the program, since the process P (K + M) 58 is assigned to only one processor,
SECTION type parallelization instruction lines 48 to 53 are used. The processing of each processor executes the range delimited by each SECTON () statement.

【００６１】ＳＥＣＴＩＯＮ（1）４９では、処理Ｑ1´
（Ｋ）５４と処理Ｐ（Ｋ＋Ｍ）５８をこの順番に実行す
る。ここで、処理Ｑ1´（Ｋ）５４は行列Ａの第Ｋ＋Ｍ
列から第Ｋ１列までを担当する。ここで、Ｋ１は（式
１）を用いて求める。（式１）のＫ１の定義により、処
理Ｑ1´（Ｋ）５４は前述した処理ＱS（Ｋ）を含むこと
を保障することができる。In SECTION (1) 49, the process Q1 '
The (K) 54 and the process P (K + M) 58 are executed in this order. Here, the process Q1 ′ (K) 54 is the K + Mth matrix A
Responsible from column to column K1. Here, K1 is calculated using (Equation 1). The definition of K1 in (Equation 1) can guarantee that the processing Q1 '(K) 54 includes the processing QS (K) described above.

【００６２】ＳＥＣＴＩＯＮ（2）５０では、処理Ｑ2´
（Ｋ）５５を実行する。処理Ｑ2´（Ｋ）５５は行列Ａ
の第（Ｋ１＋１）列から第Ｋ２列までを担当する。ここ
で、Ｋ２は（式２）を用いて求める。In SECTION (2) 50, processing Q2 '
(K) Execute 55. Process Q2 '(K) 55 is matrix A
(K1 + 1) to K2 columns. Here, K2 is calculated using (Equation 2).

【００６３】ＳＥＣＴＩＯＮ（3）５１では、処理Ｑ3´
（Ｋ）５６を実行する。処理Ｑ3´（Ｋ）５６は行列Ａ
の第（Ｋ２＋１）列から第Ｋ３列までを担当する。ここ
で、Ｋ３も（式２）を用いて求める。In SECTION (3) 51, processing Q3 '
(K) 56 is executed. Process Q3 '(K) 56 is matrix A
(K2 + 1) to K3 columns. Here, K3 is also obtained using (Equation 2).

【００６４】ＳＥＣＴＩＯＮ（4）５２では、処理Ｑ4´
（Ｋ）５７を実行する。処理Ｑ4´（Ｋ）５７は行列Ａ
の第（Ｋ３＋１）列から第Ｎ列までを担当する。ここ
で、Ｋ３は（式２）を用いて求める。In SECTION (4) 52, processing Q4 '
(K) Execute 57. Process Q4 '(K) 57 is matrix A
Are in charge of the (K3 + 1) th column to the Nth column. Here, K3 is obtained using (Equation 2).

【００６５】図８に示したプログラムをコンパイルする
ことにより、図９に示した手順で実行されるオブジェク
トプログラムが生成される。このコンパイルにより、図
８に示す各部分に対応するオブジェクト部分が生成され
るとともに、これらのオブジェクト部分の実行を制御す
るために、並列化処理６３と同期処理６９を実行するコ
ードがこのオブジェクトプログラムに追加される。この
ような目的のためのコンパイラはすでに知られている。By compiling the program shown in FIG. 8, an object program executed in the procedure shown in FIG. 9 is generated. By this compilation, object parts corresponding to the parts shown in FIG. 8 are generated, and in order to control the execution of these object parts, the code for executing the parallelization process 63 and the synchronization process 69 is stored in this object program. Is added. Compilers for such purposes are already known.

【００６６】この実施例により、処理Ｐを実質上、並列
処理の中に組み込むことが可能となるために、処理Ｐの
実行時間が実質的に（１−１／ｐ）に短縮される。この
結果ＬＵ分解処理全体の処理時間が短縮される。例え
ば、図１の形式の並列計算機が４台のプロセッサを有
し、かつ、各プロセッサがベクトルプロセッサの場合、
行列のサイズＮを１０００で、処理Ｐと処理Ｑの実行時
間がほぼ同程度の場合は、本実施例を用いることによ
り、ＬＵ分解処理の実行時間を最大３５％程度低減する
ことができる。According to this embodiment, since the process P can be incorporated into the parallel process substantially, the execution time of the process P is substantially shortened to (1-1 / p). As a result, the processing time of the entire LU decomposition processing is shortened. For example, if the parallel computer of the type shown in FIG. 1 has four processors and each processor is a vector processor,
When the size N of the matrix is 1000 and the execution times of the process P and the process Q are almost the same, the execution time of the LU decomposition process can be reduced by up to 35% by using this embodiment.

【００６７】＜実施例２＞本実施例は、図１の主記憶共
有型の並列計算機を使用した、、ＬＵ分解処理の並列実
行方法の他の例を提供する。図１０は本実施例で使用す
るプログラムの例であり、図１１はこのプログラムを図
１の装置で実行させるときの実行手順を示すフローチャ
ートである。<Embodiment 2> This embodiment provides another example of a parallel execution method of LU decomposition processing using the main memory sharing type parallel computer of FIG. FIG. 10 is an example of a program used in this embodiment, and FIG. 11 is a flowchart showing an execution procedure when the apparatus of FIG. 1 executes this program.

【００６８】実施例１においては、図９で示したごと
く、各反復ごとに毎回並列化処理６３を行なうことによ
り、処理Ｐ（Ｋ＋Ｍ）と次の反復回の処理Ｑ1´（Ｋ＋
Ｍ）ないし処理ＱP´（Ｋ＋Ｍ）との順序関係を保証し
ている。しかし、毎回行なう並列化処理６３はすべての
プロセッサ間の同期を取るために性能低下の要因となる
可能性がある。本実施例では、きめ細かくＰｏｓｔ処理
やＷａｉｔ処理を用いた同期処理を用いて順序保障処理
を行ない、並列化処理を制御変数Ｋの反復ループの外側
に移動させる方法を説明する。ここで用いるＰｏｓｔ処
理やＷａｉｔ処理は、ＯＳの発行するＰｏｓｔ／Ｗａｉ
ｔマクロではなく、コンパイラの並列制御ライブラリの
ひとつであるＰｏｓｔ／Ｗａｉｔライブラリであり、コ
ストが小さい処理である。In the first embodiment, as shown in FIG. 9, the parallelization processing 63 is performed every iteration so that the processing P (K + M) and the processing Q1 '(K +) of the next iteration are performed.
M) or processing QP ′ (K + M) is guaranteed to be in order. However, the parallelization process 63 that is performed each time may cause performance degradation because all processors are synchronized. In the present embodiment, a method will be described in which order synchronization processing is finely performed using synchronous processing using Post processing or Wait processing and the parallelization processing is moved to the outside of the iterative loop of the control variable K. The Post process and Wait process used here are Post / Wai issued by the OS.
It is not a t macro but a Post / Wait library which is one of the parallel control libraries of the compiler, and is a process with a low cost.

【００６９】この処理手順を図１１を用いて図９と比較
して説明する。まず、図９と同様に、処理Ｐ（１）１０
１を実行する。次に、制御変数Ｋにより指示される反復
ループの外側に並列化処理１０２を移動する。このた
め、処理の均等分割処理１０３ないし１０５、反復ルー
プの制御変数Ｋの更新処理１１７ないし１１９および反
復ループの判定処理１２０ないし１２２は、並列化され
た後、各プロセッサごとに実行させる。さらに、他のプ
ロセッサの次の反復回の処理との順序関係を保障するた
めに、プロセッサ間でＰｏｓｔ処理およびＷａｉｔ処理
を実行する必要がある。This processing procedure will be described with reference to FIG. 11 in comparison with FIG. First, as in FIG. 9, process P (1) 10
Execute 1. Next, the parallelization process 102 is moved outside the iterative loop indicated by the control variable K. For this reason, the equal division processes 103 to 105, the iterative loop control variable K updating processes 117 to 119, and the iterative loop determination processes 120 to 122 are executed in parallel for each processor. Further, in order to guarantee the order relation with the processing of the next iteration of another processor, it is necessary to execute the Post processing and the Wait processing between the processors.

【００７０】Ｐｏｓｔ処理およびＷａｉｔ処理の設定す
べき位置について説明する。基本的な方針はＰｏｓｔ処
理をできるかぎり早く実行し、Ｗａｉｔ処理をできるか
ぎり遅く実行することである。このことにより、各プロ
セッサの実行の独立性を高め、動的な要因（たとえば、
主記憶アクセス競合など）によるＷａｉｔ処理時の不要
な同期待ちを軽減することができる。The positions to be set in the Post process and Wait process will be described. The basic policy is to execute the Post process as soon as possible and the Wait process as late as possible. This increases the independence of execution of each processor and allows dynamic factors (eg,
It is possible to reduce unnecessary synchronization waiting at the time of Wait processing due to main memory access competition.

【００７１】まず、処理Ｐ（Ｋ＋Ｍ）１０７を担当する
プロセッサの実行順序を説明する。このプロセッサで実
行する消去更新処理Ｑ1´（Ｋ）を処理ＱS（Ｋ）１０６
と処理（Ｑ1´（Ｋ）−ＱS（Ｋ））１０８の２つに分解
する。ここで、処理ＱS（Ｋ）１０６は、処理Ｐ（Ｋ＋
Ｍ）１０７に先だって行なう行列Ａ内の第Ｋ＋Ｍ行から
第Ｋ＋２＊Ｍ−１行の更新処理である。処理（Ｑ1´
（Ｋ）−ＱS（Ｋ））１０８は、このプロセッサが担当
する処理Ｑ（Ｋ）に関する残りの処理である。処理Ｐ
（Ｋ＋Ｍ）１０７を実行することにより、他のプロセッ
サの次の反復ループの処理Ｑ2´（Ｋ＋Ｍ）１０９ない
し処理ＱP´（Ｋ＋Ｍ）１１０を実行することができる
ので、ここで、Ｐｏｓｔ処理１１１およびＷａｉｔ処理
１１５ないし１１６を実行する。ここで、Ｐｏｓｔ処理
１１１は他のｐ−１台のプロセッサのＷａｉｔ処理と対
応することとする。次に、処理（Ｑ1´（Ｋ）−Ｑ1S
（Ｋ））１０８を実行する。次反復回の処理Ｐ（Ｋ＋２
＊Ｍ）を実行するためには、その前に他のプロセッサの
処理Ｑ2´（Ｋ）１０９ないし処理ＱP´（Ｋ）１１０を
実行する必要があるので、ここで、Ｐｏｓｔ処理１１３
ないし１１４およびＷａｉｔ処理１１２を実行する。こ
こで、Ｗａｉｔ処理１１２はｐ−１台のプロセッサから
のＰｏｓｔ処理をすべて受けて成立することとする。First, the execution order of the processor in charge of the process P (K + M) 107 will be described. The erase update process Q1 '(K) executed by this processor is processed QS (K) 106.
And processing (Q1 '(K) -QS (K)) 108. Here, the process QS (K) 106 is the process P (K +
M) 107, which is an updating process from the K + Mth row to the K + 2 * M−1th row in the matrix A, which is performed before the M) 107. Processing (Q1 '
(K) -QS (K)) 108 is the remaining processing relating to the processing Q (K) that this processor is in charge of. Process P
By executing (K + M) 107, it is possible to execute the processing Q2 '(K + M) 109 to the processing QP' (K + M) 110 of the next iterative loop of the other processor. Therefore, here, the Post processing 111 and the Wait processing are executed. The processes 115 to 116 are executed. Here, the Post process 111 corresponds to the Wait process of the other p-1 processors. Next, process (Q1 '(K) -Q1S
(K)) 108 is executed. Processing of the next iteration P (K + 2
In order to execute * M), it is necessary to execute the processing Q2 ′ (K) 109 to the processing QP ′ (K) 110 of another processor before that. Therefore, here, the Post processing 113 is performed.
Through 114 and Wait process 112. Here, the Wait processing 112 is established by receiving all Post processing from p-1 processors.

【００７２】このように、処理Ｑ1´（Ｋ）を処理ＱS
（Ｋ）１０６と処理（Ｑ1´（Ｋ）−Ｑ1S（Ｋ））１０
８の２つに分解することにより、Ｐｏｓｔ処理１１１は
処理Ｐ（Ｋ＋Ｍ）１０７の直後に実行可能となり、処理
（Ｑ1´（Ｋ）−Ｑ1S（Ｋ））１０８の時間だけｐｏｓ
ｔ処理１１１を他のプロセッサのＷａｉｔ処理１１５な
いし１１６より早く実行することが可能となる。In this way, the processing Q1 '(K) is processed QS
(K) 106 and processing (Q1 '(K) -Q1S (K)) 10
By decomposing it into two, the Post process 111 can be executed immediately after the process P (K + M) 107, and the post process 111 is pos only for the time of the process (Q1 ′ (K) −Q1S (K))
The t process 111 can be executed earlier than the Wait processes 115 to 116 of other processors.

【００７３】図１０に示すプログラムでは、ＳＥＣＴＩ
ＯＮタイプの並列化７０ないし７５は１回だけ行なう。
各ＳＥＣＴＩＯＮ（）文とＳＥＣＴＩＯＮ（）文の間に
各プロセッサで各々実行する制御変数Ｋの反復ループ７
６ないし７９がある。それぞれの反復ループの中に各演
算処理８０ないし８９がある。複数のＰｏｓｔ処理とＷ
ａｉｔ処理を対応させるために、ここでは、イベント変
数を引数とする。複数のイベント変数を持つＰｏｓｔ処
理９０は、設定した数のイベントを発行し、Ｗａｉｔ処
理９５ないし９７と対応する。複数のイベント変数を持
つＷａｉｔ処理９４は、設定した数のイベントを対応す
るＰｏｓｔ処理９１ないし９３から受けることにより成
立する。In the program shown in FIG. 10, the SECTI
The ON type parallelization 70 to 75 is performed only once.
Iterative loop 7 of control variable K executed by each processor between each SECTION () statement and each SECTION () statement
There are 6 to 79. Within each iterative loop is each operation 80-89. Multiple Post processes and W
In order to correspond to the ait process, an event variable is used as an argument here. The Post process 90 having a plurality of event variables issues the set number of events and corresponds to the Wait processes 95 to 97. The Wait process 94 having a plurality of event variables is established by receiving the set number of events from the corresponding Post processes 91 to 93.

【００７４】図１０に示したプログラムをコンパイルす
ることにより、図１１に示した手順で実行されるオブジ
ェクトプログラムが生成される。このことは、実施例１
の場合と同様である。By compiling the program shown in FIG. 10, an object program executed in the procedure shown in FIG. 11 is generated. This is the same as Example 1
Is the same as

【００７５】＜実施例３＞本実施例は、図２の主記憶分
散型の並列計算機を使用した、ＬＵ分解処理の並列実行
方法を提供する。図１３は本実施例で使用するプログラ
ムの例であり、図１４はこのプログラムを図２の装置で
実行させるときの実行手順を示すフローチャートであ
る。<Embodiment 3> This embodiment provides a parallel execution method of LU decomposition processing using the main memory distributed parallel computer of FIG. FIG. 13 is an example of a program used in this embodiment, and FIG. 14 is a flow chart showing an execution procedure when the apparatus of FIG. 2 executes this program.

【００７６】このような並列計算機では、図２に示すよ
うに、相互結合網５を介して、プロセッサ６ないし８間
でのデータの送受信処理が必要となる。このデータの送
受信処理を用いて順序保証を行なう。以下では、メッセ
ージ交換型の並列計算機を例に説明する。In such a parallel computer, as shown in FIG. 2, data transmission / reception processing is required between the processors 6 to 8 via the mutual connection network 5. The order is assured by using this data transmission / reception process. In the following, a message exchange type parallel computer will be described as an example.

【００７７】分散記憶型の並列計算機では、まず、デー
タを分割配置する必要がある。ここでは、図１２に示す
ように、行列Ａを多段展開の幅Ｍごとに行を分割して、
順次各プロセッサに分割する。このような分割では、各
プロセッサが、順番にピボット処理Ｐを担当することに
なる。In a distributed storage type parallel computer, it is necessary to divide and arrange data first. Here, as shown in FIG. 12, the matrix A is divided into rows for each width M of multistage expansion,
Divide into each processor sequentially. In such division, each processor sequentially takes charge of the pivot process P.

【００７８】この処理手順を図１４を用いて説明する。
図１１と比較して、図１４で異なっている部分は、
（１）同期処理のＰｏｓｔ／Ｗａｉｔ処理の代わりに、
データの送信／受信処理を伴うＳｅｎｄ処理およびＲｅ
ｃｅｉｖｅ処理を用いること、（２）図１４の実施例で
は、つねに一つのプロセッサがピボット処理Ｐを担当し
ていたのに対して、実施例３では、負荷をできるだけ均
等にするために、各プロセッサが交互にピボット処理を
担当することである。図１４では、２台のプロセッサ
（ＰＥａとＰＥｂ）で処理を実行する場合を示す。もち
ろん、３台以上のプロセッサの場合も同様である。処理
Ｑ（Ｋ）は、プロセッサ数である２つに均等に分割され
る。それぞれのプロセッサの担当する処理を処理Ｑa
（Ｋ）および処理Ｑb（Ｋ）とする。This processing procedure will be described with reference to FIG.
14 differs from FIG. 11 in that
(1) Instead of the Post / Wait process of the synchronous process,
Send processing and Re with data transmission / reception processing
(2) In the embodiment of FIG. 14, one processor is always in charge of the pivot processing P, whereas in the third embodiment, in order to make the load as uniform as possible, Alternate to take charge of the pivoting process. FIG. 14 shows a case where processing is executed by two processors (PEa and PEb). Of course, the same applies to the case of three or more processors. The process Q (K) is evenly divided into two, which is the number of processors. Process Qa performed by each processor
(K) and processing Qb (K).

【００７９】このとき、もし、データを送信するＳｅｎ
ｄ処理とデータを受信するＲｅｃｅｉｖｅ処理が非同期
に実行可能な場合は、Ｓｅｎｄ処理を可能なかぎり先行
させて、Ｒｅｃｅｉｖｅ処理を同様に可能なかぎり遅ら
せて実行する必要がある。これは、図１１のＰｏｓｔ／
Ｗａｉｔ処理を用いた時のスケジューリング方針と同様
であり、処理Ｑa（Ｋ）を処理Ｑs（Ｋ）１７０と処理
（Ｑa（Ｋ）−Ｑs（Ｋ））１７２とに分解することによ
り、実現可能である。At this time, if the data to be sent is Sen
When the d process and the Receive process for receiving the data can be executed asynchronously, it is necessary to precede the Send process as much as possible and similarly execute the Receive process as late as possible. This is Post / in FIG.
It is similar to the scheduling policy when the Wait process is used, and can be realized by decomposing the process Qa (K) into the process Qs (K) 170 and the process (Qa (K) -Qs (K)) 172. is there.

【００８０】まず、それぞれのプロセッサでは、１６
０、１６１において自プロセッサがどちらのプロセッサ
であるかを判断する。これは、一般にたとえばＵＮＩＸ
コマンドのWHOAMIなどのような判別ライブラリを事前に
用意しておく。ここでは、それぞれ自プロセッサである
と判断した１６２および１６３とする。First, in each processor, 16
At 0 and 161, it is determined which processor the own processor is. This is generally for example UNIX
Prepare a discrimination library such as WHOAMI for commands in advance. Here, it is assumed that the respective processors are 162 and 163 which are determined to be their own processors.

【００８１】次に、プロセッサＰＥａで、処理Ｐ（１）
１６６を行ない、生成した領域Ｌ２（図４）のデータを
プロセッサＰＥｂに送信するためにＳｅｎｄ処理１６７
を行なう。Next, in the processor PEa, the process P (1)
166, and Send processing 167 is performed to transmit the data in the generated area L2 (FIG. 4) to the processor PEb.
Perform

【００８２】一方、プロセッサＰＥｂは、このデータを
受信するＲｅｃｅｉｖｅ処理１６８を実行する。以下、
各プロセッサはそれぞれ制御変数Ｋの反復ループに入
る。On the other hand, the processor PEb executes the Receive process 168 for receiving this data. Less than,
Each processor enters an iterative loop of control variables K, respectively.

【００８３】まず、プロセッサＰＥｂは、処理ＱS
（Ｋ）１７３を実行し、引き続き、処理Ｐ（Ｋ＋Ｍ）１
７４を実行する。これにより、Ｋ＋Ｍ回目の消去処理に
用いる領域Ｌ２の生成が可能となるために、ただちにプ
ロセッサＰＥａにＳrｅｎｄ処理１７８を実行する。そ
の後、残りの処理（Ｑb（Ｋ）−Ｑs（Ｋ））１７５を実
行し、次反復回であるＫ＋Ｍステップに手順を進める。First, the processor PEb processes QS
(K) 173 is executed, and then process P (K + M) 1
Execute 74. As a result, the region L2 used for the K + Mth erasing process can be generated, so the Send process 178 is immediately executed on the processor PEa. After that, the remaining processing (Qb (K) -Qs (K)) 175 is executed, and the procedure is advanced to the K + M step which is the next iteration.

【００８４】一方、プロセッサＰＥａでは、まず処理Ｑ
a（Ｋ）１６９で担当するＱの領域の更新処理を行な
い、その後に、Ｋ＋Ｍステップに必要なＬ２のデータを
Ｒｅｃｅｉｖｅ処理１７７で受け取る。このとき、プロ
セッサＰＥｂ側のＳｅｎｄ処理１７８は可能なかぎり早
く実行されているため、Ｒｅｃｅｉｖｅ処理１７７での
受信待ちはあまりなく、すぐにデータ受信が可能とな
る。On the other hand, in the processor PEa, first the processing Q
The a (K) 169 updates the area of Q in charge, and then receives the L2 data necessary for the K + M step in the Receive processing 177. At this time, since the Send processing 178 on the processor PEb side is executed as soon as possible, there is not much waiting for reception in the Receive processing 177, and data can be received immediately.

【００８５】次にＫ＋Ｍステップに進む。今度は、プロ
セッサＰＥａ側がピボット処理Ｐ（Ｋ＋２＊Ｍ）１７１
を担当する。以下の処理は、プロセッサＰＥｂのＫステ
ップの手順と同等である。他方のプロセッサＰＥｂでも
Ｋ＋Ｍステップの処理は、プロセッサＰＥａのＫステッ
プの処理と同様に行なう。Next, the process proceeds to the K + M step. This time, the processor PEa side is the pivot processing P (K + 2 * M) 171
In charge of. The following processing is equivalent to the procedure of the K step of the processor PEb. The processing of K + M steps is also performed in the other processor PEb in the same manner as the processing of K steps in the processor PEa.

【００８６】まず、処理ＱS（Ｋ＋Ｍ）１７０を実行
し、引き続き、処理Ｐ（Ｋ＋２＊Ｍ）１７１を実行す
る。これにより、Ｋ＋２＊Ｍ回目の消去処理に用いる領
域Ｌ２の生成が可能となるために、ただちにプロセッサ
ＰＥｂにＳｅｎｄ処理１７９を実行する。その後、残り
の処理（Ｑa（Ｋ＋Ｍ）−Ｑs（Ｋ＋Ｍ））１７２を実行
し、次反復回であるＫ＋２＊Ｍステップに手順を進める
ために、制御変数Ｋの更新処理１８１、判定処理１８３
を行なう。First, the process QS (K + M) 170 is executed, and subsequently the process P (K + 2 * M) 171 is executed. As a result, the area L2 used for the K + 2 * Mth erase processing can be generated, and therefore the Send processing 179 is immediately executed on the processor PEb. After that, the remaining processing (Qa (K + M) -Qs (K + M)) 172 is executed, and in order to advance the procedure to the next iteration, K + 2 * M steps, the control variable K update processing 181 and the determination processing 183.
Perform

【００８７】一方、プロセッサＰＥｂでは、まず処理Ｑ
b（Ｋ）１７６で担当するＱの領域の更新処理を行な
い、その後に、Ｋ＋２＊Ｍステップに必要なＬ２のデー
タをＲｅｃｅｉｖｅ処理１８０で受け取る。このとき、
プロセッサＰＥａ側のＳｅｎｄ処理１７９は可能なかぎ
り早く実行されているため、Ｒｅｃｅｉｖｅ処理１８０
での受信待ちはあまりなく、すぐにデータ受信が可能と
なる。その後、制御変数Ｋの更新処理１８２、判定処理
１８４を行なう。On the other hand, in the processor PEb, first the processing Q
The area of Q in charge is updated by the b (K) 176, and then the Receive processing 180 receives the L2 data necessary for K + 2 * M steps. At this time,
Since the Send processing 179 on the processor PEa side is executed as soon as possible, the Receive processing 180
There is not much waiting for reception, and data can be received immediately. After that, the control variable K update processing 182 and the determination processing 184 are performed.

【００８８】図１３に示すプログラムでは、まず１３０
では自プロセッサがどちらのプロセッサであるかを判断
する。これは、一般にたとえばＵＮＩＸコマンドのWHOA
MIなどのような判別ライブラリを事前に用意しておく。
それぞれのプロセッサの処理は、条件文１３１ないし１
３３で判別される。図１３においては、１３６以下１３
２の直前までがプロセッサＰＥａの処理で、１４８以下
１３３の直前までがプロセッサＰＥｂの処理である。In the program shown in FIG. 13, first, 130
Then, it determines which processor is its own processor. This is typically the case with the UNIX command WHOA, for example.
Prepare a discrimination library such as MI in advance.
The processing of each processor is performed by conditional statements 131 to 1
It is determined at 33. In FIG. 13, 136 or less 13
The process up to immediately before 2 is the process of the processor PEa, and the process up to just before 148 and 133 is the process of the processor PEb.

【００８９】各プロセッサで交互にピボット処理を担当
するために、まず、１回だけ、プロセッサＰＥａで、処
理Ｐ（１）１３６を行ない、その結果をＳＥＮＤ処理１
４５で送信し、プロセッサＰＥｂにおいてＲＥＣＥＩＶ
Ｅ処理１４８で受信する。その後、各プロセッサは、反
復ループに入る。各反復ループＤＯ９１およびＤＯ９２
は制御変数Ｋに関して、２回分の処理を行なっている。
そのために、ＤＯ９１およびＤＯ９２ループの間隔の指
定が、図１３の１３４、１３５に示すように、第１、２
の実施例に比べて２倍となっている。In order to alternately take charge of the pivot process in each processor, first, the process P (1) 136 is performed only once in the processor PEa, and the result is sent to the SEND process 1.
45, and the processor PEb receives RECEIV.
It is received in the E processing 148. Then, each processor enters an iterative loop. Each iteration loop DO91 and DO92
Performs the process for the control variable K twice.
Therefore, the designation of the intervals of the DO91 and DO92 loops is performed as shown in 134 and 135 of FIG.
It is twice as large as that of the example.

【００９０】Ｋ回目のステップでは、プロセッサＰＥｂ
において、処理ＱS（Ｋ）１４１、処理Ｐ（Ｋ＋Ｍ）１
４２および処理（Ｑb（Ｋ）−Ｑs（Ｋ））１４３を行な
い、ＳＥＮＤ処理１４９でプロセッサＰＥａにピボット
値を送信する。プロセッサＰＥａでは、平行して処理Ｑ
a（Ｋ）１３７を行ない、ＲＥＣＥＩＶＥ処理１４６で
ピボット値を受信する。In the Kth step, the processor PEb
In, processing QS (K) 141, processing P (K + M) 1
42 and the processing (Qb (K) -Qs (K)) 143 are performed, and the pivot value is transmitted to the processor PEa in the SEND processing 149. The processor PEa processes Q in parallel.
a (K) 137 is performed, and the pivot value is received in the RECEIVE processing 146.

【００９１】Ｋ＋Ｍ回目のステップでは、プロセッサＰ
Ｅａにおいて、処理ＱS（Ｋ＋Ｍ）１３８、処理Ｐ（Ｋ
＋２＊Ｍ）１３９および処理（Ｑa（Ｋ＋Ｍ）−Ｑs（Ｋ
＋Ｍ））１４０の演算処理を行ない、ＳＥＮＤ処理１４
７でプロセッサＰＥｂにピボット値を送信する。プロセ
ッサＰＥｂでは、平行して処理Ｑb（Ｋ）１４４を行な
い、ＲＥＣＥＩＶＥ処理１５０でピボット値を受信す
る。In the K + Mth step, the processor P
In Ea, processing QS (K + M) 138, processing P (K
+ 2 * M) 139 and processing (Qa (K + M) -Qs (K
+ M)) 140 arithmetic processing, and SEND processing 14
At 7, the pivot value is transmitted to the processor PEb. The processor PEb performs the process Qb (K) 144 in parallel, and receives the pivot value in the RECEIVE process 150.

【００９２】図１３に示したプログラムをコンパイルす
ることにより、図１４に示した手順で実行されるオブジ
ェクトプログラムが生成される。このことは、実施例１
の場合と同様である。By compiling the program shown in FIG. 13, an object program executed in the procedure shown in FIG. 14 is generated. This is the same as Example 1
Is the same as

【００９３】＜実施例４＞本実施例は、図３に示した、
複数のクラスタにより構成された主記憶分散型の並列計
算機を使用した、ＬＵ分解処理の並列実行方法を提供す
る。図１５は本実施例で使用するプログラムの例であ
り、図１６はこのプログラムを図３の装置で実行させる
ときの実行手順を示すフローチャートである。<Embodiment 4> This embodiment is shown in FIG.
Provided is a parallel execution method of LU decomposition processing using a main memory distributed parallel computer configured by a plurality of clusters. FIG. 15 is an example of a program used in this embodiment, and FIG. 16 is a flowchart showing an execution procedure when this program is executed by the apparatus of FIG.

【００９４】このような並列計算機では、実施例３のよ
うに、データの送受信処理をもとに順序保証を行ない、
さらに、各反復回ごとに、クラスタ内のプロセッサ間で
第１、第２の実施例のような並列処理を実施する。In such a parallel computer, the order is guaranteed based on the data transmission / reception processing as in the third embodiment.
Further, the parallel processing as in the first and second embodiments is performed between the processors in the cluster for each iteration.

【００９５】実施例４の実施例３と異なる点は、各ステ
ップ実行内で、クラスタ内のプロセッサ間の並列処理を
実行することである。このときの手順は、ここでは図９
を用いて説明した実施例１で示した方法を用いる。もち
ろん、実施例２で示した図１１の手順も利用することが
可能である。The difference of the fourth embodiment from the third embodiment is that the parallel processing between the processors in the cluster is executed within each step execution. The procedure at this time is shown in FIG.
The method shown in Example 1 described above is used. Of course, it is also possible to use the procedure of FIG. 11 shown in the second embodiment.

【００９６】本実施例では、２つのクラスタＣＬａとＣ
Ｌｂ、および各クラスタ内のプロセッサ数がｐ台と仮定
する。以下、本実施例での処理手順を図１６を用いて説
明する。In this embodiment, two clusters CLa and C are used.
It is assumed that Lb and the number of processors in each cluster are p. The processing procedure in this embodiment will be described below with reference to FIG.

【００９７】図１４と比較して、図１６で異なる点は、
（１）図１４におけるプロセッサＰＥａおよびＰＥｂが
それぞれ図１６のクラスタＣＬａおよびＣＬｂに対応す
ること、（２）各クラスタ内の各ステップの処理におい
て、クラスタ内のｐ台のプロセッサにおいて、並列処理
が行なわれることの２点である。図１６では、２台のク
ラスタ（ＣＬａとＣＬｂ）で処理を実行する場合を示
す。もちろん、３台以上のクラスタの場合も同様であ
る。16 is different from FIG. 14 in that
(1) The processors PEa and PEb in FIG. 14 correspond to the clusters CLa and CLb in FIG. 16, respectively. (2) In the processing of each step in each cluster, parallel processing is performed in p processors in the cluster. There are two things to do. FIG. 16 shows a case where the process is executed by two clusters (CLa and CLb). Of course, the same applies to the case of three or more clusters.

【００９８】このとき、もし、データを送信する送信Ｓ
ｅｎｄ処理とデータを受信するＲｅｃｅｉｖｅ処理が非
同期実行可能な場合は、Ｓｅｎｄ処理を可能なかぎり先
行させて、Ｒｅｃｅｉｖｅ処理を同様に可能なかぎり遅
らせて実行する必要がある。これは、図１４におけるＳ
ｅｎｄ処理／Ｒｅｃｅｉｖｅ処理を用いた時のスケジュ
ーリング方針と同様である。At this time, if the transmission S for transmitting data
When the end process and the Receive process for receiving data can be executed asynchronously, it is necessary to execute the Send process as early as possible and similarly execute the Receive process as late as possible. This is S in FIG.
It is the same as the scheduling policy when the end process / Receive process is used.

【００９９】最初に、クラスタＣＬａでピボット処理Ｐ
（１）２５２を行ない、生成した領域Ｌ２（図４参照）
のデータをクラスタＣＬｂに送信するためにＳｅｎｄ処
理２５３を実行する。First, in the cluster CLa, the pivot process P
(1) The area L2 generated by performing 252 (see FIG. 4)
Send process 253 is executed to transmit the data of the above to the cluster CLb.

【０１００】一方、クラスタＣＬｂはこのデータを受信
するＲｅｃｅｉｖｅ処理２５４を実行する。以下、各ク
ラスタは、それぞれ制御変数Ｋの反復ループに入る。制
御変数Ｋの制御は、２７９ないし２８２で行なう。処理
Ｑ（Ｋ）は、クラスタ数である２つに均等に分割され
る。それぞれのクラスタの担当する処理を処理Ｑa
（Ｋ）および処理Ｑb（Ｋ）とする。On the other hand, the cluster CLb executes the Receive process 254 for receiving this data. Hereinafter, each cluster enters an iterative loop of the control variable K. The control of the control variable K is performed by 279 to 282. The process Q (K) is evenly divided into two, which is the number of clusters. Process Qa for each cluster
(K) and processing Qb (K).

【０１０１】まず、クラスタＣＬｂは、Ｋ回目の反復処
理において、クラスタ内のｐ台のプロセッサを用いて並
列処理２５６を行なう。このときの並列処理の分割の方
法は、実施例１から実施例３にしてしたのと同一の方法
で可能である。First, the cluster CLb performs the parallel processing 256 by using the p processors in the cluster in the Kth iteration processing. At this time, the method of dividing the parallel processing can be the same method as that of the first to third embodiments.

【０１０２】ピボット処理Ｐ（Ｋ＋Ｍ）２６３を担当す
るプロセッサでは、処理ＱS（Ｋ）２６２を実行し、引
き続き、ピボット処理Ｐ（Ｋ＋Ｍ）２６３を実行する。
これにより、Ｋ＋Ｍ回目の消去処理に用いる領域Ｌ２の
生成が可能となるために、ただちにクラスタＣＬａにＳ
ｅｎｄ処理２６８を実行する。その後、残りの処理（Ｑ
b（Ｋ）−Ｑs（Ｋ））２６４を実行する。クラスタＣＬ
ｂ内の他のプロセッサは、並列にそれぞれ処理Ｑb2
（Ｋ）２６５および処理Ｑbp（Ｋ）２６６を実行する。
Ｋステップ内でクラスタ内でのプロセッサ間の同期は、
並列化処理２５６で行なわれる。その後、Ｋ＋Ｍステッ
プに手順を進める。The processor in charge of the pivot process P (K + M) 263 executes the process QS (K) 262, and subsequently executes the pivot process P (K + M) 263.
As a result, the area L2 used for the K + Mth erasing process can be generated, and the cluster CLa is immediately converted into S.
The end process 268 is executed. After that, the remaining processing (Q
b (K) -Qs (K)) 264 is executed. Cluster CL
The other processors in b respectively process in parallel Qb2.
(K) 265 and the processing Qbp (K) 266 are executed.
Synchronization between processors within a cluster within K steps is
The parallelization processing 256 is performed. Then, the procedure proceeds to K + M steps.

【０１０３】一方、クラスタＣＬａでは、まず処理Ｑa1
（Ｋ）２５９で担当するＱの領域の更セ新処理を行な
い、その後に、Ｋ＋Ｍステップに必要なＬ２のデータを
Ｒｅｃｅｉｖｅ処理２６７で受け取る。このとき、クラ
スタＣＬｂ側のＳｅｎｄ処理２６８は可能なかぎり早く
実行されているため、Ｒｅｃｅｉｖｅ処理２６７での受
信待ちはあまりなく、すぐにデータ受信が可能となる。
クラスタＣＬａ内の他のプロセッサは、並列にそれぞれ
処理Ｑa2（Ｋ）２６０および処理Ｑap（Ｋ）２６１を実
行する。Ｋステップ内でクラスタ内でのプロセッサ間の
同期は、並列化処理２５５で行なわれる。その後、Ｋ＋
Ｍステップに手順を進める。On the other hand, in the cluster CLa, first the processing Qa1
(K) The area of Q in charge is updated in (K) 259, and then the Receive processing 267 receives the L2 data required for the K + M step. At this time, the Send process 268 on the side of the cluster CLb is executed as soon as possible, and therefore there is not much waiting for reception in the Receive process 267, and data can be received immediately.
The other processors in the cluster CLa execute the process Qa2 (K) 260 and the process Qap (K) 261 respectively in parallel. The synchronization between the processors in the cluster within K steps is performed in the parallelization processing 255. Then K +
Proceed to M steps.

【０１０４】次にＫ＋Ｍステップに進む。今度は、クラ
スタＣＬａ側がピボット処理Ｐ（Ｋ＋２＊Ｍ）２７０を
担当する。以下の処理は、クラスタＣＬｂのＫステップ
の手順と同等である。Next, the process proceeds to the K + M step. This time, the cluster CLa side is in charge of the pivot process P (K + 2 * M) 270. The following processing is equivalent to the procedure of the K step of the cluster CLb.

【０１０５】処理ＱS（Ｋ＋Ｍ）２６９を実行し、引き
続き、ピボット処理Ｐ（Ｋ＋２＊Ｍ）２７０を実行す
る。これにより、Ｋ＋２＊Ｍ回目の消去処理に用いる領
域Ｌ２の生成が可能となるために、ただちにクラスタＣ
ＬｂにＳｅｎｄ処理２７７を実行する。その後、残りの
処理（Ｑa1（Ｋ＋Ｍ）−Ｑs（Ｋ＋Ｍ））２７１を実行
する。クラスタＣＬａ内の他のプロセッサは、並列にそ
れぞれ処理Ｑa2（Ｋ）２７２および処理Ｑap（Ｋ）２７
３を実行する。Ｋ＋Ｍステップ内でクラスタ内でのプロ
セッサ間の同期は、並列化処理２５７で行なわれる。そ
の後、Ｋ＋２＊Ｍステップに手順を進める。The process QS (K + M) 269 is executed, and then the pivot process P (K + 2 * M) 270 is executed. As a result, the area L2 used for the K + 2 * Mth erasing processing can be generated, so that the cluster C
Send processing 277 is executed for Lb. After that, the remaining processing (Qa1 (K + M) -Qs (K + M)) 271 is executed. The other processors in the cluster CLa respectively perform the processing Qa2 (K) 272 and the processing Qap (K) 27 in parallel.
Execute 3. The synchronization between the processors in the cluster within K + M steps is performed by the parallelization processing 257. After that, the procedure proceeds to K + 2 * M steps.

【０１０６】一方、クラスタＣＬｂでは、まず処理Ｑb1
（Ｋ＋Ｍ）２７４において、担当する領域の更新処理を
行ない、その後に、Ｋ＋２＊Ｍステップに必要なＬ２の
データをＲｅｃｅｉｖｅ処理２７８で受け取る。このと
き、クラスタＣＬａ側のＳｅｎｄ処理２７７は可能なか
ぎり早く実行されているため、Ｒｅｃｅｉｖｅ処理２７
８での受信待ちはあまりなく、すぐにデータ受信が可能
となる。クラスタＣＬｂ内の他のプロセッサは、並列に
それぞれ処理Ｑb2（Ｋ）２７５および処理Ｑbp（Ｋ）２
７６を実行する。Ｋ＋Ｍステップ内でクラスタ内でのプ
ロセッサ間の同期は、並列化処理２５８で行なわれる。
その後、Ｋ＋２＊Ｍステップに手順を進める。On the other hand, in the cluster CLb, first, the processing Qb1
In (K + M) 274, the area in charge is updated, and then the L2 data necessary for K + 2 * M steps is received by the Receive processing 278. At this time, the Send process 277 on the side of the cluster CLa is executed as soon as possible, and therefore the Receive process 27 is executed.
There is not much waiting for reception at 8, and data can be received immediately. The other processors in the cluster CLb respectively perform the processing Qb2 (K) 275 and the processing Qbp (K) 2 in parallel.
Execute 76. The synchronization between the processors in the cluster within K + M steps is performed by the parallelization processing 258.
After that, the procedure proceeds to K + 2 * M steps.

【０１０７】図１５に示すプログラムでは、２台のクラ
スタ（ＣＬａとＣＬｂ）でかつ各クラスタ内のプロセッ
ササ数がそれぞれ２台の場合を示す。もちろん、３台以
上のクラスタの場合、あるいは、クラスタ内のプロセッ
サ台数を３台以上にすることに可能である。図１５中、
２０６以下２０２の直前までがクラスタＣＬａの処理
で、２０８以下２０３の直前までがクラスタＣＬｂの処
理である。ここでは、まず２００では自クラスタがどち
らのクラスタであるかを判断する。これは、一般にたと
えばＵＮＩＸコマンドのWHOAMIなどのような判別ライブ
ラリを事前に用意しておく。The program shown in FIG. 15 shows a case where there are two clusters (CLa and CLb) and the number of processors in each cluster is two. Of course, in the case of a cluster of three or more, or the number of processors in the cluster can be three or more. In FIG.
The process from 206 to 202 immediately before 202 is the processing of the cluster CLa, and the process from 208 to 203 immediately before the 203 is processing of the cluster CLb. Here, first, in 200, it is determined which cluster is the own cluster. In general, a discrimination library such as WHOAMI of UNIX command is prepared in advance.

【０１０８】各クラスタで交互にピボット処理を行なう
ために、ＤＯ９１ループ２０４およびＤＯ９２ループ２
０５は制御変数Ｋに関して、２回分の処理を行なってい
る。そのために、実施例３と同様に、ＤＯ９１およびＤ
Ｏ９２ループの間隔の指定が、実施例１、２に比べて２
倍となっている。In order to alternately perform the pivot processing in each cluster, DO91 loop 204 and DO92 loop 2
Reference numeral 05 is processing the control variable K twice. Therefore, as in the third embodiment, DO91 and D
The designation of the O92 loop interval is 2 as compared with the first and second embodiments.
Has doubled.

【０１０９】各クラスタ内では、各反復ステップＫおよ
びＫ＋Ｍごとに、並列処理指示文２０９ないし２２４に
より、並列化の指示を行なう。Within each cluster, parallelization is instructed by parallel processing directives 209 to 224 for each iterative step K and K + M.

【０１１０】図１５の例では、クラスタＣＬａにおいて
は、ステップＫに対しては、２０９から２１２の間の並
列指示行により並列処理が規定されており、演算処理２
２５および２２６、ＲＥＣＥＩＶＥ処理２３７を行な
う。ステップＫ＋Ｍでは、２１３から２１６の間の並列
指示行により並列処理が規定されており、演算処理２２
７ないし２２９および２３０、ＳＥＮＤ処理２３９を行
なう。In the example of FIG. 15, in the cluster CLa, the parallel processing is defined for the step K by the parallel instruction lines between 209 and 212.
25 and 226, RECEIIVE processing 237. In step K + M, parallel processing is defined by the parallel instruction lines between 213 and 216.
7 to 229 and 230, and SEND processing 239.

【０１１１】クラスタＣＬｂにおいては、ステップＫに
対しても、同様に２１７から２２０の間の並列指示行に
より並列処理が規定されており、演算処理２３１ないし
２３４、ＳＥＮＤ処理２３８を行なう。ステップＫ＋Ｍ
では、２２１から２２４の間の並列指示行により並列処
理が規定されており、演算処理２３５および２３６、Ｒ
ＥＣＥＩＶＥ処理２４０を行なう。In the cluster CLb, also for step K, parallel processing is similarly defined by the parallel instruction lines between 217 and 220, and arithmetic processing 231 to 234 and SEND processing 238 are performed. Step K + M
In the above, parallel processing is defined by the parallel instruction lines between 221 and 224, and the arithmetic processing 235 and 236, R
ECEIVE processing 240 is performed.

【０１１２】なお、第３および実施例４では、この処理
では、データの送受信のためのＳｅｎｄ処理やＲｅｃｅ
ｉｖｅ処理を他の演算処理と並列に実行することが必要
となる。もちろん、このＳｅｎｄ処理やＲｅｃｅｉｖｅ
処理が専用の通信プロセッサで動作可能である場合は、
それらの通信に要する処理時間を除いたのちに、演算を
担当するプロセッサ内で演算量の処理を均等にする必要
がある。In the third and fourth embodiments, in this processing, Send processing for receiving and transmitting data and Rec.
It is necessary to execute the ive processing in parallel with other arithmetic processing. Of course, this Send processing and Receive
If the process can run on a dedicated communications processor,
After removing the processing time required for these communications, it is necessary to equalize the processing of the calculation amount within the processor in charge of the calculation.

【０１１３】図１５に示したプログラムをコンパイルす
ることにより、図１６に示した処理手順が実行されるオ
ブジェクトプログラムが生成されることは、実施例１の
場合と同様である。Similar to the case of the first embodiment, an object program for executing the processing procedure shown in FIG. 16 is generated by compiling the program shown in FIG.

【０１１４】[0114]

【発明の効果】本来は逐次処理されるべき第１の処理と
その後に実行されるべき、並列実行可能な第２の処理と
を有し、第２の処理の完了を待って再度第１、第２の処
理を実行するように、第１、第２の処理を反復して実行
すべき反復処理を、並列計算機で実行させる場合に、プ
ロセッサが待ち状態にある時間を軽減することが出来、
計算機資源の有効な利用を図ることが出来る。それによ
り、この反復処理全体の処理時間を軽減することが可能
になる。The present invention has a first process that should be sequentially processed and a second process that can be executed in parallel and that can be executed in parallel. After waiting for the completion of the second process, the first process is executed again. When the parallel computer executes the iterative process that should be executed by repeating the first and second processes so as to execute the second process, the time during which the processor is in the waiting state can be reduced.
Effective use of computer resources can be promoted. This makes it possible to reduce the processing time of the entire iterative process.

[Brief description of drawings]

【図１】従来の主記憶共有型並列計算機の概略構造を示
す図。FIG. 1 is a diagram showing a schematic structure of a conventional main memory sharing type parallel computer.

【図２】従来の主記憶分散型並列計算機の概略構造を示
す図。FIG. 2 is a diagram showing a schematic structure of a conventional main memory distributed parallel computer.

【図３】従来のクラスタ結合型並列計算機の概略構造を
示す図。FIG. 3 is a diagram showing a schematic structure of a conventional cluster-coupled parallel computer.

【図４】従来の多段同時消去ＬＵ分解法の概要を示す
図。FIG. 4 is a diagram showing an outline of a conventional multistage simultaneous erasure LU decomposition method.

【図５】従来の多段同時消去ＬＵ分解法のプログラム例
を示す図。FIG. 5 is a diagram showing a program example of a conventional multi-stage simultaneous erase LU decomposition method.

【図６】従来の反復計算処理のフローチャート。FIG. 6 is a flowchart of conventional iterative calculation processing.

【図７】並列化された反復計算処理のフローチャート。FIG. 7 is a flowchart of parallelized iterative calculation processing.

【図８】本発明の実施例１で使用するプログラム例を示
す図。FIG. 8 is a diagram showing an example of a program used in the first embodiment of the present invention.

【図９】実施例１による反復計算処理のフローチャー
ト。FIG. 9 is a flowchart of iterative calculation processing according to the first embodiment.

【図１０】実施例２で使用するプログラム例を示す図。FIG. 10 is a diagram showing an example of a program used in the second embodiment.

【図１１】本発明の実施例２による反復計算処理のフロ
ーチャート。FIG. 11 is a flowchart of iterative calculation processing according to the second embodiment of the present invention.

【図１２】分散記憶型計算機向けのデータの分割を示す
図。FIG. 12 is a diagram showing division of data for a distributed storage computer.

【図１３】本発明の実施例３で使用するプログラム例を
示す図。FIG. 13 is a diagram showing an example of a program used in Example 3 of the present invention.

【図１４】実施例３による反復計算処理のフローチャー
ト。FIG. 14 is a flowchart of iterative calculation processing according to the third embodiment.

【図１５】本発明の実施例４で使用するプログラム例を
示す図。FIG. 15 is a diagram showing an example of a program used in Example 4 of the present invention.

【図１６】実施例４による反復計算処理のフローチャー
ト。FIG. 16 is a flowchart of iterative calculation processing according to the fourth embodiment.

[Explanation of symbols]

２０、３４、４２…ループ制御、２７、３１、３６…逐
次的処理Ｐ（Ｋ）、２８、３２…並列化可能処理Ｑ
（Ｋ）、２９…並列指示、３８ないし４０…並列化可能
処理Ｑ（Ｋ）を分割した処理20, 34, 42 ... Loop control, 27, 31, 36 ... Sequential processing P (K), 28, 32 ... Parallelizable processing Q
(K), 29 ... Parallel instruction, 38 to 40 ... Process capable of parallelization Q (K) is divided

フロントページの続き (72)発明者玉置由子東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者榊原忠幸東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者朝家真知子東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者保田淑子東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内Front page continuation (72) Yuko Tamaki, 1-280, Higashi Koikekubo, Kokubunji, Tokyo, Central Research Laboratory, Hitachi, Ltd. (72) Inventor, Tadayuki Sakakibara 1-280, Higashi Koikeku, Kokubunji, Tokyo Hitachi Research Center, Ltd. (72) Inventor Machiko Asaya 1-280, Higashi Koigokubo, Kokubunji, Tokyo Inside Central Research Laboratory, Hitachi, Ltd. (72) Yoshiko Yasuda 1-280, Higashi Koikeku, Kokubunji, Tokyo Inside Central Research Laboratory, Hitachi, Ltd.

Claims

[Claims]

1. A repetitive process for repetitively executing a first process and a second process using the execution result, the second process comprising:
Processing consists of multiple sub-processes that can be executed in parallel,
The first process is the second process during a certain iteration of the iterative process.
A method of causing a parallel computer to execute the execution result of the process of step 1) used in the next iteration of the iterative process, comprising: (a) performing the iterative process by any one of a plurality of processors constituting the parallel computer. (B) One processing group uses the first processing in the next iteration among the plurality of partial processings included in the second processing. To generate at least some of the partial processing and the first processing, and the other processing group includes the second processing.
Included in the first process and the second process so as to include a part of a plurality of partial processes other than the plurality of partial processes included in the process. The plurality of the plurality of partial processes are divided into a plurality of processing groups whose number is equal to the number of the plurality of processors, and (c) the execution result of the step (a) is obtained for the first iteration. Using the result of this execution of the plurality of sub-processes for the next iteration, the first process being performed The one processing group including at least a part of a plurality of partial processings and the first processing is executed by any one of the plurality of processors, and (d) each of the plurality of other processors Using one of the plurality of processing groups for the first iteration using the execution result of the step (a) And (e) after the execution of step (c) or (d) by each processor is performed in parallel with the above execution by step (c) by the one processor, e)
In the iteration, (e1) the second processing is executed for the iterations after the next iteration, and the first processing is performed for the next iteration of the iterations after the next iteration. The execution result of the first process, which is obtained as a result of the execution of step (c) before executing the steps (c) to (d) and (e2) above, as A parallel execution method of the iterative process used in the second process in step (d) executed after repeating the above process.

2. A parallel execution method for iterative processing according to claim 1, wherein at the time of repeating by step (e), step (b) is also executed and then repeated.

3. The step (b) is such that a plurality of other partial processings to be executed by the one processor in addition to the plurality of partial processings belong to the one processing group. The parallel execution method for iterative processing according to claim 1, further comprising the step of executing the division.

4. The step (b) is required for execution of each of the plurality of partial processes and a first processing time required for executing the first process, and each of the plurality of other partial processes. The parallel execution method for iterative processing according to claim 1, further comprising the step of performing the division so that the execution times of the plurality of processing groups are substantially equal based on the second processing time.

5. The iterative process is a process for LU decomposition of a coefficient matrix of N-dimensional simultaneous linear equations, the first process includes a pivot process, and the second process is an erase / update process of a coefficient matrix. A parallel execution method for iterative processing according to any one of claims 1 to 4, further comprising:

6. The step (e), after detecting the end of the processing in each processor, repeats the step (c) by the one processor and the step (d) by each of the other plurality of processors. 2. The parallel execution method for iterative processing according to claim 1, further comprising the step of starting simultaneously.

7. The step (e) comprises the step of determining, by each processor, the end of the process by each of the other processors after the end of the process by that processor, and by each processor. 2. The parallel execution method for iterative processing according to claim 1, further comprising the step of activating the repetition of step (c) or (d) in the processor when the end of the processing is detected.

8. The parallel computer further comprises a main memory shared by the plurality of processors, the method comprising: (f) when each processor finishes executing the step (c) or (d). The method further comprises the step of writing information indicating the end by the processor to the main memory, wherein the step (e) is based on the information written by the processor to the main memory and indicating the end of the processing by the processor. The parallel execution method of the iterative process according to claim 7, which is performed by the method.

9. In step (f), after the execution of step (c), information indicating the end of processing by the one processor is written in the main memory by the one processor, and the information of the plurality of other processors is written. Step (d) by each
After the execution of any of the sub-processing groups by, each processor writes information indicating the end of the processing in that processor to the main memory, and step (e) is performed by the one processor. After writing the information indicating the end, the first processor monitors whether or not the information indicating the end is written in the main memory by each of the other plurality of processors, and the plurality of other processors are monitored. After the writing of the information indicating the end by each of the processors, whether or not the information indicating the end by each of the processors other than the plurality of other processors is written in the main memory is determined. By monitoring by each of the processors, execution of the process at step (c) or (d) by the other processor is completed by each of the plurality of processors. Step (e) is performed for each processor based on the determination result of the end of another processor for each processor in Step (e). 9. The parallel execution method for iterative processing according to claim 8, further comprising the step of determining the start time of the iteration of (d).

10. An iterative process for iteratively executing a first process and a second process using the execution result, wherein the second process comprises a plurality of partial processes that can be executed in parallel. The first process is a method of causing a parallel computer to execute the execution result of the second process in a certain iteration of the iterative process, which is used in the next iteration of the iterative process. Any one of the plurality of processors constituting the parallel computer executes the first process for the first iteration of the iterative process, and (b) one process group is included in the second process. Among the plurality of partial processes, the first process includes at least a part of the partial processes that generate data to be used in the next iteration and the first process, and the other process group includes , Second
Of the plurality of partial processes included in the process, the plurality of the plurality of partial processes included in the second process are included so as to include a part of a plurality of partial processes other than the plurality of partial processes of the part. The partial processing and the first processing are divided into a plurality of processing groups whose number is equal to the number of the plurality of processors, and (c) the execution result of the step (a) is used for the first iteration. To execute the first process using the execution result of the plurality of partial processes for the next iteration. The one processing group including at least the plurality of partial processings and the first processing is executed by any one of the plurality of processors, and (d) the plurality of processings are performed by each of the plurality of other processors. One of the processing groups, using the execution result of step (a) for the first iteration, And (e) in each processor, when the execution of step (c) or (d) by the processor is completed, the processor is executed in parallel with the execution of step (c) by the one processor. By repeating (c) or (d) above,
In the repetition, (e1) step (c) or (d) is executed so that another processor group different from the processing group executed in step (c) or (d) is executed in that processor. (E2) When the other processing group includes the partial processing group of the second processing, the partial processing group is executed for the iterations after the next iteration, and the other processing group When it includes the processing of 1, the first
Is executed for the next iteration of the iterations after the next iteration, and (e3) when the processing group executed by the processor in step (c) or (d) does not include the first processing, A parallel execution method of repetitive processing, comprising the step of starting the above-mentioned repetition after confirming the notification of the execution result of the first processing notified from another processor.

11. The parallel execution method for iterative processing according to claim 10, wherein the iteration in step (e) is performed so that each processor sequentially executes a different processing group for each iteration in step (e).

12. When any one of the processors executes the one processing group for any of the repetitions in step (c), the execution result is notified to a plurality of other processors, and the other plurality of processors are notified. Each of the processors executes the other one process group of the plurality of process groups for the repetition after confirming that the execution result is notified from the one processor. Executing the one processing group for the next iteration of the above iteration, notifying another processor of the execution result of the other processing group by each of the other plurality of processors, and In the processor, after the reception of the execution result of the plurality of processing groups for the repetition from the plurality of processors other than the processor is confirmed by the other processor, Relative return, claim 10 further comprising the step of initiating said one processing
A parallel execution method of the described iterative processing.

13. A distributed computer having a set work for connecting the plurality of processors to each other, and a local memory provided corresponding to each processor for holding a program and data executed by the processor. 13. A parallel execution method for repetitive processing according to claim 12, wherein the method is a memory parallel computer, and the notification of the execution result of the one processing and the notification of the execution result of the other processing group are performed via this network.

14. A repetitive process for repetitively executing a first process and a second process using the execution result, wherein the second process comprises a plurality of partial processes that can be executed in parallel. In the first process, the execution result of the second process in one iteration of the iterative process is used in the next iteration of the iterative process and shared by a plurality of processors. A plurality of clusters having local memory for holding programs and data;
A method of executing in a distributed storage parallel computer comprising a network connecting them, comprising: (a) repeating by one of a plurality of processors included in any one of the plurality of clusters. The first process is executed for the first iteration of the process, and (b) one process group is included in the second process, and when the first process is the next iteration, At least a part of a plurality of partial processes for generating data to be used and the first process are included, and the other process group includes a second process.
Included in the first process and the second process so as to include a part of a plurality of partial processes other than the plurality of partial processes included in the process. The plurality of the plurality of partial processings are divided into a plurality of processing groups whose number is equal to the number of the plurality of clusters, and (c) one of the plurality of processing groups is divided by each of the plurality of clusters. (C1) The plurality of partial processes and the first partial process are executed in one processing group when executing one processing group by each cluster. When processing and is included,
One partial process group including at least a part of the plurality of partial processes and the first process, and other than the some of the plurality of partial processes included in the second process. Of the plurality of partial processes, the process of the one process group is divided into a plurality of other partial process groups, and the one part of the plurality of partial processes and the first partial process When the processing is not included, among the plurality of partial processings included in the second processing, a plurality of other partial processings each including a part of a plurality of partial processings other than the plurality of partial processings The processing of the one processing group is divided into processing groups, and (c2) a plurality of processors in the cluster divides the processing into the step (c
One of a plurality of partial processing groups obtained by the division according to 1),
When the first processing is included in the one processing group executed in the cluster in parallel with the execution of another partial processing group by another processor in the cluster, (c3) One processor that has executed one partial process group to which one process belongs notifies other clusters of the execution result of the first process. (D) In each cluster, the step (c1 (C) is repeated when the execution of () is completed, and in the repetition, (d1) another processing group different from the processing group executed in step (c) by the cluster among the plurality of processing groups. Repeat step (c) so that
(D2) If the first process is not included in the process group executed by the cluster in step (c), the execution result of the first process notified by another cluster is notified. After confirming, the execution of the other processing group is started, and (d
3) The other processing group executed in the repetition is the second
When the sub-process group belonging to the process is included, the sub-process group is executed for the iterations after the next iteration, and the other process group includes the first process. Sometimes, the parallel execution method of the iterative processing that executes the other processing group so that the first processing is executed for the next iteration of the iteration after the next iteration.