JP2004078617A

JP2004078617A - Parallel processing method for application

Info

Publication number: JP2004078617A
Application number: JP2002238552A
Authority: JP
Inventors: Masaru Hashimoto; 橋本　賢
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2002-08-19
Filing date: 2002-08-19
Publication date: 2004-03-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide a parallel processing method for an application, preventing deviation in operation of a thread. <P>SOLUTION: In a relationship between an address (a) of an inlet of a parallelization loop and a younger address b, correspondences of a distance between addresses of a difference (b-a) of the addresses and a duration t are sequentially recorded. A certain loop will reach the leading address (a) of the loop<SB>n</SB>first. At this point, the difference between the address b and the address (a) executed by other threads i is entered into an distance between addresses (b-a) column. The time at this point is regarded as time t<SB>1</SB>. Processing in the same thread i progresses, the time it reaches the address (a) is regarded as time t<SB>2</SB>, and a difference t2-t1 between times is entered in a duration t column corresponding to the priorly recorded b-a column. On the basis of the table, in a computer of a shared memory model having a plurality of CPUs, when executing programs arranged in the memory in parallel, an execution schedule of a parallelized loop is dynamically optimized. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、アプリケーションの並列処理方法に関し、特に、ＯＳが１つのアプリケーションの処理を複数に分割して実行するスレッド（ｔｈｒｅａｄ）処理に係わる、アプリケーションの並列処理方法に関する。
【０００２】
【従来の技術】
従来、アプリケーションの並列処理方法は、たとえば、複数のＣＰＵを有する共有メモリモデルのコンピュータ処理に適用され用いられる。
この方法では、ループを並列実行するスケジュール方式が、プログラミング言語で記述されたソース・コードを、コンピュータが直接実行できる機械語に一括変換する、コンパイル（Ｃｏｍｐｉｌｅ　）時に決定されている。このため、実行時に各スレッド（ｔｈｒｅａｄ）がループの入口に到着する時刻に大差がある場合にも、既に決定しているスケジュール方式に従って処理が行われる。
【０００３】
本発明と技術分野の類似する先願発明例１として、特開平０４−２９３１５０号公報の「コンパイル方法」がある。本先願発明例１では、ソースプログラムを最も早く処理できるオブジェクトコードを生成し、ベクトル計算機システムの最大性能の発揮を図っている。つまり、共有メモリモデルのコンピュータでの、プログラム並列化の最適化方法について、述べられている。
【０００４】
先願発明例２としての特開２００２−０４９６０３号公報の「動的負荷分散方法及び動的負荷分散装置」では、例えば、線形または非線型最適化問題を解くような反復法を用いたアプリケーションプログラムを並列計算機で実行する場合に、各プロセッサの負荷を効率的に分散させ、より良い計算効率を得ることを図っている。
【０００５】
先願発明例３としての特許第２８１８０１６号（特開平４−９８３２３号公報）の「プロセス並列実行方法および装置」では、並列実行可能なループ処理を並列プロセス処理により実行する場合に、ループの演算量にかかる実行コスト値を基に並列プロセスの生成個数を決定する。これにより、並列実行可能なループの処理実行に対して効率的な並列実行を実現できる、としている。
【０００６】
【発明が解決しようとする課題】
しかしながら、上記従来の技術では、ＯＳ（Ｏｐｅｒａｔｉｎｇ　ｓｙｓｔｅｍ）が１つのアプリケーションの処理を複数に分割して実行するスレッドの動作に、ずれが生じたまま以後の処理に進んでしまう恐れがある問題点を伴う。その結果、以降の処理に錯綜を生じ、正しく処理がされない恐れを生じさせる。これは、ループを実行するときの各スレッドの進行状況（実行箇所、所要時間）の情報を、以後の実行に活用していないためである。
【０００７】
本発明は、スレッドの動作にずれを生じさせないアプリケーションの並列処理方法を提供することを目的とする。
【０００８】
【課題を解決するための手段】
かかる目的を達成するため、本発明のアプリケーションの並列処理方法は、１つのアプリケーションの処理を複数に分割して並列的に実行するスレッド処理に係わり、共有メモリにおいて実行されるアプリケーションの並列処理方法において、並列化したループの入口のアドレスであるアドレスａに、より早く到達したスレッドＡには、より多くの仕事を割り当てる仕事割り当て工程Ａと、より遅く到達したスレッドＢには、より少ない仕事を割り当てる仕事割り当て工程Ｂと、少なくとも仕事割り当て工程Ａと仕事割り当て工程Ｂとに基づき、処理を実行するためのスケジューリングを動的に実行する、動的スケジューリング工程とを有し、スレッドＡとスレッドＢの並列処理における、ループの出口での進行時間の統一化を図ったことを特徴としている。
【０００９】
また、上記の動的スケジューリング工程において、並列化した各ループのループ名と、各ループの実行時間との関係を検索し、各ループの実行時間との関係の検索は、各スレッドがループ入口に到着した時刻と実行中のアドレスとの組を対応づけて、それらのデータの蓄積から、各スレッドが分担すべきループ実行仕事量を推定するものであるとよい。
【００１０】
さらに、上記の各ループの実行時間との関係の検索において、並列化ループの入口のアドレスａより若いアドレスをアドレスｂとし、所定のプログラムの実行がアドレスｂを通過してからアドレスａに到達するまでの所要時間を時間ｔとし、アドレス間距離（ｂ−ａ）と、所要時間（ｔ）との対応関係を検索し、所定のプログラムの実行前にメモリ内に準備しておく表を構成し、表には「ループ名」と「実行時間」が示され、プログラム内に存在する各並列ループと、ループ本体の実行時間（並列化しない場合）の対応が表記され、表は、プログラムの実行時に動的に作成される表であり、「実行時間」が事前に分からない場合には、プログラムの実行中に値を記入してもよい。
【００１１】
【発明の実施の形態】
次に、添付図面を参照して本発明によるアプリケーションの並列処理方法の実施の形態を詳細に説明する。図１から図３を参照すると、本発明のアプリケーションの並列処理方法の一実施形態が示されている。
【００１２】
本発明は、複数のＣＰＵを有する共有メモリモデルのコンピュータにおいて、メモリに配置されたプログラムを並列に実行するにあたって、並列化されたループの実行スケジュール方式を動的に最適化するものである。
【００１３】
その方法は、各スレッドがループ入口に到着した時刻と実行中のアドレスとの組を対応づけて、それらのデータの蓄積から、各スレッドが分担すべきループ実行仕事量を推定するものである。なお、データの蓄積を行う表をループ毎に持つ必要がなく、全体を管理するための一つでよいため効率的である。本発明の構成内容を、以下に詳述する。
【００１４】
（構成例）
図１に、プログラム内の一つの並列化ループ周辺のメモリマップの構成例を示す。ここで、並列化ループの入口のアドレスを、アドレスａとする。また、アドレスｂは、並列化ループの入口のアドレスａより若いアドレスである。なお、プログラムの実行がアドレスｂを通過してからアドレスａに到達するまでの所要時間を、時間ｔとする。
【００１５】
図２は、プログラムの実行前にメモリ内に準備しておく表の構成例を示している。本表２には、「ループ名」と「実行時間」が示され、プログラム内に存在する各並列ループと、ループ本体の実行時間（並列化しない場合）の対応が表記されている。なお、実行時間が事前に分からない場合は、プログラムの実行中に値を記入してもよい。
【００１６】
図３は、プログラムの実行中に作成される表の構成例を示している。本表は、アドレス間距離（ｂ−ａ）と、所要時間（ｔ）との対応表として構成される。
上記構成表は、プログラムの実行時に動的に作成される表であり、アドレスｂとアドレスａの差（ｂ−ａ）／（これをアドレス間距離と呼ぶ）と、所要時間（ｔ）の対応を順次記録してゆくことにより得られる表である。
【００１７】
（動作例）
本実施形態の動作例を、プログラム内の、あるループｌｏｏｐ_ｎを例に採って説明する。実行時にあるスレッドが、最初にｌｏｏｐ_ｎの先頭アドレスａに到達する。この時に、他の各スレッドｉについて次の処理を行う。
すなわち、スレッドｉが実行しているアドレスｂとアドレスａとの差を、図３の表のアドレス間距離（ｂ−ａ）欄に記入する。なお、その時の時刻を時刻ｔ_１とする。
【００１８】
そして、同じスレッドｉでの処理が進行し、アドレスａに到達した時刻を時刻ｔ_２とし、時刻間差ｔ_２−ｔ_１を、図３の先程記録したｂ−ａ欄に対応する所要時間（ｔ）欄に記入する。この処理は、スレッド数をｍとすると、０　≦ｉ＜ｍなるスレッドｉに対してそれぞれ行う。
【００１９】
このようにして、アドレス間距離（ｂ−ａ）と所要時間（ｔ）の組を採取してゆくと、両者は比例する傾向を示す。そこで、最小二乗法によってアドレス間距離から所要時間を算出する一次関数を求め、これを関数ｆ（ｔ）とする。
【００２０】
プログラムの実行が更に進行し、再び同じループｌｏｏｐ_ｎを実行する際に、各スレッドに割り当てるループの最適の仕事量を算出するには、次のようにする。
並列処理を行っているｍ個のスレッドのうち、いずれか一つがアドレスａに到達した時のスレッドｉ（０≦ｉ＜ｍ）におけるスレッド間距離（ｂ−ａ）を求め、これをｄ　_ｉとする。この場合、最初にアドレスａに到達したスレッドについては、ｄ　_ｉ＝０であり、他のスレッドについては、ｄ　_ｉ＞０である。また、当該ループの実行時間を図２から求め、この値を時間ｔ_ｓとする。
【００２１】
値ｇ　_ｉ＝（　ｔ_ｓ＋（ｆ（ｄ_０）＋ｆ（ｄ_１）　＋…＋ｆ（ｄ　ｍ_−１））−ｍｆ（ｄ　_ｉ））／ｍを考え、これをスレッドｉに割り当てる仕事量とする。ここで、ｇ_０＋ｇ_１＋…＋ｇ　_ｍ−１＝ｔ_ｓが成り立っている。このように割り当てると、アドレスａに早く到達したスレッドには多くの仕事が、また、遅く到達したスレッドには少ない仕事がそれぞれ割り当てられ、ループの出口でスレッドの進行を揃えることが出来る。
【００２２】
（効果）
プログラム実行時に、ループの入口に到達した時の各スレッドの進行状況に応じて動的にループ実行分担の最適化が行われる。このため、コンパイル時にスケジューリングを行う方法に比べて柔軟性が高い。
【００２３】
また、各スレッドの処理が揃えるに際し、ループ出入口で同期をとる必要がなく、実行性能上有利である。
【００２４】
（他の実施例）
上記発明の実施例の動作の説明で記した方法は、関数ｆ（ｔ）の生成に用いるループと最適化対象のループが同一である。
【００２５】
しかし、アドレス間距離（ｂ−ａ）と所要時間（ｔ）の関係は、ハードウェアの性質によってのみ決まり、ループの種類とは無関係である。このため、関数ｆ（ｔ）の生成に用いたループと最適化対象のループは、異なっていても構わない。つまり、図２の表に追記するデータは、特定のループに依存せず、プログラム内の全ループ共通に使うことができる。
【００２６】
なお、上述の実施形態は本発明の好適な実施の一例である。ただし、これに限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変形実施が可能である。
【００２７】
【発明の効果】
以上の説明より明らかなように、本発明のアプリケーションの並列処理方法は、１つのアプリケーションの処理を複数に分割して並列的に実行するスレッド処理に係わり、並列化したループの入口のアドレスであるアドレスａに、より早く到達したスレッドＡにはより多くの仕事を割り当て、より遅く到達したスレッドＢにはより少ない仕事を割り当て、処理を実行するためのスケジューリングを動的に実行する。これにより、スレッドＡとスレッドＢの並列処理における、ループの出口での進行時間の統一化を図ることができる。
【図面の簡単な説明】
【図１】本発明のアプリケーションの並列処理方法の実施形態が適用されるプログラム内の一つの並列化ループ周辺のメモリマップの構成例を示す。
【図２】プログラムの実行前にメモリ内に準備しておく表の構成例を示している。
【図３】プログラムの実行中に作成される表の構成例を示している。
【符号の説明】
ａ　並列化ループの入口のアドレス
ｂ　アドレスａより若いアドレス
ｔ　所要時間[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an application parallel processing method, and more particularly, to an application parallel processing method related to a thread (thread) process in which an OS divides a single application process into a plurality of processes.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, an application parallel processing method is applied to, for example, a computer process of a shared memory model having a plurality of CPUs.
In this method, a scheduling method for executing loops in parallel is determined at the time of compiling, in which source code written in a programming language is collectively converted into a machine language that can be directly executed by a computer. For this reason, even when there is a large difference between the times at which the threads arrive at the entrance of the loop at the time of execution, the processing is performed according to the already determined schedule method.
[0003]
Japanese Patent Application Laid-Open No. H04-293150 discloses a “compilation method” as a first prior art example similar to the present invention in the technical field. In Inventive Example 1, the object code capable of processing the source program the fastest is generated, and the maximum performance of the vector computer system is achieved. That is, it describes a method of optimizing program parallelization in a computer of a shared memory model.
[0004]
In the “Dynamic load distribution method and dynamic load distribution apparatus” of JP-A-2002-049603 as the prior application example 2, for example, an application program using an iterative method that solves a linear or nonlinear optimization problem Is executed by a parallel computer, the load of each processor is efficiently distributed, and better calculation efficiency is obtained.
[0005]
In "Process Parallel Execution Method and Apparatus" of Japanese Patent No. 2818016 (Japanese Patent Application Laid-Open No. 4-98323) as the prior invention example 3, when a loop process that can be executed in parallel is executed by a parallel process process, a loop operation is performed. The number of parallel processes to be generated is determined based on the execution cost value related to the amount. According to the document, efficient parallel execution can be realized with respect to the execution of a loop that can be executed in parallel.
[0006]
[Problems to be solved by the invention]
However, in the above-described conventional technology, there is a problem that an OS (Operating System) may proceed to subsequent processes with a shift in the operation of a thread that divides a process of one application into a plurality of processes and executes the process. Accompany. As a result, the subsequent processing is complicated, and the processing may not be performed correctly. This is because the information on the progress (execution location, required time) of each thread when executing the loop is not used for subsequent execution.
[0007]
An object of the present invention is to provide a parallel processing method of an application that does not cause a shift in the operation of a thread.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, an application parallel processing method according to the present invention relates to a thread process for dividing a process of one application into a plurality of processes and executing the process in parallel. A task allocating step A for allocating more work to the thread A that has reached the address a, which is the entry address of the parallelized loop earlier, and a less work for the thread B that has reached late, A task allocation step B, and a dynamic scheduling step for dynamically executing a scheduling for executing a process based on at least the task allocation step A and the task allocation step B, wherein the thread A and the thread B are arranged in parallel. In the processing, we have unified the progress time at the exit of the loop. It is a symptom.
[0009]
Further, in the above dynamic scheduling step, the relation between the loop name of each parallelized loop and the execution time of each loop is searched. A set of the arrival time and the address being executed may be associated with each other, and a loop execution work to be shared by each thread may be estimated from the accumulation of the data.
[0010]
Further, in the search for the relationship with the execution time of each loop, an address smaller than the address a at the entrance of the parallelized loop is set as the address b, and the execution of the predetermined program reaches the address a after passing through the address b. A time t is defined as a time required until the address is reached, a correspondence between the address distance (ba) and the required time (t) is searched, and a table is prepared in a memory before executing a predetermined program. In the table, the "loop name" and "execution time" are shown, and the correspondence between each parallel loop existing in the program and the execution time of the loop body (when not parallelized) is described. This is a table that is dynamically created from time to time. If the “execution time” is not known in advance, a value may be entered during the execution of the program.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
Next, an embodiment of an application parallel processing method according to the present invention will be described in detail with reference to the accompanying drawings. 1 to 3, there is shown one embodiment of the application parallel processing method of the present invention.
[0012]
The present invention dynamically optimizes an execution schedule method of a parallelized loop when a program arranged in a memory is executed in parallel in a computer of a shared memory model having a plurality of CPUs.
[0013]
In this method, a set of a time at which each thread arrives at a loop entrance and an address being executed is associated with each other, and the amount of loop execution work to be shared by each thread is estimated from accumulation of the data. Note that it is not necessary to have a table for storing data for each loop, and only one table for managing the entirety is required, which is efficient. The configuration of the present invention will be described in detail below.
[0014]
(Configuration example)
FIG. 1 shows a configuration example of a memory map around one parallelization loop in a program. Here, the address of the entrance of the parallelization loop is assumed to be address a. The address b is an address lower than the address a at the entrance of the parallelization loop. The time required from when the program execution passes address b until it reaches address a is defined as time t.
[0015]
FIG. 2 shows a configuration example of a table prepared in a memory before executing a program. In Table 2, “loop name” and “execution time” are shown, and the correspondence between each parallel loop existing in the program and the execution time of the loop body (when not parallelized) is described. If the execution time is not known in advance, a value may be entered during the execution of the program.
[0016]
FIG. 3 shows a configuration example of a table created during execution of the program. This table is configured as a correspondence table between the address distance (ba) and the required time (t).
The above configuration table is a table dynamically created when a program is executed, and corresponds to a difference between the address b and the address a (ba) / (this is called an inter-address distance) and a required time (t). Are sequentially recorded.
[0017]
(Operation example)
An operation example of the present embodiment will be described using a certain loop loop _n in a program as an example. At the time of execution, a certain thread first reaches the start address a of loop _n . At this time, the following processing is performed for each of the other threads i.
That is, the difference between the address b and the address a executed by the thread i is entered in the address distance (ba) column in the table of FIG. It should be noted that, for the time at that time and the time t _1.
[0018]
Then, development processing is the same thread i, and the time reaching the address a and time t _2, the time between the difference t ₂ -t _1, the required time corresponding to the earlier recorded b-a column in FIG. 3 ( Fill in column t). This process is performed for each thread i satisfying 0 ≦ i <m, where m is the number of threads.
[0019]
In this way, when a pair of the address distance (ba) and the required time (t) is collected, the two tend to be proportional. Therefore, a linear function for calculating the required time from the distance between addresses is obtained by the least square method, and this is defined as a function f (t).
[0020]
When the execution of the program further progresses and the same loop "loop _n" is executed again, the optimum work of the loop allocated to each thread is calculated as follows.
Of the m threads performing parallel processing, it obtains the inter-thread distance in the thread i of when any one has reached the address a (0 ≦ i <m) (b-a), which the d _i I do. In this case, d _i = 0 for the thread that first reaches the address a, and d _i > 0 for the other threads. Also, determine the execution time of the loop from FIG. 2, and this value time t _s.
[0021]
The value _{_{_{_{g i = (t s + (}}}} f (d 0) + f (d 1) + ... + f (d m -1)) - mf (d i)) thinking / m, a workload allocating it to thread i I do. _{_{_{Here, g 0 + g 1 + ...}}} + g m-1 = t s is made up. With this assignment, a thread that arrives at address a earlier is assigned more work, and a thread that arrives later is assigned less work, and the progress of the threads can be made uniform at the exit of the loop.
[0022]
(effect)
During program execution, optimization of the loop execution allocation is dynamically performed according to the progress of each thread when the thread reaches the entrance of the loop. For this reason, the flexibility is higher than the method of performing scheduling at the time of compilation.
[0023]
Further, when the processes of the threads are aligned, there is no need to synchronize at the entrance and exit of the loop, which is advantageous in execution performance.
[0024]
(Other embodiments)
In the method described in the description of the operation of the embodiment of the present invention, the loop used for generating the function f (t) and the loop to be optimized are the same.
[0025]
However, the relationship between the address distance (ba) and the required time (t) is determined only by the nature of the hardware, and is independent of the type of loop. For this reason, the loop used for generating the function f (t) and the loop to be optimized may be different. That is, the data to be added to the table of FIG. 2 does not depend on a specific loop and can be used commonly for all loops in the program.
[0026]
The above embodiment is an example of a preferred embodiment of the present invention. However, the present invention is not limited to this, and various modifications can be made without departing from the scope of the present invention.
[0027]
【The invention's effect】
As is clear from the above description, the application parallel processing method according to the present invention relates to a thread process in which one application process is divided into a plurality of processes and executed in parallel, and is an entry address of a parallelized loop. More work is assigned to thread A that arrives earlier at address a, and less work is assigned to thread B that arrives later, and scheduling for executing processing is dynamically executed. Thereby, in the parallel processing of the thread A and the thread B, the progress time at the exit of the loop can be unified.
[Brief description of the drawings]
FIG. 1 shows a configuration example of a memory map around one parallelization loop in a program to which an embodiment of an application parallel processing method of the present invention is applied.
FIG. 2 shows a configuration example of a table prepared in a memory before execution of a program.
FIG. 3 shows a configuration example of a table created during execution of a program.
[Explanation of symbols]
a Address at the entrance of the parallelization loop b Address smaller than address a Required time

Claims

The present invention relates to a thread process for dividing a process of one application into a plurality of processes and executing the processes in parallel.
A task allocating step A for allocating more work to the thread A that has reached the address a which is the entry address of the parallelized loop earlier,
A task allocation step B that allocates less work to the thread B that arrived later;
A dynamic scheduling step of dynamically executing scheduling for executing a process based on at least the work assignment step A and the work assignment step B,
A parallel processing method for an application, wherein in the parallel processing of the thread A and the thread B, the progress time at the exit of the loop is unified.

2. The application parallel processing method according to claim 1, wherein in the dynamic scheduling step, a relationship between a loop name of each of the parallelized loops and an execution time of each of the loops is searched.

The search for the relationship between the execution time of each loop is performed by associating a set of the time at which each thread arrives at the entrance of the loop with the address being executed, and accumulating the data to execute the loop execution to be shared by each thread. 3. The parallel processing method for an application according to claim 2, wherein the amount of work is estimated.

In the search for the relationship with the execution time of each loop,
An address smaller than the address a of the entrance of the parallelization loop is defined as an address b,
The time required from execution of a predetermined program to passing address b until reaching address a is defined as time t,
The correspondence between the distance between addresses (ba) and the required time (t) is searched,
4. The application parallel processing method according to claim 3, wherein a table prepared in the memory before executing the predetermined program is configured.

In the table, "loop name" and "execution time" are shown, and correspondence between each parallel loop existing in the program and execution time of the loop body (when not parallelized) is described. The parallel processing method for an application according to claim 4.

The table is a table dynamically created when the program is executed, and when the “execution time” is not known in advance, a value may be entered during the execution of the program. A parallel processing method for an application according to claim 5.