JP2677482B2

JP2677482B2 - Computer language processing method

Info

Publication number: JP2677482B2
Application number: JP4061984A
Authority: JP
Inventors: 直司中平
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1992-03-18
Filing date: 1992-03-18
Publication date: 1997-11-17
Anticipated expiration: 2012-11-17
Also published as: JPH05265770A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は，階層型メモリを有する
計算機上で動作するプログラムについて，オブジェクト
プログラムへの翻訳時に最適化処理を行う計算機言語処
理方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a computer language processing method for optimizing a program running on a computer having a hierarchical memory when translating it into an object program.

【０００２】最近のハードウェアは，主記憶の他に，よ
り高速なメモリ（以下，キャシュメモリという）を実装
した階層型メモリ構造を持つものが一般的である。アク
セス頻度の高いデータは，キャシュメモリに格納され高
速にアクセスされる。一般に，キャシュメモリと主記憶
のアクセスとの差は数〜数十マシンサイクルと言われて
おり，主記憶の方が圧倒的にメモリアクセスのコストが
大きい。Recent hardware generally has a hierarchical memory structure in which a faster memory (hereinafter referred to as cache memory) is mounted in addition to the main memory. Data that is frequently accessed is stored in the cache memory and is accessed at high speed. Generally, the difference between the cache memory and the access to the main memory is said to be several to several tens machine cycles, and the memory access cost of the main memory is overwhelmingly higher.

【０００３】従って，高級言語で記述されたプログラム
をコンパイルする際に，できるだけキャッシュメモリが
有効利用されるように最適化を行えば，プログラムの実
行性能が向上する。そのための適切な技術手段が必要と
される。Therefore, when compiling a program written in a high-level language, if the cache memory is optimized so as to be used as effectively as possible, the execution performance of the program is improved. Appropriate technical means for that purpose are required.

【０００４】[0004]

【従来の技術】従来の最適化コンパイラでは，ユーザが
記述したソースプログラムに対して，命令数の削減／命
令の並べ換えにより，メモリアクセスの縮小化／パイプ
ラインにおけるディレイの縮小化を実施し，処理の向上
を図ってきた。2. Description of the Related Art In a conventional optimizing compiler, a source program written by a user is processed by reducing the number of instructions / rearranging instructions to reduce memory access / reduce pipeline delay. Has been improved.

【０００５】しかし，頻繁にメモリアクセスが必要な配
列データ等のデータをキャシュメモリに保持する方法
は，ほとんどの場合，ハードウェアのキャシュメモリの
管理方式に処理が委ねられている。ハードウェアで実現
されているキャシュメモリの制御方式では，ソースプロ
グラム上でキャシュメモリのミスヒットが認識できる場
合でも，キャシュメモリへのデータの配置・更新が一意
的に処理される。このため，ソフトウェアにより，ユー
ザプログラムの特徴に合わせてデータをキャシュメモリ
上へ配置することが，性能向上の観点から重要な問題と
なる。However, in most cases, the method of holding data such as array data, which requires frequent memory access, in the cache memory is left to the management method of the cache memory of hardware. In the cache memory control method realized by hardware, even if the cache memory mishit can be recognized in the source program, the data allocation / update to the cache memory is uniquely processed. Therefore, it is an important problem from the viewpoint of improving the performance to arrange the data on the cache memory according to the characteristics of the user program by the software.

【０００６】上記の問題点をソフトウェアで解決する手
段として，ブロッキングと呼ばれる最適化技術が存在す
る。ブロッキング最適化の目的は，多重のループで構成
されるループ構造に対して，その最内ループで使用され
る配列をキャシュサイズに収まるように分割して，配列
データが常にキャシュメモリ上に保持されるようにルー
プの変形を行うことにある。As a means for solving the above problems by software, there is an optimization technique called blocking. The purpose of blocking optimization is to divide the array used in the innermost loop into a cache size for a loop structure composed of multiple loops so that the array data is always retained in the cache memory. To transform the loop like this.

【０００７】図１０の（イ）にブロッキングの最適化が
適用可能なループの例を示す。このループでは，配列ｂ
（ｋ，ｊ）は，インデックスｉが更新される毎に，同一
の配列要素をアクセスする。従って，最内ループにおけ
る配列ｂに対するメモリアクセスの軌跡は，図１０の
（ロ）に示すように，同じ軌跡が繰り返される。FIG. 10A shows an example of a loop to which blocking optimization can be applied. In this loop, the array b
(K, j) accesses the same array element every time the index i is updated. Therefore, as the locus of memory access to the array b in the innermost loop, the same locus is repeated as shown in (b) of FIG.

【０００８】もし，最内ループの配列ｂのサイズがキャ
シュサイズを超えているならば，一定間隔ごとに，キャ
シュメモリ上でミスヒットが生じることになる。キャシ
ュメモリを有効に利用するためには，配列ｂをキャシュ
サイズに格納できる大きさに分割し，分割を指示するル
ープを，最外ループの外に生成することが必要となる。If the size of the array b of the innermost loop exceeds the cache size, mishits will occur on the cache memory at regular intervals. In order to effectively use the cache memory, it is necessary to divide the array b into a size that can be stored in the cache size and generate a loop instructing the division outside the outermost loop.

【０００９】現状のブロッキングによる最適化では，以
下の問題点があった。 (1) 通常のユーザプログラムでは，ループ変形を実施せ
ずに，ブロッキングの最適化を実施できることは非常に
稀である。そのため，本最適化機能をより広範囲に適用
するためには，インデックス交換／ループ分割といっ
た，各種のループ変形の最適化との融合が必要になる。
この適用順序によって，ブロッキングが適用できるルー
プが認識できる場合とできない場合がある。しかしなが
ら，従来技術では，これらを考慮する処理が行われてい
ないので，広範囲に最適化を適用することはできなかっ
た。In the current optimization by blocking, there are the following problems. (1) In a normal user program, it is very rare that blocking optimization can be performed without performing loop transformation. Therefore, in order to apply this optimization function to a wider range, it is necessary to combine it with optimization of various loop transformations such as index exchange / loop division.
Depending on this order of application, there may or may not be a loop to which blocking can be applied. However, in the conventional technology, since the processing considering these is not performed, the optimization cannot be applied to a wide range.

【００１０】(2) ブロッキング可能なデータが検出でき
た場合，どのインデックスで配列を分割するかの決定が
性能向上に大きな影響を及ぼす。どのインデックスで配
列データを分割するかは，配列データの大きさ，対象候
補の個数およびループの実行順序関係に依存するが，従
来の処理系では，発見的な方法を取っているため，適切
な分割がなされないことがあった。(2) When blockingable data is detected, the determination of which index the array is divided into has a great influence on the performance improvement. The index at which the array data is divided depends on the size of the array data, the number of target candidates, and the loop execution order relationship. However, in conventional processing systems, a heuristic method is used, so it is appropriate. Sometimes the division was not done.

【００１１】(3) ハードウェアに実装されているキャシ
ュサイズとブロッキング対象となった配列データとか
ら，最適な分割値（またはブロック長）を決定する必要
がある。しかしながら，複数の配列データが分割対象と
なった場合を含めて最適な分割値の決定方式が確立され
ておらず，適切な分割値が得られないことがあった。(3) It is necessary to determine the optimum division value (or block length) from the cache size mounted on the hardware and the array data to be blocked. However, there is a case where an optimal division value cannot be obtained because the optimal division value determination method has not been established, including the case where a plurality of array data are divided.

【００１２】[0012]

【発明が解決しようとする課題】本発明は上記問題点の
解決を図り，翻訳対象プログラムについて，多重ループ
中のデータを高速にアクセスすることができる最適化を
広範囲に実施し，かつ適切な分割を可能とすることによ
り，プログラムの実行性能を向上させることを目的とす
る。DISCLOSURE OF THE INVENTION The present invention solves the above problems, extensively optimizes a program to be translated for high-speed access to data in multiple loops, and appropriately divides it. The purpose is to improve the execution performance of the program by enabling.

【００１３】[0013]

【課題を解決するための手段】図１は，本発明の原理説
明図である。図１において，１０はコンパイル対象とな
るソースプログラム，１１はコンパイラが動作するＣＰ
Ｕおよびメモリなどからなる処理装置，１２はソースプ
ログラム１０を入力し，構文解析および意味解析を行う
構文・意味解析部，１３は構文・意味解析部１２の処理
結果による中間テキストについて最適化を行う最適化処
理部，１４は命令スケジューリングやレジスタ割り付け
を行い，オブジェクトコードを生成して出力するオブジ
ェクト生成出力部，１５はコンパイル結果のオブジェク
トプログラムを表す。FIG. 1 is a diagram for explaining the principle of the present invention. In FIG. 1, 10 is a source program to be compiled, and 11 is a CP on which the compiler operates.
A processing device including U and a memory, 12 is a syntax / semantic analysis unit that inputs the source program 10 and performs syntax analysis and semantic analysis, and 13 is an optimization of an intermediate text according to a processing result of the syntax / semantic analysis unit 12. An optimization processing unit, 14 performs instruction scheduling and register allocation, and generates and outputs an object code. An object generation and output unit, 15 represents an object program as a compilation result.

【００１４】本発明では，主記憶とキャッシュメモリの
階層型メモリを有する計算機をターゲットとするソース
プログラム１０をコンパイルする際に，最適化処理部１
３において以下の処理を行う。In the present invention, when compiling a source program 10 targeting a computer having a hierarchical memory of a main memory and a cache memory, the optimization processing unit 1
In 3, the following processing is performed.

【００１５】タイトな構造を持つループを検出し，
ブロッキング可能なループを認識する。タイトな構造と
は，ループ制御変数以外のループ回帰変数がないこと，
ループ外への飛び出しがないこと，手続き呼出しが存在
しないことである。Detecting a loop having a tight structure,
Recognize blocking loops. A tight structure means that there are no loop regression variables other than loop control variables.
There is no jump out of the loop and there are no procedure calls.

【００１６】検出したループがタイトな構造を持た
ない場合，処理を行う。データの重なり解析を利用することにより，インデ
ックス交換が可能なループの範囲を識別し，最内ループ
における配列データのアクセス距離が短くなるように外
側ループと最内ループとを入れ換えて，インデックス交
換を実施する。If the detected loop does not have a tight structure, processing is performed. By using the overlap analysis of data, the range of the loop where the index exchange is possible is identified, and the outer loop and the innermost loop are exchanged so that the access distance of the array data in the innermost loop is shortened, and the index exchange is performed. carry out.

【００１７】ループの深さとインデックスとの関
係，最内ループの配列をアクセスするインデックスとル
ープとの関係により，ブロッキング可能な候補配列を抽
出し，候補配列の中から分割対象配列と分割対象インデ
ックスを決定する。Based on the relationship between the loop depth and the index and the relationship between the index that accesses the innermost loop array and the loop, a blocking candidate array is extracted, and the division target array and the division target index are selected from the candidate arrays. decide.

【００１８】分割後の配列がキャッシュメモリのサ
イズに収まるように，キャッシュメモリのサイズおよび
ループ回転数に基づいて，分割のブロック長を決定す
る。分割すべき配列データ，分割すべきインデックスお
よび分割のブロック長をもとに，分割数を指示するルー
プを元のループの外に生成し，ループの変形を行う。The block length for division is determined based on the size of the cache memory and the number of loop rotations so that the array after division fits within the size of the cache memory. Based on the array data to be divided, the index to be divided, and the block length of division, a loop indicating the number of divisions is generated outside the original loop, and the loop is transformed.

【００１９】検出したループがタイトな構造でない
場合には，ループ分割によってタイトなループ構造に再
構成する。タイトなループ構造への再構成が成功したならば，
処理以下を実行する。失敗したならば，そのループの
最適化を諦め，次のループの検出を行う。If the detected loop is not a tight structure, it is reconstructed into a tight loop structure by loop division. If the reconstruction to the tight loop structure is successful,
Process Perform the following. If it fails, give up the optimization of the loop and detect the next loop.

【００２０】処理によるループ変形によって，キャッ
シュメモリにおけるミスヒットを減少させ，実行性能を
向上させることができる。By modifying the loop due to the processing, mishits in the cache memory can be reduced and the execution performance can be improved.

【００２１】[0021]

【作用】本発明は，以下の方式を採用することにより，
従来技術によるブロッキングの問題点を改善するもので
ある。The present invention, by adopting the following method,
The problem of blocking by the prior art is improved.

【００２２】(1) ブロッキング可能な配列を認識する場
合に，ループ分割，インデックス交換等のループ変形の
最適化を利用して，ブロッキングの適用範囲を広げる。 (2) 最適な分割値（またはブロック長）を決定する。(1) When recognizing a blockable array, optimization of loop transformation such as loop division and index exchange is used to widen the applicable range of blocking. (2) Determine the optimal division value (or block length).

【００２３】ブロッキング可能な配列データを分割する
場合には，ループの回転数で割り切れて，かつキャシュ
メモリに十分に収まる大きさとなるように分割値を決定
する。また，ブロッキング可能な配列データが複数存在
する場合においても，複数の配列データがキャシュサイ
ズに十分収まるように，最適な分割値を決定する。When dividing the array data which can be blocked, the division value is determined so that it can be divided by the number of rotations of the loop and can be sufficiently accommodated in the cache memory. In addition, even when there are a plurality of block data that can be blocked, the optimum division value is determined so that the plurality of array data can fit within the cache size.

【００２４】これによって，主記憶へのアクセスを少な
くした効率のよいオブジェクトプログラムを生成するこ
とが可能になる。This makes it possible to generate an efficient object program with less access to the main memory.

【００２５】[0025]

【実施例】以下，図１に示す最適化処理部１３における
本発明に関係する部分の実施例の処理を，詳細に説明す
る。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The processing of the embodiment of the part related to the present invention in the optimization processing section 13 shown in FIG. 1 will be described in detail below.

【００２６】１）ブロッキング可能なループの認識ブロッキングにより，実行結果が変わらない条件は，最
内ループに対して，(1) ループ制御変数以外のループ回
帰変数がない，(2) ループ外への飛び出しがない，(3)
手続き呼出しが存在しない，という３つの要件を満たす
タイトなループ構造を持つことである。また，この条件
に付け加えて，ブロッキングの最適化処理が実施可能で
あるためには，多重ループにおいて，最内ループの配列
の演算が最外ループのインデックスに依存せず，最外ル
ープのインデックスが変化しても，最内ループでは，常
にデータアクセスの法則性が変化しないことが必要であ
る。1) Recognizing a loop that can be blocked The condition that the execution result does not change due to blocking is that (1) there is no loop regression variable other than the loop control variable, and (2) outside the loop. No jumping out, (3)
It has a tight loop structure that satisfies the three requirements that there are no procedure calls. In addition to this condition, in order to be able to perform blocking optimization processing, in multiple loops, the operation of the array of the innermost loop does not depend on the index of the outermost loop, and the index of the outermost loop is Even if it changes, in the innermost loop, it is necessary that the law of data access does not always change.

【００２７】ブロッキングの対象ループは，上述したよ
うに，タイトなループに対してのみ実施されるため，対
象となる多重ループがタイトなループでない場合，タイ
トなループ構造への再構成を実施する。この処理によ
り，ブロッキング可能なループの適用範囲を広げること
ができる。As described above, the target loop for blocking is executed only for a tight loop. Therefore, when the target multiple loop is not a tight loop, reconfiguration to a tight loop structure is executed. By this processing, the applicable range of the loop capable of blocking can be expanded.

【００２８】図２は，そのループ分割によるループ再構
成の例を示している。ループ分割とは，図２の（イ）に
示すように，１つのループをそれぞれ同じループ制御変
数を持つ複数のループに分ける処理である。本発明で
は，タイトでない多重ループを複数のタイトなループに
分割することにより，タイトなループの最適化を可能と
する。タイトな多重ループとは，ループが直列の入れ子
をなし，最内のループのみに，実行文が出現する多重ル
ープのことを言う。FIG. 2 shows an example of loop reconstruction by the loop division. Loop division is a process of dividing one loop into a plurality of loops each having the same loop control variable, as shown in FIG. In the present invention, a tight loop can be optimized by dividing a non-tight multiple loop into a plurality of tight loops. A tight multiple loop is a multiple loop in which loops are nested in series and an executable statement appears only in the innermost loop.

【００２９】ループ分割は，以下の手順で行う。 (1) タイトなループの検出最内ループから最外ループへとループを検出し，最内ル
ープにしか実行文が存在しないループをタイトなループ
として認識する。それ以外のループは，タイトなループ
として検出できないので，タイトでない多重ループを複
数のタイトなループに分割する必要がある。The loop division is performed by the following procedure. (1) Tight loop detection Loops are detected from the innermost loop to the outermost loop, and a loop in which an executable statement exists only in the innermost loop is recognized as a tight loop. Other loops cannot be detected as tight loops, so it is necessary to divide the non-tight multiple loops into multiple tight loops.

【００３０】(2) タイトでない多重ループの検索と分割
対象ループの検索最内ループから外側ループへと検索し，以下のいずれか
の条件を満足するループをループ分割の対象とする。(2) Retrieval of non-tight multiple loops and retrieval of division target loops Retrieval is performed from the innermost loop to the outer loop, and a loop satisfying any of the following conditions is set as an object of loop division.

【００３１】−最内ループと外側ループとの間に実行文
（ループ制御のための命令は除く）が存在する。 −同じ深さのタイトな兄弟ループが存在する。-Execution statements (excluding instructions for loop control) exist between the innermost loop and the outer loop. -There are tight sibling loops of the same depth.

【００３２】−ループ内からの飛び出しが存在しない。 (3) ループ分割例えば，図２（ロ）の(a) に示す多重ループは，最内ル
ープと外側ループとの間に実行文Ａが存在する。のル
ープから検索すると，とのループの間に，実行文Ａ
が存在することがわかるので，がループ分割の対象と
なる。このループ分割により，ループは図２（ロ）の
(b) に示すようになる。There is no jump out of the loop. (3) Loop division For example, in the multiple loop shown in (a) of Fig. 2 (b), the executable statement A exists between the innermost loop and the outer loop. If you search from the loop of
Since it is known that there exists, is the target of loop division. Due to this loop division, the loop is
(b).

【００３３】図２（ハ）の(a) に示す多重ループは，同
じ深さのタイトな兄弟ループが存在する例である。の
ループから検索する。，のループは分割の対象とは
ならない。双方を検索した後，とを比較すると，同
じ深さのタイトな兄弟ループであることがわかるので，
図２（ハ）の(b) に示すようにループ分割を行う。The multiple loop shown in (a) of FIG. 2C is an example in which there are tight sibling loops having the same depth. Search from the loop. , Loops are not subject to splitting. After searching for both, and comparing with, you can see that they are tight sibling loops of the same depth.
Loop division is performed as shown in (b) of FIG.

【００３４】２）ループ入れ換え処理 2.1) インデックス交換可能ループの判定例えば，多次元配列がタイトな多重ループ内で，最内ル
ープの制御変数が一次元目以外の次元を引用している場
合，一次元目の配列要素を引用している外側ループが最
内ループになるようにループを入れ換える処理を，ルー
プのインデックス交換という。インデックス交換を実施
することにより，メモリ上の連続領域をアクセスするこ
とになり，メモリアクセス効率が向上する。2) Loop replacement processing 2.1) Judgment of index exchangeable loop For example, when the control variable of the innermost loop refers to a dimension other than the first dimension in a multiple loop where the multidimensional array is tight, The process of swapping loops so that the outer loop that refers to the original array element becomes the innermost loop is called loop index exchange. By executing the index exchange, the continuous area on the memory is accessed, and the memory access efficiency is improved.

【００３５】例えば図３の（イ）に示すループでは，
（ロ）に示す，，，…の順番で配列Ａの各要素に
アクセスすることになる。従って，不連続領域へのアク
セスとなる。配列Ｂも同様である。なお，この飛び飛び
にアクセスするデータの間隔を，配列データのアクセス
距離という。For example, in the loop shown in FIG.
Each element of the array A is accessed in the order of (b) ,. Therefore, the access is to the discontinuous area. The same applies to array B. The data access interval is called the array data access distance.

【００３６】インデックス交換により，図３の（ハ）に
示すように，最内ループと外側ループとを入れ換える
と，アクセスの順番は，（ニ）に示す，，，…の
ようになり，最内ループの配列Ａ（Ｉ，Ｊ），Ｂ（Ｉ，
Ｊ）は，連続領域のアクセスとなる。When the innermost loop and the outer loop are exchanged by the index exchange as shown in FIG. 3C, the access order is as shown in (D), ... Loop sequences A (I, J), B (I,
J) is a continuous area access.

【００３７】インデックス交換の目的は，このように配
列データのアクセス距離が最小となるようにすることで
ある。インデックス交換が可能であるかどうかは，ルー
プ中の配列要素の依存関係による。そこで，“データの
重なり解析”により，インデックス交換が可能なループ
の範囲を識別し，インデックス交換を実施する。The purpose of the index exchange is to minimize the access distance of array data in this way. Whether index exchange is possible depends on the dependency of array elements in the loop. Therefore, the range of loops in which the index can be exchanged is identified by "data overlap analysis", and the index is exchanged.

【００３８】データの重なり解析とは，次のような処理
である。一般に，多重ループのインデックス交換やルー
プ分割等のループの構造を大幅に変換する最適化では，
そのループ中の配列要素の依存関係を認識する必要があ
る。このとき，ループ内の配列要素の振る舞いを，添字
と制御変数の関係から単純化した形にすることで，デー
タの重なり（依存）関係を調べて，上記で述べた最適化
処理が実施できるかどうかを判断する。データの重なり
解析は，このようなときに行う処理であり，具体的には
次の情報を解析する処理である。The data overlap analysis is the following processing. In general, optimizations that greatly transform the structure of a loop, such as index exchange of multiple loops and loop division,
You need to be aware of the array element dependencies in the loop. At this time, if the behavior of the array elements in the loop is simplified from the relationship between the subscript and the control variable, the data overlap (dependency) relationship can be examined and the optimization processing described above can be performed. Determine whether The data overlap analysis is a process performed in such a case, and specifically, a process of analyzing the following information.

【００３９】(1) 配列要素のインデックス空間上での定
義・参照の順序関係。この情報解析では，個々の配列要素の定義・参照情報の
他に，最内ループで演算を実施する異なる配列要素同士
が，互いに同一領域をアクセスするかの情報も収集す
る。(1) Order relation of definition / reference in the index space of array elements. In this information analysis, in addition to the definition / reference information of each array element, information about whether different array elements that perform operations in the innermost loop access the same area with each other is also collected.

【００４０】(2) 単純変数同士の演算における先行順序
関係。以上のデータの重なり解析処理は，ベクトル化コンパイ
ラで採用されているデータの依存関係の解析と同様でよ
く，よく知られている処理であるので，ここでの説明は
この程度にとどめる。(2) Preceding order relation in the operation of simple variables. The above-described data overlap analysis process may be similar to the data dependency analysis adopted by the vectorizing compiler, and is well known, so the description here will be limited to this.

【００４１】もし，インデックス交換が不可能なループ
であれば，現状のインデックスの並びで，ブロッキング
可能かどうかの判定を行う。本方式を以下のような順序
に拡張することにより，ブロッキングを実施する配列が
決定した後に，インデックス交換を実施することが可能
である。If the loop is such that the index cannot be exchanged, it is judged whether or not blocking is possible with the current arrangement of indexes. By extending this method in the following order, it is possible to perform index exchange after the sequence for blocking is determined.

【００４２】(1) インデックス交換をする部分を，交換
できるインデックス空間に置き換える。 (2) “ブロッキング可能な配列データ”の候補集合を求
める。(1) The index exchange part is replaced with an exchangeable index space. (2) Obtain a candidate set of “blocking sequence data”.

【００４３】(3) “交換できるインデックス空間”でイ
ンデックス交換を実施する。 (1) 〜(3) の操作を繰り返し実施することにより，より
きめこまかなブロッキングの処理を実現できる。(3) The index is exchanged in the "exchangeable index space". By performing steps (1) to (3) repeatedly, more detailed blocking processing can be realized.

【００４４】３）ブロッキング可能な配列データの検出 3.1) 配列データ再利用の解析ループの深さとインデックスとの関係，最内ループの配
列をアクセスするインデックスとループとの関係によ
り，ブロッキング可能な配列を決定する。3) Detection of Blockable Sequence Data 3.1) Analysis of Sequence Data Reuse A blockable sequence is determined by the relationship between the loop depth and the index, and the index that accesses the array of the innermost loop and the loop. decide.

【００４５】本解析方法は，“データの重なり解析”の
ように，最内ループで主記憶アクセスされる定義と参照
のデータが重なっているか等の情報は一切収集しない。
ブロッキングに必要なデータの依存関係の解析は，解析
中のインデックスＸに対して，最内ループで，Ｘの値が
変化した時，同一順序の配列のアクセスが存在するかど
うかだけを調べればよい。This analysis method does not collect any information such as whether the definition and reference data accessed in the main memory in the innermost loop overlap, as in "data overlap analysis".
For the analysis of the data dependency required for blocking, it suffices to check whether or not there is an access to the array in the same order when the value of X changes in the innermost loop with respect to the index X being analyzed. .

【００４６】この配列データ再利用の解析処理について
の詳しい説明は，本実施例の説明の後に〔補足説明１〕
として述べる。 3.2) 分割対象配列および分割対象インデックスの決定ブロッキングをどの配列に対して実施するか，またどの
インデックスで実施するかの決定プロセスを図４に示
す。For a detailed description of the analysis processing of this sequence data reuse, [supplementary explanation 1] is given after the explanation of the present embodiment.
It is described as. 3.2) Determining the array to be divided and the index to be divided Figure 4 shows the process of deciding which array to perform blocking and which index to perform.

【００４７】(a) 分割対象インデックスの決定では，ま
ずブロッキング可能な候補配列およびインデックスの候
補の抽出を行う。 (b) 次に，キャッシュメモリのサイズと，ブロッキング
対象データのサイズとを比較する。(A) In determining the index to be divided, first, a candidate sequence that can be blocked and a candidate index are extracted. (b) Next, the size of the cache memory is compared with the size of the blocking target data.

【００４８】(c) 配列データのアクセス距離を計算す
る。 (d) 以上の結果から，配列の次元数を考慮し，分割イン
デックスを決定する。以上の処理(a) 〜(d) を繰り返
す。このときに注意すべき点は以下のとおりである。(C) The access distance of array data is calculated. (d) From the above results, the partition index is determined by considering the dimensionality of the array. The above processes (a) to (d) are repeated. The points to be noted at this time are as follows.

【００４９】(1) もしループの回転数が翻訳時に確定し
ないならば，配列の宣言を参照して，要素すべてをアク
セスするとみなして計算する。 (2) ブロッキング対象の配列データの形状が翻訳時に確
定しない場合，例えば“ｂ（ｉ＊８，ｋ＊８）”や“ｂ
（ｃ（ｉ，ｋ））”の場合，翻訳時に正確な分割値を求
めることができない。この場合には，ブロッキングを実
施しない。(1) If the number of rotations of the loop is not fixed at the time of translation, reference is made to the array declaration and all elements are considered to be accessed for calculation. (2) When the shape of the sequence data to be blocked is not fixed at the time of translation, for example, “b (i * 8, k * 8)” or “b
In the case of (c (i, k)) ", an accurate division value cannot be obtained during translation. In this case, blocking is not performed.

【００５０】(3) 次元数の考慮は，１次元配列，２次元
配列，Ｎ次元（Ｎ＞＝３）配列と分離して考える。な
お，この次元数の考慮については，後述する〔補足説明
２〕で詳しく説明する。(3) Considering the number of dimensions, the one-dimensional array, the two-dimensional array, and the N-dimensional (N> = 3) array are considered separately. The consideration of the number of dimensions will be described in detail later in [Supplementary Explanation 2].

【００５１】４）ブロック長の決定ブロッキングを行うにあたり，性能を大きく左右する要
因となるのが，“ブロッキング対象の配列をいくつに分
割するか”である。分割のブロック長をＳＴＲＩＤＥと
呼び，最適なＳＴＲＩＤＥを求めることが，ブロッキン
グ最適化の鍵となる。このＳＴＲＩＤＥを決定する方法
を，〔補足説明３〕として後述する。4) Determining Block Length In blocking, the factor that greatly affects the performance is "how many blocks to be divided into blocks". The block length of division is called STRIDE, and finding the optimum STRIDE is the key to blocking optimization. A method of determining this STRIDE will be described later as [Supplementary Explanation 3].

【００５２】ブロック長を決定する場合の最大の要因
は，ハードウェアが持つデータキャシュの大きさであ
る。データキャシュが大きい場合，より多量のデータを
データキャシュ上に保持することが可能（ＳＴＲＩＤＥ
を大きくとることが可能）であり，反対に少量のデータ
しかキャシュ上に保持することができないときには，Ｓ
ＴＲＩＤＥを小さくすることが必要となる。以下に，Ｓ
ＴＲＩＤＥの決定要因を示す。The largest factor in determining the block length is the size of the data cache of the hardware. If the data cache is large, a larger amount of data can be retained on the data cache (STRIDE
Can be made large), and conversely, when only a small amount of data can be held on the cache, S
It is necessary to reduce TRIDE. Below, S
The determinants of TRIDE are shown.

【００５３】(1) ループの回転数が翻訳時に確定してい
る場合分割後の配列がキャシュサイズに収まり，かつループの
回転数で割り切れるＳＴＲＩＤＥを選択する。ブロッキ
ング対象ループが２つあり，相互の回転数が違う場合に
は，最内ループの回転数に合わせて，ＳＴＲＩＤＥを決
定する。また回転数が素数の場合には，配列要素が収ま
る最大値をＳＴＲＩＤＥとする。(1) When the number of rotations of the loop is fixed at the time of translation: Select STRIDE that allows the array after division to fit within the cache size and is divisible by the number of rotations of the loop. When there are two blocking target loops and the rotation speeds are different from each other, STRIDE is determined according to the rotation speed of the innermost loop. If the rotation number is a prime number, STRIDE is set to the maximum value that the array elements can fit.

【００５４】(2) ループの回転数が翻訳時に確定してい
ない場合回転数が翻訳時に確定しない場合にもブロッキングを実
施することができる。このとき注意すべき点は，“ＳＴ
ＲＩＤＥの合計が回転数を超える場合”を常に考慮しな
ければならないことである。この場合には，以下の手順
でＳＴＲＩＤＥを決定する。(2) When the number of rotations of the loop is not fixed at the time of translation Blocking can be performed even when the number of rotations is not fixed at the time of translation. The point to be noted at this time is "ST
It is always necessary to consider "when the sum of RIDE exceeds the number of revolutions." In this case, STRIDE is determined by the following procedure.

【００５５】ｉ) 回転数が翻訳時に確定しない場合に
は，配列の要素すべてをアクセスするとみなし，キャシ
ュサイズに格納可能なＳＴＲＩＤＥを決定する。方法は
回転数が既知の場合と同様である。I) When the number of rotations is not fixed at the time of translation, it is considered that all the elements of the array are accessed, and STRIDE that can be stored in the cache size is determined. The method is the same as when the rotation speed is known.

【００５６】ii) ＳＴＲＩＤＥが決定した後，新たにル
ープを生成する。ループの回転数が不明であるため，Ｓ
ＴＲＩＤＥと回転数との比較のためのコードを生成する
必要がある。このとき，条件付転送命令を持つハードウ
ェアでは，ＳＴＲＩＤＥと回転数との比較を条件付転送
命令に置き換える。この置き換えにより，分岐のオーバ
ヘッドを縮小することが可能となる。Ii) After STRIDE is determined, a new loop is generated. Since the number of rotations of the loop is unknown, S
Code needs to be generated for the comparison of TRIDE with the speed. At this time, in the hardware having the conditional transfer instruction, the comparison between STRIDE and the rotation speed is replaced with the conditional transfer instruction. By this replacement, the branch overhead can be reduced.

【００５７】図５に，ＳＴＲＩＤＥの和とループの回転
数（Ｎ）の関係を示す。この図から明らかなように，ル
ープの回転数が不明である場合には，比較判断の命令が
必要になる。FIG. 5 shows the relationship between the sum of STRIDE and the rotation speed (N) of the loop. As is clear from this figure, if the number of rotations of the loop is unknown, an instruction for comparison and judgment is required.

【００５８】図６は，そのためのループの終了条件を判
定する命令の例を示している。例えば図６に示す“do i
=ii,min(ii+istride-1, n)”のように，翻訳時にループ
の回転数（変数ｎ）が確定しない場合には，ＳＴＲＩＤ
Ｅ（変数ｉｓｔｒｉｄｅ）のそれまでの合計値と，回転
数（ｎ）のうち小さいほうを選び，それをループの終了
条件とする。５）ループ変形分割すべき配列データ，分割すべきインデックス，およ
びブロック長（ＳＴＲＩＤＥ）が決定した後，分割数を
指示するループをオリジナルループの外に生成する。ブ
ロッキングを実施した場合，オリジナルループと比較し
て，ループが深くなるが，最内ループのブロッキング対
象となったデータがキャシュにミスヒットしないことを
考慮すると，新たに生成したループのオーバヘッドは無
視できる。〔補足説明１〕配列データの再利用解析配列データの再利用解析では，“最内ループで使用され
る配列データが，どのインデックス空間で使用されるの
か”のデータを収集する。このデータをインデックス・
ディペンデンス・ベクトル(Index dependence vector)
と呼ぶ。FIG. 6 shows an example of an instruction for determining a loop end condition therefor. For example, "do i
= ii, min (ii + istride-1, n) ”, if the loop rotation speed (variable n) is not fixed during translation, STRID
The smaller one of the total value of E (variable isride) and the number of revolutions (n) is selected, and the smaller one is used as the end condition of the loop. 5) Loop transformation After the array data to be divided, the index to be divided, and the block length (STRIDE) are determined, a loop indicating the division number is generated outside the original loop. When blocking is performed, the loop becomes deeper than the original loop, but the overhead of the newly created loop can be ignored considering that the data that was the target of blocking in the innermost loop does not hit the cache. . [Supplemental Explanation 1] Reuse Analysis of Sequence Data In the reuse analysis of sequence data, data of “in which index space the sequence data used in the innermost loop is used” is collected. Index this data
Index dependence vector
Call.

【００５９】例えば，図７の（イ）に示す例で説明する
と，Index dependence vector は，以下のようになる。
最内ループの配列データのインデックスｊ，ｋ，ｉと，
ループの深さとの関係は，図７の（ロ）に示すようにな
る。この各インデックスに対応したループの深さが，In
dex vectorの並びになる。このケースでは，Index vect
orは｛ｊ，ｋ，ｉ｝の順番になる。ここで，インデック
スが関与するものを“１”，関与しないものを“０”と
し，Index dependence vector を定める。For example, explaining with the example shown in FIG. 7A, the Index dependence vector is as follows.
The indices j, k, i of the array data of the innermost loop,
The relationship with the loop depth is as shown in FIG. The loop depth corresponding to each index is In
It is a sequence of dex vectors. In this case Index vect
or is in the order of {j, k, i}. Here, the index dependence vector is defined by defining “1” as the index-related one and “0” as the index-independent one.

【００６０】その結果，Index dependence vector は，
図７の（ハ）に示すように， c(i,j) ＝｛1,0,1 ｝： a(i,k) ＝｛0,1,1 ｝： b(k,
j) ＝｛1,1,0 ｝となる。As a result, the Index dependence vector is
As shown in FIG. 7C, c (i, j) = {1,0,1}: a (i, k) = {0,1,1}: b (k,
j) = {1,1,0}.

【００６１】これをもとに，最外ループに対して再利用
(Reuse)可能な配列データを求める。各ループのインデ
ックスのIndex vectorを定義し，Index dependence vec
torとの減算を実施する。減算した結果は，結果Vector
に格納される。Based on this, reuse for the outermost loop
(Reuse) Find possible array data. Index index of each loop is defined, and Index dependence vec
Perform subtraction with tor. The result of subtraction is the result Vector
Is stored in

【００６２】Result(k1,・・・,kn) ＝ Idv(j1,・・・,jn)
− Iv(i1,・・・,in) （ｎ：ループの深さ）各インデックスに対して減算の結果を求める。Result
(1) ＝｛r1,・・・,rn ｝，インデックスIn(1＜in＜n)で再
利用可能なデータは，該当するIndex Vectorの欄が('−
1'）の場合である。それ以外の数値の場合('０')は，ア
クセスする配列が調査対象のインデックスを含んでいる
ため，最外ループに対して再利用不可能となる。Result (k1, ..., kn) = Idv (j1, ..., jn)
-Iv (i1, ..., in) (n: depth of loop) Obtain the result of subtraction for each index. Result
(1) = {r1, ..., rn}, data that can be reused with index In (1 <in <n) has the corresponding Index Vector column ('-
1 '). If the value is any other value ('0'), the array to be accessed contains the index to be checked, and cannot be reused for the outermost loop.

【００６３】図７の例で説明すると，まず， Index Dependence Vector ： c(i,j) ＝ (1,0,1) a
(i,k) ＝ (0,1,1) b(k,j) ＝ (1,1,0) を求め，次に，Index ｊに対して再利用可能なデータを
収集する。インデックス'ｊ' のIndex Vectorは， Iv
(j)＝ (1,0,0)である。Explaining with the example of FIG. 7, first, Index Dependence Vector: c (i, j) = (1,0,1) a
Find (i, k) = (0,1,1) b (k, j) = (1,1,0) and then collect reusable data for Index j. Index Vector of index'j 'is Iv
(j) = (1,0,0).

【００６４】Index Dependence Vector とIndex Vector
の減算を行う。 c(i,j) : (1,0,1)− (1,0,0)＝ ( 0,0,1) a(i,k) : (0,1,1)− (1,0,0)＝ (-1,1,1) ★再利用可能 b(k,j) : (1,1,0)− (1,0,0)＝ ( 0,1,0) このとき，インデックス 'ｊ' に対して，配列ａ（ｉ，
ｋ）は再利用があることを意味し，インデックス 'ｉ',
'ｋ' に対してブロッキング可能となる。Index Dependence Vector and Index Vector
Is subtracted. c (i, j): (1,0,1) − (1,0,0) ＝ (0,0,1) a (i, k): (0,1,1) − (1,0,1) 0) ＝ (-1,1,1) ★ Reusable b (k, j): (1,1,0) − (1,0,0) ＝ (0,1,0) At this time, the index ' For j ', the array a (i,
k) means that there is reuse, index'i ',
Blocking is possible for'k '.

【００６５】以上のように簡単なベクトルの減算によ
り，再利用可能な配列データと，対象インデックスを求
めることができる。〔補足説明２〕ブロッキング対象データの次元数の考
慮［１次元配列の場合］分割対象の配列が１次元のとき，
分割対象となるインデックスは１つしかない。再利用可
能な配列データのインデックスが最内ループのインデッ
クスに依存していないならば，ブロッキングは実施しな
い。最内ループのインデックスに依存しているときのみ
分割する。As described above, the reusable array data and the target index can be obtained by the simple vector subtraction. [Supplementary explanation 2] Considering the number of dimensions of blocking target data [In the case of one-dimensional array] When the array to be divided is one-dimensional,
There is only one index to be divided. If the index of the reusable array data does not depend on the index of the innermost loop, blocking will not be performed. Split only when it depends on the index of the innermost loop.

【００６６】［２次元配列の場合］対象となる配列デー
タの各要素が，最内とそのすぐ外側のループのインデッ
クスでアクセスされるならば，両方のインデックスに対
して，ブロッキングを実施する。もし最内ループしか含
まないとき，最内ループのインデックスのみで分割す
る。両方のインデックスを含まないとき，ブロッキング
は実施しない。[In the case of a two-dimensional array] If each element of the target array data is accessed by the index of the innermost loop and the loop immediately outside thereof, blocking is performed for both indexes. If only the innermost loop is included, split only by the innermost loop index. If both indexes are not included, blocking is not performed.

【００６７】［３次元以上配列の場合］ −最内ループに着目してブロッキングを実施する。外側
ループに着目したブロッキングは翻訳コストの面から実
施しない。[In the case of an array of three or more dimensions] Blocking is performed by focusing on the innermost loop. Blocking focusing on the outer loop is not implemented in terms of translation cost.

【００６８】−ブロッキングを行うためにインデックス
交換が必要なときは，ブロッキングは実施しない。最内
ループのインデックスでアクセスされる配列のみブロッ
キングの対象とする。〔補足説明３〕ＳＴＲＩＤＥ（分割長）の決定ループの回転数が翻訳時に確定している場合，分割後の
配列がキャシュサイズに収まり，かつループの回転数で
割り切れるＳＴＲＩＤＥ（分割長）を選択する。-When the index exchange is necessary for blocking, blocking is not performed. Only the array accessed by the index of the innermost loop is subject to blocking. [Supplementary explanation 3] Determination of STRIDE (division length) If the rotation number of the loop is fixed at the time of translation, select the STRIDE (division length) that the array after division fits into the cache size and is divisible by the rotation number of the loop. .

【００６９】(1) 正方形に分割する場合キャシュサイズをＣ（Ｂｙｔｅ），ブロッキング対象の
配列データがＮ個存在するものとし，配列の１要素の大
きさをＥ（Ｂｙｔｅ）とすると，１つの配列に分配する
ことができるキャシュサイズはＳＴＲＩＤＥ＝ＳＱＲＴ((Ｃ／Ｅ) ／Ｎ) の式で求められる。なお，ＳＱＲＴは，平方根を求める
関数である。(1) Dividing into squares If the cache size is C (Byte), N pieces of array data to be blocked exist, and the size of one element of the array is E (Byte), one array The cache size that can be distributed in the following manner is obtained by the formula STRIDE = SQRT ((C / E) / N). Note that SQRT is a function for obtaining a square root.

【００７０】回転数は翻訳時に確定しているので，分割
値に近似していて，かつループの回転数で割り切れる数
値が最適なＳＴＲＩＤＥになる。ブロッキング対象ループが２つあり，相互の回転数
が違う場合最内ループの回転数をベースとして，最適なＳＴＲＩＤ
Ｅ値を決定する。外側のループの回転数の決定では，回
転数を超えるかどうかの判定を生成する。Since the rotation speed is fixed at the time of translation, the optimum STRIDE value is close to the division value and divisible by the rotation speed of the loop. If there are two blocking target loops and their rotation speeds are different, the optimal STRID is based on the rotation speed of the innermost loop.
Determine the E value. The determination of the number of rotations of the outer loop generates a determination whether the number of rotations is exceeded.

【００７１】回転数が素数のとき上記の式で求めたＳＴＲＩＤＥで分割する。ＳＴＲＩＤ
Ｅ値が回転数を超えたかの判定を生成する。When the rotation number is a prime number, division is performed by STRIDE obtained by the above formula. STRID
A determination is made as to whether the E value exceeds the rotational speed.

【００７２】(2) 最内ループのみブロッキングを実施す
る場合最内ループの最適なＳＴＲＩＤＥ値をＸとして，その他
のＮ−１個のループの回転数を Ic(i)｛１≦ｉ≦Ｎ−
１｝，対象配列の１要素の大きさをＢとすると，一つの
配列要素に分配できるキャシュサイズは，ＳＴＲＩＤＥ＝Ｃ／Ｂ＊ (Π Ic(Ｎ−１)) の式で求めることができる。これが，キャシュに収まる
ＳＴＲＩＤＥ長の最大値となる。以下，正方形に分割す
る方式と同様にして，最適なＳＴＲＩＤＥ長を決定す
る。次に，本発明の具体的な適用例について，図８およ
び図９に従って説明する。(2) Blocking only the innermost loop Letting X be the optimum STRIDE value of the innermost loop, the rotation speeds of the other N-1 loops are Ic (i) {1≤i≤N-
1}, where B is the size of one element of the target array, the cache size that can be distributed to one array element can be obtained by the formula STRIDE = C / B * (Π Ic (N-1)). This is the maximum STRIDE length that fits in the cache. Hereinafter, the optimum STRIDE length is determined in the same manner as the method of dividing into squares. Next, a specific application example of the present invention will be described with reference to FIGS.

【００７３】図８の（イ）に示すＦＯＲＴＲＡＮソ
ースプログラムについて，本発明による最適化を実施す
るものとする。なお，実際には，ソースプログラムを中
間テキストに変換したものについて最適化を行うが，こ
こでは，説明をわかりやすくするため，ソースプログラ
ム形式で表記する。The FORTRAN source program shown in FIG. 8A is to be optimized according to the present invention. Actually, the source program is converted into an intermediate text and optimization is performed, but here, in order to make the explanation easy to understand, the source program format is used.

【００７４】このプログラムのループを検出し，タイト
なループ構造であることがわかったならば，データの依
存関係の情報収集を実施した後に，最内ループの配列要
素のアクセスができるだけ連続となるように，インデッ
クス交換を実施する。If the loop of this program is detected and it is found that it has a tight loop structure, after collecting the information of the data dependency, the array elements of the innermost loop are accessed as continuously as possible. Then, exchange the index.

【００７５】インデックスＩが最内ループ，インデ
ックスＪが最外ループのインデックスとなるようにルー
プを変形する。これにより，図８の（ロ）に示すように
変形される。The loop is transformed so that the index I becomes the index of the innermost loop and the index J becomes the index of the outermost loop. As a result, it is transformed as shown in FIG.

【００７６】個々の配列要素に対して，Index depe
ndence vector を求める。インデックスとループの深さ
との関係から, Index dependence vector は，図８の
（ハ）に示すようになる。Index depe for each array element
Find the ndence vector. From the relationship between the index and the depth of the loop, the Index dependence vector is as shown in (c) of FIG.

【００７７】最外ループに対して，常に同じデータ
アクセスの軌跡を描く配列を求める。図９の（イ）に示
すように，Ｊの Index vector は（１，０，０）であ
る。これをもとに，各配列のIndex dependence vector
と，Ｊの Index vector との減算を行う。この結果，Ｊ
のインデックス部が“−１”の配列要素，すなわちＡ
（Ｉ，Ｋ）が，インデックスＪに対して，常に同じデー
タのアクセスの軌跡を描くデータであることがわかる。For the outermost loop, an array that always draws the same data access locus is obtained. As shown in FIG. 9A, the index vector of J is (1, 0, 0). Based on this, the Index dependence vector of each array
And the J's Index vector. As a result, J
Array element whose index part is "-1", that is, A
It can be seen that (I, K) is data that always draws the same data access locus with respect to the index J.

【００７８】図９の（ロ）に示す処理を行う。すな
わち，配列Ａ（Ｉ，Ｋ）とキャッシュサイズ（Ｃとす
る）を比較し，Ａ（Ｉ，Ｋ）＜Ｃのとき，本最適化は必
要ないので，ループ変形は実施しない。Ａ（Ｉ，Ｋ）＞
Ｃのとき，Ａ（Ｉ，Ｋ）＜Ｃとなるように，Ｉ，Ｋを分
割する。The process shown in FIG. 9B is performed. That is, the array A (I, K) is compared with the cache size (denoted by C), and when A (I, K) <C, the main optimization is not necessary, so the loop transformation is not executed. A (I, K)>
When C, I and K are divided so that A (I, K) <C.

【００７９】ここで，Ａ（Ｉ，Ｋ）＞Ｃであるとする。
前述した方法により，分割のサイズを計算し，それぞれ
ＩＳＴＲＩＤＥ，ＫＳＴＲＩＤＥとする。この分割長ＩＳＴＲＩＤＥ，ＫＳＴＲＩＤＥに従っ
て，ループ変形を実施する。すなわち，分割数を指示す
るループ“DO KK=1,R,KSTRIDE", “DO II=1,N,ISTRIDE
”を，元のループの外に生成する。また，Ｋ，Ｊの回
転数が不明であることから，Ｋの回転数Ｒと分割数の合
計，Ｉの回転数Ｎと分割数の合計の比較によるループ終
了条件を定める命令を生成する。この結果，ループは，
図９の（ハ）に示すように変形される。図９の（ハ）に
示す最適化結果に従って，オブジェクトプログラムを生
成すれば，キャッシュメモリのミスヒットが少ない性能
のよいオブジェクトプログラムが得られることになる。Here, it is assumed that A (I, K)> C.
The size of the division is calculated by the above-mentioned method, and is set as ISTRIDE and KSTRIDE, respectively. The loop transformation is performed according to the division lengths ISTRIDE and KSTRIDE. That is, the loop that indicates the number of divisions "DO KK = 1, R, KSTRIDE", "DO II = 1, N, ISTRIDE
Is generated outside the original loop. Also, since the rotation speeds of K and J are unknown, the rotation speed R of K and the total number of divisions, and the rotation speed N of I and the total number of divisions are compared. Generates an instruction that determines the loop end condition by.
It is transformed as shown in FIG. If an object program is generated in accordance with the optimization result shown in FIG. 9C, an object program with good performance with few cache memory miss hits can be obtained.

【００８０】[0080]

【発明の効果】以上説明したように，本発明によれば，
ブロッキングを行う前に，ループ分割，インデックス交
換等のループ変形を実施することにより，ブロッキング
の適用範囲を広げることが可能となる。また，ブロッキ
ングの効果により，一番実行コストが高い最内ループで
のキャシュメモリへのミスヒットを減少させ，主記憶参
照アクセスのオーバヘッドを削減することができるよう
になる。As described above, according to the present invention,
By implementing loop transformation such as loop division and index exchange before blocking, it is possible to broaden the range of application of blocking. In addition, due to the blocking effect, it is possible to reduce the miss hit to the cache memory in the innermost loop, which has the highest execution cost, and to reduce the overhead of main memory reference access.

[Brief description of the drawings]

【図１】本発明の原理説明図である。FIG. 1 is a diagram illustrating the principle of the present invention.

【図２】本発明の実施例に係るループ分割によるループ
再構成の例を示す図である。FIG. 2 is a diagram showing an example of loop reconstruction by loop division according to an embodiment of the present invention.

【図３】本発明の実施例に係るインデックス交換の説明
図である。FIG. 3 is an explanatory diagram of index exchange according to the embodiment of the present invention.

【図４】本発明の実施例に係る分割対象インデックス決
定プロセス説明図である。FIG. 4 is an explanatory diagram of a division target index determination process according to the embodiment of the present invention.

【図５】本発明の実施例に係るＳＴＲＩＤＥの合計と回
転数の比較判断説明図である。FIG. 5 is an explanatory diagram for comparing and determining the total STRIDE and the rotation speed according to the embodiment of the present invention.

【図６】本発明の実施例に係るループの終了条件の判定
説明図である。FIG. 6 is an explanatory diagram of determination of a loop end condition according to the embodiment of the present invention.

【図７】本発明の実施例に係るループの深さ／インデッ
クス／配列アクセスの関係説明図である。FIG. 7 is a diagram illustrating the relationship between loop depth / index / array access according to an embodiment of the present invention.

【図８】本発明の適用例説明図である。FIG. 8 is an explanatory diagram of an application example of the present invention.

【図９】本発明の適用例説明図である。FIG. 9 is an explanatory diagram of an application example of the present invention.

【図１０】本発明に関係する最内ループのメモリアクセ
スの軌跡説明図である。FIG. 10 is an explanatory diagram of a locus of memory access of the innermost loop related to the present invention.

[Explanation of symbols]

１０ソースプログラム１１処理装置１２構文・意味解析部１３最適化処理部１４オブジェクト生成出力部１５オブジェクトプログラム 10 Source Program 11 Processing Device 12 Syntax / Semantic Analysis Unit 13 Optimization Processing Unit 14 Object Generation Output Unit 15 Object Program

───────────────────────────────────────────────────── フロントページの続き (56)参考文献「情報処理学会第44回（平成４年前期）全国大会講演論文集」Ｐ．５−93〜５−94（1992−２) 「ＰＲＯＣＥＥＤＩＮＧＯＦ 16ＴＨＡＮＮＵＡＬＩＮＴＥＲＮＡＴＩＯＮＡＬＳＹＭＰＯＳＩＵＭＯＮＣＯＭＰＵＴＥＲＡＲＣＨＩＴＥＣＴＵＲＥ」（1989）Ｐ．242−251 ─────────────────────────────────────────────────── ─── Continuation of the front page (56) Bibliography "Proceedings of the 44th National Conference of the Information Processing Society of Japan" 5-93 to 5-94 (1992-2), "PROCEEDING OF 16T H ANNNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITEC TURE" (1989) P. 242-251

Claims

(57) [Claims]

1. A source program including a multiple loop (10)
In a computer language processing method for converting a program into an object program (15) that runs on a computer with hierarchical memory, a process (,) that detects loops with a tight structure and recognizes blocking loops, By using the overlap analysis of data, the range of the loop where the index exchange is possible is identified, and the outer loop and the innermost loop are exchanged so that the access distance of the array data in the innermost loop is shortened, and the index exchange is performed. Blockable candidate sequences are extracted based on the relationship between the processing step () to be performed, the loop depth and the index, and the index to access the innermost loop array and the loop. And the process of determining the partition target index (), and the array after partition is the size of the cache memory Based on the process () to determine the block length of the partition based on the size of the cache memory and the number of rotations of the loop, and the array data to be partitioned, the index to be partitioned, and the block length of the partition so that A computer language processing method characterized in that a loop for instructing the number of divisions is generated outside the original loop, and a processing step () for deforming the loop is performed, and optimization processing of the generated object program is performed.

2. The computer language processing method according to claim 1, further comprising a processing step () for reconstructing a tight loop structure by loop division when the blocking target loop is not a tight loop structure. Computer language processing method.