JP2024069960A

JP2024069960A - RESOURCE RECONFIGURATION PROGRAM, RESOURCE RECONFIGURATION METHOD, AND INFORMATION PRO

Info

Publication number: JP2024069960A
Application number: JP2022180279A
Authority: JP
Inventors: 真弘三輪; Masahiro Miwa
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2024-05-22

Abstract

To make it possible to appropriately determine whether or not to perform resource reconfiguration.SOLUTION: A processing part 12 calculates a first time required from reconfiguration to job completion in the case of allocating a job to a first node on the basis of a reconfiguration time required for reconfiguration of the first node requiring reconfiguration of resources for executing the job, a job execution time after the reconfiguration, and a first coefficient representing influence on the execution time of communication contention due to communication when the job is executed by the first node. The processing part 12 calculates a second time required till the job completion in the case of allocating the job to a second node on the basis of a second coefficient representing influence on the execution time and the execution time of communication contention due to communication when the job is executed by the second node which does not require the resource configuration. The processing part 12 performs the resource reconfiguration in the first node in the case that the first time is shorter than the second time, and does not perform the resource reconfiguration in the first node in the case that the first time is equal to or longer than the second time.SELECTED DRAWING: Figure 1

Description

本発明はリソース再構成プログラム、リソース再構成方法および情報処理システムに関する。 The present invention relates to a resource reconfiguration program, a resource reconfiguration method, and an information processing system.

現在、ハードウェアのリソースをプール化し、ノードとリソースとの接続関係をスイッチ経由で切り替えることで、ノードのリソース構成を変更可能なシステムが用いられている。このようなシステムは、ディスアグリゲーションシステムと言われる。プール化されるリソースには、例えば、ＧＰＵ（Graphics Processing Unit）、ＦＰＧＡ（Field Programmable Gate Array）およびＳＳＤ（Solid State Drive）などがある。 Currently, systems are being used that pool hardware resources and change the node resource configuration by switching the connection between the node and the resource via a switch. Such systems are called disaggregated systems. Examples of resources that can be pooled include GPUs (Graphics Processing Units), FPGAs (Field Programmable Gate Arrays), and SSDs (Solid State Drives).

なお、サービスを１つのＶＭ（Virtual Machine）で提供する場合だけでなく、複数のＶＭで提供する場合を考慮することで、サービスに用いられるハードウェアリソースをＳＬＡ（Service Level Agreement）を満たす範囲で小さくする管理装置の提案がある。 In addition, a management device has been proposed that takes into account cases where a service is provided not only on one VM (Virtual Machine) but also on multiple VMs, thereby reducing the hardware resources used for the service while still satisfying the SLA (Service Level Agreement).

特開２０１８－１１６５５６号公報JP 2018-116556 A

ジョブをノードに割り当てる際に、割り当て先のノードにおいてジョブで要求されるリソースが不足する場合、リソースの再構成を要する。リソースの再構成は、時間がかかりジョブ完了の遅延要因となる。 When assigning a job to a node, if the resources required by the job are insufficient on the assigned node, the resources must be reconfigured. Reconfiguring resources takes time and can delay job completion.

一方、ジョブで要求されるリソースを構成済のノードがある場合、当該ノードにジョブを割り当てることで、リソースの再構成を省略できる。しかし、当該ノードによるジョブの実行時の通信が他のノード間の通信と競合することがある。当該通信の競合もジョブ完了の遅延要因となる。 On the other hand, if there is a node that already has the resources required by the job configured, you can avoid reconfiguring the resources by assigning the job to that node. However, communications when the job is executed by that node may conflict with communications between other nodes. Such communications conflicts can also cause delays in job completion.

このため、例えば単純にリソースの再構成不要のノードにジョブを割り当てたとしても、ジョブ完了の遅延が再構成を行う場合より改善するとは限らないという問題がある。
１つの側面では、本発明は、リソースの再構成を行うか否かを適切に決定可能にすることを目的とする。 For this reason, even if a job is simply assigned to a node that does not require resource reconfiguration, there is a problem in that the delay in job completion is not necessarily improved compared to the case where reconfiguration is performed.
In one aspect, the present invention aims to make it possible to appropriately determine whether or not to reconfigure resources.

１つの態様では、リソース再構成プログラムが提供される。このリソース再構成プログラムは、コンピュータに次の処理を実行させる。コンピュータは、ジョブの割り当て候補でありジョブの実行のためにリソースの再構成を要する第１ノードの再構成に要する再構成時間と、再構成後のジョブの実行時間と、第１ノードによるジョブの実行時の通信に伴う通信競合の、実行時間に対する影響を表す第１係数とに基づいて、ジョブを第１ノードに割り当てる場合の再構成からジョブの完了までに要する第１時間を算出する。コンピュータは、ジョブの割り当て候補でありリソースの再構成を要しない第２ノードによるジョブの実行時の通信に伴う通信競合の、実行時間に対する影響を表す第２係数と実行時間とに基づいて、リソースの再構成を行わずにジョブを第２ノードに割り当てる場合のジョブの完了までに要する第２時間を算出する。コンピュータは、第１時間と第２時間とを比較し、第１時間が第２時間よりも短い場合は第１ノードにおけるリソースの再構成を行い、第１時間が第２時間以上の場合は第１ノードにおけるリソースの再構成を行わない。 In one aspect, a resource reconfiguration program is provided. This resource reconfiguration program causes a computer to execute the following process. The computer calculates a first time required from reconfiguration to completion of a job when the job is assigned to the first node, based on a reconfiguration time required for reconfiguring a first node that is a candidate for job allocation and requires resource reconfiguration to execute the job, the execution time of the job after reconfiguration, and a first coefficient representing the effect on the execution time of a communication conflict accompanying communication during execution of the job by the first node. The computer calculates a second time required to complete a job when the job is assigned to the second node without reconfiguring resources, based on a second coefficient representing the effect on the execution time of a communication conflict accompanying communication during execution of the job by a second node that is a candidate for job allocation and does not require resource reconfiguration, and the execution time. The computer compares the first time with the second time, and if the first time is shorter than the second time, reconfigures the resources in the first node, and if the first time is equal to or longer than the second time, does not reconfigure the resources in the first node.

また、１つの態様では、コンピュータが実行するリソース再構成方法が提供される。また、１つの態様では、記憶部と処理部とを有する情報処理システムが提供される。 In one aspect, a resource reconfiguration method executed by a computer is provided. In another aspect, an information processing system having a storage unit and a processing unit is provided.

１つの側面では、リソースの再構成を行うか否かを適切に決定できる。 In one aspect, it is possible to appropriately decide whether or not to reconfigure resources.

第１の実施の形態の情報処理システムを説明する図である。FIG. 1 is a diagram illustrating an information processing system according to a first embodiment. 第２の実施の形態の情報処理システムの例を示す図である。FIG. 1 illustrates an example of an information processing system according to a second embodiment. 管理装置のハードウェア例を示す図である。FIG. 2 illustrates an example of hardware of a management apparatus. ノード間の接続例を示す図である。FIG. 2 is a diagram illustrating an example of connections between nodes. ラック内のノードとリソースプールとの接続例を示す図である。FIG. 1 illustrates an example of a connection between a node in a rack and a resource pool. ジョブの割り当て候補となる空きノードの例を示す図である。FIG. 13 illustrates an example of free nodes that are candidates for job allocation. 通信競合が発生する例を示す図である。FIG. 13 illustrates an example in which communication contention occurs. ノードのリソースの再構成の例を示す図である。FIG. 13 is a diagram illustrating an example of reconfiguration of node resources. 通信競合が回避される例を示す図である。FIG. 13 is a diagram illustrating an example in which communication contention is avoided. 管理装置の機能例を示す図である。FIG. 2 illustrates an example of functions of a management device. ジョブ管理テーブルの例を示す図である。FIG. 4 illustrates an example of a job management table. 基準通信時間テーブルの例を示す図である。FIG. 13 is a diagram illustrating an example of a reference communication time table. 通信性能テーブルの例を示す図である。FIG. 13 illustrates an example of a communication performance table. 評価値テーブルの例を示す図である。FIG. 13 is a diagram illustrating an example of an evaluation value table. ジョブスケジューラの処理例を示すフローチャートである。13 is a flowchart showing an example of processing by a job scheduler. ノード選択処理の例を示すフローチャートである。13 is a flowchart illustrating an example of a node selection process. ノード組合せの抽出処理例を示すフローチャートである。13 is a flowchart illustrating an example of a process for extracting a node combination. ノード組合せの評価処理例を示すフローチャートである。13 is a flowchart illustrating an example of a process for evaluating a node combination. ジョブの実行に係るトータル時間の相違の例を示す図である。11A and 11B are diagrams illustrating an example of a difference in total time related to the execution of a job.

以下、本実施の形態について図面を参照して説明する。
［第１の実施の形態］
第１の実施の形態を説明する。 The present embodiment will be described below with reference to the drawings.
[First embodiment]
A first embodiment will be described.

図１は、第１の実施の形態の情報処理システムを説明する図である。
情報処理システム１は、情報処理装置１０およびノード２０，２０ａ，２０ｂ，２０ｃ，２０ｄ，…を有する。情報処理装置１０およびノード２０，２０ａ，…は、管理ネットワーク３０に接続される。ノード２０，２０ａ，…は、ノード間ネットワーク４０に接続される。管理ネットワーク３０は、情報処理装置１０とノード２０，２０ａ，…との通信に用いられるネットワークである。ノード間ネットワーク４０は、ノード間の通信に用いられるネットワークである。 FIG. 1 is a diagram illustrating an information processing system according to a first embodiment.
The information processing system 1 has an information processing device 10 and nodes 20, 20a, 20b, 20c, 20d, .... The information processing device 10 and the nodes 20, 20a, ... are connected to a management network 30. The nodes 20, 20a, ... are connected to an inter-node network 40. The management network 30 is a network used for communication between the information processing device 10 and the nodes 20, 20a, .... The inter-node network 40 is a network used for communication between the nodes.

情報処理装置１０は、記憶部１１および処理部１２を有する。記憶部１１は、ＲＡＭ（Random Access Memory）などの揮発性の半導体メモリでもよいし、ＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの不揮発性ストレージでもよい。処理部１２は、例えば、ＣＰＵ（Central Processing Unit）、ＧＰＵ、ＤＳＰ（Digital Signal Processor）などのプロセッサである。ただし、処理部１２は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡなどの特定用途の電子回路を含んでもよい。プロセッサは、ＲＡＭなどのメモリ（記憶部１１でもよい）に記憶されたプログラムを実行する。複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うことがある。 The information processing device 10 has a storage unit 11 and a processing unit 12. The storage unit 11 may be a volatile semiconductor memory such as a random access memory (RAM), or may be a non-volatile storage such as a hard disk drive (HDD) or a flash memory. The processing unit 12 is, for example, a processor such as a central processing unit (CPU), a GPU, or a digital signal processor (DSP). However, the processing unit 12 may include an electronic circuit for a specific purpose such as an application specific integrated circuit (ASIC) or an FPGA. The processor executes a program stored in a memory such as a RAM (which may be the storage unit 11). A collection of multiple processors is sometimes called a "multiprocessor" or simply a "processor."

ノード２０，２０ａ，…も情報処理装置１０と同様のハードウェアを有する。また、情報処理装置１０の機能は、ノード２０，２０ａ，…の何れかが備えてもよい。すなわち、ノード２０，２０ａ，…の何れかが情報処理装置１０として動作してもよい。 The nodes 20, 20a, ... also have the same hardware as the information processing device 10. In addition, the functions of the information processing device 10 may be provided by any of the nodes 20, 20a, .... In other words, any of the nodes 20, 20a, ... may operate as the information processing device 10.

情報処理システム１は、ディスアグリゲーションシステムである。すなわち、情報処理システム１は、ノード２０，２０ａ，…それぞれのリソース構成を変更可能である。情報処理装置１０は、ノード２０，２０ａ，…それぞれのリソース構成を管理し、ノード２０，２０ａ，…それぞれのリソースの再構成を制御する。リソースには、例えば、ＧＰＵ、ＦＰＧＡおよびＳＳＤなどがある。 The information processing system 1 is a disaggregation system. That is, the information processing system 1 is capable of changing the resource configuration of each of the nodes 20, 20a, .... The information processing device 10 manages the resource configuration of each of the nodes 20, 20a, ... and controls the reconfiguration of the resources of each of the nodes 20, 20a, .... The resources include, for example, a GPU, an FPGA, and an SSD.

処理部１２は、ジョブの実行要求を受け付ける。ジョブの実行要求は、実行に要求されるノードの数と、ジョブの実行に要求されるリソースの情報と、ジョブの実行時間の情報とを含む。当該リソースの情報は、例えばＧＰＵの数やＳＳＤの容量など、ノード当たりに要求されるリソースの種類とその量とを含む。ジョブの実行時間は、当該ジョブを実行する時間である。処理部１２は、ノード２０，２０ａ，…のうち、要求された数のノードにジョブを割り当て、割り当て先のノードにジョブを実行させる。割り当て先のノードは、要求されたリソースを有するノードとなる。 The processing unit 12 accepts a job execution request. The job execution request includes the number of nodes required for execution, information on the resources required for job execution, and information on the job execution time. The resource information includes the type and amount of resources required per node, such as the number of GPUs and SSD capacity. The job execution time is the time to execute the job. The processing unit 12 assigns the job to the requested number of nodes out of nodes 20, 20a, ..., and has the assigned nodes execute the job. The assigned nodes are the nodes that have the requested resources.

ノード２０，２０ａ，…に接続されるリソースは、例えばノード２０，２０ａ，…のグループ単位に設けられたリソースプールにプール化して管理される。リソースプールは、複数のリソースを集約して備える装置である。グループは例えば各ノードが搭載されるラック単位のグループである。例えば、あるグループに属する各ノードと当該グループに対応するリソースプールの各リソースとが、当該グループに対応するＰＣＩｅ（Peripheral Component Interconnect-Express）スイッチなどの接続装置を介して接続される。この場合、接続装置により当該グループ内で、ノードとリソースとの接続関係を変更することで、ノードにおけるリソースの再構成が行われる。なお、図１ではリソースプールや接続装置の図示は省略されている。 The resources connected to the nodes 20, 20a, ... are managed by pooling them in a resource pool provided for each group of the nodes 20, 20a, .... The resource pool is a device that aggregates multiple resources. The group is, for example, a group of racks in which the nodes are mounted. For example, each node belonging to a certain group and each resource of the resource pool corresponding to the group are connected via a connection device such as a PCIe (Peripheral Component Interconnect-Express) switch corresponding to the group. In this case, the connection device changes the connection relationship between the nodes and resources within the group, thereby reconfiguring the resources in the nodes. Note that the resource pool and connection device are not shown in FIG. 1.

ここで、要求されたリソースを有していないノードにジョブを割り当てる場合、当該ノードのリソースの再構成を要する。リソースの再構成では、例えば当該ノードで利用可能なＧＰＵ数を増やすなどの構成変更が行われる。リソースの再構成には、例えば数秒から数分の時間を要する。具体的には、ノードがホットプラグ／リムーブに対応している場合、リソースの再構成は数秒程度となる。一方、ノードがホットプラグ／リムーブに非対応でありノードの再起動を行う場合は、リソースの再構成は数分を要する。 When a job is assigned to a node that does not have the requested resources, the resources of that node must be reconfigured. In reconfiguring resources, for example, configuration changes are made, such as increasing the number of GPUs available to the node. Reconfiguring resources takes, for example, several seconds to several minutes. Specifically, if a node supports hot plug/remove, resource reconfiguration takes about several seconds. On the other hand, if a node does not support hot plug/remove and the node is rebooted, resource reconfiguration takes several minutes.

リソースの再構成の所要時間を示す情報は、情報処理システム１が有するノードがホットプラグ／リムーブに対応しているか否かに応じて記憶部１１に予め保持される。また、ジョブの実行要求に含まれるジョブの実行時間の情報も記憶部１１に保持される。 Information indicating the time required for reconfiguring resources is stored in advance in the storage unit 11 depending on whether the nodes of the information processing system 1 support hot plugging/removal. In addition, information on the job execution time included in the job execution request is also stored in the storage unit 11.

処理部１２は、ノード２０，２０ａ，…において、ジョブの割り当て先の候補である第１ノードおよび第２ノードがある場合に、第１ノードおよび第２ノードのうちジョブの割り当て先のノードを次のように選択する。第１ノードは、ジョブの実行のためにリソースの再構成を要するノードである。第２ノードは、ジョブで要求されるリソースを構成済であり、ジョブの実行のためにリソースの再構成を要しないノードである。 When there are a first node and a second node among the nodes 20, 20a, ... that are candidates for the assignment of a job, the processing unit 12 selects the node to which the job is assigned from the first node and the second node as follows: The first node is a node that requires resource reconfiguration to execute the job. The second node is a node that has already configured the resources required by the job and does not require resource reconfiguration to execute the job.

処理部１２は、第１ノードにジョブを割り当てる場合の再構成からジョブの完了までに要する第１時間を算出する。第１時間は、第１ノードのリソースの再構成および再構成後の第１ノードでのジョブ実行開始から実行完了までの総所要時間となる。処理部１２は、再構成時間と、再構成後のジョブの実行時間と第１ノードによるジョブの実行時の通信に伴う通信競合の、実行時間に対する影響を表す第１係数とに基づいて第１時間を算出する。 The processing unit 12 calculates a first time required from reconfiguration to completion of the job when the job is assigned to the first node. The first time is the total time required for reconfiguring the resources of the first node and for starting and completing job execution on the first node after the reconfiguration. The processing unit 12 calculates the first time based on the reconfiguration time, the execution time of the job after the reconfiguration, and a first coefficient that represents the effect on the execution time of communication contention associated with communication during job execution by the first node.

第１時間をＴ_{Ｔｏｔａｌ１}、再構成時間をＴ_{ｒｅｃｏｎｆ}、ジョブの実行時間をＴ_ｊｏｂ、第１係数をａ１とする。ａ１は１以上の実数である。第１時間Ｔ_{Ｔｏｔａｌ１}は、例えば式（１）で計算される。 The first time is T _Total1 , the reconfiguration time is T _reconf , the job execution time is T _job , and the first coefficient is a1, where a1 is a real number equal to or greater than 1. The first time T _Total1 is calculated, for example, by equation (1).

Ｔ_{Ｔｏｔａｌ１}＝Ｔ_{ｒｅｃｏｎｆ}＋ａ１×Ｔ_ｊｏｂ・・・（１）
また、処理部１２は、第２ノードにジョブを割り当てる場合のジョブの完了までに要する第２時間を算出する。第２時間は、第２ノードでのジョブ実行開始から実行完了までの総所要時間となる。第２時間は、リソースの再構成時間を含まない点が第１時間と異なる。処理部１２は、第２ノードによるジョブの実行時の通信に伴う通信競合の、実行時間に対する影響を表す第２係数とジョブの実行時間とに基づいて第２時間を算出する。 T _Total1 = T _reconf + a1 × T _job ... (1)
The processing unit 12 also calculates a second time required to complete the job when the job is assigned to the second node. The second time is the total time required from the start of job execution on the second node to the completion of execution. The second time differs from the first time in that it does not include the time required for reconfiguring resources. The processing unit 12 calculates the second time based on a second coefficient representing the effect on execution time of communication contention associated with communication during job execution by the second node and the execution time of the job.

第２時間をＴ_{Ｔｏｔａｌ２}、第２係数をａ２とする。ａ２は１以上の実数である。第２時間Ｔ_{Ｔｏｔａｌ２}は、例えば式（２）で計算される。
Ｔ_{Ｔｏｔａｌ２}＝ａ２×Ｔ_ｊｏｂ・・・（２）
処理部１２は、第１時間と第２時間とを比較する。処理部１２は、第１時間が第２時間よりも短い場合（Ｔ_{Ｔｏｔａｌ１}＜Ｔ_{Ｔｏｔａｌ２}の場合）、第１ノードにおけるリソースの再構成を行う。処理部１２は、第１時間が第２時間以上の場合（Ｔ_{Ｔｏｔａｌ１}≧Ｔ_{Ｔｏｔａｌ２}の場合）、第１ノードにおけるリソースの再構成を行わない。 The second time period is T _Total2 and the second coefficient is a2, where a2 is a real number equal to or greater than 1. The second time period T _Total2 is calculated, for example, by the formula (2).
T _Total2 = a2 × T _job ... (2)
The processing unit 12 compares the first time with the second time. If the first time is shorter than the second time (if T _Total1 <T _Total2 ), the processing unit 12 reconfigures resources in the first node. If the first time is equal to or longer than the second time (if T _Total1 ≧T _Total2 ), the processing unit 12 does not reconfigure resources in the first node.

なお、処理部１２は、ジョブの割り当て候補である、第１ノードを含むノードの第１グループに属するノード間の通信の第１通信時間を所定の通信ベンチマークプログラムを用いて測定し、測定結果に基づいて第１係数ａ１を算出してもよい。より具体的には、処理部１２は、第１グループに属するノード数と同数のノードを用いた、通信競合がない場合におけるノードの間の基準通信時間を予め取得しておき、第１通信時間と基準通信時間との比に基づいて第１係数ａ１を算出してもよい。同様に、処理部１２は、ジョブの割り当て候補である、第２ノードを含むノードの第２グループに属するノード間の通信の第２通信時間の測定結果に基づいて第２係数ａ２を算出することができる。 The processing unit 12 may measure the first communication time of communication between nodes belonging to a first group of nodes including the first node, which are candidates for job allocation, using a predetermined communication benchmark program, and calculate the first coefficient a1 based on the measurement result. More specifically, the processing unit 12 may obtain in advance a reference communication time between nodes in a case where there is no communication contention, using the same number of nodes as the number of nodes belonging to the first group, and calculate the first coefficient a1 based on the ratio between the first communication time and the reference communication time. Similarly, the processing unit 12 may calculate the second coefficient a2 based on the measurement result of the second communication time of communication between nodes belonging to a second group of nodes including the second node, which are candidates for job allocation.

また、処理部１２は、ジョブの実行のために１ノード当たりに要求されるリソースの量に基づいて、第１係数ａ１および第２係数ａ２を算出してもよい。１ノード当たりに要求されるリソースの量が多いほど、ノードの演算量が増え、ノード間の通信量が増えると推定される。このため、処理部１２は、当該リソースの量が多いほど、第１係数ａ１および第２係数ａ２を大きくするようにしてもよい。 The processing unit 12 may also calculate the first coefficient a1 and the second coefficient a2 based on the amount of resources required per node for job execution. It is estimated that the greater the amount of resources required per node, the greater the amount of calculations at the node and the greater the amount of communication between nodes. For this reason, the processing unit 12 may increase the first coefficient a1 and the second coefficient a2 as the amount of resources increases.

処理部１２は、第１ノードにおけるリソースの再構成を行う場合、リソースの再構成を第１ノードに指示して、第１ノードに当該再構成を実行させ、要求されたジョブの割り当て先のノードを第１ノードとする。処理部１２は、第１ノードにおけるリソースの再構成を行わない場合、要求されたジョブの割り当て先のノードを、第１ノードではなく第２ノードとする。そして、処理部１２は、割り当て先のノードにジョブを実行させる。 When the processing unit 12 reconfigures resources in the first node, it instructs the first node to reconfigure the resources, causes the first node to execute the reconfiguration, and assigns the requested job to the first node. When the processing unit 12 does not reconfigure resources in the first node, it assigns the requested job to the second node, not the first node. Then, the processing unit 12 causes the assigned node to execute the job.

図１では、割り当て候補のノードの例が示されている。ジョブに要求されるノードの数は２であるとする。例えば、１つ目の候補＃１は、ノード２０ａ，２０ｂの組合せである。２つ目の候補＃２は、ノード２０ａ，２０ｃの組合せである。説明の簡単のため、ノード２０ａは、ジョブの実行のためにリソースの再構成が不要であり、ジョブの割り当て先として確定されているものとする。 Figure 1 shows examples of candidate nodes for allocation. Assume that the number of nodes required for a job is two. For example, the first candidate #1 is a combination of nodes 20a and 20b. The second candidate #2 is a combination of nodes 20a and 20c. For simplicity of explanation, it is assumed that node 20a does not require resource reconfiguration to execute the job and has been confirmed as the allocation destination for the job.

ノード２０ｂは、ジョブの実行のためにリソースの再構成を要する。すなわち、ノード２０ｂは第１ノードに相当する。ノード２０ｃは、ジョブの実行のためにリソースの再構成を要しない。すなわち、ノード２０ｃは第２ノードに相当する。 Node 20b requires resource reconfiguration to execute the job. In other words, node 20b corresponds to the first node. Node 20c does not require resource reconfiguration to execute the job. In other words, node 20c corresponds to the second node.

ジョブの割り当て先としてノード２０ｂを選択する場合、例えば、ノード間ネットワーク４０において他のノード間の既存の通信との通信路の競合による通信競合が発生しないものとする。一例として、ノード２０，２０ｄが他ジョブを実行しており、ノード間ネットワーク４０の一部の通信路４１を用いて通信しているとする。ノード２０ａ，２０ｂとの通信に用いられるノード間ネットワーク４０の一部の通信路４２は、通信路４１とは別個である。このため、ノード２０，２０ｄ間の通信とノード２０ａ，２０ｂ間の通信とは競合しない。この場合、第１係数ａ１は、ａ１＝１となる。第１係数ａ１＝１は、ジョブの実行時間に対する通信競合の影響がない場合に相当する。 When node 20b is selected as the job allocation destination, for example, it is assumed that no communication conflict occurs due to conflict with existing communication paths between other nodes in the internode network 40. As an example, it is assumed that nodes 20 and 20d are executing other jobs and communicating using a communication path 41 that is a part of the internode network 40. A communication path 42 that is a part of the internode network 40 used for communication with nodes 20a and 20b is separate from communication path 41. Therefore, there is no conflict between communication between nodes 20 and 20d and communication between nodes 20a and 20b. In this case, the first coefficient a1 is a1=1. The first coefficient a1=1 corresponds to a case where there is no effect of communication conflict on the job execution time.

一方、ジョブの割り当て先としてノード２０ｃを選択する場合、ノード間ネットワーク４０において他のノード間の既存の通信との通信路の競合による通信競合が発生するものとする。一例として、ノード２０ａ，２０ｃとがノード間ネットワーク４０の通信路４１を用いて通信することになり、ノード２０，２０ｄの既存の通信の通信路４１で通信競合が発生するものとする。この場合、第２係数ａ２は、ａ２＞１となる。 On the other hand, if node 20c is selected as the destination for the job, a communication conflict occurs due to a conflict with the communication path of existing communication between other nodes in internode network 40. As an example, nodes 20a and 20c communicate using communication path 41 of internode network 40, and a communication conflict occurs on communication path 41 of the existing communication between nodes 20 and 20d. In this case, the second coefficient a2 is a2>1.

ａ１＝１またはａ１が１近傍の場合、第１時間Ｔ_{Ｔｏｔａｌ１}は、再構成時間Ｔ_{ｒｅｃｏｎｆ}の影響を受け易い。一方、第２時間Ｔ_{Ｔｏｔａｌ２}は、第２係数ａ２の影響を受け易い。このため、再構成時間Ｔ_{ｒｅｃｏｎｆ}が比較的短く、第２係数ａ２が比較的大きい場合、Ｔ_{Ｔｏｔａｌ１}＜Ｔ_{Ｔｏｔａｌ２}となる。この場合、ノード２０ｂの再構成を行い、ノード２０ｂにジョブを割り当てた方がジョブの実行完了までの遅延を低減できる。一方、再構成時間Ｔ_{ｒｅｃｏｎｆ}が比較的長く、第２係数ａ２が比較的小さい場合、Ｔ_{Ｔｏｔａｌ２}≦Ｔ_{Ｔｏｔａｌ１}となる。この場合、ノード２０ｂの再構成を行わずにノード２０ｃにジョブを割り当てた方がジョブの実行完了までの遅延を低減できる。 When a1=1 or a1 is close to 1, the first time T _Total1 is easily influenced by the reconfiguration time T _reconf . On the other hand, the second time T _Total2 is easily influenced by the second coefficient a2. Therefore, when the reconfiguration time T _reconf is relatively short and the second coefficient a2 is relatively large, T _Total1 <T _Total2 . In this case, reconfiguring the node 20b and assigning the job to the node 20b can reduce the delay until the job is completed. On the other hand, when the reconfiguration time T _reconf is relatively long and the second coefficient a2 is relatively small, T _Total2 ≦T _Total1 . In this case, assigning the job to the node 20c without reconfiguring the node 20b can reduce the delay until the job is completed.

なお、上記の例では、ａ１＝１を例示したが、ａ１＞１の場合も、処理部１２は、Ｔ_{ｒｅｃｏｎｆ}とａ１とａ２とを用いたジョブ実行の総所要時間を評価し、当該評価結果を基に、リソースの再構成を行うか否かを判定し得る。 In the above example, a1=1 is exemplified. However, even if a1>1, the processing unit 12 can evaluate the total required time for job execution using T _reconf , a1, and a2, and determine whether or not to reconfigure resources based on the evaluation result.

以上説明したように、情報処理装置１０によれば、第１ノードのリソースの再構成に要する再構成時間Ｔ_{ｒｅｃｏｎｆ}と、ジョブの実行時間Ｔ_ｊｏｂと、第１係数ａ１とに基づいて、第１時間Ｔ_{Ｔｏｔａｌ１}が算出される。第１係数ａ１は、第１ノードによるジョブの実行時の通信に伴う通信競合の、実行時間Ｔ_ｊｏｂに対する影響を表す。また、第２係数ａ２とジョブの実行時間Ｔ_ｊｏｂとに基づいて、第２時間Ｔ_{Ｔｏｔａｌ２}が算出される。第２係数ａ２は、第２ノードによるジョブの実行時の通信に伴う通信競合の、実行時間Ｔ_ｊｏｂに対する影響を表す。そして、第１時間Ｔ_{Ｔｏｔａｌ１}と第２時間Ｔ_{Ｔｏｔａｌ２}とが比較される。第１時間Ｔ_{Ｔｏｔａｌ１}が第２時間Ｔ_{Ｔｏｔａｌ２}よりも短い場合は第１ノードにおけるリソースの再構成が行われる。第１時間Ｔ_{Ｔｏｔａｌ１}が第２時間Ｔ_{Ｔｏｔａｌ２}以上の場合は第１ノードにおけるリソースの再構成が行われない。 As described above, according to the information processing device 10, the first time T Total1 is calculated based on the reconfiguration time T _reconf required for reconfiguring the resources of the first node, the job execution time T _job , and the first coefficient a1. The first coefficient a1 represents the influence of the communication contention caused by the communication during the execution of the job by the first node on the execution time T _job . In addition, the second time T _Total2 is calculated based on the second coefficient a2 and the job execution time T _job . The second coefficient a2 represents the influence of the communication contention caused by the communication during the execution of _the job by the second node on the execution time T _job . Then, the first time T _Total1 and the second time T _Total2 are compared. If the first time T _Total1 is shorter than the second time T _Total2 , the resources in the first node are reconfigured. If the first time T _Total1 is greater than or equal to the second time T _Total2 , the reconfiguration of resources in the first node is not performed.

これにより、情報処理装置１０は、通信競合の影響を考慮して、リソースの再構成を行うか否かを適切に決定できる。また、情報処理装置１０は、リソースの再構成時間と通信競合の影響を考慮することにより、ジョブの実行完了までの所要時間が短くなるように、当該ジョブの割り当て先のノードの選択が可能になる。 This allows the information processing device 10 to appropriately decide whether or not to reconfigure resources, taking into account the impact of communication contention. In addition, by taking into account the time required to reconfigure resources and the impact of communication contention, the information processing device 10 becomes able to select a node to which a job is assigned so as to shorten the time required to complete job execution.

［第２の実施の形態］
次に、第２の実施の形態を説明する。
図２は、第２の実施の形態の情報処理システムの例を示す図である。 [Second embodiment]
Next, a second embodiment will be described.
FIG. 2 illustrates an example of an information processing system according to the second embodiment.

情報処理システム２は、管理装置１００およびノード２００，２００ａ，２００ｂ，…を有する。情報処理システム２は、ノード２００，２００ａ，２００ｂ，…それぞれのハードウェアのリソースの構成を変更可能なディスアグリゲーションシステムである。管理装置１００およびノード２００，２００ａ，２００ｂ，…は、管理ネットワーク５０に接続される。ノード２００，２００ａ，２００ｂ，…は、ノード間ネットワーク６０に接続される。管理ネットワーク５０は、例えばイーサネット（登録商標）のネットワークである。ノード間ネットワーク６０は、例えばＩｎｆｉｎｉＢａｎｄのネットワークである。ただし、ノード間ネットワーク６０はイーサネットなどの他の種類のネットワークでもよい。 The information processing system 2 has a management device 100 and nodes 200, 200a, 200b, .... The information processing system 2 is a disaggregation system that can change the configuration of the hardware resources of each of the nodes 200, 200a, 200b, .... The management device 100 and the nodes 200, 200a, 200b, ... are connected to a management network 50. The nodes 200, 200a, 200b, ... are connected to an inter-node network 60. The management network 50 is, for example, an Ethernet (registered trademark) network. The inter-node network 60 is, for example, an InfiniBand network. However, the inter-node network 60 may be another type of network, such as Ethernet.

管理装置１００は、ジョブの実行要求を受け付け、実行要求に基づいてノード２００，２００ａ，２００ｂ，…に対するジョブの割り当てを制御するサーバコンピュータである。ジョブの実行要求は、ジョブの実行に要するノード数、ノードごとに要するハードウェアのリソース量およびジョブの実行時間の情報を含む。管理装置１００は、ジョブの割り当ての必要に応じてノード２００，２００ａ，２００ｂ，…のリソースの再構成を制御する。管理装置１００は、第１の実施の形態の情報処理装置１０の一例である。 The management device 100 is a server computer that receives a job execution request and controls the allocation of jobs to the nodes 200, 200a, 200b, ... based on the execution request. The job execution request includes information on the number of nodes required to execute the job, the amount of hardware resources required for each node, and the execution time of the job. The management device 100 controls the reconfiguration of resources of the nodes 200, 200a, 200b, ... as required for job allocation. The management device 100 is an example of the information processing device 10 of the first embodiment.

ノード２００，２００ａ，２００ｂ，…は、割り当てられたジョブを実行するサーバコンピュータである。ノード２００，２００ａ，２００ｂ，…は、ノード間ネットワーク６０を介して通信可能である。例えば、２つのノードを用いて、あるジョブを実行する場合、当該２つのノードは、ノード間ネットワーク６０を介して相互に通信しながら、ジョブを実行する。 Nodes 200, 200a, 200b, ... are server computers that execute assigned jobs. Nodes 200, 200a, 200b, ... can communicate with each other via inter-node network 60. For example, when a job is executed using two nodes, the two nodes execute the job while communicating with each other via inter-node network 60.

図３は、管理装置のハードウェア例を示す図である。
管理装置１００は、ＣＰＵ１０１、ＲＡＭ１０２、ＨＤＤ１０３、ＧＰＵ１０４、入力インタフェース１０５、媒体リーダ１０６および通信インタフェース１０７を有する。管理装置１００が有するこれらのユニットは、管理装置１００の内部でバスに接続されている。ＣＰＵ１０１は、第１の実施の形態の処理部１２に対応する。ＲＡＭ１０２またはＨＤＤ１０３は、第１の実施の形態の記憶部１１に対応する。 FIG. 3 illustrates an example of hardware of the management apparatus.
The management device 100 has a CPU 101, a RAM 102, a HDD 103, a GPU 104, an input interface 105, a media reader 106, and a communication interface 107. These units of the management device 100 are connected to a bus inside the management device 100. The CPU 101 corresponds to the processing unit 12 in the first embodiment. The RAM 102 or the HDD 103 corresponds to the storage unit 11 in the first embodiment.

ＣＰＵ１０１は、プログラムの命令を実行するプロセッサである。ＣＰＵ１０１は、ＨＤＤ１０３に記憶されたプログラムやデータの少なくとも一部をＲＡＭ１０２にロードし、プログラムを実行する。なお、ＣＰＵ１０１は複数のプロセッサコアを含んでもよい。また、管理装置１００は複数のプロセッサを有してもよい。以下で説明する処理は複数のプロセッサまたはプロセッサコアを用いて並列に実行されてもよい。また、複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うことがある。 The CPU 101 is a processor that executes program instructions. The CPU 101 loads at least a portion of the programs and data stored in the HDD 103 into the RAM 102 and executes the programs. The CPU 101 may include multiple processor cores. The management device 100 may also have multiple processors. The processes described below may be executed in parallel using multiple processors or processor cores. A collection of multiple processors may also be called a "multiprocessor" or simply a "processor."

ＲＡＭ１０２は、ＣＰＵ１０１が実行するプログラムやＣＰＵ１０１が演算に用いるデータを一時的に記憶する揮発性の半導体メモリである。なお、管理装置１００は、ＲＡＭ以外の種類のメモリを備えてもよく、複数個のメモリを備えてもよい。 RAM 102 is a volatile semiconductor memory that temporarily stores programs executed by CPU 101 and data used by CPU 101 for calculations. Note that management device 100 may include a type of memory other than RAM, and may include multiple memories.

ＨＤＤ１０３は、ＯＳ（Operating System）やミドルウェアやアプリケーションソフトウェアなどのソフトウェアのプログラム、および、データを記憶する不揮発性の記憶装置である。なお、管理装置１００は、フラッシュメモリやＳＳＤなどの他の種類の記憶装置を備えてもよく、複数の不揮発性の記憶装置を備えてもよい。 The HDD 103 is a non-volatile storage device that stores software programs such as the OS (Operating System), middleware, and application software, as well as data. Note that the management device 100 may also include other types of storage devices, such as flash memory and SSDs, and may also include multiple non-volatile storage devices.

ＧＰＵ１０４は、ＣＰＵ１０１からの命令に従って、管理装置１００に接続されたディスプレイ１１１に画像を出力する。ディスプレイ１１１としては、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ（ＬＣＤ：Liquid Crystal Display）、プラズマディスプレイ、有機ＥＬ（ＯＥＬ：Organic Electro-Luminescence）ディスプレイなど、任意の種類のディスプレイを用いることができる。 The GPU 104 outputs an image to a display 111 connected to the management device 100 in accordance with an instruction from the CPU 101. The display 111 may be any type of display, such as a CRT (Cathode Ray Tube) display, a liquid crystal display (LCD), a plasma display, or an organic electro-luminescence (OEL) display.

入力インタフェース１０５は、管理装置１００に接続された入力デバイス１１２から入力信号を取得し、ＣＰＵ１０１に出力する。入力デバイス１１２としては、マウス、タッチパネル、タッチパッド、トラックボールなどのポインティングデバイス、キーボード、リモートコントローラ、ボタンスイッチなどを用いることができる。また、管理装置１００に、複数の種類の入力デバイスが接続されていてもよい。 The input interface 105 acquires an input signal from an input device 112 connected to the management device 100 and outputs the signal to the CPU 101. The input device 112 may be a pointing device such as a mouse, a touch panel, a touch pad, or a trackball, a keyboard, a remote controller, or a button switch. In addition, multiple types of input devices may be connected to the management device 100.

媒体リーダ１０６は、記録媒体１１３に記録されたプログラムやデータを読み取る読み取り装置である。記録媒体１１３として、例えば、磁気ディスク、光ディスク、光磁気ディスク（ＭＯ：Magneto-Optical disk）、半導体メモリなどを使用できる。磁気ディスクには、フレキシブルディスク（ＦＤ：Flexible Disk）やＨＤＤが含まれる。光ディスクには、ＣＤ（Compact Disc）やＤＶＤ（Digital Versatile Disc）が含まれる。 The media reader 106 is a reading device that reads programs and data recorded on the recording medium 113. For example, a magnetic disk, an optical disk, a magneto-optical disk (MO: Magneto-Optical disk), a semiconductor memory, etc. can be used as the recording medium 113. Magnetic disks include flexible disks (FD: Flexible Disks) and HDDs. Optical disks include compact discs (CDs) and digital versatile discs (DVDs).

媒体リーダ１０６は、例えば、記録媒体１１３から読み取ったプログラムやデータを、ＲＡＭ１０２やＨＤＤ１０３などの他の記録媒体にコピーする。読み取られたプログラムは、例えば、ＣＰＵ１０１によって実行される。なお、記録媒体１１３は可搬型記録媒体であってもよく、プログラムやデータの配布に用いられることがある。また、記録媒体１１３やＨＤＤ１０３を、コンピュータ読み取り可能な記録媒体と言うことがある。 For example, the media reader 106 copies the program or data read from the recording medium 113 to another recording medium such as the RAM 102 or the HDD 103. The read program is executed by the CPU 101, for example. Note that the recording medium 113 may be a portable recording medium, and may be used to distribute programs and data. The recording medium 113 and the HDD 103 may also be referred to as computer-readable recording media.

通信インタフェース１０７は、管理ネットワーク５０に接続され、管理ネットワーク５０を介してノード２００，２００ａ，２００ｂ，…を含む他の情報処理装置と通信する。通信インタフェース１０７は、スイッチやルータなどの有線通信装置に接続される有線通信インタフェースでもよいし、基地局やアクセスポイントなどの無線通信装置に接続される無線通信インタフェースでもよい。 The communication interface 107 is connected to the management network 50 and communicates with other information processing devices including the nodes 200, 200a, 200b, ... via the management network 50. The communication interface 107 may be a wired communication interface connected to a wired communication device such as a switch or a router, or a wireless communication interface connected to a wireless communication device such as a base station or an access point.

ノード２００，２００ａ，２００ｂ，…も、管理装置１００と同様のハードウェアにより実現される。ノード２００，２００ａ，２００ｂ，…は、管理ネットワーク５０だけでなくノード間ネットワーク６０にも接続されるため、ノード間ネットワーク６０に接続する通信インタフェースも有する。 The nodes 200, 200a, 200b, ... are also realized by the same hardware as the management device 100. The nodes 200, 200a, 200b, ... are connected not only to the management network 50 but also to the inter-node network 60, and therefore also have a communication interface that connects to the inter-node network 60.

図４は、ノード間の接続例を示す図である。
各ノードは、ラックＲ１，Ｒ２，Ｒ３，Ｒ４の何れかに搭載される。例えば、ラックＲ１には、全ノードのうちのノード２００，２００ａを含む一部のノードが搭載される。ラックＲ２には、全ノードのうちのノード２００ｍ，２００ｎを含む一部のノードが搭載される。図示を省略しているが、ラックＲ３，Ｒ４にも全ノードのうちの一部のノードが搭載される。 FIG. 4 is a diagram showing an example of connections between nodes.
Each node is mounted on one of racks R1, R2, R3, and R4. For example, some of all the nodes, including nodes 200 and 200a, are mounted on rack R1. Some of all the nodes, including nodes 200m and 200n, are mounted on rack R2. Although not shown in the figure, some of all the nodes are also mounted on racks R3 and R4.

また、ラックＲ１，Ｒ２，Ｒ３，Ｒ４それぞれは、リソースプールを有する。リソースプールは、ハードウェアのリソースを集約して備える装置である。第２の実施の形態では、当該リソースとしてＧＰＵを例示する。ただし、リソースプールにプール化されるリソースは、ＦＰＧＡやＳＳＤなどの他の種類のハードウェアを含んでもよい。例えば、ラックＲ１はリソースプール３００を有する。ラックＲ２はリソースプール３００ａを有する。図示を省略しているが、ラックＲ３，Ｒ４にもリソースプールが搭載される。 Each of the racks R1, R2, R3, and R4 also has a resource pool. A resource pool is a device that aggregates hardware resources. In the second embodiment, a GPU is exemplified as the resource. However, the resources pooled in the resource pool may include other types of hardware such as FPGAs and SSDs. For example, the rack R1 has a resource pool 300. The rack R2 has a resource pool 300a. Although not shown in the figure, the racks R3 and R4 are also equipped with resource pools.

同一ラック内のノード間は、当該ラックに搭載されたノード間接続スイッチにより接続される。また、同一ラック内のノードとリソースプールとは、当該ラックに搭載されたＰＣＩｅスイッチにより接続される。例えば、ラックＲ１は、ノード間接続スイッチ６１およびＰＣＩｅスイッチ７１を有する。ノード間接続スイッチ６１は、ラックＲ１に搭載されたノード２００，２００ａ，…を接続する。ＰＣＩｅスイッチ７１は、ラックＲ１に搭載されたノード２００，２００ａ，…とリソースプール３００とを接続する。また、ラックＲ２は、ノード間接続スイッチ６２およびＰＣＩｅスイッチ７２を有する。ノード間接続スイッチ６２は、ラックＲ２に搭載されたノード２００ｍ，２００ｎ，…を接続する。また、ＰＣＩｅスイッチ７２は、ラックＲ２に搭載されたノード２００ｍ，２００ｎ，…とリソースプール３００ａとを接続する。図示を省略しているが、ラックＲ３，Ｒ４にもノード間接続スイッチおよびＰＣＩｅスイッチが搭載される。 Nodes in the same rack are connected by an inter-node connection switch mounted on the rack. Also, nodes in the same rack and a resource pool are connected by a PCIe switch mounted on the rack. For example, rack R1 has an inter-node connection switch 61 and a PCIe switch 71. The inter-node connection switch 61 connects the nodes 200, 200a, ... mounted on rack R1. The PCIe switch 71 connects the nodes 200, 200a, ... mounted on rack R1 and the resource pool 300. Also, rack R2 has an inter-node connection switch 62 and a PCIe switch 72. The inter-node connection switch 62 connects the nodes 200m, 200n, ... mounted on rack R2. Also, the PCIe switch 72 connects the nodes 200m, 200n, ... mounted on rack R2 and the resource pool 300a. Although not shown in the figure, racks R3 and R4 are also equipped with inter-node connection switches and PCIe switches.

ノード間接続スイッチ６１，６２を含む、各ラックのノード接続スイッチは、上位スイッチ６５，６６，６７，６８に接続される。上位スイッチ６５，６６，６７，６８によりラック間を跨ぐノード間の通信が可能となる。ノード間接続スイッチ６１，６２を含む、各ラックのノード接続スイッチ、および、上位スイッチ６５，６６，６７，６８は、何れもＩｎｆｉｎｉＢａｎｄのスイッチであり、ノード間ネットワーク６０を形成する。 The node connection switches of each rack, including the inter-node connection switches 61 and 62, are connected to the upper switches 65, 66, 67, and 68. The upper switches 65, 66, 67, and 68 enable communication between nodes across racks. The node connection switches of each rack, including the inter-node connection switches 61 and 62, and the upper switches 65, 66, 67, and 68 are all InfiniBand switches, and form the inter-node network 60.

ノード間ネットワーク６０のトポロジは、例えばＦａｔＴｒｅｅである。具体的には、各ノード間接続スイッチにおけるノードに接続するリンクの数と当該ノード間接続スイッチの上位スイッチ側のリンクの数が等しい構成となっている。あるノードから他ラックの通信相手ノードへのパケットの送信に用いる上位スイッチ側のリンクを通信相手ノードに応じて使い分けることで、各リンクの使用量をバランスする。ただし、ノード間ネットワーク６０のトポロジは、ＦａｔＴｒｅｅ以外のトポロジでもよい。ここで、リンクは、ノード間ネットワーク６０に含まれる通信路である。 The topology of the inter-node network 60 is, for example, FatTree. Specifically, the number of links connecting to the nodes in each inter-node connection switch is equal to the number of links on the upper switch side of the inter-node connection switch. The usage of each link is balanced by using different links on the upper switch side used to send packets from a node to a communication partner node in another rack depending on the communication partner node. However, the topology of the inter-node network 60 may be a topology other than FatTree. Here, a link is a communication path included in the inter-node network 60.

図５は、ラック内のノードとリソースプールとの接続例を示す図である。
ノード２００は、ＣＰＵ２０１、ＲＡＭ２０２、ＨＤＤ２０３およびＮＩＣ（Network Interface Card）２０４を有する。ノード２００ａは、ＣＰＵ２０１ａ、ＲＡＭ２０２ａ、ＨＤＤ２０３ａおよびＮＩＣ２０４ａを有する。ＮＩＣ２０４，２０４ａは、ノード間接続スイッチ６１と接続する通信インタフェースである。ラックＲ１における他のノードもノード２００，２００ａと同様のハードウェアを有する。また、ＣＰＵ２０１，２０１ａを含む、ラックＲ１の各ノードのＣＰＵは、ＰＣＩｅスイッチ７１に接続される。ＰＣＩｅスイッチ７１は、ＣＰＵ２０１，２０１ａを含む、ラックＲ１の各ノードのＣＰＵをルートコンプレックスとして認識し、リソースプール３００に含まれる、エンドポイントのデバイスであるＧＰＵ３０１，３０２，…と接続する。ＰＣＩｅスイッチ７１は、ノードからの指示に応じて、ラックＲ１の各ノードのＣＰＵと、リソースプール３００のＧＰＵ３０１，３０２，…との接続関係を変更する。 FIG. 5 is a diagram illustrating an example of a connection between nodes in a rack and a resource pool.
The node 200 includes a CPU 201, a RAM 202, a HDD 203, and a NIC (Network Interface Card) 204. The node 200a includes a CPU 201a, a RAM 202a, a HDD 203a, and a NIC 204a. The NICs 204 and 204a are communication interfaces that connect to the inter-node connection switch 61. The other nodes in the rack R1 also include hardware similar to that of the nodes 200 and 200a. The CPUs of each node in the rack R1, including the CPUs 201 and 201a, are connected to a PCIe switch 71. The PCIe switch 71 recognizes the CPUs of each node in the rack R1, including the CPUs 201 and 201a, as root complexes, and connects them to GPUs 301, 302, ..., which are endpoint devices included in the resource pool 300. The PCIe switch 71 changes the connection relationship between the CPU of each node in the rack R1 and the GPUs 301, 302, . . . of the resource pool 300 in response to an instruction from the node.

ここで、管理装置１００は、ジョブの割り当て要求を受け付けると、現在ジョブを未割り当てである空きノードをジョブの割り当て候補とする。
図６は、ジョブの割り当て候補となる空きノードの例を示す図である。 Here, when the management device 100 receives a job allocation request, it sets free nodes to which no job is currently assigned as candidates for job allocation.
FIG. 6 is a diagram showing an example of free nodes that are candidates for job allocation.

各ノードは、ノード番号により識別される。例えば、ノード２００のノード番号は「０」であり、ノード２００ａのノード番号は「１」である。以下では、ノード番号ｎのノードを、ノード「ｎ」のように表記する。１つのラックに搭載されるノードの数は４とする。すなわち、１つのノード間接続スイッチには４つのノードが接続される。また、ラックの総数は４であり、ノードの総数は１６とする。 Each node is identified by a node number. For example, node 200 has a node number of "0", and node 200a has a node number of "1". Below, a node with node number n will be written as node "n". The number of nodes mounted on one rack is assumed to be four. In other words, four nodes are connected to one inter-node connection switch. The total number of racks is assumed to be four, and the total number of nodes is assumed to be 16.

この場合、ノード間ネットワーク６０におけるＦａｔＴｒｅｅのトポロジでは、１つのノード間接続スイッチに対して、上位スイッチ６５，６６，６７，６８それぞれと接続する４つのリンクが存在する。ここで、ノード間接続スイッチ６３はラックＲ３に搭載される。ノード間接続スイッチ６４はラックＲ４に搭載される。 In this case, in the FatTree topology of the inter-node network 60, there are four links for one inter-node connection switch, connecting it to upper switches 65, 66, 67, and 68, respectively. Here, the inter-node connection switch 63 is mounted on rack R3. The inter-node connection switch 64 is mounted on rack R4.

ノード間接続スイッチ６１には、ノード「０」～「３」が接続される。ノード間接続スイッチ６２には、ノード「４」～「７」が接続される。ノード間接続スイッチ６３には、ノード「８」～「１１」が接続される。ノード間接続スイッチ６４には、ノード「１２」～「１５」が接続される。 Nodes "0" to "3" are connected to inter-node connection switch 61. Nodes "4" to "7" are connected to inter-node connection switch 62. Nodes "8" to "11" are connected to inter-node connection switch 63. Nodes "12" to "15" are connected to inter-node connection switch 64.

図中、黒丸で示されるノード「３」、「４」、「７」、「８」～「１５」は、ジョブ割り当て済のノードである。ジョブ割り当て済のノードは、先行するジョブを割り当て済で、当該ジョブを実行中であり、新たなジョブの割り当て候補からは除外される。白丸で示されるノード「０」～「２」、「５」、「６」は、ジョブ未割り当てのノードである。ジョブ未割り当てのノードは、新たなジョブの割り当て候補となる。更に、図中、ノードに接続された、「Ｇ」の文字が付された長方形は、当該ノードに接続されているＧＰＵを示す。 In the figure, nodes "3," "4," "7," "8" to "15," indicated by black circles, are nodes to which jobs have already been assigned. Nodes to which jobs have already been assigned have been assigned a previous job and are currently executing that job, and are excluded from being candidates for allocation of new jobs. Nodes "0" to "2," "5," and "6," indicated by white circles, are nodes to which jobs have not yet been assigned. Nodes to which jobs have not yet been assigned are candidates for allocation of new jobs. Furthermore, in the figure, the rectangles marked with the letter "G" connected to the nodes indicate the GPUs connected to those nodes.

例えば、４つのノード、および、１ノード当たりＧＰＵ１個を要求するジョブへのノード割り当てを行う場合に、管理装置１００は、ノード「０」、「１」、「２」、「５」、「６」から４つを選択するノードの組合せを、ジョブの割り当て先のノードの組合せ候補とする。なお、ノードの組合せは、ノード群と言われてもよい。 For example, when allocating nodes to a job that requires four nodes and one GPU per node, the management device 100 selects four node combinations from nodes "0", "1", "2", "5", and "6" as candidate combinations of nodes to which the job is to be allocated. The node combinations may also be referred to as node groups.

このように、管理装置１００は、ラックを跨いでジョブの割り当て先の候補のノードを選択できる。このとき、他のラックにおける割り当て先のノードの選択によっては、ジョブ実行時にノード間ネットワーク６０の一部のリンクで通信競合が発生することがある。 In this way, the management device 100 can select candidate nodes to which a job is assigned across racks. In this case, depending on the selection of the node to which the job is assigned in another rack, communication contention may occur in some links of the inter-node network 60 when the job is executed.

図７は、通信競合が発生する例を示す図である。
図７では、ノード間接続スイッチ６１，６２，６３，６４と、上位スイッチ６５，６６，６７，６８とを結ぶ線によって、スイッチ間のリンクが示されている。リンクに記載されている、例えば「４，８，１２」などの数字は通信相手のノード番号であり、ノード間接続スイッチ６１側から、当該ノード番号のノードと通信する場合に選択されるリンクを表す。 FIG. 7 illustrates an example in which communication contention occurs.
7, links between switches are shown by lines connecting inter-node connection switches 61, 62, 63, 64 and upper level switches 65, 66, 67, 68. Numbers written on the links, for example "4, 8, 12", are node numbers of communication partners and indicate the link selected when communicating with the node with the node number from the inter-node connection switch 61 side.

例えば、ノード「１」がノード「５」と通信し、ノード「３」がノード「９」と通信する場合、ノード間接続スイッチ６１と上位スイッチ６６とを結ぶ同一のリンクが使用される。当該リンクの帯域が共有して使用されることで、通信競合による性能低下が発生する場合がある。この問題は、ＦａｔＴｒｅｅ以外のトポロジでも発生し得る。 For example, when node "1" communicates with node "5" and node "3" communicates with node "9", the same link connecting the inter-node switch 61 and the upper switch 66 is used. If the bandwidth of the link is shared and used, performance degradation due to communication contention may occur. This problem can also occur in topologies other than FatTree.

ところで、ノードに対するＧＰＵの接続構成は変更可能である。
図８は、ノードのリソースの再構成の例を示す図である。
図８（Ａ）は、ノード「５」に接続されるＧＰＵを、ノード「６」に接続し直す再構成の例を示す。図８（Ｂ）は、リソースプール３００ａにおいて使用されていないＧＰＵをノード「６」に接続する再構成の例を示す。 Incidentally, the connection configuration of the GPUs to the nodes can be changed.
FIG. 8 is a diagram illustrating an example of reconfiguration of node resources.
8A shows an example of reconfiguration in which a GPU connected to node "5" is reconnected to node "6." Fig. 8B shows an example of reconfiguration in which a GPU that is not being used in the resource pool 300a is connected to node "6."

例えば、各リソースプールにおけるリソースの管理では、ＧＰＵなどのリソースの使用後はリソースプールに返却する方法や、構成変更が必要となるまでそのままとする方法がある。前者の場合、ジョブの割り当て先のノードに対して、要求されたＧＰＵ数に応じてリソースプールのＧＰＵを接続すればよい。後者の場合、図８（Ａ）のように、必要に応じてノード「５」に接続されたＧＰＵを、ノード「６」に接続し直してもよい。あるいは、図８（Ｂ）のように、リソースプール３００ａに空きＧＰＵがある場合、ノード「５」へのＧＰＵの既存の接続を維持して、当該空きＧＰＵをノード「６」に接続してもよい。 For example, in resource management in each resource pool, resources such as GPUs can be returned to the resource pool after use, or left as is until a configuration change is required. In the former case, GPUs from the resource pool can be connected to the node to which the job is assigned according to the requested number of GPUs. In the latter case, as shown in FIG. 8(A), a GPU connected to node "5" can be reconnected to node "6" as necessary. Alternatively, as shown in FIG. 8(B), if there is a free GPU in resource pool 300a, the existing connection of the GPU to node "5" can be maintained and the free GPU can be connected to node "6."

このように、情報処理システム２では、ノードにおけるＧＰＵの再構成によりジョブに要求されるＧＰＵを備えたノードを用意することが可能である。当該再構成により図７で例示される通信競合を回避できることがある。 In this way, in the information processing system 2, it is possible to prepare a node equipped with a GPU required for a job by reconfiguring the GPU in the node. This reconfiguration can sometimes avoid the communication contention illustrated in FIG. 7.

図９は、通信競合が回避される例を示す図である。
図８（Ａ）および図８（Ｂ）の何れかの再構成によりノード「６」にＧＰＵを接続することで、管理装置１００は、ノード「６」をジョブの割り当て先とすることができる。ノード「５」ではなく、ノード「６」をジョブ実行に使用することで、例えばノード「１」とノード「６」との通信ではノード間接続スイッチ６１と上位スイッチ６７とを結ぶリンクが使用される。ノード間接続スイッチ６１と上位スイッチ６７とを結ぶリンクは、ノード「３」がノード「９」と通信するときに用いられる、ノード間接続スイッチ６１と上位スイッチ６６とを結ぶリンクとは異なり、通信競合が回避される。 FIG. 9 is a diagram illustrating an example in which communication contention is avoided.
8A or 8B, a GPU is connected to node "6", which allows the management device 100 to assign a job to node "6". By using node "6" for job execution instead of node "5", a link connecting node "1" and upper switch 67 is used for communication between node "1" and node "6", for example. The link connecting node "6" and upper switch 67 is different from the link connecting node "3" and upper switch 66, which is used when node "9" communicates, and communication contention is avoided.

通信競合は、ジョブの実行完了までの遅延に影響する。一方、通信競合を回避するためにリソースの再構成を行うとしても、当該再構成には時間を要し、ジョブの実行完了までの遅延に影響する。再構成に要する時間、すなわち、再構成時間は、例えば、ノードがホットプラグ／リムーブに対応しているか否かにより予め定められる。ホットプラグ／リムーブは、ノードの再起動を行わずに、ＧＰＵなどのリソースの接続を変更可能にする機能である。例えば、ノードがホットプラグ／リムーブに対応している場合、ＧＰＵの再構成は数秒程度となる。一方、ノードがホットプラグ／リムーブに非対応でありノードの再起動を行う場合は、ＧＰＵの再構成は数分を要する。 Communication contention affects the delay until the job execution is completed. On the other hand, even if resources are reconfigured to avoid communication contention, the reconfiguration takes time and affects the delay until the job execution is completed. The time required for reconfiguration, i.e., the reconfiguration time, is determined in advance, for example, depending on whether or not the node supports hot plug/remove. Hot plug/remove is a function that makes it possible to change the connection of resources such as GPUs without rebooting the node. For example, if the node supports hot plug/remove, the reconfiguration of the GPU takes about a few seconds. On the other hand, if the node does not support hot plug/remove and the node is rebooted, the reconfiguration of the GPU takes several minutes.

そこで、管理装置１００は、通信競合を考慮して、リソースの再構成を行うか否かを決定し、ジョブに対するノードの割り当てを行うことで、ジョブを効率的に実行可能にする機能を提供する。 Therefore, the management device 100 provides a function that allows jobs to be executed efficiently by determining whether or not to reconfigure resources while taking communication contention into account and allocating nodes to jobs.

図１０は、管理装置の機能例を示す図である。
管理装置１００は、記憶部１２０およびジョブスケジューラ１３０を有する。記憶部１２０には、ＲＡＭ１０２やＨＤＤ１０３の記憶領域が用いられる。ジョブスケジューラ１３０は、ＲＡＭ１０２に記憶されたプログラムがＣＰＵ１０１により実行されることで実現される。 FIG. 10 illustrates an example of functions of the management device.
The management device 100 includes a storage unit 120 and a job scheduler 130. The storage unit 120 uses storage areas of the RAM 102 and the HDD 103. The job scheduler 130 is realized by the CPU 101 executing a program stored in the RAM 102.

記憶部１２０は、ジョブスケジューラ１３０の処理に用いられる情報を記憶する。記憶部１２０に記憶される情報は、ジョブ管理テーブル１２１、基準通信時間テーブル１２２、通信性能テーブル１２３、および、評価値テーブル１２４を含む。 The memory unit 120 stores information used in the processing of the job scheduler 130. The information stored in the memory unit 120 includes a job management table 121, a reference communication time table 122, a communication performance table 123, and an evaluation value table 124.

ジョブ管理テーブル１２１は、ジョブに要求されるノード数、ノード当たりＧＰＵ数およびジョブの実行時間などのジョブ情報を保持するテーブルである。
基準通信時間テーブル１２２は、通信ベンチマーク測定による通信競合の有無の判定に用いられる基準通信時間を保持するテーブルである。基準通信時間は、システムの運用開始前に予め計測され、基準通信時間テーブル１２２に登録される。通信ベンチマーク測定は、所定の通信ベンチマークプログラムを、対象の各ノードに短時間だけ実行させて、ノード間の通信時間を計測することで行われる。 The job management table 121 is a table that holds job information such as the number of nodes required for a job, the number of GPUs per node, and the execution time of the job.
The reference communication time table 122 is a table that holds a reference communication time used to determine the presence or absence of communication contention by communication benchmark measurement. The reference communication time is measured in advance before the operation of the system starts, and is registered in the reference communication time table 122. The communication benchmark measurement is performed by having each target node execute a predetermined communication benchmark program for a short period of time and measuring the communication time between the nodes.

通信性能テーブル１２３は、ジョブを割り当てるノードの組合せ候補ごとの通信性能の測定結果の情報を保持するテーブルである。ノードの組合せ候補ごとの通信性能は、当該組合せ候補に対する通信ベンチマーク測定により取得される。 The communication performance table 123 is a table that holds information on the measurement results of the communication performance for each candidate combination of nodes to which a job is assigned. The communication performance for each candidate combination of nodes is obtained by performing a communication benchmark measurement for that candidate combination.

評価値テーブル１２４は、ジョブを割り当てるノードの組合せ候補ごとのジョブ完了までのトータル時間の評価値を保持するテーブルである。トータル時間の評価には、通信性能テーブル１２３の情報に加え、ノードでＧＰＵの再構成を行う場合には再構成時間も考慮される。 The evaluation value table 124 is a table that holds evaluation values of the total time until a job is completed for each candidate combination of nodes to which the job is assigned. In addition to the information in the communication performance table 123, the evaluation of the total time also takes into account the reconfiguration time if the GPU is reconfigured on the node.

なお、記憶部１２０は、上記の情報に加えて、各ノードが搭載されているラックや、各ノードに対するジョブの割り当て状況や、各ノードに接続されているＧＰＵの数などの情報を保持する。 In addition to the above information, the memory unit 120 also stores information such as the rack on which each node is mounted, the job allocation status for each node, and the number of GPUs connected to each node.

ジョブスケジューラ１３０は、ジョブに対するノード２００，２００ａ，２００ｂ，…の割り当てを行い、割り当てたノードに当該ジョブを実行させる。ジョブスケジューラ１３０は、ジョブ情報取得部１３１、ノード割り当て部１３２およびノード選択部１３３を有する。 The job scheduler 130 assigns nodes 200, 200a, 200b, ... to a job and has the assigned node execute the job. The job scheduler 130 has a job information acquisition unit 131, a node assignment unit 132, and a node selection unit 133.

ジョブ情報取得部１３１は、ジョブの実行要求の入力を受け付ける。ジョブの実行要求は、例えば管理ネットワーク５０に接続されたクライアント装置から管理装置１００に入力される。ジョブ情報取得部１３１は、実行要求に含まれる、ジョブに要求されるノード数、ノード当たりＧＰＵ数およびジョブの実行時間などのジョブ情報を取得し、ジョブ管理テーブル１２１に登録する。 The job information acquisition unit 131 accepts input of a job execution request. The job execution request is input to the management device 100 from, for example, a client device connected to the management network 50. The job information acquisition unit 131 acquires job information included in the execution request, such as the number of nodes required for the job, the number of GPUs per node, and the job execution time, and registers the information in the job management table 121.

ノード割り当て部１３２は、ジョブ情報取得部１３１がジョブ情報を受け付けると、ジョブ情報に基づいて、ジョブに割り当てるノードの選択をノード選択部１３３に依頼する。ノード割り当て部１３２は、割り当てるノードの選択結果をノード選択部１３３から取得し、割り当て先のノードに当該ジョブを割り当てる。すなわち、ノード割り当て部１３２は、割り当て先のノードに当該ジョブの実行を指示する。 When the job information acquisition unit 131 accepts job information, the node allocation unit 132 requests the node selection unit 133 to select a node to be assigned to the job based on the job information. The node allocation unit 132 acquires the selection result of the node to be assigned from the node selection unit 133, and assigns the job to the assigned node. In other words, the node allocation unit 132 instructs the assigned node to execute the job.

ノード選択部１３３は、ノード割り当て部１３２によるノードの選択の依頼に応じて、当該ジョブに割り当てるノードを選択する。ノード選択部１３３は、ノード組合せ抽出部１３３ａおよび評価部１３３ｂを有する。 The node selection unit 133 selects nodes to be assigned to the job in response to a node selection request from the node assignment unit 132. The node selection unit 133 has a node combination extraction unit 133a and an evaluation unit 133b.

ノード組合せ抽出部１３３ａは、ジョブに割り当てるノードの組合せを抽出する。具体的には、まず、ノード組合せ抽出部１３３ａは、空きノードの中からジョブに割り当てるノードの組合せ候補を抽出する。ノード組合せ抽出部１３３ａは、抽出した組合せ候補に対するジョブの実行完了までのトータル時間の評価を、評価部１３３ｂに依頼する。 The node combination extraction unit 133a extracts a combination of nodes to be assigned to a job. Specifically, the node combination extraction unit 133a first extracts candidate combinations of nodes to be assigned to a job from among available nodes. The node combination extraction unit 133a requests the evaluation unit 133b to evaluate the total time until the job execution is completed for the extracted candidate combinations.

ここで、トータル時間は、ノードにおけるＧＰＵの再構成開始からジョブの実行完了までの時間である。トータル時間は、ノードにおけるＧＰＵの再構成時間と、再構成後のジョブの実行時間との合計となる。ただし、ＧＰＵの再構成が不要な場合、再構成時間＝０となる。 Here, the total time is the time from the start of GPU reconfiguration on the node to the completion of job execution. The total time is the sum of the GPU reconfiguration time on the node and the job execution time after reconfiguration. However, if GPU reconfiguration is not required, the reconfiguration time = 0.

そして、ノード組合せ抽出部１３３ａは、評価部１３３ｂによる評価で得られたトータル時間の評価値が最も良い組合せ候補を、ジョブに割り当てるノードの組合せとして決定する。ノード組合せ抽出部１３３ａは、決定したノードの組合せを、ノード割り当て部１３２に応答する。 Then, the node combination extraction unit 133a determines the combination candidate with the best evaluation value of the total time obtained by the evaluation unit 133b as the combination of nodes to be assigned to the job. The node combination extraction unit 133a responds to the node assignment unit 132 with the determined combination of nodes.

評価部１３３ｂは、ノード組合せ抽出部１３３ａの依頼に応じて、ノードの組合せ候補に対するジョブのトータル時間の評価を行う。トータル時間の評価では、評価部１３３ｂは、組合せ候補ごとのジョブ実行に係るトータル時間を算出する。ジョブの実行時間には、通信競合の影響が考慮される。通信競合の影響の有無は、比較的短時間で実行される所定の通信ベンチマーク測定により判定される。 The evaluation unit 133b evaluates the total time of jobs for node combination candidates in response to a request from the node combination extraction unit 133a. In evaluating the total time, the evaluation unit 133b calculates the total time for job execution for each combination candidate. The effect of communication contention is taken into account in the job execution time. The presence or absence of the effect of communication contention is determined by a specified communication benchmark measurement that is performed in a relatively short time.

評価部１３３ｂは、通信ベンチマーク測定の結果を、通信性能テーブル１２３に登録する。また、評価部１３３ｂは、基準通信時間テーブル１２２および通信性能テーブル１２３に基づいて、ノードの組合せ候補ごとのトータル時間を算出し、評価値テーブル１２４に登録する。 The evaluation unit 133b registers the results of the communication benchmark measurement in the communication performance table 123. The evaluation unit 133b also calculates the total time for each candidate node combination based on the reference communication time table 122 and the communication performance table 123, and registers the calculated total time in the evaluation value table 124.

ここで、評価部１３３ｂはトータル時間を次の式（３）により算出する。
Ｔ_{Ｔｏｔａｌ}＝Ｔ_{ｒｅｃｏｎｆ}＋α×β×Ｔ_ｊｏｂ・・・（３）
Ｔ_{Ｔｏｔａｌ}は、トータル時間である。Ｔ_{Ｔｏｔａｌ}は、トータル時間の評価値と言われてもよい。Ｔ_{ｒｅｃｏｎｆ}は、ノードにおけるＧＰＵの再構成時間である。Ｔ_{ｒｅｃｏｎｆ}は、ノードのホットプラグ／リムーブの対応状況に応じて評価部１３３ｂに予め与えられる。Ｔ_ｊｏｂは、ジョブ実行時間の本体である。再構成不要の場合、Ｔ_{ｒｅｃｏｎｆ}＝０である。Ｔ_ｊｏｂには、ユーザにより入力されるジョブ実行時のジョブ実行時間上限値が使用される。なお、Ｔ_{Ｔｏｔａｌ}、Ｔ_{ｒｅｃｏｎｆ}およびＴ_ｊｏｂの単位は、例えば秒である。 Here, the evaluation unit 133b calculates the total time by the following formula (3).
T _Total = T _reconf + α × β × T _job ... (3)
T _Total is the total time. T _Total may be called an evaluation value of the total time. T _reconf is the reconfiguration time of the GPU in the node. T _reconf is given in advance to the evaluation unit 133b according to the hot plug/remove support status of the node. T _job is the main body of the job execution time. If reconfiguration is not required, T _reconf =0. For T _job , the upper limit value of the job execution time at the time of job execution input by the user is used. The units of T _Total , T _reconf and T _job are, for example, seconds.

αは、ジョブ実行時間Ｔ_ｊｏｂに対する通信競合の影響を表す係数である。αは、短時間で実行が完了する通信ベンチマークプログラムによる通信時間の測定結果の、基準通信時間（通信競合の影響がないときの値）に対する倍率となる。基準通信時間は、各ノードにジョブ割り当てがされていない状態（通信競合がない状態）で当該通信ベンチマークプログラムを用いて事前に取得された基準の通信時間である。具体的には、α＝（通信ベンチマークにより測定した通信時間）／基準通信時間である。なお、実際に実行されるジョブは通信ベンチマークプログラムとは異なり、通信以外の演算処理を含む。このため、αは、基準通信時間に対する倍率そのものでなくてもよく、基準通信時間に対する倍率を調整した値でもよい。例えば、当該倍率を更に０．５倍した値をαとするなど、影響を小さくする調整方法が考えられる。 α is a coefficient representing the influence of communication contention on the job execution time T _job . α is a magnification of the measurement result of the communication time by a communication benchmark program that is completed in a short time to the reference communication time (a value when there is no influence of communication contention). The reference communication time is a reference communication time obtained in advance using the communication benchmark program in a state where no job is assigned to each node (a state where there is no communication contention). Specifically, α = (communication time measured by the communication benchmark) / reference communication time. Note that, unlike the communication benchmark program, the job that is actually executed includes calculation processing other than communication. Therefore, α does not have to be the magnification itself to the reference communication time, and may be a value obtained by adjusting the magnification to the reference communication time. For example, an adjustment method for reducing the influence can be considered, such as setting the value obtained by multiplying the magnification by 0.5 to α.

βは、ジョブ実行時間Ｔ_ｊｏｂに対する通信競合の影響を表す係数である。βは、実行するジョブのノード当たりＧＰＵ数に応じた通信競合の影響を示す。ジョブのノード当たりＧＰＵ数が多いほど、ノード間通信が多く発生し、ジョブ実行時間への影響も大きいと考えられるためである。 β is a coefficient that represents the effect of communication contention on the job execution time T _job . β indicates the effect of communication contention according to the number of GPUs per node of the job to be executed. This is because it is considered that the more GPUs per node of the job, the more inter-node communications occur, and the greater the effect on the job execution time.

例えば、ノード当たりＧＰＵ数をＮ_Ｇとすると、β＝１＋（１／８）×Ｎ_Ｇである。
この場合、ノード当たりＧＰＵ数＝１では、β＝１．１となる。ノード当たりＧＰＵ数＝４では、β＝１．５となる。Ｎ_Ｇに乗じる係数（１／８）の分母の定数は、例えば、ノード当たりに構成可能な最大のＧＰＵ数としてもよい。当該分母の定数は、事前に決定される。なお、（通信ベンチマーク測定で計測した通信時間）／基準通信時間≦１の場合は、通信競合の影響がない場合であり、評価部１３３ｂは、α＝１、β＝１とする。 For example, if the number of GPUs per node is _NG , then β=1+(1/8)× _NG .
In this case, when the number of GPUs per node is 1, β=1.1. When the number of GPUs per node is 4, β=1.5. The constant in the denominator of the coefficient (1/8) multiplied by _NG may be, for example, the maximum number of GPUs that can be configured per node. The constant in the denominator is determined in advance. Note that when (communication time measured in communication benchmark measurement)/reference communication time≦1, this is a case where there is no effect of communication contention, and the evaluation unit 133b sets α=1 and β=1.

式（３）のα×βで表される係数は、第１の実施の形態の第１係数ａ１および第２係数ａ２に相当する。
図１１は、ジョブ管理テーブルの例を示す図である。 The coefficient expressed by α×β in equation (3) corresponds to the first coefficient a1 and the second coefficient a2 in the first embodiment.
FIG. 11 is a diagram illustrating an example of a job management table.

ジョブ管理テーブル１２１は、ジョブ番号、プログラム名、ノード数、ノード当たりＧＰＵ数および実行時間の項目を含む。ジョブ番号の項目には、ジョブの識別番号であるジョブ番号が登録される。プログラム名の項目には、ジョブのプログラム名が登録される。ノード数の項目には、ジョブの実行に要求されるノード数が登録される。ノード当たりＧＰＵ数の項目には、ジョブの実行に要求されるノード当たりＧＰＵ数が登録される。実行時間の項目には、ユーザにより実行要求で指定されたジョブの実行時間の上限値が登録される。 The job management table 121 includes fields for job number, program name, number of nodes, number of GPUs per node, and execution time. The job number field registers the job number, which is the identification number of the job. The program name field registers the program name of the job. The number of nodes field registers the number of nodes required to execute the job. The number of GPUs per node field registers the number of GPUs per node required to execute the job. The execution time field registers the upper limit of the execution time of the job specified by the user in the execution request.

例えば、ジョブ管理テーブル１２１は、ジョブ番号「１」、プログラム名「Ａ」、ノード数「２」、ノード当たりＧＰＵ数「１」、実行時間「１：００：００」のレコードを有する。当該レコードはジョブ番号「１」のジョブのプログラム名が「Ａ」、要求されるノード数が「２」、要求されるノード当たりＧＰＵ数が「１」、ジョブの実行時間の上限が「１：００：００」（１時間０分０秒）であることを示す。なお、ジョブの実行要求で指定されるジョブの実行時間は、通信競合の影響が考慮されていない。ジョブの実行時間は、実際のジョブ実行時の通信競合の影響により延びることがある。 For example, job management table 121 has a record with job number "1", program name "A", number of nodes "2", number of GPUs per node "1", and execution time "1:00:00". This record indicates that the program name of the job with job number "1" is "A", the requested number of nodes is "2", the requested number of GPUs per node is "1", and the upper limit of the job execution time is "1:00:00" (1 hour, 0 minutes, 0 seconds). Note that the job execution time specified in the job execution request does not take into account the effects of communication contention. The job execution time may be extended due to the effects of communication contention during actual job execution.

ジョブ管理テーブル１２１には、他のジョブ番号のレコードも登録される。
図１２は、基準通信時間テーブルの例を示す図である。
基準通信時間テーブル１２２は、ノード数および基準通信時間（ｍｓｅｃ）の項目を含む。ノード数の項目には、ジョブを割り当てるノードの数が登録される。基準通信時間の項目には、基準通信時間が登録される。基準通信時間は、ノード間の通信において、通信競合が発生すると判定するための基準となる時間である。ノード数ごとの基準通信時間は、システムの運用開始前に通信ベンチマーク測定により予め取得され、基準通信時間テーブル１２２に登録される。基準通信時間の単位はｍｓｅｃ（ミリ秒）である。 In the job management table 121, records of other job numbers are also registered.
FIG. 12 is a diagram illustrating an example of a reference communication time table.
The reference communication time table 122 includes items for the number of nodes and the reference communication time (msec). The number of nodes to which a job is assigned is registered in the number of nodes item. The reference communication time is registered in the reference communication time item. The reference communication time is a reference time for determining whether a communication contention will occur in communication between nodes. The reference communication time for each number of nodes is obtained in advance by measuring a communication benchmark before the system starts operating, and is registered in the reference communication time table 122. The unit of the reference communication time is msec (milliseconds).

例えば、あるジョブを２つのノードに割り当てて、当該ノード間での通信ベンチマーク測定で得られる通信時間が基準通信時間以下の場合、通信競合なしと判定される。一方、当該ノード間での通信ベンチマーク測定で得られる通信時間が基準通信時間より長い場合、通信競合ありと判定される。 For example, if a job is assigned to two nodes, and the communication time obtained by benchmarking the communication between the nodes is less than or equal to the reference communication time, it is determined that there is no communication contention. On the other hand, if the communication time obtained by benchmarking the communication between the nodes is longer than the reference communication time, it is determined that there is a communication contention.

例えば、基準通信時間テーブル１２２は、ノード数「２」、基準通信時間「４．７０」のレコードを有する。当該レコードは、ジョブを２つのノードで実行する場合、当該２つのノードの基準通信時間が４．７０ｍｓｅｃであることを示す。 For example, the reference communication time table 122 has a record with the number of nodes "2" and the reference communication time "4.70." This record indicates that when a job is executed on two nodes, the reference communication time for the two nodes is 4.70 msec.

基準通信時間テーブル１２２には、ノード数「３」の場合など、他のノード数に対するレコードも登録される。なお、例えば３以上のノードでのノード間の通信時間（基準通信時間も含む）は、それらノードを用いた通信ベンチマーク測定において得られた２つのノード間の通信時間の平均でもよいし、最大の通信時間でもよいし、最小の通信時間でもよい。 Records for other numbers of nodes, such as when the number of nodes is "3", are also registered in the reference communication time table 122. Note that the communication time between nodes (including the reference communication time), for example, when there are three or more nodes, may be the average of the communication times between two nodes obtained in a communication benchmark measurement using those nodes, or it may be the maximum communication time, or it may be the minimum communication time.

図１３は、通信性能テーブルの例を示す図である。
通信性能テーブル１２３は、項番、組合せ、通信時間（ｍｓｅｃ）および基準に対する倍率の項目を含む。項番の項目には、レコードの識別番号である項番が登録される。組合せの項目には、ジョブに割り当てるノードの組合せ候補が登録される。通信時間の項目には、当該ノードの組合せに対する通信ベンチマーク計測に得られた通信時間が登録される。通信時間の単位はｍｓｅｃである。基準に対する倍率の項目には、基準通信時間に対する通信時間の倍率（＝通信時間÷基準通信時間）が登録される。当該倍率は、式（３）のαに相当する。 FIG. 13 illustrates an example of a communication performance table.
The communication performance table 123 includes fields for item number, combination, communication time (msec), and ratio to standard. The item number, which is an identification number of a record, is registered in the item number field. The combination field registers a candidate combination of nodes to be assigned to a job. The communication time field registers the communication time obtained in a communication benchmark measurement for the node combination. The unit of communication time is msec. The ratio to standard field registers the ratio of the communication time to the standard communication time (= communication time ÷ standard communication time). This ratio corresponds to α in equation (3).

通信性能テーブル１２３の例では、図６で例示したノード「０」、「１」、「２」、「５」、「６」から４つを選択するノードの組合せ候補に対する、評価部１３３ｂによる通信性能の計測結果が示されている。 The example of the communication performance table 123 shows the results of measuring the communication performance by the evaluation unit 133b for candidate node combinations in which four nodes are selected from the nodes "0", "1", "2", "5", and "6" shown in FIG. 6.

例えば、通信性能テーブル１２３は、項番「１」、組合せ「０，１，２，５」、通信時間「５．３０」、基準に対する倍率「１．０８」のレコードを有する。当該レコードは、ノードの組合せ候補であるノード「０」、「１」、「２」、「５」での通信ベンチマーク計測で得られたノード間の通信時間が５．３０ｍｓｅｃであり、ノード数「４」の基準通信時間に対する通信時間の倍率αが１．０８であることを示す。倍率α＝１．０８は、基準通信時間テーブル１２２におけるノード数「４」の基準通信時間４．９０ｍｓｅｃを用いて、α＝５．３０÷４．９０＝１．０８と計算される。 For example, the communication performance table 123 has a record with item number "1", combination "0,1,2,5", communication time "5.30", and magnification "1.08" relative to the standard. This record indicates that the communication time between nodes obtained in a communication benchmark measurement of nodes "0", "1", "2", and "5", which are candidate node combinations, is 5.30 msec, and the magnification α of the communication time relative to the standard communication time for node number "4" is 1.08. The magnification α = 1.08 is calculated as α = 5.30 ÷ 4.90 = 1.08 using the standard communication time of 4.90 msec for node number "4" in standard communication time table 122.

通信性能テーブル１２３には、他の組合せ候補に対するレコードも登録される。
図１４は、評価値テーブルの例を示す図である。
評価値テーブル１２４は、項番、組合せ、基準に対する倍率および評価値の項目を含む。項番の項目には、レコードの識別番号である項番が登録される。組合せの項目には、ジョブに割り当てるノードの組合せ候補が登録される。基準に対する倍率の項目には、基準通信時間に対する通信時間の倍率αが登録される。評価値の項目には、評価部１３３ｂにより式（３）を用いて計算されたＴ_{Ｔｏｔａｌ}が登録される。 In the communication performance table 123, records for other combination candidates are also registered.
FIG. 14 is a diagram illustrating an example of an evaluation value table.
The evaluation value table 124 includes fields for item number, combination, magnification to criterion, and evaluation value. In the field for item number, an item number that is an identification number of a record is registered. In the field for combination, a candidate combination of nodes to be assigned to a job is registered. In the field for magnification to criterion, a magnification α of the communication time to the reference communication time is registered. In the field for evaluation value, T _Total calculated by the evaluation unit 133b using formula (3) is registered.

評価値テーブル１２４には、図６で例示した４ノード（ノード当たりＧＰＵ数１）を用いるジョブに対するＴ_{Ｔｏｔａｌ}の計算結果が示されている。一例として、Ｔ_ｊｏｂ＝３６００秒、Ｔ_{ｒｅｃｏｎｆ}＝１５０である。ノード「０」、「１」、「２」、「５」の組合せでは、ＧＰＵの再構成不要である。 The evaluation value table 124 shows the calculation result of T _Total for a job using four nodes (one GPU per node) as exemplified in Fig. 6. As an example, T _job = 3600 seconds, and T _reconf = 150. For the combination of nodes "0", "1", "2", and "5", no GPU reconfiguration is required.

例えば、評価値テーブル１２４は、項番「１」、組合せ「０，１，２，５」、基準に対する倍率「１．０８」、評価値「４２７６．８０」のレコードを有する。このレコードは、ノードの組合せ候補であるノード「０」、「１」、「２」、「５」に対する倍率α＝１．０８であり、当該倍率αとβとを用いて計算されるＴ_{ｒｅｃｏｎｆ}が「４２７６．８０」であることを示す。なお、上記の例において、β＝１．１である。 For example, the evaluation value table 124 has a record of item number "1", combination "0,1,2,5", magnification with respect to the reference "1.08", and evaluation value "4276.80". This record indicates that the magnification α for the nodes "0", "1", "2", and "5" which are candidate node combinations is 1.08, and T _reconf calculated using the magnifications α and β is "4276.80". In the above example, β=1.1.

すなわち、組合せ「０，１，２，５」では、Ｔ_{Ｔｏｔａｌ}＝１．１＊１．０８＊３６００＝４２７６．８０となる。
評価値テーブル１２４には、ノードの他の組合せ候補に対するＴ_{Ｔｏｔａｌ}も登録されている。 That is, for the combination "0, 1, 2, 5", T _Total = 1.1 * 1.08 * 3600 = 4276.80.
The evaluation value table 124 also registers T _Total for other combination candidates of nodes.

組合せ「０，１，２，６」では、Ｔ_{Ｔｏｔａｌ}＝１５０＋１．１＊１．０２＊３６００＝４１８９．２０である。
組合せ「０，１，５，６」では、Ｔ_{Ｔｏｔａｌ}＝１５０＋１．１＊１．１２＊３６００＝４５８５．２０である。 For the combination "0,1,2,6", T _Total =150+1.1*1.02*3600=4189.20.
For the combination "0,1,5,6", T _Total =150+1.1*1.12*3600=4585.20.

組合せ「０，２，５，６」では、Ｔ_{Ｔｏｔａｌ}＝１５０＋１．１＊１．１０＊３６００＝４５０６．００である。
組合せ「１，２，５，６」では、Ｔ_{Ｔｏｔａｌ}＝１５０＋１．１＊１．０６＊３６００＝４３４７．６０である。 For the combination "0,2,5,6", T _Total = 150 + 1.1 * 1.10 * 3600 = 4506.00.
For the combination "1, 2, 5, 6", T _Total = 150 + 1.1 * 1.06 * 3600 = 4347.60.

評価値テーブル１２４の例の場合、項番「１」のレコードがＧＰＵの再構成を行わない場合に相当する。項番「２」～「５」のレコードがＧＰＵの再構成を行う場合に相当する。 In the example of evaluation value table 124, record number "1" corresponds to the case where the GPU is not reconfigured. Records numbered "2" to "5" correspond to the case where the GPU is reconfigured.

評価値テーブル１２４の例の場合、ノードの組合せ候補「０」、「１」、「２」、「６」のＴ_{Ｔｏｔａｌ}が最も小さい。このため、ノード割り当て部１３２は、ジョブの割り当て先のノードの組合せとして、ノード「０」、「１」、「２」、「６」をノード選択部１３３から取得し、当該ジョブにノード「０」、「１」、「２」、「６」を割り当てる。ノード「６」は、ＧＰＵの再構成を要するノードである。このため、ジョブ割り当てのために、ノード「６」においてＧＰＵの再構成が行われることになる。 In the example of the evaluation value table 124, the node combination candidates "0", "1", "2", and "6" have the smallest T _Total . Therefore, the node allocation unit 132 acquires nodes "0", "1", "2", and "6" from the node selection unit 133 as node combinations to which the job is to be assigned, and assigns nodes "0", "1", "2", and "6" to the job. Node "6" is a node that requires GPU reconfiguration. Therefore, the GPU will be reconfigured in node "6" in order to assign the job.

なお、Ｔ_{Ｔｏｔａｌ}が最小の組合せが複数存在し、その中に再構成要のノードを含む組合せと再構成要のノードを含まない組合せがある場合、ノード組合せ抽出部１３３ａは、再構成要のノードを含まない組合せから任意の組合せを選択することが考えられる。このようにすると、余計な再構成を行わずに済む。 In addition, when there are a plurality of combinations with the smallest T _Total , among which some combinations include a node requiring reconfiguration and some do not, the node combination extraction unit 133a may select an arbitrary combination from among the combinations that do not include a node requiring reconfiguration. In this way, unnecessary reconfiguration can be avoided.

次に、管理装置１００の処理手順を説明する。
図１５は、ジョブスケジューラの処理例を示すフローチャートである。
（Ｓ１０）ジョブ情報取得部１３１は、ジョブの実行要求を受け付けると、実行要求からジョブ情報を取得し、ジョブ管理テーブル１２１に登録する。ノード割り当て部１３２は、ジョブ管理テーブル１２１に登録されたジョブ情報を取得し、ジョブの実行に必要な構成の空きノード数があるか否かを判定する。ジョブの実行に必要な構成の空きノード数がある場合、ステップＳ１１に処理が進む。ジョブの実行に必要な構成の空きノード数がない場合、処理が終了する。例えば、ノード割り当て部１３２は、ジョブの実行に必要な構成の空きノード数がない場合、所定時間経過後に当該ジョブに対して、ジョブの実行に必要な構成の空きノード数があるか否かを再度判定する。 Next, the processing procedure of the management device 100 will be described.
FIG. 15 is a flowchart showing an example of processing by the job scheduler.
(S10) When the job information acquisition unit 131 receives a job execution request, it acquires job information from the execution request and registers it in the job management table 121. The node allocation unit 132 acquires the job information registered in the job management table 121 and determines whether or not there are free nodes in the configuration required to execute the job. If there are free nodes in the configuration required to execute the job, the process proceeds to step S11. If there are not free nodes in the configuration required to execute the job, the process ends. For example, if there are not free nodes in the configuration required to execute the job, the node allocation unit 132 re-determines whether or not there are free nodes in the configuration required to execute the job for that job after a predetermined time has elapsed.

（Ｓ１１）ノード割り当て部１３２は、１ラック内に割り当て可能であるか否かを判定する。１ラック内に割り当て可能な場合、ステップＳ１２に処理が進む。１ラック内に割り当て可能でない場合、ステップＳ１３に処理が進む。１ラック内に割り当て可能とは、ジョブの割り当て先のノードを、全て同一ラックから選択できる場合である。 (S11) The node allocation unit 132 determines whether or not allocation within one rack is possible. If allocation within one rack is possible, processing proceeds to step S12. If allocation within one rack is not possible, processing proceeds to step S13. Allocation within one rack means that all nodes to which the job is to be assigned can be selected from the same rack.

（Ｓ１２）ノード割り当て部１３２は、該当のラック内のノードにジョブを割り当てる。そして、ステップＳ１４に処理が進む。
（Ｓ１３）ノード割り当て部１３２は、ノード選択部１３３にノード選択を依頼する。ノード選択部１３３は、ノード選択を実行し、選択結果をノード割り当て部１３２に応答する。ノード割り当て部１３２は、ノード選択部１３３により選択されたノードにジョブを割り当てる。ノード選択部１３３によるノード選択の詳細は後述される。 (S12) The node allocation unit 132 allocates the job to a node in the corresponding rack. Then, the process proceeds to step S14.
(S13) The node allocation unit 132 requests node selection from the node selection unit 133. The node selection unit 133 executes node selection and returns the selection result to the node allocation unit 132. The node allocation unit 132 assigns the job to the node selected by the node selection unit 133. Details of node selection by the node selection unit 133 will be described later.

（Ｓ１４）ノード割り当て部１３２は、ステップＳ１２またはステップＳ１３の割り当て先のノードにジョブの実行を指示する。指示を受けたノードはジョブを実行する。そして、ジョブスケジューラ１３０の処理が終了する。 (S14) The node allocation unit 132 instructs the node assigned in step S12 or step S13 to execute the job. The node that received the instruction executes the job. Then, the processing of the job scheduler 130 ends.

図１６は、ノード選択処理の例を示すフローチャートである。
ノード選択処理はステップＳ１３に相当する。
（Ｓ２０）ノード組合せ抽出部１３３ａは、ノード組合せの抽出を行う。ノード組合せの抽出の詳細は後述される。 FIG. 16 is a flowchart illustrating an example of a node selection process.
The node selection process corresponds to step S13.
(S20) The node combination extraction unit 133a extracts node combinations. The node combination extraction will be described in detail later.

（Ｓ２１）評価部１３３ｂは、ノード組合せの評価を行う。ノード組合せの評価の詳細は後述される。ノード組合せの評価により、評価値テーブル１２４が作成される。
（Ｓ２２）ノード組合せ抽出部１３３ａは、評価値テーブル１２４の評価値（Ｔ_{Ｔｏｔａｌ}）を基に、ジョブの割り当て先のノードの組合せを選択する。このとき、ノード組合せ抽出部１３３ａは、評価値テーブル１２４の各評価値Ｔ_{Ｔｔｏｔａｌ}を比較し、Ｔ_{Ｔｏｔａｌ}が最小の組合せを選択する。ノード組合せ抽出部１３３ａは、選択した組合せに属するノードをノード割り当て部１３２に応答する。そして、ノード選択処理が終了する。 (S21) The evaluation unit 133b evaluates the node combination. The evaluation of the node combination will be described in detail later. An evaluation value table 124 is created based on the evaluation of the node combination.
(S22) The node combination extraction unit 133a selects a combination of nodes to which the job is to be assigned based on the evaluation value (T _Total ) in the evaluation value table 124. At this time, the node combination extraction unit 133a compares each evaluation value T _Total in the evaluation value table 124, and selects the combination with the smallest T _Total . The node combination extraction unit 133a responds to the node assignment unit 132 with the nodes that belong to the selected combination. Then, the node selection process ends.

図１７は、ノード組合せの抽出処理例を示すフローチャートである。
ノード組合せの抽出処理はステップＳ２０に相当する。
（Ｓ３０）ノード組合せ抽出部１３３ａは、ジョブ管理テーブル１２１のジョブ情報に基づいて、ジョブ実行に必要な構成と同じ構成済みノードの組合せを抽出する。例えば、記憶部１２０は、空きノードにおけるＧＰＵの接続数の情報を保持する。ノード組合せ抽出部１３３ａは、記憶部１２０に記憶された当該情報に基づいて、構成済みノードの組合せを抽出し得る。 FIG. 17 is a flowchart illustrating an example of a process for extracting a node combination.
The process of extracting node combinations corresponds to step S20.
(S30) The node combination extraction unit 133a extracts a combination of configured nodes that is the same as the configuration required for job execution, based on the job information in the job management table 121. For example, the storage unit 120 holds information on the number of GPU connections in free nodes. The node combination extraction unit 133a can extract a combination of configured nodes based on the information stored in the storage unit 120.

例えば、図６の例において、空きノードであるノード「０」、「１」、「２」、「５」、「６」から４つ選択するノードの組合せのうち、ノード「０」、「１」、「２」、「５」の組合せが、ジョブ実行に必要な構成と同じ構成済みノードの組合せである。 For example, in the example of Figure 6, of the four node combinations selected from the free nodes "0", "1", "2", "5", and "6", the combination of nodes "0", "1", "2", and "5" is a combination of configured nodes that has the same configuration as the configuration required for job execution.

（Ｓ３１）ノード組合せ抽出部１３３ａは、構成変更が必要なノードの組合せを抽出する。例えば、ノード組合せ抽出部１３３ａは、記憶部１２０に記憶された、空きノードにおけるＧＰＵの接続数の情報に基づいて、構成変更が必要なノードの組合せを抽出し得る。例えば、図６の例において、空きノードであるノード「０」、「１」、「２」、「５」、「６」から４つ選択するノードの組合せのうち、ノード「０」、「１」、「２」、「５」の組合せ以外の組合せが、構成変更が必要なノードの組合せである。ノード「６」でＧＰＵの再構成を要するためである。そして、ノード組合せの抽出処理が終了する。 (S31) The node combination extraction unit 133a extracts node combinations that require a configuration change. For example, the node combination extraction unit 133a may extract node combinations that require a configuration change based on information on the number of GPU connections at free nodes stored in the storage unit 120. For example, in the example of FIG. 6, of the four node combinations selected from the free nodes "0", "1", "2", "5", and "6", any combination other than the combination of nodes "0", "1", "2", and "5" is a node combination that requires a configuration change. This is because node "6" requires a GPU reconfiguration. Then, the node combination extraction process ends.

ステップＳ３０，Ｓ３１で抽出されたノードの組合せが、ジョブの割り当て先のノードの組合せ候補となる。
図１８は、ノード組合せの評価処理例を示すフローチャートである。 The node combinations extracted in steps S30 and S31 become candidate combinations of nodes to which the job is assigned.
FIG. 18 is a flowchart illustrating an example of a process for evaluating a node combination.

ノード組合せの評価処理はステップＳ２１に相当する。
（Ｓ４０）評価部１３３ｂは、式（３）の評価式により、ノードの組合せ候補ごとのトータル時間Ｔ_{Ｔｏｔａｌ}を評価する。このとき、評価部１３３ｂは、比較的短い時間、所定の通信ベンチマーク測定を行うことで、ノードの組合せ候補ごとの通信時間を計測して通信性能テーブル１２３に登録し、式（３）のαを求める。通信ベンチマーク測定は、組合せ候補に含まれる各ノードにより通信ベンチマークプログラムを比較的短い時間だけ実行させることで行われる。このとき、（通信ベンチマーク測定で計測した通信時間）／基準通信時間≦１の場合、評価部１３３ｂは通信競合の影響がないと判定し、式（３）においてα＝１、β＝１とする。（通信ベンチマーク測定で計測した通信時間）／基準通信時間＞１の場合、評価部１３３ｂは、通信競合の影響があると判定し、α＝（通信ベンチマーク測定で計測した通信時間）／基準通信時間、β＝１＋（１／８）×Ｎ_Ｇとする。 The node combination evaluation process corresponds to step S21.
(S40) The evaluation unit 133b evaluates the total time T _Total for each node combination candidate using the evaluation formula (3). At this time, the evaluation unit 133b measures the communication time for each node combination candidate by performing a predetermined communication benchmark measurement for a relatively short time, registers the measured communication time in the communication performance table 123, and obtains α in formula (3). The communication benchmark measurement is performed by having each node included in the combination candidate execute a communication benchmark program for a relatively short time. At this time, if (communication time measured by the communication benchmark measurement)/reference communication time≦1, the evaluation unit 133b determines that there is no effect of communication contention, and sets α=1 and β=1 in formula (3). If (communication time measured by the communication benchmark measurement)/reference communication time>1, the evaluation unit 133b determines that there is an effect of communication contention, and sets α=(communication time measured by the communication benchmark measurement)/reference communication time, and β=1+(1/8)× _NG .

（Ｓ４１）評価部１３３ｂは、ステップＳ４０の評価で計算したＴ_{Ｔｏｔａｌ}を、組合せ候補ごとに、評価値テーブル１２４に記録する。そして、ノード組合せの評価処理が終了する。 (S41) The evaluation unit 133b records T _Total calculated in the evaluation in step S40 for each combination candidate in the evaluation value table 124. Then, the evaluation process of the node combination ends.

図１９は、ジョブの実行に係るトータル時間の相違の例を示す図である。
タイムチャート８０，８１，８２は、それぞれ時刻ｔ０を起点とした場合のジョブ実行完了までのトータル時間の例を示す。図１９では、図に向かって左側から右側へ向かう方向が時間の正方向である。 FIG. 19 is a diagram showing an example of a difference in the total time related to the execution of a job.
Time charts 80, 81, and 82 each show an example of the total time from time t0 until the completion of job execution. In Fig. 19, the positive direction of time is from the left to the right in the figure.

タイムチャート８０は、ＧＰＵの構成変更を行わないが、通信競合の影響を受ける場合のトータル時間Ｔ_{ＴｏｔａｌＡ}を示す。
タイムチャート８１は、構成変更を行うことで、通信競合の影響を受けなくなり、ジョブ自体の実行時間が改善する場合のトータル時間Ｔ_{ＴｏｔａｌＢ}を示す。Ｔ_{ＴｏｔａｌＢ}＜Ｔ_{ｔｏｔａｌＡ}である。このように、構成変更（再構成）による初期オーバーヘッドがあったとしても、ジョブ自体の実行時間が改善することで、トータル時間も改善することがある。 A time chart 80 shows a total time T _TotalA in the case where the GPU configuration is not changed but is affected by communication contention.
The time chart 81 shows the total time T _TotalB when the configuration change eliminates the influence of communication contention and improves the execution time of the job itself. T _TotalB < T _totalA . In this way, even if there is an initial overhead due to the configuration change (reconfiguration), the total time may also be improved by improving the execution time of the job itself.

タイムチャート８２は、構成変更を行わなくても、通信競合の影響が小さく、通信競合の影響による実行時間の増大が小さい場合のトータル時間Ｔ_{ＴｏｔａｌＣ}を示す。Ｔ_{ＴｏｔａｌＣ}＜Ｔ_{ＴｏｔａｌＢ}である。このように、場合によって、通信競合の影響が小さければ、実行時間の増大が小さく、構成変更を行わない方が良い場合もあり得る。 The time chart 82 shows the total time T _TotalC when the effect of communication contention is small and the increase in execution time due to the effect of communication contention is small even without making a configuration change. T _TotalC < T _TotalB . In this way, depending on the situation, if the effect of communication contention is small, the increase in execution time may be small and it may be better not to make a configuration change.

そこで、管理装置１００は、リソースの再構成時間とジョブ実行時のノード間の通信競合の影響とを考慮して、ノードの組合せ候補ごとのトータル時間を算出し、トータル時間が最も短い組合せ候補を、ジョブの割り当て先のノードの組合せとする。 Therefore, the management device 100 calculates the total time for each candidate node combination, taking into account the resource reconfiguration time and the impact of communication contention between nodes when the job is executed, and selects the candidate combination with the shortest total time as the node combination to which the job is assigned.

これにより、管理装置１００は、ノードにおけるＧＰＵなどのリソースの再構成を行うか否かを適切に決定できる。また、管理装置１００は、ジョブの実行完了までのトータル時間が短くなるように、ジョブの割り当て先のノードの選択が可能になる。 This allows the management device 100 to appropriately decide whether or not to reconfigure resources such as GPUs in a node. In addition, the management device 100 can select a node to which a job is assigned so as to shorten the total time until the job is completed.

なお、管理装置１００の機能は、ノード２００，２００ａ，…の何れかで実現されてもよい。その場合、情報処理システム２は、管理装置１００や管理ネットワーク５０を含まなくてもよい。 The functions of the management device 100 may be realized by any of the nodes 200, 200a, ... In that case, the information processing system 2 does not need to include the management device 100 or the management network 50.

以上説明したように管理装置１００は、次の処理を実行する。
ジョブスケジューラ１３０は、ジョブの割り当て候補でありジョブの実行のためにリソースの再構成を要する第１ノードにジョブを割り当てる場合の再構成からジョブの完了までに要する第１時間を算出する。このとき、ジョブスケジューラ１３０は、リソースの再構成に要する再構成時間と、再構成後のジョブの実行時間と、第１ノードによるジョブの実行時の通信に伴う通信競合の、実行時間に対する影響を表す第１係数とに基づいて、第１時間を算出する。また、ジョブスケジューラ１３０は、ジョブの割り当て候補でありリソースの再構成を要しない第２ノードに、リソースの再構成を行わずにジョブを割り当てる場合のジョブの完了までに要する第２時間を算出する。このとき、ジョブスケジューラ１３０は、第２ノードによるジョブの実行時の通信に伴う通信競合の、ジョブの実行時間に対する影響を表す第２係数と当該実行時間とに基づいて、第２時間を算出する。ジョブスケジューラ１３０は、第１時間と第２時間とを比較し、第１時間が第２時間よりも短い場合は第１ノードにおけるリソースの再構成を行い、第１時間が第２時間以上の場合は第１ノードにおけるリソースの再構成を行わない。 As described above, the management device 100 executes the following process.
The job scheduler 130 calculates a first time required from reconfiguration to completion of a job when the job is assigned to a first node that is a candidate for the job assignment and requires resource reconfiguration to execute the job. At this time, the job scheduler 130 calculates the first time based on a reconfiguration time required for resource reconfiguration, an execution time of the job after the reconfiguration, and a first coefficient representing the effect on the execution time of a communication contention accompanying communication during execution of the job by the first node. The job scheduler 130 also calculates a second time required to complete a job when the job is assigned to a second node that is a candidate for the job assignment and does not require resource reconfiguration without resource reconfiguration. At this time, the job scheduler 130 calculates the second time based on a second coefficient representing the effect on the execution time of a job of a communication contention accompanying communication during execution of the job by the second node, and the execution time. The job scheduler 130 compares the first time with the second time, and if the first time is shorter than the second time, reconfigures the resources on the first node, and if the first time is equal to or longer than the second time, does not reconfigure the resources on the first node.

これにより、管理装置１００は、通信競合の影響を考慮して、リソースの再構成を行うか否かを適切に決定できる。前述の係数α×βは第１係数および第２係数の一例である。
ジョブスケジューラ１３０は、ジョブの割り当て候補であり第１ノードを含む第１ノード群に対して第１時間を算出し、ジョブの割り当て候補であり第１ノードを含まない第２ノード群に対して第２時間を算出してもよい。 This allows the management device 100 to appropriately determine whether or not to reconfigure resources, taking into account the effect of communication contention. The above-mentioned coefficient α×β is an example of the first coefficient and the second coefficient.
The job scheduler 130 may calculate a first time for a first group of nodes that are candidates for job allocation and include the first node, and may calculate a second time for a second group of nodes that are candidates for job allocation and do not include the first node.

これにより、管理装置１００は、通信競合の影響を考慮して、リソースの再構成を行うか否かを適切に決定できる。また、ジョブの完了までの時間が短くなるように、ジョブを割り当てるノード群、すなわち、ノードの組合せを決定できる。 This allows the management device 100 to appropriately determine whether or not to reconfigure resources, taking into account the effects of communication contention. It also allows the management device 100 to determine the node group to which a job is assigned, i.e., the combination of nodes, so as to shorten the time it takes to complete the job.

例えば、ジョブスケジューラ１３０は、第１ノード群に属するノード間の通信の第１通信時間の測定結果に基づいて第１係数を算出する。また、ジョブスケジューラ１３０は、第２ノード群に属するノード間の通信の第２通信時間の測定結果に基づいて第２係数を算出する。 For example, the job scheduler 130 calculates a first coefficient based on the measurement result of a first communication time of communication between nodes belonging to a first node group. The job scheduler 130 also calculates a second coefficient based on the measurement result of a second communication time of communication between nodes belonging to a second node group.

これにより、管理装置１００は、第１係数および第２係数を適切に決定できる。例えば、第１係数は、第１通信時間と所定の基準通信時間との比に基づいて計算され得る。また、第２係数は、第２通信時間と基準通信時間との比に基づいて計算され得る。 This allows the management device 100 to appropriately determine the first coefficient and the second coefficient. For example, the first coefficient can be calculated based on the ratio between the first communication time and a predetermined reference communication time. Also, the second coefficient can be calculated based on the ratio between the second communication time and the reference communication time.

このとき、ジョブスケジューラ１３０は、第１通信時間と所定の基準通信時間とに基づいて第１ノードによるジョブの実行時の通信に伴う通信競合の有無を判定し、当該通信競合がない場合に第１係数を１に設定する。また、ジョブスケジューラ１３０は、第２通信時間と基準通信時間とに基づいて第２ノードによるジョブの実行時の通信に伴う通信競合の有無を判定し、当該通信競合がない場合に第２係数を１に設定する。 At this time, the job scheduler 130 determines whether or not there is a communication conflict associated with the communication when the first node executes the job based on the first communication time and a predetermined reference communication time, and sets the first coefficient to 1 if there is no communication conflict. The job scheduler 130 also determines whether or not there is a communication conflict associated with the communication when the second node executes the job based on the second communication time and the reference communication time, and sets the second coefficient to 1 if there is no communication conflict.

これにより、管理装置１００は、第１係数および第２係数を適切に決定できる。例えば、ジョブスケジューラ１３０は、第１通信時間が基準通信時間以下の場合、通信競合がないと判定し、第１係数を１に設定する。ジョブスケジューラ１３０は、第１通信時間が基準通信時間より長い場合、通信競合があると判定し、第１通信時間と基準通信時間との比に基づいて第１係数を設定する。同様に、ジョブスケジューラ１３０は、第２通信時間が基準通信時間以下の場合、通信競合がないと判定し、第２係数を１に設定する。ジョブスケジューラ１３０は、第２通信時間が基準通信時間より長い場合、通信競合があると判定し、第２通信時間と基準通信時間との比に基づいて第２係数を設定する。 This allows the management device 100 to appropriately determine the first coefficient and the second coefficient. For example, if the first communication time is equal to or less than the reference communication time, the job scheduler 130 determines that there is no communication contention and sets the first coefficient to 1. If the first communication time is longer than the reference communication time, the job scheduler 130 determines that there is a communication contention and sets the first coefficient based on the ratio between the first communication time and the reference communication time. Similarly, if the second communication time is equal to or less than the reference communication time, the job scheduler 130 determines that there is no communication contention and sets the second coefficient to 1. If the second communication time is longer than the reference communication time, the job scheduler 130 determines that there is a communication contention and sets the second coefficient based on the ratio between the second communication time and the reference communication time.

また、ジョブスケジューラ１３０は、ジョブの実行に用いられるノードに対して要求される、１つのノード当たりのリソースの量に基づいて、第１係数および第２係数を算出する。 The job scheduler 130 also calculates the first and second coefficients based on the amount of resources per node required for the nodes used to execute the job.

ここで、ジョブの実行に用いられるノード当たりのリソースの量が多いほど、ジョブ実行時の通信量が多い傾向となる。このため、ノード当たりのリソースの数が多いほど、ジョブの実行時間は通信競合の影響を受け易いと推定される。そこで、１つのノードに要求されるリソースの量に基づいて第１係数および第２係数を決定することで、管理装置１００は、第１係数および第２係数を適切に決定できる。なお、ジョブ実行のために１ノード当たりに要求されるＧＰＵの数やＳＳＤの容量などは、リソースの量の一例である。 Here, the greater the amount of resources per node used to execute a job, the greater the communication volume during job execution. For this reason, it is estimated that the greater the number of resources per node, the more susceptible the job execution time is to communication contention. Thus, by determining the first and second coefficients based on the amount of resources required for one node, the management device 100 can appropriately determine the first and second coefficients. Note that the number of GPUs and SSD capacity required per node for job execution are examples of the amount of resources.

また、ジョブスケジューラ１３０は、ジョブの実行時間および第１係数の積とリソースの再構成時間との和を第１時間として算出し、当該実行時間および第２係数の積を第２時間として算出する。 The job scheduler 130 also calculates the first time as the sum of the product of the job execution time and the first coefficient and the resource reconfiguration time, and calculates the product of the execution time and the second coefficient as the second time.

これにより、管理装置１００は、第１時間および第２時間を適切に算出できる。
更に、ジョブスケジューラ１３０は、第１ノードにおけるリソースの再構成を行う場合、第１ノードにジョブを割り当てる。一方、ジョブスケジューラ１３０は、第１ノードにおけるリソースの再構成を行わない場合、第２ノードにジョブを割り当てる。 This allows the management device 100 to appropriately calculate the first time and the second time.
Furthermore, the job scheduler 130 assigns the job to the first node when the resources in the first node are to be reconfigured, whereas the job scheduler 130 assigns the job to the second node when the resources in the first node are not to be reconfigured.

これにより、管理装置１００は、ジョブの実行完了までのトータル時間を短くすることができる。
なお、第１の実施の形態の情報処理は、処理部１２にプログラムを実行させることで実現できる。また、第２の実施の形態の情報処理は、ＣＰＵ１０１にプログラムを実行させることで実現できる。プログラムは、コンピュータ読み取り可能な記録媒体１１３に記録できる。 This enables the management apparatus 100 to shorten the total time until the execution of the job is completed.
The information processing in the first embodiment can be realized by causing the processing unit 12 to execute a program. The information processing in the second embodiment can be realized by causing the CPU 101 to execute a program. The program can be recorded in a computer-readable recording medium 113.

例えば、プログラムを記録した記録媒体１１３を配布することで、プログラムを流通させることができる。また、プログラムを他のコンピュータに格納しておき、ネットワーク経由でプログラムを配布してもよい。コンピュータは、例えば、記録媒体１１３に記録されたプログラムまたは他のコンピュータから受信したプログラムを、ＲＡＭ１０２やＨＤＤ１０３などの記憶装置に格納し（インストールし）、当該記憶装置からプログラムを読み込んで実行してもよい。 For example, the program can be distributed by distributing the recording medium 113 on which it is recorded. The program may also be stored in another computer and distributed via a network. For example, the computer may store (install) the program recorded on the recording medium 113 or a program received from another computer in a storage device such as the RAM 102 or HDD 103, and read and execute the program from the storage device.

１情報処理システム
１０情報処理装置
１１記憶部
１２処理部
２０，２０ａ，２０ｂ，２０ｃ，２０ｄ，… ノード
３０管理ネットワーク
４０ノード間ネットワーク
４１，４２通信路 Reference Signs List 1 Information processing system 10 Information processing device 11 Storage unit 12 Processing unit 20, 20a, 20b, 20c, 20d, . . . Node 30 Management network 40 Inter-node network 41, 42 Communication path

Claims

On the computer,
calculate a first time required from the reconfiguration to the completion of the job when the job is allocated to the first node, based on a reconfiguration time required for the reconfiguration of a first node that is a candidate for allocation of the job and requires resource reconfiguration to execute the job, an execution time of the job after the reconfiguration, and a first coefficient that represents an influence on the execution time of a communication contention accompanying communication during execution of the job by the first node; calculate a second time required for the completion of the job when the job is allocated to the second node without the reconfiguration of the resources, based on the execution time and a second coefficient that represents an influence on the execution time of a communication contention accompanying communication during execution of the job by a second node that is a candidate for allocation of the job and does not require the resource reconfiguration;
comparing the first time with the second time, and performing the reconfiguration of the resource in the first node if the first time is shorter than the second time, and not performing the reconfiguration of the resource in the first node if the first time is equal to or longer than the second time;
A resource reconfiguration program that accompanies job execution to execute processing.

calculating the first time for a first node group which is a candidate for allocation of the job and which includes the first node, and calculating the second time for a second node group which is a candidate for allocation of the job and which does not include the first node;
2. The resource reconfiguration program according to claim 1, which causes the computer to execute a process.

calculating the first coefficient based on a measurement result of a first communication time of communication between nodes belonging to the first node group, and calculating the second coefficient based on a measurement result of a second communication time of communication between nodes belonging to the second node group;
3. The resource reconfiguration program according to claim 2, which causes the computer to execute a process.

determining whether or not a communication conflict occurs in communication by the first node when the job is executed based on the first communication time and a predetermined reference communication time, and setting the first coefficient to 1 if the communication conflict does not occur; determining whether or not a communication conflict occurs in communication by the second node when the job is executed based on the second communication time and the reference communication time, and setting the second coefficient to 1 if the communication conflict does not occur;
4. The resource reconfiguration program according to claim 3, which causes the computer to execute a process.

calculating the first coefficient and the second coefficient based on an amount of the resource per node required for a node used to execute the job;
2. The resource reconfiguration program according to claim 1, which causes the computer to execute a process.

calculating a sum of a product of the execution time and the first coefficient and the reconstruction time as the first time, and calculating a product of the execution time and the second coefficient as the second time;
2. The resource reconfiguration program according to claim 1, which causes the computer to execute a process.

assigning the job to the first node when the reconfiguration of the resources in the first node is to be performed, and assigning the job to the second node when the reconfiguration of the resources in the first node is not to be performed;
2. The resource reconfiguration program according to claim 1, which causes the computer to execute a process.

The computer
calculate a first time required from the reconfiguration to the completion of the job when the job is allocated to the first node, based on a reconfiguration time required for the reconfiguration of a first node that is a candidate for allocation of the job and requires resource reconfiguration to execute the job, an execution time of the job after the reconfiguration, and a first coefficient that represents an influence on the execution time of a communication contention accompanying communication during execution of the job by the first node; calculate a second time required for the completion of the job when the job is allocated to the second node without the reconfiguration of the resources, based on the execution time and a second coefficient that represents an influence on the execution time of a communication contention accompanying communication during execution of the job by a second node that is a candidate for allocation of the job and does not require the resource reconfiguration;
comparing the first time with the second time, and performing the reconfiguration of the resource in the first node if the first time is shorter than the second time, and not performing the reconfiguration of the resource in the first node if the first time is equal to or longer than the second time;
A method for reconfiguring resources when a job is executed.

a storage unit that stores a reconfiguration time required for a first node that is a candidate for a job and requires resource reconfiguration for executing the job, and an execution time of the job after the reconfiguration;
a processing unit that calculates a first time required from the reconfiguration to completion of the job when the job is assigned to the first node, based on the reconfiguration time, the execution time, and a first coefficient representing an influence on the execution time of a communication contention accompanying communication during execution of the job by the first node, and calculates a second time required to complete the job when the job is assigned to the second node without performing the reconfiguration of the resource, based on the execution time and a second coefficient representing an influence on the execution time of a communication contention accompanying communication during execution of the job by a second node that is an assignment candidate for the job and does not require the reconfiguration of the resource, compares the first time with the second time, and performs the reconfiguration of the resource in the first node if the first time is shorter than the second time, and does not perform the reconfiguration of the resource in the first node if the first time is equal to or longer than the second time;
An information processing system having the above configuration.