JP2008225686A

JP2008225686A - Data arrangement management device and method in distributed data processing platform, and system and program

Info

Publication number: JP2008225686A
Application number: JP2007060741A
Authority: JP
Inventors: Satoshi Yamakawa; 聡山川
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-03-09
Filing date: 2007-03-09
Publication date: 2008-09-25

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problems of a vast time spent for production processing of replica data when performing parallel distribution processing of data processing, and reproduction or the like of the replica data for corresponding to a latest data classification situation to allow optimum data arrangement. <P>SOLUTION: This system includes: a data arrangement management device 1 for managing attribute information of data, address information and a data size, and providing data size information and the address information of a data group associated to the attribute information according to inquiry based on the attribute information to an inquiry source; an annotator 4 for generating the attribute information to the data stored in a data processing server 2, and updating attribute information data of the data arrangement management device; and a data production client 5 for transmitting the data size and a new production request of the data to the data arrangement management device when storing the newly produced data into the data processing server, acquires the address information of a data arrangement destination, and storing the data into the data processing server. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、情報処理技術に関し、特に、データ処理を並列分散処理に好適とされるデータ配置技術に関する。 The present invention relates to information processing technology, and in particular, to a data arrangement technology in which data processing is suitable for parallel distributed processing.

一般に、データ格納システムに格納されている大量のデータを用いてデータ分析処理やデータ読み出し処理等のデータ処理を行なう場合、データ格納装置内のディスクストレージからのデータの読み出し処理がデータ処理のボトルネックとなり、データ処理全体のパフォーマンスを下げている。これは、一般的に、データ処理システムを構成しているＣＰＵやメモリといった他のコンポーネントのデータアクセス速度に比べて、ディスクストレージへのデータアクセス速度が遅いためである。 Generally, when data processing such as data analysis processing and data reading processing is performed using a large amount of data stored in a data storage system, data reading processing from the disk storage in the data storage device is a bottleneck of data processing. As a result, the performance of the entire data processing is lowered. This is because the data access speed to the disk storage is generally slower than the data access speed of other components such as a CPU and a memory constituting the data processing system.

データ処理において扱うデータ量が大きくなればなるほど、ディスクストレージにおける処理と他のコンポーネントとの処理のバランスが悪くなり、ディスクストレージからのデータの読み出し処理の遅さが、システム全体のボトルネックとして顕著に現れるようになる。 The larger the amount of data handled in data processing, the worse the balance between the processing in the disk storage and the processing of other components, and the delay in reading data from the disk storage becomes a significant bottleneck for the entire system. Appears.

このようなディスクストレージからのデータ読み出し処理の課題を解決するために、一般に、データ処理に用いられるデータ群を複数のデータ格納装置に分散配置しておき、ディスクストレージからのデータの読み出し処理を並列に実行する手法が用いられる。 In order to solve such a problem of data read processing from disk storage, generally, a data group used for data processing is distributed and arranged in a plurality of data storage devices, and data read processing from disk storage is performed in parallel. The method to be executed is used.

この手法を用いて、データ処理に用いられるデータ群を、予め均等に分割し、複数のデータ格納装置に分散配置しておくことで、個々のデータ格納装置においてデータの読み出し処理を、並列に分散して実行することができる。これにより、データ処理全体のパフォーマンスを上げることが可能となる。このような手法は、特に、データ格納装置に分散したデータ間に相関がないようなデータ処理において、そのデータ処理を高速化する手法として用いられている。 By using this technique, the data group used for data processing is divided equally in advance and distributed to a plurality of data storage devices, so that the data read processing in each data storage device is distributed in parallel. And can be executed. As a result, the performance of the entire data processing can be improved. Such a technique is used as a technique for speeding up data processing, particularly in data processing in which there is no correlation between data distributed in a data storage device.

一方、前述のようなデータ分析処理やデータ読み出し処理等のデータ処理の実行にあたっては、データ処理に用いられるデータ群自体の作成処理が行なわれていることが前提となる。 On the other hand, in executing data processing such as data analysis processing and data reading processing as described above, it is assumed that processing for creating a data group itself used for data processing has been performed.

ある単一のデータ処理を目的としたシステムにおいては、該データ処理がデータの種別に依存せずに実行される場合、該データ処理に用いるデータ群に関して、予めデータ作成処理時に生成されたデータを複数のデータ格納装置間で容量が均等になるように分散配置しておくことで、データの読み出し処理を並列化させるデータ格納システムを容易に実現することが可能である。 In a system for the purpose of a single data processing, when the data processing is executed without depending on the type of data, the data generated in advance during the data creation processing is related to the data group used for the data processing. A data storage system that parallelizes data read processing can be easily realized by arranging the data storage devices so as to have the same capacity among the plurality of data storage devices.

しかしながら、例えば、
・複数のデータ処理が共存しており、且つ、
・各データ処理が、データ格納システムに格納されている全てのデータを用いない、
場合においては、データ作成処理時に、単純にデータの容量が均等になるようなデータの分散配置を行なうだけでは、すべてのデータ処理を高速化するようなデータの分散配置と並列処理を実現することはできない。 However, for example,
・ Multiple data processing coexist and
・ Each data processing does not use all data stored in the data storage system.
In some cases, it is possible to realize distributed data processing and parallel processing that speed up all data processing by simply performing data distributed processing so that the data capacity is equalized during data creation processing. I can't.

このような課題を解決するために、例えば、データ格納システムに格納されているデータ群から、あるデータ処理で用いられるデータ群のみを抽出し、他のデータ格納システム内に、抽出したデータ群のレプリカを作成し、容量が均等になるように、複数のデータ格納装置に分散配置することで、データ処理の高速化を図るような手法も考えられる。 In order to solve such a problem, for example, only a data group used in a certain data processing is extracted from a data group stored in the data storage system, and the extracted data group is extracted in another data storage system. A method is also conceivable in which a replica is created and distributed at a plurality of data storage devices so that the capacities are equal, thereby speeding up data processing.

なお、データと属性の対応を考慮したデータ配置に関連した従来技術として、例えば特許文献１には、衛星画像データ等のような空間的分布を有する地理的情報（提示対象データ）と、地理的情報に関連する品質精度や緑地率等のような衛星画像データに付随する属性情報とは各々関連するものとして対応付けられ、対象提示データ群の属性情報に基づいた次元数のデータ空間（経度、緯度等と時間の３次元空間）に提示対象データ群を配置して視認可能とする構成が開示されている。データ群を複数のデータ格納装置に分散配置しデータ量の均等化を行うシステムは各種提案されているが、例えば特許文献２には、モニタエージェントがネットワークを介して各記憶装置又は全記憶装置に格納されているデータ総量に関する情報をモニタし、配置手段は、システム平準化（使用容量平準化、残容量平準化、容量使用率平準化）を実現すべくデータ移動を実行させるためのものであり、ネットワークおよび制御モジュールを介して、論理記憶領域を、前記憶装置上の物理記憶領域に分散して配置又は再配置する分散ストレージシステムが開示されている。さらに、データを分散させて再配置する構成として、例えば特許文献３には、各データファイルが書き込まれている物理デバイス番号とアドレスをリロケーション手段内のマップに記録しておき、読み込みが要求される複数のデータファイル名とその組み合わせ回数とをファイル監視手段で監視し、リロケーション手段内の記録データとファイル監視手段内の記録データとに基づいて１つの物理デバイスに集中するデータファイルを複数の物理デバイスに分散させて再配置するディスクアレイ装置が開示されている。しかしながら、後の記載からも明らかとされるように、本発明が解決しようとする課題、該課題を解決するための手段等は、特許文献１乃至３のいずれにも記載されていない。 As a conventional technique related to data arrangement in consideration of correspondence between data and attributes, for example, Patent Document 1 discloses geographical information (presentation target data) having a spatial distribution such as satellite image data, and geographical information. The attribute information associated with the satellite image data such as the quality accuracy and the green space rate related to the information is associated with each other, and the data space of the number of dimensions based on the attribute information of the target presentation data group (longitude, A configuration is disclosed in which a presentation target data group is arranged and visible in a three-dimensional space of latitude and time and time. Various systems have been proposed in which data groups are distributed and arranged in a plurality of data storage devices to equalize the amount of data. For example, in Patent Document 2, a monitor agent is connected to each storage device or all storage devices via a network. Monitors information related to the total amount of data stored, and the arrangement means is used to execute data movement to achieve system leveling (leveling of used capacity, leveling of remaining capacity, leveling of capacity usage) A distributed storage system is disclosed in which a logical storage area is distributed and rearranged in a physical storage area on a previous storage device via a network and a control module. Further, as a configuration in which data is distributed and rearranged, for example, in Patent Document 3, a physical device number and an address in which each data file is written are recorded in a map in the relocation means, and reading is requested. A plurality of data file names and the number of combinations thereof are monitored by the file monitoring means, and a plurality of data files are concentrated on one physical device based on the recording data in the relocation means and the recording data in the file monitoring means. There is disclosed a disk array device that is dispersed and rearranged. However, as will be apparent from the following description, the problem to be solved by the present invention, means for solving the problem, etc. are not described in any of Patent Documents 1 to 3.

特開２０００−０７６２９４号公報JP 2000-076294 A 特開２００５−０５０３０３号公報JP 2005-050303 A 特開平８−２０２５０３号公報JP-A-8-202503

それぞれに対応するデータ格納システムに格納されているデータを用いてデータ処理を実行する、複数の異なるデータ処理システムが共存している環境において、データのレプリカを作成し、他のデータ格納システムにデータを分散配置させて処理を行う手法をとった場合、すべてのデータ処理を高速化するためには、複数の異なるデータ処理システム間で同一のデータ群を対象としたデータ処理を実行することが必要とされる。すなわち、このようにしない限り、データ処理ごとに、データ格納システムから、データ処理のみが用いるデータを抽出し、該抽出したデータを複数のデータ処理システムに均等配置するために、抽出データのレプリカを、他のデータ格納システムに作成しなければならない。 In an environment where multiple different data processing systems coexist, executing data processing using data stored in the corresponding data storage systems, data replicas are created and data is stored in other data storage systems. In order to speed up all data processing, it is necessary to execute data processing for the same data group among multiple different data processing systems. It is said. That is, unless this is done, in order to extract data used only by data processing from the data storage system for each data processing, and to distribute the extracted data equally to a plurality of data processing systems, replicas of the extracted data are used. Must be created in other data storage systems.

しかしながら、通常、このようなデータの分散配置を実施するようなデータ処理においては、各データ処理システムで用いられるデータ群のデータ容量が、並列処理を実行するに値するくらい、巨大であることが前提とされている。このため、データ処理ごとに、各データ処理システムで用いられるデータ群のレプリカを作成することは、データ容量を大量に消費することになり、レプリカの作成処理に、膨大な時間を要する等の課題がある。 However, normally, in data processing that implements such distributed arrangement of data, it is assumed that the data capacity of the data group used in each data processing system is so large that it is worth performing parallel processing. It is said that. For this reason, creating a replica of a data group used in each data processing system for each data processing consumes a large amount of data capacity, and the replica creation processing requires a huge amount of time. There is.

また、データ処理が用いるデータ群を特定する際の基準として、データの内容、意味、位置づけ等の、複数のデータ分類方法を用いるようなシステム環境において、データの内容、意味、位置づけが、時間の経過と共に変化した場合、
・データ処理における、レプリカ作成時点でのデータの切り口で定義されていなかったデータが、データ処理対象として新たに加わる場合や、
・データの定義の変更に伴い、データ処理対象から外れる場合や、
・レプリカ作成時以降に作成されたデータがデータ処理対象として新たに加わる場合、
等が発生することになる。 In addition, in a system environment that uses a plurality of data classification methods such as data content, meaning, positioning, etc., as a reference for specifying a data group used by data processing, the data content, meaning, positioning are If it changes over time,
-In data processing, when data that was not defined at the point of data creation at the time of replica creation is newly added as a data processing target,
・ When the definition of data is changed, it may be excluded from the data processing target.
・ When data created after replica creation is newly added as a data processing target,
Etc. will occur.

そこで、このような事態に対応し、最新のデータ分類状況で、データ処理を実行するためには、データ処理を実行するたびに、レプリカデータを再作成しなければならない。 Therefore, in order to cope with such a situation and execute data processing in the latest data classification situation, replica data must be recreated every time data processing is executed.

いずれにしても、複数のデータ処理システムに大量のデータが格納されている環境において、分析処理やデータの読み出し処理といったデータ処理を実行することを前提として、データ処理を並列分散処理する際に、
・レプリカデータの作成処理のために膨大な時間が費やされる、
・最新のデータ分類状況に対応するためのレプリカデータを再作成しなければならない、
等といった課題を解消し、最適なデータ配置を可能とするシステムの実現が望まれる。 In any case, in an environment where a large amount of data is stored in a plurality of data processing systems, assuming that data processing such as analysis processing and data read processing is executed, when performing parallel distributed processing of data processing,
・ A huge amount of time is spent in creating replica data.
・ Replica data must be recreated to support the latest data classification situation.
It is desired to realize a system that solves the problems such as the above and enables optimal data arrangement.

さらに上記課題に関連して、それぞれがストレージを備えた複数のデータ処理システム間でデータ容量の平準化を行うにあたり、１つの属性情報に関連してデータ配置の最適化を行っても、別の属性情報に関連したデータに関しては、複数のストレージ間におけるデータ蓄積容量の偏差が拡大してしまい、別の属性情報によるデータ処理の性能が劣化し、全体では、データ処理性能を劣化させるという事態が発生する場合もある。 Further, in relation to the above problem, when data capacity is leveled between a plurality of data processing systems each having a storage, even if data placement is optimized in relation to one attribute information, Regarding data related to attribute information, the deviation of data storage capacity between multiple storages will increase, the data processing performance by other attribute information will deteriorate, and the overall data processing performance will deteriorate It may occur.

なお、本願出願時、本明細書で述べた上記各課題の認識に関連した技術を開示した先行技術文献等は見当たらなかった。本発明は、上記課題を解決する、全く新規な装置、システム、方法、コンピュータプログラムを提供することを目的とする。 At the time of filing this application, there was no prior art document or the like that disclosed a technique related to recognition of the above-mentioned problems described in this specification. An object of the present invention is to provide a completely new apparatus, system, method, and computer program that solve the above-described problems.

本願で開示される発明は、前記課題を解決するために、概略以下の構成とされる。 In order to solve the above-mentioned problems, the invention disclosed in the present application is generally configured as follows.

本発明は、動的に属性情報が変動し、データ処理の切り口が変更する環境において、複数のデータ処理の並列処理を実行可能とするように、多面的な属性情報の状態を用いて再配置すべきデータを抽出し、最適な場所でデータを再配置するものである。 The present invention rearranges using multi-faceted attribute information states so that parallel processing of a plurality of data processings can be executed in an environment where the attribute information dynamically changes and the point of data processing changes. The data to be extracted is extracted, and the data is rearranged at an optimum place.

本発明は、複数の異なるデータ処理タスクが、あるデータ格納システム内に格納されたデータ群に対して実行される環境において、データ処理タスクごとにデータ処理の対象となるデータ群のレプリカを作成し他のデータ格納システムに格納する手法を用いるのではなく、単一のデータ格納システムのみを用いて、データ配置管理装置の導入により、データ分類状況に基づいて、データ群を最適な場所に再配置することにより、複数の異なるデータ処理の並列化を実現する。 The present invention creates a replica of a data group to be processed for each data processing task in an environment where a plurality of different data processing tasks are executed on the data group stored in a data storage system. Instead of using a method for storing in other data storage systems, only a single data storage system is used, and a data placement management device is introduced to relocate the data group to the optimal location based on the data classification situation By doing so, parallelization of a plurality of different data processing is realized.

本発明において、データ群には、各データ処理にて用いるデータを分類するために用いられる複数の属性情報が付与されている。 In the present invention, a plurality of attribute information used for classifying data used in each data processing is given to the data group.

本発明において、データ配置管理装置は、
データ群に付与されている属性情報と、
前記データ群を格納しているデータ格納装置のアドレス情報と、
データサイズと、
を管理する手段と、
各データ処理タスクにおいて、データ処理対象として指定された属性情報を取得する手段と、
前記属性情報から、
前記属性情報に関連付けられたデータ群のデータ容量の分布と、
複数のデータ格納装置間でのデータ容量比率分布と、
を算出する手段と、
前記データ容量の分布と、データ容量比率分布との最新状況から、ある属性情報に関連付けられたデータ群について、複数のデータ格納装置間で、蓄積データ容量が、ある一定水準の範囲内で均等となるようにデータを再配置する手段を備えている。 In the present invention, the data arrangement management device
Attribute information given to the data group,
Address information of a data storage device storing the data group;
Data size,
A means of managing
Means for acquiring attribute information designated as a data processing target in each data processing task;
From the attribute information,
A distribution of data capacity of a data group associated with the attribute information;
Data capacity ratio distribution among multiple data storage devices,
Means for calculating
From the latest situation of the data capacity distribution and the data capacity ratio distribution, the accumulated data capacity is uniformly within a certain level range among a plurality of data storage devices for a data group associated with certain attribute information. Means for rearranging data are provided.

本発明の別の側面に係る装置は、それぞれに複数の属性情報が付与された複数のデータと複数のデータ格納装置とに関して、前記データの格納場所を表すアドレス情報と、前記データのサイズ情報と、前記データに付与された属性情報とを対応付けて記憶管理する情報格納手段と、
前記情報格納手段を参照して、属性情報が付与されたデータの前記複数のデータ格納装置でのデータ容量の分布状況を取得する手段と、
複数の属性情報のうちの少なくとも一つの属性情報に関して取得された前記データ容量の分布状況が予め定められたデータ再配置条件に該当しているものと判断した場合、
データを再配置した場合、前記一つの属性情報に関する前記複数のデータ格納装置間でのデータ容量の分布が均一化に関する所定の条件を満たすことになるのみならず、複数の属性情報のうちの少なくとも一つの他の属性情報に関する前記複数のデータ格納装置間でのデータ容量の分布についても均一化に関する所定の条件を満たすことになるような、再配置計画を導出する手段と、
前記導出された再配置計画にしたがって、複数の前記データ格納装置間でデータを再配置する手段とを備えている。 An apparatus according to another aspect of the present invention relates to a plurality of data each provided with a plurality of attribute information and a plurality of data storage apparatuses, address information indicating a storage location of the data, size information of the data, Information storing means for storing and managing the attribute information assigned to the data in association with each other;
Means for referring to the information storage means for obtaining distribution status of data capacity in the plurality of data storage devices of the data to which the attribute information is attached;
When it is determined that the distribution status of the data capacity acquired with respect to at least one attribute information among a plurality of attribute information meets a predetermined data relocation condition,
When the data is rearranged, the distribution of the data capacity among the plurality of data storage devices regarding the one attribute information not only satisfies a predetermined condition regarding equalization, but at least of the plurality of attribute information Means for deriving a rearrangement plan so as to satisfy a predetermined condition regarding homogenization with respect to a distribution of data capacity among the plurality of data storage devices related to one other attribute information;
Means for rearranging data among the plurality of data storage devices according to the derived rearrangement plan.

本発明に係るコンピュータ・プログラムは、それぞれに複数の属性情報が付与された複数のデータと複数のデータ格納装置とに関して、前記データの格納場所を表すアドレス情報と、前記データのサイズ情報と、前記データに付与された属性情報とを対応付けて記憶部に記憶する処理と、
前記記憶部を参照して、属性情報が付与されたデータの前記複数のデータ格納装置でのデータ容量の分布状況を取得する処理と、
複数の属性情報のうちの少なくとも一つの属性情報に関して取得された前記データ容量の分布状況が予め定められたデータ再配置条件に該当しているものと判断した場合、
データ再配置計画を評価し、前記一つの属性情報に関する前記複数のデータ格納装置間でのデータ容量の分布が均一化に関する所定の条件を満たすことになるのみならず、複数の属性情報のうちの少なくとも一つの他の属性情報に関する前記複数のデータ格納装置間でのデータ容量の分布についても均一化に関する所定の条件を満たすことになるような、再配置計画を導出する処理と、
前記導出された再配置計画にしたがって、複数の前記データ格納装置間でデータを再配置する処理と、
をコンピュータに実行させるプログラムよりなる。 The computer program according to the present invention relates to a plurality of data each having a plurality of attribute information and a plurality of data storage devices, address information indicating a storage location of the data, size information of the data, A process of associating the attribute information given to the data and storing it in the storage unit;
A process of referring to the storage unit to acquire a distribution state of data capacity in the plurality of data storage devices of data to which attribute information is attached;
When it is determined that the distribution status of the data capacity acquired with respect to at least one attribute information among a plurality of attribute information meets a predetermined data relocation condition,
The data rearrangement plan is evaluated, and not only the distribution of data capacity among the plurality of data storage devices related to the one attribute information satisfies a predetermined condition regarding equalization, but also among the plurality of attribute information A process for deriving a rearrangement plan that satisfies a predetermined condition regarding the uniformity of the distribution of data capacity among the plurality of data storage devices related to at least one other attribute information;
Relocating data among the plurality of data storage devices according to the derived relocation plan;
It consists of a program that causes a computer to execute.

本発明に係る方法は、それぞれに複数の属性情報が付与された複数のデータと複数のデータ格納装置とに関して、前記データの格納場所を表すアドレス情報と、前記データのサイズ情報と、前記データに付与された属性情報とを対応付けて記憶部に記憶保持する工程と、
前記記憶部を参照して、属性情報が付与されたデータの前記複数のデータ格納装置でのデータ容量の分布状況を取得する工程と、
複数の属性情報のうちの少なくとも一つの属性情報に関して取得された前記データ容量の分布状況が予め定められたデータ再配置条件に該当しているものと判断した場合、
データを再配置した場合、前記一つの属性情報に関する前記複数のデータ格納装置間でのデータ容量の分布が均一化に関する所定の条件を満たすことになるのみならず、複数の属性情報のうちの少なくとも一つの他の属性情報に関する前記複数のデータ格納装置間でのデータ容量の分布についても均一化に関する所定の条件を満たすことになるような、再配置計画を導出する工程と、
前記導出された再配置計画にしたがって、複数の前記データ格納装置間でデータを再配置する工程と、を含む。 The method according to the present invention relates to a plurality of data each having a plurality of attribute information and a plurality of data storage devices, address information indicating a storage location of the data, size information of the data, and the data Associating the assigned attribute information with each other and storing it in the storage unit;
Referring to the storage unit, obtaining a distribution state of data capacity in the plurality of data storage devices of data to which attribute information is given;
When it is determined that the distribution status of the data capacity acquired with respect to at least one attribute information among a plurality of attribute information meets a predetermined data relocation condition,
When the data is rearranged, the distribution of the data capacity among the plurality of data storage devices regarding the one attribute information not only satisfies a predetermined condition regarding equalization, but at least of the plurality of attribute information A step of deriving a rearrangement plan that satisfies a predetermined condition regarding homogenization with respect to distribution of data capacity among the plurality of data storage devices related to one other attribute information;
Rearranging data among the plurality of data storage devices according to the derived rearrangement plan.

本発明に係るシステムは、それぞれが複数の属性情報を有するデータに関して、複数の属性情報のそれぞれに関する、前記データを格納する複数のデータ格納装置間でのデータ容量の分布が、予め定められた分布状況に近づくように、複数の前記データ格納装置間でとり得る、最適なデータ配置を導出する手段と、前記導出されたデータ配置にしたがってデータを再配置する手段と、を備えている。 In the system according to the present invention, with respect to data each having a plurality of attribute information, the distribution of the data capacity among the plurality of data storage devices that store the data with respect to each of the plurality of attribute information is determined in advance. In order to approach the situation, there is provided means for deriving an optimum data arrangement that can be taken between the plurality of data storage devices, and means for rearranging data according to the derived data arrangement.

本発明によれば、あるデータ格納システムに格納されている複数の属性情報が定義されたデータに対して、属性情報に基づく複数の異なるデータ処理タスクを実行する際、各データ処理の対象となっているデータ群のデータ容量が巨大であったとしても、他のデータ格納システムに前記データ群のレプリカをデータ処理ごとに作成せずに、それぞれのデータ格納システムのみを用いて、複数のデータ処理装置を用いたデータ処理の並列分散処理を実行することが可能となる。 According to the present invention, when a plurality of different data processing tasks based on attribute information are executed on data in which a plurality of attribute information stored in a data storage system is defined, each data processing target is processed. Even if the data capacity of the data group is huge, multiple data processing can be performed using only each data storage system without creating a replica of the data group for each data processing in another data storage system. It becomes possible to execute parallel distributed processing of data processing using the apparatus.

つまり、本発明によれば、データ処理ごとに、レプリカデータを作成することがないため、レプリカデータの作成時間やレプリカデータを格納するためのストレージ容量を削減することが可能となる。 That is, according to the present invention, since replica data is not created for each data process, it is possible to reduce the creation time of replica data and the storage capacity for storing replica data.

さらに、本発明によれば、複数のストレージ間におけるデータ蓄積容量の平準化を行うシステムにおいて、１つの属性情報に関連してデータの平準化を行っても、別の属性情報に関連したデータに関しては、複数のストレージ間におけるデータ蓄積容量の偏差が逆に拡大してしまい、全体でのデータ処理性能を劣化させるという事態の発生を抑制可能としている。 Furthermore, according to the present invention, in a system for leveling data storage capacity among a plurality of storages, even if data is leveled in relation to one attribute information, data related to another attribute information is related. However, it is possible to suppress the occurrence of a situation in which the deviation of the data storage capacity between a plurality of storages is enlarged and the overall data processing performance is deteriorated.

上記した本発明についてさらに詳細に説述すべく、添付図面を参照して以下に説明する。図１は、本発明の一実施形態のシステム構成の一例を示す図である。図１を参照すると、本実施の形態では、
少なくとも１台以上のタスクブローカー３と、
少なくとも２台以上のデータ処理サーバ２と、
少なくとも１台以上のアノテーター４と、
少なくとも１台以上のデータ作成クライアント５と、
少なくとも１台以上のデータ処理タスククライアント８と、
データ配置管理装置１と
を備えており、これらは、ネットワーク６を介して通信可能なシステムを構成している。なお、図１では、説明の容易化のため、データ処理タスククライアント８はタスクブローカー３にデータ処理タスクを直接与える接続構成として示されているが、データ処理タスククライアント８はネットワーク６を介してタスクブローカー３にデータ処理タスクを与えるようにしてもよいことは勿論である。 The present invention will be described in detail below with reference to the accompanying drawings. FIG. 1 is a diagram illustrating an example of a system configuration according to an embodiment of the present invention. Referring to FIG. 1, in this embodiment,
At least one task broker 3;
At least two data processing servers 2, and
At least one annotator 4 and
At least one data creation client 5;
At least one data processing task client 8;
And a data arrangement management device 1, which constitute a system capable of communicating via the network 6. In FIG. 1, the data processing task client 8 is shown as a connection configuration that directly gives the data processing task to the task broker 3 for ease of explanation, but the data processing task client 8 is connected to the task via the network 6. Of course, the broker 3 may be given a data processing task.

タスクブローカー３は、データ処理タスククライアント８からデータ処理タスクを受け取り、該データ処理タスクをネットワーク６を介して、システムを構成しているすべてのデータ処理サーバ２に転送する。またタスクブローカー３は、各データ処理サーバ２にて実行されたデータ処理結果を受け取って統合し、タスクを完了させる役割を持つ。タスク完了の旨を、データ処理タスククライアント８に応答として返す。なお、タスクブローカー３は、これから実行が予定されているタスクの一覧を内部の記憶部（不図示）に保存している。 The task broker 3 receives the data processing task from the data processing task client 8 and transfers the data processing task to all the data processing servers 2 constituting the system via the network 6. The task broker 3 has a role of receiving and integrating data processing results executed by the data processing servers 2 to complete the task. The task completion is returned to the data processing task client 8 as a response. The task broker 3 stores a list of tasks scheduled to be executed in an internal storage unit (not shown).

なお、データ処理タスクは、少なくとも、
・データ処理の実行対象となるデータ群を特定するための属性情報と、
・データの分析処理やデータの読み出し処理等のデータ処理の実行内容と、
を含む。 The data processing task is at least
-Attribute information for specifying the data group to be subjected to data processing,
・ Contents of data processing such as data analysis processing and data reading processing,
including.

複数のデータ処理サーバ２のそれぞれは、データ処理を実行するためのデータを格納するデータ格納装置としてディスクストレージ７を備えている。 Each of the plurality of data processing servers 2 includes a disk storage 7 as a data storage device that stores data for executing data processing.

データ処理サーバ２は、タスクブローカー３から転送されたデータ処理タスクに基づき、ディスクストレージ７に格納されているデータを用いてデータ処理タスクを実行し、その実行結果を、タスクブローカー３へ返答する。 Based on the data processing task transferred from the task broker 3, the data processing server 2 executes the data processing task using the data stored in the disk storage 7 and returns the execution result to the task broker 3.

データ配置管理装置１は、
・データに関連付けられた属性情報、
・当該データの格納場所を表すアドレス情報、
・当該データのデータサイズ
を対応付けて記憶管理している。 The data arrangement management device 1
Attribute information associated with the data,
-Address information indicating the storage location of the data,
-The data size of the relevant data is stored and managed in association.

データ配置管理装置１は、外部装置からの属性情報に基づく問い合わせに応じて、
・該当する属性情報に関連付けられたデータ群のアドレス情報と、
・データサイズ情報と、
を問い合わせ主（元）に提供する。 In response to the inquiry based on the attribute information from the external device, the data arrangement management device 1
-Address information of the data group associated with the corresponding attribute information,
・ Data size information,
Is provided to the inquirer.

図２は、本実施形態におけるデータ配置管理装置１の構成の一例を示した図である。図２を参照すると、データ配置管理装置１は、データ配置／属性情報格納部１００と、データ管理部１０１と、タスク分析部１０２と、データ配置状況分析部１０３と、データ配置制御部１０４と、再配置実行部１０５を備えている。 FIG. 2 is a diagram showing an example of the configuration of the data arrangement management device 1 in the present embodiment. Referring to FIG. 2, the data arrangement management apparatus 1 includes a data arrangement / attribute information storage unit 100, a data management unit 101, a task analysis unit 102, a data arrangement status analysis unit 103, a data arrangement control unit 104, A rearrangement execution unit 105 is provided.

データ配置／属性情報格納部１００は、
・データ処理サーバ２のディスクストレージ７に格納されているデータのアドレス情報と、
・データに関連付けられた属性情報と、
・データのデータサイズと、
を格納する。 The data arrangement / attribute information storage unit 100
The address information of the data stored in the disk storage 7 of the data processing server 2,
Attribute information associated with the data,
・ Data size of data,
Is stored.

データ管理部１０１は、
・データ作成クライアント５による、データの新規作成要求や、
・アノテーター４による属性情報の追加、変更要求
を受け、データ配置／属性情報格納部１００に格納されているデータを更新する。 The data management unit 101
・ New data creation request by the data creation client 5,
In response to an attribute information addition / change request by the annotator 4, the data stored in the data arrangement / attribute information storage unit 100 is updated.

タスク分析部１０２は、タスクブローカー３により実行された、もしくは実行される予定のデータ処理タスクの内容（処理の実行内容）を分析する。 The task analysis unit 102 analyzes the contents of the data processing task executed by the task broker 3 or scheduled to be executed (process execution contents).

データ配置状況分析部１０３は、データ配置／属性情報格納部１００に格納されている情報（データの格納アドレスと属性情報）を用いて、ある特定の属性情報をキーとするデータ群について、複数のデータ処理サーバ２における、分布状況を分析する。すなわち、特定の属性情報が付与されたデータが、複数のデータ処理サーバ２のディスクストレージ７において、どのように分布しているかの情報を取得する。 The data arrangement state analysis unit 103 uses the information (data storage address and attribute information) stored in the data arrangement / attribute information storage unit 100 for a plurality of data groups with a specific attribute information as a key. The distribution status in the data processing server 2 is analyzed. That is, information about how data to which specific attribute information is assigned is distributed in the disk storages 7 of the plurality of data processing servers 2 is acquired.

データ配置制御部１０４は、
・タスク分析部１０２で得られたタスク（データ処理タスク）の内容（処理の実行内容）と、
・データ配置状況分析部１０３で取得されたデータの分布状況と、
を用いて、再配置すべきデータとその配置場所を決定する。 The data arrangement control unit 104
The content of the task (data processing task) obtained by the task analysis unit 102 (processing execution content),
The distribution status of data acquired by the data arrangement status analysis unit 103;
Is used to determine data to be rearranged and its location.

再配置実行部１０５は、データ配置制御部１０４で決定されたデータ配置対象のデータを、決定された配置場所に基づき、データ処理サーバ２のデータを再配置する。 The rearrangement execution unit 105 rearranges the data of the data processing server 2 based on the determined placement location, based on the data placement target data determined by the data placement control unit 104.

なお、データ管理部１０１、タスク分析部１０２、再配置実行部１０５は、ネットワーク６を介して、システムを構成する他の装置と通信する手段を具備している。 Note that the data management unit 101, task analysis unit 102, and rearrangement execution unit 105 include means for communicating with other devices constituting the system via the network 6.

図２のデータ管理部１０１、タスク分析部１０２、データ配置状況分析部１０３、データ配置制御部１０４、再配置実行部１０５はコンピュータ上で実行されるプログラムにより処理・機能を実現するようにしてもよい。 The data management unit 101, the task analysis unit 102, the data arrangement status analysis unit 103, the data arrangement control unit 104, and the rearrangement execution unit 105 in FIG. 2 may realize processing and functions by a program executed on the computer. Good.

再び図１を参照すると、アノテーター４は、
・データ処理サーバ２のディスクストレージ７に格納されているデータに対して、データの内容や位置づけを分析して、属性情報を生成する、
・データ配置管理装置１のデータ管理部１０１（図２参照）を介して、データ配置／属性情報格納部１００の属性情報データを更新する、
等の機能を備えている。 Referring again to FIG. 1, the annotator 4 is
-Analyzing the content and positioning of the data stored in the disk storage 7 of the data processing server 2 to generate attribute information;
Update the attribute information data in the data arrangement / attribute information storage unit 100 via the data management unit 101 (see FIG. 2) of the data arrangement management device 1.
Etc. are provided.

データ作成クライアント５は、新規作成されたデータを、データ処理サーバ２へ格納するに当たり、データ配置管理装置１のデータ管理部１０１（図２参照）に対して、
・データの新規作成要求と、
・データサイズと、
を送信することで、
・データ配置先のアドレス情報を取得し、
・データ処理サーバ２のディスクストレージ７における指定された場所へデータを格納する。 The data creation client 5 stores the newly created data in the data processing server 2 with respect to the data management unit 101 (see FIG. 2) of the data arrangement management device 1.
・ New data creation request,
・ Data size,
By sending
・ Get the address information of the data placement destination,
Store data in a specified location in the disk storage 7 of the data processing server 2.

＜データ処理タスクの実行手順＞
次に、図１のシステムを用いたデータ処理タスクの基本的な実行手順について説明する。 <Data processing task execution procedure>
Next, a basic execution procedure of a data processing task using the system of FIG. 1 will be described.

タスクブローカー３にて登録されたデータ処理タスクは、データ処理対象とデータ処理内容が含まれたデータ処理命令として、ネットワーク６を介してシステムを構成するすべてのデータ処理サーバ２へ転送される。 The data processing task registered in the task broker 3 is transferred to all the data processing servers 2 constituting the system via the network 6 as a data processing command including a data processing target and data processing contents.

データ処理サーバ２では、
・受け取ったデータ処理命令から、データ処理対象の指定に用いられている属性情報を抽出し、
・データ配置管理装置１に対して、
・属性情報が付与されており、且つ、
・自身のディスクストレージ７に格納されているデータの一覧の取得要求を送信する。 In the data processing server 2,
・ From the received data processing instruction, extract the attribute information used to specify the data processing target,
For the data arrangement management device 1,
・ Attribute information is given, and
Send a request to obtain a list of data stored in its own disk storage 7

データ配置管理装置１のデータ管理部１０１では、データ一覧の取得要求を受け、データ配置／属性情報格納部１００から、与えられた属性情報を元に、
・属性情報が付与されており、且つ、
・データ一覧取得要求の発行元のデータ処理サーバ２のディスクストレージ７に格納されている、
データの一覧を取得し（「データ一覧取得操作」という）、要求発行元となるデータ処理サーバ２へ転送する。なお、データ一覧取得操作は、データ処理命令を受け取ったすべてのデータ処理サーバ２において実行される。 In response to the data list acquisition request, the data management unit 101 of the data arrangement management device 1 receives the attribute information from the data arrangement / attribute information storage unit 100 based on the attribute information.
・ Attribute information is given, and
Stored in the disk storage 7 of the data processing server 2 that issued the data list acquisition request;
A list of data is acquired (referred to as “data list acquisition operation”) and transferred to the data processing server 2 that is the request issuer. The data list acquisition operation is executed in all the data processing servers 2 that have received the data processing command.

データ処理サーバ２は、自身のディスクストレージ７に格納されているデータのうち、データ配置管理装置１から受け取ったデータ一覧に含まれている、すべてのデータに対して、データ処理命令に含まれているデータ処理内容に基づいたデータ処理を実行する。 The data processing server 2 includes all the data included in the data list received from the data arrangement management device 1 among the data stored in its own disk storage 7 in the data processing instruction. Data processing is executed based on the data processing contents.

データ処理サーバ２は、与えられたデータ処理命令の実行終了後、実行結果をタスクブローカー３に対して送信する。 The data processing server 2 transmits the execution result to the task broker 3 after completing the execution of the given data processing instruction.

タスクブローカー３は、自身が転送したデータ処理命令の実行結果を、命令を転送したすべてのデータ処理サーバ２から取得した後、データ処理内容に基づいて、すべての実行結果を、１つの実行結果としてまとめ、実行結果をデータ処理タスククライアント８に返送し、データ処理タスクを終了する。 The task broker 3 acquires the execution result of the data processing instruction transferred by itself from all the data processing servers 2 to which the instruction has been transferred, and then sets all the execution results as one execution result based on the data processing content. In summary, the execution result is returned to the data processing task client 8, and the data processing task is terminated.

＜データ処理サーバに格納されているデータに関連付けられている属性情報について＞
本実施形態において、データ配置管理装置１のデータ配置／属性情報格納部１００に格納されている属性情報は、例えば以下のようなデータ構成とされている。 <About attribute information associated with data stored in data processing server>
In the present embodiment, the attribute information stored in the data arrangement / attribute information storage unit 100 of the data arrangement management device 1 has, for example, the following data configuration.

＜Ｋｅｙ情報＞
Ｋｅｙ情報は、属性情報の分類名を表すデータである。例えば、データ処理サーバ２のディスクストレージ７に格納されているデータに、色に関するデータが含まれていた場合、「色」という分類名がＫｅｙ情報として格納される。 <Key information>
Key information is data representing a classification name of attribute information. For example, if the data stored in the disk storage 7 of the data processing server 2 includes data relating to color, the classification name “color” is stored as the key information.

＜Ｖａｌｕｅ情報＞
Ｖａｌｕｅ情報はＫｅｙ情報の具体的な値を表すデータである。例えば、色という分類名がＫｅｙ情報として定義されていた場合、赤、青、黄といった分類の値を表すデータがＶａｌｕｅ情報として格納される。 <Value information>
The value information is data representing a specific value of the key information. For example, when a classification name “color” is defined as key information, data representing classification values such as red, blue, and yellow is stored as value information.

Ｋｅｙ情報とＶａｌｕｅ情報を組み合わせて１つの属性情報として定義することにする。 The key information and the value information are combined and defined as one attribute information.

また、データ処理サーバ２のディスクストレージ７に格納されているデータに対して、複数の属性情報が付与可能とされている。但し、属性情報を付与する際、１つのＫｅｙ情報に対して、複数のＶａｌｕｅ情報を付与することはできず、データに付与される１つの属性情報を構成する１つのＫｅｙ情報に対して１つのＶａｌｕｅ情報が設定される。 In addition, a plurality of attribute information can be given to the data stored in the disk storage 7 of the data processing server 2. However, when assigning attribute information, a plurality of value information cannot be assigned to one key information, and one piece of key information constituting one piece of attribute information assigned to data is assigned to one piece of key information. Value information is set.

本実施形態において、属性情報は、タスクブローカー３から転送されるデータ処理タスクのデータ処理対象を特定するための情報として用いられる。 In the present embodiment, the attribute information is used as information for specifying the data processing target of the data processing task transferred from the task broker 3.

なお、本実施形態において、データに対して複数の属性情報を付与することが可能であることから、複数のデータ処理タスクにおいて、それぞれデータ処理対象を指定している属性情報が異なっている場合にも、同一のデータがデータ処理の対象となっている可能性があることになる。同一のデータが第１の属性情報に関して第１のデータ処理タスク、第２の属性情報に関して第２のデータ処理タスクで処理される場合がある。 In this embodiment, since it is possible to give a plurality of attribute information to data, when the attribute information specifying the data processing target is different in each of the plurality of data processing tasks. However, there is a possibility that the same data is subject to data processing. The same data may be processed by the first data processing task with respect to the first attribute information and the second data processing task with respect to the second attribute information.

＜データ処理サーバに格納されているデータの再配置手順＞
本実施形態において、データ配置管理装置１を用いてデータ処理サーバ２間でデータを再配置する手順を、図３のフローチャートと、図４乃至図６を用いて説明する。 <Relocation procedure for data stored in data processing server>
In the present embodiment, a procedure for rearranging data between data processing servers 2 using the data placement management device 1 will be described with reference to the flowchart of FIG. 3 and FIGS. 4 to 6.

以下では、図１に示すシステムにおいて、ノード番号１〜３の３台のデータ処理サーバ２を備え、図４に示すように、それぞれ２種類のＶａｌｕｅ情報をもつ２種類のＫｅｙ情報が、すべてのデータに対して属性情報として付与されている環境、つまり、合計４種類のデータ分類方法に基づくデータ群に対する複数のデータ処理タスクが実行される環境を例に説明する。 In the following, in the system shown in FIG. 1, three data processing servers 2 with node numbers 1 to 3 are provided, and as shown in FIG. 4, two types of Key information each having two types of Value information are all An example will be described in which an environment is given as attribute information to data, that is, an environment in which a plurality of data processing tasks are executed on a data group based on a total of four types of data classification methods.

データ配置管理装置１のデータ配置制御部１０４は、タスク分析部１０２に対して、タスクブローカー３に登録されている、今後実行が予定されている、データ処理タスク命令群の取得と、分析命令を出す。 The data arrangement control unit 104 of the data arrangement management device 1 acquires the data processing task instruction group registered in the task broker 3 and scheduled to be executed in the future, and the analysis instruction to the task analysis unit 102. put out.

タスク分析部１０２は、データ配置制御部１０４から受け取った命令に基づき、タスクブローカー３からデータ処理タスク命令群を取得し、データ処理タスク命令群から、データ処理の対象となる、データ群を特定する属性情報を抽出する。 The task analysis unit 102 acquires a data processing task instruction group from the task broker 3 based on the instruction received from the data arrangement control unit 104, and identifies a data group that is a target of data processing from the data processing task instruction group. Extract attribute information.

さらに、タスク分析部１０２は、取得したすべての属性情報のＫｅｙ情報とＶａｌｕｅ情報を抽出し、データ配置制御部１０４へ転送する（ステップ２００）。 Furthermore, the task analysis unit 102 extracts key information and value information of all the acquired attribute information, and transfers them to the data arrangement control unit 104 (step 200).

データ配置制御部１０４は、データ配置状況分析部１０３に対してタスク分析部１０２から取得したＫｅｙ情報とＶａｌｕｅ情報の一覧と共に、各属性情報を持つデータを格納する、複数のデータ処理サーバ２のそれぞれのディスクストレージ７における、データの容量分布比率表の作成命令を転送する。 The data placement control unit 104 stores data having each attribute information together with a list of key information and value information acquired from the task analysis unit 102 with respect to the data placement state analysis unit 103, respectively. The data volume distribution ratio table creation command in the disk storage 7 is transferred.

データ配置状況分析部１０３は、与えられたＫｅｙ情報とＶａｌｕｅ情報に該当するすべてのデータのアドレス情報、および、データサイズを、データ配置／属性情報格納部１００から抽出し、属性情報ごとに、該属性情報を持つデータのデータ処理サーバ２間での容量比率をまとめたデータ容量比率分布表（図５の１０６）を作成し、データ配置制御部１０４へ転送する（ステップ２０１）。 The data arrangement state analysis unit 103 extracts the address information and the data size of all data corresponding to the given key information and value information from the data arrangement / attribute information storage unit 100, and for each attribute information, A data capacity ratio distribution table (106 in FIG. 5) that summarizes the capacity ratio of the data having attribute information between the data processing servers 2 is created and transferred to the data arrangement control unit 104 (step 201).

なお、前述したように、データ配置／属性情報格納部１００には、データ処理サーバ２のディスクストレージ７に格納されているデータのアドレス情報と、データに関連付けられた属性情報と、データのデータサイズとが関連付けて格納されているため、データ配置状況分析部１０３は、データ配置／属性情報格納部１００に格納されている情報を参照するだけで、データ処理サーバ２のディスクストレージ７間でのデータ容量比率分布表を作成することができる。 As described above, in the data arrangement / attribute information storage unit 100, the address information of the data stored in the disk storage 7 of the data processing server 2, the attribute information associated with the data, and the data size of the data Are stored in association with each other, the data arrangement status analysis unit 103 simply refers to the information stored in the data arrangement / attribute information storage unit 100 and transmits data between the disk storages 7 of the data processing server 2. A capacity ratio distribution table can be created.

図５のデータ容量比率分布表１０６は、例えばＫｅｙ１のＶａｌｕｅ情報がＡのデータは、ノード１、２、３間で３０％、４０％、３０％、Ｖａｌｕｅ情報がＢのデータは、ノード１、２、３間で５０％、３０％、２０％であることを示している。 The data capacity ratio distribution table 106 of FIG. 5 shows that, for example, the data with Key 1 value information A is 30%, 40%, 30% between the nodes 1, 2, and 3, and the data with Value information B is node 1, 2 and 3 indicate 50%, 30%, and 20%.

データ配置制御部１０４は、データ配置状況分析部１０３で作成されたデータ容量比率分布表１０６を元に、各属性情報に関連するデータ群のうち、システム管理者等によって予め決められていたデータ再配置実行対象の該当条件に当てはまるものがあるか否かを判定する（ステップ２０２）。 Based on the data capacity ratio distribution table 106 created by the data arrangement status analysis unit 103, the data arrangement control unit 104 selects a data group that has been determined in advance by a system administrator or the like from the data group related to each attribute information. It is determined whether or not there is any applicable condition for the placement execution target (step 202).

なお、該当条件には、例えばデータ処理サーバ２間でのデータ容量比率差に基づく閾値を用い、前記閾値を超えていた場合、データ再配置対象に該当すると判断するものとする。特に制限されないが、本実施の形態では、この手順の説明で用いる例の場合の閾値は、格納するデータ容量の最大比率の値が最小比率の値の２倍を超えるかどうかで設定されているものとする。 For the applicable condition, for example, a threshold value based on a difference in data capacity ratio between the data processing servers 2 is used, and when the threshold value is exceeded, it is determined that the target condition is a data relocation target. Although not particularly limited, in this embodiment, the threshold value in the example used in the description of this procedure is set based on whether or not the maximum ratio value of the data capacity to be stored exceeds twice the minimum ratio value. Shall.

データ再配置実行対象の該当条件に当てはまるデータ群が存在する場合、データ配置制御部１０４は、該当条件に当てはまる１つのデータ群について、再配置の検討を開始する。 If there is a data group that matches the corresponding condition to be subjected to data rearrangement, the data placement control unit 104 starts to consider rearrangement for one data group that satisfies the corresponding condition.

なお、図５に示すデータ容量比率分布表１０６の場合、Ｋｅｙ１のＢという属性情報が付与されたデータ群に関して、ノード１に格納されているデータ容量がノード３に格納されているデータ容量の２倍を超えているため（２．５倍）、当該データ群が再配置検討の対象となる。 In the case of the data capacity ratio distribution table 106 shown in FIG. 5, the data capacity stored in the node 1 is 2 of the data capacity stored in the node 3 for the data group to which the attribute information B of Key1 is assigned. Since the number exceeds twice (2.5 times), the data group is subject to relocation examination.

データ配置制御部１０４は、データ容量分布表の作成命令をデータ配置状況分析部１０３へ転送する。 The data arrangement control unit 104 transfers a data capacity distribution table creation command to the data arrangement state analysis unit 103.

データ配置状況分析部１０３は、データ配置／属性情報格納部１００に格納されているデータを用いて、図６に示すようなデータ容量分布表１０７を作成し、データ配置制御部１０４へ転送する（ステップ２０３）。 The data arrangement status analysis unit 103 creates a data capacity distribution table 107 as shown in FIG. 6 using the data stored in the data arrangement / attribute information storage unit 100 and transfers it to the data arrangement control unit 104 ( Step 203).

なお、データ容量分布表は、タスクブローカー３から取得した属性情報のすべての組み合わせに関連するデータ群ごとに作成するものとする。その組み合わせの数は、タスクブローカー３から取得したＫｅｙ情報数をＸ、タスクブローカー３から取得したｉ番目のＫｅｙ情報に該当するＶａｌｕｅ値の数をＡ（ｉ）とすると、以下で与えられる（ただし、Πは総乗（積）演算子である）。

Note that the data capacity distribution table is created for each data group related to all combinations of attribute information acquired from the task broker 3. The number of combinations is given by the following, where X is the number of key information acquired from the task broker 3 and A (i) is the value of the value corresponding to the i-th key information acquired from the task broker 3 (however, , Π is the sum (product) operator).

さらに、データ処理サーバの数をＮ台と定義すると、データ容量分布表は、

のデータ群に分けることができる。 Furthermore, if the number of data processing servers is defined as N, the data capacity distribution table is

Can be divided into data groups.

図４に示す属性情報の例の場合、２×２×３＝１２通り、つまり、
Ａ且つＣ、Ａ且つＤ、Ｂ且つＣ、Ｂ且つＤ
の４通りのデータが、３台のデータ処理サーバに分布していることになる。 In the example of the attribute information shown in FIG. 4, 2 × 2 × 3 = 12, that is,
A and C, A and D, B and C, B and D
The four types of data are distributed over three data processing servers.

このため、最大で、２×２×３＝１２通りのデータ群に分けることができ、図６に示すようなデータ容量分布表１０７を作成することが可能となる。 Therefore, it can be divided into 2 × 2 × 3 = 12 data groups at maximum, and the data capacity distribution table 107 as shown in FIG. 6 can be created.

データ配置制御部１０４は、データ配置状況分析部１０３から取得したデータ容量分布表１０７と、前記ステップ２０２で用いたデータ容量比率分布表１０６と、を用いて、再配置対象となるデータ群とその再配置先の絞り込みを行なう。 The data arrangement control unit 104 uses the data capacity distribution table 107 acquired from the data arrangement state analysis unit 103 and the data capacity ratio distribution table 106 used in step 202 to determine a data group to be rearranged and its data group. Narrow down the relocation destinations.

再配置対象となるデータ群とその際配置先の絞り込みにあたっては、以下の（１）〜（４）の評価を行ない、再配置するデータと、再配置元、および再配置先のデータ処理サーバ２の絞り込みを行なう（ステップ２０４）。 In narrowing down the data group to be rearranged and the placement destination at that time, the following (1) to (4) are evaluated, the data to be rearranged, the rearrangement source, and the rearrangement destination data processing server 2 (Step 204).

（１）＜データ容量比率分布表を用いた再配置する容量比率と、データ再配置元と再配置先の評価＞
データ再配置の対象となった再配置対象となる属性情報が付与されたデータに関して、データ再配置作業を行なうことにより、再配置後に達成されるデータ容量比率の閾値をユーザ、もしくはシステム管理者が予め決めておき、そのデータ容量比率となるように、再配置するデータ群の容量比率を決定する。 (1) <Capacity ratio to be rearranged using data capacity ratio distribution table and evaluation of data relocation source and relocation destination>
The user or system administrator sets the threshold of the data capacity ratio achieved after the relocation by performing the data relocation operation on the data to which the attribute information to be relocated is assigned. The capacity ratio of the data group to be rearranged is determined in advance so that the data capacity ratio is obtained.

また、データ再配置元とデータ再配置先の決定に当たっては、前記データ容量比率を満たすデータ容量を、最大比率を占めているデータ処理サーバを再配置元、最小比率を占めているデータ処理サーバを再配置先とする。 In determining the data rearrangement source and the data rearrangement destination, the data processing server that occupies the maximum ratio is the data processing server that occupies the maximum ratio, and the data processing server that occupies the minimum ratio. Relocation destination.

以上により、データ再配置対象のデータ群の全体の何％のデータを、どのデータ処理サーバから、どのデータ処理サーバに再配置すべきか、を判定することが可能となる。 As described above, it is possible to determine from which data processing server to which data processing server the data of the data group to be rearranged is to be rearranged.

なお、上記手順の説明に用いている例において、ユーザ、もしくはシステム管理者が、データ容量比率の最小値が、最大値の１．５倍以下となるように、再配置後の容量比率差を設定しているものとすると、図５のデータ容量比率分布表１０６から、Ｋｅｙ１のＢの属性情報が付与されたすべてのデータのうち、全体の８％比率の容量のデータを、ノード番号１のデータ処理サーバ２から、ノード番号３のデータ処理サーバ２へ再配置すればよいことが分かる。 In the example used for the description of the above procedure, the user or the system administrator sets the capacity ratio difference after rearrangement so that the minimum value of the data capacity ratio is 1.5 times or less of the maximum value. Assuming that it is set, from the data capacity ratio distribution table 106 of FIG. It can be seen that the data processing server 2 may be rearranged to the data processing server 2 with the node number 3.

（２）＜データ容量分布表を用いた（１）の評価に基づく、再配置容量の評価＞
データ容量分布表にある容量データから、再配置対象となる属性情報が付与されたデータの総容量を算出し、（１）の評価で求めたデータ容量比率から、実際に再配置しなければならないデータ容量を算出する。 (2) <Evaluation of rearrangement capacity based on evaluation of (1) using data capacity distribution table>
From the capacity data in the data capacity distribution table, it is necessary to calculate the total capacity of the data to which the attribute information to be relocated is assigned, and to actually relocate from the data capacity ratio obtained in the evaluation of (1) Calculate the data capacity.

さらに、（１）の評価により抽出された再配置対象、再配置元、再配置先に合致するデータ群を抽出する。 Further, a data group that matches the rearrangement target, rearrangement source, and rearrangement destination extracted by the evaluation in (1) is extracted.

図６に示したデータ容量分布表１０７の例の場合、再配置対象となっているＫｅｙ１のＢの属性情報の付与されたデータ容量の総容量は６００ＧＢであることから、６００×０．０８＝４８ＧＢの容量のデータをノード番号１のデータ処理サーバ２からノード番号３のデータ処理サーバへ再配置すればよいことが分かる。 In the case of the example of the data capacity distribution table 107 shown in FIG. 6, since the total capacity of the data capacity to which the attribute information of Key B of Key1 to be rearranged is 600 GB, 600 × 0.08 = It can be seen that the data having a capacity of 48 GB may be rearranged from the data processing server 2 with the node number 1 to the data processing server with the node number 3.

また、Ｋｅｙ１のＢの属性情報の付与されたデータ群は、さらにＫｅｙ２のＣの属性情報、もしくはＫｅｙ２のＤの属性情報を付与されていることから、再配置の候補として、図６の候補１、候補２と示した２つのデータ群を抽出することができる。 Further, since the data group to which the attribute information B of Key1 is assigned is further assigned the attribute information of C of Key2 or the attribute information of D of Key2, candidate 1 in FIG. Two data groups indicated as candidate 2 can be extracted.

（３）＜データ容量比率分布表、およびデータ容量分布表を用いた（２）の評価に基づく、再配置対象、および再配置元と再配置先の評価＞
（２）の評価により抽出された再配置候補のデータ群のうち、データ容量比率分布表を用いて、データ再配置を実行することによって、関連する他の属性情報に関する容量比率を参照し、データ再配置実行により、データ容量の比率の最小値と最大値の差が開かない候補を選択する。 (3) <Evaluation of relocation target and relocation source and relocation destination based on evaluation of (2) using data capacity ratio distribution table and data capacity distribution table>
By executing the data rearrangement using the data capacity ratio distribution table in the data group of the rearrangement candidates extracted by the evaluation of (2), the capacity ratio relating to other related attribute information is referred to, and the data By executing the rearrangement, a candidate that does not open the difference between the minimum value and the maximum value of the data capacity ratio is selected.

すべての再配置候補について、データ容量の比率の最小値と最大値の差が開く場合については、データ容量分布表１０７を用いて、データ再配置後の各ノード間の実際の容量比率を算出し、（１）の評価に用いた、容量比率の最大値と最小値の差を表す閾値以上の差が発生せず、且つ、その差が最も開かない候補を選択する。 When the difference between the minimum value and the maximum value of the data capacity ratio opens for all relocation candidates, the actual capacity ratio between the nodes after data relocation is calculated using the data capacity distribution table 107. , (1) used in the evaluation, a candidate that does not generate a difference equal to or larger than a threshold value indicating a difference between the maximum value and the minimum value of the capacity ratio and that opens the difference most is selected.

再配置候補について、データ容量の比率の最小値と最大値の差が開かない場合については、データ容量分布表１０７を用いて、データ再配置後の各ノード間の実際の容量比率を算出し、容量比率の差が最も小さくなるような候補を選択する。 For the rearrangement candidate, when the difference between the minimum value and the maximum value of the data capacity ratio does not open, the actual capacity ratio between the nodes after the data rearrangement is calculated using the data capacity distribution table 107, A candidate that minimizes the difference in capacity ratio is selected.

なお、本手順の説明で用いる例の場合、Ｋｅｙ１のＢ、且つ、Ｋｅｙ２のＣが付与されたデータ群（候補１）をノード番号１からノード番号３のデータ処理サーバ２に再配置するケースは、図５のデータ容量比率分布表１０６により、容量比率の最小値と最大値の差が開くことから、再配置候補から外れる。 In the case of the example used in the description of this procedure, the case of relocating the data group (candidate 1) to which B of Key1 and C of Key2 are assigned to the data processing server 2 of node number 1 to node number 3 is 5, the difference between the minimum value and the maximum value of the capacity ratio opens from the data capacity ratio distribution table 106 in FIG.

一方、Ｋｅｙ１のＢ且つＫｅｙ２のＤが付与されたデータ群（候補２）をノード番号１からノード番号３のデータ処理サーバ２に再配置するケースは、図５のデータ容量比率分布表１０６により、容量比率の最小値と最大値の差が縮まることから、最終的な再配置対象として決定される。以上の処理から、再配置すべきデータの抽出条件としては、Ｋｅｙ１のＢ且つＫｅｙ２のＤが属性情報として付与されていること、再配置データ容量としては、４８ＧＢ、再配置元のデータ処理サーバとしてノード番号１のサーバ、再配置先のデータ処理サーバとしてノード番号３のサーバが抽出される。 On the other hand, the case of relocating the data group (candidate 2) to which B of Key1 and D of Key2 are assigned to the data processing server 2 from node number 1 to node number 3 is shown in the data capacity ratio distribution table 106 of FIG. Since the difference between the minimum value and the maximum value of the capacity ratio is reduced, it is determined as the final relocation target. From the above processing, as the extraction condition of data to be rearranged, B of Key1 and D of Key2 are assigned as attribute information, the rearrangement data capacity is 48 GB, and the data processing server of the rearrangement source A server with node number 3 is extracted as a server with node number 1 and a data processing server to be relocated.

データ配置制御部１０４は、前述の（１）〜（３）の評価によって抽出された。 The data arrangement control unit 104 is extracted by the evaluations (1) to (3) described above.

・再配置対象となるデータの抽出条件、
・再配置データ容量、
・再配置元、
・再配置先
の情報をもとに、データ配置／属性情報格納部１００に格納されている再配置対象のデータに対して、再配置対象であることを示すフラグと再配置元、再配置先の情報を追加する（ステップ２０５）。・ Extraction conditions for data to be relocated,
・ Relocation data capacity,
・ Relocation source,
-Based on the information of the relocation destination, with respect to the relocation target data stored in the data arrangement / attribute information storage unit 100, a flag indicating the relocation target, the relocation source, and the relocation destination Is added (step 205).

データ配置制御部１０４は、データ配置状況分析部１０３に対して、データ容量比率分布表１０６の作成要求を発行し、データ配置／属性情報格納部１００に格納されているデータを用いて、その時点で設定されている再配置実行後のデータ配置でのデータ容量比率分布表１０６を作成させ（ステップ２０１）、データ容量比率の最小値と最大値の差が閾値以上になっている他のデータ群があるか否かを確認する（ステップ２０２）。 The data arrangement control unit 104 issues a request to create the data capacity ratio distribution table 106 to the data arrangement state analysis unit 103, and uses the data stored in the data arrangement / attribute information storage unit 100 to The data capacity ratio distribution table 106 in the data arrangement after the rearrangement execution set in step 1 is created (step 201), and another data group in which the difference between the minimum value and the maximum value of the data capacity ratio is equal to or greater than the threshold value. It is confirmed whether or not there is (step 202).

データ容量比率の最小値と最大値の差が閾値以上になっているデータ群が存在する場合は、データ群に対して、ステップ２０３〜２０５の操作を再び実行する。 If there is a data group in which the difference between the minimum value and the maximum value of the data capacity ratio is greater than or equal to the threshold value, the operations in steps 203 to 205 are performed again on the data group.

なお、以降の操作の際に作成される、データ容量分布表１０７、およびデータ容量比率分布表１０６は、他のデータ群に関する再配置評価から決定された再配置後のアドレス情報を元に作成することにする。 The data capacity distribution table 107 and the data capacity ratio distribution table 106 created during the subsequent operations are created based on the address information after rearrangement determined from the rearrangement evaluation for other data groups. I will decide.

本実施の形態において、複数のデータ群に対する再配置評価の際、あるデータ群の評価の際に再配置対象となったデータが、その後のデータ群の再配置評価の際に再び再配置対象となった場合には、データ配置／属性情報格納部１００のデータを、最新の評価結果に基づく、再配置元、再配置先のデータに更新する。 In the present embodiment, in the rearrangement evaluation for a plurality of data groups, the data that has been subject to rearrangement in the evaluation of a certain data group is again the rearrangement target in the rearrangement evaluation of the subsequent data group. If it becomes, the data in the data arrangement / attribute information storage unit 100 is updated to the data of the rearrangement source and the rearrangement destination based on the latest evaluation result.

したがって、あるデータ群の再配置の評価の際、例えばノード１からノード３へ再配置するとして計画されたデータが、その後の他のデータ群の再配置の評価の際、ノード３からノード１へ再配置するものとして計画され、最終的に、現在の配置場所にデータを配置するように計画された場合には、データ配置／属性情報格納部１００において該当データが再配置対象であることを示すフラグをクリアする。 Accordingly, when evaluating the rearrangement of a certain data group, for example, the data planned to be rearranged from the node 1 to the node 3 is changed from the node 3 to the node 1 when evaluating the rearrangement of another data group. If the data is planned to be rearranged and finally planned to arrange data at the current arrangement location, the data arrangement / attribute information storage unit 100 indicates that the corresponding data is a relocation target. Clear the flag.

データ容量比率の最小値と最大値の差が閾値以上になっているデータ群が存在しな場合には（再配置を検討すべきデータ群が残っていない場合であり、図３のステップ２０２のＮｏ分岐）、再配置実行命令を再配置実行部１０５に対して転送する。 If there is no data group in which the difference between the minimum value and the maximum value of the data capacity ratio is equal to or greater than the threshold (there is no data group to be rearranged, step 202 in FIG. No branch), the rearrangement execution instruction is transferred to the rearrangement execution unit 105.

再配置実行部１０５は、データ配置／属性情報格納部１００において、再配置対象としてフラグのたっているすべてのデータについて、アドレス情報、再配置先の情報を抽出し、データ処理サーバ２上のデータの再配置を実行する（ステップ２０６）。 The relocation execution unit 105 extracts address information and relocation destination information for all data flagged as a relocation target in the data allocation / attribute information storage unit 100, and stores the data on the data processing server 2. Relocation is executed (step 206).

本実施の形態によれば、
・データに複数の属性情報が付与されており、
・その属性情報を元に処理するデータを決定するようなデータ処理タスクが複数共存し、
・複数のデータ処理サーバで、各データ処理サーバ自身のディスクストレージに格納しているデータに対して、前記データ処理を分割して並列に実行する、
という環境において、
データ処理タスクごとに、最新のデータから、データのレプリカを作成し、並列処理用のデータ処理サーバを作らなくても、データ処理サーバ群で複数の異なるデータ処理タスクの並列処理を実行することを可能としている。すなわち、本実施の形態によれば、レプリカの作成コストと余分なデータ処理サーバを必要としない。 According to this embodiment,
・ Multiple attribute information is given to the data,
・ There are multiple data processing tasks that determine the data to be processed based on the attribute information.
A plurality of data processing servers execute the data processing in parallel by dividing the data stored in the disk storage of each data processing server itself.
In that environment,
Create a data replica from the latest data for each data processing task, and execute parallel processing of multiple different data processing tasks in the data processing server group without creating a data processing server for parallel processing. It is possible. In other words, according to the present embodiment, replica creation costs and an extra data processing server are not required.

つまり、本実施の形態によれば、従来、別々に構築していた、データの並列処理を実行する基盤と、データ格納基盤とを、１つの基盤で実現することが可能となり、大幅なコスト削減が期待できる。 In other words, according to the present embodiment, it is possible to realize a platform for executing parallel processing of data and a data storage platform, which have been separately constructed in the past, on a single platform, thereby greatly reducing costs. Can be expected.

さらに、本実施の形態によれば、データに付与される属性情報が動的に変更するような環境であっても、常に、最新の状態のデータを用いたデータの並列処理を可能な限り並列効果を生かした形態で実行することが可能となる。 Furthermore, according to the present embodiment, even in an environment in which attribute information given to data is dynamically changed, parallel processing of data using data in the latest state is always performed in parallel as much as possible. It is possible to execute in a form that takes advantage of the effect.

本実施形態によれば、複数のストレージ間におけるデータ蓄積容量の平準化を行うシステムにおいて、１つの属性情報に関連してデータの平準化を行っても、別の属性情報に関連したデータに関しては、複数のストレージ間におけるデータ蓄積容量の偏差が逆に拡大してしまい、全体でのデータ処理性能を劣化させるという事態の発生を抑制可能としている。 According to the present embodiment, in a system that performs data storage capacity leveling among a plurality of storages, even if data is leveled in relation to one attribute information, data related to another attribute information The deviation of the data storage capacity between the plurality of storages increases on the contrary, and the occurrence of a situation in which the overall data processing performance is deteriorated can be suppressed.

なお、図１に示した実施の形態では、データ処理サーバ２はタスクブローカー３から受け取ったデータ処理命令からデータ処理対象の指定に用いられている属性情報を抽出し、データ配置管理装置１に対して属性情報が付与されたディスクストレージ７に格納されているデータの一覧の取得要求を送信し、ディスクストレージ７に格納されているデータのうち、データ配置管理装置１から受け取ったデータ一覧に含まれているデータに対してデータ処理命令で指定される処理を実行する構成とされているが、本発明は、かかる構成に限定されるものでないことは勿論である。例えば、タスクブローカー３とデータ配置管理装置１を相互に接続するか一体化し、タスクブローカー３は、データ処理対象の指定に用いられている属性情報を抽出してデータ配置管理装置１に渡し、データ配置管理装置１では、データ配置／属性情報格納部１００を参照して、それぞれのデータ処理サーバ２のディスクストレージ７に格納されている、属性情報に関連したデータの一覧を取得した上で、データ処理命令とデータ一覧を対応するデータ処理サーバ２宛てにそれぞれ送信する構成としてもよい。 In the embodiment shown in FIG. 1, the data processing server 2 extracts attribute information used for designating the data processing target from the data processing instruction received from the task broker 3, and sends it to the data arrangement management device 1. A request to obtain a list of data stored in the disk storage 7 to which the attribute information is assigned is transmitted, and the data stored in the disk storage 7 is included in the data list received from the data arrangement management device 1. However, the present invention is of course not limited to such a configuration. For example, the task broker 3 and the data arrangement management device 1 are connected to each other or integrated, and the task broker 3 extracts the attribute information used for designating the data processing target, passes it to the data arrangement management device 1, and the data The arrangement management apparatus 1 refers to the data arrangement / attribute information storage unit 100, acquires a list of data related to attribute information stored in the disk storage 7 of each data processing server 2, and then stores the data The processing instruction and the data list may be transmitted to the corresponding data processing server 2 respectively.

また、本実施の形態においては、データの格納場所（アドレス情報）、データサイズ、属性情報を個々のデータ処理サーバ２でそれぞれ個別に記憶管理し、データ配置管理装置１において、適宜、各データ処理サーバ２でのデータのアドレス情報、データサイズ、属性情報の対応付け情報を集計してデータ配置／属性情報格納部１００を更新し、データ容量の分布表、比率表等を作成する構成も一変形例として含み得る。この場合、データ処理サーバ２は、アノテーター４で作成された属性情報を受け取ってデータに関連付けて管理するようにしてもよい。 In this embodiment, the data storage location (address information), data size, and attribute information are individually stored and managed by each data processing server 2, and each data processing is appropriately performed by the data arrangement management device 1. The data address information, data size, and attribute information association information in the server 2 is aggregated to update the data arrangement / attribute information storage unit 100 to create a data capacity distribution table, ratio table, etc. May be included as an example. In this case, the data processing server 2 may receive the attribute information created by the annotator 4 and manage it in association with the data.

データの再配置は、定期的に（例えば毎日深夜零時等）に一括で行ってもよいし、あるいはデータアクセスのモニタ結果等に基づき、アクセス頻度の少ない期間に分散して（部分的に）行うようにしてもよい。その際、アクセス頻度の少ないストレージから優先的に行ってもよい。 Data rearrangement may be performed on a regular basis (for example, every day at midnight), or may be distributed (partially) over a period of low access frequency based on data access monitoring results, etc. You may make it perform. At this time, the storage may be preferentially performed from a storage with low access frequency.

本実施形態において、データ処理タスクとしては、例えばデータの検索、データの読出と読出データに関する演算等の処理が行われ、データの書き込み（更新）は、データ作成クライアント５により行われる。図１に示した構成において、データ処理タスククライアント８とデータ作成クライアント５とが分離されているが、一つの装置としてもよい。この場合、アノテーター４は、データ処理の内容がデータ作成処理であると判断したとき、属性情報を作成する。 In the present embodiment, as the data processing task, for example, processing such as data search, data reading and calculation related to read data is performed, and data writing (updating) is performed by the data creation client 5. In the configuration shown in FIG. 1, the data processing task client 8 and the data creation client 5 are separated, but may be a single device. In this case, when the annotator 4 determines that the content of the data processing is data creation processing, it creates attribute information.

本実施の形態において、データ配置制御部１０４におけるデータの再配置評価の手法は上記手法にのみ限定されるものでなく、複数の属性情報に関する複数データの複数のデータ格納装置間における分布の最適化（データ処理に適した分布状況に近づけるための最適化）として、他の任意の最適化法が適用可能である。例えば第１、第２の属性情報をＸ−Ｙ直交座標軸とし、複数のデータ処理サーバのノード番号（離散値１、２、３・・・）を各座標値で表し、Ｘ、Ｙ座標平面のある位置でのデータ容量（又は比率）をＺ軸とした３次元表示にて表した場合、データ配置の最適化は、（Ｘ，Ｙ）平面の離散格子点におけるＺ軸の値の凹凸を均一化（平坦化）させるプロセスとして捕らえることができる。一般のｎ次元の属性情報と分布状況はｎ＋１次元モデルとなる。属性情報の数、データ処理サーバの個数が増大した場合、この種の問題を解決するために数理計画法等が利用される。なお、再配置計画において、再配置元、再配置先はそれぞれ１つのノードに制限されるものでなく、例えば１つの再配置元から複数の再配置先へのデータの再配置、複数の再配置元から１つの再配置先等も可能であり、再配置データを分割し重み付けして分配するようにしてもよい。 In the present embodiment, the data rearrangement evaluation method in the data arrangement control unit 104 is not limited to the above method, and the optimization of the distribution of the plurality of data regarding the plurality of attribute information among the plurality of data storage devices. Any other optimization method can be applied as (optimization to approximate a distribution situation suitable for data processing). For example, the first and second attribute information are XY orthogonal coordinate axes, the node numbers (discrete values 1, 2, 3,...) Of a plurality of data processing servers are represented by respective coordinate values, and the X and Y coordinate planes are represented. When the data capacity (or ratio) at a certain position is represented by a three-dimensional display with the Z axis, the optimization of the data arrangement is uniform in the unevenness of the Z axis value at the discrete grid points on the (X, Y) plane. It can be grasped as a process of flattening. General n-dimensional attribute information and distribution status is an n + 1-dimensional model. When the number of attribute information and the number of data processing servers increase, mathematical programming is used to solve this kind of problem. In the relocation plan, the relocation source and the relocation destination are not limited to one node each. For example, data relocation from one relocation source to a plurality of relocation destinations, a plurality of relocations One rearrangement destination or the like is possible from the beginning, and the rearrangement data may be divided and weighted for distribution.

以上、本発明を上記実施例に即して説明したが、本発明は上記実施例の構成にのみ制限されるものでなく、本発明の範囲内で当業者であればなし得るであろう各種変形、修正を含むことは勿論である。 Although the present invention has been described with reference to the above-described embodiments, the present invention is not limited to the configurations of the above-described embodiments, and various modifications that can be made by those skilled in the art within the scope of the present invention. Of course, including modifications.

本発明の一実施形態のシステム構成の一例を示したものである。1 shows an example of a system configuration according to an embodiment of the present invention. 本発明の一実施形態のデータ配置管理装置の構成の一例を示したものである。1 shows an example of the configuration of a data arrangement management device according to an embodiment of the present invention. 本発明の一実施形態のデータ再配置手順を説明するためのフローチャートである。It is a flowchart for demonstrating the data rearrangement procedure of one Embodiment of this invention. 本発明の一実施形態においてデータに付与される属性情報の例を示す図である。It is a figure which shows the example of the attribute information provided to data in one Embodiment of this invention. 本発明の一実施形態においてあるひとつの属性情報の切り口で分類したデータ容量比率分布表の一例を示す図である。It is a figure which shows an example of the data capacity | capacitance ratio distribution table classified according to the cut face of one attribute information in one Embodiment of this invention. 本発明の一実施形態において複数の属性情報の切り口で分類したデータ容量分布表の一例である。It is an example of the data capacity | capacitance distribution table classified according to the cut face of several attribute information in one Embodiment of this invention.

Explanation of symbols

１データ配置管理装置
２データ処理サーバ
３タスクブローカー
４アノテーター
５データ作成クライアント
６ネットワーク
７ディスクストレージ
８データ処理タスククライアント
１００データ配置／属性情報格納部
１０１データ管理部
１０２タスク分析部
１０３データ配置状況分析部
１０４データ配置制御部
１０５再配置実行部
１０６データ容量比率分布表
１０７データ容量分布表 DESCRIPTION OF SYMBOLS 1 Data arrangement management apparatus 2 Data processing server 3 Task broker 4 Annotator 5 Data creation client 6 Network 7 Disk storage 8 Data processing task client 100 Data arrangement / attribute information storage part 101 Data management part 102 Task analysis part 103 Data arrangement state analysis part 104 Data Arrangement Control Unit 105 Relocation Execution Unit 106 Data Capacity Ratio Distribution Table 107 Data Capacity Distribution Table

Claims

With respect to a plurality of data and a plurality of data storage devices each assigned a plurality of attribute information, address information indicating a storage location of the data, size information of the data, and attribute information provided to the data Information storage means for storing and managing in association;
Means for referring to the information storage means for obtaining distribution status of data capacity in the plurality of data storage devices of the data to which the attribute information is attached;
When it is determined that the distribution status of the data capacity acquired with respect to at least one attribute information among a plurality of attribute information meets a predetermined data relocation condition,
When the data is rearranged, the distribution of the data capacity among the plurality of data storage devices regarding the one attribute information not only satisfies a predetermined condition regarding equalization, but at least of the plurality of attribute information Means for deriving a rearrangement plan so as to satisfy a predetermined condition regarding homogenization with respect to a distribution of data capacity among the plurality of data storage devices related to one other attribute information;
Means for relocating data among the plurality of data storage devices according to the derived relocation plan;
A data arrangement management device comprising:

With respect to a plurality of data and a plurality of data storage devices each assigned a plurality of attribute information, the attribute information assigned to the data, address information indicating a location where the data is stored, and the size of the data Information storage means for storing and managing information in association with each other;
Means for calculating a ratio distribution of data capacity among the plurality of data storage devices of the data associated with the attribute information from the attribute information designated as the data processing target, with reference to the information storage means;
From the latest situation of the data capacity ratio distribution among the plurality of data storage devices, regarding the plurality of attribute information, the distribution of the data capacity of the data associated with each attribute information between the plurality of data storage devices is uniform. Means to rearrange the data so as to approach the conditions for
A data arrangement management device comprising:

A plurality of data processing devices corresponding respectively to the plurality of data storage devices;
Data processing related to attribute information designated as a data processing target, which is performed by individually accessing a data storage device corresponding to each, is performed in parallel between the plurality of data processing devices. The data arrangement management apparatus according to claim 1 or 2.

An apparatus for managing data arrangement of a plurality of data processing servers,
A data arrangement / attribute information storage unit that stores and holds address information of data stored in each data storage device of each of the plurality of data processing servers, attribute information associated with the data, and data size in association with each other When,
A data management unit that receives a new data creation request, a request to add or change attribute information, and updates information stored in the data arrangement / attribute information storage unit;
A task analysis unit that analyzes the contents of data processing tasks;
With reference to the data arrangement / attribute information storage unit, in the plurality of data processing servers, how the data related to the plurality of attribute information designated as data processing targets is distributed in the plurality of data processing servers. A data arrangement status analysis unit that acquires information on whether or not
With respect to a plurality of designated attribute information, the distribution of data capacity among the plurality of data storage devices of the data associated with each attribute information is made closer to the condition for equalization by using the distribution status of the data. And a data placement control unit for determining data to be rearranged and a placement location thereof,
A rearrangement execution unit for rearranging data of the data processing server based on the determined placement location of the data placement target data determined by the data placement control unit;
A data arrangement management device comprising:

The data arrangement status analysis unit creates data volume ratio distribution information among the plurality of data processing servers with respect to keys and values constituting data attribute information as the data distribution status. 5. The data arrangement management apparatus according to claim 4, wherein

The data arrangement control unit uses a threshold value based on a difference in data capacity ratio between the data processing servers, and determines that attribute information data exceeding the threshold value corresponds to a data relocation target. The data arrangement management apparatus according to claim 4, wherein:

The said data arrangement | positioning condition analysis part produces the data capacity distribution table which is a list of a data processing server and a data capacity about all the combinations of the said attribute information given, The data capacity distribution table | surface characterized by the above-mentioned. Data placement management device.

The data placement control unit determines a capacity ratio of a data group to be rearranged so as to satisfy a predetermined threshold with respect to a ratio of data capacity achieved after data rearrangement,
Regarding the distribution of the data capacity ratio obtained in the data arrangement situation analysis unit, when the data capacity ratio satisfies a predetermined condition,
The data processing server occupying the largest ratio is the relocation source,
6. The data arrangement management apparatus according to claim 5, wherein a data processing server occupying the minimum ratio is a relocation destination.

The data placement control unit determines a capacity ratio of a data group to be rearranged so as to satisfy a predetermined threshold with respect to a ratio of data capacity achieved after data rearrangement,
From the capacity data in the data capacity distribution table, calculate the total capacity of the data to which the attribute information to be rearranged is attached,
Calculate the data capacity that must actually be relocated from the capacity ratio of the data group to be relocated,
Regarding the distribution of the data capacity ratio obtained in the data arrangement situation analysis unit, when the capacity ratio satisfies a predetermined condition,
The data processing server occupying the largest ratio is the relocation source,
8. The data arrangement management apparatus according to claim 7, wherein a data processing server that occupies the minimum ratio is set as a relocation destination, and a data group that matches the relocation target, the relocation source, and the relocation destination is extracted.

The data placement control unit executes data rearrangement using the data capacity ratio distribution table among the rearrangement target and the rearrangement candidate data group extracted by the evaluation of the rearrangement source and the rearrangement destination. By referring to the capacity ratio for other related attribute information,
Select a relocation candidate that does not open the difference between the minimum and maximum data capacity ratio by executing data relocation,
About the relocation candidate
When the difference between the minimum value and the maximum value of the data capacity ratio opens, the actual capacity ratio between each node after data rearrangement is calculated using the data capacity distribution table,
Select a candidate that does not generate a difference greater than or equal to a predetermined threshold regarding the difference between the maximum value and the minimum value of the calculated capacity ratio, and that the difference is least open,
About the relocation candidate
When the difference between the minimum value and the maximum value of the data capacity ratio does not open, the actual capacity ratio between the nodes after data rearrangement is calculated using the data capacity distribution table, and the difference in capacity ratio is 10. The data arrangement management apparatus according to claim 9, wherein a candidate that is the smallest is selected.

A task broker that transfers the data processing task to the data processing server via the network, integrates the data processing results executed in the data processing server, and completes the task;
Based on the data processing task from the task broker, a data processing server that executes a data processing task using storage data and returns an execution result to the task broker;
Means for managing attribute information attached to data, address information of a data storage device storing the data, and data size;
Means for acquiring attribute information designated as a data processing target in the data processing task;
The data arrangement management device according to any one of claims 4 to 10,
Annotator means for generating attribute information for data stored in the storage of the data processing server and updating attribute information data of the data arrangement management device;
When newly created data is stored in the data processing server, a new data creation request and a data size are transmitted to the data management unit of the data placement management device, and the address of the data placement destination Get information,
A data creation client for storing data in a designated location in the storage of the data processing server;
A distributed data processing system.

With respect to a plurality of data and a plurality of data storage devices each assigned a plurality of attribute information, address information indicating a storage location of the data, size information of the data, and attribute information provided to the data A process of associating and storing in the storage unit;
A process of referring to the storage unit and acquiring a distribution state of data capacity in the plurality of data storage devices of data to which attribute information is attached;
When it is determined that the distribution status of the data capacity acquired with respect to at least one attribute information among a plurality of attribute information meets a predetermined data relocation condition,
When the data is rearranged, the distribution of the data capacity among the plurality of data storage devices related to the one attribute information not only satisfies a predetermined condition regarding equalization, but at least of the plurality of attribute information A process of deriving a rearrangement plan that satisfies a predetermined condition regarding the homogenization with respect to the distribution of data capacity among the plurality of data storage devices related to one other attribute information;
Relocating data among the plurality of data storage devices according to the derived relocation plan;
A program that causes a computer to execute.

To a computer that manages the data arrangement of multiple data processing servers,
A process of associating and storing address information of data stored in each data storage device of the plurality of data processing servers, attribute information associated with the data, and data size in a storage unit;
A data management process for receiving a new data creation request, a request for adding or changing attribute information, and updating information stored in the data arrangement / attribute information storage unit;
Task analysis processing to analyze the contents of data processing tasks,
With reference to the storage unit, in the plurality of data processing servers, information on how data related to the plurality of attribute information designated as data processing targets is distributed in the plurality of data processing servers. Data placement status analysis processing to be acquired,
With respect to a plurality of designated attribute information, the distribution of data capacity among the plurality of data storage devices of the data associated with each attribute information is made closer to the condition for equalization by using the distribution status of the data. In addition, data placement control processing for determining data to be rearranged and its placement location,
Relocation execution processing for rearranging the data of the data processing server based on the determined location of the data allocation target data determined in the data allocation control processing;
A program that executes

14. The program according to claim 13, wherein the data arrangement status analysis processing is performed by calculating a data capacity ratio between the plurality of data processing servers with respect to keys and values constituting data attribute information as the data distribution status. A program characterized by creating distribution information.

14. The program according to claim 13, wherein the data placement control process uses a threshold value based on a difference in data capacity ratio between the data processing servers, and sets attribute information data exceeding the threshold value as a data relocation target. A program characterized by being judged to be applicable.

15. The program according to claim 14, wherein the data arrangement situation analysis processing creates a data capacity distribution table that is a list of data processing servers and data capacities for all combinations of the given attribute information. A featured program.

15. The program according to claim 14, wherein the data placement control process determines a capacity ratio of a data group to be rearranged so as to satisfy a predetermined threshold with respect to a ratio of data capacity achieved after data rearrangement,
Regarding the distribution of the data capacity ratio obtained by the data arrangement situation analysis process, when the data capacity ratio satisfies a predetermined condition,
The data processing server occupying the largest ratio is the relocation source,
A program characterized in that a data processing server occupying a minimum ratio is a relocation destination.

The program according to claim 16, wherein the data placement control process determines a capacity ratio of a data group to be rearranged so as to satisfy a predetermined threshold with respect to a ratio of data capacity achieved after data rearrangement.
From the capacity data in the data capacity distribution table, calculate the total capacity of the data to which the attribute information to be rearranged is attached,
Calculate the data capacity that must actually be relocated from the capacity ratio of the data group to be relocated,
Regarding the distribution of the data capacity ratio obtained by the data arrangement situation analysis process, when the capacity ratio satisfies a predetermined condition,
The data processing server occupying the largest ratio is the relocation source,
A program characterized in that a data processing server occupying a minimum ratio is a relocation destination, and a data group that matches a relocation target, a relocation source, and a relocation destination is extracted.

19. The program according to claim 18, wherein the data arrangement control processing is performed by using the data capacity ratio distribution table in the data group of the rearrangement candidates and the rearrangement candidate data extracted by evaluating the rearrangement source and the rearrangement destination. Use the data relocation to see the capacity ratio for other related attribute information,
Select a relocation candidate that does not open the difference between the minimum and maximum data capacity ratio by executing data relocation,
About the relocation candidate
When the difference between the minimum value and the maximum value of the data capacity ratio opens, the actual capacity ratio between each node after data rearrangement is calculated using the data capacity distribution table,
Select a candidate that does not generate a difference greater than or equal to a predetermined threshold regarding the difference between the maximum value and the minimum value of the calculated capacity ratio, and that the difference is least open,
About the relocation candidate
When the difference between the minimum value and the maximum value of the data capacity ratio does not open, the actual capacity ratio between the nodes after data rearrangement is calculated using the data capacity distribution table, and the difference in capacity ratio is A program characterized by selecting a candidate that is the smallest.

With respect to a plurality of data and a plurality of data storage devices each assigned a plurality of attribute information, address information indicating a storage location of the data, size information of the data, and attribute information provided to the data Associating and storing in the storage unit;
Referring to the storage unit, obtaining a distribution state of data capacity in the plurality of data storage devices of data to which attribute information is given;
When it is determined that the distribution status of the data capacity acquired with respect to at least one attribute information among a plurality of attribute information meets a predetermined data relocation condition,
When the data is rearranged, the distribution of the data capacity among the plurality of data storage devices regarding the one attribute information not only satisfies a predetermined condition regarding equalization, but at least of the plurality of attribute information A step of deriving a rearrangement plan that satisfies a predetermined condition regarding homogenization with respect to distribution of data capacity among the plurality of data storage devices related to one other attribute information;
Relocating data among the plurality of data storage devices according to the derived relocation plan;
A data arrangement management method comprising:

With respect to data each having a plurality of attribute information, a plurality of data capacity distributions among a plurality of data storage devices for storing the data with respect to each of the plurality of attribute information are approximated to a predetermined distribution state. Means for deriving an optimal data arrangement that can be taken between the data storage devices;
Means for rearranging data according to the derived data arrangement;
A data arrangement management system characterized by comprising:

A plurality of data processing devices corresponding respectively to the plurality of data storage devices;
Data processing related to attribute information designated as a data processing target, which is performed by individually accessing a data storage device corresponding to each, is performed in parallel between the plurality of data processing devices. The data arrangement management system according to claim 21.