JP2012181605A

JP2012181605A - Data analysis support device and program

Info

Publication number: JP2012181605A
Application number: JP2011042687A
Authority: JP
Inventors: Seiji Egawa; 誠二江川; Rumi Hayakawa; ルミ早川
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2011-02-28
Filing date: 2011-02-28
Publication date: 2012-09-20
Anticipated expiration: 2031-02-28
Also published as: JP5526057B2

Abstract

【課題】任意のデータテーブル間において適切な属性の対応づけを行うことが可能なデータ分析支援装置およびプログラムを提供することにある。
【解決手段】第１の単語抽出手段は、第１のデータテーブルを構成する第１の文字列型属性が有する属性値から第１の単語を抽出する。第２の単語抽出手段は、第２のデータテーブルを構成する第２の文字列型属性が有する属性値から第２の単語を抽出する。類似度算出手段は、第１の抽出手段によって抽出された第１の単語および第２の抽出手段によって抽出された第２の単語に基づいて、第１のデータテーブルを構成する第１の文字列型属性および第２のデータテーブルを構成する第２の文字列型属性の類似度を算出する。類似属性候補抽出手段は、類似度算出手段によって算出された類似度に基づいて、第１の文字列型属性および第２の文字列型属性を類似属性候補として抽出する。
【選択図】図２An object of the present invention is to provide a data analysis support apparatus and program capable of associating appropriate attributes between arbitrary data tables.
A first word extracting unit extracts a first word from an attribute value of a first character string type attribute constituting a first data table. The second word extracting means extracts the second word from the attribute value of the second character string type attribute constituting the second data table. The similarity calculation means is configured to output a first character string constituting the first data table based on the first word extracted by the first extraction means and the second word extracted by the second extraction means. The similarity between the type attribute and the second character string type attribute constituting the second data table is calculated. The similar attribute candidate extraction unit extracts the first character string type attribute and the second character string type attribute as similar attribute candidates based on the similarity calculated by the similarity calculation unit.
[Selection] Figure 2

Description

本発明の実施形態は、複数のデータテーブル間で、同一の内容を表す属性を対応づけるためのデータ分析支援装置およびプログラムに関する。 Embodiments described herein relate generally to a data analysis support apparatus and program for associating attributes representing the same content among a plurality of data tables.

例えば複数の銀行等における業務ミスを分析するためには、当該業務ミスに関するデータ（情報）が蓄積された当該銀行毎の異なるデータテーブルを参照し、当該データテーブル間で同一の内容を表す属性を比較する必要がある。 For example, in order to analyze business mistakes at a plurality of banks or the like, refer to different data tables for each bank in which data (information) related to the business mistakes is stored, and set attributes that represent the same contents between the data tables. It is necessary to compare.

しかしながら、これらのデータテーブルは、一般的に各銀行において異なる定義がされており、例えば属性名または属性値の表記が異なる場合が多い。 However, these data tables are generally defined differently in each bank. For example, the attribute names or attribute values are often expressed differently.

したがって、異なる定義がされた複数のデータテーブル間において、同一の内容を表す属性を対応づけることは困難である。 Therefore, it is difficult to associate attributes representing the same contents between a plurality of data tables having different definitions.

これに関連して、例えばデータテーブルにおける主キーが共通する２つのデータテーブル間で、当該主キーの値が同一の行を比較し、当該主キー以外の属性の一致率を算出することで、両データテーブルに共通する属性を抽出する技術（以下、第１の技術と表記）が知られている。 In this connection, for example, by comparing rows having the same primary key value between two data tables having the same primary key in the data table, and calculating the matching rate of attributes other than the primary key, A technique for extracting attributes common to both data tables (hereinafter referred to as a first technique) is known.

また、例えば属性名、属性値の分布、属性値の文字素の分布および属性値の文字列長の分布（つまり、属性の特徴）の類似度を算出し、データテーブル間で対応する属性を抽出する技術（以下、第２の技術と表記）が知られている。 Also, for example, the similarity of attribute name, attribute value distribution, attribute value grapheme distribution and attribute value string length distribution (that is, attribute features) is calculated, and corresponding attributes are extracted between data tables. Technology (hereinafter referred to as second technology) is known.

特開２００４−８６７８２号公報Japanese Patent Laid-Open No. 2004-86782 特開２００３−２７１６５６号公報JP 2003-271656 A

しかしながら、上記した第１の技術によれば、例えば２つのデータテーブルの主キーが共通していなければならない。したがって、第１の技術では、主キーが共通していなければ、任意のデータテーブル間において類似する属性を抽出することはできない。 However, according to the first technique described above, for example, the primary keys of two data tables must be common. Therefore, in the first technique, similar attributes cannot be extracted between arbitrary data tables unless the primary keys are common.

一方、上記した第２の技術によれば、比較される属性の特徴のみを利用するため、任意のデータテーブル間において類似する属性を抽出することができる。しかしながら、第２の技術によれば、例えば文字列型の属性（つまり、文字列を含む属性値を有する属性）の場合に、当該属性値の意味を考慮していないため、適切な属性の対応づけができない場合がある。 On the other hand, according to the second technique described above, since only the feature of the attribute to be compared is used, it is possible to extract a similar attribute between arbitrary data tables. However, according to the second technique, for example, in the case of a character string type attribute (that is, an attribute having an attribute value including a character string), the meaning of the attribute value is not taken into consideration. There is a case that cannot be attached.

そこで、本発明が解決しようとする課題は、任意のデータテーブル間において適切な属性の対応づけを行うことが可能なデータ分析支援装置およびプログラムを提供することにある。 Therefore, the problem to be solved by the present invention is to provide a data analysis support apparatus and program capable of associating appropriate attributes between arbitrary data tables.

実施形態に係るデータ分析支援装置は、データテーブル格納手段と、第１の単語抽出手段と、第２の単語抽出手段と、類似度算出手段と、類似属性候補抽出手段とを具備する。 The data analysis support apparatus according to the embodiment includes a data table storage unit, a first word extraction unit, a second word extraction unit, a similarity calculation unit, and a similar attribute candidate extraction unit.

データテーブル格納手段は、文字列を含む属性値を有する第１の文字列型属性を含む第１の属性から構成される第１のデータテーブルおよび文字列を含む属性値を有する第２の文字列型属性を含む第２の属性から構成される第２のデータテーブルを予め格納する。 The data table storage means includes a first data table including a first attribute including a first character string type attribute having an attribute value including a character string, and a second character string including an attribute value including the character string. A second data table composed of second attributes including type attributes is stored in advance.

第１の単語抽出手段は、前記データテーブル格納手段に格納されている第１のデータテーブルを構成する第１の属性に含まれる第１の文字列型属性が有する属性値に含まれる文字列を構成する第１の単語を抽出する。 The first word extraction unit is configured to extract a character string included in the attribute value of the first character string type attribute included in the first attribute included in the first data table stored in the data table storage unit. A first word to be configured is extracted.

第２の単語抽出手段は、前記データテーブル格納手段に格納されている第２のデータテーブルを構成する第２の属性に含まれる第２の文字列型属性が有する属性値に含まれる文字列を構成する第２の単語を抽出する。 A second word extracting unit that extracts a character string included in an attribute value of a second character string type attribute included in a second attribute included in the second data table stored in the data table storing unit; A second word to be configured is extracted.

類似度算出手段は、前記第１の抽出手段によって抽出された第１の単語および前記第２の抽出手段によって抽出された第２の単語に基づいて、前記第１のデータテーブルを構成する第１の属性に含まれる第１の文字列型属性および前記第２のデータテーブルを構成する第２の属性に含まれる第２の文字列型属性の類似度を算出する。 The similarity calculation means includes a first data table constituting the first data table based on the first word extracted by the first extraction means and the second word extracted by the second extraction means. The similarity between the first character string type attribute included in the second attribute and the second character string type attribute included in the second attribute constituting the second data table is calculated.

類似属性候補抽出手段は、前記算出された類似度に基づいて、前記第１のデータテーブルを構成する第１の属性に含まれる第１の文字列型属性および前記第２のデータテーブルを構成する第２の属性に含まれる第２の文字列型属性を類似属性候補として抽出する。 The similar attribute candidate extracting unit configures the first character string type attribute included in the first attribute configuring the first data table and the second data table based on the calculated similarity. A second character string type attribute included in the second attribute is extracted as a similar attribute candidate.

第１の実施形態に係るデータ分析支援装置のハードウェア構成を示すブロック図。The block diagram which shows the hardware constitutions of the data analysis assistance apparatus which concerns on 1st Embodiment. 図１に示すデータ分析支援装置３０の主として機能構成を示すブロック図。The block diagram which mainly shows a function structure of the data analysis assistance apparatus 30 shown in FIG. 図２に示す数値型属性処理部３２の機能構成を示すブロック図。The block diagram which shows the function structure of the numerical type attribute process part 32 shown in FIG. 図２に示す文字列型属性処理部３３の機能構成を示すブロック図。The block diagram which shows the function structure of the character string type | mold attribute process part 33 shown in FIG. データテーブル格納部２２に格納されているＡ銀行のデータテーブルのデータ構造の一例を示す図。The figure which shows an example of the data structure of the data table of A bank stored in the data table storage part 22. FIG. データテーブル格納部２２に格納されているＢ銀行のデータテーブルのデータ構造の一例を示す図。The figure which shows an example of the data structure of the data table of B bank stored in the data table storage part 22. FIG. データテーブル格納部２２に格納されているＣ銀行のデータテーブルのデータ構造の一例を示す図。The figure which shows an example of the data structure of the data table of C bank stored in the data table storage part 22. FIG. 本実施形態に係るデータ分析支援装置３０の処理手順を示すフローチャート。The flowchart which shows the process sequence of the data analysis assistance apparatus 30 which concerns on this embodiment. 類似度算出処理に含まれる数値型属性の類似度算出処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of the similarity calculation process of the numerical attribute contained in a similarity calculation process. 数値型属性類似度算出部３２３によって算出される第１および第２の数値型属性間の類似度について具体的に説明するための図。The figure for demonstrating concretely the similarity between the 1st and 2nd numerical type attributes calculated by the numerical type attribute similarity calculation part 323. FIG. 数値型属性類似度算出部３２３によって算出される第１および第２の数値型属性間の類似度について具体的に説明するための図。The figure for demonstrating concretely the similarity between the 1st and 2nd numerical type attributes calculated by the numerical type attribute similarity calculation part 323. FIG. 数値型属性の類似度算出処理において作成された類似度一覧表の一例を示す図。The figure which shows an example of the similarity list created in the similarity calculation process of a numerical type attribute. 類似度算出処理に含まれる文字列型属性の類似度算出処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of the similarity calculation process of the character string type attribute contained in a similarity calculation process. 属性単語抽出部３３１によって作成される対象文字列型属性の単語集合について具体的に説明するための図。The figure for demonstrating concretely the word set of the target character string type | mold attribute produced by the attribute word extraction part 331. FIG. 文字列型属性類似度算出部３３３によって算出される第１および第２の文字列型属性間の類似度について具体的に説明するための図。The figure for demonstrating concretely the similarity between the 1st and 2nd character string type | mold attribute calculated by the character string type | mold attribute similarity calculation part 333. FIG. 文字列型属性の類似度算出処理において作成された類似度一覧表の一例を示す図。The figure which shows an example of the similarity list created in the similarity calculation process of a character string type | mold attribute. 類似属性候補抽出処理の処理手順を示すフローチャート。The flowchart which shows the process sequence of a similar attribute candidate extraction process. 類似属性候補抽出部３５によって算出される対象属性の属性名および該当属性の属性名間の類似度について具体的に説明するための図。The figure for demonstrating concretely about the similarity between the attribute name of the target attribute calculated by the similar attribute candidate extraction part 35, and the attribute name of an applicable attribute. 類似属性候補抽出部３５によって算出される対象属性の属性名および該当属性の属性名間の類似度について具体的に説明するための図。The figure for demonstrating concretely about the similarity between the attribute name of the target attribute calculated by the similar attribute candidate extraction part 35, and the attribute name of an applicable attribute. 類似属性候補格納部２７のデータ構造の一例を示す図。The figure which shows an example of the data structure of the similar attribute candidate storage part. 第２の実施形態に係るデータ分析支援装置３０に含まれる文字列型属性処理部３３の機能構成を示すブロック図。The block diagram which shows the function structure of the character string type | mold attribute process part 33 contained in the data analysis assistance apparatus 30 which concerns on 2nd Embodiment. 文字列型属性類似度算出部３３５によって算出される第１および第２の文字列型属性間の類似度について具体的に説明するための図。The figure for demonstrating concretely the similarity between the 1st and 2nd character string type | mold attribute calculated by the character string type | mold attribute similarity calculation part 335. FIG.

以下、図面を参照して、各実施形態について説明する。 Hereinafter, each embodiment will be described with reference to the drawings.

（第１の実施形態）
図１は、第１の実施形態に係るデータ分析支援装置のハードウェア構成を示すブロック図である。図１に示すように、コンピュータ１０は、例えばハードディスクドライブ（ＨＤＤ：Hard Disk Drive）のような外部記憶装置２０と接続されている。この外部記憶装置２０は、コンピュータ１０によって実行されるプログラム２１を格納する。コンピュータ１０および外部記憶装置２０は、データ分析支援装置３０を構成する。 (First embodiment)
FIG. 1 is a block diagram illustrating a hardware configuration of the data analysis support apparatus according to the first embodiment. As shown in FIG. 1, the computer 10 is connected to an external storage device 20 such as a hard disk drive (HDD). The external storage device 20 stores a program 21 executed by the computer 10. The computer 10 and the external storage device 20 constitute a data analysis support device 30.

このデータ分析支援装置３０は、例えばデータを分析する際に、異なる複数のデータテーブル（異なる定義がされたデータテーブル）間において属性を対応づけるために用いられる。 This data analysis support device 30 is used, for example, for associating attributes among a plurality of different data tables (data tables having different definitions) when analyzing data.

図２は、図１に示すデータ分析支援装置３０の主として機能構成を示すブロック図である。 FIG. 2 is a block diagram mainly showing a functional configuration of the data analysis support apparatus 30 shown in FIG.

図２に示すように、データ分析支援装置３０は、属性型分類部３１、数値型属性処理部３２、文字列型属性処理部３３、閾値入力部３４および類似属性候補抽出部３５を含む。本実施形態において、これらの各部３１〜３５は、図１に示すコンピュータ１０が外部記憶装置２０に格納されているプログラム２１を実行することにより実現されるものとする。このプログラム２１は、コンピュータ読み取り可能な記憶媒体に予め格納して頒布可能である。また、このプログラム２１が、例えばネットワークを介してコンピュータ１０にダウンロードされても構わない。 As shown in FIG. 2, the data analysis support device 30 includes an attribute type classification unit 31, a numeric type attribute processing unit 32, a character string type attribute processing unit 33, a threshold value input unit 34, and a similar attribute candidate extraction unit 35. In the present embodiment, these units 31 to 35 are realized by the computer 10 illustrated in FIG. 1 executing the program 21 stored in the external storage device 20. This program 21 can be stored in advance in a computer-readable storage medium and distributed. Further, this program 21 may be downloaded to the computer 10 via, for example, a network.

また、データ分析支援装置３０は、データテーブル格納部２２、数値型属性格納部２３、文字列型属性格納部２４、数値型属性類似度格納部２５、文字列型属性類似度格納部２６および類似属性候補格納部２７を含む。本実施形態において、これらの各部２２〜２７は、例えば外部記憶装置２０に格納される。 In addition, the data analysis support device 30 includes a data table storage unit 22, a numerical attribute storage unit 23, a character string attribute storage unit 24, a numerical attribute similarity storage unit 25, a character string attribute similarity storage unit 26, and a similarity. An attribute candidate storage unit 27 is included. In the present embodiment, these units 22 to 27 are stored in, for example, the external storage device 20.

データテーブル格納部２２には、分析の対象となる異なる複数のデータテーブル（第１および第２のテーブル）が格納されている。このデータテーブル格納部２２に格納されている複数のデータテーブルの各々は、属性から構成される。また、データテーブルを構成する属性には、例えば数値型属性および文字列型属性が含まれる。数値型属性は、数値を含む属性値を有する（つまり、数値を属性値としてとり得る）属性である。一方、文字列型属性は、文字列を含む属性値を有する（つまり、文字列を属性値としてとり得る）属性である。なお、文字列型属性が有する属性値に含まれる文字列は、例えば単語から構成される。 The data table storage unit 22 stores a plurality of different data tables (first and second tables) to be analyzed. Each of the plurality of data tables stored in the data table storage unit 22 includes attributes. Further, the attributes constituting the data table include, for example, a numeric attribute and a character string attribute. The numeric type attribute is an attribute having an attribute value including a numeric value (that is, a numeric value can be taken as the attribute value). On the other hand, the character string type attribute is an attribute having an attribute value including a character string (that is, a character string can be taken as an attribute value). Note that the character string included in the attribute value of the character string type attribute is composed of, for example, a word.

属性型分類部３１は、データテーブル格納部２２に格納されているデータテーブルの各々を構成する属性を、数値型属性または文字列型属性に分類する。 The attribute type classification unit 31 classifies the attributes constituting each of the data tables stored in the data table storage unit 22 into numerical value type attributes or character string type attributes.

数値型属性格納部２３および文字列型属性格納部２４には、属性型分類部３１による分類結果が格納される。具体的には、数値型属性格納部２３には、データテーブル格納部２２に格納されているデータテーブル毎に、当該データテーブルを構成する属性のうちの数値型属性（属性名および属性値）が格納される。また、文字列型属性格納部２４には、データテーブル格納部２２に格納されているデータテーブル毎に、当該データテーブルを構成する属性のうちの文字列型属性（属性名および属性値）が格納される。 In the numeric attribute storage unit 23 and the character string attribute storage unit 24, the classification results by the attribute type classification unit 31 are stored. Specifically, the numeric type attribute storage unit 23 has, for each data table stored in the data table storage unit 22, a numeric type attribute (attribute name and attribute value) among the attributes constituting the data table. Stored. The character string type attribute storage unit 24 stores, for each data table stored in the data table storage unit 22, a character string type attribute (attribute name and attribute value) among the attributes constituting the data table. Is done.

数値型属性処理部３２は、数値型属性格納部２３に格納された異なるデータテーブルを構成する２つの数値型属性（第１および第２の数値型属性）が有する属性値（に含まれる数値）に基づいて、当該２つの数値型属性間の類似度を算出する。なお、数値型属性処理部３２は、数値型属性格納部２３に格納された異なるデータテーブルを構成する２つの数値型属性の組み合わせの全てについて類似度を算出する。 The numerical value attribute processing unit 32 is an attribute value (a numerical value included) included in two numerical value attributes (first and second numerical value attributes) constituting different data tables stored in the numerical value attribute storage unit 23. Based on the above, the similarity between the two numeric attributes is calculated. The numeric attribute processing unit 32 calculates the similarity for all combinations of two numeric attributes constituting different data tables stored in the numeric attribute storage unit 23.

数値型属性類似度格納部２５には、数値型属性処理部３２によって算出された異なるデータテーブルを構成する２つの数値型属性の組み合わせ毎の類似度が格納される。 The numeric attribute similarity storage unit 25 stores the similarity for each combination of two numeric attributes constituting different data tables calculated by the numeric attribute processor 32.

文字列型属性処理部３３は、文字列型属性格納部２４に格納された異なるデータテーブルを構成する２つの文字列型属性（第１および第２の文字列型属性）が有する属性値（に含まれる文字列）に基づいて、当該２つの文字列型属性間の類似度を算出する。なお、文字列型属性処理部３３は、文字列型属性格納部２４に格納された異なるデータテーブルを構成する２つの文字列型属性の組み合わせの全てについて類似度を算出する。 The character string type attribute processing unit 33 includes attribute values (first and second character string type attributes) included in different data tables stored in the character string type attribute storage unit 24. The similarity between the two character string type attributes is calculated based on the included character string. The character string type attribute processing unit 33 calculates the similarity for all combinations of two character string type attributes that constitute different data tables stored in the character string type attribute storage unit 24.

文字列型属性類似度格納部２６には、文字列型属性処理部３３によって算出された異なるデータテーブルを構成する２つの文字列型属性の組み合わせ毎の類似度が格納される。 The character string type attribute similarity storage unit 26 stores the similarity for each combination of two character string type attributes constituting different data tables calculated by the character string type attribute processing unit 33.

閾値入力部３４は、異なるデータテーブル間において同一の内容を表す属性の候補（以下、類似属性候補と表記）を抽出する際に用いられる閾値を入力する。閾値入力部３４によって入力される閾値は、例えばユーザによって指定される。 The threshold value input unit 34 inputs a threshold value used when extracting attribute candidates (hereinafter referred to as similar attribute candidates) representing the same contents between different data tables. The threshold value input by the threshold value input unit 34 is specified by the user, for example.

類似属性候補抽出部３５は、数値型属性類似度格納部２５に格納された異なるデータテーブルを構成する２つの数値型属性の組み合わせ毎の類似度および閾値入力部３４によって入力された閾値に基づいて、当該２つの数値型属性を類似属性候補として抽出する。また、類似属性候補抽出部３５は、文字列型属性類似度格納部２６に格納された異なるデータテーブルを構成する２つの文字列型属性の組み合わせ毎の類似度および閾値入力部３４によって入力された閾値に基づいて、当該２つの文字列型属性を類似属性候補として抽出する。 The similar attribute candidate extraction unit 35 is based on the similarity for each combination of two numeric type attributes constituting different data tables stored in the numeric type attribute similarity storage unit 25 and the threshold value input by the threshold value input unit 34. The two numeric type attributes are extracted as similar attribute candidates. Further, the similar attribute candidate extraction unit 35 is input by the similarity and threshold value input unit 34 for each combination of two character string type attributes constituting different data tables stored in the character string type attribute similarity storage unit 26. Based on the threshold, the two character string type attributes are extracted as similar attribute candidates.

類似属性候補格納部２７には、類似属性候補抽出部３５によって抽出された類似属性候補が格納される。 The similar attribute candidate storage unit 27 stores similar attribute candidates extracted by the similar attribute candidate extraction unit 35.

図３は、図２に示す数値型属性処理部３２の機能構成を示すブロック図である。図３に示すように、数値型属性処理部３２は、属性値数値範囲特定部３２１、属性値数値範囲格納部３２２および数値型属性類似度算出部３２３を含む。 FIG. 3 is a block diagram showing a functional configuration of the numerical attribute processing unit 32 shown in FIG. As shown in FIG. 3, the numeric attribute processing unit 32 includes an attribute value numeric range specifying unit 321, an attribute value numeric range storage unit 322, and a numeric attribute similarity calculating unit 323.

属性値数値範囲特定部３２１は、数値型属性格納部２３に格納された数値型属性が有する属性値に含まれる数値の範囲（以下、当該属性値の数値範囲と表記）を特定する。なお、属性値数値範囲特定部３２１は、数値型属性格納部２３に格納された全ての数値型属性に対して属性値の数値範囲を特定する。 The attribute value numerical value range specifying unit 321 specifies a numerical value range (hereinafter referred to as a numerical value range of the attribute value) included in the attribute value of the numerical value attribute stored in the numerical value attribute storage unit 23. The attribute value numerical value range specifying unit 321 specifies the numerical value range of the attribute value for all the numerical value attributes stored in the numerical value attribute storage unit 23.

属性値数値範囲格納部３２２には、属性値数値範囲特定部３２１によって特定された数値型属性が有する属性値の数値範囲が格納される。 The attribute value numerical value range storage unit 322 stores the numerical value range of the attribute value that the numerical value attribute specified by the attribute value numerical value range specification unit 321 has.

数値型属性類似度算出部３２３は、属性値数値範囲格納部３２２に格納された各数値型属性が有する属性値の数値範囲に基づいて、異なるデータテーブルを構成する２つの数値型属性間の類似度を算出する。具体的には、数値型属性類似度算出部３２３は、異なるデータテーブルを構成する２つの数値型属性が有する属性値の数値範囲が重なる範囲に基づいて、当該２つの数値型属性間の類似度を算出する。なお、数値型属性類似度算出部３２３は、上記したように異なるデータテーブルを構成する２つの数値型属性の組み合わせの全てについて類似度を算出する。このように数値型属性類似度算出部３２３によって算出された類似度は、数値型属性類似度格納部２５に格納される。 The numerical value type attribute similarity calculation unit 323 is configured to calculate the similarity between two numerical type attributes constituting different data tables based on the numerical value range of the attribute value of each numerical value attribute stored in the attribute value numerical value range storage unit 322. Calculate the degree. Specifically, the numerical attribute similarity calculation unit 323 calculates the similarity between the two numerical attributes based on a range in which the numerical ranges of the attribute values of the two numerical attributes constituting the different data tables overlap. Is calculated. Note that the numerical value attribute similarity calculation unit 323 calculates similarity for all combinations of two numerical value attributes that constitute different data tables as described above. The similarity calculated by the numerical attribute similarity calculator 323 is stored in the numerical attribute similarity storage 25.

図４は、図２に示す文字列型属性処理部３３の機能構成を示すブロック図である。図４に示すように、文字列型属性処理部３３は、属性値単語抽出部３３１、属性値単語集合格納部３３２および文字列型属性類似度算出部３３３を含む。 FIG. 4 is a block diagram showing a functional configuration of the character string type attribute processing unit 33 shown in FIG. As shown in FIG. 4, the character string type attribute processing unit 33 includes an attribute value word extraction unit 331, an attribute value word set storage unit 332, and a character string type attribute similarity calculation unit 333.

属性値単語抽出部３３１は、文字列型属性格納部２４に格納された文字列型属性が有する属性値に含まれる文字列を形態素解析する。属性値単語抽出部３３１は、形態素解析結果に基づいて、文字列型属性格納部２４に格納された文字列型属性が有する属性値に含まれる文字列を構成する単語を抽出する。これにより、属性値単語抽出部３３１は、文字列型属性が有する属性値に含まれる文字列を構成する単語の集合（以下、単に当該属性値の単語集合と表記）を作成する。なお、属性値単語抽出部３３１は、文字列型属性格納部２４に格納された全ての文字列型属性に対して属性値の単語集合を作成する。 The attribute value word extraction unit 331 performs morphological analysis on the character string included in the attribute value of the character string type attribute stored in the character string type attribute storage unit 24. The attribute value word extraction unit 331 extracts words constituting the character string included in the attribute value of the character string type attribute stored in the character string type attribute storage unit 24 based on the morphological analysis result. Thereby, the attribute value word extraction unit 331 creates a set of words constituting the character string included in the attribute value of the character string type attribute (hereinafter simply referred to as a word set of the attribute value). Note that the attribute value word extraction unit 331 creates a word set of attribute values for all the character string type attributes stored in the character string type attribute storage unit 24.

属性値単語集合格納部３３２には、属性値単語抽出部３３１によって作成された文字列型属性が有する属性値の単語集合が格納される。 The attribute value word set storage unit 332 stores a word set of attribute values of the character string type attribute created by the attribute value word extraction unit 331.

文字列型属性類似度算出部３３３は、属性値単語集合格納部３３２に格納された各文字列型属性が有する属性値の単語集合に基づいて、異なるデータテーブルを構成する２つの文字列型属性間の類似度を算出する。具体的には、文字列型属性類似度算出部３３３は、異なるデータテーブルを構成する２つの文字列型属性が有する属性値の単語集合間で一致する単語の数に基づいて、当該２つの文字列型属性間の類似度を算出する。なお、文字列型属性類似度算出部３３３は、上記したように異なるデータテーブルを構成する２つの文字列型属性の組み合わせの全てについて類似度を算出する。このように文字列型属性類似度算出部３３３によって算出された類似度は、文字列型属性類似度格納部２６に格納される。 The character string type attribute similarity calculation unit 333 includes two character string type attributes constituting different data tables based on the word set of attribute values of the character string type attributes stored in the attribute value word set storage unit 332. The similarity between them is calculated. Specifically, the character string type attribute similarity calculation unit 333 calculates the two characters based on the number of words that match between the word sets of the attribute values of the two character string type attributes that form different data tables. Calculate the similarity between column type attributes. Note that the character string type attribute similarity calculation unit 333 calculates the similarity for all combinations of two character string type attributes constituting different data tables as described above. The similarity calculated by the character string type attribute similarity calculating unit 333 as described above is stored in the character string type attribute similarity storing unit 26.

ここで、図５〜図７を参照して、上記した図２に示すデータテーブル格納部２２に格納されている異なる複数のデータテーブルについて説明する。 Here, a plurality of different data tables stored in the data table storage unit 22 shown in FIG. 2 will be described with reference to FIGS.

なお、本実施形態では、例えば複数の銀行における業務ミス（に関するデータ）の分析を支援することを想定する。ここでは、データテーブル格納部２２に格納されている複数のデータテーブルは、例えばＡ〜Ｃ銀行の日々の業務で発生したミス（手数料間違い、口座番号指定間違い等）に関するデータ（情報）が蓄積された当該銀行毎のデータテーブル（つまり、Ａ〜Ｃ銀行のデータテーブル）であるものとする。 In the present embodiment, for example, it is assumed that analysis of business errors (related data) in a plurality of banks is supported. Here, in the plurality of data tables stored in the data table storage unit 22, for example, data (information) relating to mistakes (such as wrong commissions, wrong account number designations, etc.) that occurred in daily operations of banks A to C is accumulated. The data table for each bank (that is, the data table of banks A to C).

図５は、データテーブル格納部２２に格納されている複数のデータテーブルのうちのＡ銀行のデータテーブルのデータ構造の一例を示す。 FIG. 5 shows an example of the data structure of the bank A data table among the plurality of data tables stored in the data table storage unit 22.

図５に示すように、Ａ銀行のデータテーブル２２１は、属性名（属性の名称）が「版」、「発生日」、「発見日」、「発生原因／発生者」、「現象／発生者」、「発生業務」、「発生者職位」、「損失金額（円）」、「発生店番号」および「発見店番号」である複数の属性から構成されている。以下の説明においては、例えば属性名が「版」である属性を単に「版」属性と称する。なお、他の属性についても同様に表記するものとする。 As shown in FIG. 5, the bank A data table 221 has attribute names (attribute names) of “version”, “occurrence date”, “discovery date”, “occurrence cause / occurrence person”, “phenomenon / occurrence person”. ”,“ Occurrence work ”,“ occurrence position ”,“ loss amount (yen) ”,“ occurrence store number ”, and“ discovered store number ”. In the following description, for example, an attribute whose attribute name is “version” is simply referred to as a “version” attribute. The other attributes are also expressed in the same manner.

ここで、図５に示すＡ銀行のデータテーブル２２１を構成する複数の属性のうち、例えば「損失金額（円）」属性は、「９４５００」、「３０００００」、「１５０００００」および「０」等の数値を含む属性値を有する。このため、「損失金額（円）」属性は、数値型属性である。なお、Ａ銀行のデータテーブル２２１を構成する複数の属性のうちの「版」属性、「発生日」属性、「発見日」属性、「損失金額（円）」属性、「発生店番号」属性および「発見店番号」属性が数値型属性である。 Here, among the plurality of attributes constituting the bank A data table 221 shown in FIG. 5, for example, the “loss amount (yen)” attribute is “94500”, “300000”, “1500000”, “0”, and the like. Has an attribute value that contains a numeric value. Therefore, the “loss amount (yen)” attribute is a numerical attribute. The “version” attribute, the “occurrence date” attribute, the “discovery date” attribute, the “loss amount (yen)” attribute, the “occurrence store number” attribute, The “discovered store number” attribute is a numeric attribute.

また、図５に示すＡ銀行のデータテーブル２２１を構成する複数の属性のうち、例えば「発生原因／発生者」属性は、「経験不足」、「指導・教育不足」、「第三者による事故」および「お客様の依頼ミス・記入誤り」等の文字列を含む属性値を有する。このため、「発生原因／発生者」属性は、文字列型属性である。なお、Ａ銀行のデータテーブル２２１を構成する複数の属性のうちの「発生原因／発生者」属性、「現象／発生者」属性、「発生業務」属性および「発生者職位」属性が文字列型属性である。 Further, among the plurality of attributes constituting the bank A data table 221 shown in FIG. 5, for example, the “occurrence cause / occurrence” attribute includes “insufficient experience”, “insufficient instruction / education”, “accident by third party” ”And attribute values including character strings such as“ customer request error / entry error ”. Therefore, the “occurrence cause / occurrence” attribute is a character string type attribute. Of the plurality of attributes constituting the bank A data table 221, the “occurrence / occurrence” attribute, the “phenomenon / occurrence” attribute, the “occurrence work” attribute, and the “occurrence position” attribute are character string types. Attribute.

図６は、データテーブル格納部２２に格納されている複数のデータテーブルのうちのＢ銀行のデータテーブルのデータ構造の一例を示す。 FIG. 6 shows an example of the data structure of the B bank data table among the plurality of data tables stored in the data table storage unit 22.

図６に示すように、Ｂ銀行のデータテーブル２２２は、「発生日」属性、「バージョン」属性、「判明日」属性、「発生原因」属性、「概要」属性、「職位／発生者」属性、「職位／検証者」属性、「直接損失額（千円）」属性、「間接損失額（千円）」属性、「業務」属性および「発生店舗」属性から構成されている。なお、Ｂ銀行のデータテーブル２２２は、上述したＡ銀行のデータテーブル２２１と異なる定義がされているため、当該Ａ銀行のデータテーブル２２１を構成する各属性と比較して属性名および属性値の表記が異なる。 As shown in FIG. 6, the data table 222 of the bank B includes an “occurrence date” attribute, a “version” attribute, a “found date” attribute, an “occurrence cause” attribute, an “overview” attribute, and a “job title / occurrence” attribute. , “Position / verifier” attribute, “direct loss (thousand yen)” attribute, “indirect loss (thousand yen)” attribute, “business” attribute, and “occurring store” attribute. Since the bank B data table 222 is defined differently from the bank A data table 221 described above, the attribute name and the attribute value notation are compared with the attributes constituting the bank A data table 221. Is different.

ここで、図６に示すＢ銀行のデータテーブル２２２を構成する複数の属性のうち、例えば「間接損失額（千円）」属性は、「０」、「４００」、「０」および「０」等の数値を含む属性値を有する。このため、「間接損失額（千円）」は、数値型属性である。なお、Ｂ銀行のデータテーブル２２２を構成する複数の属性のうちの「発生日」属性、「バージョン」属性、「判明日」属性、「直接損失額（千円）」属性、「間接損失額（千円）」属性および「発生店舗」属性が数値型属性である。 Here, among the plurality of attributes constituting the bank B data table 222 shown in FIG. 6, for example, the “indirect loss amount (thousand yen)” attribute is “0”, “400”, “0”, and “0”. Attribute values including numerical values such as Therefore, the “indirect loss amount (thousand yen)” is a numerical attribute. The “occurrence date” attribute, the “version” attribute, the “identification date” attribute, the “direct loss (1,000 yen)” attribute, the “indirect loss ( The “1,000 yen)” attribute and the “occurring store” attribute are numeric attributes.

また、図６に示すＢ銀行のデータテーブル２２２を構成する複数の属性のうち、例えば「発生原因」属性は、「知識・経験・教育不足」、「顧客による事故」、「複雑な作業内容」および「ケアレスミス」等の文字列を含む属性値を有する。このため、「発生原因」属性は、文字列型属性である。なお、Ｂ銀行のデータテーブル２２２を構成する複数の属性のうちの「発生原因」属性、「概要」属性、「職位／発生者」属性、「職位／検証者」属性および「業務」属性が文字列型属性である。 Further, among the plurality of attributes constituting the bank B data table 222 shown in FIG. 6, for example, the “occurrence cause” attributes are “knowledge / experience / learning shortage”, “accident by customer”, “complex work contents”. And an attribute value including a character string such as “careless mistake”. For this reason, the “occurrence cause” attribute is a character string type attribute. Of the plurality of attributes constituting the bank B data table 222, the "occurrence cause" attribute, the "summary" attribute, the "position / issuer" attribute, the "position / verifier" attribute, and the "business" attribute are characters. It is a column type attribute.

図７は、データテーブル格納部２２に格納されている複数のデータテーブルのうちのＣ銀行のデータテーブルのデータ構造の一例を示す。 FIG. 7 shows an example of the data structure of the C bank data table among the plurality of data tables stored in the data table storage unit 22.

図７に示すように、Ｃ銀行のデータテーブル２２３は、「判明日」属性、「発生日」属性、「版」属性、「発生店番号」属性、「発生原因／発生者」属性、「現象」属性、「発生業務」属性、「発生者職位」属性、「損失額（千円）」属性および「リスク評価」属性から構成されている。なお、Ｃ銀行のデータテーブル２２３は、上述したＡ銀行のデータテーブル２２１およびＢ銀行のデータテーブル２２２と異なる定義がされているため、当該Ａ銀行のデータテーブル２２１およびＢ銀行のデータテーブル２２２を構成する各属性と比較して属性名および属性値の表記が異なる。 As shown in FIG. 7, the data table 223 of bank C includes a “found date” attribute, “occurrence date” attribute, “version” attribute, “occurrence store number” attribute, “occurrence cause / occurrence” attribute, “phenomenon” ”Attribute,“ occurrence work ”attribute,“ occurrence position ”attribute,“ loss (thousand yen) ”attribute, and“ risk evaluation ”attribute. Since the bank C data table 223 is defined differently from the bank A data table 221 and the bank B data table 222 described above, the bank A data table 221 and the bank B data table 222 are configured. The attribute name and attribute value are different from each other.

ここで、図７に示すＣ銀行のデータテーブル２２３を構成する複数の属性のうち、例えば「リスク評価」属性は、「１」、「０」、「３」および「０」等の数値を含む属性値を有する。このため、「リスク評価」属性は、数値型属性である。なお、Ｃ銀行のデータテーブル２２３を構成する複数の属性のうちの「判明日」属性、「発生日」属性、「版」属性、「発生店番号」属性、「損失額（千円）」属性および「リスク評価」属性が数値型属性である。 Here, among the plurality of attributes constituting the bank C data table 223 shown in FIG. 7, for example, the “risk evaluation” attribute includes numerical values such as “1”, “0”, “3”, and “0”. Has an attribute value. For this reason, the “risk evaluation” attribute is a numerical attribute. Of the plurality of attributes constituting the bank C data table 223, the “found date” attribute, the “occurrence date” attribute, the “version” attribute, the “occurrence store number” attribute, and the “loss (thousand yen)” attribute And the “risk assessment” attribute is a numeric attribute.

また、図７に示すＣ銀行のデータテーブル２２３を構成する複数の属性のうち、例えば「発生者職位」属性は、「一般行員」、「上級職」、「一般行員」および「パート」等の文字列を含む属性値を有する。このため、「発生者職位」属性は、文字列型属性である。なお、Ｃ銀行のデータテーブル２２３を構成する複数の属性のうちの「発生原因／発生者」属性、「現象」属性、「発生業務」属性および「発生者職位」属性が文字列型属性である。 Among the plurality of attributes constituting the bank C data table 223 shown in FIG. 7, for example, the “generator position” attribute includes “general employee”, “senior employee”, “general employee”, “part”, and the like. Has an attribute value that contains a string. For this reason, the “generation position” attribute is a character string type attribute. Of the plurality of attributes constituting the bank C data table 223, the “cause / occurrence” attribute, the “phenomenon” attribute, the “occurrence work” attribute, and the “occurrence position” attribute are character string type attributes. .

次に、図８のフローチャートを参照して、本実施形態に係るデータ分析支援装置３０の処理手順について説明する。 Next, the processing procedure of the data analysis support apparatus 30 according to the present embodiment will be described with reference to the flowchart of FIG.

まず、属性型分類部３１は、データテーブル格納部２２に格納されている複数のデータテーブルを取得する（ステップＳ１）。 First, the attribute type classification unit 31 acquires a plurality of data tables stored in the data table storage unit 22 (step S1).

次に、属性型分類部３１は、取得されたデータテーブル毎に、当該データテーブルを構成する属性の各々を数値型属性または文字列型属性に分類する（ステップＳ２）。また、属性型分類部３１は、数値型属性に分類された属性の各々を更に数値型属性小分類に分類する。この数値型属性小分類には、例えば整数型、浮動小数型および日付型が含まれる。 Next, the attribute type classification unit 31 classifies each attribute constituting the data table into a numeric type attribute or a character string type attribute for each acquired data table (step S2). Further, the attribute type classification unit 31 further classifies each of the attributes classified as numerical type attributes into the numerical type attribute small classification. This numerical attribute small classification includes, for example, an integer type, a floating point type, and a date type.

属性型分類部３１は、取得されたデータテーブルを構成する属性が有する属性値を参照して当該属性の分類処理を実行する。なお、属性型分類部３１によって取得されたデータテーブルを構成する属性の各々の型を特定する情報（型情報）が当該データテーブルに保持されている場合には、当該情報を参照して分類処理が実行されても構わない。 The attribute type classification unit 31 refers to the attribute value of the attribute that constitutes the acquired data table and executes the attribute classification process. In addition, when the information (type information) specifying each type of the attribute constituting the data table acquired by the attribute type classification unit 31 is held in the data table, the classification process is performed with reference to the information. May be executed.

属性型分類部３１は、分類結果をデータテーブル毎に数値型属性格納部２３および文字列型属性格納部２４に格納する（ステップＳ３）。 The attribute type classification unit 31 stores the classification results in the numerical type attribute storage unit 23 and the character string type attribute storage unit 24 for each data table (step S3).

ここで、データテーブル格納部２２に格納されている複数のデータテーブルが上記したＡ〜Ｃ銀行のデータテーブル２２１〜２２３である場合を想定する。 Here, it is assumed that the plurality of data tables stored in the data table storage unit 22 are the above-described data tables 221 to 223 of the A to C banks.

この場合、数値型属性格納部２３には、Ａ〜Ｃ銀行のデータテーブル２２１〜２２３毎に数値型属性が格納される。具体的には、数値型属性格納部２３には、Ａ銀行のデータテーブル２２１を構成する複数の属性のうちの数値型属性（Ａ銀行のデータテーブル２２１を構成する数値型属性）、Ｂ銀行のデータテーブル２２２を構成する複数の属性のうちの数値型属性（Ｂ銀行のデータテーブル２２２を構成する数値型属性）およびＣ銀行のデータテーブル２２３を構成する複数の属性のうちの数値型属性（Ｃ銀行のデータテーブル２２３を構成する数値型属性）が格納される。 In this case, the numeric attribute storage unit 23 stores numeric attributes for each of the data tables 221 to 223 of the A to C banks. Specifically, the numeric type attribute storage unit 23 includes a numeric type attribute (a numeric type attribute constituting the bank A data table 221) among a plurality of attributes constituting the bank A data table 221; A numerical type attribute (a numerical type attribute constituting the bank B data table 222) among a plurality of attributes constituting the data table 222 and a numeric type attribute (C for a plurality of attributes constituting the data table 223 for the C bank) (Numeric type attributes constituting the bank data table 223) are stored.

なお、数値型属性格納部２３において、Ａ銀行のデータテーブル２２１を構成する数値型属性は、上記したように整数型、浮動小数型および日付型に更に分類されている。Ｂ銀行のデータテーブル２２２を構成する数値型属性およびＣ銀行のデータテーブル２２３を構成する数値型属性についても同様である。 In the numeric attribute storage unit 23, the numeric attributes constituting the bank A data table 221 are further classified into an integer type, a floating point type, and a date type as described above. The same applies to the numeric type attributes constituting the bank B data table 222 and the numeric type attributes constituting the bank C data table 223.

また、数値型属性格納部２３に格納されたＡ〜Ｃ銀行のデータテーブル２２１〜２２３の数値型属性には、当該Ａ〜Ｃ銀行のデータテーブル２２１〜２２３において当該数値型属性が有する属性値および当該数値型属性の属性名が含まれる。 The numerical type attributes of the bank data tables 221 to 223 stored in the numerical type attribute storage unit 23 include the attribute values of the numerical type attributes in the data tables 221 to 223 of the bank A to C and The attribute name of the numeric attribute is included.

一方、文字列型属性格納部２４には、Ａ〜Ｃ銀行のデータテーブル２２１〜２２３毎に文字列型属性が格納される。具体的には、文字列型属性格納部２４には、Ａ銀行のデータテーブル２２１を構成する複数の属性のうちの文字列型属性（Ａ銀行のデータテーブル２２１を構成する文字列型属性）、Ｂ銀行のデータテーブル２２２を構成する複数の属性のうちの文字列型属性（Ｂ銀行のデータテーブル２２２を構成する文字列型属性）およびＣ銀行のデータテーブル２２３を構成する複数の属性のうちの文字列型属性（Ｃ銀行のデータテーブル２２３を構成する文字列型属性）が格納される。 On the other hand, the character string type attribute storage unit 24 stores a character string type attribute for each of the data tables 221 to 223 of the A to C banks. Specifically, the character string type attribute storage unit 24 includes a character string type attribute (character string type attribute constituting the bank A data table 221) among a plurality of attributes constituting the bank A data table 221; Among the plurality of attributes constituting the bank B data table 222, the character string type attribute (character string type attribute constituting the bank B data table 222) and the plurality of attributes constituting the bank C data table 223 A character string type attribute (character string type attribute constituting the data table 223 of the C bank) is stored.

また、文字列型属性格納部２４に格納されたＡ〜Ｃ銀行のデータテーブル２２１〜２２３の文字列型属性には、当該Ａ〜Ｃ銀行のデータテーブル２２１〜２２３において当該文字列型属性が有する属性値および当該文字列型属性の属性名が含まれる。 Further, the character string type attributes of the data tables 221 to 223 of the A to C banks stored in the character string type attribute storage unit 24 have the character string type attributes in the data tables 221 to 223 of the A to C banks. The attribute value and the attribute name of the character string type attribute are included.

次に、数値型属性格納部２３および文字列型属性格納部２４を参照して、類似度算出処理が実行される（ステップＳ４）。詳細については後述するが、この類似度算出処理には、数値型属性処理部３２によって実行される数値型属性の類似度算出処理および文字列型属性処理部３３によって実行される文字列型属性の類似度算出処理が含まれる。 Next, the similarity calculation process is executed with reference to the numerical attribute storage unit 23 and the character string attribute storage unit 24 (step S4). Although details will be described later, the similarity calculation processing includes numerical value attribute similarity calculation processing executed by the numerical attribute processing unit 32 and character string attribute processing executed by the character string attribute processing unit 33. Similarity calculation processing is included.

数値型属性の類似度算出処理においては、数値型属性格納部２３に格納された異なるデータテーブル（の各々）を構成する２つの数値型属性が有する属性値（に含まれる数値）に基づいて、当該２つの数値型属性間の類似度が数値型属性処理部３２によって算出される。なお、数値型属性類似度算出処理においては、上記した数値型属性小分類（つまり、分類先）が同一である２つの数値型属性間の類似度が算出される。この数値型属性の類似度算出処理では、数値型属性格納部２３に格納された異なるデータテーブルを構成する２つの数値型属性であって数値型属性小分類が同一である２つの数値型属性の組み合わせの全てについて類似度が算出される。 In the numerical value attribute similarity calculation process, based on attribute values (numerical values included in) two numerical type attributes that constitute different data tables (each) stored in the numerical type attribute storage unit 23, The similarity between the two numeric type attributes is calculated by the numeric type attribute processing unit 32. In the numerical attribute similarity calculation process, the similarity between two numerical attributes having the same numerical attribute small classification (that is, classification destination) is calculated. In the numerical value attribute similarity calculation process, two numerical type attributes that are different numerical tables stored in the numerical type attribute storage unit 23 and that have the same numerical type attribute sub-classification are included. Similarities are calculated for all combinations.

また、文字列型属性の類似度算出処理においては、文字列型属性格納部２４に格納された異なるデータテーブルを構成する２つの文字列型属性が有する属性値（に含まれる文字列）に基づいて、当該２つの文字列型属性間の類似度が文字列型属性処理部３３によって算出される。この文字列型属性の類似度算出処理では、文字列型属性格納部２４に格納された異なるデータテーブルを構成する２つの文字列型属性の組み合わせの全てについて類似度が算出される。 Further, in the similarity calculation processing of the character string type attribute, based on the attribute values (character strings included in) the two character string type attributes constituting the different data tables stored in the character string type attribute storage unit 24. Thus, the similarity between the two character string type attributes is calculated by the character string type attribute processing unit 33. In this character string type attribute similarity calculation process, the similarity is calculated for all combinations of two character string type attributes constituting different data tables stored in the character string type attribute storage unit 24.

ステップＳ４の処理が実行されると、数値型属性類似度格納部２５には、数値型属性処理部３２によって算出された類似度（異なるデータテーブルを構成する２つの数値型属性の組み合わせ毎の類似度）を含む類似度一覧表が格納され、文字列型属性類似度格納部２６には、文字列型属性処理部３３によって算出された類似度（異なるデータテーブルを構成する２つの文字列型属性の組み合わせ毎の類似度）を含む類似度一覧表が格納される（ステップＳ５）。なお、数値型属性類似度格納部２５および文字列型属性類似度格納部２５に格納される類似度一覧表の詳細については、後述する。 When the processing of step S4 is executed, the numerical attribute similarity storage unit 25 stores the similarity calculated by the numerical attribute processor 32 (similarity for each combination of two numerical attributes constituting different data tables). The similarity list including the degree of similarity is stored, and the character string type attribute similarity storage unit 26 stores the similarity calculated by the character string type attribute processing unit 33 (two character string type attributes constituting different data tables). (Similarity for each combination) is stored (step S5). Details of the similarity list stored in the numerical attribute similarity storage unit 25 and the character string attribute similarity storage unit 25 will be described later.

ここで、閾値入力部３４は、例えばユーザによって指定された閾値を入力する（ステップＳ６）。ここでは、閾値がユーザによって指定されるものとして説明したが、当該閾値は、例えばデータ分析支援装置３０の内部で予め設定されていてもよいし、データテーブルの内容等に応じて動的に決定されても構わない。 Here, the threshold value input unit 34 inputs a threshold value specified by the user, for example (step S6). Here, the threshold value has been described as being specified by the user. However, the threshold value may be set in advance in the data analysis support device 30, for example, or dynamically determined according to the contents of the data table or the like. It does not matter.

次に、類似属性候補抽出部３５は、数値型属性類似度格納部２５に格納された類似度一覧表、文字列型属性類似度格納部２６に格納された類似度一覧表および閾値入力部３４によって入力された閾値に基づいて、類似属性候補を抽出する処理（以下、類似属性候補抽出処理と表記）を実行する（ステップＳ７）。この類似属性候補抽出処理において類似属性候補抽出部３５によって抽出される類似属性候補には、上記した異なるデータテーブルを構成する２つの数値型属性の組み合わせ（数値型属性ペア）および異なるデータテーブルを構成する２つの文字列型属性の組み合わせ（文字列型属性ペア）が含まれる。なお、この類似属性候補抽出処理の詳細については後述する。 Next, the similarity attribute candidate extraction unit 35 includes a similarity list stored in the numerical attribute similarity storage unit 25, a similarity list stored in the character string attribute similarity storage unit 26, and a threshold input unit 34. A process for extracting similar attribute candidates (hereinafter referred to as a similar attribute candidate extraction process) is executed based on the threshold value input by (step S7). The similar attribute candidate extracted by the similar attribute candidate extraction unit 35 in the similar attribute candidate extraction process includes a combination of two numerical type attributes (numerical type attribute pair) and different data tables constituting the different data tables described above. A combination of two character string type attributes (character string type attribute pair) is included. The details of the similar attribute candidate extraction process will be described later.

ステップＳ７の処理が実行されると、類似属性候補抽出部３５によって抽出された類似属性候補は、類似属性候補格納部２７に格納される（ステップＳ８）。なお、類似属性候補格納部２７に格納された類似属性候補は、データテーブル格納部２２に格納されている異なる複数のデータテーブル間において同一の内容を表す属性の候補であるため、例えば当該複数のデータテーブルを比較する際に利用されることができる。 When the process of step S7 is executed, the similar attribute candidates extracted by the similar attribute candidate extraction unit 35 are stored in the similar attribute candidate storage unit 27 (step S8). Note that the similar attribute candidates stored in the similar attribute candidate storage unit 27 are attribute candidates that represent the same content among a plurality of different data tables stored in the data table storage unit 22. It can be used when comparing data tables.

次に、図９のフローチャートを参照して、上述した類似度算出処理（図８に示すステップＳ４の処理）に含まれる数値型属性の類似度算出処理の処理手順について説明する。なお、この数値型属性の類似度算出処理は、数値型属性処理部３２によって実行される。 Next, with reference to the flowchart of FIG. 9, the processing procedure of the similarity calculation process of the numerical attribute included in the above-described similarity calculation process (the process of step S4 shown in FIG. 8) will be described. The numerical value attribute similarity calculation processing is executed by the numerical value attribute processing unit 32.

まず、数値型属性処理部３２は、数値型属性格納部２３に格納された数値型属性の各々について以下のステップＳ１１およびＳ１２の処理を実行する。ここでは、この処理の対象となる数値型属性を対象数値型属性と称する。 First, the numerical type attribute processing unit 32 executes the following steps S11 and S12 for each of the numerical type attributes stored in the numerical type attribute storage unit 23. Here, the numerical type attribute that is the target of this processing is referred to as a target numerical type attribute.

数値型属性処理部３２に含まれる属性値数値範囲特定部３２１は、数値型属性格納部２３に格納された対象数値型属性が有する属性値の数値範囲を特定する（ステップＳ１１）。この場合、属性値数値範囲特定部３２１は、数値型属性格納部２３に格納された対象数値型属性が有する属性値（に含まれる数値）の最大値および最小値を特定し、当該最大値から最小値までの範囲を当該対象数値型属性が有する属性値の数値範囲とする。 The attribute value numerical value range specifying unit 321 included in the numerical value type attribute processing unit 32 specifies a numerical value range of attribute values of the target numerical value attribute stored in the numerical value attribute storage unit 23 (step S11). In this case, the attribute value numerical value range specifying unit 321 specifies the maximum value and the minimum value of the attribute values (the numerical values included in) the target numerical value attribute stored in the numerical value attribute storage unit 23, and from the maximum value The range up to the minimum value is the numeric range of the attribute value of the target numeric type attribute.

属性値数値範囲特定部３２１は、特定された対象数値型属性が有する属性値の数値範囲を属性値数値範囲格納部３２２に格納する（ステップＳ１２）。 The attribute value numerical range specifying unit 321 stores the numerical value range of the attribute value of the specified target numerical value attribute in the attribute value numerical range storage unit 322 (step S12).

ここで、数値型属性格納部２３に格納された全ての数値型属性について上記したステップＳ１１およびＳ１２の処理が実行されたか否かが判定される（ステップＳ１３）。 Here, it is determined whether or not the processes of steps S11 and S12 described above have been executed for all the numeric type attributes stored in the numeric type attribute storage unit 23 (step S13).

数値型属性格納部２３に格納された全ての数値型属性について処理が実行されていないと判定された場合（ステップＳ１３のＮＯ）、上記したステップＳ１１に戻って処理が繰り返される。この場合、ステップＳ１１およびＳ１２の処理が実行されていない数値型属性を対象数値型属性として処理が実行される。このように、数値型属性格納部２３に格納された全ての数値型属性についてステップＳ１１およびＳ１２の処理が実行されることによって、当該数値型属性毎に特定された当該数値型属性が有する属性値の数値範囲が属性値数値範囲格納部３２２に格納される。以下、属性値数値範囲格納部３２２に格納された数値型属性が有する属性値の数値範囲を単に当該数値型属性の数値範囲と称する。 If it is determined that processing has not been executed for all the numeric type attributes stored in the numeric type attribute storage unit 23 (NO in step S13), the process returns to step S11 described above and is repeated. In this case, the process is executed with the numeric type attribute for which the processes of steps S11 and S12 have not been executed as the target numeric type attribute. As described above, the processing of steps S11 and S12 is executed for all the numeric type attributes stored in the numeric type attribute storage unit 23, whereby the attribute value of the numeric type attribute specified for each numeric type attribute is included. Are stored in the attribute value numerical value range storage unit 322. Hereinafter, the numerical value range of the attribute value included in the numerical value attribute stored in the attribute value numerical value range storage unit 322 is simply referred to as the numerical value range of the numerical value attribute.

一方、数値型属性格納部２３に格納された全ての数値型属性について処理が実行されたと判定された場合（ステップＳ１３のＹＥＳ）、数値型属性類似度算出部３２３は、数値型属性格納部２３に格納された異なるデータテーブルを構成する２つの数値型属性の組み合わせ（数値型属性ペア）の各々に対して以下のステップＳ１４およびＳ１５の処理を実行する。ここでは、この処理の対象となる数値型属性ペアを対象数値型属性ペアと称する。また、対象数値型属性ペアに含まれる一方の数値型属性を第１の数値型属性、他方の数値型属性を第２の数値型属性と称する。なお、第１の数値型属性および第２の数値型属性の数値型属性小分類（つまり、分類先）は同一であるものとする。 On the other hand, when it is determined that the processing has been executed for all the numeric type attributes stored in the numeric type attribute storage unit 23 (YES in step S13), the numeric type attribute similarity calculation unit 323 is The following steps S14 and S15 are executed for each of the combinations of two numeric type attributes (numeric type attribute pairs) constituting different data tables stored in. Here, the numerical attribute pair that is the target of this processing is referred to as a target numerical attribute pair. One numerical value attribute included in the target numerical value attribute pair is referred to as a first numerical value attribute, and the other numerical value attribute is referred to as a second numerical value attribute. It is assumed that the first numerical value attribute and the second numerical value attribute have the same numerical attribute small classification (that is, classification destination).

まず、数値型属性類似度算出部３２３は、対象数値型属性ペアに含まれる第１および第２の数値型属性の数値範囲を属性値数値範囲格納部３２２から取得する。 First, the numerical value attribute similarity calculation unit 323 acquires the numerical value ranges of the first and second numerical value attributes included in the target numerical value attribute pair from the attribute value numerical value storage unit 322.

次に、数値型属性類似度算出部３２３は、取得された第１および第２の数値型属性の数値範囲に基づいて、当該第１および第２の数値型属性間の類似度（対象数値型属性ペアの類似度）を算出する（ステップＳ１４）。 Next, the numerical value attribute similarity calculation unit 323 determines the similarity (target numerical value type) between the first and second numerical value attributes based on the acquired numerical value ranges of the first and second numerical value attributes. The similarity of the attribute pair is calculated (step S14).

ここで、数値型属性類似度算出部３２３によって取得された第１の数値型属性の数値範囲が第２の数値型属性の数値範囲より広い場合、または双方の数値範囲が同等の場合を想定する。この場合、数値型属性類似度算出部３２３は、第１の数値型属性の数値範囲に対する第１および第２の数値型属性の数値範囲の重なる範囲の割合（比率）を、当該第１および第２の数値型属性間の類似度として算出する。 Here, it is assumed that the numerical value range of the first numerical value attribute acquired by the numerical value attribute similarity calculation unit 323 is wider than the numerical value range of the second numerical value attribute, or the numerical value ranges of both are equal. . In this case, the numerical value attribute similarity calculation unit 323 determines the ratio (ratio) of the overlapping range of the numerical ranges of the first and second numerical attribute to the numerical range of the first numerical attribute. Calculated as the similarity between two numeric attributes.

一方、第１の数値型属性の数値範囲が第２の数値型属性の数値範囲より狭い場合、数値型属性類似度算出部３２３は、第２の数値型属性の数値範囲に対する第１および第２の数値型属性の数値範囲の重なる範囲の割合（比率）を、当該第１および第２の数値型属性間の類似度として算出する。 On the other hand, when the numerical value range of the first numerical value attribute is narrower than the numerical value range of the second numerical value attribute, the numerical value attribute similarity calculation unit 323 performs the first and second values for the numerical value range of the second numerical value attribute. The ratio (ratio) of the overlapping range of the numerical ranges of the numerical type attributes is calculated as the similarity between the first and second numerical type attributes.

ここで、図１０および図１１を参照して、数値型属性類似度算出部３２３によって算出される第１および第２の数値型属性間の類似度について具体的に説明する。 Here, with reference to FIG. 10 and FIG. 11, the similarity between the first and second numerical type attributes calculated by the numerical type attribute similarity calculating unit 323 will be specifically described.

まず、図１０においては、第１の数値型属性がＡ銀行のデータテーブル２２１を構成する「損失金額（円）」属性であり、第２の数値型属性がＢ銀行のデータテーブル２２２を構成する「直接損失額（千円）」属性であるものとする。また、第１の数値型属性（つまり、「損失金額（円）」属性）の数値範囲は０〜１５０００００であり、第２の数値型属性（つまり、「直接損失額（千円）」属性）の数値範囲は０〜１００００００であるものとする。なお、第１および第２の数値型属性においては単位が「円」と「千円」とで異なっているが、例えば上記したように数値範囲が特定される際に同一の単位となるように補正されているものとする。 First, in FIG. 10, the first numeric type attribute is the “loss amount (yen)” attribute that constitutes the bank A data table 221, and the second numeric type attribute is the bank B data table 222. It is assumed that the attribute is “direct loss (1,000 yen)”. Further, the numerical range of the first numerical type attribute (that is, the “loss amount (yen)” attribute) is 0 to 1500000, and the second numerical type attribute (that is, the “direct loss amount (thousand yen)” attribute). The numerical range of is assumed to be 0 to 1000000. In the first and second numerical type attributes, the unit is different between “yen” and “thousand yen”. For example, when the numerical range is specified as described above, the unit is the same. It shall be corrected.

ここで、第１の数値型属性の数値範囲は、第２の数値型属性の数値範囲より広い。この場合、第１および第２の数値型属性間の類似度は、第１の数値型属性の数値範囲（ここでは、０〜１５０００００）に対する第１および第２の数値型属性の数値範囲の重なる範囲（ここでは、０〜１００００００）の割合、つまり、１００００００／１５０００００≒０．６６７と算出される。 Here, the numerical range of the first numerical attribute is wider than the numerical range of the second numerical attribute. In this case, the similarity between the first and second numeric type attributes overlaps the numeric range of the first and second numeric type attributes with respect to the numeric range of the first numeric type attribute (here, 0 to 1500000). The ratio of the range (here, 0 to 1000000), that is, 1000000 / 1500,000≈0.667 is calculated.

一方、図１１においては、第１の数値型属性がＡ銀行のデータテーブル２２１を構成する「損失金額（円）」属性であり、第２の数値型属性がＢ銀行のデータテーブル２２２を構成する「発生店舗」属性であるものとする。また、第１の数値型属性（つまり、「損失金額（円）」属性）の数値範囲は０〜１５０００００であり、第２の数値型属性（つまり、「発生店舗」属性）の数値範囲は１〜１４５であるものとする。 On the other hand, in FIG. 11, the first numeric type attribute is the “loss amount (yen)” attribute constituting the bank A data table 221, and the second numeric type attribute constitutes the bank B data table 222. It is assumed that the attribute is “occurring store”. The numerical range of the first numerical type attribute (that is, the “loss amount (yen)” attribute) is 0 to 1500000, and the numerical range of the second numerical type attribute (that is, the “occurring store” attribute) is 1. ˜145.

ここで、第１の数値型属性の数値範囲は、第２の数値型属性の数値範囲より広い。この場合、第１および第２の数値型属性間の類似度は、第１の数値型属性の数値範囲（ここでは、０〜１５０００００）に対する第１および第２の数値型属性の数値範囲の重なる範囲（ここでは、１〜１４５）の割合、つまり、１４４／１５０００００≒０．０００と算出される。 Here, the numerical range of the first numerical attribute is wider than the numerical range of the second numerical attribute. In this case, the similarity between the first and second numeric type attributes overlaps the numeric range of the first and second numeric type attributes with respect to the numeric range of the first numeric type attribute (here, 0 to 1500000). The ratio of the range (here, 1 to 145), that is, 144 / 1500000≈0.000 is calculated.

再び図９に戻ると、数値型属性類似度算出部３２３は、算出された第１および第２の数値型属性間の類似度を、類似度一覧表に格納する（ステップＳ１５）。この場合、第１および第２の数値型属性間の類似度は、当該第１の数値型属性によって構成されるデータテーブルおよび当該第２の数値型属性によって構成されるデータテーブルの組み合わせに対して用意された、当該第１および第２の数値型属性の数値型属性小分類の類似度一覧表に格納される。具体的には、銀行Ａのデータテーブル２２１を構成する整数型の第１の数値型属性および銀行Ｂのデータテーブル２２２を構成する整数型の第２の数値型属性間の類似度は、当該銀行Ａのデータテーブル２２１および銀行Ｂのデータテーブル２２２の組み合わせに対して用意された整数型の類似度一覧表に格納される。 Returning to FIG. 9 again, the numerical attribute similarity calculation unit 323 stores the calculated similarity between the first and second numerical attributes in the similarity list (step S15). In this case, the degree of similarity between the first and second numeric type attributes is the combination of the data table constituted by the first numeric type attribute and the data table constituted by the second numeric type attribute. It is stored in the prepared similarity degree list of the numerical type attribute small classification of the first and second numerical type attributes. Specifically, the similarity between the integer type first numeric type attribute constituting the bank A data table 221 and the integer type second numeric type attribute constituting the bank B data table 222 is determined by the bank. The data is stored in an integer similarity list prepared for the combination of the A data table 221 and the bank B data table 222.

なお、類似度一覧表は、類似度が算出される２つの数値型属性（つまり、第１および第２の数値型属性）の各々を構成する異なる２つのデータテーブルの組み合わせ毎に用意されている。また、異なる２つのデータテーブルの組み合わせ毎に用意されている類似度一覧表は、更に数値型小分類毎に用意されている。 Note that the similarity list is prepared for each combination of two different data tables constituting each of two numerical type attributes (that is, the first and second numerical type attributes) for which the similarity is calculated. . A similarity list prepared for each combination of two different data tables is further prepared for each numerical subclass.

例えばデータテーブル格納部２２に格納されているデータテーブルの数がｎであり、数値型属性が分類される数値型属性小分類の数がｍである場合には、ｎ＊（ｎ−１）＊ｍ／２の数の類度一覧表が用意されている。具体的には、例えばデータテーブル格納部２２に３つのデータテーブル２２１〜２２３が格納されており、３つの数値型属性小分類（例えば、整数型、浮動小数型および日付型）があるような場合には、９つの類似度一覧表が用意されていることになる。 For example, when the number of data tables stored in the data table storage unit 22 is n and the number of numerical type attribute subclasses into which the numerical type attribute is classified is m, n * (n−1) *. A maturity number list of m / 2 is prepared. Specifically, for example, in the case where three data tables 221 to 223 are stored in the data table storage unit 22 and there are three numerical type attribute minor classifications (for example, an integer type, a floating point type, and a date type). Nine similarity degree lists are prepared.

次に、異なるデータテーブルを構成する２つの数値型属性の組み合わせの全て（つまり、全ての数値型属性ペア）について上記したステップＳ１４およびＳ１５の処理が実行されたか否かが判定される（ステップＳ１６）。 Next, it is determined whether or not the processes of steps S14 and S15 described above have been executed for all combinations of two numeric type attributes constituting different data tables (that is, all numeric type attribute pairs) (step S16). ).

全ての数値型属性ペアについて処理が実行されていないと判定された場合（ステップＳＳ１６のＮＯ）、上記したステップＳ１４に戻って処理が繰り返される。この場合、ステップＳ１４およびＳ１５の処理が実行されていない数値型属性ペアを対象数値型属性ペアとして処理が実行される。 When it is determined that the processing has not been executed for all the numeric attribute pairs (NO in step SS16), the processing returns to the above-described step S14 and is repeated. In this case, the process is executed with the numeric attribute pair that has not been subjected to the processes of steps S14 and S15 as the target numeric attribute pair.

一方、全ての数値型属性ペアについて処理が実行されたと判定された場合（ステップＳ１６のＹＥＳ）、数値型属性の類似度算出処理は終了される。 On the other hand, when it is determined that the processing has been executed for all the numerical attribute pairs (YES in step S16), the numerical attribute similarity calculation processing ends.

上記したように数値型属性の類似度算出処理が実行されると、用意されている全ての類似度一覧表（数値型属性の類似度一覧表）が作成される。なお、数値型属性の類似度算出処理において作成された類似度一覧表は、上述したように数値型属性類似度格納部２５に格納される。 As described above, when the numerical attribute similarity calculation process is executed, all prepared similarity lists (numerical attribute similarity lists) are created. The similarity list created in the numerical attribute similarity calculation process is stored in the numerical attribute similarity storage unit 25 as described above.

なお、異なるデータテーブルを構成する２つの数値型属性間の類似度算出方法は、上記に限定されるものではなく、例えば平均または分散等の基本統計量を比較する、またはグラフを作成して当該グラフ同士の重なりを類似度とする等の他の方法を用いても構わない。 Note that the method of calculating the degree of similarity between two numeric attributes constituting different data tables is not limited to the above. For example, a basic statistic such as an average or a variance is compared, or a graph is created to calculate the similarity. Other methods such as making the overlap between graphs similar may be used.

ここで、図１２を参照して、数値型属性の類似度算出処理において作成された類似度一覧表について具体的に説明する。図１２は、Ａ銀行のデータテーブル２２１およびＢ銀行のデータテーブル２２２の組み合わせに対して用意されている例えば整数型の類似度一覧表の一例を示す。 Here, with reference to FIG. 12, the similarity list created in the numerical value attribute similarity calculation processing will be specifically described. FIG. 12 shows an example of an integer type similarity list prepared for the combination of the bank A data table 221 and the bank B data table 222.

図１２に示すように、類似度一覧表２５１中には、Ａ銀行のデータテーブル２２１を構成する数値型属性およびＢ銀行のデータテーブル２２２を構成する数値型属性の各々が示されている。Ａ銀行のデータテーブル２２１を構成する数値型属性には、「版」属性、「損失金額（円）」属性、「発生店番号」属性および「発見店番号」属性が含まれる。なお、これらの「版」属性、「損失金額（円）」属性、「発生店番号」属性および「発見店番号」属性は、整数型の数値型属性である。また、Ｂ銀行のデータテーブル２２２を構成する数値型属性には、「バージョン」属性、「直接損失額（千円）」属性、「間接損失額（千円）」属性および「発生店舗」属性が含まれる。同様に、これらの「バージョン」属性、「直接損失額（千円）」属性、「間接損失額（千円）」属性および「発生店舗」属性は、整数型の数値型属性である。 As shown in FIG. 12, in the similarity list 251, each of the numeric type attribute that constitutes the bank A data table 221 and the numeric type attribute that constitutes the bank B data table 222 is shown. The numerical type attributes constituting the bank A data table 221 include a “version” attribute, a “loss amount (yen)” attribute, an “occurrence store number” attribute, and a “discovered store number” attribute. The “version” attribute, the “loss amount (yen)” attribute, the “occurrence store number” attribute, and the “discovered store number” attribute are integer type numeric attributes. In addition, the numerical type attributes constituting the bank B data table 222 include a “version” attribute, a “direct loss (thousand yen)” attribute, an “indirect loss (thousand yen)” attribute, and an “occurring store” attribute. included. Similarly, the “version” attribute, the “direct loss (thousand yen)” attribute, the “indirect loss (thousand yen)” attribute, and the “occurring store” attribute are integer type numeric attributes.

図１２に示す例では、類似度一覧表２５１には、例えば「版」属性および「バージョン」属性に対応づけて０．８０５が格納されている。これによれば、異なるデータテーブル（ここでは、Ａ銀行のデータテーブル２２１およびＢ銀行のデータテーブル２２２）を構成する「版」属性および「バージョン」属性間の類似度が０．８０５であることが示されている。 In the example illustrated in FIG. 12, the similarity list 251 stores 0.805 in association with, for example, the “version” attribute and the “version” attribute. According to this, the similarity between the “version” attribute and the “version” attribute constituting the different data tables (here, the bank A data table 221 and the bank B data table 222) is 0.805. It is shown.

なお、図１２に示す類似度一覧表２５１には、「版」属性および「バージョン」属性以外の他の２つの数値型属性間の類似度についても同様に格納されている。つまり、類似度一覧表２５１には、Ａ銀行のデータテーブル２２１を構成する整数型の数値型属性およびＢ銀行のデータテーブル２２２を構成する整数型の数値型属性の全ての組み合わせに対する類似度が格納されている。 In the similarity list 251 shown in FIG. 12, similarities between two numerical type attributes other than the “version” attribute and the “version” attribute are similarly stored. In other words, the similarity list 251 stores similarities for all combinations of the integer numeric attributes constituting the bank A data table 221 and the integer numeric attributes constituting the bank B data table 222. Has been.

ここでは、Ａ銀行のデータテーブル２２１およびＢ銀行のデータテーブル２２２の組み合わせに対して用意された整数型の類似度一覧表について説明したが、他の類似度一覧表についても同様であるため、その詳しい説明を省略する。 Here, the integer type similarity list prepared for the combination of the bank A data table 221 and the bank B data table 222 has been described, but the same applies to other similarity lists. Detailed description is omitted.

次に、図１３のフローチャートを参照して、上述した類似度算出処理（図８に示すステップＳ４の処理）に含まれる文字列型属性の類似度算出処理の処理手順について説明する。なお、この文字列型属性の類似度算出処理は、文字列型属性処理部３３によって実行される。 Next, with reference to the flowchart of FIG. 13, a processing procedure of the similarity calculation process of the character string type attribute included in the above-described similarity calculation process (the process of step S4 shown in FIG. 8) will be described. Note that the character string type attribute similarity calculation processing is executed by the character string type attribute processing unit 33.

まず、文字列型属性処理部３３は、文字列型属性格納部２４に格納された文字列型属性の各々について以下のステップＳ２１およびＳ２２の処理を実行する。ここでは、この処理の対象となる文字列型属性を対象文字列型属性と称する。 First, the character string type attribute processing unit 33 performs the following steps S21 and S22 for each of the character string type attributes stored in the character string type attribute storage unit 24. Here, the character string type attribute to be processed is referred to as a target character string type attribute.

文字列型属性処理部３３に含まれる属性値単語抽出部３３１は、文字列型属性格納部２４に格納された対象文字列型属性が有する属性値に含まれる文字列を形態素解析する。これにより、属性値単語抽出部３３１は、対象文字列型属性が有する属性値に含まれる文字列を構成する単語を抽出し、当該抽出された単語を含む単語集合（以下、対象文字列型属性の単語集合と表記）を作成する（ステップＳ２１）。 The attribute value word extraction unit 331 included in the character string type attribute processing unit 33 performs morphological analysis on the character string included in the attribute value of the target character string type attribute stored in the character string type attribute storage unit 24. As a result, the attribute value word extraction unit 331 extracts words constituting the character string included in the attribute value of the target character string type attribute, and a word set including the extracted words (hereinafter, the target character string type attribute). Are created (step S21).

ここで、図１４を参照して、属性値単語抽出部３３１によって作成される対象文字列型属性の単語集合について具体的に説明する。ここでは、対象文字列型属性は、図５に示すＡ銀行のデータテーブル２２１を構成する「発生原因／発生者」属性であるものとする。 Here, with reference to FIG. 14, the word set of the object character string type | mold attribute produced by the attribute value word extraction part 331 is demonstrated concretely. Here, it is assumed that the target character string type attribute is an “occurrence cause / occurrence person” attribute constituting the data table 221 of the A bank shown in FIG.

まず、対象文字列型属性によって構成されるデータテーブルにおいて当該対象文字列型属性が有する全ての属性値の集合（以下、対象文字列型属性の属性値集合と表記）が取得される。図１４に示す例では、対象文字列型属性の属性値集合には、属性値「経験不足」、「指導・教育不足」、「第三者による事故」、「指導・教育不足」、「お客様の依頼ミス・記入誤り」、「経験不足」、「指導・教育不足」、「指導・教育不足」および「第三者による事故」が含まれている。 First, a set of all attribute values of the target character string type attribute (hereinafter referred to as an attribute value set of the target character string type attribute) in the data table configured by the target character string type attribute is acquired. In the example shown in FIG. 14, the attribute value set of the target string type attribute includes attribute values “insufficient experience”, “insufficient instruction / education”, “accident by third party”, “insufficient instruction / education”, “customer "Incorrect request / entry error", "insufficient experience", "insufficient instruction / education", "insufficient instruction / education", and "accidents by third parties".

次に、対象文字列型属性の属性値集合に含まれる属性値のうち同一の属性値を１つに集約する（つまり、重複を除く）ことにより、対象文字列型属性の一意な属性値集合が作成される（ステップＳ３１）。図１４に示す例では、対象文字列型属性の一意な属性値集合には、属性値「経験不足」、「指導・教育不足」、「第三者による事故」および「お客様の依頼ミス・記入誤り」が含まれている。上記した対象文字列型属性の属性値集合においては属性値「経験不足」、「指導・教育不足」および「第三者による事故」が複数含まれているため、これらの属性値については対象文字列型属性の一意な属性値集合において１つに集約されている。 Next, by collecting the same attribute values among the attribute values included in the attribute value set of the target string type attribute into one (that is, excluding duplicates), the unique attribute value set of the target string type attribute Is created (step S31). In the example shown in FIG. 14, the unique attribute value set of the target string type attribute includes attribute values “insufficient experience”, “insufficient instruction / education”, “accident by third party”, and “customer request error / entry "Error" is included. The attribute value set of the target string type attribute described above contains multiple attribute values “insufficient experience”, “insufficient teaching / education”, and “accidents by third parties”. The column type attributes are aggregated into one in a unique attribute value set.

次に、対象文字列型属性の一意な属性値集合に含まれる属性値（に含まれる文字列）を形態素解析することにより、当該文字列が単語に分割される。ここで、形態素解析処理とは、文字列を単語に分割し、当該分割された各単語に品詞を付与する処理である。具体的には、属性値「お客様の依頼ミス・記入誤り」は、形態素解析処理により、「お（接頭辞）／客（名詞）／様（接尾辞）／の（助詞）／依頼（名詞）／ミス（名詞）／・（記号）／記入（名詞）／誤り（名詞）」のように分割される。このような形態素解析処理結果から品詞が名詞である単語が抽出され、当該単語を含む単語集合が作成される（ステップＳ３２）。なお、図１４に示す例では、属性値単語抽出部３３１によって作成された単語集合には、単語「経験」、「不足」、「指導」、「教育」、「不足」、「第三者」、「事故」、「客」、「依頼」、「ミス」、「記入」および「誤り」が含まれている。 Next, the character string is divided into words by performing morphological analysis on the attribute value (character string included) included in the unique attribute value set of the target character string type attribute. Here, the morpheme analysis process is a process of dividing a character string into words and assigning parts of speech to the divided words. Specifically, the attribute value “Customer's request mistake / entry error” is converted into “O (prefix) / customer (noun) / like (suffix) / no (particle) / request (noun) by morphological analysis processing. / Miss (noun) /. (Symbol) / entry (noun) / error (noun) ”. A word whose part of speech is a noun is extracted from such a morphological analysis processing result, and a word set including the word is created (step S32). In the example illustrated in FIG. 14, the word set created by the attribute value word extraction unit 331 includes the words “experience”, “insufficiency”, “guidance”, “education”, “insufficiency”, “third party”. , “Accident”, “customer”, “request”, “miss”, “entry” and “error” are included.

なお、ステップＳ３２の処理においては、品詞が名詞である単語の他に例えば品詞が動詞である単語および未知語等があわせて抽出されても構わない。未知語とは、例えば形態素解析用の辞書に登録されていない語である。一般に、固有名詞または専門用語等が未知語となる可能性が高い。 In the process of step S32, in addition to the word whose part of speech is a noun, for example, a word whose part of speech is a verb, an unknown word, and the like may be extracted together. An unknown word is a word which is not registered in the dictionary for morphological analysis, for example. In general, there is a high possibility that proper nouns or technical terms become unknown words.

次に、作成された単語集合に含まれる単語のうち同一の単語を１つに集約する（つまり、重複を除く）ことにより、対象文字列型属性の一意な単語集合（属性値単語集合）が作成される（ステップＳ３３）。図１４に示す例では、対象文字列型属性の一意な単語集合には、単語「経験」、「不足」、「指導」、「教育」、「第三者」、「事故」、「客」、「依頼」、「ミス」、「記入」および「誤り」が含まれている。上記したステップＳ３２において作成された単語集合においては単語「不足」が複数含まれているため、この単語については対象文字列型属性の一意な単語集合において１つに集約されている。 Next, by collecting the same words among the words included in the created word set into one (that is, excluding duplication), a unique word set (attribute value word set) of the target string type attribute is obtained. It is created (step S33). In the example shown in FIG. 14, the unique word set of the target character string type attribute includes the words “experience”, “insufficiency”, “guidance”, “education”, “third party”, “accident”, “customer”. , “Request”, “miss”, “entry”, and “error”. Since the word set created in step S32 described above includes a plurality of words “insufficiency”, these words are integrated into one in the unique word set of the target character string type attribute.

再び図１３に戻ると、属性値単語抽出部３３１は、上記したように作成された対象文字列型属性の単語集合（対象文字列型属性の一意な単語集合）を属性値単語集合格納部３３２に格納する（ステップＳ２２）。 Returning to FIG. 13 again, the attribute value word extracting unit 331 uses the attribute value word set storage unit 332 as a target character string type attribute word set (unique word set of the target character string type attribute) created as described above. (Step S22).

ここで、文字列型属性格納部２４に格納された全ての文字列型属性について上記したステップＳ２１およびＳ２２の処理が実行されたか否かが判定される（ステップＳ２３）。 Here, it is determined whether or not the processing of steps S21 and S22 described above has been executed for all the character string type attributes stored in the character string type attribute storage unit 24 (step S23).

文字列型属性格納部２４に格納された全ての文字列型属性について処理が実行されていないと判定された場合（ステップＳ２３のＮＯ）、上記したステップＳ２１に戻って処理が繰り返される。この場合、ステップＳ２１およびＳ２２の処理が実行されていない文字列型属性を対象文字列型属性として処理が実行される。このように、文字列型属性格納部２４に格納された全ての文字列型属性についてステップＳ２１およびＳ２２の処理が実行されることによって、当該文字列型属性毎の単語集合が属性値単語集合格納部３３２に格納される。 When it is determined that processing has not been performed for all the character string type attributes stored in the character string type attribute storage unit 24 (NO in step S23), the process returns to the above step S21 and is repeated. In this case, the processing is executed with the character string type attribute for which the processing of steps S21 and S22 has not been executed as the target character string type attribute. As described above, by executing the processing of steps S21 and S22 for all the character string type attributes stored in the character string type attribute storage unit 24, the word set for each character string type attribute is stored in the attribute value word set. Stored in the unit 332.

一方、文字列型属性格納部２４に格納された全ての文字列型属性について処理が実行されたと判定された場合（ステップＳ２３のＹＥＳ）、文字列型属性類似度算出部３３３は、文字列型属性格納部２４に格納された異なるデータテーブルを構成する２つの文字列型属性の組み合わせ（文字列型属性ペア）の各々に対して以下のステップＳ２４〜Ｓ２６の処理を実行する。ここでは、この処理の対象となる文字列型属性ペアを対象文字列型属性ペアと称する。また、対象文字列型属性ペアに含まれる一方の文字列型属性を第１の文字列型属性、他方の文字列型属性を第２の文字列型属性と称する。 On the other hand, if it is determined that the processing has been executed for all the character string type attributes stored in the character string type attribute storage unit 24 (YES in step S23), the character string type attribute similarity calculation unit 333 displays the character string type attribute value. The following steps S24 to S26 are executed for each combination of two character string type attributes (character string type attribute pair) constituting different data tables stored in the attribute storage unit 24. Here, the character string type attribute pair that is the target of this processing is referred to as a target character string type attribute pair. One character string type attribute included in the target character string type attribute pair is referred to as a first character string type attribute, and the other character string type attribute is referred to as a second character string type attribute.

まず、文字列型属性類似度算出部３３３は、対象文字列型属性ペアに含まれる第１および第２の文字列型属性の単語集合を属性値単語集合格納部３３２から取得する。 First, the character string type attribute similarity calculation unit 333 acquires the word set of the first and second character string type attributes included in the target character string type attribute pair from the attribute value word set storage unit 332.

次に、文字列型属性類似度算出部３３３は、取得された第１および第２の文字列型属性の単語集合を参照して、当該第１および第２の文字列型属性の単語集合間で一致する単語の数を特定する（ステップＳ２４）。この場合、文字列型属性類似度算出部３３３は、第１の文字列型属性の単語集合に含まれる単語のうち、第２の文字列型属性の単語集合に含まれる単語と一致する単語の数（以下、第１の文字列型属性の一致数と表記）を特定する。また、文字列型属性類似度算出部３３３は、第２の文字列型属性の単語集合に含まれる語のうち、第１の文字列型属性の単語集合に含まれる単語と一致する単語の数（以下、第２の文字列型属性の一致数と表記）を特定する。 Next, the character string type attribute similarity calculation unit 333 refers to the acquired word sets of the first and second character string type attributes, and determines between the word sets of the first and second character string type attributes. The number of matching words is specified (step S24). In this case, the character string type attribute similarity calculation unit 333 selects a word that matches a word included in the word set of the second character string type attribute among words included in the word set of the first character string type attribute. The number (hereinafter referred to as the number of matches of the first character string type attribute) is specified. Further, the character string type attribute similarity calculating unit 333 counts the number of words that match the word included in the word set of the first character string type attribute among the words included in the word set of the second character string type attribute. (Hereinafter referred to as the number of matches of the second character string type attribute).

文字列型属性類似度算出部３３３は、特定された第１および第２の文字列型属性の一致数に基づいて、当該第１および第２の文字列型属性間の類似度を算出する（ステップＳ２５）。この場合、文字列型属性類似度算出部３３３は、第１の文字列型属性の単語集合に含まれる単語の一致率（以下、第１の文字列型属性の単語一致率と表記）および第２の文字列型属性の単語集合に含まれる単語の一致率（以下、第２の文字列型属性の単語一致率と表記）を利用して類似度を算出する。 The character string type attribute similarity calculation unit 333 calculates the degree of similarity between the first and second character string type attributes based on the number of matches between the identified first and second character string type attributes ( Step S25). In this case, the character string type attribute similarity calculation unit 333 performs matching of words included in the word set of the first character string type attribute (hereinafter referred to as word matching rate of the first character string type attribute) and the first The similarity is calculated using the matching rate of words included in the word set of the second character string type attribute (hereinafter referred to as the word matching rate of the second character string type attribute).

ここで、第１の文字列型属性の単語一致率とは、第１の文字列型属性の単語集合に含まれる単語の数に対する上記したステップＳ２４において特定された第１の文字列型属性の一致数の割合をいう。また、第２の文字列型属性の単語一致率とは、第２の文字列型属性の単語集合に含まれる単語の数に対する上記したステップＳ２４において特定された第２の文字列型属性の一致数の割合をいう。 Here, the word match rate of the first character string type attribute is the first character string type attribute specified in step S24 described above with respect to the number of words included in the word set of the first character string type attribute. Refers to the percentage of matches. The word match rate of the second character string type attribute is the match of the second character string type attribute specified in step S24 described above with respect to the number of words included in the word set of the second character string type attribute. The ratio of numbers.

この場合、文字列型属性類似度算出部３３３は、第１および第２の文字列型属性の単語一致率の平均値を、当該第１および第２の文字列型属性間の類似度として算出する。 In this case, the character string type attribute similarity calculating unit 333 calculates the average value of the word match rates of the first and second character string type attributes as the similarity between the first and second character string type attributes. To do.

ここで、図１５を参照して、文字列型属性類似度算出部３３３によって算出される第１および第２の文字列型属性間の類似度について具体的に説明する。 Here, with reference to FIG. 15, the similarity between the 1st and 2nd character string type | mold attributes calculated by the character string type | mold attribute similarity calculation part 333 is demonstrated concretely.

ここでは、第１の文字列型属性がＡ銀行のデータテーブル２２１を構成する「発生原因／発生者」属性であり、第２の文字列型属性がＢ銀行のデータテーブル２２２を構成する「発生原因」属性であるものとする。 Here, the first character string type attribute is the “cause / occurrence” attribute that constitutes the bank A data table 221, and the second character string type attribute is the “occurrence” that constitutes the bank B data table 222. Attribute.

図１５に示すように、第１の文字列型属性（つまり、「発生原因／発生者」属性）の単語集合には、１１個の単語、具体的には、単語「経験」、「不足」、「指導」、「教育」、「第三者」、「事故」、「客」、「依頼」、「ミス」、「記入」および「誤り」が含まれるものとする。また、第２の文字列型属性（つまり、「発生原因」属性）の単語集合には、１０個の単語、具体的には、単語「知識」、「経験」、「教育」、「不足」、「顧客」、「事故」、「複雑」、「作業」、「内容」および「ケアレスミス」が含まれるものとする。 As shown in FIG. 15, the word set of the first string type attribute (that is, the “cause / occurrence” attribute) has 11 words, specifically, the words “experience” and “insufficient”. , “Guidance”, “education”, “third party”, “accident”, “customer”, “request”, “miss”, “entry” and “error”. The word set of the second character string type attribute (that is, the “occurrence cause” attribute) has 10 words, specifically, the words “knowledge”, “experience”, “education”, “insufficient”. , “Customer”, “accident”, “complexity”, “work”, “content” and “careless mistake”.

ここで、第１の文字列型属性の単語集合に含まれる単語のうち、第２の文字列型属性の単語集合に含まれる単語と一致する単語は、単語「経験」、「不足」、「教育」および「事故」である。この場合、第１の文字列型属性の一致数は４となる。一方、第２の文字列型属性の単語集合に含まれる単語のうち、第１の文字列型属性の単語集合に含まれる単語と一致する単語は、単語「経験」、「教育」、「不足」および「事故」である。この場合、第２の文字列型属性の一致数は４となる。 Here, among words included in the word set of the first character string type attribute, words that match the words included in the word set of the second character string type attribute are words “experience”, “insufficient”, “ “Education” and “Accident”. In this case, the number of matches of the first character string type attribute is 4. On the other hand, of the words included in the word set of the second string type attribute, the words that match the words included in the word set of the first string type attribute are the words “experience”, “education”, “insufficient” "And" accidents ". In this case, the number of matches of the second character string type attribute is 4.

また、上記したように第１の文字列型属性の単語集合に含まれる単語の数は１１であるため、第１の文字列型属性の単語一致率は４／１１である。また、第２の文字列型属性の単語集合に含まれる単語の数は１０であるため、第２の文字列型属性の単語一致率は４／１０である。 Moreover, since the number of words included in the word set of the first character string type attribute is 11 as described above, the word matching rate of the first character string type attribute is 4/11. Further, since the number of words included in the word set of the second character string type attribute is 10, the word matching rate of the second character string type attribute is 4/10.

これにより、第１および第２の文字列型属性間の類似度は、４／１１と４／１０との平均値、つまり、（４／１１＋４／１０）／２≒０．３８２と算出される。 Accordingly, the similarity between the first and second character string type attributes is calculated as an average value of 4/11 and 4/10, that is, (4/11 + 4/10) /2≈0.382. .

再び図１３に戻ると、文字列型属性類似度算出部３３３は、算出された第１および第２の文字列型属性間の類似度を、類似度一覧表に格納する（ステップＳ２６）。この場合、第１および第２の文字列型属性間の類似度は、当該第１の文字列型属性によって構成されるデータテーブルおよび当該第２の文字列型属性によって構成されるデータテーブルの組み合わせに対して用意された類似度一覧表に格納される。具体的には、銀行Ａのデータテーブル２２１を構成する第１の文字列型属性および銀行Ｂのデータテーブル２２２を構成する第２の文字列型属性間の類似度は、当該銀行Ａのデータテーブル２２１および銀行Ｂのデータテーブル２２２の組み合わせに対して用意された類似度一覧表に格納される。 Returning to FIG. 13 again, the character string type attribute similarity calculating unit 333 stores the calculated similarity between the first and second character string type attributes in the similarity list (step S26). In this case, the similarity between the first and second character string type attributes is a combination of a data table constituted by the first character string type attribute and a data table constituted by the second character string type attribute. Is stored in the similarity list prepared for. Specifically, the similarity between the first character string type attribute constituting the bank A data table 221 and the second character string type attribute constituting the bank B data table 222 is determined by the bank A data table. 221 and the data table 222 of the bank B are stored in the similarity list prepared.

なお、類似度一覧表は、類似度が算出される２つの文字列型属性（つまり、第１および第２の文字列型属性）の各々を構成する異なる２つのデータテーブルの組み合わせ毎に用意されている。 Note that the similarity list is prepared for each combination of two different data tables constituting each of the two character string type attributes (that is, the first and second character string type attributes) whose similarity is calculated. ing.

例えばデータテーブル格納部２２に格納されているデータテーブルの数がｎである場合には、ｎ＊（ｎ−１）／２の数の類似度一覧表が用意されている。具体的には、例えばデータテーブル格納部２２に３つのデータテーブル２２１〜２２３が格納されているような場合には、３つの類似度一覧表が用意されていることになる。 For example, when the number of data tables stored in the data table storage unit 22 is n, n * (n−1) / 2 number of similarity lists are prepared. Specifically, for example, when three data tables 221 to 223 are stored in the data table storage unit 22, three similarity list tables are prepared.

次に、異なるデータテーブルを構成する２つの文字列型属性の組み合わせの全て（つまり、全ての文字列型属性ペア）について上記したステップＳ２４〜Ｓ２６の処理が実行されたか否かが判定される（ステップＳ２７）。 Next, it is determined whether or not the processing in steps S24 to S26 described above has been executed for all combinations of two character string type attributes constituting different data tables (that is, all character string type attribute pairs) ( Step S27).

全ての文字列型属性ペアについて処理が実行されていないと判定された場合（ステップＳ２７のＮＯ）、上記したステップＳ２４に戻って処理が繰り返される。この場合、ステップＳ２４〜Ｓ２６の処理が実行されていない文字列型属性ペアを対象文字列型属性ペアとして処理が実行される。 When it is determined that the processing has not been executed for all the character string type attribute pairs (NO in step S27), the processing returns to the above-described step S24 and is repeated. In this case, the process is executed with the character string type attribute pair that has not been subjected to the processes of steps S24 to S26 as the target character string type attribute pair.

一方、全ての文字列型属性ペアについて処理が実行されたと判定された場合（ステップＳ２７のＹＥＳ）、文字列型属性の類似度算出処理は終了される。 On the other hand, when it is determined that the process has been executed for all the character string type attribute pairs (YES in step S27), the character string type attribute similarity calculation process ends.

上記したように文字列型属性の類似度算出処理が実行されると、用意されている全ての類似度一覧表（文字列型属性の類似度一覧表）が作成される。なお、文字列型属性の類似度算出処理において作成された類似度一覧表は、上述したように文字列型属性類似度格納部２６に格納される。 As described above, when the character string type attribute similarity calculation process is executed, all prepared similarity degree lists (similarity list of character string type attributes) are created. The similarity list created in the character string attribute similarity calculation processing is stored in the character string attribute similarity storage unit 26 as described above.

ここで、図１６を参照して、文字列型属性の類似度算出処理において作成された類似度一覧表について具体的に説明する。図１６は、Ａ銀行のデータテーブル２２１およびＢ銀行のデータテーブル２２２の組み合わせに対して用意されている類似度一覧表の一例を示す。 Here, the similarity list created in the similarity calculation process of the character string type attribute will be specifically described with reference to FIG. FIG. 16 shows an example of a similarity list prepared for a combination of the bank A data table 221 and the bank B data table 222.

図１６に示すように、類似度一覧表２６１中には、Ａ銀行のデータテーブル２２１を構成する文字列型属性およびＢ銀行のデータテーブル２２２を構成する文字列型属性の各々が示されている。Ａ銀行のデータテーブル２２１を構成する文字列型属性には、「発生原因／発生者」属性、「現象／発生者」属性、「発生業務」属性および「発生者職位」属性が含まれる。また、Ｂ銀行のデータテーブル２２２を構成する文字列型属性には、「発生原因」属性、「概要」属性、「職位／発生者」属性および「職位／検証者」属性が含まれる。 As shown in FIG. 16, the similarity list 261 shows character string type attributes constituting the bank A data table 221 and string type attributes constituting the bank B data table 222. . The character string type attributes constituting the bank A data table 221 include a “cause / occurrence” attribute, a “phenomenon / occurrence” attribute, an “occurrence work” attribute, and an “occurrence position” attribute. Further, the character string type attributes constituting the bank B data table 222 include an “occurrence cause” attribute, an “overview” attribute, a “position / occurrence” attribute, and a “position / verifier” attribute.

図１６に示す例では、類似度一覧表２６１には、例えば「発生原因／発生者」属性および「発生原因」属性に対応づけて０．３８２が格納されている。これによれば、異なるデータテーブル（ここでは、Ａ銀行のデータテーブル２２１およびＢ銀行のデータテーブル２２２）を構成する「発生原因／発生者」属性および「発生原因」属性間の類似度が０．３８２であることが示されている。 In the example shown in FIG. 16, the similarity list 261 stores 0.382 in association with, for example, the “occurrence / occurrence” attribute and the “occurrence cause” attribute. According to this, the similarity between the “occurrence cause / occurrence” attribute and the “occurrence cause” attribute constituting different data tables (here, the data table 221 of the A bank and the data table 222 of the B bank) is 0. 382.

なお、図１６に示す類似度一覧表２６１には、「発生原因／発生者」属性および「発生原因」属性以外の他の２つの文字列型属性間の類似度についても同様に格納されている。つまり、類似度一覧表２６１には、Ａ銀行のデータテーブル２２１を構成する文字列型属性およびＢ銀行のデータテーブル２２２を構成する文字列型属性の全ての組み合わせに対する類似度が格納されている。 The similarity list 261 shown in FIG. 16 similarly stores similarities between two other character string type attributes other than the “cause / occurrence” attribute and the “cause” attribute. . That is, the similarity list 261 stores the degrees of similarity for all combinations of the character string type attributes constituting the bank A data table 221 and the character string type attributes constituting the bank B data table 222.

ここでは、Ａ銀行のデータテーブル２２１およびＢ銀行のデータテーブル２２２の組み合わせに対して用意された類似度一覧表について説明したが、他の類似度一覧表についても同様であるため、その詳しい説明を省略する。 Here, the similarity list prepared for the combination of the bank A data table 221 and the bank B data table 222 has been described, but the same applies to the other similarity lists, so a detailed description thereof will be given. Omitted.

次に、図１７のフローチャートを参照して、上述した類似属性候補抽出処理（図８に示すステップＳ７の処理）の処理手順について説明する。この類似属性候補抽出処理は、数値型属性および文字列型属性に共通して実行される処理である。なお、この類似属性候補抽出処理は、類似属性候補抽出部３５によって実行される。 Next, the procedure of the similar attribute candidate extraction process (the process of step S7 shown in FIG. 8) described above will be described with reference to the flowchart of FIG. This similar attribute candidate extraction process is a process executed in common for the numeric type attribute and the character string type attribute. This similar attribute candidate extraction process is executed by the similar attribute candidate extraction unit 35.

類似属性候補抽出部３５は、数値型属性類似度格納部２５および文字列型属性類似度格納部２６に格納された類似度一覧表の各々について以下のステップＳ４１〜Ｓ４９の処理を実行する。ここでは、この処理の対象となる類似度一覧表を対象類似度一覧表と称する。 The similar attribute candidate extraction unit 35 executes the following steps S41 to S49 for each of the similarity list stored in the numerical attribute similarity storage unit 25 and the character string attribute similarity storage unit 26. Here, the similarity list targeted for this processing is referred to as a target similarity list.

まず、類似属性候補抽出部３５は、対象類似度一覧表を、数値型属性類似度格納部２５または文字列型属性類似度格納部２６から取り出す（ステップＳ４１）。 First, the similar attribute candidate extraction unit 35 extracts the target similarity list from the numerical attribute similarity storage unit 25 or the character string attribute similarity storage unit 26 (step S41).

次に、類似属性候補抽出部３５は、取得された対象類似度一覧表中の属性の各々について以下のステップＳ４２〜Ｓ４８の処理を実行する。ここでは、この処理の対象となる属性を対象属性と称する。 Next, the similar attribute candidate extraction unit 35 performs the following steps S42 to S48 for each of the attributes in the acquired target similarity list. Here, the attribute to be processed is referred to as a target attribute.

類似属性候補抽出部３５は、対象類似度一覧表中の対象属性を取り出す（ステップＳ４２）。 The similar attribute candidate extraction unit 35 extracts a target attribute from the target similarity list (step S42).

類似属性候補抽出部３５は、対象類似度一覧表において、取り出された対象属性との類似度が閾値入力部３４によって入力された閾値（図８に示すステップＳ６において入力された閾値）以上である属性（対象属性とは異なるデータテーブルを構成する属性）を検索する。なお、この閾値は、類似属性候補を抽出するために十分な類似度を示す値である。 In the target similarity list, the similar attribute candidate extraction unit 35 has a similarity with the extracted target attribute equal to or greater than the threshold input by the threshold input unit 34 (threshold input in step S6 shown in FIG. 8). Search for attributes (attributes constituting a data table different from the target attribute). This threshold is a value indicating a degree of similarity sufficient for extracting similar attribute candidates.

以下の説明においては、対象属性との類似度が閾値以上である属性を該当属性と称する。 In the following description, an attribute whose similarity to the target attribute is equal to or greater than a threshold value is referred to as a corresponding attribute.

類似属性候補抽出部３５は、検索された該当属性の数が２つ以上であるか否かを判定する（ステップＳ４３）。 The similar attribute candidate extraction unit 35 determines whether or not the number of searched corresponding attributes is two or more (step S43).

該当属性の数が２つ以上でないと判定された場合（ステップＳ４３のＮＯ）、類似属性候補抽出部３５は、検索された該当属性の数が１つであるか否かを判定する（ステップＳ４４）。 When it is determined that the number of corresponding attributes is not two or more (NO in step S43), the similar attribute candidate extraction unit 35 determines whether or not the number of searched corresponding attributes is one (step S44). ).

該当属性の数が１つでない（つまり、該当属性が存在しない）と判定された場合（ステップＳ４４のＮＯ）、類似属性候補抽出部３５は、対象類似度一覧表中の対象属性とは異なるデータテーブルを構成する属性のうち、予め定められた条件を満たす属性があるか否かを判定する（ステップＳ４５）。ここで、予め定められた条件は、他の属性と比較して、対象属性と互いに類似度が最大となる属性が存在することを含む。 When it is determined that the number of corresponding attributes is not one (that is, the corresponding attribute does not exist) (NO in step S44), the similar attribute candidate extracting unit 35 is different from the target attribute in the target similarity list. It is determined whether or not there is an attribute satisfying a predetermined condition among the attributes constituting the table (step S45). Here, the predetermined condition includes the presence of an attribute having a maximum similarity to the target attribute as compared to other attributes.

ここで、対称属性と互いに類似度が最大となる属性が存在するか否かを判定する処理（つまり、ステップＳ４５の処理）について具体的に説明する。ここでは、対象類似度一覧表は、上述した図１６に示す類似度一覧表２６１であるものとする。また、閾値入力部３４によって入力された閾値は０．８であるものとする。 Here, the process for determining whether or not there is an attribute having the maximum similarity to the symmetry attribute (that is, the process of step S45) will be specifically described. Here, it is assumed that the target similarity list is the above-described similarity list 261 shown in FIG. The threshold value input by the threshold value input unit 34 is assumed to be 0.8.

まず、対象属性は、類似度一覧表２６１中のＡ銀行のデータテーブル２２１を構成する「現象／発生者」属性であるものとする。この場合、類似度一覧表２６１において、対象属性である「現象／発生者」属性との類似度が閾値（０．８）以上である属性（該当属性）は存在しないため、ステップＳ４５の処理が実行される。 First, it is assumed that the target attribute is a “phenomenon / occurrence” attribute that configures the data table 221 of the bank A in the similarity list 261. In this case, in the similarity list 261, there is no attribute (corresponding attribute) whose similarity with the “phenomenon / occurrence” attribute that is the target attribute is greater than or equal to the threshold (0.8), and therefore the process of step S45 is Executed.

図１６に示す類似度一覧表２６１を参照すると、当該類似度一覧表２６１において、対象属性である「現象／発生者」属性との類似度が最大の属性は、「概要」属性である。 Referring to the similarity list 261 shown in FIG. 16, in the similarity list 261, the attribute having the maximum similarity with the “phenomenon / occurrence” attribute that is the target attribute is the “summary” attribute.

これに対して、類似度一覧表２６１において、「概要」属性との類似度が最大の属性は、対象属性である「現象／発生者」属性である。 On the other hand, in the similarity list 261, the attribute having the maximum similarity with the “summary” attribute is the “phenomenon / occurrence” attribute that is the target attribute.

このように対象属性である「現象／発生者」属性と類似度が最大となる属性が「概要」属性であり、当該「概要」属性と類似度が最大となる属性が「現象／発生者」属性（つまり、対象属性）である場合には、当該対象属性と互いに類似度が最大となる属性（ここでは、「概要」属性）が存在すると判定される。 In this way, the attribute having the maximum similarity with the “phenomenon / occurrence” attribute as the target attribute is the “summary” attribute, and the attribute having the maximum similarity with the “summary” attribute is “phenomenon / occurrence”. In the case of an attribute (that is, a target attribute), it is determined that there is an attribute (here, “summary” attribute) having a maximum similarity to the target attribute.

次に、対象属性は、類似度一覧表２６１中のＡ銀行のデータテーブル２２１を構成する「発生業務」属性であるものとする。この場合、類似度一覧表２６１において、対象属性である「発生業務」属性との類似度が閾値（０．８）以上である属性（該当属性）は存在しないため、ステップＳ４５の処理が実行される。 Next, it is assumed that the target attribute is an “occurring business” attribute that constitutes the data table 221 of the bank A in the similarity list 261. In this case, in the similarity list 261, since there is no attribute (corresponding attribute) whose similarity with the “occurring work” attribute that is the target attribute is equal to or greater than the threshold (0.8), the process of step S45 is executed. The

図１６に示す類似度一覧表２６１を参照すると、当該類似度一覧表２６１において、対象属性である「発生業務」属性との類似度が最大の属性は、「発生原因」属性である。 Referring to the similarity list 261 shown in FIG. 16, in the similarity list 261, the attribute having the highest similarity with the “occurring business” attribute that is the target attribute is the “cause” attribute.

これに対して、類似度一覧表２６１において、「発生原因」属性との類似度が最大の属性は、「発生原因／発生者」属性であり、対象属性である「発生業務」属性ではない。 On the other hand, in the similarity list 261, the attribute having the highest similarity to the “occurrence cause” attribute is the “occurrence cause / occurrence” attribute, not the “occurrence work” attribute that is the target attribute.

このように対象属性である「発生業務」属性と類似度が最大となる属性が「発生原因」属性であるが、当該「発生原因」属性との類似度が最大となる属性が「発生業務」属性（つまり、対象属性）でない場合には、当該対象属性と互いに類似度が最大となる属性が存在しないと判定される。 In this way, the attribute that has the maximum similarity to the “occurrence work” attribute that is the target attribute is the “occurrence cause” attribute, but the attribute that has the maximum similarity to the “occurrence cause” attribute is “occurrence work”. If it is not an attribute (that is, a target attribute), it is determined that there is no attribute having the maximum similarity with the target attribute.

上記したようにステップＳ４５においては、対象属性と類似度が最大となる属性と類似度が最大となる属性が当該対象属性である場合には当該対象属性と互いに類似度が最大となる属性が存在すると判定され、対象属性と類似度が最大となる属性と類似度が最大となる属性が当該対象属性でない場合には当該対象属性と互いに類似度が最大となる属性が存在しない（つまり、類似属性候補はない）と判定される。 As described above, in step S45, when the attribute having the maximum similarity with the target attribute and the attribute having the maximum similarity are the target attributes, there is an attribute having the maximum similarity with the target attribute. If the attribute having the maximum similarity with the target attribute and the attribute having the maximum similarity are not the target attribute, there is no attribute having the maximum similarity with the target attribute (that is, the similar attribute No candidate).

対象属性と互いに類似度が最大となる属性が存在する、つまり、予め定められた条件を満たす属性があると判定された場合（ステップＳ４５のＹＥＳ）、当該属性および対象属性（の組み合わせ）を類似属性候補として抽出する（ステップＳ４６）。 If it is determined that there is an attribute having the maximum degree of similarity with the target attribute, that is, it is determined that there is an attribute that satisfies a predetermined condition (YES in step S45), the attribute and the target attribute (combination) are similar Extracted as attribute candidates (step S46).

一方、対象属性と互いに類似度が最大となる属性が存在しない、つまり、予め定められた条件を満たす属性がないと判定された場合（ステップＳ４５のＮＯ）、ステップＳ４６の処理は実行されない。 On the other hand, when it is determined that there is no attribute having the maximum similarity with the target attribute, that is, there is no attribute satisfying a predetermined condition (NO in step S45), the process in step S46 is not executed.

また、上記したステップＳ４３において該当属性の数が２つ以上であると判定された場合、類似属性候補抽出部３５は、当該該当属性毎に、対象属性の属性名および当該該当属性の属性名間の類似度を算出する（ステップＳ４７）。なお、対象属性の属性名および該当属性の属性名間の類似度は、上述した２つの文字列型属性間の類似度と同様に算出される。 If it is determined in step S43 that the number of corresponding attributes is two or more, the similar attribute candidate extraction unit 35 determines, for each corresponding attribute, between the attribute name of the target attribute and the attribute name of the corresponding attribute. Is calculated (step S47). Note that the similarity between the attribute name of the target attribute and the attribute name of the corresponding attribute is calculated in the same manner as the similarity between the two character string type attributes described above.

ここで、図１８および図１９を参照して、類似属性候補抽出部３５によって算出される対象属性の属性名および該当属性の属性名間の類似度について具体的に説明する。ここでは、対象類似度一覧表は、上述した図１６に示す類似度一覧表２６１であるものとする。また、対象属性は、類似度一覧表２６１中のＡ銀行のデータテーブル２２１を構成する「発生者職位」属性であるものとする。なお、上記したように閾値入力部３４によって入力された閾値は０．８であるものとする。 Here, with reference to FIG. 18 and FIG. 19, the similarity between the attribute name of the target attribute and the attribute name of the corresponding attribute calculated by the similar attribute candidate extracting unit 35 will be described in detail. Here, it is assumed that the target similarity list is the above-described similarity list 261 shown in FIG. Further, it is assumed that the target attribute is an “occurrence position” attribute that constitutes the data table 221 of the bank A in the similarity list 261. Note that the threshold value input by the threshold value input unit 34 is 0.8 as described above.

この場合、類似度一覧表２６１において、対象属性である「発生者職位」属性との類似度が閾値（０．８）以上である属性（該当属性）は、「職位／発生者」属性および「職位／検証者」属性の２つである。なお、「発生者職位」属性の属性名は「発生者職位」であり、「職位／発生者」属性の属性名は「職位／発生者」であり、「職位／検証者」属性の属性名は「職位／検証者」である。 In this case, in the similarity list 261, attributes (corresponding attributes) whose similarity with the “occurrence position” attribute that is the target attribute is equal to or greater than the threshold (0.8) are “position / occurrence” attributes and “ There are two attributes, “Position / Verifier”. Note that the attribute name of the “occurrence position” attribute is “occurrence position”, the attribute name of the “position / occurrence” attribute is “position / occurrence”, and the attribute name of the “position / verifier” attribute Is a “position / verifier”.

まず、図１８を参照して、対象属性である「発生者職位」属性の属性名および該当属性である「職位／発生者」属性の属性名間の類似度について説明する。 First, with reference to FIG. 18, the similarity between the attribute name of the “occurrence position” attribute that is the target attribute and the attribute name of the “position / occurrence” attribute that is the corresponding attribute will be described.

この場合、上述した２つの文字列型属性間の類似度と同様に、対象属性である「発生者職位」属性の属性名および該当属性である「職位／発生者」属性の属性名が形態素解析処理されることにより、当該「発生者職位」属性の属性名の単語集合および当該「職位／発生者」属性の属性名の単語集合が作成される。ここで作成される単語集合には、例えば品詞が名詞の単語が含まれる。 In this case, similarly to the similarity between the two character string type attributes described above, the attribute name of the “occurrence position” attribute that is the target attribute and the attribute name of the “position / occurrence” attribute that is the corresponding attribute are morphologically analyzed. As a result of the processing, a word set of attribute names of the “occurrence position” attribute and a word set of attribute names of the “position / occurrence” attribute are created. The word set created here includes words whose part of speech is a noun, for example.

具体的には、「発生者職位」属性の属性名の単語集合には、単語「発生者」および「職位」が含まれる。また、「職位／発生者」属性の属性名の単語集合には、単語「職位」および「発生者」が含まれる。 Specifically, the word set of the attribute name of the “Gener position” attribute includes the words “Generator” and “Position”. In addition, the word set of the attribute name of the “position / occurrence” attribute includes the words “position” and “occurrence”.

ここで、「発生者職位」属性の属性名の単語集合に含まれる単語のうち、「職位／発生者」属性の属性名の単語集合に含まれる単語と一致する単語の数は２（単語「発生者」および「職位」）である。また、「職位／発生者」属性の属性名の単語集合に含まれる単語のうち、「職位／発生者」属性の属性名の単語集合に含まれる単語と一致する単語の数は２（単語「職位」および「発生者」）である。 Here, out of the words included in the word set of the attribute name of the “occurrence position” attribute, the number of words that match the word included in the word set of the attribute name of the “position / occurrence” attribute is 2 (the word “ Accrual ”and“ position ”). Of the words included in the word set of the attribute name of the “position / occurrence” attribute, the number of words matching the word included in the word set of the attribute name of the “position / occurrence” attribute is 2 (the word “ Position ”and“ occurrence ”).

また、上記したように「発生者職位」属性の属性名の単語集合に含まれる単語の数は２であるため、当該「発生者職位」属性の属性名の単語集合に含まれる単語の数に対する上記した「職位／発生者」属性の属性名の単語集合に含まれる単語と一致する単語の数の割合は２／２である。また、「職位／発生者」属性の属性名の単語集合に含まれる単語の数は２であるため、当該「職位／発生者」属性の属性名の単語集合に含まれる単語の数に対する上記した「発生者職位」属性の属性名の単語集合に含まれる単語と一致する単語の数の割合は２／２である。 Further, as described above, since the number of words included in the word set of the attribute name of the “Gener position” attribute is 2, the number of words included in the word set of the attribute name of the “Generator position” attribute is The ratio of the number of words that match the word included in the word set of the attribute name of the “job title / occurrence” attribute is 2/2. Further, since the number of words included in the word set of the attribute name of the “position / occurrence” attribute is 2, the number of words included in the word set of the attribute name of the “position / occurrence” attribute is described above. The ratio of the number of words that match the word included in the word set of the attribute name of the “generator title” attribute is 2/2.

これにより、対象属性である「発生者職位」属性の属性名および該当属性である「職位／発生者」属性の属性名間の類似度は、２／２と２／２との平均値、つまり、（２／２＋２／２）／２＝１と算出される。 Accordingly, the similarity between the attribute name of the “occurrence position” attribute that is the target attribute and the attribute name of the “position / occurrence” attribute that is the corresponding attribute is an average value of 2/2 and 2/2, that is, , (2/2 + 2/2) / 2 = 1.

次に、図１９を参照して、対象属性である「発生者職位」属性の属性名および該当属性である「職位／検証者」属性の属性名間の類似度について説明する。 Next, with reference to FIG. 19, the similarity between the attribute name of the “occurrence position” attribute that is the target attribute and the attribute name of the “position / verifier” attribute that is the corresponding attribute will be described.

この場合、対象属性である「発生者職位」属性の属性名および該当属性である「職位／検証者」属性の属性名が形態素解析処理されることにより、当該「発生者職位」属性の属性名の単語集合および当該「職位／検証者」属性の属性名の単語集合が作成される。ここで作成される単語集合には、例えば品詞が名詞の単語が含まれる。 In this case, the attribute name of the “occurrence position” attribute is obtained by performing morphological analysis processing on the attribute name of the “occurrence position” attribute that is the target attribute and the attribute name of the “position / verifier” attribute that is the corresponding attribute. And a word set of attribute names of the “position / verifier” attribute are created. The word set created here includes words whose part of speech is a noun, for example.

具体的には、「発生者職位」属性の属性名の単語集合には、上記したように単語「発生者」および「職位」が含まれる。また、「職位／検証者」属性の属性名の単語集合には、単語「職位」および「検証者」が含まれる。 Specifically, as described above, the word set of the attribute name of the “Gener position” attribute includes the words “Generator” and “Position”. Further, the word set of attribute names of the “position / verifier” attribute includes the words “position” and “verifier”.

ここで、「発生者職位」属性の属性名の単語集合に含まれる単語のうち、「職位／検証者」属性の属性名の単語集合に含まれる単語と一致する単語の数は１（単語「職位」）である。また、「職位／検証者」属性の属性名の単語集合に含まれる単語のうち、「発生者職位」属性の属性名の単語集合に含まれる単語と一致する単語の数は１（単語「職位」）である。 Here, of the words included in the word set of the attribute name of the “occurrence position” attribute, the number of words that match the word included in the word set of the attribute name of the “position / verifier” attribute is 1 (the word “ Position ”). Further, among the words included in the word set of the attribute name of the “position / verifier” attribute, the number of words matching the word included in the word set of the attribute name of the “occurrence position” attribute is 1 (the word “position” ]).

また、上記したように「発生者職位」属性の属性名の単語集合に含まれる単語の数は２であるため、当該「発生者職位」属性の属性名の単語集合に含まれる単語の数に対する上記した「職位／検証者」属性の属性名の単語集合に含まれる単語と一致する単語の数の割合は１／２である。また、「職位／検証者」属性の属性名の単語集合に含まれる単語の数は２であるため、当該「職位／検証者」属性の属性名の単語集合に含まれる単語の数に対する上記した「発生者職位」属性の属性名の単語集合に含まれる単語と一致する単語の数の割合は１／２である。 Further, as described above, since the number of words included in the word set of the attribute name of the “Gener position” attribute is 2, the number of words included in the word set of the attribute name of the “Generator position” attribute is The ratio of the number of words that match the word included in the word set of the attribute name of the “position / verifier” attribute is 1/2. In addition, since the number of words included in the word set of the attribute name of the “position / verifier” attribute is 2, the number of words included in the word set of the attribute name of the “position / verifier” attribute is described above. The ratio of the number of words that match the word included in the word set of the attribute name of the “generator position” attribute is ½.

これにより、対象属性である「発生者職位」属性の属性名および該当属性である「職位／検証者」属性の属性名間の類似度は、１／２と１／２との平均値、つまり、（１／２＋１／２）／２＝１／２と算出される。 Accordingly, the similarity between the attribute name of the “occurrence position” attribute that is the target attribute and the attribute name of the “position / verifier” attribute that is the corresponding attribute is an average value of 1/2 and 1/2, that is, , (1/2 + 1/2) / 2 = 1/2.

再び図１７に戻ると、類似属性候補抽出部３５は、該当属性毎に算出された類似度（対象属性の属性名および当該該当属性の属性名間の類似度）に基づいて、当該該当属性に対して順位づけを行う（ステップＳ４８）。具体的には、類似属性候補抽出部３５は、算出された類似度が高い該当属性の優先順位を高くするような順位づけを行う。上記したように「発生者職位」属性が対象属性であり、「職位／発生者」属性および「職位／検証者」属性が該当属性である場合には、「発生者職位」属性の属性名および「職位／検証者」属性の属性名間の類似度より「発生者職位」属性の属性名および「職位／発生者」属性の属性名間の類似度の方が高いため、「職位／発生者」属性により高い優先順位が付与される。 Returning to FIG. 17 again, the similar attribute candidate extraction unit 35 determines the corresponding attribute based on the similarity calculated for each corresponding attribute (the similarity between the attribute name of the target attribute and the attribute name of the corresponding attribute). Ranking is performed on the images (step S48). Specifically, the similar attribute candidate extraction unit 35 performs ranking so as to increase the priority of the corresponding attribute having a high degree of similarity. As described above, when the “occurrence position” attribute is the target attribute and the “position / occurrence” attribute and the “position / verifier” attribute are the corresponding attributes, the attribute name of the “occurrence position” attribute and Since the similarity between the attribute name of the “Occupation Position” attribute and the attribute name of the “Position / Occurrence” attribute is higher than the similarity between the attribute names of the “Position / Verifier” attribute, ”Attribute gives higher priority.

次に、類似属性候補抽出部３５は、ステップＳ４６において類似属性候補を抽出する。この場合、類似属性候補抽出部３５は、対象属性および該当属性の各々（の組み合わせ）を類似属性候補として抽出する。 Next, the similar attribute candidate extraction unit 35 extracts similar attribute candidates in step S46. In this case, the similar attribute candidate extraction unit 35 extracts each (combination) of the target attribute and the corresponding attribute as a similar attribute candidate.

一方、上記したステップＳ４４において該当属性の数が１つであると判定された場合、類似属性候補抽出部３５は、ステップＳ４６において類似属性候補を抽出する。この場合、類似属性候補抽出部３５は、対象属性および該当属性（の組み合わせ）を類似属性候補として抽出する。 On the other hand, when it is determined in step S44 that the number of corresponding attributes is one, the similar attribute candidate extraction unit 35 extracts similar attribute candidates in step S46. In this case, the similar attribute candidate extraction unit 35 extracts the target attribute and the corresponding attribute (a combination thereof) as similar attribute candidates.

上記したステップＳ４５において対象属性と互いに類似度が最大となる属性が存在しない、つまり、予め定められた条件を満たす属性がないと判定された場合、またはステップＳ４６の処理が実行されると、対象類似度一覧表中の全ての属性について上記したステップＳ４２〜Ｓ４８の処理が実行されたか否かが判定される（ステップＳ４９）。 When it is determined in step S45 that there is no attribute having the maximum similarity with the target attribute, that is, when there is no attribute that satisfies a predetermined condition, or when the process of step S46 is executed, It is determined whether or not the above-described steps S42 to S48 have been executed for all attributes in the similarity list (step S49).

対象類似度一覧表中の全ての属性について処理が実行されていないと判定された場合（ステップＳ４９のＮＯ）、上記したステップＳ４２に戻って処理が繰り返される。この場合、ステップＳ４２〜Ｓ４８の処理が実行されていない属性を対象属性として処理が実行される。 If it is determined that processing has not been executed for all attributes in the target similarity list (NO in step S49), the process returns to step S42 described above and is repeated. In this case, the process is executed with an attribute for which the process of steps S42 to S48 has not been executed as a target attribute.

一方、対象類似度一覧表中の全てについて処理が実行されたと判定された場合（ステップＳ４９のＹＥＳ）、数値型属性類似度格納部２５および文字列型属性類似度格納部２６に格納された全ての類似度一覧表について上記したステップＳ４１〜Ｓ４９の処理が実行されたか否かが判定される（ステップＳ５０）。 On the other hand, if it is determined that the processing has been executed for all of the target similarity list (YES in step S49), all the values stored in the numerical attribute similarity storage unit 25 and the character string attribute similarity storage unit 26 are stored. It is determined whether or not the above-described steps S41 to S49 have been performed for the similarity list (step S50).

全ての類似度一覧表について処理が実行されていないと判定された場合（ステップＳ５０のＮＯ）、上記したステップＳ４１に戻って処理が繰り返される。この場合、ステップＳ４１〜Ｓ４９の処理が実行されていない類似度一覧表を対象類似度一覧表として処理が実行される。 When it is determined that the process is not executed for all the similarity list (NO in step S50), the process returns to the above step S41 and is repeated. In this case, the process is executed using the similarity list that has not been subjected to the processes of steps S41 to S49 as the target similarity list.

一方、全ての類似度一覧表について処理が実行されたと判定された場合（ステップＳ５０のＹＥＳ）、類似属性候補抽出処理は終了される。 On the other hand, when it is determined that the process has been executed for all the similarity list (YES in step S50), the similar attribute candidate extraction process is terminated.

上記したように類似属性候補抽出処理が実行されると、当該類似属性候補抽出処理において抽出された類似属性候補が類似属性候補格納部２７に格納される。 When the similar attribute candidate extraction process is executed as described above, the similar attribute candidates extracted in the similar attribute candidate extraction process are stored in the similar attribute candidate storage unit 27.

ここで、図２０は、類似属性候補格納部２７のデータ構造の一例を示す。類似属性候補格納部２７には、上記したように類似属性候補として抽出された異なるデータテーブルを構成する２つの属性（の属性名）が対応づけて格納されている。 Here, FIG. 20 shows an example of the data structure of the similar attribute candidate storage unit 27. The similar attribute candidate storage unit 27 stores two attributes (attribute names) constituting different data tables extracted as similar attribute candidates as described above in association with each other.

図２０に示す例では、類似属性候補格納部２７には、例えばＡ銀行のデータテーブル２２１を構成する「発生原因／発生者」属性に対応づけてＢ銀行のデータテーブル２２２を構成する「発生原因」属性が格納されている。また、類似属性候補格納部２７には、例えばＡ銀行のデータテーブル２２１を構成する「発生原因／発生者」属性に対応づけてＣ銀行のデータテーブル２２３を構成する「発生原因／発生者」属性が格納されている。 In the example illustrated in FIG. 20, the similar attribute candidate storage unit 27 includes, for example, an “occurrence cause” that configures the data table 222 of the B bank in association with the “cause / occurrence” attribute that configures the data table 221 of the A bank Attribute is stored. The similar attribute candidate storage unit 27 includes, for example, an “occurrence cause / occurrence” attribute that configures the data table 223 of the C bank in association with the “occurrence cause / occurrence” attribute that configures the data table 221 of the A bank. Is stored.

これによれば、Ａ銀行のデータテーブル２２１を構成する「発生原因／発生者」属性およびＢ銀行のデータテーブル２２２を構成する「発生原因」属性が同一の内容を表す属性の候補（つまり、類似属性候補）であることが示されている。同様に、Ａ銀行のデータテーブル２２１を構成する「発生原因／発生者」属性およびＣ銀行のデータテーブル２２３を構成する「発生原因／発生者」属性が同一の内容を現す属性の候補であることが示されている。 According to this, the “occurrence / occurrence” attribute constituting the bank A data table 221 and the “occurrence cause” attribute constituting the bank B data table 222 are attribute candidates representing the same contents (that is, similar Attribute candidate). Similarly, the “occurrence cause / occurrence” attribute constituting the bank A data table 221 and the “occurrence cause / occurrence” attribute constituting the bank C data table 223 are candidates for attributes representing the same contents. It is shown.

また、類似属性候補格納部２７には、例えばＡ銀行のデータテーブル２２１を構成する「発生者職位」属性に対応づけてＢ銀行のデータテーブル２２２を構成する「職位／発生者」属性および「職位／検証者」属性が格納されている。これによれば、Ａ銀行のデータテーブル２２１を構成する「発生者職位」属性およびＢ銀行のデータテーブル２２２を構成する「職位／発生者」属性が同一の内容を表す属性の候補であることが示されている。また、Ａ銀行のデータテーブル２２１を構成する「発生者職位」属性およびＢ銀行のデータテーブル２２２を構成する「職位／検証者」属性が同一の内容を現す属性の候補であることが示されている。 Further, the similar attribute candidate storage unit 27 includes, for example, a “position / occurrence” attribute and a “position” that configure the data table 222 of the B bank in association with the “generation position” attribute that configures the data table 221 of the A bank. / Verifier attribute is stored. According to this, the “generator position” attribute constituting the bank A data table 221 and the “position / generator” attribute constituting the bank B data table 222 are candidate attributes representing the same content. It is shown. In addition, it is shown that the “generator position” attribute that configures the data table 221 of the A bank and the “position / verifier” attribute that configures the data table 222 of the B bank are candidates for attributes that express the same content. Yes.

なお、Ａ銀行のデータテーブル２２１を構成する「発生者職位」属性に対応づけて類似属性候補格納部２７に格納されている「職位／発生者」属性および「職位／検証者」は、上記した類似属性候補抽出処理において当該「職位／発生者」属性および「職位／検証者」属性（該当属性）に対して付与された優先順位の順番に並べられる。 Note that the “position / occurrence” attribute and the “position / verifier” stored in the similar attribute candidate storage unit 27 in association with the “occurrence position” attribute constituting the data table 221 of the bank A are as described above. In the similar attribute candidate extraction process, they are arranged in the order of priority assigned to the “position / occurrence” attribute and the “position / verifier” attribute (corresponding attribute).

図２０に示すように、類似属性候補格納部２７には、類似属性候補抽出処理において類似属性候補として抽出された２つの属性の組み合わせの全てが格納されている。 As shown in FIG. 20, the similar attribute candidate storage unit 27 stores all combinations of two attributes extracted as similar attribute candidates in the similar attribute candidate extraction process.

なお、図２０に示す類似属性候補格納部２７に格納された類似属性候補は、上述したように例えばデータテーブル格納部２２に格納されたＡ〜Ｃ銀行のデータテーブル２２１〜２２３（に保持されるデータ）のデータ分析において当該Ａ〜Ｃ銀行のデータテーブル２２１〜２２３を比較する際に利用されることができる。 The similar attribute candidates stored in the similar attribute candidate storage unit 27 shown in FIG. 20 are held in, for example, the data tables 221 to 223 (banks A to C) stored in the data table storage unit 22 as described above. Data) can be used when comparing the data tables 221 to 223 of the banks A to C in the data analysis.

上記したように本実施形態においては、テーブル格納部２２に格納されている複数のテーブルを構成する文字列型属性が有する属性値に含まれる文字列を構成する単語を抽出し、当該抽出された単語に基づいて異なるテーブルを構成する２つの文字列型属性間の類似度を算出し、当該算出された類似度に基づいて当該２つの文字列型属性を類似属性候補として抽出する構成により、属性の特徴のみを利用し、また属性値の意味を考慮して任意のデータテーブル間において適切な属性の対応づけを行うことが可能となる。 As described above, in the present embodiment, the words constituting the character strings included in the attribute values of the character string type attributes constituting the plurality of tables stored in the table storage unit 22 are extracted, and the extracted By calculating the similarity between two character string type attributes constituting different tables based on the word, and extracting the two character string type attributes as similar attribute candidates based on the calculated similarity degree, the attribute Appropriate attributes can be associated between arbitrary data tables by using only the characteristics of the data and considering the meaning of the attribute value.

また、本実施形態においては、テーブル格納部２２に格納されている複数のテーブルを構成する数値型属性が有する属性値に含まれる数値の範囲に基づいて、異なるテーブルを構成する２つの数値型属性間の類似度を算出し、当該算出された類似度に基づいて当該２つの数値型属性を類似属性候補として抽出する構成により、数値型属性についても任意のデータテーブル間において適切な対応づけを行うことができる。 Further, in the present embodiment, two numerical type attributes that configure different tables based on the range of numerical values included in the attribute values of the numerical type attributes that configure the plurality of tables stored in the table storage unit 22. By calculating the similarity between the two, and extracting the two numeric type attributes as similar attribute candidates based on the calculated similarity, the numeric type attributes are also appropriately associated between arbitrary data tables. be able to.

更に、本実施形態においては、複数のテーブルを構成する数値型属性を数値型属性小分類に分類することにより、例えば数値範囲を比較することができない２つの数値型属性間の類似度を算出することを回避し、数値型属性の適切な対応づけを行うことができる。 Furthermore, in this embodiment, by classifying the numerical type attributes constituting a plurality of tables into numerical type attribute subcategories, for example, the similarity between two numerical type attributes that cannot be compared in numerical range is calculated. This can be avoided and appropriate association of numeric type attributes can be performed.

（第２の実施形態）
次に、第２の実施形態について説明する。本実施形態に係るデータ分析支援装置のハードウェア構成および機能構成は、前述した第１の実施形態と同様であるため、適宜、図１および図２を用いて説明する。 (Second Embodiment)
Next, a second embodiment will be described. The hardware configuration and functional configuration of the data analysis support apparatus according to this embodiment are the same as those of the first embodiment described above, and will be described with reference to FIGS. 1 and 2 as appropriate.

なお、本実施形態においては、データ分析支援装置３０に含まれる文字列型属性処理部３３が２つの文字列型属性の類似度を算出する際に類義語辞書を用いる点が、前述した第１の実施形態とは異なる。 In the present embodiment, the point that the character string type attribute processing unit 33 included in the data analysis support device 30 uses the synonym dictionary when calculating the similarity between the two character string type attributes is the first described above. Different from the embodiment.

ここで、図２１は、本実施形態に係るデータ分析支援装置３０に含まれる文字列型属性処理部３３の機能構成を示すブロック図である。 Here, FIG. 21 is a block diagram illustrating a functional configuration of the character string type attribute processing unit 33 included in the data analysis support device 30 according to the present embodiment.

文字列型属性処理部３３は、類義語辞書格納部３３４および文字列型属性類似度算出部３３５を含む。本実施形態において、類義語辞書格納部３３４は、例えば図１に示す外部記憶装置２０に格納される。 The character string type attribute processing unit 33 includes a synonym dictionary storage unit 334 and a character string type attribute similarity calculation unit 335. In the present embodiment, the synonym dictionary storage unit 334 is stored, for example, in the external storage device 20 shown in FIG.

類義語辞書格納部３３４には、意味が類似する（つまり、意味の似かよった）複数の単語が類義語として登録された類義語辞書が予め格納されている。 The synonym dictionary storage unit 334 stores in advance a synonym dictionary in which a plurality of words having similar meanings (that is, having similar meanings) are registered as synonyms.

文字列型属性類似度算出部３３５は、属性値単語集合格納部３３１によって格納された各文字列型属性の単語集合および類義語辞書格納部３３４に格納されている類義語辞書に基づいて、異なるデータテーブルを構成する２つの文字列型属性間の類似度を算出する。この場合、文字列型属性類似度算出部３３５は、前述した第１の実施形態における文字列型属性類似度算出部３３３と同様に、２つの文字列型属性（第１および第２の文字列型属性）の単語集合間で一致する単語の数を特定する。このとき、文字列型属性類似度算出部３３５によって特定される２つの文字列型属性の単語集合間で一致する単語には、完全に一致した単語だけではなく、類義語辞書格納部３３４に格納されている類義語辞書に登録されている意味が類似する単語が含まれる。つまり、本実施形態における文字列型属性類似度算出部３３５においては、意味が類似する単語についても一致したものとみなされる。 The character string type attribute similarity calculation unit 335 generates a different data table based on the word set of each character string type attribute stored by the attribute value word set storage unit 331 and the synonym dictionary stored in the synonym dictionary storage unit 334. The degree of similarity between the two character string type attributes that constitutes is calculated. In this case, the character string type attribute similarity calculation unit 335 has two character string type attributes (first and second character strings) in the same manner as the character string type attribute similarity calculation unit 333 in the first embodiment described above. The number of words that match between word sets of type attribute) is specified. At this time, the words that match between the word sets of the two character string type attributes specified by the character string type attribute similarity calculation unit 335 are stored in the synonym dictionary storage unit 334 as well as the completely matched words. Words having similar meanings registered in the synonym dictionary. That is, in the character string type attribute similarity calculation unit 335 in the present embodiment, words having similar meanings are also considered to match.

なお、文字列型属性類似度算出部３３５は、この点以外については、前述した第１の実施形態における文字列型属性類似度算出部３３３と同様の機能を有する。 The character string type attribute similarity calculation unit 335 has the same function as the character string type attribute similarity calculation unit 333 in the first embodiment except for this point.

ここで、本実施形態に係るデータ分析支援装置３０の動作について説明する。なお、本実施形態に係るデータ分析支援装置３０において実行される処理のうち類似度算出処理（前述した図８に示すステップＳ４の処理）に含まれる文字列型属性の類似度算出処理以外の処理については前述した第１の実施形態と同様であるため、その詳しい説明を省略する。 Here, the operation of the data analysis support device 30 according to the present embodiment will be described. Of the processes executed in the data analysis support device 30 according to the present embodiment, processes other than the character string type attribute similarity calculation process included in the similarity calculation process (the process of step S4 shown in FIG. 8 described above). Since is the same as that of the first embodiment described above, detailed description thereof is omitted.

以下、本実施形態における文字列型属性の類似度算出処理の処理手順について説明する。ここでは、便宜的に、図１３のフローチャートを参照して説明する。 Hereinafter, the processing procedure of the similarity calculation processing of the character string type attribute in this embodiment will be described. Here, for convenience, description will be made with reference to the flowchart of FIG.

まず、図１３に示すステップＳ２１〜Ｓ２３の処理が実行される。このステップＳ２１〜Ｓ２３の処理については、前述した第１の実施形態において説明した通りであるため、その詳しい説明を省略する。 First, the processes of steps S21 to S23 shown in FIG. 13 are executed. Since the processes in steps S21 to S23 are the same as those described in the first embodiment, detailed description thereof is omitted.

ステップＳ２３において文字列型属性格納部２４に格納された全ての文字列型属性について処理が実行されたと判定された場合、文字列型属性類似度算出部３３５は、例えば文字列型属性格納部２４に格納された異なるデータテーブルを構成する２つの文字列型属性の組み合わせ（文字列型属性ペア）の各々に対して以下のステップＳ２４〜Ｓ２６の処理を実行する。ここでは、この処理の対象となる文字列型属性ペアを対象文字列型属性ペアと称する。また、対象文字列型属性ペアに含まれる一方の文字列型属性を第１の文字列型属性、他方の文字列型属性を第２の文字列型属性と称する。 When it is determined in step S23 that processing has been executed for all the character string type attributes stored in the character string type attribute storage unit 24, the character string type attribute similarity calculation unit 335, for example, the character string type attribute storage unit 24 The processing of the following steps S24 to S26 is executed for each of the combinations of two character string type attributes (character string type attribute pairs) constituting different data tables stored in. Here, the character string type attribute pair that is the target of this processing is referred to as a target character string type attribute pair. One character string type attribute included in the target character string type attribute pair is referred to as a first character string type attribute, and the other character string type attribute is referred to as a second character string type attribute.

この場合、文字列型属性類似度算出部３３５は、対象文字列型属性ペアに含まれる第１および第２の文字列型属性の単語集合を属性値単語集合格納部３３２から取得する。 In this case, the character string type attribute similarity calculation unit 335 acquires the word set of the first and second character string type attributes included in the target character string type attribute pair from the attribute value word set storage unit 332.

文字列型属性類似度算出部３３５は、取得された第１および第２の文字列型属性の単語集合と類義語辞書格納部３３４に格納されている類義語辞書を参照して、当該第１および第２の文字列型属性の単語集合間で一致する単語の数を特定する（ステップＳ２４）。この場合、文字列型属性類似度算出部３３５は、第１の文字列型属性の単語集合に含まれる単語のうち、第２の文字列型属性の単語集合に含まれる単語と一致する単語および類似する単語の数（第１の文字列型属性の一致数）を特定する。また、文字列型属性類似度算出部３３５は、第２の文字列型属性の単語集合に含まれる単語のうち、第２の文字列型属性の単語集合に含まれる単語と一致する単語および類似する単語の数（第２の文字列型属性の一致数）を特定する。 The character string type attribute similarity calculating unit 335 refers to the acquired word sets of the first and second character string type attributes and the synonym dictionary stored in the synonym dictionary storage unit 334, and performs the first and second The number of words that match between the word sets having the character string type attribute of 2 is specified (step S24). In this case, the character string type attribute similarity calculation unit 335 includes a word that matches a word included in the word set of the second character string type attribute and a word that matches the word included in the word set of the second character string type attribute. The number of similar words (the number of matches of the first character string type attribute) is specified. In addition, the character string type attribute similarity calculation unit 335 includes words and similarities that match words included in the word set of the second character string type attribute among words included in the word set of the second character string type attribute. The number of words to be identified (the number of matches of the second character string type attribute) is specified.

文字列型属性類似度算出部３３５は、特定された第１および第２の文字列型属性の一致数に基づいて、当該第１および第２の文字列型属性間の類似度を算出する（ステップＳ２５）。この場合、文字列型属性類似度算出部３３５は、前述した第１の実施形態と同様に、第１の文字列型属性の単語集合に含まれる単語の一致率（第１の文字列型属性の単語一致率）および第２の文字列型属性の単語集合に含まれる単語の一致率（第２の文字列型属性の単語一致率）を利用して類似度を算出する。 The character string type attribute similarity calculating unit 335 calculates the degree of similarity between the first and second character string type attributes based on the number of matches between the specified first and second character string type attributes ( Step S25). In this case, the character string type attribute similarity calculation unit 335 performs matching of the words included in the word set of the first character string type attribute (first character string type attribute), as in the first embodiment described above. And the matching rate of words included in the word set of the second character string type attribute (word matching rate of the second character string type attribute) are used to calculate the similarity.

ここで、図２２を参照して、文字列型属性類似度算出部３３５によって算出される第１および第２の文字列型属性間の類似度について具体的に説明する。 Here, the similarity between the first and second character string type attributes calculated by the character string type attribute similarity calculating unit 335 will be specifically described with reference to FIG.

ここでは、第１の文字列型属性が前述した図５に示すＡ銀行のデータテーブル２２１を構成する「発生原因／発生者」属性であり、第２の文字列型属性が前述した図６に示すＢ銀行のデータテーブル２２２を構成する「発生原因」属性であるものとする。 Here, the first character string type attribute is the “cause / occurrence” attribute that constitutes the bank A data table 221 shown in FIG. 5, and the second character string type attribute is shown in FIG. It is assumed that it is an “occurrence cause” attribute constituting the data table 222 of the B bank shown.

図２２に示すように、第１の文字列型属性（つまり、「発生原因／発生者」属性）の単語集合には、１１個の単語、具体的には、単語「経験」、「不足」、「指導」、「教育」、「第三者」、「事故」、「客」、「依頼」、「ミス」、「記入」および「誤り」が含まれるものとする。また、第２の文字列型属性（つまり、「発生原因」属性）の単語集合には、１０個の単語、具体的には、単語「知識」、「経験」、「教育」、「不足」、「顧客」、「事故」、「複雑」、「作業」、「内容」および「ケアレスミス」が含まれるものとする。 As shown in FIG. 22, the word set of the first character string type attribute (that is, the “occurrence cause / occurrence” attribute) has 11 words, specifically, the words “experience” and “insufficient”. , “Guidance”, “education”, “third party”, “accident”, “customer”, “request”, “miss”, “entry” and “error”. The word set of the second character string type attribute (that is, the “occurrence cause” attribute) has 10 words, specifically, the words “knowledge”, “experience”, “education”, “insufficient”. , “Customer”, “accident”, “complexity”, “work”, “content” and “careless mistake”.

なお、文字列型属性処理部３３に含まれる類義語辞書格納部３３４に格納されている類義語辞書には、単語「指導」および「教育」が類義語として登録されているものとする。また、類義語辞書には、単語「客」および「顧客」が類義語として登録されているものとする。また、類義語辞書には、単語「ミス」、「誤り」および「ケアレスミス」が類義語として登録されているものとする。 It is assumed that the words “teaching” and “education” are registered as synonyms in the synonym dictionary stored in the synonym dictionary storage unit 334 included in the character string type attribute processing unit 33. It is assumed that the words “customer” and “customer” are registered as synonyms in the synonym dictionary. Further, it is assumed that the words “miss”, “error”, and “careless miss” are registered as synonyms in the synonym dictionary.

ここで、第１の文字列型属性の単語集合に含まれる単語のうち、第２の文字列型属性の単語集合に含まれる単語と完全に一致する単語は、単語「経験」、「不足」、「教育」および「事故」である。また、上記した類義語辞書によれば、第１の文字列型属性の単語集合に含まれる単語のうち、第２の文字列型属性の単語集合に含まれる単語と意味が類似する単語（つまり、類義語）は、第２の文字列型属性の単語集合に含まれる単語「教育」と意味が類似する単語「指導」、単語「顧客」と意味が類似する単語「客」、単語「ケアレスミス」と意味が類似する単語「ミス」および「誤り」である。この場合、第１の文字列型属性の一致数は８（単語「経験」、「不足」、「指導」、「教育」、「事故」、「客」、「ミス」および「誤り」）となる。 Here, among words included in the word set of the first character string type attribute, words that completely match the word included in the word set of the second character string type attribute are the words “experience” and “insufficient”. , "Education" and "Accident". Further, according to the above synonym dictionary, among words included in the word set of the first character string type attribute, words having meanings similar to words included in the word set of the second character string type attribute (that is, Synonyms) are a word “teaching” having a similar meaning to the word “education” included in the word set of the second character string type attribute, a word “customer” having a similar meaning to the word “customer”, and a word “careless mistake”. The words “Miss” and “Miss” are similar in meaning. In this case, the number of matches of the first character string type attribute is 8 (words “experience”, “insufficient”, “teaching”, “education”, “accident”, “customer”, “miss” and “error”) Become.

一方、第２の文字列型属性の単語集合に含まれる単語のうち、第１の文字列型属性の単語集合に含まれる単語と完全に一致する単語は、単語「経験」、「教育」、「不足」および「事故」である。また、上記した類義語辞書によれば、第２の文字列型属性の単語集合に含まれる単語のうち、第１の文字列型属性の単語集合に含まれる単語と意味が類似する単語（つまり、類義語）は、第１の文字列型属性の単語集合に含まれる単語「指導」と意味が類似する単語「教育」と、単語「客」と意味が類似する単語「顧客」と、単語「ミス」および「誤り」と意味が類似する単語「ケアレスミス」である。この場合、第２の文字列型属性の一致数は６（単語「経験」、「教育」、「不足」、「顧客」、「事故」および「ケアレスミス」）となる。例えば単語「教育」および「ケアレスミス」のように、第１の文字列型属性の単語集合に含まれる単語と完全に一致する単語および類似する単語が複数存在する場合には、当該同一の単語は１つとして扱われる。 On the other hand, among words included in the word set of the second character string type attribute, words that completely match the words included in the word set of the first character string type attribute are the words “experience”, “education”, “Insufficient” and “Accident”. Further, according to the above synonym dictionary, among words included in the word set of the second character string type attribute, words having meanings similar to words included in the word set of the first character string type attribute (that is, Synonyms) include a word “education” similar in meaning to the word “guidance” included in the word set of the first string attribute, a word “customer” similar in meaning to the word “customer”, and a word “miss” "And" Error "are words" Careless Miss "that have similar meanings. In this case, the number of matches of the second character string type attribute is 6 (the words “experience”, “education”, “insufficient”, “customer”, “accident”, and “careless mistake”). For example, when there are a plurality of words and similar words that completely match a word included in the word set of the first character string type attribute, such as the words “education” and “careless mistake”, the same word Are treated as one.

なお、図２２においては、完全に一致する２つの単語が実線でつながれており、意味が類似する２つの単語が破線でつながれている。 In FIG. 22, two completely matching words are connected by a solid line, and two words having similar meanings are connected by a broken line.

上記したように第１の文字列型属性の単語集合に含まれる単語の数は１１であるため、第１の文字列型属性の単語一致率は８／１１である。また、第２の文字列型属性の単語集合に含まれる単語の数は１０であるため、第２の文字列型属性の単語一致率は６／１０である。 As described above, since the number of words included in the word set of the first character string type attribute is 11, the word matching rate of the first character string type attribute is 8/11. Further, since the number of words included in the word set of the second character string type attribute is 10, the word matching rate of the second character string type attribute is 6/10.

これにより、第１および第２の文字列型属性間の類似度は、８／１１と６／１０との平均値、つまり、（８／１１＋６／１０）／２≒０．６６４と算出される。 Accordingly, the similarity between the first and second character string type attributes is calculated as an average value of 8/11 and 6/10, that is, (8/11 + 6/10) /2≈0.664. .

再び図１３に戻ると、ステップＳ２５〜Ｓ２７の処理が実行される。なお、このステップＳ２５〜Ｓ２７の処理は、前述した第１の実施形態において説明した通りであるため、その詳しい説明を省略する。 Returning to FIG. 13 again, the processing of steps S25 to S27 is executed. Note that the processing in steps S25 to S27 is the same as that described in the first embodiment, and a detailed description thereof will be omitted.

上記したように本実施形態においては、類義語辞書格納部３３４に格納されている類義語辞書を用いることにより２つの文字列型属性間の類似度が算出される際に特定される当該文字列型属性の単語集合間で一致する単語として意味が類似する単語（類義語）が含まれる構成により、前述した第１の実施形態においては当該単語集合間で一致する単語として扱われない単語であっても意味が類似する単語であれば当該一致する単語と同様に扱われるため、当該２つの文字列型属性間の類似度をより適切に算出することが可能となる。 As described above, in the present embodiment, the character string type attribute specified when the similarity between two character string type attributes is calculated by using the synonym dictionary stored in the synonym dictionary storage unit 334. Even if a word that is similar in meaning (synonyms) is included as a word that matches between the word sets in the first embodiment, the word is not treated as a word that matches between the word sets in the first embodiment. If they are similar words, they are treated in the same way as the matching words, and therefore the similarity between the two character string type attributes can be calculated more appropriately.

なお、本実施形態においては、２つの文字列型属性間の類似度を算出する（単語集合間で一致する単語の数を特定する）際に類義語辞書を用いるものとして説明したが、前述した第１の実施形態において説明した属性名間の類似度を算出する際に類義語辞書を用いる構成であっても構わない。 In the present embodiment, the synonym dictionary is used when calculating the similarity between two character string type attributes (specifying the number of matching words between word sets). The synonym dictionary may be used when calculating the similarity between the attribute names described in the first embodiment.

以上説明した少なくとも１つの実施形態によれば、任意のデータテーブル間において適切な属性の対応づけを行うことが可能なデータ分析支援装置およびプログラムを提供することができる。 According to at least one embodiment described above, it is possible to provide a data analysis support apparatus and program capable of associating appropriate attributes between arbitrary data tables.

なお、本願発明は、上記各実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記各実施形態に開示されている複数の構成要素の適宜な組合せにより種々の発明を形成できる。例えば、各実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。更に、異なる実施形態に亘る構成要素を適宜組合せてもよい。 Note that the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Further, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in each embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

１０…コンピュータ、２０…外部記憶装置、２２…データテーブル格納部、２３…数値型属性格納部、２４…文字列型属性格納部、２５…数値型属性類似度格納部、２６…文字列型属性類似度格納部、２７…類似属性候補格納部、３１…属性型分類部、３２…数値型属性処理部、３３…文字列型属性処理部、３４…閾値入力部、３５…類似属性候補抽出部、３２１…属性値数値範囲特定部、３２２…属性値数値範囲格納部、３２３…数値型属性類似度算出部、３３１…属性値単語抽出部、３３２…属性値単語集合格納部、３３３…文字列型属性類似度算出部、３３４…類義語辞書格納部、３３５…文字列型属性類似度算出部。 DESCRIPTION OF SYMBOLS 10 ... Computer, 20 ... External storage device, 22 ... Data table storage part, 23 ... Numerical value attribute storage part, 24 ... Character string type attribute storage part, 25 ... Numerical value type attribute similarity storage part, 26 ... Character string type attribute Similarity degree storage unit, 27 ... similar attribute candidate storage unit, 31 ... attribute type classification unit, 32 ... numeric type attribute processing unit, 33 ... character string type attribute processing unit, 34 ... threshold input unit, 35 ... similar attribute candidate extraction unit , 321 ... Attribute value numerical range specifying unit, 322 ... Attribute value numerical range storage unit, 323 ... Numeric attribute similarity calculation unit, 331 ... Attribute value word extraction unit, 332 ... Attribute value word set storage unit, 333 ... Character string Type attribute similarity calculation unit, 334... Synonym dictionary storage unit, 335... Character string type attribute similarity calculation unit.

Claims

A first data table composed of a first attribute including a first character string type attribute having an attribute value including a character string and a second including a second character string type attribute having an attribute value including a character string Data table storage means for storing in advance a second data table comprising the attributes of:
Extracting the first word constituting the character string included in the attribute value of the first character string type attribute included in the first attribute constituting the first data table stored in the data table storage means First word extracting means for
Extracting the second word constituting the character string included in the attribute value of the second character string type attribute included in the second attribute included in the second data table stored in the data table storage means Second word extracting means for
Based on the first word extracted by the first extraction means and the second word extracted by the second extraction means, the first attribute included in the first data table is included in the first attribute included in the first data table. Similarity calculating means for calculating the similarity between the first character string type attribute and the second character string type attribute included in the second attribute constituting the second data table;
Based on the calculated similarity, the first character string type attribute included in the first attribute constituting the first data table and the second attribute constituting the second data table are included. A data analysis support apparatus comprising: similar attribute candidate extraction means for extracting the second character string type attribute as a similar attribute candidate.

The similarity calculation means matches the ratio of the number of the first words that matches the second word with respect to the number of the first words, and matches the first word with respect to the number of the second words The data analysis support according to claim 1, wherein the similarity between the first character string type attribute and the second character string type attribute is calculated based on a ratio of the number of the second words. apparatus.

The first attribute constituting the first data table stored in the data table storage means includes a plurality of first character string type attributes,
The second attribute constituting the second data table stored in the data table storage means includes a plurality of second character string type attributes,
The first word extraction unit is configured to output the first character string for each first character string type attribute included in the first attribute constituting the first data table stored in the data table storage unit. Extracting a first word constituting a character string included in an attribute value of a type attribute;
The second word extraction unit is configured to output the second character string for each second character string type attribute included in the second attribute that constitutes the second data table stored in the data table storage unit. Extracting a second word constituting the character string included in the attribute value of the type attribute;
The similarity calculation means includes attributes of the first character string type attribute extracted by the first extraction means for each combination of the first character string type attribute and the second character string type attribute. Based on the first word constituting the character string included in the value and the second word constituting the character string included in the attribute value included in the second character string type attribute extracted by the second extracting means Calculating the similarity between the first string type attribute and the second string type attribute,
The similar attribute candidate extraction unit is configured to generate another first character string type attribute and a second character string based on the similarity calculated for each combination of the first character string type attribute and the second character string type attribute. The first character string type attribute and the second character string type attribute that have a maximum degree of similarity with each other as compared with the degree of similarity with the character string type attribute are extracted as similar attribute candidates. The data analysis support device described.

The first attribute constituting the first data table stored in the data table storage means further includes a first numeric type attribute having an attribute value including a numeric value,
The second attribute constituting the second data table stored in the data table storage means further includes a second numeric type attribute having an attribute value including a numeric value,
The similarity calculation means includes:
A numerical value range included in an attribute value included in the first numerical value attribute included in the first attribute and a numerical value range included in an attribute value included in the second numerical value attribute included in the second attribute. Based on this, the similarity between the first numeric type attribute and the second numeric type attribute is calculated,
The similar attribute candidate extracting means calculates the first numerical value attribute and the second numerical value attribute based on the calculated similarity between the first numerical value attribute and the second numerical value attribute. The data analysis support apparatus according to claim 1, wherein the data analysis support apparatus is extracted as a similar attribute candidate.

First classification means for classifying the first numeric type attribute according to a numeric value included in an attribute value of the first numeric type attribute;
A second classification means for classifying the second numeric type attribute according to a numeric value included in an attribute value of the second numeric type attribute;
The similarity calculation means calculates the similarity between the first numeric type attribute and the second numeric type attribute having the same classification destination by the first classification means and the second classification means. 5. The data analysis support apparatus according to claim 4, wherein

A first data table composed of a first attribute including a first character string type attribute having an attribute value including a character string and a second including a second character string type attribute having an attribute value including a character string In a data analysis support apparatus composed of an external storage device having a data table storage means for preliminarily storing a second data table composed of the attributes of and a computer using the external storage device, the data analysis support device is executed by the computer A program
In the computer,
Extracting the first word constituting the character string included in the attribute value of the first character string type attribute included in the first attribute constituting the first data table stored in the data table storage means And steps to
Extracting the second word constituting the character string included in the attribute value of the second character string type attribute included in the second attribute included in the second data table stored in the data table storage means And steps to
Based on the extracted first word and the extracted second word, the first character string type attribute and the second data included in the first attribute constituting the first data table Calculating the similarity of the second character string type attribute included in the second attribute constituting the table;
Based on the calculated similarity, the first character string type attribute included in the first attribute constituting the first data table and the second attribute constituting the second data table are included. Extracting the second character string type attribute as a similarity attribute candidate.