JP2007058311A

JP2007058311A - Corpus addition apparatus and corpus addition method

Info

Publication number: JP2007058311A
Application number: JP2005239999A
Authority: JP
Inventors: Denoual Etienne; ドヌアールエティエンヌ
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-08-22
Filing date: 2005-08-22
Publication date: 2007-03-08

Abstract

【課題】よりよい自然言語処理のためのコーパスを作成する。
【解決手段】第１のコーパスが記憶される第１コーパス記憶部１１と、第２のコーパスが記憶される第２コーパス記憶部１２と、第１のコーパスにおけるサブセットの類似度係数が記憶される第１類似度係数記憶部１６と、第２のコーパスにおけるサブセットの類似度係数が記憶される第２類似度係数記憶部１７と、第２のコーパスのサブセットの集合であって、第１のコーパスにおける類似度係数の代表値と同じ代表値を有するサブセットの集合を第２のコーパスから抽出するサブセット抽出部１８と、サブセット抽出部１８が抽出した第２のコーパスのサブセットの集合と、第１のコーパスとを加算する加算部１９と、加算部１９が加算したコーパスを出力する出力部２０と、を備える。
【選択図】図１
A corpus for better natural language processing is created.
A first corpus storage unit 11 in which a first corpus is stored, a second corpus storage unit 12 in which a second corpus is stored, and similarity factors of subsets in the first corpus are stored. A first similarity coefficient storage unit 16, a second similarity coefficient storage unit 17 that stores similarity coefficients of subsets in the second corpus, and a set of subsets of the second corpus, the first corpus A subset extractor 18 for extracting a set of subsets having the same representative value as the representative value of the similarity coefficient in the second corpus, a set of subsets of the second corpus extracted by the subset extractor 18, An adder 19 for adding the corpus and an output unit 20 for outputting the corpus added by the adder 19 are provided.
[Selection] Figure 1

Description

本発明は、一のコーパスに他のコーパスの少なくとも一部を加算するコーパス加算装置等に関する。 The present invention relates to a corpus adder that adds at least a part of another corpus to one corpus.

コーパスに基づいた自然言語処理（例えば、機械翻訳や要約など）において、よりよい処理を行うために、どのようなコーパスを用いるのかについて研究がなされている。例えば、自然言語処理で用いるコーパスとして、異質（ｈｅｔｅｒｏｇｅｎｅｏｕｓ）なものではなく、同質（ｈｏｍｏｇｅｎｅｏｕｓ）なものを用いることが提案されている（例えば、非特許文献１）。 Research has been conducted on what corpora are used to perform better processing in corpus-based natural language processing (for example, machine translation and summarization). For example, as a corpus used in natural language processing, it has been proposed to use a homogeneous one instead of a heterogeneous one (for example, Non-Patent Document 1).

また、関連した技術として、コーパスの同質性を示す指標として、クロスエントロピーに基づいた係数を用いることが提案されている（例えば、非特許文献２）。
ＧａｂｒｉｅｌａＣａｖａｇｌｉａ『Ｍｅａｓｕｒｉｎｇｃｏｒｐｕｓｈｏｍｏｇｅｎｅｉｔｙｕｓｉｎｇａｒａｎｇｅｏｆｍｅａｓｕｒｅｓｆｏｒｉｎｔｅｒ−ｄｏｃｕｍｅｎｔｄｉｓｔａｎｃｅ』、ＰｒｏｃｅｅｄｉｎｇｓｏｆＬＲＥＣ、２００２、ｐｐ．４２６−４３１ＥｔｉｅｎｎｅＤｅｎｏｕａｌ『Ａｍｅｔｈｏｄｔｏｑｕａｎｔｉｆｙｃｏｒｐｕｓｓｉｍｉｌａｒｉｔｙａｎｄｉｔｓａｐｐｌｉｃａｔｉｏｎｔｏｑｕａｎｔｉｆｙｉｎｇｔｈｅｄｅｇｒｅｅｏｆｌｉｔｅｒａｌｉｔｙｉｎａｄｏｃｕｍｅｎｔ』、ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩｎｔｅｒｎａｔｉｏｎａｌＷｏｒｋｓｈｏｐｏｎＨｕｍａｎＬａｎｇｕａｇｅＴｅｃｈｎｏｌｏｇｙ，ＨｏｎｇＫｏｎｇ、２００４ As a related technique, it has been proposed to use a coefficient based on cross entropy as an index indicating the homogeneity of a corpus (for example, Non-Patent Document 2).
Gabriela Cavaglia, “Measuring corpus homogeneity using a range of measures for inter-document distance”, Processings of LREC, 2002, pp. 426-431 Etienne Annual, A method to qualitative corporate similarity and its application to qualitatively qualifying in the human beings, Proceeding and Quantitative

上述の通り、自然言語処理におけるよりよい処理のためのコーパスを提供することが求められている。 As described above, there is a need to provide a corpus for better processing in natural language processing.

本発明は、上記状況においてなされたものであり、一のコーパスに他のコーパスの少なくとも一部を加算することにより、自然言語処理のよりよい処理のためのコーパスを作成するコーパス加算装置等を提供することを目的とする。 The present invention has been made in the above situation, and provides a corpus adder and the like that creates a corpus for better natural language processing by adding at least a part of another corpus to one corpus The purpose is to do.

上記目的を達成するため、本発明によるコーパス加算装置は、第１のコーパスが記憶される第１コーパス記憶部と、第２のコーパスが記憶される第２コーパス記憶部と、前記第１のコーパスにおけるサブセットの類似度係数が記憶される第１類似度係数記憶部と、前記第２のコーパスにおけるサブセットの類似度係数が記憶される第２類似度係数記憶部と、前記第２のコーパスのサブセットの集合であって、前記第１のコーパスにおけるサブセットの類似度係数の代表値と同じ代表値を有するサブセットの集合を前記第２のコーパスから抽出するサブセット抽出部と、前記サブセット抽出部が抽出した前記第２のコーパスのサブセットの集合と、前記第１のコーパスとを加算する加算部と、前記加算部が加算したコーパスを出力する出力部と、を備えたものである。 In order to achieve the above object, a corpus adder according to the present invention includes a first corpus storage unit that stores a first corpus, a second corpus storage unit that stores a second corpus, and the first corpus. A first similarity coefficient storage unit storing a subset similarity coefficient in the second corpus, a second similarity coefficient storage unit storing a subset similarity coefficient in the second corpus, and a subset of the second corpus A subset extractor that extracts from the second corpus a subset set that has the same representative value as the representative value of the similarity coefficient of the subset in the first corpus, and the subset extractor extracts An adder that adds the set of subsets of the second corpus and the first corpus; and an output unit that outputs the corpus added by the adder; It is those with a.

このような構成により、よりよい自然言語処理のためのコーパスを作成することができ、そのコーパスを用いて自然言語処理を実行することによって、より高いパフォーマンスが得られる。 With such a configuration, a corpus for better natural language processing can be created, and higher performance can be obtained by executing natural language processing using the corpus.

また、本発明によるコーパス加算装置では、第１の参照コーパスが記憶される第１参照コーパス記憶部と、第２の参照コーパスが記憶される第２参照コーパス記憶部と、前記第１参照コーパス記憶部が記憶している第１の参照コーパスと、前記第２参照コーパス記憶部が記憶している第２の参照コーパスとを参照コーパスとして、前記第１コーパス記憶部が記憶している第１のコーパス、及び前記第２コーパス記憶部が記憶している第２のコーパスにおけるそれぞれのサブセットの類似度係数を算出する類似度係数算出部と、をさらに備え、前記第１類似度係数記憶部が記憶している前記第１のコーパスにおけるサブセットの類似度係数、及び前記第２類似度係数記憶部が記憶している前記第２のコーパスにおけるサブセットの類似度係数は、前記類似度係数算出部が算出したものであってもよい。
このような構成により、第１の参照コーパス及び第２の参照コーパスを用いて、類似度係数を算出することができる。 In the corpus adder according to the present invention, a first reference corpus storage unit that stores a first reference corpus, a second reference corpus storage unit that stores a second reference corpus, and the first reference corpus storage The first reference corpus stored in the first corpus and the second reference corpus stored in the second reference corpus storage as a reference corpus is used as the first reference corpus stored in the first corpus. A similarity coefficient calculating unit that calculates a similarity coefficient of each subset in the second corpus stored in the corpus and the second corpus storage unit, the first similarity coefficient storage unit storing The similarity coefficient of the subset in the first corpus and the similarity coefficient of the subset in the second corpus stored in the second similarity coefficient storage unit are: The similarity coefficient calculating unit may be obtained by calculation.
With this configuration, the similarity coefficient can be calculated using the first reference corpus and the second reference corpus.

また、本発明によるコーパス加算装置では、前記サブセット抽出部は、前記第１のコーパスに加算した結果のコーパスにおける類似度係数の分布の形状が、前記第１のコーパスにおける類似度係数の分布の形状と同じになるように、前記第２のコーパスから前記第１のコーパスに加算するサブセットを抽出してもよい。 In the corpus adder according to the present invention, the subset extracting unit may add the similarity coefficient distribution shape in the corpus as a result of addition to the first corpus to the similarity coefficient distribution shape in the first corpus. A subset to be added to the first corpus may be extracted from the second corpus so as to be the same.

このような構成により、結果として、第１のコーパスの代表値を保ちながら、異質的な成分を第１のコーパスに付加したコーパスを作成することができ、そのコーパスを用いることによって、よりよい自然言語処理を実行することができる。 With such a configuration, as a result, it is possible to create a corpus in which a heterogeneous component is added to the first corpus while maintaining the representative value of the first corpus. Language processing can be performed.

また、本発明によるコーパス加算装置では、前記サブセット抽出部は、前記第１のコーパスにおける類似度係数の分布に所定の値を掛けた類似度係数の分布が、前記第２のコーパスにおける類似度係数の分布に含まれる場合に、前記第１のコーパスにおける類似度係数の分布に前記所定の値を掛けた類似度係数の分布と同じ分布を有する前記第２のコーパスのサブセットの集合を前記第２のコーパスから抽出してもよい。 In the corpus adder according to the present invention, the subset extraction unit may obtain a similarity coefficient distribution obtained by multiplying a similarity coefficient distribution in the first corpus by a predetermined value to obtain a similarity coefficient in the second corpus. The second corpus subset set having the same distribution as the similarity coefficient distribution obtained by multiplying the distribution of similarity coefficients in the first corpus by the predetermined value. You may extract from the corpus.

このような構成により、第１のコーパスにおける類似度係数の分布の形状を変化させないように、第２のコーパスの少なくとも一部を第１のコーパスに加算することができる。 With such a configuration, at least a part of the second corpus can be added to the first corpus so as not to change the shape of the distribution of similarity coefficients in the first corpus.

また、本発明によるコーパス加算装置では、前記所定の値は、前記第１のコーパスにおける類似度係数の分布に当該所定の値を掛けた類似度係数の分布が、前記第２のコーパスにおける類似度係数の分布に含まれる場合における最も大きな値であってもよい。 In the corpus adder according to the present invention, the predetermined value may be a similarity coefficient distribution obtained by multiplying the distribution of the similarity coefficient in the first corpus by the predetermined value, and the similarity in the second corpus. It may be the largest value when included in the coefficient distribution.

このような構成により、第１のコーパスに第２のコーパスのできるだけ多くのサブセットの集合を加算することができ、結果として、サブセット数の多いコーパスを作成することができ、よりよい自然言語処理を実行するためのコーパスを作成することができる。 With such a configuration, a set of as many subsets of the second corpus as possible can be added to the first corpus, and as a result, a corpus having a large number of subsets can be created, and better natural language processing can be performed. You can create a corpus to run.

本発明によるコーパス加算装置等によれば、よりよい自然言語処理を実行することができるコーパスを作成することができる。 According to the corpus adder and the like according to the present invention, a corpus capable of executing better natural language processing can be created.

以下、本発明によるコーパス加算装置等について、実施の形態を用いて説明する。なお、以下の実施の形態において、同じ符号を付した構成要素及びステップは同一または相当するものであり、再度の説明を省略することがある。 Hereinafter, a corpus adder and the like according to the present invention will be described using embodiments. In the following embodiments, components and steps denoted by the same reference numerals are the same or equivalent, and repetitive description may be omitted.

（実施の形態１）
本発明の実施の形態１によるコーパス加算装置について、図面を参照しながら説明する。
図１は、本実施の形態によるコーパス加算装置１の構成を示すブロック図である。図１において、本実施の形態によるコーパス加算装置１は、第１コーパス記憶部１１と、第２コーパス記憶部１２と、第１参照コーパス記憶部１３と、第２参照コーパス記憶部１４と、類似度係数算出部１５と、第１類似度係数記憶部１６と、第２類似度係数記憶部１７と、サブセット抽出部１８と、加算部１９と、出力部２０とを備える。 (Embodiment 1)
A corpus adder according to Embodiment 1 of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing the configuration of a corpus adder 1 according to this embodiment. In FIG. 1, the corpus adder 1 according to the present embodiment is similar to a first corpus storage unit 11, a second corpus storage unit 12, a first reference corpus storage unit 13, and a second reference corpus storage unit 14. A degree coefficient calculation unit 15, a first similarity coefficient storage unit 16, a second similarity coefficient storage unit 17, a subset extraction unit 18, an addition unit 19, and an output unit 20.

第１コーパス記憶部１１では、第１のコーパスが記憶される。ここで、第１のコーパスとは、コーパス加算装置１において、その全てが加算される対象となるコーパスである。コーパスとは、電子化された言語（文書）データベースである。なお、第１コーパス記憶部１１に第１のコーパスが記憶される過程は問わない。例えば、記録媒体を介して第１のコーパスが第１コーパス記憶部１１で記憶されるようになってもよく、通信回線等を介して送信された第１のコーパスが第１コーパス記憶部１１で記憶されるようになってもよく、あるいは、入力デバイス等を介して入力された第１のコーパスが第１コーパス記憶部１１で記憶されるようになってもよい。第１コーパス記憶部１１は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。第１コーパス記憶部１１での記憶は、外部のストレージデバイス等から読み出した第１のコーパスのＲＡＭ等における一時的な記憶でもよく、あるいは、ハードディスク等における長期的な記憶でもよい。 The first corpus storage unit 11 stores the first corpus. Here, the first corpus is a corpus to which all of them are added in the corpus adder 1. A corpus is an electronic language (document) database. In addition, the process in which a 1st corpus is memorize | stored in the 1st corpus memory | storage part 11 is not ask | required. For example, the first corpus may be stored in the first corpus storage unit 11 via a recording medium, and the first corpus transmitted via a communication line or the like is stored in the first corpus storage unit 11. The first corpus input via an input device or the like may be stored in the first corpus storage unit 11. The first corpus storage unit 11 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.). The storage in the first corpus storage unit 11 may be temporary storage in the first corpus RAM read from an external storage device or the like, or may be long-term storage in a hard disk or the like.

第２コーパス記憶部１２では、第２のコーパスが記憶される。ここで、第２のコーパスとは、コーパス加算装置１において、その少なくとの一部が加算される対象となるコーパスである。なお、第２コーパス記憶部１２に第２のコーパスが記憶される過程は問わない。例えば、記録媒体を介して第２のコーパスが第２コーパス記憶部１２で記憶されるようになってもよく、通信回線等を介して送信された第２のコーパスが第２コーパス記憶部１２で記憶されるようになってもよく、あるいは、入力デバイス等を介して入力された第２のコーパスが第２コーパス記憶部１２で記憶されるようになってもよい。第２コーパス記憶部１２は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。第２コーパス記憶部１２での記憶は、外部のストレージデバイス等から読み出した第２のコーパスのＲＡＭ等における一時的な記憶でもよく、あるいは、ハードディスク等における長期的な記憶でもよい。 The second corpus storage unit 12 stores the second corpus. Here, the second corpus is a corpus to which at least a part of the corpus is added in the corpus adder 1. In addition, the process in which a 2nd corpus is memorize | stored in the 2nd corpus memory | storage part 12 is not ask | required. For example, the second corpus may be stored in the second corpus storage unit 12 via a recording medium, and the second corpus transmitted via a communication line or the like is stored in the second corpus storage unit 12. Alternatively, the second corpus input via an input device or the like may be stored in the second corpus storage unit 12. The second corpus storage unit 12 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.). The storage in the second corpus storage unit 12 may be temporary storage in the RAM of the second corpus read from an external storage device or the like, or may be long-term storage in a hard disk or the like.

第１参照コーパス記憶部１３では、第１の参照コーパスが記憶される。ここで、第１の参照コーパスとは、後述する類似度係数算出部１５において、第１のコーパス等の類似度係数を算出する場合に、一の参照コーパスとして用いられるものである。参照コーパスとは、類似度係数を算出するときの基準となるコーパスのことである。なお、第１参照コーパス記憶部１３に第１の参照コーパスが記憶される過程は問わない。例えば、記録媒体を介して第１の参照コーパスが第１参照コーパス記憶部１３で記憶されるようになってもよく、通信回線等を介して送信された第１の参照コーパスが第１参照コーパス記憶部１３で記憶されるようになってもよく、あるいは、入力デバイス等を介して入力された第１の参照コーパスが第１参照コーパス記憶部１３で記憶されるようになってもよい。第１参照コーパス記憶部１３は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。第１参照コーパス記憶部１３での記憶は、外部のストレージデバイス等から読み出した第１の参照コーパスのＲＡＭ等における一時的な記憶でもよく、あるいは、ハードディスク等における長期的な記憶でもよい。 The first reference corpus storage unit 13 stores a first reference corpus. Here, the first reference corpus is used as one reference corpus when the similarity coefficient calculation unit 15 described later calculates a similarity coefficient such as a first corpus. The reference corpus is a corpus that serves as a reference when calculating the similarity coefficient. The process of storing the first reference corpus in the first reference corpus storage unit 13 does not matter. For example, the first reference corpus may be stored in the first reference corpus storage unit 13 via the recording medium, and the first reference corpus transmitted via the communication line or the like is the first reference corpus. The first reference corpus may be stored in the storage unit 13, or the first reference corpus input via an input device or the like may be stored in the first reference corpus storage unit 13. The first reference corpus storage unit 13 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.). The storage in the first reference corpus storage unit 13 may be temporary storage in the RAM of the first reference corpus read from an external storage device or the like, or may be long-term storage in a hard disk or the like.

第２参照コーパス記憶部１４では、第２の参照コーパスが記憶される。ここで、第２の参照コーパスとは、後述する類似度係数算出部１５において、第１のコーパス等の類似度係数を算出する場合に、一の参照コーパスとして用いられるものである。第１の参照コーパスと、第２の参照コーパスとは、できるだけ離れたものであることが好ましい。コーパスが離れているとは、コーパスが類似していないことをいう。例えば、第１の参照コーパスとして口語のコーパスを用い、第２の参照コーパスとして文語のコーパスを用いてもよい。また、例えば、第１の参照コーパスとしてくだけた言葉のコーパスを用い、第２の参照コーパスとして礼儀正しい言葉のコーパスを用いてもよい。また、例えば、第１の参照コーパスとして新しいスタイルの言葉のコーパスを用い、第２の参照コーパスとして古いスタイルの言葉のコーパスを用いてもよい。なお、第１の参照コーパスと第２の参照コーパスを例示したが、第１の参照コーパスと第２の参照コーパスとが逆であってもよい。なお、第２参照コーパス記憶部１４に第２の参照コーパスが記憶される過程は問わない。例えば、記録媒体を介して第２の参照コーパスが第２参照コーパス記憶部１４で記憶されるようになってもよく、通信回線等を介して送信された第２の参照コーパスが第２参照コーパス記憶部１４で記憶されるようになってもよく、あるいは、入力デバイス等を介して入力された第２の参照コーパスが第２参照コーパス記憶部１４で記憶されるようになってもよい。第２参照コーパス記憶部１４は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。第２参照コーパス記憶部１４での記憶は、外部のストレージデバイス等から読み出した第２の参照コーパスのＲＡＭ等における一時的な記憶でもよく、あるいは、ハードディスク等における長期的な記憶でもよい。 The second reference corpus storage unit 14 stores a second reference corpus. Here, the second reference corpus is used as one reference corpus when the similarity coefficient calculation unit 15 described later calculates a similarity coefficient such as the first corpus. The first reference corpus and the second reference corpus are preferably separated as much as possible. A corpus is distant means that the corpora are not similar. For example, a colloquial corpus may be used as the first reference corpus, and a sentence corpus may be used as the second reference corpus. Further, for example, a corpus of words that can be used as the first reference corpus may be used, and a corpus of polite words may be used as the second reference corpus. Further, for example, a new-style word corpus may be used as the first reference corpus, and an old-style word corpus may be used as the second reference corpus. Although the first reference corpus and the second reference corpus are illustrated, the first reference corpus and the second reference corpus may be reversed. Note that the process of storing the second reference corpus in the second reference corpus storage unit 14 does not matter. For example, the second reference corpus may be stored in the second reference corpus storage unit 14 via a recording medium, and the second reference corpus transmitted via a communication line or the like may be stored in the second reference corpus. The second reference corpus may be stored in the storage unit 14, or the second reference corpus input via an input device or the like may be stored in the second reference corpus storage unit 14. The second reference corpus storage unit 14 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.). The storage in the second reference corpus storage unit 14 may be temporary storage in the RAM or the like of the second reference corpus read from an external storage device or the like, or may be long-term storage in a hard disk or the like.

なお、第１コーパス記憶部１１、第２コーパス記憶部１２、第１参照コーパス記憶部１３、第２参照コーパス記憶部１４の任意の２以上の記憶部は、同一の記録媒体によって実現されてもよく、あるいは、別々の記録媒体によって実現されてもよい。前者の場合には、例えば、第１のコーパスを記憶している領域が第１コーパス記憶部１１となり、第２のコーパスを記憶している領域が第２コーパス記憶部１２となる。 Note that any two or more storage units of the first corpus storage unit 11, the second corpus storage unit 12, the first reference corpus storage unit 13, and the second reference corpus storage unit 14 may be realized by the same recording medium. Alternatively, it may be realized by a separate recording medium. In the former case, for example, an area that stores the first corpus is the first corpus storage unit 11, and an area that stores the second corpus is the second corpus storage unit 12.

類似度係数算出部１５は、第１参照コーパス記憶部１３が記憶している第１の参照コーパスと、第２参照コーパス記憶部１４が記憶している第２の参照コーパスとを参照コーパスとして、第１コーパス記憶部１１が記憶している第１のコーパスにおけるサブセットの類似度係数を算出する。また、類似度係数算出部１５は、第１参照コーパス記憶部１３が記憶している第１の参照コーパスと、第２参照コーパス記憶部１４が記憶している第２の参照コーパスとを参照コーパスとして、第２コーパス記憶部１２が記憶している第２のコーパスにおけるサブセットの類似度係数を算出する。ここで、類似度係数とは、あるコーパス（コーパスのサブセットを含む）が２個の参照コーパスのいずれに近いのかを示す指標である。例えば、クロスエントロピーを用いて類似度係数Ｉ（Ａ）を算出する場合には、次のようになる。

The similarity coefficient calculation unit 15 uses the first reference corpus stored in the first reference corpus storage unit 13 and the second reference corpus stored in the second reference corpus storage unit 14 as a reference corpus. The similarity coefficient of the subset in the first corpus stored in the first corpus storage unit 11 is calculated. Further, the similarity coefficient calculation unit 15 uses the first reference corpus stored in the first reference corpus storage unit 13 and the second reference corpus stored in the second reference corpus storage unit 14 as a reference corpus. As described above, the similarity coefficient of the subset in the second corpus stored in the second corpus storage unit 12 is calculated. Here, the similarity coefficient is an index indicating which one of two reference corpora is close to a certain corpus (including a corpus subset). For example, when the similarity coefficient I (A) is calculated using cross-entropy, it is as follows.

上記の式より、Ｉ（Ｔ１）＝０となり、Ｉ（Ｔ２）＝１となる。したがって、類似度係数Ｉ（Ａ）の０から１の値により、テストコーパスＡがコーパスＴ１に近いのか、あるいは、コーパスＴ２に近いのかが示されることになる。また、訓練コーパスＴ、テストコーパスＡに構築されたＮ−ｇｒａｍモデルｐのクロスエントロピーＨ_Ｔ（Ａ）は、次のようになる。ただし、テストコーパスＡは、Ａ＝｛ｓ_１，．．，ｓ_Ｑ｝というように、Ｑセンテンスからなるものである。

From the above equation, I (T1) = 0 and I (T2) = 1. Therefore, the value of the similarity coefficient I (A) from 0 to 1 indicates whether the test corpus A is close to the corpus T1 or the corpus T2. Further, the cross entropy H _T (A) of the N-gram model p constructed in the training corpus T and the test corpus A is as follows. However, the test corpus A has A = {s ₁ ,. . , S _Q }, and so on.

また、各センテンスｓ_ｉは、次のように、｜ｓ_ｉ｜個の文字からなる。

Each sentence s _i is composed of | s _i | characters as follows.

類似度係数算出部１５は、第１の参照コーパスを訓練コーパスＴ１とし、第２の参照コーパスを訓練コーパスＴ２とし、第１のコーパスのサブセットをテストコーパスＡとして、第１のコーパスのサブセットの類似度係数を算出する。第２のコーパスのサブセットの類似度係数を算出する場合も同様である。ここで、コーパスのサブセットとは、そのコーパスの任意の大きさの部分集合である。サブセットは、例えば、一のセンテンスであってもよく、２以上のセンテンスの集合であってもよく、フレーズの集合であってもよい。本実施の形態では、サブセットが１センテンス（１文）である場合について説明する。なお、本実施の形態では、類似度係数の算出において、クロスエントロピーを用いる場合について説明するが、それ以外の方法によって類似度係数を算出してもよい。例えば、クロスエントロピーに代えて、χ^２検定等を用いて類似度係数を算出してもよい。また、類似度係数算出部１５は、第１の参照コーパスを訓練コーパスＴ２とし、第２の参照コーパスを訓練コーパスＴ１として、類似度係数の算出を行ってもよい。 The similarity coefficient calculation unit 15 uses the first reference corpus as the training corpus T1, the second reference corpus as the training corpus T2, the first corpus subset as the test corpus A, and the similarity of the first corpus subset. Calculate the degree factor. The same applies when calculating the similarity coefficient of the subset of the second corpus. Here, the corpus subset is a subset of the corpus having an arbitrary size. The subset may be, for example, one sentence, a set of two or more sentences, or a set of phrases. In this embodiment, a case where the subset is one sentence (one sentence) will be described. In the present embodiment, the case where cross entropy is used in the calculation of the similarity coefficient will be described. However, the similarity coefficient may be calculated by other methods. For example, the similarity coefficient may be calculated using a χ ² test or the like instead of the cross entropy. Further, the similarity coefficient calculation unit 15 may calculate the similarity coefficient using the first reference corpus as the training corpus T2 and the second reference corpus as the training corpus T1.

第１類似度係数記憶部１６では、第１のコーパスにおけるサブセットの類似度係数が記憶される。この類似度係数は、類似度係数算出部１５によって算出されたものである。第１類似度係数記憶部１６は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。 The first similarity coefficient storage unit 16 stores the similarity coefficient of the subset in the first corpus. This similarity coefficient is calculated by the similarity coefficient calculation unit 15. The first similarity coefficient storage unit 16 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.).

第２類似度係数記憶部１７では、第２のコーパスにおけるサブセットの類似度係数が記憶される。この類似度係数は、類似度係数算出部１５によって算出されたものである。第２類似度係数記憶部１７は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。 The second similarity coefficient storage unit 17 stores the similarity coefficient of the subset in the second corpus. This similarity coefficient is calculated by the similarity coefficient calculation unit 15. The second similarity coefficient storage unit 17 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, etc.).

なお、第１類似度係数記憶部１６、第２類似度係数記憶部１７は、第１コーパス記憶部１１、第２コーパス記憶部１２等と同一の記録媒体によって実現されてもよく、あるいは、別々の記録媒体によって実現されてもよい。前者の場合には、例えば、各コーパスのサブセットに対応付けて類似度係数が記憶されてもよい。また、後者の場合には、例えば、第１コーパス記憶部１１等においてコーパスのサブセットが、そのサブセットを識別するサブセット識別子に対応付けられて記憶されており、第１類似度係数記憶部１６等において、サブセット識別子に対応付けられて、そのサブセット識別子で識別されるサブセットの類似度係数が記憶されていてもよい。 The first similarity coefficient storage unit 16 and the second similarity coefficient storage unit 17 may be realized by the same recording medium as the first corpus storage unit 11, the second corpus storage unit 12, and the like, or separately. It may be realized by the recording medium. In the former case, for example, a similarity coefficient may be stored in association with each corpus subset. In the latter case, for example, in the first corpus storage unit 11 or the like, a subset of the corpus is stored in association with a subset identifier for identifying the subset, and in the first similarity coefficient storage unit 16 or the like. The similarity coefficient of the subset identified by the subset identifier may be stored in association with the subset identifier.

サブセット抽出部１８は、第２コーパス記憶部１２が記憶している第２のコーパスから、加算部１９において第１のコーパスと加算するサブセットの集合を抽出する。ここで、サブセット抽出部１８が抽出する第２のコーパスのサブセットの集合は、第１のコーパスにおける類似度係数の代表値と同じ代表値を有するものである。ここで、代表値とは、類似度係数の分布（グループ）を代表する値であり、代表値は、例えば、平均値であってもよく、中央値であってもよい。なお、結果として、サブセット抽出部１８が抽出する第２のコーパスのサブセットの集合が、第１のコーパスにおける類似度係数の代表値と同じ代表値を有するものとなればよいため、サブセット抽出部１８は、第１のコーパスや第２のコーパスにおいて、代表値の算出を行ってもよく、あるいは行わなくてもよい。サブセット抽出部１８が第１のコーパス等の代表値の算出を行う場合には、第１類似度係数記憶部１６、第２類似度係数記憶部１７が記憶している類似度係数に基づいて代表値の算出等を行ってもよい。本実施の形態では、代表値として平均値を用いる場合について説明する。サブセット抽出部１８が、第１のコーパスにおける類似度係数の平均値と同じ平均値を有する、第２のコーパスのサブセットの集合を抽出する方法については後述する。 The subset extracting unit 18 extracts a set of subsets to be added to the first corpus in the adding unit 19 from the second corpus stored in the second corpus storage unit 12. Here, the subset set of the second corpus extracted by the subset extraction unit 18 has the same representative value as the representative value of the similarity coefficient in the first corpus. Here, the representative value is a value representing the distribution (group) of similarity coefficient, and the representative value may be an average value or a median value, for example. As a result, the subset extractor 18 only needs to have the same representative value as the representative value of the similarity coefficient in the first corpus in the subset of the second corpus extracted by the subset extractor 18. In the first corpus and the second corpus, the representative value may or may not be calculated. When the subset extraction unit 18 calculates a representative value such as the first corpus, the representative is based on the similarity coefficient stored in the first similarity coefficient storage unit 16 and the second similarity coefficient storage unit 17. A value may be calculated. In this embodiment, a case where an average value is used as a representative value will be described. A method in which the subset extracting unit 18 extracts a set of subsets of the second corpus having the same average value as the average value of the similarity coefficient in the first corpus will be described later.

加算部１９は、サブセット抽出部１８が抽出した第２のコーパスのサブセットの集合と、第１コーパス記憶部１１が記憶している第１のコーパスとを加算する。
出力部２０は、加算部１９が加算したコーパスを出力する。ここで、この出力は、例えば、表示デバイス（例えば、ＣＲＴや液晶ディスプレイなど）への表示でもよく、所定の機器への通信回線を介した送信でもよく、記録媒体への蓄積でもよい。本実施の形態では、出力部２０は、加算部１９が加算したコーパスを記録媒体２１に蓄積するものとする。記録媒体２１は、第１コーパス記憶部１１等と同一の記録媒体であってもよく、そうでなくてもよい。なお、出力部２０は、出力を行うデバイス（例えば、表示デバイスなど）を含んでもよく、あるいは含まなくてもよい。また、出力部２０は、ハードウェアによって実現されてもよく、あるいは、それらのデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 The adder 19 adds the subset of the second corpus extracted by the subset extractor 18 and the first corpus stored in the first corpus storage 11.
The output unit 20 outputs the corpus added by the adding unit 19. Here, the output may be, for example, display on a display device (for example, a CRT or a liquid crystal display), transmission via a communication line to a predetermined device, or accumulation in a recording medium. In the present embodiment, it is assumed that the output unit 20 stores the corpus added by the adding unit 19 in the recording medium 21. The recording medium 21 may or may not be the same recording medium as the first corpus storage unit 11 or the like. Note that the output unit 20 may or may not include a device (for example, a display device) that performs output. The output unit 20 may be realized by hardware, or may be realized by software such as a driver that drives these devices.

なお、第１のコーパスは、自然言語処理において用いられるコーパスであり、あらかじめ自然言語処理の処理対象に対して調整されているコーパスであってもよく、そうでなくてもよい。また、第２のコーパスは、その類似度係数の分布がディラックのデルタ関数的な分布ではなく、ある程度の広がりを持ったものであれば、どのようなものであってもよい。 The first corpus is a corpus used in natural language processing, and may or may not be a corpus that has been adjusted in advance with respect to the processing target of natural language processing. Further, the second corpus may be anything as long as the similarity coefficient distribution is not a Dirac delta function distribution but has a certain extent.

次に、本実施の形態によるコーパス加算装置１の動作について、図２のフローチャートを用いて説明する。図２は、第１のコーパスと、第２のコーパスの少なくとも一部とを加算する処理が開始された後の処理を示すフローチャートである。 Next, the operation of the corpus adder 1 according to this embodiment will be described using the flowchart of FIG. FIG. 2 is a flowchart showing a process after the process of adding the first corpus and at least a part of the second corpus is started.

（ステップＳ１０１）類似度係数算出部１５は、第１コーパス記憶部１１から第１のコーパスを読み出し、その第１のコーパスのサブセットの類似度係数を算出する。なお、その算出時には、第１参照コーパス記憶部１３が記憶している第１の参照コーパスと、第２参照コーパス記憶部１４が記憶している第２の参照コーパスとを参照コーパスとして用いる。また、その算出された類似度係数は、第１類似度係数記憶部１６で記憶される。 (Step S101) The similarity coefficient calculation unit 15 reads the first corpus from the first corpus storage unit 11, and calculates the similarity coefficient of a subset of the first corpus. At the time of the calculation, the first reference corpus stored in the first reference corpus storage unit 13 and the second reference corpus stored in the second reference corpus storage unit 14 are used as the reference corpus. Further, the calculated similarity coefficient is stored in the first similarity coefficient storage unit 16.

（ステップＳ１０２）類似度係数算出部１５は、第２コーパス記憶部１２から第２のコーパスを読み出し、その第２のコーパスのサブセットの類似度係数を算出する。なお、その算出時には、第１参照コーパス記憶部１３が記憶している第１の参照コーパスと、第２参照コーパス記憶部１４が記憶している第２の参照コーパスとを参照コーパスとして用いる。また、その算出された類似度係数は、第２類似度係数記憶部１７で記憶される。 (Step S102) The similarity coefficient calculation unit 15 reads the second corpus from the second corpus storage unit 12, and calculates the similarity coefficient of the subset of the second corpus. At the time of the calculation, the first reference corpus stored in the first reference corpus storage unit 13 and the second reference corpus stored in the second reference corpus storage unit 14 are used as the reference corpus. Further, the calculated similarity coefficient is stored in the second similarity coefficient storage unit 17.

（ステップＳ１０３）サブセット抽出部１８は、第１類似度係数記憶部１６で記憶されている第１のコーパスにおける各サブセットの類似度係数と、第２類似度係数記憶部１７で記憶されている第２のコーパスにおける各サブセットの類似度係数とを用いて、加算部１９において第１のコーパスと加算する第２のコーパスのサブセットの集合を決定する。この処理の詳細については後述する。 (Step S <b> 103) The subset extraction unit 18 stores the similarity coefficient of each subset in the first corpus stored in the first similarity coefficient storage unit 16 and the second similarity coefficient storage unit 17. Using the similarity coefficient of each subset in the two corpora, an adder 19 determines a set of subsets of the second corpus to be added to the first corpus. Details of this processing will be described later.

（ステップＳ１０４）サブセット抽出部１８は、ステップＳ１０３で決定した第２のコーパスのサブセットの集合を、第２コーパス記憶部１２から抽出する。 (Step S104) The subset extraction unit 18 extracts a set of subsets of the second corpus determined in step S103 from the second corpus storage unit 12.

（ステップＳ１０５）加算部１９は、ステップＳ１０４でサブセット抽出部１８が抽出した第２のコーパスのサブセットの集合と、第１コーパス記憶部１１が記憶している第１のコーパスとを加算する。 (Step S <b> 105) The adding unit 19 adds the subset of the second corpus extracted by the subset extracting unit 18 in step S <b> 104 and the first corpus stored in the first corpus storage unit 11.

（ステップＳ１０６）出力部２０は、加算部１９が加算したコーパスを記録媒体２１に蓄積する。そして、処理は終了となる。 (Step S <b> 106) The output unit 20 stores the corpus added by the adding unit 19 in the recording medium 21. Then, the process ends.

次に、本実施の形態によるコーパス加算装置１の動作について、具体例を用いて説明する。この具体例において、第１のコーパスにおけるサブセットの類似度係数の分布の形状を用いて第２のコーパスのサブセットの集合を抽出する場合（具体例１）と、第１のコーパスにおけるサブセットの類似度係数の分布の形状と関係なく、第１のコーパスにおけるサブセットの類似度係数の平均値と同じ平均値を有する第２のコーパスのサブセットの集合を抽出する場合（具体例２）とについて説明する。なお、具体例１と具体例２とに共通する説明については、具体例２における説明を省略する場合がある。 Next, the operation of the corpus adder 1 according to this embodiment will be described using a specific example. In this specific example, when a set of subsets of the second corpus is extracted using the distribution shape of the subset similarity coefficient in the first corpus (specific example 1), and the similarity of the subsets in the first corpus A case will be described where a set of subsets of the second corpus having the same average value as the average value of the similarity coefficients of the subsets in the first corpus is extracted regardless of the shape of the coefficient distribution (specific example 2). In addition, about the description common to the specific example 1 and the specific example 2, the description in the specific example 2 may be abbreviate | omitted.

［具体例１］
この具体例では、日本語のコーパスを用いることにする。第１の参照コーパスとしては、ＳＬＤＢ（ＳｐｏｎｔａｎｅｏｕｓＳｐｅｅｃｈＤａｔａｂａｓｅ）コーパスを用いる。また、第２の参照コーパスとしては、日経新聞の記事からなるコーパスを用いる。 [Specific Example 1]
In this specific example, a Japanese corpus is used. An SLDB (Spontaneous Speech Database) corpus is used as the first reference corpus. Further, as the second reference corpus, a corpus composed of Nikkei newspaper articles is used.

また、この具体例では、前述のように、類似度係数の分布の形状を用いてサブセットの集合の抽出を行うものとする。ここで、類似度係数の分布とは、横軸に類似度係数をとり、縦軸にサブセットの数をとった分布である。 In this specific example, as described above, a subset set is extracted using the shape of the distribution of similarity coefficients. Here, the distribution of the similarity coefficient is a distribution in which the horizontal axis represents the similarity coefficient and the vertical axis represents the number of subsets.

サブセット抽出部１８は、第１のコーパスに加算した結果のコーパスにおける類似度係数の分布の形状が、類似度係数算出部１５が算出した第１のコーパスにおける類似度係数の分布の形状と同じになるように、第２のコーパスから第１のコーパスに加算するサブセットを抽出するものとする。ここで、一の類似度係数の分布の形状と、他の類似度係数の分布の形状とが同じであるとは、一の類似度係数の分布におけるサブセットの数を所定倍した場合に、他の類似度係数の分布と同じになる場合である。すなわち、類似度係数の分布を縦軸方向に所定倍したものは、全て同じ分布の形状を有することになる。 The subset extraction unit 18 makes the distribution shape of the similarity coefficient in the corpus as a result of addition to the first corpus the same as the distribution shape of the similarity coefficient in the first corpus calculated by the similarity coefficient calculation unit 15. It is assumed that a subset to be added to the first corpus is extracted from the second corpus. Here, the shape of the distribution of one similarity coefficient and the shape of the distribution of another similarity coefficient are the same when the number of subsets in the distribution of one similarity coefficient is multiplied by a predetermined number. This is the case where the similarity coefficient distribution is the same. That is, all the similarity coefficient distributions multiplied by a predetermined value in the vertical axis direction have the same distribution shape.

サブセット抽出部１８は、より具体的には、次のようにしてサブセットの集合を抽出するものとする。サブセット抽出部１８は、第１のコーパスにおける類似度係数の分布に所定の値を掛けた類似度係数の分布（すなわち、第１のコーパスの類似度係数の分布の縦軸方向に拡大、あるいは縮小した分布）が、第２のコーパスにおける類似度係数の分布に含まれる場合に、第１のコーパスにおける類似度係数の分布に所定の値を掛けた類似度係数の分布と同じ分布を有する第２のコーパスのサブセットの集合を第２のコーパスから抽出する。なお、この具体例では、その所定の値は、第１のコーパスにおける類似度係数の分布に当該所定の値を掛けた類似度係数の分布が、第２のコーパスにおける類似度係数の分布に含まれる場合における最も大きな値であるとする。 More specifically, the subset extraction unit 18 extracts a set of subsets as follows. The subset extracting unit 18 expands or reduces the similarity coefficient distribution obtained by multiplying the distribution of the similarity coefficient in the first corpus by a predetermined value (that is, expands or reduces in the vertical axis direction of the similarity coefficient distribution of the first corpus). 2) having the same distribution as the distribution of similarity coefficients obtained by multiplying the distribution of similarity coefficients in the first corpus by a predetermined value when the distribution of similarity coefficients is included in the distribution of similarity coefficients in the second corpus. A set of corpora subsets is extracted from the second corpus. In this specific example, the predetermined value includes the similarity coefficient distribution obtained by multiplying the distribution of the similarity coefficient in the first corpus by the predetermined value in the distribution of the similarity coefficient in the second corpus. It is assumed that it is the largest value in the case where

コーパスの加算の処理が開始されると、まず、類似度係数算出部１５は、第１コーパス記憶部１１、第１参照コーパス記憶部１３、第２参照コーパス記憶部１４から、第１のコーパス、第１の参照コーパス、第２の参照コーパスをそれぞれ読み出し、前述の式を用いて第１のコーパスにおけるサブセットの類似度係数を算出する（ステップＳ１０１）。前述の式のテストコーパスとして、第１のコーパスのサブセットが用いられる。類似度係数算出部１５が算出した第１のコーパスの類似度係数は、第１類似度係数記憶部１６で記憶される。 When the corpus addition process is started, first, the similarity coefficient calculation unit 15 sends the first corpus, the first corpus storage unit 11, the first reference corpus storage unit 13, and the second reference corpus storage unit 14 to each other. The first reference corpus and the second reference corpus are read out, and the subset similarity coefficient in the first corpus is calculated using the above-described formula (step S101). A subset of the first corpus is used as the test corpus of the above equation. The similarity coefficient of the first corpus calculated by the similarity coefficient calculation unit 15 is stored in the first similarity coefficient storage unit 16.

図３は、第１類似度係数記憶部１６で記憶されている第１のコーパスにおけるサブセットの類似度係数の一例を示す図である。図３で示されるように、サブセット識別子と、類似度係数とが対応付けられている。なお、前述のように、第１コーパス記憶部１１における第１のコーパスのサブセットに直接対応付けて類似度係数が記憶されてもよいことは言うまでもない。図４は、第１類似度係数記憶部１６で記憶されている第１のコーパスの類似度係数の分布を示す図である。図４で示される分布では、分布の平均値Ｉ_０は、０．４５となっている。なお、図４において、類似度係数はとびとびの値を有しているが、図４では、その値を補間したものを分布として示している。他の類似度係数の分布においても同様である。 FIG. 3 is a diagram illustrating an example of a subset similarity coefficient in the first corpus stored in the first similarity coefficient storage unit 16. As shown in FIG. 3, the subset identifier is associated with the similarity coefficient. Of course, as described above, the similarity coefficient may be stored in direct association with the subset of the first corpus in the first corpus storage unit 11. FIG. 4 is a diagram showing a distribution of similarity coefficients of the first corpus stored in the first similarity coefficient storage unit 16. In the distribution shown in FIG. 4, the average value I _{0 of the} distribution is 0.45. In FIG. 4, the similarity coefficient has discrete values. In FIG. 4, the values obtained by interpolating the values are shown as distributions. The same applies to the distribution of other similarity coefficients.

次に、類似度係数算出部１５は、第２コーパス記憶部１２、第１参照コーパス記憶部１３、第２参照コーパス記憶部１４から、第２のコーパス、第１の参照コーパス、第２の参照コーパスをそれぞれ読み出し、前述の式を用いて第２のコーパスにおけるサブセットの類似度係数を算出する（ステップＳ１０２）。前述の式のテストコーパスとして、第２のコーパスのサブセットが用いられる。類似度係数算出部１５が算出した第２のコーパスの類似度係数は、第２類似度係数記憶部１７で記憶される。第２類似度係数記憶部１７においても、図３と同様に、サブセット識別子に対応付けられて類似度係数が記憶されているものとする。図５は、第２類似度係数記憶部１７で記憶されている第２のコーパスの類似度係数の分布を示す図である。 Next, the similarity coefficient calculation unit 15 includes the second corpus, the first reference corpus, and the second reference from the second corpus storage unit 12, the first reference corpus storage unit 13, and the second reference corpus storage unit 14. Each corpus is read out and the similarity coefficient of the subset in the second corpus is calculated using the above-described equation (step S102). A subset of the second corpus is used as the test corpus of the above equation. The similarity coefficient of the second corpus calculated by the similarity coefficient calculation unit 15 is stored in the second similarity coefficient storage unit 17. In the second similarity coefficient storage unit 17 as well, as in FIG. 3, it is assumed that the similarity coefficient is stored in association with the subset identifier. FIG. 5 is a diagram illustrating a distribution of similarity coefficients of the second corpus stored in the second similarity coefficient storage unit 17.

サブセット抽出部１８は、第１類似度係数記憶部１６、第２類似度係数記憶部１７を参照し、図４，図５で示される類似度係数の分布を得る。そして、サブセット抽出部１８は、第１のコーパスの類似度係数の分布の縦軸方向に所定の値を掛けたものが、第２のコーパスの類似度係数の分布に含まれるようになる最大の値を求める。この具体例では、その値は、０．３４であったとする。すなわち、図５で示されるように、第１のコーパスの類似度係数の分布に０．３４を掛けたものが、第２のコーパスの類似度係数の分布に含まれる。したがって、サブセット抽出部１８は、図５で示される、第１のコーパスの類似度係数の分布に０．３４を掛けた分布を有する第２のコーパスのサブセットの集合を第２のコーパスから抽出するサブセットと決定する（ステップＳ１０３）。なお、このように第２のコーパスから抽出するサブセットの集合が決定されるため、結果として、その抽出されたサブセットの集合と、第１のコーパスとを加算した結果のコーパスにおける類似度係数の分布の形状は、第１のコーパスの類似度係数の分布の形状と同じになる。 The subset extraction unit 18 refers to the first similarity coefficient storage unit 16 and the second similarity coefficient storage unit 17 to obtain the distribution of the similarity coefficient shown in FIGS. The subset extraction unit 18 then multiplies the first corpus similarity coefficient distribution by a predetermined value in the vertical axis direction so as to be included in the second corpus similarity coefficient distribution. Find the value. In this specific example, it is assumed that the value is 0.34. That is, as shown in FIG. 5, the distribution of the similarity coefficient of the first corpus multiplied by 0.34 is included in the distribution of the similarity coefficient of the second corpus. Therefore, the subset extracting unit 18 extracts a set of subsets of the second corpus having a distribution obtained by multiplying the distribution of the similarity coefficient of the first corpus shown in FIG. 5 by 0.34 from the second corpus. A subset is determined (step S103). Since the subset set extracted from the second corpus is determined in this way, as a result, the distribution of similarity coefficients in the corpus as a result of adding the extracted subset set and the first corpus Is the same as the shape of the similarity coefficient distribution of the first corpus.

サブセット抽出部１８は、抽出すると決定したサブセットの集合を、第２コーパス記憶部１２から抽出する。例えば、図５で示される第１のコーパスの類似度係数の分布に０．３４を掛けた分布において、類似度係数０．５に対応するサブセット数が１９０であった場合には、サブセット抽出部１８は、第２類似度係数記憶部１７から、類似度係数が０．５であるサブセットのサブセット識別子をランダムに１９０個選択し、その選択したサブセット識別子で識別されるサブセットを、第２コーパス記憶部１２から抽出する。このようなサブセットの抽出を、抽出すると決定した全てのサブセットの範囲について行う（ステップＳ１０４）。 The subset extraction unit 18 extracts a set of subsets determined to be extracted from the second corpus storage unit 12. For example, in the distribution obtained by multiplying the similarity coefficient distribution of the first corpus shown in FIG. 5 by 0.34, when the number of subsets corresponding to the similarity coefficient 0.5 is 190, the subset extraction unit 18 randomly selects 190 subset identifiers of the subset having the similarity coefficient of 0.5 from the second similarity coefficient storage unit 17 and stores the subset identified by the selected subset identifier in the second corpus. Extract from part 12. Such subset extraction is performed for all subset ranges determined to be extracted (step S104).

加算部１９は、サブセット抽出部１８が抽出した第２のコーパスのサブセットの集合と、第１コーパス記憶部１１から読み出した第１のコーパスとを加算する（ステップＳ１０５）。加算された後のコーパスは、図６で示されるように、図４の第１のコーパスの類似度係数の分布を１．３４倍した分布となっている。出力部２０は、加算部１９が加算したコーパスを、記録媒体２１に蓄積する（ステップＳ１０６）。ユーザは、記録媒体２１に蓄積されたコーパスを用いることにより、第１のコーパスを用いた場合よりもよりよい処理結果を得ることができる。このことについては、後述する。 The adding unit 19 adds the subset of the second corpus extracted by the subset extracting unit 18 and the first corpus read from the first corpus storage unit 11 (step S105). As shown in FIG. 6, the corpus after the addition has a distribution obtained by multiplying the distribution of the similarity coefficient of the first corpus in FIG. 4 by 1.34. The output unit 20 accumulates the corpus added by the adding unit 19 in the recording medium 21 (step S106). The user can obtain a better processing result by using the corpus stored in the recording medium 21 than when the first corpus is used. This will be described later.

なお、この具体例では、サブセット抽出部１８が、第１のコーパスにおける類似度係数の分布に所定の値を掛けた類似度係数の分布が、第２のコーパスにおける類似度係数の分布に含まれる場合における、第１のコーパスにおける類似度係数の分布に最大の値を掛けた類似度係数の分布と同じ分布を有する第２のコーパスのサブセットの集合を第２のコーパスから抽出する場合について説明したが、サブセット抽出部１８は、第１のコーパスにおける類似度係数の分布に所定の値を掛けた類似度係数の分布が、第２のコーパスにおける類似度係数の分布に含まれる場合における、第１のコーパスにおける類似度係数の分布に最大でない値を掛けた類似度係数の分布と同じ分布を有する第２のコーパスのサブセットの集合を第２のコーパスから抽出してもよい。 In this specific example, the distribution of similarity coefficients obtained by multiplying the similarity coefficient distribution in the first corpus by a predetermined value is included in the distribution of similarity coefficients in the second corpus. In the case described above, the second corpus subset set having the same distribution as the similarity coefficient distribution obtained by multiplying the similarity coefficient distribution in the first corpus by the maximum value is extracted from the second corpus. However, the subset extraction unit 18 includes the first similarity coefficient distribution obtained by multiplying the similarity coefficient distribution in the first corpus by a predetermined value in the similarity coefficient distribution in the second corpus. From the second corpus, a set of subsets of the second corpus having the same distribution as the similarity coefficient distribution obtained by multiplying the distribution of the similarity coefficient in the other corpus by a non-maximum value It may put.

［具体例２］
この具体例では、第２のコーパスから抽出するサブセットの集合を決定する方法が異なる以外、具体例１と同様である。その具体例１と同様である処理については説明を省略する。 [Specific Example 2]
This specific example is the same as specific example 1 except that the method for determining the subset set to be extracted from the second corpus is different. The description of the same process as that of the specific example 1 is omitted.

類似度係数算出部１５によって第１のコーパス及び第２のコーパスにおけるサブセットの類似度係数が算出され、第１類似度係数記憶部１６及び第２類似度係数記憶部１７で記憶されたとする。すると、サブセット抽出部１８は、具体例１と同様にして、第２のコーパスにおけるサブセットの類似度係数の分布を得る。ここで、前述のように、第１のコーパスにおけるサブセットの類似度係数の平均値は、０．４５であったとする。サブセット抽出部１８は、図７で示されるように、第２のコーパスにおけるサブセットの類似度係数の分布において、第１のコーパスの平均値に対して類似度係数が対称となり（すなわち、図７の座標系において左右対称となり）、かつ、分布が最大となるように抽出するサブセットの集合の分布を決定する。この後、サブセット抽出部１８が、決定された分布を有するサブセットの集合を第２のコーパスから抽出し、その抽出されたサブセットの集合と第１のコーパスとが加算される処理については、具体例１と同様である。 Assume that the similarity coefficient calculation unit 15 calculates the similarity coefficients of the subsets in the first corpus and the second corpus and stores them in the first similarity coefficient storage unit 16 and the second similarity coefficient storage unit 17. Then, the subset extraction unit 18 obtains the distribution of the similarity coefficient of the subset in the second corpus in the same manner as in the first specific example. Here, as described above, it is assumed that the average value of the similarity coefficients of the subsets in the first corpus is 0.45. As shown in FIG. 7, the subset extractor 18 makes the similarity coefficient symmetric with respect to the average value of the first corpus in the distribution of the similarity coefficients of the subset in the second corpus (that is, in FIG. 7). The distribution of the subset set to be extracted is determined so that the distribution is maximized. Thereafter, the subset extracting unit 18 extracts a subset set having the determined distribution from the second corpus, and the process of adding the extracted subset set and the first corpus is a specific example. Same as 1.

なお、第２のコーパスから抽出されたサブセットの集合も、第１のコーパスにおけるサブセットの類似度係数の平均値と同じ平均値を有するため、加算された結果のコーパスにおけるサブセットの類似度係数の平均値も、第１のコーパスにおけるサブセットの類似度係数の平均値と同じとなる。 Since the subset set extracted from the second corpus has the same average value as the average value of the similarity coefficients of the subsets in the first corpus, the average of the similarity coefficients of the subsets in the resultant corpus is added. The value is also the same as the average value of the similarity coefficients of the subsets in the first corpus.

第２のコーパスから第１のコーパスの類似度係数の平均値と同じ平均値を有するサブセットの集合を抽出する方法は、この具体例の説明に限定されないことは言うまでもない。例えば、第２のコーパスにおけるサブセットの類似度係数の分布において、第１のコーパスの類似度係数の平均値に対して類似度係数が対称となるように、サブセットの集合の分布を決定することにより、容易に第２のコーパスから第１のコーパスの類似度係数の平均値と同じ平均値を有するサブセットの集合を抽出することができる。具体的には、図８で示されるように第２のコーパスから抽出するサブセットの集合の分布を決定してもよい。また、第１のコーパスの類似度係数の平均値と同じ平均値を有するサブセットの集合を第２のコーパスから抽出することができるのであれば、そのサブセットの集合の分布は、図７や図８で示されるように左右対称でなくてもよい。 It goes without saying that the method of extracting a subset set having the same average value as the average value of the similarity coefficient of the first corpus from the second corpus is not limited to the description of this specific example. For example, in the distribution of similarity coefficients of the subset in the second corpus, by determining the distribution of the subset set so that the similarity coefficient is symmetric with respect to the average value of the similarity coefficients of the first corpus The subset set having the same average value as the average value of the similarity coefficient of the first corpus can be easily extracted from the second corpus. Specifically, as shown in FIG. 8, the distribution of the subset set extracted from the second corpus may be determined. Further, if a subset set having the same average value as the average value of the similarity coefficient of the first corpus can be extracted from the second corpus, the distribution of the subset set is shown in FIGS. As shown in FIG.

上記２個の具体例において、サブセット抽出部１８が、第１のコーパスにおける類似度係数の代表値と同じ代表値を有する、第２のコーパスのサブセットの集合を抽出する場合について説明したが、サブセット抽出部１８が抽出する第２のコーパスのサブセットの集合は、結果として第１のコーパスにおける類似度係数の代表値と同じ代表値を有するのであれば、その抽出の方法は上記２個の具体例に限定されないことは言うまでもない。 In the above two specific examples, the case where the subset extracting unit 18 extracts a set of subsets of the second corpus having the same representative value as the representative value of the similarity coefficient in the first corpus has been described. If the set of subsets of the second corpus extracted by the extraction unit 18 has the same representative value as the representative value of the similarity coefficient in the first corpus as a result, the extraction method is the above two specific examples. Needless to say, it is not limited to.

また、上記各具体例においては、第１のコーパスにおけるサブセットの類似度係数は、類似度係数の分布を求めたり、代表値を求めたりするために用いられるだけであるため、第１類似度係数記憶部１６において、単にサブセットの類似度係数が記憶されているだけであり、図３で示されるように、サブセット識別子と対応付けられていなくてもよい。 Further, in each of the above specific examples, the similarity coefficient of the subset in the first corpus is only used to obtain the distribution of the similarity coefficient or to obtain the representative value, so the first similarity coefficient The storage unit 16 simply stores the similarity coefficient of the subset and does not have to be associated with the subset identifier as shown in FIG.

［効果についての説明］
以下、本実施の形態で説明したようにして加算された結果のコーパスが、第１のコーパスよりも自然言語処理において、よりよいパフォーマンスを有するコーパスであることについて説明する。 [Description of effects]
Hereinafter, it will be described that the corpus obtained as a result of addition as described in the present embodiment is a corpus having better performance in natural language processing than the first corpus.

ここでは、確率的言語モデルのパフォーマンスと、ＥＢＭＴシステムにおける日本語から英語への用例に基づく機械翻訳のパフォーマンスとについて評価することによって、本実施の形態によるコーパス加算装置１の効果について説明する。 Here, the effect of the corpus adder 1 according to the present embodiment will be described by evaluating the performance of the probabilistic language model and the performance of machine translation based on an example from Japanese to English in the EBMT system.

なお、第１の参照コーパスとして、ＳＬＤＢを用いるものとする。このＳＬＤＢは、英語のコーパスと、日本語のコーパスの両方が存在する。第２の参照コーパスとして、英語ではＣａｌｇａｒｙＣｏｒｐｕｓを用い、日本語では日経新聞の記事からなるコーパスを用いるものとする。また、それらの参照コーパスを用いて類似度係数が算出されるコーパスとして、日本語及び英語の基本的な旅行会話のコーパスであるＢＴＥＣ（ＢａｓｉｃＴｒａｖｅｌｅｒ'ｓＥｘｐｒｅｓｓｉｏｎＣｏｒｐｕｓ）を用いる。また、タスク（評価のために実際に処理されるデータ）としては、上記のコーパスに含まれないＢＴＥＣの５１０個の日本語のセンテンスのセットを用いる。このタスクの類似度係数（ここでは、タスク全体の類似度係数）は、０．３３１であるとする。なお、タスクに含まれる各センテンスの類似度係数も、タスクの類似度係数に近い値である。また、コーパスのサブセットはセンテンスであるとする。 Note that SLDB is used as the first reference corpus. This SLDB has both an English corpus and a Japanese corpus. As the second reference corpus, Calcorp Corpus is used in English, and a corpus composed of Nikkei newspaper articles is used in Japanese. In addition, BTEC (Basic Traveller's Expression Corpus), which is a corpus of basic travel conversation in Japanese and English, is used as a corpus in which the similarity coefficient is calculated using these reference corpora. As a task (data actually processed for evaluation), a set of 510 Japanese sentences of BTEC not included in the corpus is used. Assume that the similarity coefficient of this task (here, the similarity coefficient of the entire task) is 0.331. The similarity coefficient of each sentence included in the task is also a value close to the similarity coefficient of the task. Further, a subset of the corpus is a sentence.

図９は、日本語のＢＴＥＣにおけるサブセット（ここではセンテンス）の類似度係数の分布を示す図である。図９で示される分布における平均値は「０．３１５」であり、標準偏差は「０．１１８」である。また、図１０は、英語のＢＴＥＣにおけるサブセット（ここではセンテンス）の類似度係数の分布を示す図である。図１０で示される分布における平均値は「０．３１３」であり、標準偏差は「０．１５６」である。 FIG. 9 is a diagram showing the distribution of similarity coefficients of subsets (here, sentences) in Japanese BTEC. The average value in the distribution shown in FIG. 9 is “0.315”, and the standard deviation is “0.118”. FIG. 10 is a diagram showing a distribution of similarity coefficients of subsets (here, sentences) in English BTEC. The average value in the distribution shown in FIG. 10 is “0.313”, and the standard deviation is “0.156”.

まず、パープレキシティーについて検討する。日本語のＢＴＥＣからランダムに選択したデータを用いて、前述のタスクに対してパープレキシティーを算出した。なお、日本語のＢＴＥＣからのランダムなデータの選択は、ＢＴＥＣの０．５％から１００％まで行った。その結果は、図１１の実線のグラフで示されている。また、前述のタスクの類似度係数「０．３３１」の付近から順番に日本語のＢＴＥＣからのデータの選択を行い、前述のタスクに対してパープレキシティーを算出した。この場合には、タスクの類似度係数と、ＢＴＥＣの類似度係数とがほぼ同じであるため、同質的なデータによるパープレキシティーの算出が行われたことになる。なお、日本語のＢＴＥＣからの同質的なデータの選択も、ＢＴＥＣの０．５％から１００％まで行った。その結果は、図１１の一点鎖線のグラフで示されている。日本語のＢＴＥＣから１００％のデータの選択を行った場合には、ランダムに選択したデータでも、同質的なデータでも同じデータを選択したことになるため、パープレキシティーは同じとなる。 First, consider perplexity. Perplexity was calculated for the aforementioned task using data randomly selected from Japanese BTEC. In addition, selection of random data from Japanese BTEC was performed from 0.5% to 100% of BTEC. The result is shown by the solid line graph in FIG. In addition, data from Japanese BTECs was selected in order from the vicinity of the similarity coefficient “0.331” of the above-mentioned task, and perplexity was calculated for the above-mentioned task. In this case, since the similarity coefficient of the task and the similarity coefficient of BTEC are almost the same, the perplexity is calculated based on homogeneous data. The selection of homogeneous data from Japanese BTEC was also performed from 0.5% to 100% of BTEC. The result is shown by the dashed-dotted line graph in FIG. When 100% data is selected from Japanese BTEC, the same data is selected for randomly selected data or homogeneous data, so the perplexity is the same.

一般に、パープレキシティーが低いほど、確率モデルにおけるパフォーマンスは高いことになる。図１１からわかるように、パープレキシティーの算出で用いるコーパス（日本語のＢＴＥＣ）の割合が少ない場合、すなわち、１５％ぐらいまでの場合には、同質的なデータの方が低いパープレキシティーとなっている。一方、それ以上の割合となると、同質的なデータの方がランダムに選択したデータよりも高いパープレキシティー（すなわち、低いパフォーマンス）となっている。以上から、パープレキシティーに関しては、データの少ない領域をのぞいて、ランダムに選択したデータの方が、同質的なデータよりも高いパフォーマンスが得られ、役に立つことがわかる。 In general, the lower the perplexity, the higher the performance in the stochastic model. As can be seen from FIG. 11, when the percentage of the corpus (Japanese BTEC) used in the calculation of perplexity is small, that is, up to about 15%, homogeneous data has a lower perplexity. It has become. On the other hand, if the ratio is more than that, homogeneous data has higher perplexity (that is, lower performance) than randomly selected data. From the above, with regard to perplexity, it can be seen that data selected at random, excluding areas with a small amount of data, is more useful than homogeneous data.

この結果を本実施の形態によるコーパス加算装置１の効果の説明に適用すると、タスクに対して、より多くの異質的なサブセットを含むコーパスの方が低いパープレキシティーとなり、高いパフォーマンスが得られることになる。したがって、第１のコーパスよりも、その第１のコーパスにサブセットを加算した結果のコーパスの方が、より多くの異質的なサブセットを含むことになり、より低いパープレキシティーとなり、自然言語処理における高いパフォーマンスが得られることがわかる。 When this result is applied to the description of the effect of the corpus adder 1 according to the present embodiment, a corpus including a larger number of heterogeneous subsets has lower perplexity for tasks, and higher performance can be obtained. become. Thus, the corpus resulting from adding a subset to the first corpus will contain more heterogeneous subsets than the first corpus, resulting in a lower perplexity and in natural language processing. It can be seen that high performance can be obtained.

次に、ＥＢＭＴシステムにおける日本語から英語への機械翻訳について検討する。上記のパープレキシティーに関する検討の場合と同様に、日本語及び英語のＢＴＥＣを用いて、５１０個のセンテンスからなるタスクに対して機械翻訳を実行する。この場合にも、パープレキシティーの場合と同様に、ランダムなデータの選択を行った場合と、同質的なデータの選択を行った場合とについて、図１２で示されるＢＬＥＵスコアの算出と、図１３で示されるＮＩＳＴスコアの算出と、図１４で示されるｍＷＥＲスコアの算出とを行った。図１２から図１４の各グラフにおいて、ランダムに選択したデータの結果は実線で示されており、同質的なデータの結果は一点鎖線で示されている。なお、ＢＬＥＵスコア、ＮＩＳＴスコア、ｍＷＥＲスコアについては、それぞれ下記の（文献１）、（文献２）、（文献３）で説明されている。 Next, we will examine machine translation from Japanese to English in the EBMT system. As in the case of the above-described study on perplexity, machine translation is executed for a task consisting of 510 sentences using Japanese and English BTECs. Also in this case, as in the case of perplexity, the calculation of the BLEU score shown in FIG. 12 and the case where random data is selected and the case where homogeneous data are selected are shown in FIG. The NIST score shown by 13 and the mWER score shown in FIG. 14 were calculated. In each graph of FIG. 12 to FIG. 14, the result of randomly selected data is indicated by a solid line, and the result of homogeneous data is indicated by a one-dot chain line. Note that the BLEU score, NIST score, and mWER score are described in the following (Reference 1), (Reference 2), and (Reference 3), respectively.

（文献１）ＫｉｓｈｏｒｅＰａｐｉｎｅｎｉ，ＳａｌｉｍＲｏｕｋｏｓ，ＴｏｄｄＷａｒｄａｎｄＷｅｉ−ＪｉｎｇＺｈｕ．２００２．Ｂｌｅｕ：ａＭｅｔｈｏｄｆｏｒＡｕｔｏｍａｔｉｃＥｖａｌｕａｔｉｏｎｏｆＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ． (Reference 1) Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation.

（文献２）ＧｅｏｒｇｅＤｏｄｄｉｎｇｔｏｎ．２００２．Ａｕｔｏｍａｔｉｃｅｖａｌｕａｔｉｏｎｏｆｍａｃｈｉｎｅｔｒａｎｓｌａｔｉｏｎｑｕａｌｉｔｙｕｓｉｎｇｎ−ｇｒａｍｃｏ−ｏｃｃｕｒｒｅｎｃｅｓｔａｔｉｓｔｉｃｓ。ＰｒｏｃｅｅｄｉｎｇｓｏｆＨｕｍａｎＬａｎｇ．Ｔｅｃｈｎｏｌ．Ｃｏｎｆ．（ＨＬＴ−０２），ｐｐ．１３８−１４５． (Reference 2) George Dodington. 2002. Automatic evaluation of machine translation quality using n-gram co-ocurrence statistics. Proceedings of Human Lang. Technol. Conf. (HLT-02), pp. 138-145.

（文献３）ＦｒａｎｚＪｏｓｅｆＯｃｈ．２００３．ＭｉｎｉｍｕｍＥｒｒｏｒＲａｔｅＴｒａｉｎｉｎｇｉｎＳｔａｔｉｓｔｉｃａｌＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ．ＰｒｏｃｅｅｄｉｎｇｓｏｆＡＣＬ２００３，ｐｐ．１６０−１６７． (Reference 3) Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. Proceedings of ACL 2003, pp. 160-167.

図１２において、ＢＬＥＵスコアは、０から１の範囲の値をとるものであり、スコアの高い方がよりよい翻訳であることを示す。図１３において、ＮＩＳＴスコアは、０以上の値をとるものであり、スコアの高い方がよりよい翻訳であることを示す。図１４において、ｍＷＥＲスコアは、０から１の範囲の値をとるものであり、スコアの低い方がよりよい翻訳であることを示す。それらの結果より、機械翻訳の質は用いるＢＴＥＣのサイズが大きくなるほどよくなることがわかる。また、ＢＬＥＵスコアでのＢＴＥＣの３％まで、ＮＩＳＴスコアでのＢＴＥＣの１８％まで、ｍＷＥＲスコアのＢＴＥＣの２％までをのぞき、３個の評価方法において、ランダムに選択したデータの方が同質的なデータよりも翻訳の品質が同等であるかあるいは高いことがわかる。 In FIG. 12, the BLEU score takes a value in the range of 0 to 1, and a higher score indicates better translation. In FIG. 13, the NIST score has a value of 0 or more, and a higher score indicates better translation. In FIG. 14, the mWER score takes a value in the range from 0 to 1, and a lower score indicates better translation. From these results, it can be seen that the quality of machine translation improves as the size of the BTEC used increases. Randomly selected data are more homogeneous in the three evaluation methods except for 3% of BTEC in BLEU score, 18% of BTEC in NIST score, and 2% of BTEC in mWER score. It can be seen that the quality of translation is equal to or higher than that of simple data.

この結果を本実施の形態によるコーパス加算装置１の効果の説明に適用すると、タスクに対して、より多くの異質的なサブセットを含むコーパスの方が機械翻訳の質が高いものとなり、高いパフォーマンスが得られることになる。したがって、第１のコーパスよりも、その第１のコーパスにサブセットを加算した結果のコーパスの方が、より多くの異質的なサブセットを含むことになり、より高い翻訳の品質が得られることとなり、自然言語処理における高いパフォーマンスが得られることがわかる。 When this result is applied to the explanation of the effect of the corpus adder 1 according to the present embodiment, a corpus including a larger number of different subsets of tasks has higher machine translation quality and higher performance. Will be obtained. Therefore, the corpus resulting from adding the subset to the first corpus will contain more heterogeneous subsets than the first corpus, resulting in higher translation quality, It can be seen that high performance in natural language processing can be obtained.

このように、前述の非特許文献１とは異なり、異質なものを含むコーパスの方がよりよい自然言語処理を実行することができることがわかる。なお、このたびはパープレキシティーと機械翻訳についてのみ確認したが、他の自然言語処理、例えば、要約などにおいても同様の結果が得られると予測することができうる。 Thus, unlike the above-mentioned non-patent document 1, it can be seen that a corpus including a different material can execute better natural language processing. Although only perplexity and machine translation have been confirmed this time, it can be predicted that similar results can be obtained in other natural language processing such as summarization.

上記各具体例では、タスクの類似度係数と、第１のコーパスの類似度係数の分布の代表値とが近いものである場合、すなわち、第１のコーパスがタスクに対して既に調整されている場合について説明したが、本実施の形態によるコーパス加算装置１で用いる第１のコーパスは、前述のように、タスクに対して調整されたコーパスであってもよく、そうでなくてもよい。 In each of the above specific examples, when the similarity coefficient of the task and the representative value of the distribution of the similarity coefficient of the first corpus are close, that is, the first corpus has already been adjusted for the task. As described above, the first corpus used in the corpus adder 1 according to the present embodiment may or may not be a corpus adjusted for a task as described above.

以上のように、本実施の形態によるコーパス加算装置１によれば、よりパフォーマンスの高いコーパスを得ることができ、その加算された結果のコーパスを用いて自然言語処理を行うことによって、よりよい処理が行われることになる。 As described above, according to the corpus adder 1 according to the present embodiment, a corpus with higher performance can be obtained, and by performing natural language processing using the corpus as a result of the addition, better processing is achieved. Will be done.

なお、コーパスにおけるサブセットの類似度係数の分布が図９や図１０で示されるものである場合には、類似度係数の分布を図４等と同様になめらかなものにしてもよく、あるいは、そのままの分布を用いてもよい。前者の場合には、最小自乗法等を用いて最適ななめらかな曲線を決定してもよい。 If the distribution of similarity coefficients of subsets in the corpus is as shown in FIGS. 9 and 10, the distribution of similarity coefficients may be as smooth as in FIG. Alternatively, the distribution may be used. In the former case, an optimal smooth curve may be determined using a least square method or the like.

また、コーパスにおけるサブセットの類似度係数の分布の形状は、図４や図９等で示されるように、一般にガウス分布となることが多い。しかし、本実施の形態において用いるコーパスは、そのサブセットの類似度係数の分布がガウス分布でなくてもよいことは言うまでもない。 In addition, the shape of the subset similarity coefficient distribution in the corpus is generally a Gaussian distribution as shown in FIGS. However, it goes without saying that the distribution of the similarity coefficient of the subset of the corpus used in the present embodiment may not be a Gaussian distribution.

また、上記具体例において、コーパスとしてＢＴＥＣを用いる場合、また、参照コーパスとして、ＳＬＤＢやＣａｌｇａｒｙ、日経新聞等を用いる場合について説明したが、コーパスはそれらに限定されないものであることは言うまでもない。 Further, in the above specific example, the case where BTEC is used as the corpus and the case where SLDB, Calgary, Nikkei Shimbun, etc. are used as the reference corpus has been described, but it goes without saying that the corpus is not limited thereto.

また、本実施の形態では、類似度係数算出部１５が第１のコーパスにおけるサブセットの類似度係数、及び第２のコーパスにおけるサブセットの類似度係数を算出する場合について説明したが、第１類似度係数記憶部１６、第２類似度係数記憶部１７に類似度係数があらかじめ記憶されている場合には、コーパス加算装置１は、類似度係数算出部１５を備えなくてもよい。その場合には、コーパス加算装置１は、第１参照コーパス記憶部１３、第２参照コーパス記憶部１４を備えなくてもよい。なお、その場合において、第１類似度係数記憶部１６、及び第２類似度係数記憶部１７に類似度係数が記憶される過程は問わない。例えば、記録媒体を介して類似度係数が第１類似度係数記憶部１６等で記憶されるようになってもよく、通信回線等を介して送信された類似度係数が第１類似度係数記憶部１６等で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された類似度係数が第１類似度係数記憶部１６等で記憶されるようになってもよい。 In the present embodiment, the case where the similarity coefficient calculation unit 15 calculates the similarity coefficient of the subset in the first corpus and the similarity coefficient of the subset in the second corpus has been described. When the similarity coefficient is stored in advance in the coefficient storage unit 16 and the second similarity coefficient storage unit 17, the corpus adding device 1 may not include the similarity coefficient calculation unit 15. In this case, the corpus adding device 1 may not include the first reference corpus storage unit 13 and the second reference corpus storage unit 14. In this case, the process of storing the similarity coefficient in the first similarity coefficient storage unit 16 and the second similarity coefficient storage unit 17 does not matter. For example, the similarity coefficient may be stored in the first similarity coefficient storage unit 16 or the like via the recording medium, and the similarity coefficient transmitted via the communication line or the like is stored in the first similarity coefficient storage. It may be stored in the unit 16 or the like, or the similarity coefficient input via the input device may be stored in the first similarity coefficient storage unit 16 or the like.

また、本実施の形態では、加算部１９が第１のコーパスと、第２のコーパスの少なくとも一部とを実際に加算する場合について説明したが、加算部１９による加算は、論理的なものであってもよい。すなわち、加算部１９は、第１のコーパスと、第２のコーパスの少なくとも一部とを１個のコーパスとして用いると論理的に決定するだけであってもよい。この場合には、出力部２０は、加算後のコーパスを特定することができる情報を出力してもよい。例えば、出力部２０は、第１コーパス記憶部１１で記憶されている第１のコーパス及び第２コーパス記憶部１２で記憶されている第２のコーパスにおいて、加算後のコーパスに含まれるサブセットに対して、所定のフラグを立てる処理を行ってもよい。 In the present embodiment, the case where the adder 19 actually adds the first corpus and at least a part of the second corpus has been described. However, the addition by the adder 19 is logical. There may be. That is, the adding unit 19 may only logically determine that the first corpus and at least a part of the second corpus are used as one corpus. In this case, the output unit 20 may output information that can specify the corpus after the addition. For example, in the first corpus stored in the first corpus storage unit 11 and the second corpus stored in the second corpus storage unit 12, the output unit 20 applies the subset included in the added corpus. Then, a process for setting a predetermined flag may be performed.

また、本実施の形態では、類似度係数の分布の代表値として平均値を用いる場合について説明したが、前述のように、類似度係数の分布の代表値として中央値を用いてもよいことは言うまでもない。 In the present embodiment, the case where the average value is used as the representative value of the similarity coefficient distribution has been described. However, as described above, the median value may be used as the representative value of the similarity coefficient distribution. Needless to say.

また、上記各実施の形態において、各処理または各機能は、単一の装置または単一のシステムによって集中処理されることによって実現されてもよく、あるいは、複数の装置または複数のシステムによって分散処理されることによって実現されてもよい。 In each of the above embodiments, each processing or each function may be realized by centralized processing by a single device or a single system, or distributed processing by a plurality of devices or a plurality of systems. May be realized.

また、上記各実施の形態において、各構成要素は専用のハードウェアにより構成されてもよく、あるいは、ソフトウェアにより実現可能な構成要素については、プログラムを実行することによって実現されてもよい。例えば、ハードディスクや半導体メモリ等の記録媒体に記録されたソフトウェア・プログラムをＣＰＵ等のプログラム実行部が読み出して実行することによって、各構成要素が実現され得る。なお、上記各実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、第１類似度係数記憶部で記憶されている、第１のコーパスにおけるサブセットの類似度係数と、第２類似度係数記憶部で記憶されている、第２のコーパスにおけるサブセットの類似度係数とを用いて、第２のコーパスのサブセットの集合であって、第１のコーパスにおけるサブセットの類似度係数の代表値と同じ代表値を有するサブセットの集合を、第２のコーパスから抽出する抽出ステップと、前記抽出ステップで抽出した前記第２のコーパスのサブセットの集合と、前記第１のコーパスとを加算する加算ステップと、前記加算ステップで加算したコーパスを出力する出力ステップと、を実行させるためのものである。 In each of the above embodiments, each component may be configured by dedicated hardware, or a component that can be realized by software may be realized by executing a program. For example, each component can be realized by a program execution unit such as a CPU reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory. The software that realizes the information processing apparatus in each of the above embodiments is a program as described below. That is, this program is stored in the computer in the first similarity coefficient storage unit, the subset similarity coefficient in the first corpus, and the second similarity coefficient storage unit. Using a subset similarity coefficient in the corpus, a second set of subsets of the corpus that has the same representative value as the representative value of the subset similarity coefficient in the first corpus An extraction step for extracting from the corpus, an addition step for adding the set of subsets of the second corpus extracted in the extraction step, and the first corpus, and an output for outputting the corpus added in the addition step Step.

また、このプログラムは、コンピュータに、第１参照コーパス記憶部が記憶している第１の参照コーパスと、第２参照コーパス記憶部が記憶している第２の参照コーパスとを参照コーパスとして、前記第１のコーパス、及び前記第２のコーパスにおけるそれぞれのサブセットの類似度係数を算出する類似度係数算出ステップをさらに実行させ、前記第１類似度係数記憶部が記憶している、前記第１のコーパスにおけるサブセットの類似度係数、及び前記第２類似度係数記憶部が記憶している、前記第２のコーパスにおけるサブセットの類似度係数は、前記類似度係数算出ステップで算出したものであってもよい。 Further, the program stores, in the computer, the first reference corpus stored in the first reference corpus storage unit and the second reference corpus stored in the second reference corpus storage unit as a reference corpus. A first similarity coefficient calculation step of calculating a similarity coefficient of each subset in the first corpus and the second corpus is further executed, and the first similarity coefficient storage unit stores the first corpus The similarity coefficient of the subset in the corpus and the similarity coefficient of the subset in the second corpus stored in the second similarity coefficient storage unit may be calculated in the similarity coefficient calculation step. Good.

なお、上記プログラムにおいて、情報を出力する出力ステップなどでは、ハードウェアでしか行われない処理、例えば、出力ステップにおけるモデムやインターフェースカードなどで行われる処理は少なくとも含まれない。 In the above program, the output step for outputting information does not include at least processing performed only by hardware, for example, processing performed by a modem or an interface card in the output step.

また、このプログラムは、サーバなどからダウンロードされることによって実行されてもよく、所定の記録媒体（例えば、ＣＤ−ＲＯＭなどの光ディスクや磁気ディスク、半導体メモリなど）に記録されたプログラムが読み出されることによって実行されてもよい。 Further, this program may be executed by being downloaded from a server or the like, and a program recorded on a predetermined recording medium (for example, an optical disk such as a CD-ROM, a magnetic disk, a semiconductor memory, or the like) is read out. May be executed by

また、このプログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes this program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

図１５は、上記プログラムを実行して、上記実施の形態によるコーパス加算装置１を実現するコンピュータの外観の一例を示す模式図である。上記実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムによって実現されうる。 FIG. 15 is a schematic diagram showing an example of the external appearance of a computer that executes the program and realizes the corpus adder 1 according to the embodiment. The above-described embodiment can be realized by computer hardware and a computer program executed on the computer hardware.

図１５において、コンピュータシステム１００は、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）ドライブ１０５、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）ドライブ１０６を含むコンピュータ１０１と、キーボード１０２と、マウス１０３と、モニタ１０４とを備える。 In FIG. 15, a computer system 100 includes a computer 101 including a CD-ROM (Compact Disk Read Only Memory) drive 105, an FD (Flexible Disk) drive 106, a keyboard 102, a mouse 103, and a monitor 104.

図１６は、コンピュータシステムを示す図である。図１６において、コンピュータ１０１は、ＣＤ−ＲＯＭドライブ１０５、ＦＤドライブ１０６に加えて、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１１と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１１２と、ＣＰＵ１１１に接続され、アプリケーションプログラムの命令を一時的に記憶すると共に、一時記憶空間を提供するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１１３と、アプリケーションプログラム、システムプログラム、及びデータを記憶するハードディスク１１４と、ＣＰＵ１１１、ＲＯＭ１１２等を相互に接続するバス１１５とを備える。なお、コンピュータ１０１は、ＬＡＮへの接続を提供する図示しないネットワークカードを含んでもよい。 FIG. 16 is a diagram illustrating a computer system. In FIG. 16, in addition to the CD-ROM drive 105 and the FD drive 106, a computer 101 includes a CPU (Central Processing Unit) 111, a ROM (Read Only Memory) 112 for storing a program such as a bootup program, A CPU (Random Access Memory) 113 that is connected to the CPU 111 and temporarily stores application program instructions and provides a temporary storage space, a hard disk 114 that stores application programs, system programs, and data, a CPU 111 and a ROM 112. Etc. to each other. The computer 101 may include a network card (not shown) that provides connection to the LAN.

コンピュータシステム１００に、上記実施の形態によるコーパス加算装置１の機能を実行させるプログラムは、ＣＤ−ＲＯＭ１２１、またはＦＤ１２２に記憶されて、ＣＤ−ＲＯＭドライブ１０５、またはＦＤドライブ１０６に挿入され、ハードディスク１１４に転送されてもよい。これに代えて、そのプログラムは、図示しないネットワークを介してコンピュータ１０１に送信され、ハードディスク１１４に記憶されてもよい。プログラムは実行の際にＲＡＭ１１３にロードされる。なお、プログラムは、ＣＤ−ＲＯＭ１２１やＦＤ１２２、またはネットワークから直接、ロードされてもよい。 A program that causes the computer system 100 to execute the functions of the corpus adder 1 according to the above-described embodiment is stored in the CD-ROM 121 or FD 122, inserted into the CD-ROM drive 105 or FD drive 106, and stored in the hard disk 114. May be forwarded. Instead, the program may be transmitted to the computer 101 via a network (not shown) and stored in the hard disk 114. The program is loaded into the RAM 113 at the time of execution. The program may be loaded directly from the CD-ROM 121, the FD 122, or the network.

プログラムは、コンピュータ１０１に、上記実施の形態によるコーパス加算装置１の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティプログラム等を必ずしも含まなくてもよい。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいてもよい。コンピュータシステム１００がどのように動作するのかについては周知であり、詳細な説明を省略する。
また、本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The program does not necessarily include an operating system (OS) or a third-party program that causes the computer 101 to execute the functions of the corpus adder 1 according to the above-described embodiment. The program may include only a part of an instruction that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 100 operates is well known and will not be described in detail.
Further, the present invention is not limited to the above-described embodiment, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上より、本発明によるコーパス加算装置等は、よりよい処理を実行できるコーパスを作成することができ、コーパスを処理する装置等として有用である。 As described above, the corpus adder according to the present invention can create a corpus capable of executing better processing, and is useful as an apparatus for processing a corpus.

本発明の実施の形態１によるコーパス加算装置の構成を示すブロック図The block diagram which shows the structure of the corpus adding apparatus by Embodiment 1 of this invention. 同実施の形態によるコーパス加算装置の動作を示すフローチャートThe flowchart which shows operation | movement of the corpus adder by the same embodiment 同実施の形態における類似度係数とサブセット識別子との対応の一例を示す図The figure which shows an example of a response | compatibility with the similarity coefficient and subset identifier in the embodiment 同実施の形態におけるコーパスの類似度係数の分布の一例を示す図The figure which shows an example of distribution of the similarity coefficient of corpus in the embodiment 同実施の形態におけるコーパスの類似度係数の分布の一例を示す図The figure which shows an example of distribution of the similarity coefficient of corpus in the embodiment 同実施の形態におけるコーパスの類似度係数の分布の一例を示す図The figure which shows an example of distribution of the similarity coefficient of corpus in the embodiment 同実施の形態におけるコーパスの類似度係数の分布の一例を示す図The figure which shows an example of distribution of the similarity coefficient of corpus in the embodiment 同実施の形態におけるコーパスの類似度係数の分布の一例を示す図The figure which shows an example of distribution of the similarity coefficient of corpus in the embodiment 同実施の形態におけるコーパスの類似度係数の分布の一例を示す図The figure which shows an example of distribution of the similarity coefficient of corpus in the embodiment 同実施の形態におけるコーパスの類似度係数の分布の一例を示す図The figure which shows an example of distribution of the similarity coefficient of corpus in the embodiment 同実施の形態におけるパープレキシティーの一例を示す図The figure which shows an example of the perplexity in the same embodiment 同実施の形態におけるＢＬＥＵスコアの一例を示す図The figure which shows an example of the BLEU score in the embodiment 同実施の形態におけるＮＩＳＴスコアの一例を示す図The figure which shows an example of the NIST score in the embodiment 同実施の形態におけるｍＷＥＲスコアの一例を示す図The figure which shows an example of the mWER score in the same embodiment 同実施の形態におけるコンピュータシステムの外観一例を示す模式図Schematic diagram showing an example of the appearance of the computer system in the embodiment 同実施の形態におけるコンピュータシステムの構成の一例を示す図The figure which shows an example of a structure of the computer system in the embodiment

Explanation of symbols

１コーパス加算装置
１１第１コーパス記憶部
１２第２コーパス記憶部
１３第１参照コーパス記憶部
１４第２参照コーパス記憶部
１５類似度係数算出部
１６第１類似度係数記憶部
１７第２類似度係数記憶部
１８サブセット抽出部
１９加算部
２０出力部
２１記録媒体
DESCRIPTION OF SYMBOLS 1 Corpus addition apparatus 11 1st corpus storage part 12 2nd corpus storage part 13 1st reference corpus storage part 14 2nd reference corpus storage part 15 Similarity coefficient calculation part 16 1st similarity coefficient memory | storage part 17 2nd similarity coefficient Storage unit 18 Subset extraction unit 19 Addition unit 20 Output unit 21 Recording medium

Claims

A first corpus storage unit storing a first corpus;
A second corpus storage unit for storing a second corpus;
A first similarity coefficient storage unit that stores similarity coefficients of subsets in the first corpus;
A second similarity coefficient storage unit that stores similarity coefficients of subsets in the second corpus;
A subset extractor for extracting from the second corpus a subset set of the second corpus having a representative value that is the same as a representative value of a similarity coefficient of the subset in the first corpus;
An adder that adds a set of subsets of the second corpus extracted by the subset extractor and the first corpus;
An output unit that outputs the corpus added by the addition unit.

A first reference corpus storage unit in which a first reference corpus is stored;
A second reference corpus storage unit for storing a second reference corpus;
The first corpus storage unit stores the first reference corpus stored in the first reference corpus storage unit and the second reference corpus stored in the second reference corpus storage unit as a reference corpus. A similarity coefficient calculation unit that calculates a similarity coefficient of each subset in the first corpus and the second corpus stored in the second corpus storage unit,
The subset similarity coefficient in the first corpus stored in the first similarity coefficient storage unit, and the subset similarity coefficient in the second corpus stored in the second similarity coefficient storage unit The corpus adder according to claim 1, which is calculated by the similarity coefficient calculation unit.

The subset extractor is configured so that the shape of the similarity coefficient distribution in the corpus as a result of addition to the first corpus is the same as the shape of the similarity coefficient distribution in the first corpus. The corpus addition apparatus according to claim 1, wherein a subset to be added to the first corpus is extracted from the corpus of the corpus.

A first corpus storage unit storing a first corpus, a second corpus storage unit storing a second corpus, and a first similarity coefficient storing a subset similarity coefficient in the first corpus In a corpus addition method processed using a storage unit, a second similarity coefficient storage unit that stores subset similarity coefficients in the second corpus, a subset extraction unit, an addition unit, and an output unit There,
The subset extractor extracts a set of subsets of the second corpus, the subset having a representative value that is the same as the representative value of the similarity coefficient of the subset in the first corpus from the second corpus. An extraction step to
An adding step in which the adding unit adds the set of subsets of the second corpus extracted in the extracting step and the first corpus;
A corpus addition method, comprising: an output step in which the output unit outputs the corpus added in the addition step.

On the computer,
The subset similarity coefficient stored in the first corpus and stored in the first corpus, and the subset similarity coefficient stored in the second corpus and stored in the second corpus. Using an extraction step of extracting from the second corpus a set of subsets of the second corpus that has the same representative value as the representative value of the similarity coefficient of the subset in the first corpus;
An addition step of adding a set of subsets of the second corpus extracted in the extraction step and the first corpus;
An output step of outputting the corpus added in the adding step.