JP2018055670A

JP2018055670A - Similar sentence generation method, similar sentence generation program, similar sentence generation apparatus, and similar sentence generation system

Info

Publication number: JP2018055670A
Application number: JP2017096570A
Authority: JP
Inventors: 山内　真樹; Maki Yamauchi; 真樹山内; 菜々美藤原; Nanami Fujiwara; 今出　昌宏; Masahiro Imaide; 昌宏今出
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2016-09-27
Filing date: 2017-05-15
Publication date: 2018-04-05
Anticipated expiration: 2037-05-15
Also published as: JP6817556B2

Abstract

【課題】言語モデルのデータベースに対する探索コストを低減できるとともに、類似文の識別を高精度に行うことができる類似文生成方法を提供する。【解決手段】類似文生成方法は、第１文を入力し、第１文を構成する複数の語句のうち第１語句と同じ意味を持つ一以上の第２語句を第１データベースから抽出し、第２データベースに基づいて一以上の第２語句に対応する文脈依存値に基づいてＮ−ｇｒａｍ値を算出し、第１文において第１語句が一以上の第２語句に置き換えられた一以上の第２文において、Ｎ−ｇｒａｍ値に相当する数の第２語句を含んだ連続する一以上の第３語句を抽出し、一以上の第３語句について、第３データベースにおける出現頻度を算出し、算出された出現頻度が閾値以上であるか判定し、算出された出現頻度が閾値以上であると判定された場合は、一以上の第２文を第１文の類似文として採用し、外部の機器に出力する。【選択図】図５The present invention provides a similar sentence generation method capable of reducing the search cost for a database of language models and identifying a similar sentence with high accuracy. A similar sentence generation method inputs a first sentence, extracts one or more second phrases having the same meaning as the first phrase from a first database among a plurality of phrases constituting the first sentence, An N-gram value is calculated based on a context-dependent value corresponding to one or more second words based on the second database, and the first word is replaced with one or more second words in the first sentence. In the second sentence, one or more consecutive third phrases including the number of second phrases corresponding to the N-gram value are extracted, and the appearance frequency in the third database is calculated for the one or more third phrases, It is determined whether the calculated appearance frequency is equal to or higher than the threshold, and when it is determined that the calculated appearance frequency is equal to or higher than the threshold, one or more second sentences are adopted as similar sentences of the first sentence, Output to the device. [Selection] Figure 5

Description

本開示は、原文から類似文を生成する類似文生成方法、類似文生成プログラム、類似文生成装置、及び該類似文生成装置を備える類似文生成システムに関する。 The present disclosure relates to a similar sentence generation method for generating a similar sentence from an original sentence, a similar sentence generation program, a similar sentence generation apparatus, and a similar sentence generation system including the similar sentence generation apparatus.

近年、第１言語の文を第１言語と異なる第２言語の文に翻訳する機械翻訳が研究及び開発されており、このような機械翻訳の性能向上には、翻訳に利用可能な多数の例文を収集した対訳コーパスが必要となる。このため、１個の原文から当該原文に類似する１又は複数の類似文（言い換え文）を生成することが行われている。 In recent years, machine translation that translates sentences in a first language into sentences in a second language different from the first language has been researched and developed. To improve the performance of such machine translation, a large number of example sentences that can be used for translation are studied. A bilingual corpus that collects For this reason, one or a plurality of similar sentences (paraphrase sentences) similar to the original sentence are generated from one original sentence.

例えば、特許文献１には、所定のパターンで文を変形し、ふさわしい変形かどうかを判定するため、評価関数を用いて評価値を算出し、評価値の最も高い表現を選択する言語変換処理統一システムが開示されている。 For example, Patent Document 1 discloses a unified language conversion process in which an evaluation value is calculated using an evaluation function and an expression having the highest evaluation value is selected in order to determine whether or not the sentence is deformed in a predetermined pattern. A system is disclosed.

また、特許文献２には、活性に係るポイントを形態素に設定して、そのポイントを増減させ、増減されたポイントに基づいてテキストから情報を抽出する自然言語処理方法が開示されている。 Patent Document 2 discloses a natural language processing method in which points relating to activity are set as morphemes, the points are increased or decreased, and information is extracted from text based on the increased or decreased points.

また、特許文献３には、ユーザによって指定された言い換え前用例及び言い換え後用例に基づいて新たな言い換え後用例を生成し、解析済み文に差分を適用することによって作成された言い換え文を出力する文書処理装置が開示されている。 Patent Document 3 generates a new post-paraphrase example based on the pre-paraphrase example and post-paraphrase example specified by the user, and outputs a paraphrase sentence created by applying the difference to the analyzed sentence. A document processing apparatus is disclosed.

特許第３９３２３５０号公報Japanese Patent No. 3932350 特開２００５−３３９０４３号公報JP 2005-339043 A 特許第５０６０５３９号公報Japanese Patent No. 5060539

しかしながら、機械翻訳の性能向上には、翻訳に利用可能な例文が多いほど好ましく、例文として使用可能な類似文の生成には、更なる改善が必要とされていた。 However, in order to improve the performance of machine translation, it is preferable that there are many example sentences that can be used for translation. Further generation of similar sentences that can be used as example sentences requires further improvement.

本開示は、上記従来の課題を解決するもので、言語モデルのデータベースに対する探索コストを低減できるとともに、類似文の識別を高精度に行うことができる類似文生成方法、類似文生成プログラム、類似文生成装置及び類似文生成システムを提供することを目的とする。 The present disclosure solves the above-described conventional problems, and can reduce the search cost for a database of language models and can identify similar sentences with high accuracy, a similar sentence generation method, a similar sentence generation program, and a similar sentence An object is to provide a generation device and a similar sentence generation system.

本開示の一様態による方法は、原文から類似文を生成する方法であって、第１文を入力し、前記第１文を構成する複数の語句のうち第１語句と同じ意味を持つ一以上の第２語句を第１データベースから抽出し、前記第１データベースは語句と前記第１データベースに含まれた語句の類義語とを対応づけ、第２データベースに基づいて得られた前記一以上の第２語句に対応する文脈依存値に基づいてＮ−ｇｒａｍ値を算出し、前記第２データベースは語句と前記第２データベースに含まれた語句に対応する前記文脈依存値とを対応づけ、前記文脈依存値は、前記第２データベースに含まれた語句が示す意味が文脈に依存する程度を示し、前記第１文において前記第１語句が前記一以上の第２語句に置き換えられた一以上の第２文において、前記Ｎ−ｇｒａｍ値に相当する数の前記第２語句を含んだ連続する一以上の第３語句を抽出し、前記一以上の第３語句について、第３データベースにおける出現頻度を算出し、前記第３データベースは語句と前記第３データベースに含まれる語句の前記第３データベースにおける出現頻度とを対応づけ、前記算出された出現頻度が閾値以上であるか判定し、前記算出された出現頻度が前記閾値以上であると判定された場合は、前記一以上の第２文を前記第１文の類似文として採用し、外部の機器に出力する。 A method according to an aspect of the present disclosure is a method of generating a similar sentence from an original sentence, and the first sentence is input, and one or more having the same meaning as the first phrase among a plurality of phrases constituting the first sentence Are extracted from the first database, the first database associates the phrase with a synonym of the phrase included in the first database, and the one or more second terms obtained based on the second database An N-gram value is calculated based on a context-dependent value corresponding to a phrase, and the second database associates the phrase with the context-dependent value corresponding to the phrase included in the second database, and the context-dependent value Indicates the degree to which the meaning of the phrase included in the second database depends on the context, and the one or more second sentences in which the first phrase is replaced with the one or more second phrases in the first sentence In the above -Extracting one or more consecutive third words / phrases including a number of the second words / phrases corresponding to a gram value, calculating an appearance frequency in a third database for the one or more third words / phrases, and Associates a phrase with the appearance frequency of the phrase included in the third database in the third database, determines whether the calculated appearance frequency is greater than or equal to a threshold, and the calculated occurrence frequency is greater than or equal to the threshold If it is determined that there is, the one or more second sentences are adopted as similar sentences of the first sentence and output to an external device.

本開示によれば、言語モデルのデータベースに対する探索コストを低減できるとともに、類似文の識別を高精度に行うことができる。 According to the present disclosure, it is possible to reduce the search cost for the language model database and to identify similar sentences with high accuracy.

本開示の実施の形態１における類似文生成装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the similar sentence production | generation apparatus in Embodiment 1 of this indication. 図１に示す置き換え候補辞書のデータ構成の一例を示す図である。It is a figure which shows an example of a data structure of the replacement candidate dictionary shown in FIG. 図１に示す文脈依存率辞書のデータ構成の一例を示す図である。It is a figure which shows an example of a data structure of the context dependence rate dictionary shown in FIG. 図１に示す言語モデルデータベースのデータ構成の一例を示す図である。It is a figure which shows an example of a data structure of the language model database shown in FIG. 図１に示す類似文生成装置による類似文生成処理の一例を示すフローチャートである。It is a flowchart which shows an example of the similar sentence production | generation process by the similar sentence production | generation apparatus shown in FIG. 本開示の実施の形態２における類似文生成システムの構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the similar sentence production | generation system in Embodiment 2 of this indication. 図６に示す類似文生成システムのフィードバックデータ更新処理を含む類似文生成処理の一例を示すフローチャートである。It is a flowchart which shows an example of the similar sentence production | generation process containing the feedback data update process of the similar sentence production | generation system shown in FIG.

（本開示の基礎となった知見）
上記のように、機械翻訳の性能向上には、翻訳に利用可能な例文が多いほど好ましく、文節置き換えを用いた類似文生成による、少量の対訳コーパスをベースとした文章量の自動拡大が要望されている。この文節置き換えを用いた類似文の生成の際、置き換え対象となる表現(語句)を含む類似候補文の取捨選択において、置き換えの良否が文脈に依存する場合が存在する。 (Knowledge that became the basis of this disclosure)
As described above, it is preferable to increase the number of example sentences that can be used for translation for improving the performance of machine translation, and it is desired to automatically expand the amount of sentences based on a small amount of parallel corpus by generating similar sentences using phrase replacement. ing. When generating similar sentences using phrase replacement, there are cases where the success or failure of replacement depends on the context in selecting similar candidate sentences including expressions (phrases) to be replaced.

このため、言語モデルに基づく置き換えルールの動的な取捨選択により、文脈依存性を考慮しつつ、事例の学習及び反映を可能にしたいが、効率良く類似候補文を取捨選択するには如何に類似候補文を取捨選択するかが重要となる。 For this reason, we want to enable learning and reflection of cases while considering context dependency by dynamic selection of replacement rules based on the language model, but how to select similar candidate sentences efficiently It is important to select candidate sentences.

例えば、置き換え(換言)による対訳コーパスの拡張及び類似候補文の生成を行う場合、換言ルールとして、「話せない」が（１）「話せません」、（２）「喋れない」、（３）「秘密です」のいずれかに置き換えられる場合、「英語は話せない」との文章に、上記の換言ルールを適用すると、「英語は話せません」、「英語は喋れない」、及び「英語は秘密です」の３つの類似候補文が生成される。 For example, when expanding a bilingual corpus by replacement (paraphrase) and generating similar candidate sentences, the paraphrase rule is “I can't speak” (1) “I can't speak”, (2) “I can't speak”, (3) If you replace it with any of the words "I can't speak English" and apply the above paraphrase rule to "I can't speak English", "I can't speak English", and "English can't speak" Three similar candidate sentences are generated.

この場合、文脈から、「英語は話せません」及び「英語は喋れない」は、類似文として採用できるが、「英語は秘密です」は日本語として適切な表現ではないため、類似文として採用することはできず、棄却されることとなる。このように、同一の換言ルールを適用しても、文脈によって、類似候補文が類似文として採用できる場合とできない場合とが発生する。 In this case, from the context, “I can't speak English” and “I can't speak English” can be adopted as similar sentences, but “English is a secret” is not an appropriate expression for Japanese, so it is adopted as a similar sentence. It cannot be done and will be rejected. In this way, even if the same paraphrase rule is applied, there are cases where a similar candidate sentence can be adopted as a similar sentence and a case where it cannot be adopted depending on the context.

類似文として採用できる採択文と採用できない棄却文とを識別する従来の方法としては、単語ベクトルや文ベクトルを用いた分散表現モデルでの類似性や言語モデル（例えば、Ｎ−ｇｒａｍ言語モデル）での出現頻度等を基準に判断することが行われていた。具体的には、言語モデルの識別対象領域（探索範囲）を大きくする（例えば、Ｎ−ｇｒａｍのＮを大きくする）ことにより、表現として存在しているかどうかを判断し、文脈に依存する換言ルール（置き換えルール）の採択及び棄却を決定していた。 Conventional methods for discriminating adopted sentences that can be adopted as similar sentences and reject sentences that cannot be adopted are similarities in a distributed expression model using word vectors or sentence vectors, or language models (for example, N-gram language model). Judgment was made based on the appearance frequency of Specifically, by increasing the identification target area (search range) of the language model (for example, by increasing N of N-gram), it is determined whether or not it exists as an expression, and a paraphrase rule that depends on the context (Replacement rules) have been adopted and rejected.

また、言語モデルを用いて、文の流暢さをモデル化することによる評価も行われていた。例えば、言語モデルをＮ−ｇｒａｍ言語モデルとし、Ｎ−ｇｒａｍ言語モデルのデータベース内により多く含まれている表現を用いた訳文やフレーズのスコアを高くし、あまり含まれていないもののスコアを低くする、と言った手法などがある。この手法を応用することにより、類似候補文のスコアを算出し、閾値処理によって、「良い文」（類似文として採用できる採択文）又は「悪い文」（類似文として採用できない棄却文）を識別していた。 In addition, evaluation was performed by modeling the fluency of sentences using a language model. For example, if the language model is an N-gram language model, the score of translations and phrases using expressions included more in the database of the N-gram language model is increased, and the score of those that are not included is decreased. There is a technique that said. By applying this method, the score of similar candidate sentences is calculated, and “good sentences” (adopted sentences that can be adopted as similar sentences) or “bad sentences” (rejected sentences that cannot be adopted as similar sentences) are identified by threshold processing. Was.

しかしながら、識別対象領域を大きくすると、データ量及び計算量が増加するとともに、データ分布が疎になるため、全ての置き換え候補を大きな識別対象領域から検索するためには、データ量及び計算量が増大する。例えば、２−ｇｒａｍでは約８，０００万エントリであるが、５−ｇｒａｍでは約８億エントリとなり、Ｎ−ｇｒａｍのＮを大きくすると、データ量及び計算量が飛躍的に増大するという課題がある。 However, if the identification target area is enlarged, the data amount and calculation amount increase, and the data distribution becomes sparse. Therefore, in order to search all replacement candidates from the large identification target region, the data amount and calculation amount increase. To do. For example, in the case of 2-gram, there are about 80 million entries, but in case of 5-gram, there are about 800 million entries. If N of N-gram is increased, the amount of data and the calculation amount will increase dramatically. .

上記の課題を解決するため、本開示では、例えば、置き換え候補文字列と、当該置き換え候補文字列が文脈に依存する程度を表す文脈依存値とを対応付けて複数記憶する文脈依存値記憶部を設け、文脈に依存して置き換え良否が変動する類似文において、文脈に依存して置き換えの良否が変動するか否かに応じて、置き換え候補文字列の前後の単語を含む言語モデルを参照するか否かを決定する。 In order to solve the above problem, in the present disclosure, for example, a context-dependent value storage unit that stores a plurality of replacement-candidate character strings and context-dependent values that indicate the degree of dependency of the replacement-candidate character strings on the context is stored. Whether to refer to a language model that includes words before and after the replacement candidate character string, depending on whether or not the replacement quality varies depending on the context in a similar sentence that varies depending on the context Decide whether or not.

すなわち、文脈依存値に応じて言語モデルのデータベースに対する探索範囲（識別対象領域）を決定し、決定した探索範囲を用いて言語モデルのデータベースを探索することにより、文脈依存値が高いとみなされる置き換え候補文字列のみ、より大きな探索領域で識別を行い、文脈依存値が低い置き換え候補文字列は、小さな探索領域で識別を行い、探索コストと識別精度とのバランスを図っている。 In other words, by determining the search range (identification target region) for the language model database according to the context-dependent value, and searching the language model database using the determined search range, replacement that is regarded as having a high context-dependent value Only candidate character strings are identified in a larger search area, and replacement candidate character strings having a low context-dependent value are identified in a smaller search area to balance search cost and identification accuracy.

また、従来の類似文の生成方法では、分散表現や言語モデル内に含まれていない表現は、そもそも識別することができず、棄却されることとなる。例えば、訓練データ内に、「それは秘密です」というフレーズを含む文が無いと、「それは秘密です」を含む類似候補文の識別ができず、棄却されるという課題がある。 Further, in the conventional similar sentence generation method, distributed expressions and expressions that are not included in the language model cannot be identified in the first place, and are rejected. For example, if there is no sentence including the phrase “it is secret” in the training data, there is a problem that similar candidate sentences including “it is secret” cannot be identified and rejected.

上記の課題を解決するため、本開示では、例えば、外部からの入力（例えば、ユーザ又は所定の装置等のフィードバック）により、文脈依存性の有る置き換え候補文字列が入力された場合に、言語モデルのデータベース及び文脈依存値記憶部等を更新する。また、新しい文表現が入力された場合に、その表現に応じて、文脈依存値記憶部内の当該単語の文脈依存値を変化させ、また、新しい文表現を含むＮ−ｇｒａｍ等を部分構築し、新しい文表現を言語モデルに反映する。このように、正しいデータを追加することにより、置き換え文字列の前後の単語を含む言語モデルの出現頻度等を加減するとともに、文脈依存値記憶部そのものも外部入力に応じて更新する。 In order to solve the above problem, in the present disclosure, for example, when a replacement candidate character string having context dependency is input by an input from the outside (for example, feedback from a user or a predetermined device), a language model Update the database and the context-dependent value storage. In addition, when a new sentence expression is input, the context-dependent value of the word in the context-dependent value storage unit is changed according to the expression, and an N-gram or the like including the new sentence expression is partially constructed. Reflect new sentence expressions in the language model. In this way, by adding correct data, the appearance frequency of the language model including words before and after the replacement character string is adjusted, and the context-dependent value storage unit itself is also updated according to the external input.

上記のように、外部知識や新知識をフィードバックして、言語モデルのデータベース等を更新することにより、識別精度を向上させることができる。この結果、低コストで精度の良い類似候補文の識別を行い、更に、Ｎ−ｇｒａｍモデルのデータベース内に存在しない表現にも、更新して対応できる高効率で自律的な類似候補文の識別を行うことができる。 As described above, the identification accuracy can be improved by feeding back external knowledge and new knowledge and updating the language model database and the like. As a result, low-cost and accurate similar candidate sentences are identified, and high-efficiency and autonomous similar candidate sentences that can be updated and dealt with even in expressions that do not exist in the N-gram model database are identified. It can be carried out.

上記の知見に基づき、本願発明者らは、原文から類似文を如何にして生成すべきかについて鋭意検討を行った結果、本開示を完成したものである。 Based on the above findings, the inventors of the present application have completed the present disclosure as a result of intensive studies on how to generate a similar sentence from the original sentence.

本開示の一態様に係る方法は、原文から類似文を生成する方法であって、第１文を入力し、前記第１文を構成する複数の語句のうち第１語句と同じ意味を持つ一以上の第２語句を第１データベースから抽出し、前記第１データベースは語句と前記第１データベースに含まれた語句の類義語とを対応づけ、第２データベースに基づいて得られた前記一以上の第２語句に対応する文脈依存値に基づいてＮ−ｇｒａｍ値を算出し、前記第２データベースは語句と前記第２データベースに含まれた語句に対応する前記文脈依存値とを対応づけ、前記文脈依存値は、前記第２データベースに含まれた語句が示す意味が文脈に依存する程度を示し、前記第１文において前記第１語句が前記一以上の第２語句に置き換えられた一以上の第２文において、前記Ｎ−ｇｒａｍ値に相当する数の前記第２語句を含んだ連続する一以上の第３語句を抽出し、前記一以上の第３語句について、第３データベースにおける出現頻度を算出し、前記第３データベースは語句と前記第３データベースに含まれる語句の前記第３データベースにおける出現頻度とを対応づけ、前記算出された出現頻度が閾値以上であるか判定し、前記算出された出現頻度が前記閾値以上であると判定された場合は、前記一以上の第２文を前記第１文の類似文として採用し、外部の機器に出力する。 A method according to an aspect of the present disclosure is a method of generating a similar sentence from an original sentence, and inputs a first sentence, and has a same meaning as the first phrase among a plurality of phrases constituting the first sentence. The second phrase is extracted from the first database, the first database associates the phrase with a synonym of the phrase included in the first database, and the one or more first phrases obtained on the basis of the second database. An N-gram value is calculated based on a context-dependent value corresponding to two phrases, and the second database associates the phrase with the context-dependent value corresponding to the phrase included in the second database, and the context-dependent The value indicates the degree to which the meaning of the word or phrase included in the second database depends on the context, and the one or more second words in which the first word or phrase is replaced with the one or more second words or phrases in the first sentence. In the sentence, -Extracting one or more consecutive third words / phrases including a number of the second words / phrases corresponding to a gram value, calculating an appearance frequency in a third database for the one or more third words / phrases, and Associates a phrase with the appearance frequency of the phrase included in the third database in the third database, determines whether the calculated appearance frequency is greater than or equal to a threshold, and the calculated occurrence frequency is greater than or equal to the threshold If it is determined that there is, the one or more second sentences are adopted as similar sentences of the first sentence and output to an external device.

このような構成により、第１文を入力し、第１文を構成する複数の語句のうち第１語句と同じ意味を持つ一以上の第２語句を第１データベースから抽出し、第１データベースは語句と第１データベースに含まれた語句の類義語とを対応づけ、第２データベースに基づいて得られた一以上の第２語句に対応する文脈依存値に基づいてＮ−ｇｒａｍ値を算出し、第２データベースは語句と第２データベースに含まれた語句に対応する文脈依存値とを対応づけ、文脈依存値は、第２データベースに含まれた語句が示す意味が文脈に依存する程度を示し、第１文において第１語句が一以上の第２語句に置き換えられた一以上の第２文において、Ｎ−ｇｒａｍ値に相当する数の第２語句を含んだ連続する一以上の第３語句を抽出し、一以上の第３語句について、第３データベースにおける出現頻度を算出し、第３データベースは語句と第３データベースに含まれる語句の第３データベースにおける出現頻度とを対応づけ、算出した出現頻度が閾値以上であるか判定し、算出した出現頻度が閾値以上であると判定された場合は、一以上の第２文を第１文の類似文として採用し、外部の機器に出力しているので、文脈依存値が高い第２語句のみ、大きな探索領域で識別を行い、文脈依存値が低い第２語句は、小さな探索領域で識別を行うことができ、言語モデルのデータベースである第３データベースに対する探索コストを低減できるとともに、類似文の識別を高精度に行うことができる。 With this configuration, the first sentence is input, and one or more second phrases having the same meaning as the first phrase are extracted from the first database among the plurality of phrases constituting the first sentence. Associating a phrase with a synonym of a phrase included in the first database, calculating an N-gram value based on a context-dependent value corresponding to one or more second phrases obtained based on the second database; The second database associates the phrase with a context-dependent value corresponding to the phrase included in the second database, and the context-dependent value indicates the degree to which the meaning of the phrase included in the second database depends on the context, Extract one or more consecutive third words / phrases including the number of second words / phrases equivalent to the N-gram value in one or more second sentences in which the first word / phrase is replaced with one or more second words / phrases. And one or more third words And calculating the appearance frequency in the third database, the third database associates the phrase with the appearance frequency in the third database of the phrase included in the third database, and determines whether the calculated appearance frequency is equal to or greater than a threshold value, If it is determined that the calculated appearance frequency is greater than or equal to the threshold, one or more second sentences are adopted as similar sentences of the first sentence and are output to an external device. Only words / phrases are identified in a large search area, and second words / phrases with low context-dependent values can be identified in a small search area, reducing search costs for the third database, which is a language model database, and similar. Sentences can be identified with high accuracy.

前記第１文は第１言語で記述され、前記第１文は対訳コーパスに含まれ、前記対訳コーパスは第１言語で記述された文と第２言語で記述された対訳文との対を複数含み、前記算出された出現頻度が前記閾値以上であると判定された場合は、前記一以上の第２文を前記第１文の類似文として前記対訳コーパスに追加するようにしてもよい。 The first sentence is described in a first language, the first sentence is included in a bilingual corpus, and the bilingual corpus includes a plurality of pairs of a sentence described in the first language and a bilingual sentence described in the second language. In addition, when it is determined that the calculated appearance frequency is equal to or higher than the threshold, the one or more second sentences may be added to the parallel corpus as similar sentences to the first sentence.

このような構成により、対訳コーパスに類似文を追加することができる。 With such a configuration, a similar sentence can be added to the bilingual corpus.

前記第３データベースは、Ｎ−ｇｒａｍ言語モデルのデータベースを含み、前記文脈依存値に応じて、前記Ｎ−ｇｒａｍ言語モデルのＮをｉ（正の整数）に決定し、前記第３データベースを照合することにより、前記第２語句を含むｉ−ｇｒａｍの出現頻度を求め、前記第２語句を含むｉ−ｇｒａｍの出現頻度に基づいて、前記一以上の第２文を前記第１文の類似文として採用するか否かを判定するようにしてもよい。 The third database includes an N-gram language model database, and N of the N-gram language model is determined as i (positive integer) according to the context-dependent value, and the third database is collated. Thus, the appearance frequency of the i-gram including the second word / phrase is obtained, and the one or more second sentences are set as similar sentences of the first sentence based on the appearance frequency of the i-gram including the second word / phrase. You may make it determine whether it employ | adopts.

このような構成により、文脈依存値に応じてＮ−ｇｒａｍ言語モデルのＮをｉ（正の整数）に決定し、Ｎ−ｇｒａｍ言語モデルのデータベースを照合することにより、第２語句を含むｉ−ｇｒａｍの出現頻度を求め、求めた出現頻度に基づいて、一以上の第２文を第１文の類似文として採用するか否かを判定しているので、文脈依存値が大きいほどｉを大きく、文脈依存値が小さいほどｉを小さく設定することにより、文脈依存性が高い第２語句に対して、広い識別対象領域を用いて、文脈依存値が大きい第２語句を含むｉ−ｇｒａｍの出現頻度を高精度に求めることができるとともに、文脈依存性が低い第２語句に対して、狭い識別対象領域を用いて、文脈依存値が小さい第２語句を含むｉ−ｇｒａｍの出現頻度を低コストで且つ高精度に求めることができ、類似文の識別を効率よく且つ高精度に行うことができる。 With such a configuration, N of the N-gram language model is determined as i (a positive integer) according to the context-dependent value, and the i- containing the second word is checked by collating the database of the N-gram language model. Since the appearance frequency of gram is obtained and it is determined whether or not one or more second sentences are adopted as similar sentences of the first sentence based on the obtained appearance frequency, i increases as the context-dependent value increases. By setting i to be smaller as the context-dependent value is smaller, an i-gram including the second word / phrase having a larger context-dependent value is used for a second word / phrase having a higher context-dependent value by using a wide identification target area. The frequency can be obtained with high accuracy and the appearance frequency of the i-gram including the second word / phrase having a small context-dependent value is reduced at a low cost by using a narrow identification target region for the second word / phrase having a low context dependence. And with high accuracy Can Mel, it is possible to identify similar sentence efficiently and accurately.

前記第１文の類似文として採用すると判定された前記一以上の第２文と、前記一以上の第２文を生成した前記第１文を第２言語で翻訳した翻訳文とを基に生成された翻訳モデルを用いて、所定の翻訳対象文を翻訳して翻訳結果文を作成し、前記翻訳結果文を評価し、前記翻訳結果文の評価結果に基づいて、前記翻訳対象文の言語及び／又は前記翻訳結果文の言語に関する言語情報と、前記言語情報に対する評価情報とを含むフィードバック情報を生成するようにしてもよい。 Generated based on the one or more second sentences determined to be adopted as similar sentences of the first sentence, and a translated sentence obtained by translating the first sentence that generated the one or more second sentences in a second language Using the translated model, a predetermined translation target sentence is translated to create a translation result sentence, the translation result sentence is evaluated, and based on the evaluation result of the translation result sentence, the language of the translation target sentence and Feedback information including language information related to the language of the translation result sentence and evaluation information for the language information may be generated.

このような構成により、採用すると判定された一以上の第２文と、一以上の第２文を生成した第１文を第２言語で翻訳した翻訳文とを基に生成された翻訳モデルを用いて、所定の翻訳対象文を翻訳して翻訳結果文を作成し、作成した翻訳結果文を評価し、この翻訳結果文の評価結果に基づいて、翻訳対象文の言語及び／又は翻訳結果文の言語に関する言語情報と、この言語情報に対する評価情報とを含むフィードバック情報を生成しているので、文脈依存性を考慮した事例を学習及び反映するためのフィードバック情報を自律的に生成することができる。 With such a configuration, a translation model generated based on one or more second sentences determined to be adopted and a translated sentence obtained by translating the first sentence that generated one or more second sentences in the second language To create a translation result sentence by translating a predetermined translation target sentence, evaluate the created translation result sentence, and based on the evaluation result of the translation result sentence, the language of the translation target sentence and / or the translation result sentence Since feedback information including language information related to the language and evaluation information for the language information is generated, feedback information for learning and reflecting cases considering context dependency can be generated autonomously. .

前記第１データベース、前記第２データベース及び前記第３データベースのうち少なくとも一つを、前記フィードバック情報を用いて更新するようにしてもよい。 At least one of the first database, the second database, and the third database may be updated using the feedback information.

このような構成により、言語情報と評価情報とを含むフィードバック情報を用いて、第１データベース、第２データベース及び第３データベースのうち少なくとも一つを更新しているので、文脈依存性を考慮した事例を第１データベース、第２データベース及び第３データベースのうち少なくとも一つに反映することができ、更新前の第１データベース、第２データベース及び第３データベースに存在しない表現にも対応できる高効率で且つ自律的な類似文の識別を行うことができる。 With such a configuration, at least one of the first database, the second database, and the third database is updated using feedback information including language information and evaluation information. Can be reflected in at least one of the first database, the second database, and the third database, and can be applied to expressions that do not exist in the first database, the second database, and the third database before being updated, It is possible to identify autonomous similar sentences.

前記フィードバック情報が文脈依存性を有する前記第２語句を含む場合、前記第２データベース及び前記第３データベースを更新するようにしてもよい。 When the feedback information includes the second word / phrase having context dependency, the second database and the third database may be updated.

このような構成により、フィードバック情報が文脈依存性を有する第２語句を含む場合、第２データベース及び第３データベースを更新しているので、文脈依存性を考慮した事例を第２データベース及び第３データベースに反映することができ、文脈依存性を考慮した高効率で且つ自律的な類似文の識別を行うことができる。 With such a configuration, when the feedback information includes a second word / phrase having context dependency, the second database and the third database are updated. It is possible to identify the similar sentence with high efficiency and autonomousness considering the context dependency.

前記フィードバック情報が新しい文表現を含む場合、前記文表現に応じて前記第２データベースの文脈依存値を変化させるようにしてもよい。 When the feedback information includes a new sentence expression, a context-dependent value of the second database may be changed according to the sentence expression.

このような構成により、フィードバック情報が新しい文表現を含む場合、新しい文表現に応じて第２データベースの文脈依存値を変化させているので、新しい文表現にも対応できる高効率で且つ自律的な類似文の識別を行うことができる。 With this configuration, when the feedback information includes a new sentence expression, the context-dependent value of the second database is changed according to the new sentence expression. Similar sentences can be identified.

前記フィードバック情報が新しい文表現を含む場合、前記文表現を含むように前記第３データベースを更新するようにしてもよい。 When the feedback information includes a new sentence expression, the third database may be updated to include the sentence expression.

このような構成により、フィードバック情報が新しい文表現を含む場合、新しい文表現を含むように第３データベースを更新しているので、更新前の第３データベースに存在しない新しい文表現にも対応できる高効率で且つ自律的な類似文の識別を行うことができる。 With such a configuration, when the feedback information includes a new sentence expression, the third database is updated to include the new sentence expression. Therefore, it is possible to cope with a new sentence expression that does not exist in the third database before the update. Efficient and autonomous identification of similar sentences can be performed.

また、本開示は、以上のような特徴的な処理を実行する類似文生成方法として実現することができるだけでなく、このような類似文生成方法に含まれる特徴的な処理をコンピュータに実行させるコンピュータプログラムとして実現することもできる。また、類似文生成方法により実行される特徴的な処理に対応する特徴的な構成を備える類似文生成装置などとして実現することもできる。したがって、以下の他の態様でも、上記の類似文生成方法と同様の効果を奏することができる。 In addition, the present disclosure can be realized not only as a similar sentence generation method that executes the characteristic processing as described above, but also a computer that causes a computer to execute the characteristic processing included in the similar sentence generation method. It can also be realized as a program. It can also be realized as a similar sentence generation device having a characteristic configuration corresponding to a characteristic process executed by the similar sentence generation method. Therefore, also in the following other aspects, the same effect as the above-described similar sentence generation method can be obtained.

本開示の他の態様に係るプログラムは、原文から類似文を生成する装置として、コンピュータを機能させるためのプログラムであって、前記コンピュータに、第１文を入力し、前記第１文を構成する複数の語句のうち第１語句と同じ意味を持つ一以上の第２語句を第１データベースから抽出し、前記第１データベースは語句と前記第１データベースに含まれた語句の類義語とを対応づけ、第２データベースに基づいて得られた前記一以上の第２語句に対応する文脈依存値に基づいてＮ−ｇｒａｍ値を算出し、前記第２データベースは語句と前記第２データベースに含まれた語句に対応する前記文脈依存値とを対応づけ、前記文脈依存値は、前記第２データベースに含まれた語句が示す意味が文脈に依存する程度を示し、前記第１文において前記第１語句が前記一以上の第２語句に置き換えられた一以上の第２文において、前記Ｎ−ｇｒａｍ値に相当する数の前記第２語句を含んだ連続する一以上の第３語句を抽出し、前記一以上の第３語句について、第３データベースにおける出現頻度を算出し、前記第３データベースは語句と前記第３データベースに含まれる語句の前記第３データベースにおける出現頻度とを対応づけ、前記算出された出現頻度が閾値以上であるか判定し、前記算出された出現頻度が前記閾値以上であると判定された場合は、前記一以上の第２文を前記第１文の類似文として採用し、外部の機器に出力する、処理を実行させる。 A program according to another aspect of the present disclosure is a program for causing a computer to function as a device that generates a similar sentence from an original sentence, and the first sentence is input to the computer to configure the first sentence. One or more second phrases having the same meaning as the first phrase among a plurality of phrases are extracted from the first database, and the first database associates phrases with synonyms of phrases included in the first database; An N-gram value is calculated based on a context-dependent value corresponding to the one or more second words obtained based on a second database, and the second database includes words and phrases included in the second database. The context-dependent value is associated with the context-dependent value, and the context-dependent value indicates the degree to which the meaning of the word included in the second database depends on the context. In one or more second sentences in which a first phrase is replaced with the one or more second phrases, one or more consecutive third phrases including a number of the second phrases corresponding to the N-gram value are extracted. And calculating an appearance frequency in a third database for the one or more third words, wherein the third database associates a word and an appearance frequency in the third database of a word included in the third database, It is determined whether the calculated appearance frequency is equal to or higher than a threshold value, and when it is determined that the calculated appearance frequency is equal to or higher than the threshold value, the one or more second sentences are adopted as similar sentences of the first sentence. And execute the process of outputting to an external device.

本開示の他の態様に係る装置は、原文から類似文を生成する装置であって、第１文を入力される入力部と、前記第１文を構成する複数の語句のうち第１語句と同じ意味を持つ一以上の第２語句を第１データベースから抽出する第２語句抽出部と、前記第１データベースは語句と前記第１データベースに含まれた語句の類義語とを対応づけ、第２データベースに基づいて得られた前記一以上の第２語句に対応する文脈依存値に基づいてＮ−ｇｒａｍ値を算出する算出部と、前記第２データベースは語句と前記第２データベースに含まれた語句に対応する前記文脈依存値とを対応づけ、前記文脈依存値は、前記第２データベースに含まれた語句が示す意味が文脈に依存する程度を示し、前記第１文において前記第１語句が前記一以上の第２語句に置き換えられた一以上の第２文において、前記Ｎ−ｇｒａｍ値に相当する数の前記第２語句を含んだ連続する一以上の第３語句を抽出する第３語句抽出部と、前記一以上の第３語句について、第３データベースにおける出現頻度を算出する算出部と、前記第３データベースは語句と前記第３データベースに含まれる語句の前記第３データベースにおける出現頻度とを対応づけ、前記算出された出現頻度が閾値以上であるか判定する判定部と、前記算出された出現頻度が前記閾値以上であると判定された場合は、前記一以上の第２文を前記第１文の類似文として採用し、外部の機器に出力する出力部とを備える。 An apparatus according to another aspect of the present disclosure is an apparatus that generates a similar sentence from an original sentence, and includes an input unit to which a first sentence is input, and a first phrase among a plurality of phrases that constitute the first sentence, A second word / phrase extraction unit that extracts one or more second words / phrases having the same meaning from the first database; and the first database associates words / phrases with synonyms of the words / phrases included in the first database; A calculation unit that calculates an N-gram value based on a context-dependent value corresponding to the one or more second words obtained based on the word, and the second database includes words and phrases included in the second database. The context-dependent value is associated with the context-dependent value, the context-dependent value indicates a context-dependent meaning of the phrase included in the second database, and the first phrase in the first sentence is the one In the second phrase above In one or more replaced second sentences, a third word / phrase extraction unit that extracts one or more consecutive third words / phrases including a number of the second words / phrases corresponding to the N-gram value; A calculation unit that calculates an appearance frequency in the third database, and the third database associates the phrase with an appearance frequency in the third database of the phrase included in the third database. A determination unit that determines whether the appearance frequency is greater than or equal to a threshold value, and when the calculated appearance frequency is determined to be greater than or equal to the threshold value, the one or more second sentences are set as similar sentences to the first sentence. And an output unit that outputs to an external device.

本開示の他の態様に係るシステムは、原文から類似文を生成するシステムであって、上記の装置と、前記装置により前記第１文の類似文として採用すると判定された前記一以上の第２文と、前記一以上の第２文を生成した前記第１文を第２言語で翻訳した翻訳文とを基に生成された翻訳モデルを用いて、所定の翻訳対象文を翻訳して翻訳結果文を作成する翻訳部と、前記翻訳部により作成された前記翻訳結果文を評価する評価部と、前記評価部の評価結果に基づいて、前記翻訳対象文の言語及び／又は前記翻訳結果文の言語に関する言語情報と、前記言語情報に対する評価情報とを含むフィードバック情報を生成する生成部とを備える。 A system according to another aspect of the present disclosure is a system that generates a similar sentence from an original sentence, and the device and the one or more second ones determined to be adopted as the similar sentence of the first sentence by the device. A translation result obtained by translating a predetermined translation target sentence using a translation model generated based on a sentence and a translation sentence obtained by translating the first sentence that has generated the one or more second sentences in a second language A translation unit that creates a sentence, an evaluation unit that evaluates the translation result sentence created by the translation unit, and a language of the translation target sentence and / or the translation result sentence based on the evaluation result of the evaluation unit A generating unit that generates feedback information including language information about the language and evaluation information for the language information;

このような構成により、上記の類似文生成方法と同様の効果を奏することができるとともに、第１文の類似文として採用すると判定された一以上の第２文と、当該一以上の第２文を生成した第１文を第２言語で翻訳した翻訳文とを基に生成された翻訳モデルを用いて、所定の翻訳対象文を翻訳して翻訳結果文を作成し、作成された翻訳結果文を評価し、この評価結果に基づいて、翻訳対象文の言語及び／又は翻訳結果文の言語に関する言語情報と、言語情報に対する評価情報とを含むフィードバック情報を生成しているので、文脈依存性を考慮した事例を学習及び反映するためのフィードバック情報を自律的に生成し、文脈依存性を考慮した事例を自律的に学習及び反映することができる類似文生成システムを実現することができる。 With such a configuration, the same effect as the above-described similar sentence generation method can be obtained, and at least one second sentence determined to be adopted as the similar sentence of the first sentence and the one or more second sentences Using the translation model generated based on the translation of the first sentence generated in the second language, a translation result sentence is created by translating a predetermined translation target sentence, and the created translation result sentence Feedback information including language information on the language of the translation target sentence and / or language of the translation result sentence and evaluation information on the language information is generated based on the evaluation result. It is possible to realize a similar sentence generation system capable of autonomously generating feedback information for learning and reflecting the considered case and autonomously learning and reflecting the case considering the context dependency.

そして、上記のようなコンピュータプログラムを、ＣＤ−ＲＯＭ等のコンピュータ読み取り可能な非一時的な記録媒体あるいはインターネット等の通信ネットワークを介して流通させることができるのは、言うまでもない。 Needless to say, the above-described computer program can be distributed via a computer-readable non-transitory recording medium such as a CD-ROM or a communication network such as the Internet.

また、本開示の一実施の形態に係る類似文生成装置又は類似文生成システムの構成要素の一部とそれ以外の構成要素とを複数のコンピュータに分散させたシステムとして構成してもよい。 Moreover, you may comprise as a system which disperse | distributed a part of component of the similar sentence production | generation apparatus or similar sentence production | generation system which concerns on one embodiment of this indication, and the other component to several computers.

なお、以下で説明する実施の形態は、いずれも本開示の一具体例を示すためのものである。以下の実施の形態で示される数値、形状、構成要素、ステップ、ステップの順序などは、一例であり、本開示を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。また、全ての実施の形態において、各々の内容を組み合わせることもできる。 Note that each of the embodiments described below is for showing a specific example of the present disclosure. Numerical values, shapes, components, steps, order of steps, and the like shown in the following embodiments are merely examples, and are not intended to limit the present disclosure. In addition, among the constituent elements in the following embodiments, constituent elements that are not described in the independent claims indicating the highest concept are described as optional constituent elements. In all the embodiments, the contents can be combined.

以下、本開示の各実施の形態について、図面を参照しながら説明する。 Hereinafter, each embodiment of the present disclosure will be described with reference to the drawings.

（実施の形態１）
図１は、本開示の実施の形態１における類似文生成装置の構成の一例を示すブロック図である。図１に示す類似文生成装置１は、置き換え対象文（原文）から類似文を生成する。類似文生成装置１は、置き換え対象文入力部１０、置き換え候補抽出部１１、文脈依存率照合部１２、文脈依存性判定部１３、言語モデル照合部１４、置き換え判定部１５、置き換え結果出力部１６、置き換え候補辞書２１、文脈依存率辞書２２、及び言語モデルデータベース２３を備える。 (Embodiment 1)
FIG. 1 is a block diagram illustrating an example of a configuration of a similar sentence generation device according to Embodiment 1 of the present disclosure. A similar sentence generation device 1 shown in FIG. 1 generates a similar sentence from a replacement target sentence (original sentence). The similar sentence generation device 1 includes a replacement target sentence input unit 10, a replacement candidate extraction unit 11, a context dependency rate collation unit 12, a context dependency determination unit 13, a language model collation unit 14, a replacement determination unit 15, and a replacement result output unit 16. A replacement candidate dictionary 21, a context dependency rate dictionary 22, and a language model database 23.

置き換え対象文入力部１０は、ユーザによる所定の操作入力を受け付け、ユーザが入力した置き換え対象文（第１文）を置き換え候補抽出部１１に出力する。例えば、「僕は英語が話せないので日本語でお願いします」との置き換え対象文が置き換え対象文入力部１０に入力される。なお、類似文生成装置１が生成する類似文の言語は、日本語に特に限定されず、英語、中国語、韓国語、フランス語、ドイツ語、イタリア語、ポルトガル語等の他の言語であってもよい。 The replacement target sentence input unit 10 receives a predetermined operation input by the user, and outputs the replacement target sentence (first sentence) input by the user to the replacement candidate extraction unit 11. For example, a replacement target sentence such as “I can't speak English, please in Japanese” is input to the replacement target sentence input unit 10. Note that the language of the similar sentence generated by the similar sentence generation device 1 is not particularly limited to Japanese, and other languages such as English, Chinese, Korean, French, German, Italian, Portuguese, etc. Also good.

置き換え候補辞書２１は、文節／単語／形態素等での置き換え事例を辞書として格納する置き換え候補記憶部であり、置き換え対象文から置き換えられる置き換え対象部分の置き換え候補となる一又は複数の置き換え候補文字列を予め記憶している。置き換え候補辞書２１は、語句と置き換え候補辞書２１に含まれた語句の類義語とを対応づけた第１データベースの一例である。 The replacement candidate dictionary 21 is a replacement candidate storage unit that stores replacement examples of phrases / words / morphemes or the like as a dictionary, and one or a plurality of replacement candidate character strings that are replacement candidates for a replacement target portion to be replaced from the replacement target sentence. Is stored in advance. The replacement candidate dictionary 21 is an example of a first database in which words and phrases and synonyms of words included in the replacement candidate dictionary 21 are associated with each other.

図２は、図１に示す置き換え候補辞書２１のデータ構成の一例を示す図である。図２に示すように、置き換え候補辞書２１には、置き換え対象部分（語句）と置き換え候補文字列（語句の類義語）とが対応付けて記憶されている。例えば、置き換え対象部分の「これだ」に対応付けて「これです」、「これでございます」等の置き換え候補文字列が記憶され、置き換え対象部分の「話せない」に対応付けて、「話せません」、「しゃべれない」、「秘密です」等の置き換え候補文字列が記憶されている。 FIG. 2 is a diagram showing an example of the data configuration of the replacement candidate dictionary 21 shown in FIG. As shown in FIG. 2, the replacement candidate dictionary 21 stores a replacement target part (word / phrase) and a replacement candidate character string (synonymous word / phrase) in association with each other. For example, replacement candidate character strings such as “this is” and “this is” are stored in association with “this is” of the replacement target portion, and “speak” is associated with “cannot speak” of the replacement target portion. Replacement candidate character strings such as “No”, “Cannot speak”, “Secret” are stored.

置き換え候補抽出部１１は、置き換え対象文（第１文）を構成する複数の語句のうち置き換え対象部分（第１語句）と同じ意味を持つ置き換え候補文字列（一以上の第２語句）を置き換え候補辞書２１から抽出する。具体的には、置き換え候補抽出部１１は、置き換え対象文入力部１０から入力された置き換え対象文を文節／単語／形態素等の単位で分割し、分割された文節／単語／形態素等から置き換え対象部分を決定し、置き換え候補辞書２１から置き換え対象部分に対応付けて記憶されている置き換え可能な文字列対（置き換え候補文字列）を検索し、一又は複数の置き換え候補文字列を抽出して置き換え対象文とともに文脈依存率照合部１２に出力する。例えば、置き換え対象部分が「話せない」である場合、置き換え候補抽出部１１は、「話せません」、「しゃべれない」、及び「秘密です」等の置き換え候補文字列を置き換え候補辞書２１から抽出する。なお、置き換え対象文の分割方法は、上記の例に特に限定されず、種々の公知の手法を用いることができる。 The replacement candidate extraction unit 11 replaces a replacement candidate character string (one or more second words) having the same meaning as the part to be replaced (first word) among a plurality of words constituting the sentence to be replaced (first sentence). Extracted from the candidate dictionary 21. Specifically, the replacement candidate extraction unit 11 divides the replacement target sentence input from the replacement target sentence input unit 10 in units of phrases / words / morphemes and the like, and replaces the divided phrases / words / morphemes and the like. A part is determined, a replaceable character string pair (replacement candidate character string) stored in association with the replacement target part is searched from the replacement candidate dictionary 21, and one or more replacement candidate character strings are extracted and replaced. Output to the context-dependent rate matching unit 12 together with the target sentence. For example, when the replacement target portion is “I can't speak”, the replacement candidate extraction unit 11 extracts replacement candidate character strings such as “I can't speak”, “I can't speak”, and “I am secret” from the replacement candidate dictionary 21. To do. Note that the method of dividing the replacement target sentence is not particularly limited to the above example, and various known methods can be used.

文脈依存率辞書２２は、文節／単語／形態素等で置き換えた場合の適用可能性（文脈依存性）を示す文脈依存値を、文節／単語／形態素等と数値との対で辞書として格納する文脈依存値記憶部である。具体的には、文脈依存率辞書２２は、置き換え候補文字列と、当該置き換え候補文字列が文脈に依存する程度を表す文脈依存率ｐｃとを対応付けた複数のデータ対を予め記憶している。文脈依存率辞書２２は、語句と文脈依存率辞書２２に含まれた語句に対応する文脈依存値とを対応づけた第２データベースの一例であり、文脈依存値は、文脈依存率辞書２２に含まれた語句が示す意味が文脈に依存する程度を示す。 The context-dependent rate dictionary 22 stores a context-dependent value indicating applicability (context-dependent) when replaced with a phrase / word / morpheme or the like as a dictionary in pairs of the phrase / word / morpheme and the numerical value. It is a dependence value storage unit. Specifically, the context dependency rate dictionary 22 stores in advance a plurality of data pairs in which a replacement candidate character string is associated with a context dependency rate pc indicating a degree to which the replacement candidate character string depends on the context. . The context-dependent rate dictionary 22 is an example of a second database that associates words and phrases with context-dependent values corresponding to the words included in the context-dependent rate dictionary 22, and the context-dependent values are included in the context-dependent rate dictionary 22. Indicates the degree to which the meaning of the given phrase depends on the context.

図３は、図１に示す文脈依存率辞書２２のデータ構成の一例を示す図である。図３に示すように、文脈依存率辞書２２には、例えば、置き換え候補文字列「です」に対してｐｃ＝０．３５、「ですが」に対してｐｃ＝０．０５、「話せません」に対してｐｃ＝０．２５、「しゃべれない」に対してｐｃ＝０．０１、「秘密です」に対してｐｃ＝０．７５等が文脈依存率辞書２２に予め記憶されている。 FIG. 3 is a diagram showing an example of the data configuration of the context-dependent rate dictionary 22 shown in FIG. As shown in FIG. 3, in the context-dependent dictionary 22, for example, pc = 0.35 for the replacement candidate character string “Is”, pc = 0.05 for “Issuga”, “I cannot speak” ”Is stored in the context-dependent dictionary 22 in advance, such as“ pc = 0.25 ”,“ cannot speak ”, pc = 0.01,“ secret ”, pc = 0.75, and the like.

ここで、文脈依存率ｐｃは、例えば、置き換え候補文字列が文脈に依存することにより、置き換え候補文字列を用いた類似候補文が棄却される確率を０〜１の範囲で表した値である。なお、文脈依存値は、上記の文脈依存率ｐｃに特に限定されず、種々の変更が可能であり、置き換え候補文字列が文脈に依存する程度を表す他の数値を用いたり、置き換え候補文字列が文脈に依存する程度をクラス分け（例えば、文脈依存度を大、中、小等のクラスに分類）して、どのクラスに属するかを記憶したりしてもよい。 Here, the context dependency rate pc is, for example, a value in the range of 0 to 1 that indicates that the candidate candidate character string is dependent on the context and thus the similar candidate sentence using the replacement candidate character string is rejected. . Note that the context-dependent value is not particularly limited to the above-described context-dependent rate pc, and can be changed in various ways. For example, another numerical value indicating the degree of the replacement candidate character string depending on the context is used, or the replacement candidate character string is used. It is also possible to classify the degree of dependence on the context (for example, classify the context dependence into a class of large, medium, small, etc.) and store which class it belongs to.

文脈依存率照合部１２は、置き換え候補文字列の文脈依存率ｐｃを文脈依存率辞書２２から検索して、置き換え候補文字列に対応付けて記憶されている文脈依存率ｐｃを抽出し、抽出した文脈依存率ｐｃを置き換え対象文とともに文脈依存性判定部１３に出力する。例えば、文脈依存率ｐｃとして、置き換え候補文字列が「話せません」の場合に０．２５、「しゃべれない」の場合に０．０１、「秘密です」の場合に０．７５が抽出される。 The context dependency rate collation unit 12 searches the context dependency rate dictionary 22 for the context dependency rate pc of the replacement candidate character string, and extracts and extracts the context dependency rate pc stored in association with the replacement candidate character string. The context dependency rate pc is output to the context dependency determination unit 13 together with the replacement target sentence. For example, as the context dependency rate pc, 0.25 is extracted when the replacement candidate character string is “I can't speak”, 0.01 when “I can't speak”, and 0.75 when it is “secret”. .

文脈依存性判定部１３は、文脈依存率辞書２２に基づいて得られた置き換え候補文字列（一以上の第２語句）に対応する文脈依存値に基づいてＮ−ｇｒａｍ値を算出する。具体的には、文脈依存性判定部１３は、文脈依存率ｐｃの値から、置き換え候補文字列を含む類似候補文の判定を行うために参照する言語モデルデータベース２３の識別対象領域を判定し、判定結果を置き換え対象文とともに言語モデル照合部１４に出力する。 The context dependency determination unit 13 calculates an N-gram value based on the context dependency value corresponding to the replacement candidate character string (one or more second words / phrases) obtained based on the context dependency rate dictionary 22. Specifically, the context dependency determination unit 13 determines the identification target region of the language model database 23 to be referred to in order to determine the similar candidate sentence including the replacement candidate character string, from the value of the context dependency rate pc. The determination result is output to the language model matching unit 14 together with the replacement target sentence.

ここで、本実施の形態では、言語モデルデータベース２３として、Ｎ−ｇｒａｍ言語モデルのデータベースを用いており、言語モデルデータベース２３には、言語情報とその出現頻度とが対応付けられたテーブル形式でデータが記憶されている。言語モデルデータベース２３は、語句と言語モデルデータベース２３に含まれる語句の言語モデルデータベース２３における出現頻度とを対応づけた第３データベースの一例である。 Here, in the present embodiment, an N-gram language model database is used as the language model database 23, and the language model database 23 stores data in a table format in which language information and its appearance frequency are associated with each other. Is remembered. The language model database 23 is an example of a third database in which words and phrases are associated with the appearance frequencies of the words included in the language model database 23 in the language model database 23.

図４は、図１に示す言語モデルデータベース２３のデータ構成の一例を示す図である。図４に示すように、言語モデルデータベース２３には、例えば、言語情報及びその出現頻度として、「英語」に対して「２３４，５６７，８９０」が、「英語は」に対して「１２，３４５，６７０」が、「英語が」に対して「２２，２２２，２２０」が、「英語が好き」に対して「９９９，００１」がそれぞれ対応付けてテーブル形式で言語モデルデータベース２３に予め記憶されている。また、この出現頻度を基にして、例えば、出現確率を求めることができる。 FIG. 4 is a diagram showing an example of the data configuration of the language model database 23 shown in FIG. As shown in FIG. 4, the language model database 23 includes, for example, “234, 567, 890” for “English” and “12,345” for “English” as the language information and its appearance frequency. , 670 ”is stored in advance in the language model database 23 in the form of a table in association with“ 22,222,220 ”for“ English ”and“ 999,001 ”for“ I like English ”. ing. Further, for example, the appearance probability can be obtained based on the appearance frequency.

なお、言語モデルデータベース２３に記憶される情報は、上記の例に特に限定されず、言語情報とその出現頻度等に応じた値とが対応付けられたテーブルであれば、任意の内容であってもよい。また、言語モデルデータベース２３の言語モデルも、上記のＮ−ｇｒａｍ言語モデルに特に限定されず、他の言語モデルを用いてもよい。 The information stored in the language model database 23 is not particularly limited to the above example, and any information can be used as long as it is a table in which language information is associated with a value corresponding to its appearance frequency. Also good. Further, the language model of the language model database 23 is not particularly limited to the above-described N-gram language model, and other language models may be used.

言語モデルデータベース２３がＮ−ｇｒａｍ言語モデルのデータベースである場合、文脈依存性判定部１３は、文脈依存率ｐｃに応じて、言語モデルデータベース２３のＮ−ｇｒａｍ言語モデルのＮ（Ｎ−ｇｒａｍ値）をｉ（正の整数）に決定する。具体的には、例えば、文脈依存性判定部１３は、文脈依存率ｐｃを４つのクラスに分類し、０≦ｐｃ≦０．２５をクラス１、０．２５＜ｐｃ≦０．５をクラス２、０．５＜ｐｃ≦０．７５をクラス３、０．７５＜ｐｃ≦１をクラス４とし、Ｎ−ｇｒａｍのＮ（正の整数）として、クラス１ではＮ＝４、クラス２ではＮ＝５、クラス３ではＮ＝６、クラス４ではＮ＝７をそれぞれ決定する。 When the language model database 23 is an N-gram language model database, the context dependency determination unit 13 determines N (N-gram value) of the N-gram language model in the language model database 23 according to the context dependency rate pc. Is determined as i (a positive integer). Specifically, for example, the context dependency determination unit 13 classifies the context dependency rate pc into four classes, 0 ≦ pc ≦ 0.25 is class 1, and 0.25 <pc ≦ 0.5 is class 2. 0.5 <pc ≦ 0.75 is class 3, 0.75 <pc ≦ 1 is class 4, N-gram N (positive integer) is N = 4 in class 1, N = in class 2 5. N = 6 is determined for class 3 and N = 7 is determined for class 4.

例えば、置き換え候補文字列の「話せません」の場合、文脈依存率ｐｃが０．２５となり、クラス１に属し、文脈依存性判定部１３は、言語モデルデータベース２３の識別対象領域として、クラス１相当のＮ−ｇｒａｍすなわちＮ＝４を決定する。なお、識別対象領域の判定基準は、上記の例に特に限定されず、種々の変更が可能であり、文脈依存率ｐｃを用いて識別対象領域を直接数式化したりしてもよい。例えば、Ｎ＝ｆｌｏｏｒ（ｋ−ｌｏｇ２（ｐｃ））（ここで、ｋは定数）とし、置き換え候補文字列の「話せません」の場合、文脈依存率ｐｃが０．２５となり、定数ｋ＝６とする場合、Ｎ＝４となる。 For example, in the case of “cannot speak” for the replacement candidate character string, the context dependency rate pc is 0.25 and belongs to class 1, and the context dependency determination unit 13 uses class 1 as the identification target area of the language model database 23. Determine a substantial N-gram or N = 4. Note that the criterion for determining the identification target area is not particularly limited to the above example, and various changes can be made. The identification target area may be directly expressed using the context-dependent rate pc. For example, if N = floor (k-log2 (pc)) (here, k is a constant) and the replacement candidate character string cannot be spoken, the context dependency rate pc is 0.25, and the constant k = 6 In this case, N = 4.

また、言語モデルデータベース２３はＮ−ｇｒａｍ言語モデルに限らず、その他の言語資源に基づくデータベースであってもよい。例えば、実数値やベクトル等の分散表現で記述された言語モデルであってもよく、任意の既存手法や既存データを組合せて構築することができる。いずれの場合も、識別対象領域としてデータベースを検索する範囲を任意の変数で定義し、その任意の変数を文脈依存率ｐｃに応じて決定することができる。 The language model database 23 is not limited to the N-gram language model, and may be a database based on other language resources. For example, it may be a language model described in a distributed representation such as a real value or a vector, and can be constructed by combining arbitrary existing methods and existing data. In any case, the range for searching the database as the identification target area can be defined by an arbitrary variable, and the arbitrary variable can be determined according to the context dependency rate pc.

言語モデル照合部１４は、置き換え対象文（第１文）において置き換え対象部分（第１語句）が置き換え候補文字列（一以上の第２語句）に置き換えられた置き換え文（一以上の第２文）において、Ｎ−ｇｒａｍ値に相当する数の置き換え候補文字列（第２語句）を含んだ連続するＮ−ｇｒａｍ（一以上の第３語句）を抽出し、Ｎ−ｇｒａｍ（一以上の第３語句）について、言語モデルデータベース２３を算出する。 The language model matching unit 14 replaces the replacement target part (first phrase) with the replacement candidate character string (one or more second phrases) in the replacement target sentence (first sentence) (one or more second sentences). ), Consecutive N-grams (one or more third words) including a number of replacement candidate character strings (second words) corresponding to the N-gram values are extracted, and N-grams (one or more third words) are extracted. The language model database 23 is calculated for the phrase).

すなわち、言語モデル照合部１４は、文脈依存性判定部１３で判定された識別対象領域に対応した識別対象データを言語モデルデータベース２３から検索して抽出することにより、置き換え候補文字列との照合を行い、置き換え候補文字列に関連した、文節／単語／形態素等からなる言語情報と、その言語情報の出現頻度又は出現確率に応じた値とのペアデータを生成し、置き換え対象文とともに置き換え判定部１５に出力する。 That is, the language model matching unit 14 searches the language model database 23 for the identification target data corresponding to the identification target region determined by the context dependency determination unit 13 and extracts the identification target data, thereby matching the replacement candidate character string. And generating pair data of linguistic information composed of phrases / words / morphemes and the like related to the replacement candidate character string and a value corresponding to the appearance frequency or appearance probability of the linguistic information, along with the replacement target sentence 15 is output.

具体的には、言語モデル照合部１４は、文脈依存性判定部１３から参照する識別対象領域の大きさとして与えられたＮの値を用い、言語モデルデータベース２３からＮ−ｇｒａｍ（例えば、置き換え候補文字列がクラス１に属する場合、４−ｇｒａｍ）の出現頻度又は出現確率を取得し、照合した置き換え候補文字列と、取得した出現頻度又は出現確率とを置き換え判定部１５に出力する。 Specifically, the language model collation unit 14 uses the value of N given as the size of the identification target area referred to by the context dependency determination unit 13 and uses the N-gram (for example, a replacement candidate) from the language model database 23. If the character string belongs to class 1, the appearance frequency or appearance probability of 4-gram) is acquired, and the compared replacement candidate character string and the acquired appearance frequency or appearance probability are output to the replacement determination unit 15.

置き換え判定部１５は、言語モデル照合部１４から得た、文節／単語／形態素等からなる言語情報と、その言語情報の出現頻度又は出現確率に応じた値とのペアデータを用いて、該当する置き換え候補文字列を置き換え対象文に適用するか又は棄却するかを決定し、この置き換え結果を置き換え対象文とともに置き換え結果出力部１６に出力する。 The replacement determination unit 15 uses the pair data of the linguistic information composed of phrases / words / morphemes and the like obtained from the language model matching unit 14 and the value corresponding to the appearance frequency or appearance probability of the linguistic information. It is determined whether the replacement candidate character string is applied to the replacement target sentence or rejected, and the replacement result is output to the replacement result output unit 16 together with the replacement target sentence.

上記の決定方法の一例として、置き換え判定部１５は、算出された出現頻度が閾値以上であるか判定する。具体的には、ｊ番目（ｊは任意の整数）の言語情報の出現頻度の値をｎｊとし、所定の閾値をＴｈとしたときに、置き換え判定部１５は、すべてのｊに対して、ｎｊ＞Ｔｈで有れば、置き換え候補文字列を置き換え対象文に適用すると決定し、それ以外の場合には棄却すると決定する。 As an example of the determination method, the replacement determination unit 15 determines whether the calculated appearance frequency is greater than or equal to a threshold value. Specifically, when the appearance frequency value of the j-th language information (j is an arbitrary integer) is nj and the predetermined threshold is Th, the replacement determination unit 15 performs nj for all j. If> Th, it is determined that the replacement candidate character string is to be applied to the replacement target sentence, and otherwise it is determined to be rejected.

例えば、Ｎ−ｇｒａｍとして４−ｇｒａｍを用い、置き換え候補文字列の「話せません」に対して、言語情報と、その言語情報の出現頻度として、「は英語が話せません」に対して「５１，５５０」が、「英語が話せませんので」に対して「１，７２０」が、「が話せませんので日本」に対して「５３０」が、「話せませんので日本語」に対して「３，２２０」がそれぞれ取得され、Ｔｈ＝５００の場合、ｊ＝１〜４のすべてに対して、出現頻度は閾値Ｔｈ以上となり、置き換え候補文字列の「話せません」は適用と判定される。 For example, using 4-gram as the N-gram, for the replacement candidate character string "I can't speak", the linguistic information and the frequency of appearance of the language information are "I can't speak English" 51,550 ”is“ 1,720 ”for“ I can't speak English ”,“ 530 ”for“ Japan because I ca n’t speak ”, and“ Japanese because I ca n’t speak ” If “3,220” is acquired and Th = 500, the appearance frequency is equal to or higher than the threshold Th for all of j = 1 to 4, and the replacement candidate character string “I cannot speak” is determined to be applicable. Is done.

なお、置き換え候補文字列の置き換え対象文への適用又は棄却の決定方法としては、上記の例に特に限定されず、種々の変更が可能であり、ｎｊの分布に応じて適用又は棄却を決定したり（例えば、４−ｇｒａｍの出現頻度の下位３％を棄却したり）、ｎｊ＝０となるｊが存在するか否かに応じて適用又は棄却を決定したり、又は、ｎｊを用いた任意の式から算出される値に応じて適用又は棄却を決定したりしてもよい。 Note that the method of determining whether to apply or reject a replacement candidate character string to a replacement target sentence is not particularly limited to the above example, and various modifications are possible. Apply or reject is determined according to the distribution of nj. (For example, rejecting the lower 3% of the appearance frequency of 4-gram), determining whether to apply or reject depending on whether or not j where nj = 0 exists, or using nj The application or rejection may be determined according to the value calculated from the equation (1).

置き換え結果出力部１６は、算出された出現頻度が閾値以上であると判定された場合は、適用と判定された置き換え候補文字列によって生成された置き換え文（一以上の第２文）を置き換え対象文（第１文）の類似文として採用し、外部の機器に出力する。具体的には、置き換え結果出力部１６は、置き換え結果に基づき、置き換え対象文の置き換え対象部分を置き換え判定部１５で適用と判定された置き換え候補文字列に置き換え、適用と判定された置き換え候補文字列によって生成された置き換え文（置き換え後の文）を類似文として採用し、生成した類似文を外部の機器（図示省略）等に出力する。 When it is determined that the calculated appearance frequency is equal to or higher than the threshold, the replacement result output unit 16 replaces the replacement sentence (one or more second sentences) generated by the replacement candidate character string determined to be applied. Adopted as a sentence similar to the sentence (first sentence) and output to an external device. Specifically, the replacement result output unit 16 replaces the replacement target portion of the replacement target sentence with the replacement candidate character string determined to be applied by the replacement determination unit 15 based on the replacement result, and replaces the replacement candidate character determined to be applied. The replacement sentence generated by the column (the sentence after replacement) is adopted as a similar sentence, and the generated similar sentence is output to an external device (not shown).

また、置き換え対象文（第１文）は第１言語（例えば、日本語）で記述され、置き換え対象文（第１文）は、対訳コーパスに含まれ、対訳コーパスは第１言語で記述された文と第２言語（例えば、英語）で記述された対訳文との対を複数含み、置き換え結果出力部１６は、算出された出現頻度が閾値以上であると判定された場合は、適用と判定された置き換え候補文字列によって生成された置き換え文（一以上の第２文）を置き換え対象文（第１文）の類似文として対訳コーパスに追加するようにしてもよい。 The replacement target sentence (first sentence) is described in a first language (for example, Japanese), the replacement target sentence (first sentence) is included in the parallel translation corpus, and the parallel translation corpus is described in the first language. The replacement result output unit 16 includes a plurality of pairs of sentences and a parallel translation written in a second language (for example, English), and the replacement result output unit 16 determines that the application is applied when the calculated appearance frequency is determined to be equal to or greater than a threshold value. The replacement sentence (one or more second sentences) generated by the replaced replacement candidate character string may be added to the parallel corpus as a similar sentence to the replacement target sentence (first sentence).

なお、類似文生成装置１の構成は、上記のように、機能ごとに専用のハードウエアで構成する例に特に限定されず、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）及び補助記憶装置等を備える１台又は複数台のコンピュータ又はサーバ（情報処理装置）が、上記の処理を実行するための類似文生成プログラムをインストールし、類似文生成装置として機能するように構成してもよい。また、置き換え候補辞書２１、文脈依存率辞書２２及び言語モデルデータベース２３は、類似文生成装置１の内部に設ける例に特に限定されず、外部のサーバ等に置き換え候補辞書２１、文脈依存率辞書２２及び言語モデルデータベース２３を設け、所定のネットワークを介して類似文生成装置１が必要な情報を取得するようにしてもよい。この点については、他の実施の形態も同様である。 Note that the configuration of the similar sentence generation device 1 is not particularly limited to an example in which dedicated functions are configured for each function as described above, and a CPU (Central Processing Unit), a ROM (Read Only Memory), and a RAM (Random). One or a plurality of computers or servers (information processing devices) having an Access Memory) and an auxiliary storage device, etc., install a similar sentence generation program for executing the above processing, and function as a similar sentence generation device You may comprise. The replacement candidate dictionary 21, the context dependency rate dictionary 22, and the language model database 23 are not particularly limited to the example provided inside the similar sentence generation device 1, and the replacement candidate dictionary 21 and the context dependency rate dictionary 22 are stored in an external server or the like. In addition, the language model database 23 may be provided, and the similar sentence generation device 1 may acquire necessary information via a predetermined network. This is the same in the other embodiments.

次に、上記のように構成された類似文生成装置１による類似文生成処理について、詳細に説明する。図５は、図１に示す類似文生成装置１による類似文生成処理の一例を示すフローチャートである。なお、以下の処理では、出現頻度を用いて置き換え候補文字列の適用／棄却の判定を行っているが、この例に特に限定されず、例えば、出現確率等を用いてもよい。この点については、他の実施の形態も同様である。 Next, similar sentence generation processing by the similar sentence generation device 1 configured as described above will be described in detail. FIG. 5 is a flowchart illustrating an example of similar sentence generation processing by the similar sentence generation device 1 illustrated in FIG. 1. In the following processing, the application frequency / rejection determination of the replacement candidate character string is performed using the appearance frequency. However, the present invention is not particularly limited to this example. For example, the appearance probability may be used. This is the same in the other embodiments.

まず、ステップＳ１１において、置き換え対象文入力部１０は、ユーザによる置き換え対象文（原文）の入力を受け付け、入力された置き換え対象文を置き換え候補抽出部１１に出力する。 First, in step S 11, the replacement target sentence input unit 10 receives input of a replacement target sentence (original sentence) by the user, and outputs the input replacement target sentence to the replacement candidate extraction unit 11.

次に、ステップＳ１２において、置き換え候補抽出部１１は、置き換え対象文を文節／単語／形態素等の単位で分割し、分割された文節／単語／形態素等から置き換え対象部分を決定し、置き換え候補辞書２１から置き換え対象部分に対応付けて記憶されている置き換え候補文字列を抽出して置き換え対象文とともに文脈依存率照合部１２に出力する。 Next, in step S12, the replacement candidate extraction unit 11 divides the replacement target sentence in units of phrases / words / morphemes, etc., determines a replacement target part from the divided phrases / words / morphemes, etc., and a replacement candidate dictionary The replacement candidate character string stored in association with the replacement target portion is extracted from 21 and output to the context dependency rate collating unit 12 together with the replacement target sentence.

次に、ステップＳ１３において、文脈依存率照合部１２は、文脈依存率辞書２２を照合して、置き換え候補文字列の文脈依存率ｐｃを抽出して置き換え対象文とともに文脈依存性判定部１３に出力する。 Next, in step S13, the context dependency rate collation unit 12 collates the context dependency rate dictionary 22, extracts the context dependency rate pc of the replacement candidate character string, and outputs it to the context dependency determination unit 13 together with the replacement target sentence. To do.

次に、ステップＳ１４において、文脈依存性判定部１３は、置き換え候補文字列の文脈依存率ｐｃの値から、言語モデルデータベース２３のＮ−ｇｒａｍのＮを決定することにより、文脈依存性から参照する言語モデル長を決定し、決定したＮの値を置き換え対象文とともに言語モデル照合部１４に出力する。 Next, in step S14, the context dependency determination unit 13 refers to the context dependency by determining N of N-gram of the language model database 23 from the value of the context dependency rate pc of the replacement candidate character string. The language model length is determined, and the determined value of N is output to the language model matching unit 14 together with the replacement target sentence.

例えば、置き換え対象文が「僕は英語が話せないので日本語でお願いします」であり、置き換え候補文字列が「話せません」であり、置き換え候補文が「僕は英語が話せませんので日本語でお願いします」である場合、文脈依存性判定部１３は、言語モデルデータベース２３のＮ−ｇｒａｍのＮとして、Ｎ＝４を決定する。 For example, the sentence to be replaced is "I can't speak English, so please in Japanese", the replacement candidate string is "I can't speak", and the replacement candidate sentence is "I can't speak English If it is “Please in Japanese”, the context dependence determination unit 13 determines N = 4 as N of N-gram in the language model database 23.

次に、ステップＳ１５において、言語モデル照合部１４は、文脈依存性判定部１３から参照する識別対象領域の大きさとして与えられたＮの値を用い、言語モデルデータベース２３からＮ−ｇｒａｍの出現頻度を取得し、照合した置き換え候補文字列と、取得した出現頻度とを置き換え対象文とともに置き換え判定部１５に出力する。 Next, in step S 15, the language model collation unit 14 uses the value of N given as the size of the identification target area referenced from the context dependency determination unit 13, and the appearance frequency of the N-gram from the language model database 23. , And the collated replacement candidate character string and the acquired appearance frequency are output to the replacement determination unit 15 together with the replacement target sentence.

例えば、上記の置き換え対象文の「話せない」を「話せません」に置き換える場合、言語モデル照合部１４は、置き換えを行った文節「話せません」を含む周囲４−ｇｒａｍ（例えば、「は英語が話せません」、「英語が話せませんので」、「が話せませんので日本」、「話せませんので日本語」）を生成し、言語モデルデータベース２３と照合し、各４−ｇｒａｍの出現頻度（例えば、「は英語が話せません」の５１，５５０、「英語が話せませんので」の１，７２０、「が話せませんので日本」の５３０、「話せませんので日本語」の３，２２０）を得る。 For example, when replacing “I cannot speak” in the above sentence to be replaced with “I can't speak”, the language model matching unit 14 uses a surrounding 4-gram (for example, “ "I can't speak English", "I can't speak English", "I can't speak Japan", and "I can't speak Japanese") and collate with the language model database 23, and each 4-gram Occurrence frequency (for example, "I can't speak English" 51,550, "I can't speak English" 1,720, "I can't speak Japan" 530, "I can't speak Japanese" 3,220).

次に、ステップＳ１６において、置き換え判定部１５は、言語モデル照合部１４から置き換え候補文字列を含むＮ−ｇｒａｍと、その出現頻度とを取得し、置き換え候補文字列のスコアを算出する。 Next, in step S 16, the replacement determination unit 15 acquires the N-gram including the replacement candidate character string and the appearance frequency from the language model matching unit 14 and calculates the score of the replacement candidate character string.

次に、ステップＳ１７において、置き換え判定部１５は、置き換え候補文字列のスコア（出現頻度）が所定の閾値Ｔｈ以上であるか否かを判定することにより、置き換え候補文字列を置き換え対象文に適用するか又は棄却するかを判定し、この判定結果を置き換え対象文とともに置き換え結果出力部１６に出力する。 Next, in step S17, the replacement determination unit 15 applies the replacement candidate character string to the replacement target sentence by determining whether or not the score (appearance frequency) of the replacement candidate character string is equal to or greater than a predetermined threshold Th. It is determined whether to reject or reject, and the determination result is output to the replacement result output unit 16 together with the replacement target sentence.

ステップＳ１７において置き換え候補文字列のスコア（出現頻度）が所定の閾値Ｔｈ未満であると判定された場合、ステップＳ２０において、置き換え結果出力部１６は、置き換え候補文字列を棄却して処理を終了する。 When it is determined in step S17 that the score (appearance frequency) of the replacement candidate character string is less than the predetermined threshold Th, in step S20, the replacement result output unit 16 rejects the replacement candidate character string and ends the process. .

一方、ステップＳ１７において置き換え候補文字列のスコア（出現頻度）が所定の閾値Ｔｈ以上であると判定された場合、ステップＳ１８において、置き換え結果出力部１６は、置き換え候補文字列を置き換え対象文の置き換え対象部分に適用し、置き換え対象文の置き換え対象部分を置き換え候補文字列に置き換えた置き換え文を作成する。 On the other hand, if it is determined in step S17 that the score (appearance frequency) of the replacement candidate character string is greater than or equal to the predetermined threshold Th, the replacement result output unit 16 replaces the replacement candidate character string with the replacement target sentence in step S18. Apply to the target part and create a replacement sentence that replaces the replacement target part of the replacement target sentence with a replacement candidate character string.

次に、ステップＳ１９において、置き換え結果出力部１６は、適用と判定された置き換え候補文字列によって生成された置き換え文を類似文として出力して処理を終了する。 Next, in step S19, the replacement result output unit 16 outputs the replacement sentence generated by the replacement candidate character string determined to be applied as a similar sentence, and ends the process.

上記の処理により、本実施の形態では、文脈依存率ｐｃに応じてＮ−ｇｒａｍ言語モデルのＮを決定し、文脈依存率ｐｃが大きいほどＮを大きく、文脈依存率ｐｃが小さいほどＮを小さく設定している。また、決定されたＮを用いて、言語モデルデータベース２３を照合することにより、置き換え候補文字列を含むＮ−ｇｒａｍの出現頻度を求め、求めた出現頻度に基づいて、置き換え候補文字列によって生成された置き換え文を類似文として採用するか否かを判定しているので、広い識別対象領域を用いて、文脈依存率ｐｃが大きい置き換え候補文字列を含むｉ−ｇｒａｍの出現頻度を高精度に求めることができるとともに、狭い識別対象領域を用いて、文脈依存率ｐｃが小さい置き換え候補文字列を含むＮ−ｇｒａｍの出現頻度を低コストで且つ高精度に求めることができる。この結果、言語モデルデータベース２３に対する探索コストを低減できるとともに、類似文の識別を高精度に行うことができる。 With the above processing, in the present embodiment, N of the N-gram language model is determined according to the context dependency rate pc, and N increases as the context dependency rate pc increases, and decreases as the context dependency rate pc decreases. It is set. Further, the N-gram appearance frequency including the replacement candidate character string is obtained by collating the language model database 23 using the determined N, and is generated from the replacement candidate character string based on the obtained appearance frequency. Since it is determined whether or not the replacement sentence is adopted as the similar sentence, the appearance frequency of the i-gram including the replacement candidate character string having a large context dependency rate pc is obtained with high accuracy using a wide identification target area. In addition, it is possible to obtain the appearance frequency of the N-gram including the replacement candidate character string having a small context-dependent rate pc at a low cost and with high accuracy using a narrow identification target region. As a result, the search cost for the language model database 23 can be reduced, and similar sentences can be identified with high accuracy.

（実施の形態２）
図６は、本開示の実施の形態２における類似文生成システムの構成の一例を示すブロック図である。図６に示す類似文生成システムは、類似文生成装置１ａと、翻訳装置２とを備える。 (Embodiment 2)
FIG. 6 is a block diagram illustrating an example of a configuration of a similar sentence generation system according to the second embodiment of the present disclosure. The similar sentence generation system shown in FIG. 6 includes a similar sentence generation device 1a and a translation device 2.

類似文生成装置１ａは、置き換え対象文入力部１０ａ、置き換え候補抽出部１１、文脈依存率照合部１２、文脈依存性判定部１３、言語モデル照合部１４、置き換え判定部１５、置き換え結果出力部１６、データ更新部１７、置き換え候補辞書２１、文脈依存率辞書２２、及び言語モデルデータベース２３を備える。翻訳装置２は、対訳コーパス生成部３１、翻訳モデル生成部３２、被翻訳文入力部３３、機械翻訳部３４、翻訳結果文出力部３５、翻訳結果評価部３６、及びフィードバックデータ生成部３７を備える。 The similar sentence generation device 1a includes a replacement target sentence input unit 10a, a replacement candidate extraction unit 11, a context dependency rate collation unit 12, a context dependency determination unit 13, a language model collation unit 14, a replacement determination unit 15, and a replacement result output unit 16. , A data update unit 17, a replacement candidate dictionary 21, a context-dependent rate dictionary 22, and a language model database 23. The translation apparatus 2 includes a parallel corpus generation unit 31, a translation model generation unit 32, a translated sentence input unit 33, a machine translation unit 34, a translation result sentence output unit 35, a translation result evaluation unit 36, and a feedback data generation unit 37. .

類似文生成装置１ａは、置き換え対象文（原文）から類似文を生成し、採用すると判定した類似文等を翻訳装置２に出力する。翻訳装置２は、類似文生成装置１ａにより採用すると判定された類似文と、当該類似文を生成した原文を所定の言語で翻訳した翻訳文とを基に生成された翻訳モデルを用いて、任意の翻訳対象文を翻訳して翻訳結果文を作成し、作成した翻訳結果文の評価結果に基づき、翻訳対象文の言語及び翻訳結果文の言語のうち少なくとも一方に関する言語情報と、この言語情報に対する評価情報とを含むフィードバック情報を生成して類似文生成装置１ａにフィードバックする。類似文生成装置１ａは、フィードバック情報に基づき、置き換え候補辞書２１、文脈依存率辞書２２及び言語モデルデータベース２３のうち少なくとも一つのデータを更新する。 The similar sentence generation device 1a generates a similar sentence from the replacement target sentence (original sentence), and outputs the similar sentence determined to be adopted to the translation apparatus 2. The translation device 2 uses a translation model generated based on a similar sentence determined to be adopted by the similar sentence generation device 1a and a translated sentence obtained by translating the original sentence that generated the similar sentence in a predetermined language. The translation target sentence is translated to create a translation result sentence, and based on the evaluation result of the created translation result sentence, language information on at least one of the language of the translation target sentence and the language of the translation result sentence, and the language information Feedback information including evaluation information is generated and fed back to the similar sentence generation device 1a. The similar sentence generation device 1a updates at least one of the replacement candidate dictionary 21, the context dependency rate dictionary 22, and the language model database 23 based on the feedback information.

ここで、図６に示す類似文生成装置１ａが図１に示す類似文生成装置１と異なる点は、置き換え候補辞書２１、文脈依存率辞書２２及び言語モデルデータベース２３のデータを更新するデータ更新部１７が追加され、置き換え対象文入力部１０ａが置き換え対象文の入力に加えて、入力された置き換え対象文（原文）の翻訳文を翻訳装置２に出力する点であり、その他の点は同様であるので、同一部分には同一符号を付して、詳細な説明は省略する。 Here, the similar sentence generation device 1a shown in FIG. 6 is different from the similar sentence generation device 1 shown in FIG. 1 in that the data update unit updates the data in the replacement candidate dictionary 21, the context dependency rate dictionary 22, and the language model database 23. 17 is added, and in addition to the input of the replacement target sentence, the replacement target sentence input unit 10a outputs the translated sentence of the input replacement target sentence (original sentence) to the translation device 2, and the other points are the same. Therefore, the same parts are denoted by the same reference numerals, and detailed description thereof is omitted.

置き換え対象文入力部１０ａは、ユーザによる所定の操作入力を受け付け、ユーザが入力した置き換え対象文を置き換え候補抽出部１１に出力し、その後の置き換え対象文に対する置き換え候補抽出部１１から置き換え結果出力部１６までの処理は、図１に示す置き換え候補抽出部１１から置き換え結果出力部１６までの処理と同様であり、置き換え結果出力部１６は、置き換え判定部１５で適用と判定された置き換え候補文字列によって生成された置き換え文（類似文）を対訳コーパス生成部３１に出力する。 The replacement target sentence input unit 10a receives a predetermined operation input by the user, outputs the replacement target sentence input by the user to the replacement candidate extraction unit 11, and outputs a replacement result output unit from the replacement candidate extraction unit 11 for the subsequent replacement target sentence. The processing up to 16 is the same as the processing from the replacement candidate extraction unit 11 to the replacement result output unit 16 shown in FIG. 1, and the replacement result output unit 16 replaces the replacement candidate character string determined to be applied by the replacement determination unit 15. The replacement sentence (similar sentence) generated by the above is output to the bilingual corpus generation unit 31.

また、置き換え対象文入力部１０ａは、ユーザによる所定の操作入力を受け付け、ユーザが入力した、置き換え文を生成した原文を所定の言語で翻訳した翻訳文、すなわち、置き換え対象文の翻訳文（原文に対応する対訳文）を対訳コーパス生成部３１に出力する。例えば、上記の置き換え文が日本語（原言語文）で作成され、翻訳装置２が日英翻訳を行う場合、上記の翻訳文は英語（目的言語文）で作成されている。なお、原言語文及び目的言語文は、上記の例に特に限定されず、類似文生成装置１ａが英語の類似文を生成する場合、英語を原言語文、日本語を目的言語文としてもよく、また、中国語、韓国語、フランス語、ドイツ語、イタリア語、ポルトガル語等の他の言語であってもよい。 The replacement target sentence input unit 10a receives a predetermined operation input by the user and translates the original sentence that has been input by the user and has generated the replacement sentence in a predetermined language, that is, a translation sentence (original sentence of the replacement target sentence). 2) is output to the bilingual corpus generation unit 31. For example, when the replacement sentence is created in Japanese (source language sentence) and the translation device 2 performs Japanese-English translation, the translation sentence is created in English (target language sentence). The source language sentence and the target language sentence are not particularly limited to the above example, and when the similar sentence generation device 1a generates an English similar sentence, English may be the source language sentence and Japanese may be the target language sentence. Also, other languages such as Chinese, Korean, French, German, Italian, Portuguese may be used.

対訳コーパス生成部３１は、置き換え結果出力部１６から出力された置き換え文と、置き換え対象文入力部１０ａから出力された置き換え対象文の翻訳文とを関連付け、新たな対訳コーパスを生成して翻訳モデル生成部３２に出力する。なお、対訳コーパスの生成方法としては、上記の例に特に限定されず、既に作成している対訳コーパスに新たな対訳コーパスを追加してもよく、公知の種々の方法を用いることができる。 The parallel corpus generation unit 31 associates the replacement sentence output from the replacement result output unit 16 with the translation sentence of the replacement target sentence output from the replacement target sentence input unit 10a, generates a new parallel corpus, and generates a translation model. The data is output to the generation unit 32. The method for generating a bilingual corpus is not particularly limited to the above example, and a new bilingual corpus may be added to a bilingual corpus that has already been created, and various known methods can be used.

翻訳モデル生成部３２は、対訳コーパス生成部３１で生成された新たな対訳コーパスを用いて、所定の学習により翻訳モデルを生成して機械翻訳部３４に出力する。なお、翻訳モデルの生成方法としては、公知の種々の方法を用いることができるので、詳細な説明は省略する。 The translation model generation unit 32 generates a translation model by predetermined learning using the new parallel translation corpus generated by the parallel translation corpus generation unit 31 and outputs the translation model to the machine translation unit 34. As a translation model generation method, various known methods can be used, and thus detailed description thereof is omitted.

被翻訳文入力部３３は、ユーザによる所定の操作入力を受け付け、ユーザが入力した翻訳対象文（原言語文）を機械翻訳部３４に出力する。機械翻訳部３４は、翻訳モデル生成部３２により生成された翻訳モデルを用いて、翻訳対象文を翻訳し、翻訳結果文（目的言語文）を翻訳対象文とともに翻訳結果文出力部３５に出力する。翻訳結果文出力部３５は、翻訳結果として、翻訳結果文を翻訳対象文とともに翻訳結果評価部３６に出力する。 The translated sentence input unit 33 receives a predetermined operation input by the user and outputs the translation target sentence (source language sentence) input by the user to the machine translation unit 34. The machine translation unit 34 translates the translation target sentence using the translation model generated by the translation model generation unit 32, and outputs the translation result sentence (target language sentence) together with the translation target sentence to the translation result sentence output unit 35. . The translation result sentence output unit 35 outputs the translation result sentence together with the translation target sentence to the translation result evaluation unit 36 as a translation result.

翻訳結果評価部３６は、翻訳結果文出力部３５から出力された翻訳結果文（目的言語文）の翻訳精度及び品質に対して評価を行う。ここで、翻訳結果評価部３６の評価方法としては、機械的な数値指標によって評価を行ってもよく、また、人手による評価結果を翻訳結果評価部３６に入力するようにしてもよい。翻訳結果評価部３６は、評価結果として、評価値又は評価カテゴリなどの評価情報を翻訳結果文（目的言語文）及び／又は翻訳対象文（原言語文）と関連付けてフィードバックデータ生成部３７に出力する。 The translation result evaluation unit 36 evaluates the translation accuracy and quality of the translation result sentence (target language sentence) output from the translation result sentence output unit 35. Here, as an evaluation method of the translation result evaluation unit 36, the evaluation may be performed by a mechanical numerical index, or the manual evaluation result may be input to the translation result evaluation unit 36. The translation result evaluation unit 36 outputs evaluation information such as an evaluation value or an evaluation category as an evaluation result to the feedback data generation unit 37 in association with the translation result sentence (target language sentence) and / or the translation target sentence (source language sentence). To do.

フィードバックデータ生成部３７は、翻訳結果評価部３６より出力された評価結果を基に、フィードバック情報として、類似文生成装置１ａにフィードバックするフィードバックデータを生成してデータ更新部１７に出力する。ここで、フィードバックデータは、原言語及び／又は目的言語側の任意の言語情報と、当該言語情報に関する値又は状態の評価情報とのペアデータである。このフィードバックデータとしては、種々のデータを用いることができ、以下のデータを用いることができる。 The feedback data generation unit 37 generates feedback data to be fed back to the similar sentence generation device 1a as feedback information based on the evaluation result output from the translation result evaluation unit 36, and outputs the feedback data to the data update unit 17. Here, the feedback data is pair data of arbitrary language information on the source language and / or target language side and value or state evaluation information related to the language information. Various data can be used as this feedback data, and the following data can be used.

例えば、翻訳結果が悪かった場合に、ユーザ又は所定の翻訳結果文修正装置により翻訳結果文（目的言語文）を修正し、より良い翻訳文を入力することにより、入力された翻訳文と元の翻訳対象文（原言語文）とのペアの言語情報と、翻訳結果の状態（悪い）の評価情報とのペアデータをフィードバックデータとしてもよい。 For example, when the translation result is bad, the translation result sentence (target language sentence) is corrected by the user or a predetermined translation result sentence correction device, and a better translation sentence is input, so that the input translation sentence and the original Pair data of a pair of language information with a translation target sentence (source language sentence) and evaluation information of a translation result state (bad) may be used as feedback data.

また、ユーザ又は所定の翻訳対象文修正装置により翻訳対象文（原言語文）を修正し、同趣旨で異なる表現の翻訳対象文を入力することにより、より良い翻訳結果文を取得できた場合に、元の翻訳対象文（原言語文）と翻訳結果の良かった翻訳対象文（原言語文）とのペアの言語情報と、翻訳結果の状態（良い／悪いの２値）の評価情報とのペアデータをフィードバックデータとしてもよい。 In addition, when a translation target sentence (source language sentence) is corrected by a user or a predetermined translation target sentence correcting device, and a translation target sentence having a different expression is input for the same purpose, a better translation result sentence can be obtained. The language information of the pair of the original translation target sentence (source language sentence) and the translation target sentence (source language sentence) whose translation result was good, and evaluation information of the translation result state (good / bad binary value) Pair data may be used as feedback data.

また、対訳コーパスの中から翻訳対象文（原言語文）に近い文を一又は複数抽出し、ユーザ又は所定の翻訳文評価装置により原言語として破綻していないかどうかの評価値（例えば、良い／悪いの２値）を求め、抽出された原言語文に近い文に対して評価値を付与し、この評価値と、原言語文に近い文を示す言語情報とのペアデータをフィードバックデータとしてもよい。 Also, one or a plurality of sentences close to the translation target sentence (source language sentence) are extracted from the bilingual corpus, and an evaluation value (for example, good) whether or not the user or a predetermined translated sentence evaluation device has failed as a source language. / Binary / bad) and assigns an evaluation value to the sentence close to the extracted source language sentence, and the pair data of the evaluation value and language information indicating a sentence close to the source language sentence as feedback data Also good.

また、機械翻訳部３４により複数の翻訳結果文を作成し、その中からより適切な翻訳結果文をユーザ又は所定の翻訳文評価装置により選択し、選択された翻訳結果文と、選択されなかった翻訳結果文とのペアの言語情報と、これらの翻訳結果文の選択結果を示す評価情報とのペアデータをフィードバックデータとしてもよい。 In addition, a plurality of translation result sentences are created by the machine translation unit 34, and a more appropriate translation result sentence is selected by the user or a predetermined translation sentence evaluation apparatus, and the selected translation result sentence and the selected translation result sentence are not selected. Pair data of language information of a pair with a translation result sentence and evaluation information indicating a selection result of these translation result sentences may be used as feedback data.

データ更新部１７は、フィードバックデータ生成部３７が生成したフィードバックデータ（言語情報と当該言語情報に関する値又は状態の評価情報とのペアデータ）に基づき、置き換え候補辞書２１、文脈依存率辞書２２及び言語モデルデータベース２３のうち少なくとも一つのデータベース内容を更新する。 The data update unit 17 is based on the feedback data generated by the feedback data generation unit 37 (pair data of the language information and the value or state evaluation information related to the language information), the replacement candidate dictionary 21, the context-dependent rate dictionary 22, and the language At least one database content of the model database 23 is updated.

また、データ更新部１７は、フィードバックデータが文脈依存性を有する置き換え候補文字列を含む場合、文脈依存率辞書２２及び言語モデルデータベース２３を更新する。また、データ更新部１７は、フィードバックデータが新しい文表現を含む場合、この文表現に応じて文脈依存率辞書２２の文脈依存率の値を変化させ、また、新しい文表現を含むように言語モデルデータベース２３のＮ−ｇｒａｍを部分構築し、言語モデルデータベース２３を更新する。 In addition, when the feedback data includes a replacement candidate character string having context dependency, the data update unit 17 updates the context dependency rate dictionary 22 and the language model database 23. Further, when the feedback data includes a new sentence expression, the data update unit 17 changes the value of the context dependency rate of the context dependency ratio dictionary 22 according to the sentence expression, and the language model so as to include the new sentence expression. The N-gram of the database 23 is partially constructed, and the language model database 23 is updated.

また、データ更新部１７は、言語情報に原言語側の情報が含まれており、当該の言語情報内に、置き換え候補辞書２１、文脈依存率辞書２２又は言語モデルデータベース２３に登録されている情報が含まれている場合、対応するフィードバックデータの値又は状態の評価情報に応じて、置き換え候補辞書２１、文脈依存率辞書２２又は言語モデルデータベース２３の対応する情報を更新したり、追加したり、削除したりする。 Further, the data update unit 17 includes information on the source language side in the language information, and information registered in the replacement candidate dictionary 21, the context dependency rate dictionary 22, or the language model database 23 in the language information. Is included, the corresponding information in the replacement candidate dictionary 21, the context-dependent dictionary 22 or the language model database 23 is updated or added according to the corresponding feedback data value or state evaluation information, Or delete it.

例えば、ｐｏｓｉｔｉｖｅな（肯定的な）値又は状態の評価情報を持つ原言語側の言語情報がフィードバックされた場合、データ更新部１７は、出現頻度に所定の重みを加えて出現頻度の値を増加させる等により、言語モデルデータベース２３の当該言語情報を含む値をｐｏｓｉｔｉｖｅ方向に変化させる。一方、ｎｅｇａｔｉｖｅな（否定的な）値又は状態の評価情報を持つ原言語側の言語情報がフィードバックされた場合、データ更新部１７は、文脈に依存する割合が高くなる方向に文脈依存率を更新する等により、文脈依存率辞書２２の当該言語情報を含む値をｎｅｇａｔｉｖｅ方向に変化させる。 For example, when source language language information having positive (positive) values or state evaluation information is fed back, the data updating unit 17 adds a predetermined weight to the appearance frequency and increases the value of the appearance frequency. For example, the value including the language information in the language model database 23 is changed in the positive direction. On the other hand, when the language information on the source language side having negative (negative) values or state evaluation information is fed back, the data updating unit 17 updates the context dependency rate in a direction in which the context dependent rate increases. As a result, the value including the language information in the context-dependent dictionary 22 is changed in the negative direction.

また、翻訳結果の悪かった元の翻訳対象文（原言語文）及び翻訳結果の良かった翻訳対象文（原言語文）の言語情報と、それぞれの翻訳結果状態（悪い／良い）の評価情報とのペアデータをフィードバックされ、悪い状態の元の翻訳対象文に対する良い状態に対応する翻訳対象文の差分が置き換え候補辞書２１に登録されていない場合、データ更新部１７は、良い状態に対応する差分を置き換え候補辞書２１に登録する。 Also, the language information of the original translation target sentence (source language sentence) having a poor translation result and the translation target sentence (source language sentence) having a good translation result, and evaluation information of each translation result state (bad / good) In the case where the difference between the translation target sentences corresponding to the good state with respect to the original translation target sentence in the bad state is not registered in the replacement candidate dictionary 21, the data update unit 17 performs the difference corresponding to the good state. Is registered in the replacement candidate dictionary 21.

また、翻訳結果の悪かった翻訳対象文（原言語文）の言語情報と、翻訳結果状態（悪い）の評価情報とのペアデータをフィードバックされた場合、データ更新部１７は、翻訳結果の悪かった翻訳対象文の置き換え候補文字列を置き換え候補辞書２１から削除する。 In addition, when the pair data of the language information of the translation target sentence (original language sentence) whose translation result is bad and the evaluation information of the translation result state (bad) is fed back, the data update unit 17 has a bad translation result. The replacement candidate character string of the translation target sentence is deleted from the replacement candidate dictionary 21.

なお、類似文生成装置１ａ及び翻訳装置２の構成は、上記のように、機能ごとに専用のハードウエアで構成する例に特に限定されず、ＣＰＵ、ＲＯＭ、ＲＡＭ及び補助記憶装置等を備える１台又は複数台のコンピュータ又はサーバ（情報処理装置）が、上記の処理を実行するためのプログラムをインストールし、類似文生成装置又は翻訳装置として機能するように構成してもよい。 Note that the configurations of the similar sentence generation device 1a and the translation device 2 are not particularly limited to the example in which dedicated functions are configured for each function as described above, and includes a CPU, a ROM, a RAM, an auxiliary storage device, and the like. A computer or a plurality of computers or servers (information processing devices) may be configured to install a program for executing the above processing and function as a similar sentence generation device or a translation device.

次に、上記のように構成された類似文生成システムによるフィードバックデータ更新処理を含む類似文生成処理について、詳細に説明する。図７は、図６に示す類似文生成システムのフィードバックデータ更新処理を含む類似文生成処理の一例を示すフローチャートである。なお、図７に示す処理のうち、図５に示す処理と同一の処理には同一符号を付して、詳細な説明は省略する。 Next, the similar sentence generation process including the feedback data update process by the similar sentence generation system configured as described above will be described in detail. FIG. 7 is a flowchart showing an example of a similar sentence generation process including a feedback data update process of the similar sentence generation system shown in FIG. Of the processes shown in FIG. 7, the same processes as those shown in FIG. 5 are denoted by the same reference numerals, and detailed description thereof is omitted.

まず、類似文生成装置１ａによる類似文生成処理として、ステップＳ１１ａにおいて、置き換え対象文入力部１０ａは、ユーザによる原文に対応する対訳文及び置き換え対象文の入力を受け付け、対訳文を対訳コーパス生成部３１に出力し、置き換え対象文を置き換え候補抽出部１１に出力する。なお、対訳文を対訳コーパス生成部３１に出力するタイミングは、上記の例に特に限定されず、ステップＳ１７の処理時に、置き換え対象文入力部１０ａが対訳文を対訳コーパス生成部３１に出力するようにしてもよい。 First, as similar sentence generation processing by the similar sentence generation device 1a, in step S11a, the replacement target sentence input unit 10a accepts input of a parallel translation sentence and a replacement target sentence corresponding to the original sentence by the user, and converts the parallel translation sentence into a parallel corpus generation unit. 31, and the replacement target sentence is output to the replacement candidate extraction unit 11. Note that the timing of outputting the parallel translation to the parallel corpus generation unit 31 is not particularly limited to the above example, and the replacement target sentence input unit 10a outputs the parallel translation to the parallel corpus generation unit 31 at the time of the processing in step S17. It may be.

次に、ステップＳ１２〜Ｓ１７において、図５に示すステップＳ１２〜Ｓ１７と同様の処理が実行され、ステップＳ１７において置き換え候補文字列のスコア（出現頻度）が所定の閾値Ｔｈ未満であると判定された場合、ステップＳ２０において、置き換え結果出力部１６は、置き換え候補文字列を棄却して処理を終了する。 Next, in steps S12 to S17, processing similar to that in steps S12 to S17 shown in FIG. 5 is executed. In step S17, it is determined that the score (frequency of appearance) of the replacement candidate character string is less than the predetermined threshold Th. In step S20, the replacement result output unit 16 rejects the replacement candidate character string and ends the process.

一方、ステップＳ１７において置き換え候補文字列のスコア（出現頻度）が所定の閾値Ｔｈ以上であると判定された場合、ステップＳ１８において、図５に示すステップＳ１８と同様の処理が実行された後、ステップＳ１９において、置き換え結果出力部１６は、置き換え判定部１５で適用と判定された置き換え候補文字列によって生成された置き換え文（置き換わり文）を対訳コーパス生成部３１に出力し、類似文生成装置１ａによる類似文生成処理が終了する。 On the other hand, if it is determined in step S17 that the score (appearance frequency) of the replacement candidate character string is equal to or greater than the predetermined threshold Th, a process similar to that in step S18 shown in FIG. In S19, the replacement result output unit 16 outputs the replacement sentence (replacement sentence) generated by the replacement candidate character string determined to be applied by the replacement determination unit 15 to the bilingual corpus generation unit 31 by the similar sentence generation device 1a. The similar sentence generation process ends.

次に、翻訳装置２及び類似文生成装置１ａによるフィードバックデータ更新処理として、ステップＳ２１において、対訳コーパス生成部３１は、置き換え結果出力部１６から出力された置き換え文と、置き換え対象文入力部１０ａから出力された対訳文とを関連付け、新たな対訳コーパスを生成して翻訳モデル生成部３２に出力する。 Next, as feedback data update processing by the translation device 2 and the similar sentence generation device 1a, in step S21, the bilingual corpus generation unit 31 receives the replacement sentence output from the replacement result output unit 16 and the replacement target sentence input unit 10a. It associates with the output parallel translation sentence, generates a new parallel corpus, and outputs it to the translation model generation unit 32.

次に、ステップＳ２２において、翻訳モデル生成部３２は、対訳コーパス生成部３１で生成された新たな対訳コーパスを用いて、翻訳モデルを学習により生成して機械翻訳部３４に出力する。 Next, in step S 22, the translation model generation unit 32 generates a translation model by learning using the new parallel corpus generated by the bilingual corpus generation unit 31 and outputs the translation model to the machine translation unit 34.

次に、ステップＳ２３において、被翻訳文入力部３３は、ユーザによる翻訳対象文の入力を受け付け、ユーザが翻訳を希望する任意の翻訳対象文を機械翻訳部３４に出力する。 Next, in step S 23, the translated sentence input unit 33 receives input of the translation target sentence by the user, and outputs an arbitrary translation target sentence that the user desires to translate to the machine translation unit 34.

次に、ステップＳ２４において、機械翻訳部３４は、翻訳モデル生成部３２が生成した翻訳モデルにより、翻訳対象文を翻訳結果文に翻訳し、翻訳結果文を翻訳対象文とともに翻訳結果文出力部３５に出力する。 Next, in step S24, the machine translation unit 34 translates the translation target sentence into the translation result sentence by the translation model generated by the translation model generation unit 32, and translates the translation result sentence together with the translation target sentence to the translation result sentence output unit 35. Output to.

次に、ステップＳ２５において、翻訳結果文出力部３５は、翻訳結果文を翻訳対象文とともに翻訳結果評価部３６に出力する。 Next, in step S25, the translation result sentence output unit 35 outputs the translation result sentence together with the translation target sentence to the translation result evaluation unit 36.

次に、ステップＳ２６において、翻訳結果評価部３６は、翻訳結果文出力部３５から出力された翻訳結果文の翻訳精度及び品質に対して評価を行い、評価結果として、評価値又は評価カテゴリなどの情報を翻訳結果文と関連付けてフィードバックデータ生成部３７に出力する。 Next, in step S26, the translation result evaluation unit 36 evaluates the translation accuracy and quality of the translation result sentence output from the translation result sentence output unit 35, and the evaluation result includes an evaluation value or an evaluation category. The information is associated with the translation result sentence and output to the feedback data generation unit 37.

次に、ステップＳ２７において、フィードバックデータ生成部３７は、翻訳結果評価部３６より出力された評価結果からフィードバックデータを生成してデータ更新部１７に出力する。 Next, in step S 27, the feedback data generation unit 37 generates feedback data from the evaluation result output from the translation result evaluation unit 36 and outputs it to the data update unit 17.

最後に、ステップＳ２８において、データ更新部１７は、フィードバックデータ生成部３７が生成したフィードバックデータに基づき、置き換え候補辞書２１、文脈依存率辞書２２及び言語モデルデータベース２３のうち少なくとも一つのデータベース内容を更新し、フィードバックデータ更新処理を終了する。 Finally, in step S 28, the data update unit 17 updates at least one database content of the replacement candidate dictionary 21, the context dependency rate dictionary 22, and the language model database 23 based on the feedback data generated by the feedback data generation unit 37. Then, the feedback data update process ends.

上記の処理により、本実施の形態では、採用すると判定された置き換え文と、原文に対する対訳文とを基に生成された翻訳モデルを用いて、所定の翻訳対象文を翻訳した翻訳結果文を評価し、この評価結果に基づき、翻訳対象文の言語及び／又は翻訳結果文の言語に関する言語情報と、この言語情報に対する評価情報とを含むフィードバック情報を生成しているので、文脈依存性を考慮した事例を類似文生成装置１ａに学習及び反映するためのフィードバックデータを自律的に生成することができる。 Through the above processing, in the present embodiment, the translation result sentence obtained by translating a predetermined translation target sentence is evaluated using the translation model generated based on the replacement sentence determined to be adopted and the parallel translation sentence with respect to the original sentence. Based on the evaluation result, the feedback information including the language information about the language of the translation target sentence and / or the language of the translation result sentence and the evaluation information for the language information is generated. It is possible to autonomously generate feedback data for learning and reflecting the case in the similar sentence generation device 1a.

また、本実施の形態では、言語情報と評価情報とを含むフィードバックデータを用いて、置き換え候補辞書２１、文脈依存率辞書２２及び言語モデルデータベース２３を更新しているので、文脈依存性を考慮した事例を置き換え候補辞書２１、文脈依存率辞書２２及び言語モデルデータベース２３に反映することができ、更新前の置き換え候補辞書２１、文脈依存率辞書２２及び言語モデルデータベース２３に存在しない新しい文表現にも対応できる高効率で且つ自律的な類似文の識別を行うことができる。 In the present embodiment, the replacement candidate dictionary 21, the context dependency rate dictionary 22, and the language model database 23 are updated using feedback data including language information and evaluation information, so that context dependency is taken into consideration. The case can be reflected in the replacement candidate dictionary 21, the context dependency rate dictionary 22 and the language model database 23, and also for new sentence expressions that do not exist in the replacement candidate dictionary 21, the context dependency rate dictionary 22 and the language model database 23 before the update. A highly efficient and autonomous similar sentence that can be handled can be identified.

本開示は、言語モデルのデータベースに対する探索コストを低減できるとともに、類似文の識別を高精度に行うことができるので、原文から類似文を生成する類似文生成方法、類似文生成プログラム、類似文生成装置、及び該類似文生成装置を備える類似文生成システムに有用である。 The present disclosure can reduce the search cost for the language model database and can identify similar sentences with high accuracy. Therefore, the similar sentence generation method, the similar sentence generation program, and the similar sentence generation for generating the similar sentences from the original sentences. The present invention is useful for a device and a similar sentence generation system including the similar sentence generation device.

１、１ａ類似文生成装置
２翻訳装置
１０、１０ａ置き換え対象文入力部
１１置き換え候補抽出部
１２文脈依存率照合部
１３文脈依存性判定部
１４言語モデル照合部
１５置き換え判定部
１６置き換え結果出力部
１７データ更新部
２１置き換え候補辞書
２２文脈依存率辞書
２３言語モデルデータベース
３１対訳コーパス生成部
３２翻訳モデル生成部
３３被翻訳文入力部
３４機械翻訳部
３５翻訳結果文出力部
３６翻訳結果評価部
３７フィードバックデータ生成部 DESCRIPTION OF SYMBOLS 1, 1a Similar sentence production | generation apparatus 2 Translation apparatus 10, 10a Replacement target sentence input part 11 Replacement candidate extraction part 12 Context dependence rate collation part 13 Context dependence judgment part 14 Language model collation part 15 Replacement judgment part 16 Replacement result output part 17 Data update unit 21 Replacement candidate dictionary 22 Context-dependent rate dictionary 23 Language model database 31 Bilingual corpus generation unit 32 Translation model generation unit 33 Translated sentence input unit 34 Machine translation unit 35 Translation result sentence output unit 36 Translation result evaluation unit 37 Feedback data Generator

Claims

A method for generating a similar sentence from an original sentence,
Enter the first sentence,
One or more second words / phrases having the same meaning as the first word / phrase are extracted from the first database among the plurality of words / phrases constituting the first sentence, and the first database includes the words / phrases and the words / phrases included in the first database And synonymous with
An N-gram value is calculated based on a context-dependent value corresponding to the one or more second words obtained based on a second database, and the second database includes words and phrases included in the second database. Correlating with the corresponding context-dependent value, the context-dependent value indicates the degree to which the meaning of the phrase included in the second database depends on the context;
In one or more second sentences in which the first phrase is replaced with the one or more second phrases in the first sentence, one or more consecutive ones or more including the number of the second phrases corresponding to the N-gram value Extract the third word of
Calculating an appearance frequency in a third database for the one or more third words, wherein the third database associates a word and an appearance frequency in the third database of a word included in the third database;
Determining whether the calculated appearance frequency is greater than or equal to a threshold;
When it is determined that the calculated appearance frequency is equal to or higher than the threshold, the one or more second sentences are adopted as similar sentences of the first sentence and output to an external device.
Method.

The first sentence is written in a first language;
The first sentence is included in a bilingual corpus, and the bilingual corpus includes a plurality of pairs of sentences written in a first language and bilingual sentences written in a second language;
If it is determined that the calculated appearance frequency is equal to or higher than the threshold, the one or more second sentences are added to the parallel corpus as similar sentences of the first sentence.
The method of claim 1.

The third database includes an N-gram language model database,
According to the context-dependent value, N of the N-gram language model is determined as i (positive integer),
By checking the third database, the occurrence frequency of the i-gram including the second word / phrase is obtained,
Determining whether to employ the one or more second sentences as similar sentences of the first sentence based on the appearance frequency of the i-gram including the second phrase;
The method according to claim 1 or 2.

Generated based on the one or more second sentences determined to be adopted as similar sentences of the first sentence, and a translated sentence obtained by translating the first sentence that generated the one or more second sentences in a second language Using the translated model, create a translation result sentence by translating a predetermined translation target sentence,
Evaluating the translation result sentence;
Based on the evaluation result of the translation result sentence, generate feedback information including language information about the language of the translation target sentence and / or language of the translation result sentence, and evaluation information for the language information.
The method according to claim 1.

Updating at least one of the first database, the second database, and the third database using the feedback information;
The method of claim 4.

Updating the second database and the third database when the feedback information includes the second phrase having context dependency;
The method of claim 4.

When the feedback information includes a new sentence expression, a context-dependent value of the second database is changed according to the sentence expression;
The method of claim 4.

If the feedback information includes a new sentence expression, update the third database to include the sentence expression;
The method of claim 4.

A program for causing a computer to function as a device that generates a similar sentence from an original sentence,
In the computer,
Enter the first sentence,
One or more second words / phrases having the same meaning as the first word / phrase are extracted from the first database among the plurality of words / phrases constituting the first sentence, and the first database includes the words / phrases and the words / phrases included in the first database. And synonymous with
An N-gram value is calculated based on a context-dependent value corresponding to the one or more second words obtained based on a second database, and the second database includes words and phrases included in the second database. Correlating with the corresponding context-dependent value, the context-dependent value indicates the degree to which the meaning of the phrase included in the second database depends on the context;
In one or more second sentences in which the first phrase is replaced with the one or more second phrases in the first sentence, one or more consecutive ones or more including the number of the second phrases corresponding to the N-gram value Extract the third word of
Calculating an appearance frequency in a third database for the one or more third words, wherein the third database associates a word and an appearance frequency in the third database of a word included in the third database;
Determining whether the calculated appearance frequency is greater than or equal to a threshold;
When it is determined that the calculated appearance frequency is equal to or higher than the threshold, the one or more second sentences are adopted as similar sentences of the first sentence and output to an external device.
A program that executes processing.

An apparatus for generating a similar sentence from an original sentence,
An input unit for inputting the first sentence;
A second phrase extraction unit that extracts from the first database one or more second phrases having the same meaning as the first phrase among the plurality of phrases constituting the first sentence; the first database includes the phrase and the first phrase Corresponding to the synonym of the phrase contained in the database,
A calculation unit that calculates an N-gram value based on a context-dependent value corresponding to the one or more second words obtained based on a second database; and the second database is included in the words and the second database The context-dependent value corresponding to the phrase, the context-dependent value indicates the degree to which the meaning of the phrase included in the second database depends on the context,
In one or more second sentences in which the first phrase is replaced with the one or more second phrases in the first sentence, one or more consecutive ones or more including the number of the second phrases corresponding to the N-gram value A third word / phrase extraction unit for extracting the third word / phrase of
A calculation unit that calculates an appearance frequency in a third database for the one or more third words; and the third database associates a word and an appearance frequency in the third database of a word included in the third database;
A determination unit for determining whether the calculated appearance frequency is equal to or higher than a threshold;
When it is determined that the calculated appearance frequency is equal to or higher than the threshold value, the one or more second sentences are employed as similar sentences of the first sentence, and an output unit that outputs to an external device is provided.
apparatus.

A system for generating a similar sentence from an original sentence,
An apparatus according to claim 10;
The one or more second sentences determined to be adopted as similar sentences of the first sentence by the device, and a translated sentence obtained by translating the first sentence that generated the one or more second sentences in a second language. A translation unit that translates a predetermined translation target sentence using a translation model generated based on the translation model, and creates a translation result sentence;
An evaluation unit for evaluating the translation result sentence created by the translation unit;
Based on the evaluation result of the evaluation unit, a generation unit that generates feedback information including language information on the language of the translation target sentence and / or language of the translation result sentence, and evaluation information on the language information,
system.