JP2020008951A

JP2020008951A - Terminology fluctuation extraction device and method

Info

Publication number: JP2020008951A
Application number: JP2018127063A
Authority: JP
Inventors: 天瑶李; Tianyao Li; 真澄川上; Masumi Kawakami; 健二北川; Kenji Kitagawa; 敬志大島; Takashi Oshima; 遼曾我; Ryo Soga; 愛利國; Ai Toshikuni
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2018-07-03
Filing date: 2018-07-03
Publication date: 2020-01-16

Abstract

【課題】
表記振れ候補（複合語）とＯＫ語（複合語）の類似度を高精度で判定し、登録することを可能にする。
【解決手段】
表記振れ抽出装置は、一つのドメインにおける修正前の文書と修正後の文書を格納する記憶部と、修正前の文書と前記修正後の文書から複合語を抽出し、修正前の文書にのみ現れる複合語を前記ドメイン用語の表記揺れ候補とする処理部とを有する。そして、この処理部は、表記揺れ候補とドメイン用語の組み合わせを表記揺れ候補・ドメイン用語ペアとして、記憶部に登録し、記憶部に登録された表記揺れ候補・ドメイン用語ペアの表記揺れ候補及びドメイン用語を一般用語にそれぞれ分割し、分割された表記振れ候補の一般用語と分割されたドメイン用語の一般用語との最大類似度を計算し、計算された一般用語の最大類似度に基づいて、表記揺れ候補・ドメイン用語ペアの平均類似度を算出する。
【選択図】図４【Task】
It is possible to determine the degree of similarity between a notational deviation candidate (compound word) and an OK word (compound word) with high accuracy, and to register them.
[Solution]
The notation fluctuation extracting device stores a document before correction and a document after correction in one domain, extracts a compound word from the document before correction and the document after correction, and appears only in the document before correction. A processing unit that makes a compound word a candidate for the fluctuation of the notation of the domain term. The processing unit registers the combination of the spelling fluctuation candidate and the domain term in the storage unit as a spelling fluctuation candidate / domain term pair, and stores the spelling fluctuation candidate and the domain of the spelling fluctuation candidate / domain term pair registered in the storage unit. The term is divided into general terms, and the maximum similarity between the general term of the divided notational candidate and the general term of the divided domain term is calculated. Based on the calculated maximum similarity of the general term, the notation is calculated. The average similarity of the swing candidate / domain term pair is calculated.
[Selection diagram] FIG.

Description

本発明は、表記揺れ抽出装置及び方法に関する。 The present invention relates to a spelling fluctuation extracting device and method.

近年、自然言語で書かれた文書を分析し、その文書の「表記揺れ」を抽出する装置が開発されている。ここで、「表記揺れ」とは、一つの文書または一連の文書群において、同一の概念を指す複数の（異なる）表記である。 2. Description of the Related Art In recent years, a device that analyzes a document written in a natural language and extracts “notation fluctuation” of the document has been developed. Here, “oscillation” is a plurality of (different) notations indicating the same concept in one document or a series of documents.

表記振れに関連する技術として類似表現を抽出する技術の一例が、特許文献１に開示されている。特許文献１には、形態素列から所定の品詞の並びである複合語を抽出する複合語抽出手段と、複合語の特定の組において、一つの複合語が別の複合語の表記揺れであるか否かに関する人間による判定結果を入力する判定入力手段を備え、判定結果に基づいて表記揺れ辞書に登録する技術が開示されている。 Patent Document 1 discloses an example of a technique for extracting a similar expression as a technique related to the transcription fluctuation. Patent Literature 1 discloses a compound word extracting unit that extracts a compound word that is a predetermined sequence of part of speech from a morphological sequence, and whether a compound word is a sway of another compound word in a specific set of compound words. There is disclosed a technology which includes a determination input unit for inputting a determination result by a human regarding whether or not to perform the determination, and registers the result in a spelling dictionary based on the determination result.

特開２０１６−１３９１６４号公報JP-A-2006-139164

また、特許文献１では、人間による判定結果を提供できる場合に限り、表記揺れを適切に抽出できる。そうでない場合、表記揺れを適切に抽出できないおそれがある。そのため、十分な再現率を保証できないおそれがある。 In addition, according to Patent Literature 1, the spelling fluctuation can be appropriately extracted only when a determination result by a human can be provided. Otherwise, there is a possibility that the spelling fluctuation cannot be properly extracted. Therefore, a sufficient recall may not be guaranteed.

本発明の課題は、表記振れ候補（複合語）とＯＫ語（複合語）の類似度を高精度で判定し、登録することができる表記揺れ抽出装置及び方法を提供することにある。 SUMMARY OF THE INVENTION It is an object of the present invention to provide a spelling variation extraction apparatus and method capable of determining and registering the similarity between a spelling variation candidate (compound word) and an OK word (compound word) with high accuracy.

上記課題を解決する本発明の一側面は、一つのドメインにおける修正前の文書と修正後の文書を格納する記憶部と、修正前の文書と前記修正後の文書から複合語を抽出し、修正前の文書にのみ現れる複合語を前記ドメイン用語の表記揺れ候補とする処理部とを有する。そして、この処理部は、表記揺れ候補とドメイン用語の組み合わせを表記揺れ候補・ドメイン用語ペアとして、記憶部に登録し、記憶部に登録された表記揺れ候補・ドメイン用語ペアの表記揺れ候補及びドメイン用語を一般用語にそれぞれ分割し、分割された表記振れ候補の一般用語と分割されたドメイン用語の一般用語との最大類似度を計算し、計算された一般用語の最大類似度に基づいて、表記揺れ候補・ドメイン用語ペアの平均類似度を算出する。 One aspect of the present invention that solves the above problems is a storage unit that stores a document before correction and a document after correction in one domain, and extracts a compound word from the document before correction and the document after correction, and corrects the compound word. A processing unit that sets a compound word appearing only in the previous document as a candidate for the fluctuation of the notation of the domain term. The processing unit registers the combination of the spelling fluctuation candidate and the domain term in the storage unit as a spelling fluctuation candidate / domain term pair, and stores the spelling fluctuation candidate and the domain of the spelling fluctuation candidate / domain term pair registered in the storage unit. The term is divided into general terms, and the maximum similarity between the general term of the divided notational candidate and the general term of the divided domain term is calculated. Based on the calculated maximum similarity of the general term, the notation is calculated. The average similarity of the swing candidate / domain term pair is calculated.

本発明によれば、修正前の文書と修正後の文書とから、正しい用語とその誤記である表記揺れとを精度よく、再現率よく抽出できる。 ADVANTAGE OF THE INVENTION According to this invention, a correct term and the spelling fluctuation which is the erroneous description can be extracted from the document before correction and the document after correction accurately and with good recall.

実施例１の表記揺れ抽出装置の構成例を示した図である。FIG. 2 is a diagram illustrating a configuration example of a writing fluctuation extracting device according to a first embodiment. 表記揺れ抽出装置のハードウェア構成を示した図である。FIG. 3 is a diagram illustrating a hardware configuration of a notation fluctuation extraction device. 一般単語ベクトル１２２１の例を示した図である。FIG. 14 is a diagram illustrating an example of a general word vector 1221. 平均類似度計算部における処理手順を説明するフロー図である。FIG. 9 is a flowchart illustrating a processing procedure in an average similarity calculation unit. 表記揺れ候補・ＯＫ語ペアの出力の一例を示した図である。It is the figure which showed an example of the output of a notation fluctuation candidate / OK word pair. 「平均類似度」の算出を概念的に説明する図である。FIG. 9 is a diagram conceptually illustrating calculation of “average similarity”. 表記揺れ・ＯＫ語ペアの出力の一例を示した図である。It is the figure which showed an example of the output of a notation fluctuation / OK word pair. 実施例２の実施例２における表記揺れ抽出装置の構成例を示した図である。FIG. 13 is a diagram illustrating a configuration example of a spelling variation extraction device according to a second embodiment of the second embodiment. 複合語ベクトルの例を示した図である。FIG. 5 is a diagram showing an example of a compound word vector. 実施例２の平均類似度計算部における処理手順を説明するフロー図である。FIG. 14 is a flowchart illustrating a processing procedure in an average similarity calculation unit according to the second embodiment. 語彙情報記憶部に登録される表記揺れ候補とＯＫ語のペアを示した図である。FIG. 9 is a diagram showing pairs of spelling candidates and OK words registered in a vocabulary information storage unit. 実施例３の平均類似度計算部における処理手順を説明するフロー図である。FIG. 18 is a flowchart illustrating a processing procedure in an average similarity calculation unit according to the third embodiment.

以下、実施例について図面を用いて説明する。なお、以下に説明する実施例は特許請求の範囲にかかる発明を限定するものではない。また実施例において説明されている諸要素およびその組み合わせのすべてが発明の解決手段に必須であるとは限らない。 Hereinafter, embodiments will be described with reference to the drawings. The embodiments described below do not limit the invention according to the claims. In addition, not all of the elements and combinations thereof described in the embodiments are necessarily essential to the solution of the invention.

本技術の対象である文書は、特定の「ドメイン」に属する文書である。例えば、特定の情報システムの設計書は、その情報システムという「ドメイン」に属する文書である。このように、「ドメイン」とは、文章の主題が属する分野又は文章の作成者のグループである。自然言語で書かれた特定の電子機器や銀行システムの仕様書等の文書では、ドメインに固有の用語である複合語が使用されることが多い。 A document targeted by the present technology is a document belonging to a specific “domain”. For example, a design document of a specific information system is a document belonging to a “domain” of the information system. As described above, the “domain” is a field to which the subject of the text belongs or a group of creators of the text. In a document such as a specification of a specific electronic device or a bank system written in a natural language, a compound word which is a term specific to a domain is often used.

そのため、ドメインに特有の表記揺れが発生する可能性が高い。例えば、特定のドメインでは、「預金種目」という複合語がそのドメインに固有な用語であるが、「種別」という単語は使用してもよい。このとき、「預金種別」という表記揺れが発生しうる。 For this reason, there is a high possibility that a writing fluctuation peculiar to the domain occurs. For example, in a particular domain, the compound term "deposit item" is a term specific to that domain, but the word "type" may be used. At this time, the notation “deposit type” may fluctuate.

特に、出現頻度の高い単語に対し、出現頻度の低い類似表現があった場合、出現頻度の低い類似表現のほうは、共起表現がしばしば偏ったものであるため、抽出が困難である。例えば、出現頻度が数百程度の複合語「預金種目」に対し、出現頻度が１の類似表現「預金種別」があったとする。「預金種別」の文脈は「預金種別の変更…」であるため、その共起表現は「変更」である。一方、「預金種目」の共起表現は「登録」が多い。この状況では、「預金種別」を「預金種目」の類似表現として抽出できない可能性が高い。 In particular, when there is a similar expression with a low appearance frequency for a word with a high appearance frequency, it is difficult to extract the similar expression with a low appearance frequency because the co-occurrence expression is often biased. For example, it is assumed that there is a similar expression “deposit type” having an appearance frequency of 1 for a compound word “deposit type” having an appearance frequency of about several hundreds. Since the context of "deposit type" is "change of deposit type ...", its co-occurrence expression is "change". On the other hand, the co-occurrence expression of “deposit item” is often “registration”. In this situation, there is a high possibility that “deposit type” cannot be extracted as a similar expression of “deposit type”.

以降、例として、「預金種目」をドメインに固有な用語とし、「預金種別」を表記揺れとする。 Hereinafter, as an example, “deposit type” is a term specific to a domain, and “deposit type” is notationally changed.

図１は、表記揺れ抽出装置１０の概要を機能的に示した図である。表記揺れ抽出装置１０は、処理部１１と記憶部１２を備えている。
処理部１１は、入出力部１１１と、複合語抽出部１１２と、語彙ベクトル取得部１１３と、平均類似度計算部１１４と、選別部１１５と、を備える。記憶部１２は、文書記憶部１２１と、語彙情報記憶部１２２と、を備える。 FIG. 1 is a diagram functionally showing an outline of a writing fluctuation extracting device 10. The transcription fluctuation extracting device 10 includes a processing unit 11 and a storage unit 12.
The processing unit 11 includes an input / output unit 111, a compound word extraction unit 112, a vocabulary vector acquisition unit 113, an average similarity calculation unit 114, and a selection unit 115. The storage unit 12 includes a document storage unit 121 and a vocabulary information storage unit 122.

実施例１において、ユーザは、処理を実行するに先立ち、修正前の文書と修正後の文書とを、入出力部１１１を通じて文書記憶部１２１に登録しておく。修正前の文書は、１つの文書でも、複数の文書でもよい。修正後の文書も同様である。 In the first embodiment, the user registers a document before correction and a document after correction in the document storage unit 121 via the input / output unit 111 before executing the processing. The document before modification may be a single document or a plurality of documents. The same applies to the revised document.

表記揺れ抽出装置１０は、修正前の文書と修正後の文書とを、機械的に区別できるような仕組みを備える。例えば、文書記憶部１２１は、修正前の文書を格納するフォルダと、修正後の文書を格納するフォルダと、を備える。また、例えば、修正前の文書のファイル名に、「＿修正前」を付け足す（修正後の文書に対しても同様）機能を備えてもよい。 The spelling variation extraction device 10 is provided with a mechanism capable of mechanically distinguishing a document before correction and a document after correction. For example, the document storage unit 121 includes a folder for storing a document before correction and a folder for storing a document after correction. Further, for example, a function of adding “_before correction” to the file name of the document before correction (the same applies to the document after correction) may be provided.

入出力部１１１の文書を登録する機能は、文書が修正前か修正後かを登録する機能を併せ持つ。例えば、ユーザに「文書が修正前か修正後かを選択するチェックボックス」を提示してもよい。また、ユーザの選択結果を受け取ったとき、文書を修正前フォルダまたは修正後フォルダに仕分ける、もしくは、「＿修正前」または「＿修正後」を付け足すように文書を文書記憶部１２１に登録する。 The function of registering a document in the input / output unit 111 also has a function of registering whether the document is before or after correction. For example, the user may be presented with a “check box for selecting whether the document is before or after modification”. Further, when receiving the user's selection result, the document is sorted into the pre-correction folder or the post-correction folder, or the document is registered in the document storage unit 121 so as to add “_before correction” or “_after correction”.

ユーザが入出力部１１１にある「起動」ボタンを押したとき、複合語抽出部１１２と、語彙ベクトル取得部１１３と、平均類似度計算部１１４と、選別部１１５と、が順次起動される。すべての処理が終了したとき、抽出したすべての表記揺れは入出力部１１１に表示される。 When the user presses the “start” button on the input / output unit 111, the compound word extraction unit 112, the vocabulary vector acquisition unit 113, the average similarity calculation unit 114, and the selection unit 115 are sequentially activated. When all the processes are completed, all the extracted spelling variations are displayed on the input / output unit 111.

通常、修正後の文書においては、表記揺れはすべて修正されている。従って、次の状況を例とする。修正前の文書は「預金種目」（ドメイン用語）と「預金種別」（表記揺れ）との両方を含む。それに対して、修正後の文書は、「預金種目」（ドメイン用語）のみを含み、「預金種別」（表記揺れ）を含まない。 Usually, in the corrected document, all the spelling has been corrected. Therefore, the following situation is taken as an example. The document before amendment includes both “deposit item” (domain term) and “deposit type” (notation). On the other hand, the document after the correction includes only the “deposit item” (domain term) and does not include the “deposit type” (notation fluctuation).

複合語抽出部１１２は、文書記憶部１２１に格納されている修正前の文書を、語彙情報記憶部１２２に格納されている一般単語辞書を用いて形態素解析する。ここで、「形態素」とは、文章において意味を持つ最小単位である。形態素は、形態素の表記を示す文字列の情報と、形態素が属する品詞の情報とを含む。複合語抽出部１１２は、Ｍｅｃａｂ（MeCabはオープンソースの形態素解析エンジン）、ＴｅｒｍＥｘｔｒａｃｔを用いることで実現することができる。 The compound word extraction unit 112 performs a morphological analysis on the document before correction stored in the document storage unit 121 using a general word dictionary stored in the vocabulary information storage unit 122. Here, the “morpheme” is a minimum unit having a meaning in a sentence. The morpheme includes information of a character string indicating a notation of the morpheme and information of a part of speech to which the morpheme belongs. The compound word extraction unit 112 can be realized by using Mecab (MeCab is an open source morphological analysis engine) and TermExtract.

形態素解析の結果である修正前の文書の形態素列を、文書記憶部１２１に登録する。例えば、文書記憶部１２１に格納されている修正前の文書は、「預金種別」という表記揺れを含む。「預金種別」は、「預金」と「種別」との２つの形態素（一般単語）になる。 The morpheme sequence of the document before correction, which is the result of the morphological analysis, is registered in the document storage unit 121. For example, the document before correction stored in the document storage unit 121 includes a fluctuation in the notation “deposit type”. The “deposit type” is two morphemes (general words) of “deposit” and “type”.

複合語抽出部１１２は、文書記憶部１２１に格納されている修正前の文書の形態素列から、所定の品詞の並びを複合語として抽出する。抽出の結果を、修正前複合語辞書として、語彙情報記憶部１２２に登録する。このとき、「預金」と「種別」とは、名詞の並びなので、「預金種別」は複合語として抽出され、語彙情報記憶部１２２の修正前複合語辞書に登録される。 The compound word extraction unit 112 extracts a predetermined part-of-speech sequence as a compound word from the morpheme sequence of the document before correction stored in the document storage unit 121. The result of the extraction is registered in the vocabulary information storage unit 122 as a pre-correction compound word dictionary. At this time, since “deposit” and “type” are a sequence of nouns, “deposit type” is extracted as a compound word and registered in the uncorrected compound word dictionary of the vocabulary information storage unit 122.

複合語抽出部１１２は、文書記憶部１２１に格納されている修正後の文書を、語彙情報記憶部１２２に格納されている一般単語辞書を用いて形態素解析する。形態素解析の結果である修正後の文書の形態素列を、文書記憶部１２１に登録する。例えば、文書記憶部１２１に格納されている修正後の文書は、「預金種目」というドメインに固有の用語を含む。「預金種目」は、「預金」と「種目」との２つの形態素（一般単語）になる。 The compound word extraction unit 112 performs a morphological analysis on the corrected document stored in the document storage unit 121 using a general word dictionary stored in the vocabulary information storage unit 122. The morpheme sequence of the corrected document, which is the result of the morphological analysis, is registered in the document storage unit 121. For example, the corrected document stored in the document storage unit 121 includes a term unique to the domain “deposit item”. The “deposit item” is two morphemes (general words) of “deposit” and “item”.

複合語抽出部１１２は、文書記憶部１２１に格納されている修正後の文書の形態素列から、所定の品詞の並びを複合語として抽出する。抽出の結果を、修正後複合語辞書として、語彙情報記憶部１２２に登録する。このとき、「預金」と「種目」とは、名詞の並びなので、「預金種目」は複合語として抽出され、修正後複合語辞書に登録されるが、修正後の文書は、「預金種別」（表記揺れ）を含まないので、「預金種別」は修正後複合語辞書に登録されない。 The compound word extraction unit 112 extracts a predetermined part-of-speech sequence as a compound word from the morphological sequence of the document after correction stored in the document storage unit 121. The result of the extraction is registered in the vocabulary information storage unit 122 as a corrected compound word dictionary. At this time, since “deposit” and “item” are a sequence of nouns, “deposit item” is extracted as a compound word and registered in the compound word dictionary after correction, but the document after correction is “deposit type” Since the (description fluctuation) is not included, the “deposit type” is not registered in the compound word dictionary after correction.

前述したとおり、通常、修正後の文書においては、表記揺れはすべて修正されており、修正後の文書は正しい複合語のみを含む。そのため、修正後の文書から抽出したすべての複合語は正しい、即ち、修正後複合語辞書のすべての複合語は正しい。そのため、修正後複合語辞書を「ＯＫ語辞書」と呼ぶ。また、ＯＫ語辞書に属する複合語、即ち、修正後の文書から抽出した複合語を「ＯＫ語」と呼ぶ。例えば、「預金種目」はＯＫ語である。 As described above, all the spelling changes are usually corrected in the corrected document, and the corrected document includes only correct compound words. Therefore, all the compound words extracted from the corrected document are correct, that is, all the compound words in the corrected compound word dictionary are correct. Therefore, the corrected compound word dictionary is called an “OK word dictionary”. Further, a compound word belonging to the OK word dictionary, that is, a compound word extracted from the document after correction is referred to as an “OK word”. For example, “deposit item” is an OK word.

複合語抽出部１１２は、語彙情報記憶部１２２に格納されている修正前複合語辞書とＯＫ語辞書との差分を取る。即ち、修正前複合語辞書のみに登録され、ＯＫ語辞書に登録されていない複合語を取得し、語彙情報記憶部１２２に、表記揺れ候補辞書として登録する。 The compound word extraction unit 112 calculates a difference between the uncorrected compound word dictionary stored in the vocabulary information storage unit 122 and the OK word dictionary. That is, a compound word registered only in the uncorrected compound word dictionary and not registered in the OK word dictionary is acquired and registered in the vocabulary information storage unit 122 as a spelling candidate dictionary.

「預金種別」は、修正前複合語辞書に登録されているが、ＯＫ語辞書に登録されていないので、表記揺れ候補辞書に登録される。 The “deposit type” is registered in the pre-correction compound word dictionary, but is not registered in the OK word dictionary, and thus is registered in the spelling candidate dictionary.

複合語抽出部１１２は、語彙情報記憶部１２２に格納されているすべての表記揺れ候補とすべてのＯＫ語の対応を、「表記揺れ候補・ＯＫ語ペア」（本明細書では、「表記揺れ候補とＯＫ語のペア」を「表記揺れ候補・ＯＫ語ペア」と記載する）として、語彙情報記憶部１２２に登録する。ここで、表記揺れ候補がｎ個あり、ＯＫ語がｍ個あったとすると、表記揺れ・ＯＫ語ペアは（ｎ×ｍ）個になる。 The compound word extraction unit 112 compares the correspondence between all the spelling candidates and all the OK words stored in the vocabulary information storage unit 122 with a “spelling candidate / OK word pair” (in this specification, “spelling candidate”). And a pair of OK words are described as a “spelling fluctuation candidate / OK word pair”) in the vocabulary information storage unit 122. Here, assuming that there are n spelling fluctuation candidates and m OK words, the number of spelling fluctuation / OK word pairs is (n × m).

例えば、「預金種別・預金種目」という複合語のペアがある。ここで、表記揺れ候補辞書に登録には、ＯＫ語辞書に登録された複合語を除くため、ＯＫ語の数だけｎの数を減らすことができ、表記揺れ・ＯＫ語ペア（本明細書では、「表記揺れとＯＫ語のペア」を「表記揺れ・ＯＫ語ペア」と記載する）の数（ｎ×ｍ）を効果的に減らすことができる。 For example, there is a compound word pair “deposit type / deposit item”. Here, in the registration in the transcription fluctuation candidate dictionary, since the compound words registered in the OK word dictionary are excluded, the number of n can be reduced by the number of OK words, and the transcription fluctuation / OK word pair (in this specification, , "A pair of a spelling variation and an OK word" is referred to as a "spelling variation / OK word pair") (n × m) can be effectively reduced.

語彙ベクトル取得部１１３は、機械学習などの手法を用いて、文書記憶部１２１に格納されている修正前の文書の形態素列と修正後の文書の形態素列とを入力として、一般単語の共起表現から、一般単語のベクトル表現を計算する。語彙の共起表現から、語彙のベクトル表現を計算する方法は、例えば、Ｗｏｒｄ２Ｖｅｃがある。計算結果であるベクトル表現は、一般単語ベクトル１２２１として、語彙情報記憶部１２２に登録する（図３参照）。例えば、「預金」、「種目」と「種別」のベクトルは、それぞれの共起表現から計算される。本実施例では、一般単語「預金」、「種目」と「種別」のベクトルを使い、「預金種別」や「預金種目」といった複合語のベクトルは用いない。 The vocabulary vector acquisition unit 113 uses a technique such as machine learning to input the morpheme sequence of the document before correction and the morpheme sequence of the document after correction stored in the document storage unit 121 and to co-occur in common words. From the expression, calculate a vector expression of the general word. A method of calculating a vocabulary vector expression from a vocabulary co-occurrence expression is, for example, Word2Vec. The vector expression as a calculation result is registered as a general word vector 1221 in the vocabulary information storage unit 122 (see FIG. 3). For example, the vectors of “deposit”, “event” and “type” are calculated from the respective co-occurrence expressions. In this embodiment, vectors of the general words “deposit”, “item” and “type” are used, and vectors of compound words such as “deposit type” and “deposit item” are not used.

図２は、表記揺れ抽出装置のハードウェア構成を示した図である。演算機能を有するＣＰＵ２１、データを一時的に記憶するメモリ２２、ディスプレイ等の表示装置２５やマウスやキーボードなどの入力装置２６に接続されるインターフェース２３、記憶部１２、がバス２４を介して接続されている。記憶部１２は、ハードディスク、ＳＳＤ等の記憶装置を用いる。記憶部１２には、図１の入出力部１１１、複合語抽出部１１２、語彙ベクトル取得部１１３、平均類似度計算部１１４、選択部１１５の各種機能実現するプログラム１３が記憶されている。これらプログラムがメモリ２２に読み込まれ、ＣＰＵ２１が各種プログラムを実行することで、各種機能を実現している。また、記憶部１２に修正前の文書と修正後の文書を記憶する文書記憶部１２１、一般用語辞書、修正前複合語辞書、修正後複合語辞書、表記揺れ候補辞書、表記揺れ候補・ＯＫ語ペア、一般単語ベクトル、平均類似度Ｕ、表記揺れ・ＯＫ語ペア、複合語ベクトルを記憶する語彙情報記憶部１２２とを有する。 FIG. 2 is a diagram illustrating a hardware configuration of the transcription fluctuation extracting device. A CPU 21 having an arithmetic function, a memory 22 for temporarily storing data, an interface 23 connected to a display device 25 such as a display and an input device 26 such as a mouse and a keyboard, and a storage unit 12 are connected via a bus 24. ing. The storage unit 12 uses a storage device such as a hard disk and an SSD. The storage unit 12 stores a program 13 for implementing various functions of the input / output unit 111, compound word extraction unit 112, vocabulary vector acquisition unit 113, average similarity calculation unit 114, and selection unit 115 of FIG. These programs are read into the memory 22, and various functions are realized by the CPU 21 executing the various programs. Further, a document storage unit 121 for storing a document before correction and a document after correction in the storage unit 12, a general term dictionary, a compound word dictionary before correction, a compound word dictionary after correction, a spelling candidate dictionary, a spelling candidate / OK word A vocabulary information storage unit 122 for storing pairs, general word vectors, average similarity U, spelling / OK word pairs, and compound word vectors.

図３は、一般単語ベクトル１２２１の例を示している。１つの一般単語に対し、５０〜２００個の数字からなるベクトルが付与されている。この一般ベクトルは、「預金」、「種目」と「種別」について、それぞれの共起表現から計算された値である。 FIG. 3 shows an example of the general word vector 1221. A vector consisting of 50 to 200 numbers is assigned to one general word. This general vector is a value calculated from each co-occurrence expression for “deposit”, “item” and “type”.

図４は、平均類似度計算部１１４における処理手順を説明するフロー図である。図４に示すフローは、語彙情報記憶部１２２に格納されているすべての表記揺れ候補・ＯＫ語ペアに対し、平均類似度を計算し、その結果を語彙情報記憶部１２２に登録する処理を示している。但し、表記揺れ候補はＯＫ語辞書に登録されていない複合語のみであるため、表記揺れ候補・ＯＫ語ペアの数は、すでに効果的に絞られている。 FIG. 4 is a flowchart illustrating a processing procedure in average similarity calculation section 114. The flow illustrated in FIG. 4 illustrates a process of calculating the average similarity for all the spelling candidate / OK word pairs stored in the vocabulary information storage unit 122 and registering the result in the vocabulary information storage unit 122. ing. However, since the spelling candidates are only compound words that are not registered in the OK word dictionary, the number of spelling candidate / OK word pairs has already been effectively reduced.

ステップＳ３０１では、語彙情報記憶部１２２から、１組の表記揺れ候補・ＯＫ語ペアを取得する。例えば、「預金種別・預金種目」を取得した。「預金種別」は表記揺れ候補であり、「預金種目」はＯＫ語である。 In step S301, one spelling candidate / OK word pair is acquired from the vocabulary information storage unit 122. For example, “deposit type / deposit item” was acquired. The “deposit type” is a candidate for a sway, and the “deposit type” is an OK word.

ステップＳ３０２では、表記揺れ候補（複合語）を一般単語（要素）に分割する。ＯＫ語（複合語）も一般単語（要素）に分割する。分割とは、一般単語複合語抽出部１１２と同様に、語彙情報記憶部１２２に格納されている一般単語辞書を使用して、形態素解析することを指す。例えば、「預金種別」（表記揺れ候補複合語）は、「預金」と「種別」との２の形態素＝一般単語（要素）に分割される。また、「預金種目」（ＯＫ語複合語）は、「預金」と「種目」との２の形態素＝一般単語（要素）に分割される。便宜上、ここでは、複合語を構成する形態素＝一般単語を、その複合語の要素と呼ぶ。 In step S302, a spelling candidate (compound) is divided into general words (elements). OK words (compound words) are also divided into general words (elements). Division refers to performing morphological analysis using a general word dictionary stored in the vocabulary information storage unit 122, similarly to the general word compound word extraction unit 112. For example, the “deposit type” (candidate fluctuation candidate compound word) is divided into two morphemes = general word (element) of “deposit” and “type”. The “deposit item” (OK compound word) is divided into two morphemes = general words (elements) of “deposit” and “item”. For convenience, a morpheme = general word that forms a compound is referred to as an element of the compound here.

ステップＳ３０３では、平均類似度Ｕを０に初期化する。 In step S303, the average similarity U is initialized to 0.

ステップＳ３０４では、表記揺れ候補の１要素を取得する。便宜上、この要素を「ＮＧ要素」と呼ぶ。例えば、「種別」（ＮＧ要素）を取得した。 In step S304, one element of the spelling fluctuation candidate is acquired. For convenience, this element is called an “NG element”. For example, “type” (NG element) is acquired.

ステップＳ３０５では、最大類似度Ｖを０に初期化する。 In step S305, the maximum similarity V is initialized to 0.

ステップＳ３０６では、ＯＫ語の１要素を取得する。便宜上、この要素を「ＯＫ要素」と呼ぶ。例えば、「預金」（ＯＫ要素）を取得した。 In step S306, one element of the OK word is obtained. For convenience, this element is called an "OK element". For example, "deposit" (OK element) was acquired.

ステップＳ３０７では、ＮＧ要素のベクトルとＯＫ要素のベクトルとを、語彙情報記憶部１２２に格納されている一般単語ベクトル１２２１(図３参照)から取得する。また、ＮＧ要素とＯＫ要素との類似度Ｓを計算する。類似度Ｓは、ＮＧ要素のベクトルとＯＫ要素のベクトルとのコサイン類似度を指す。例えば、「種別」（ＮＧ要素）と「預金」（ＯＫ要素）との類似度Ｓ＝０．２とする。 In step S307, the NG element vector and the OK element vector are acquired from the general word vector 1221 (see FIG. 3) stored in the vocabulary information storage unit 122. Further, the similarity S between the NG element and the OK element is calculated. The similarity S indicates the cosine similarity between the NG element vector and the OK element vector. For example, the similarity S between the “type” (NG element) and the “deposit” (OK element) is set to 0.2.

ステップＳ３０８では、ステップＳ３０７において算出した類似度Ｓを最大類似度Ｖと比較する。類似度Ｓ＞最大類似度Ｖの場合、ステップＳ３０９へ進み、そうでない場合、ステップＳ３１０へ進む。類似度Ｓ＝０．２＞最大類似度Ｖ＝０ため、ステップＳ３０９へ進む。 In step S308, the similarity S calculated in step S307 is compared with the maximum similarity V. If the similarity S> the maximum similarity V, the process proceeds to step S309; otherwise, the process proceeds to step S310. Since the similarity S = 0.2> the maximum similarity V = 0, the process proceeds to step S309.

ステップＳ３０９では、最大類似度ＶをステップＳ３０７で計算した類似度Ｓに更新する。例えば、最大類似度Ｖを類似度Ｓ＝０．２に更新する。 In step S309, the maximum similarity V is updated to the similarity S calculated in step S307. For example, the maximum similarity V is updated to the similarity S = 0.2.

ステップＳ３１０では、ＯＫ語の要素を、すべて取得済みかどうかを確認する。ＯＫ語の要素をすべて取得済みの場合、ステップＳ３１１へ進む。そうでない場合、ステップＳ３０６へ戻る。例えば、「預金種目」（ＯＫ語）の要素としては、「預金」と「種目」とがあるが、いまはまだ「預金」のみを取得しただけなので、ステップＳ３０６へ戻る。 In step S310, it is confirmed whether or not all the elements of the OK word have been acquired. If all the elements of the OK word have been acquired, the process proceeds to step S311. Otherwise, the process returns to step S306. For example, the elements of “deposit item” (OK word) include “deposit” and “item”, but since only “deposit” has been acquired, the process returns to step S306.

上記の例について、再度、ステップＳ３０６からステップＳ３１０へ進むまでの動作を説明する。ステップＳ３０６では、「種目」（ＯＫ要素）を取得する。次に、ステップＳ３０７では、「種別」（ＮＧ要素）のベクトルと「種目」（ＯＫ要素）のベクトルとを、語彙情報記憶部１２２に格納されている一般単語ベクトル１２２１から取得し、「種別」（ＮＧ要素）のベクトルと「種目」（ＯＫ要素）のベクトルとのコサイン類似度を計算する。結果、コサイン類似度は０．８、即ち、類似度Ｓ＝０．８とする。ステップＳ３０８では、類似度Ｓ＝０．８を最大類似度Ｖ＝０．２と比較する。結果、類似度Ｓ＝０．８＞最大類似度Ｖ＝０．２のため、ステップＳ３０９へ進み、最大類似度Ｖを類似度Ｓ＝０．８に更新する。次に、ステップＳ３１０では、ＯＫ語の要素をすべて取得済みであるため、ステップＳ３１１に進む。 For the above example, the operation from step S306 to step S310 will be described again. In step S306, “event” (OK element) is obtained. Next, in step S307, the “type” (NG element) vector and the “event” (OK element) vector are acquired from the general word vector 1221 stored in the vocabulary information storage unit 122, and the “type” The cosine similarity between the (NG element) vector and the “event” (OK element) vector is calculated. As a result, the cosine similarity is 0.8, that is, the similarity S = 0.8. In step S308, the similarity S = 0.8 is compared with the maximum similarity V = 0.2. As a result, since the similarity S = 0.8> the maximum similarity V = 0.2, the process proceeds to step S309, and the maximum similarity V is updated to the similarity S = 0.8. Next, in step S310, since all the elements of the OK word have been acquired, the process proceeds to step S311.

ステップＳ３１１では、平均類似度Ｕに最大類似度Ｖを加算する。例えば、平均類似度Ｕ＝０に最大類似度Ｖ＝０．８を加算して、平均類似度Ｕ＝０．８を得る。 In step S311, the maximum similarity V is added to the average similarity U. For example, the maximum similarity V = 0.8 is added to the average similarity U = 0 to obtain the average similarity U = 0.8.

ステップＳ３１２では、表記揺れ候補の要素を、すべて取得済みかを確認する。表記揺れ候補の要素をすべて取得済みの場合、ステップＳ３１３へ進み、そうでない場合、ステップＳ３０４へ戻る。例えば、表記揺れ候補「預金種別」の要素としては、「預金」と「種別」とがあるが、いまはまだ「種別」のみを取得しただけなので、ステップＳ３０４へ戻る。 In step S312, it is checked whether all elements of the spelling fluctuation candidate have been acquired. If all of the spelling fluctuation candidate elements have been acquired, the process proceeds to step S313; otherwise, the process returns to step S304. For example, the elements of the spelling candidate “deposit type” include “deposit” and “type”, but since only “type” has been acquired, the process returns to step S304.

上記の例について、再度、ステップＳ３０４からステップＳ３１２へ進むまでの動作を簡略的に説明する。ステップＳ３０４では、表記揺れ候補「預金種別」のもう片方の要素「預金」（ＮＧ要素）を取得する。続いて、ステップＳ３０５〜ステップＳ３１０では、「預金」（ＮＧ要素）と、ＯＫ語「預金種目」の要素「預金」（ＯＫ要素）と、の類似度Ｓ＝１によって、最大類似度Ｖ＝１となる。最後に、ステップＳ３１１では、平均類似度Ｕ＝０．８に最大類似度Ｖ＝１を加算して、平均類似度Ｕ＝１．８を得る。尚、ステップＳ３０５にて、ＯＫ語の「種目」が選択された場合もステップＳ３０５からステップＳ３１０を繰り返すが、詳細な説明は省略する。 With respect to the above example, the operation from step S304 to step S312 will be briefly described again. In step S304, the other element “deposit” (NG element) of the spelling candidate “deposit type” is acquired. Subsequently, in steps S305 to S310, the similarity S = 1 between the “deposit” (NG element) and the element “deposit” (OK element) of the OK word “deposit item”, and the maximum similarity V = 1 Becomes Finally, in step S311, the maximum similarity V = 1 is added to the average similarity U = 0.8 to obtain an average similarity U = 1.8. Note that when the OK word “item” is selected in step S305, steps S305 to S310 are repeated, but detailed description is omitted.

ステップＳ３１３では、平均類似度Ｕを表記揺れ候補の要素数で除算する。平均類似度Ｕ＝１．８であり、表記揺れ候補の要素数は２であるため、除算によって平均類似度Ｕ＝０．９となる。 In step S313, the average similarity U is divided by the number of elements of the writing fluctuation candidate. Since the average similarity U = 1.8 and the number of elements of the writing fluctuation candidate is 2, the division results in the average similarity U = 0.9.

ステップＳ３１４では、表記揺れ候補・ＯＫ語ペアの平均類似度Ｕを、表記揺れ候補・ＯＫ語ペアの類似度として、語彙情報記憶部１２２に登録する。図５に、表記揺れ候補・ＯＫ語ペアと平均類似度Ｕが語彙情報記憶部１２２に登録された状態を示す。 In step S314, the average similarity U of the spelling candidate / OK word pair is registered in the vocabulary information storage unit 122 as the similarity of the spelling candidate / OK word pair. FIG. 5 shows a state in which the spelling candidate / OK word pair and the average similarity U are registered in the vocabulary information storage unit 122.

ステップＳ３１５では、語彙情報記憶部１２２の表記揺れ候補・ＯＫ語ペアを、すべて取得済みかどうかを確認する。語彙情報記憶部１２２から表記揺れ候補・ＯＫ語ペアをすべて取得済みの場合は終了する。そうでない場合はステップＳ３０１へ戻る。 In step S315, it is checked whether or not all the spelling candidate / OK word pairs in the vocabulary information storage unit 122 have been acquired. If all the spelling candidate / OK word pairs have been acquired from the vocabulary information storage unit 122, the process ends. If not, the process returns to step S301.

以上では、平均類似度を各最大類似度の通常の平均値として計算する場合について例示した。上記以外にも、平均類似度を各最大類似度の乗算として計算する方法がある。このとき、ステップＳ３０３では、平均類似度を１に初期化する。また、ステップＳ３１１では、平均類似度に最大類似度を乗算する。ステップＳ３１３は不要である。 In the above, the case where the average similarity is calculated as a normal average value of each maximum similarity has been illustrated. Other than the above, there is a method of calculating the average similarity as a multiplication of each maximum similarity. At this time, in step S303, the average similarity is initialized to 1. In step S311, the average similarity is multiplied by the maximum similarity. Step S313 is unnecessary.

さらに、乗算以外の方法もありうる。例えば、調和平均がある。ここでは一々詳細に説明しない。要は、表記振れ候補・ＯＫ語のペアの各要素の類似度が最も大きい組み合わせに基づいて、表記振れ候補・ＯＫ語のペアの平均類似度を求める。 Further, there can be other methods than multiplication. For example, there is a harmonic mean. It will not be described in detail here. The point is that the average similarity of the pair of the notational shake candidate / OK word is determined based on the combination having the highest similarity of each element of the pair of the notational shake candidate / OK word.

図６を用いて、上記の「平均類似度」の算出を概念的に説明する。語彙情報記憶部１２２から、１組の表記揺れ候補・ＯＫ語ペアである「預金種別・預金種目」の複合語のペアを取得する。「預金種別」は表記揺れ候補（ＮＧ語候補）であり、「預金種目」はＯＫ語である。 The calculation of the “average similarity” described above will be conceptually described with reference to FIG. From the vocabulary information storage unit 122, a pair of compound words of “deposit type / deposit item”, which is one set of spelling candidate / OK word pair, is acquired. The “deposit type” is a spelling candidate (NG word candidate), and the “deposit type” is an OK word.

平均類似度計算部１１４は、ＯＫ語（複合語）「預金種目」を、語彙情報記憶部１２２に登録されている一般単語辞書に基づいて、「預金」「種目」（要素）に分割する。同様に、ＮＧ語候補（複合語）「預金種別」を、語彙情報記憶部１２２に登録されている一般単語辞書に基づいて、「預金」「種別」（要素）に分割する。 The average similarity calculation unit 114 divides the OK word (compound word) “deposit item” into “deposit” and “item” (element) based on a general word dictionary registered in the vocabulary information storage unit 122. Similarly, the NG word candidate (compound) “deposit type” is divided into “deposit” and “type” (element) based on the general word dictionary registered in the vocabulary information storage unit 122.

表記揺れ候補のＮＧ要素「預金」と「種別」は、類似度計算のためにそれぞれ取得され、各ＮＧ要素はＯＫ要素の「預金」「種目」とそれぞれ類似度が計算され、最大類似度Ｖが「預金種別・預金種目」の複合語の平均類似度の計算に用いられる。図６では、ＮＧ要素である「種別」とＯＫ要素「預金」「種目」の最大類似度Ｖは、ＮＧ要素「種別」とＯＫ要素「種目」の場合の「０．８」となる。 The NG elements “deposit” and “type” of the notation fluctuation candidate are respectively acquired for similarity calculation, and each NG element is calculated for the similarity with “OK” elements “deposit” and “item”, respectively. Are used to calculate the average similarity of the compound words of “deposit type / deposit item”. In FIG. 6, the maximum similarity V between the NG element “type” and the OK element “deposit” “item” is “0.8” in the case of the NG element “type” and the OK element “event”.

同様に、ＮＧ要素である「預金」とＯＫ要素「預金」「種目」の最大類似度Ｖは、ＮＧ要素「預金」とＯＫ要素「預金」の場合の「１．０」となる。これら、一般用語に分割されたＮＧ要素とＯＫ要素の最大類似度を加え、要素数「２」で除算したものを、「平均類似度」と呼ぶ。複合語である表記振れ候補の複合語ベクトルと、複合語であるＯＫ語の複合語ベクトルとの類似度と違い、複合語の各要素の最大類似度を用いているため、複合語としての類似度の判定の精度を向上することができる。 Similarly, the maximum similarity V between the NG element “deposit” and the OK element “deposit” “item” is “1.0” in the case of the NG element “deposit” and the OK element “deposit”. The result obtained by adding the maximum similarity between the NG element and the OK element divided into these general terms and dividing by the number of elements “2” is called “average similarity”. Unlike the similarity between the compound word vector of the compound word candidate and the compound word vector of the compound word OK, the maximum similarity of each element of the compound word is used. The accuracy of the degree determination can be improved.

選別部１１５は、語彙情報記憶部１２２の表記揺れ候補・ＯＫ語ペアから、条件を満たす表記揺れ候補・ＯＫ語ペアを選別し、それらの表記揺れ候補・ＯＫ語ペアを最終的に表記揺れ・ＯＫ語ペアとする。前記条件は、例えば、「類似度が閾値より大きい」でもよい。即ち、閾値＝０．８とした場合、類似度＞０．８となる表記揺れ候補・ＯＫ語ペアを表記揺れ・ＯＫ語ペアとする。このとき、すべての表記揺れ候補に対して、表記揺れ・ＯＫ語ペアが１ペア以上あるとは限らない。すべての表記揺れ候補に対して、表記揺れ・ＯＫ語ペアが１ペア以上あることが望ましいのであれば、例えば、すべての表記揺れ候補に対し、類似度が最大の表記揺れ候補・ＯＫ語ペアを、最終的に表記揺れ・ＯＫ語ペアとして選別してもよい。表記揺れ・ＯＫ語ペアを語彙情報記憶部１２２に登録する。 The selection unit 115 selects a spelling candidate / OK word pair that satisfies the condition from the spelling candidate / OK word pair in the vocabulary information storage unit 122, and finally selects the spelling candidate / OK word pair. OK word pair. The condition may be, for example, “the similarity is larger than a threshold”. That is, when the threshold value is set to 0.8, a spelling candidate / OK word pair that satisfies the similarity> 0.8 is set to a spelling / OK word pair. At this time, there is not always one or more writing fluctuation / OK word pairs for all the writing fluctuation candidates. If it is desirable that there be at least one spelling / OK word pair for all the spelling candidates, for example, a spelling candidate / OK word pair having the highest similarity for all the spelling candidates Finally, it may be sorted as a spelling / OK word pair. The spelling / OK word pair is registered in the vocabulary information storage unit 122.

入出力部１１１は、語彙情報記憶部１２２に格納されている表記揺れ・ＯＫ語ペアをユーザに提示する。単純なリスト形式で提示してもよい。一つのＯＫ語に対して複数の表記揺れが存在する場合、ＯＫ語でグルーピングを実施してもよい。 The input / output unit 111 presents the user with the spelling / OK word pair stored in the vocabulary information storage unit 122. It may be presented in a simple list format. When a plurality of spelling variations exist for one OK word, grouping may be performed using the OK word.

図７は表記揺れ・ＯＫ語ペアの出力の一例である。図７（ａ）は、語彙情報記憶部１２２に格納されている表記揺れ・ＯＫ語ペアを、表記揺れを「表記揺れ（誤記）」とし、ＯＫ語を「ドメイン用語」として、ユーザに提示する画面の一例を示している。図７（ｂ）は、一つのＯＫ語に対して複数の表記揺れが存在する状況において、ＯＫ語でグルーピングを実施した場合を示している。ここでは、表記揺れ・ＯＫ語ペアを、ＯＫ語・表記揺れの順に、「ドメイン用語」と「表記揺れ（誤記）」としてユーザに提示する画面の一例を示している。 FIG. 7 is an example of the output of the spelling / OK word pair. FIG. 7A shows the user of the spelling / OK word pair stored in the vocabulary information storage unit 122 with the spelling being “notation spelling (wrong)” and the OK word being “domain term”. 4 shows an example of a screen. FIG. 7B shows a case where grouping is performed using an OK word in a situation where a plurality of writing fluctuations exist for one OK word. Here, an example of a screen that presents the user with the spelling / OK word pair in the order of the OK word / spelling as a “domain term” and “spelling spelling (erroneous writing)” is shown.

以上のように、実施例１では、表記揺れ候補辞書に登録には、ＯＫ語辞書に登録された複合語を除くため、ＯＫ語の数だけｎの数を減らすことができ、表記揺れ・ＯＫ語ペアは（ｎ×ｍ）個の数を効果的に絞り込むことができ、平均類似度を求めるための処理を高速化することができる。 As described above, in the first embodiment, the compound word registered in the spelling candidate dictionary is excluded from the compound words registered in the OK word dictionary. Therefore, the number of n can be reduced by the number of OK words. The number of (n × m) word pairs can be effectively narrowed down, and the processing for obtaining the average similarity can be speeded up.

また、表記振れ候補（複合語）とＯＫ語（複合語）の複合語に対する複合語ベクトルとの類似度と違い、表記振れ候補（複合語）とＯＫ語（複合語）を一般用語に分割したＮＧ要素とＯＫ要素の最大類似度を求め、各要素の最大類似度を要素数「２」で除算して「平均類似度」を算出しているので、複合語の類似度判定を高精度で行うことができる。 In addition, unlike the similarity between a compound word candidate for a compound word of an OK word (compound word) and a compound word of an OK word (compound word), the candidate for a notation shake (compound word) and an OK word (compound word) are divided into general terms. Since the maximum similarity between the NG element and the OK element is obtained, and the maximum similarity of each element is divided by the number of elements “2” to calculate “average similarity”, the similarity determination of compound words can be performed with high accuracy. It can be carried out.

実施例２では、実施例１における平均類似度計算部１１４の処理手順を変更し、処理時間をさらに短縮する。 In the second embodiment, the processing procedure of the average similarity calculation unit 114 in the first embodiment is changed to further reduce the processing time.

図８は、実施例２における表記揺れ抽出装置１０の構成例を示している。同じ符号を用いたものは、実施例１と同様の機能、構成を有するものとする。 FIG. 8 illustrates a configuration example of the spelling variation extraction device 10 according to the second embodiment. The components using the same reference numerals have the same functions and configurations as the first embodiment.

実施例２では、処理部１１に、語彙出現頻度取得部１１６が追加される。
複合語抽出部１１２は、文書記憶部１２１に格納されている修正前の文書を、語彙情報記憶部１２２に格納されている修正前複合語辞書とＯＫ語辞書とを用いて形態素解析する。その結果である修正前の文書の複合語入り形態素列を、文書記憶部１２１に登録する。例えば、「顧客記号番号検索画面にて画面検索ボタン押下で出力される確認ダイアログ」という文は、「顧客記号番号検索画面／にて／画面検索ボタン／押下／で／出力／される／確認ダイアログ」のように分解される。即ち、「顧客記号番号検索画面」などの複合語は一般単語に分解されない。次に、修正後の文書も同様に解析する。その結果である複合語入り修正後の文書の形態素列を、文書記憶部１２１に登録する。 In the second embodiment, a vocabulary appearance frequency acquisition unit 116 is added to the processing unit 11.
The compound word extraction unit 112 performs a morphological analysis on the document before correction stored in the document storage unit 121 using the compound word dictionary before correction and the OK word dictionary stored in the vocabulary information storage unit 122. The resulting morpheme sequence containing the compound word of the uncorrected document is registered in the document storage unit 121. For example, the sentence “Confirmation dialog output by pressing the screen search button on the customer symbol number search screen” is “Customer symbol number search screen / on / screen search button / press / on / output / output / confirmation dialog” Is decomposed as follows. That is, compound words such as “customer symbol number search screen” are not broken down into general words. Next, the corrected document is similarly analyzed. The resulting morpheme sequence of the compound-containing document after modification is registered in the document storage unit 121.

語彙ベクトル取得部１１３は、機械学習などの手法を用いて、文書記憶部１２１に格納されている修正前の文書の複合語入り形態素列と修正後の文書の複合語入り形態素列とを入力として、複合語の共起表現から複合語のベクトル表現を計算する。結果は複合語ベクトル１２２２として、語彙情報記憶部１２２に登録する。例えば、「預金種目」や「預金種別」のベクトルが計算され、語彙情報記憶部１２２に登録される。 The vocabulary vector acquisition unit 113 receives the morpheme sequence containing the compound word of the document before correction and the morpheme sequence containing the compound word of the document after correction stored in the document storage unit 121 by using a technique such as machine learning. , Calculate the vector representation of the compound from the co-occurrence of the compound. The result is registered in the vocabulary information storage unit 122 as a compound word vector 1222. For example, vectors of “deposit item” and “deposit type” are calculated and registered in the vocabulary information storage unit 122.

図９は、語彙情報記憶部１２２に登録されている複合語ベクトル１２２２の例を示している。１つの複合語に対し、５０〜２００個の数字からなるベクトルが付与されている。 FIG. 9 shows an example of the compound word vector 1222 registered in the vocabulary information storage unit 122. A vector composed of 50 to 200 numbers is assigned to one compound word.

語彙出現頻度取得部１１６は、語彙情報記憶部１２２に格納されているＯＫ語辞書と表記揺れ候補辞書とを取得する。文書記憶部１２１に格納されている修正前の文書の複合語入り形態素列と修正後の文書の複合語入り形態素列とから、ＯＫ語と表記揺れ候補との出現頻度を計測し、語彙情報記憶部１２２に登録する。 The vocabulary appearance frequency acquisition unit 116 acquires the OK word dictionary and the spelling candidate dictionary stored in the vocabulary information storage unit 122. From the morpheme sequence containing the compound word of the document before correction and the morpheme sequence containing the compound word of the document after correction stored in the document storage unit 121, the appearance frequency of the OK word and the spelling variation candidate is measured, and the vocabulary information is stored. Registered in the section 122.

図１０は、実施例２の平均類似度計算部１１４における処理手順を説明するフロー図である。図１０に示すフローは、語彙情報記憶部１２２に格納されているすべての表記揺れ候補・ＯＫ語ペアに対して、出現頻度差異で場合分けし、複合語の各要素の最大類似度を考慮した平均類似度または複合語の「コサイン類似度」を計算し、その結果を表記揺れ候補・ＯＫ語ペアの類似度として語彙情報記憶部１２２に登録する処理を示している。即ち、出現頻度が低い場合には、平均類似度を用い、出現頻度が高く共起表現による複合語のコサイン類似度の信頼性が高い場合には、複合語のコサイン類似度を用いることにより、処理の高速化を図ることができる。 FIG. 10 is a flowchart illustrating a processing procedure in the average similarity calculation unit 114 according to the second embodiment. In the flow illustrated in FIG. 10, all the spelling candidate / OK word pairs stored in the vocabulary information storage unit 122 are classified according to the appearance frequency difference, and the maximum similarity of each element of the compound word is considered. The processing of calculating the average similarity or the “cosine similarity” of a compound word and registering the result in the vocabulary information storage unit 122 as the similarity of the spelling candidate / OK word pair is shown. That is, when the appearance frequency is low, the average similarity is used, and when the appearance frequency is high and the reliability of the cosine similarity of the compound word by the co-occurrence expression is high, the cosine similarity of the compound word is used. The processing can be speeded up.

ステップＳ３２１では、語彙情報記憶部１２２から、１表記揺れ候補・ＯＫ語ペアを取得する。 In step S321, one spelling candidate / OK word pair is acquired from the vocabulary information storage unit 122.

ステップＳ３２２では、語彙情報記憶部１２２から、表記揺れ候補の出現頻度とＯＫ語の出現頻度とを取得する。表記揺れ候補の出現頻度は、例えば、修正前文書から表記揺れ候補の出現回数をカウントすることにより求めることができる。ＯＫ語の出現頻度も同様に、修正後文書から求めることができる。 In step S322, the appearance frequency of the spelling variation candidate and the appearance frequency of the OK word are acquired from the vocabulary information storage unit 122. The appearance frequency of the spelling variation candidate can be obtained, for example, by counting the number of appearances of the spelling variation candidate from the document before correction. Similarly, the appearance frequency of the OK word can be obtained from the document after correction.

ステップＳ３２３では、表記揺れ候補とＯＫ語との出現頻度から表記揺れ候補・ＯＫ語ペアの出現頻度差異を計算する。具体的な計算は様々ある。例えば、ＯＫ語と表記揺れ候補との出現頻度の差分でもよい。また、表記揺れ候補の出現頻度を、ＯＫ語の出現頻度で除算した商でもよい。 In step S323, the appearance frequency difference between the spelling candidate / OK word pair is calculated from the appearance frequency of the spelling candidate and the OK word. There are various specific calculations. For example, the difference in the appearance frequency between the OK word and the spelling variation candidate may be used. Alternatively, a quotient obtained by dividing the appearance frequency of the spelling fluctuation candidate by the appearance frequency of the OK word may be used.

ステップＳ３２４では、表記揺れ候補・ＯＫ語ペアの出現頻度差異＞閾値であるか否かを確認する。閾値はステップＳ３２３における表記揺れ候補・ＯＫ語ペアの出現頻度差異の計算式に従って、予め適切に設定する。ＯＫ語ペアの出現頻度差異＞閾値である場合、ステップＳ３２５に進む。そうでない場合、ステップＳ３２７に進む。表記揺れ候補の出現頻度を、ＯＫ語の出現頻度で除算した商とした場合には、例えば、除算した商が、「０．０１」を閾値とする。出現頻度差異が閾値以上となる場合には、複合語のコサイン類似度の精度が期待できないとして、平均類似度による判断を行うためである。例えば、出現頻度の差が大きい場合、ステップＳ３２５に移動する。出現頻度差異が小さい場合には、ステップＳ３２７に移動する。 In step S324, it is checked whether or not the appearance frequency difference of the spelling fluctuation candidate / OK word pair> the threshold value. The threshold value is appropriately set in advance in accordance with the formula for calculating the difference in the appearance frequency of the spelling candidate / OK word pair in step S323. If the difference in the appearance frequency of the OK word pair> the threshold, the process proceeds to step S325. Otherwise, the process proceeds to step S327. Assuming that the appearance frequency of the spelling fluctuation candidate is a quotient obtained by dividing the appearance frequency of the OK word, for example, the threshold value of the divided quotient is “0.01”. This is because, when the difference in appearance frequency is equal to or larger than the threshold value, it is determined that the accuracy of the cosine similarity of the compound word cannot be expected, and the determination based on the average similarity is performed. For example, when the difference between the appearance frequencies is large, the process moves to step S325. If the appearance frequency difference is small, the process moves to step S327.

ステップＳ３２５では、表記揺れ候補・ＯＫ語ペアの平均類似度を計算し、その結果を表記揺れ候補・ＯＫ語ペアの類似度として語彙情報記憶部１２２に登録する。計算方法は図４が示すステップＳ３０２〜ステップＳ３１４である。 In step S325, the average similarity of the spelling candidate / OK word pair is calculated, and the result is registered in the vocabulary information storage unit 122 as the similarity of the spelling candidate / OK word pair. The calculation method is steps S302 to S314 shown in FIG.

ステップＳ３２６では、語彙情報記憶部１２２の表記揺れ候補・ＯＫ語ペアを、すべて取得済みかどうかを確認する。語彙情報記憶部１２２から表記揺れ候補・ＯＫ語ペアをすべて取得済みの場合は終了する。そうでない場合はステップＳ３２１へ戻る。 In step S326, it is confirmed whether or not all the spelling candidate / OK word pairs in the vocabulary information storage unit 122 have been acquired. If all the spelling candidate / OK word pairs have been acquired from the vocabulary information storage unit 122, the process ends. If not, the process returns to step S321.

ステップＳ３２７では、語彙情報記憶部１２２に格納されている複合語ベクトル１２２２から表記揺れ候補のベクトルとＯＫ語の複合語ベクトルとを取得する。 In step S327, a spelling candidate vector and an OK word compound word vector are acquired from the compound word vector 1222 stored in the vocabulary information storage unit 122.

ステップＳ３２８では、表記揺れ候補の複合語ベクトルとＯＫ語の複合語ベクトルとのコサイン類似度を計算する。 In step S328, the cosine similarity between the compound word vector of the spelling fluctuation candidate and the compound word vector of the OK word is calculated.

ステップＳ３２９では、表記揺れ候補のベクトルとＯＫ語の複合語ベクトルとのコサイン類似度を、表記揺れ候補・ＯＫ語ペアの類似度として、語彙情報記憶部１２２に登録して、ステップＳ３２６に進む。 In step S329, the cosine similarity between the spelling variation candidate vector and the OK word compound word vector is registered in the vocabulary information storage unit 122 as the spelling variation candidate / OK word pair similarity, and the process proceeds to step S326.

図１１にステップＳ３２９で語彙情報記憶部１２２に登録される表記揺れ候補とＯＫ語のペア、出現頻度差異、揺れ候補のベクトルとＯＫ語のベクトルとのコサイン類似度を示す。 FIG. 11 shows a pair of a spelling variation candidate and an OK word registered in the vocabulary information storage unit 122 in step S329, an appearance frequency difference, and a cosine similarity between the variation candidate vector and the OK word vector.

実施例２によれば、出現頻度が低い場合には、平均類似度を用いることで高精度に表記振れ候補とＯＫ語のペアをその類似度と共に登録でき、出現頻度が高く共起表現による複合語のコサイン類似度の信頼性が高い場合には複合語のコサイン類似度を用いることにより、表記揺れ候補・ＯＫ語ペアの登録を高精度かつ処理の高速に行うことができる。 According to the second embodiment, when the frequency of appearance is low, a pair of a notational shake candidate and an OK word can be registered with the similarity with high accuracy by using the average similarity. When the reliability of the word cosine similarity is high, by using the cosine similarity of the compound word, it is possible to register the spelling fluctuation candidate / OK word pair with high accuracy and high speed processing.

実施例３では、実施例２における平均類似度計算部１１４の処理手順を変更し、表記揺れ抽出の精度及び再現率を向上する。
実施例３における装置１０の構成例は実施例２と同様（図８）である。 In the third embodiment, the processing procedure of the average similarity calculation unit 114 in the second embodiment is changed to improve the accuracy and recall of the spelling variation extraction.
A configuration example of the device 10 according to the third embodiment is the same as that of the second embodiment (FIG. 8).

図１２は、平均類似度計算部１１４における処理手順を説明するフロー図である。図１２に示すフローは、語彙情報記憶部１２２に格納されているすべての表記揺れ候補・ＯＫ語ペアに対して、平均類似度、コサイン類似度及び出現頻度差異を計算し、出現頻度差異による平均類似度とコサイン類似度の重み付け平均を計算し、その結果を表記揺れ候補・ＯＫ語ペアの類似度として語彙情報記憶部１２２に登録する処理を示している。 FIG. 12 is a flowchart illustrating a processing procedure in average similarity calculation section 114. The flow shown in FIG. 12 calculates the average similarity, the cosine similarity, and the appearance frequency difference for all the spelling candidate / OK word pairs stored in the vocabulary information storage unit 122, and calculates the average based on the appearance frequency difference. A process of calculating a weighted average of the similarity and the cosine similarity and registering the result in the vocabulary information storage unit 122 as the similarity of the spelling candidate / OK word pair is shown.

平均類似度とコサイン類似度に出現頻度を考慮した重み付けを持たせることで、表記揺れ候補とＯＫ語のペアの類似度の精度を高めることができる。 By giving weight to the average similarity and the cosine similarity in consideration of the appearance frequency, it is possible to improve the accuracy of the similarity between the pair of the spelling variation candidate and the OK word.

ステップＳ３３１では、語彙情報記憶部１２２から、１表記揺れ候補・ＯＫ語ペアを取得する。 In step S331, one spelling candidate / OK word pair is acquired from the vocabulary information storage unit 122.

ステップＳ３３２では、表記揺れ候補・ＯＫ語ペアの平均類似度を計算する。計算方法は図４が示すステップＳ３０２〜ステップＳ３１３である。 In step S332, the average similarity of the notation fluctuation candidate / OK word pair is calculated. The calculation method is steps S302 to S313 shown in FIG.

ステップＳ３３３では、語彙情報記憶部１２２に格納されている複合語ベクトル１２２２から表記揺れ候補のベクトルとＯＫ語のベクトルとを取得する。 In step S333, a spelling candidate vector and an OK word vector are acquired from the compound word vector 1222 stored in the vocabulary information storage unit 122.

ステップＳ３３４では、表記揺れ候補のベクトルとＯＫ語のベクトルとのコサイン類似度を計算する。 In step S334, the cosine similarity between the notation fluctuation candidate vector and the OK word vector is calculated.

ステップＳ３３５では、語彙情報記憶部１２２から、表記揺れ候補の出現頻度とＯＫ語の出現頻度とを取得する。 In step S335, the appearance frequency of the spelling variation candidate and the appearance frequency of the OK word are acquired from the vocabulary information storage unit 122.

ステップＳ３３６では、表記揺れ候補とＯＫ語との出現頻度から表記揺れ候補・ＯＫ語ペアの出現頻度差異を計算する。具体的な計算式には様々な候補がある。例えば、ＯＫ語の出現頻度と表記揺れ候補の出現頻度との差分でもよい。また、表記揺れ候補の出現頻度を、ＯＫ語の出現頻度で除算した商でもよい。 In step S336, a difference in the appearance frequency between the spelling candidate / OK word pair is calculated from the appearance frequency of the spelling candidate and the OK word. There are various candidates for specific formulas. For example, the difference between the appearance frequency of the OK word and the appearance frequency of the spelling variation candidate may be used. Alternatively, a quotient obtained by dividing the appearance frequency of the spelling fluctuation candidate by the appearance frequency of the OK word may be used.

ステップＳ３３７では、表記揺れ候補・ＯＫ語ペアの出現頻度差異により平均類似度とコサイン類似度との重み付け平均を計算する。一般的に、出現頻度の高いＯＫ語に対し、出現頻度の低い表記揺れがあった場合、表記揺れのほうの共起表現がしばしば偏ったものであるため、表記揺れ候補の複合語ベクトルとＯＫ語の複合語ベクトルとのコサイン類似度よりも、平均類似度を使用したほうがよい。表記揺れ候補・ＯＫ語ペアの出現頻度差異による平均類似度とコサイン類似度との重み付け平均の計算方法は、ステップＳ３３６における表記揺れ候補・ＯＫ語ペアの出現頻度差異の計算式に従って、予め適切に設定する。表記揺れ候補・ＯＫ語ペアの出現頻度差異を差分とした場合、例えば、平均類似度の重みを（１−１／差分）とし、コサイン類似度の重みを（１／差分）としてもよい。表記揺れ候補・ＯＫ語ペアの出現頻度差異を商とした場合、平均類似度の重みを（１−商）とし、コサイン類似度の重みを商としてもよい。 In step S337, a weighted average of the average similarity and the cosine similarity is calculated based on the appearance frequency difference of the spelling candidate / OK word pair. In general, when there is a spelling with a low appearance frequency for an OK word with a high appearance frequency, the co-occurrence expression of the spelling sway is often biased. It is better to use the average similarity than the cosine similarity of the word with the compound word vector. The calculation method of the weighted average of the average similarity and the cosine similarity based on the difference in the appearance frequency of the spelling variation candidate / OK word pair is appropriately determined in advance according to the calculation formula of the difference in appearance frequency of the spelling variation candidate / OK word pair in step S336. Set. When the difference in appearance frequency of the notation fluctuation candidate / OK word pair is set as a difference, for example, the weight of the average similarity may be set to (1-1 / difference), and the weight of the cosine similarity may be set to (1 / difference). When the difference in the appearance frequency of the notation fluctuation candidate / OK word pair is used as the quotient, the weight of the average similarity may be set to (1-quotient), and the weight of the cosine similarity may be set to the quotient.

ステップＳ３３８では、重み付け平均を表記揺れ候補・ＯＫ語ペアの類似度として語彙情報記憶部１２２に登録する。 In step S338, the weighted average is registered in the vocabulary information storage unit 122 as the similarity of the spelling candidate / OK word pair.

ステップＳ３３９では、語彙情報記憶部１２２の表記揺れ候補・ＯＫ語ペアをすべて取得済みかどうかを確認する。語彙情報記憶部１２２から表記揺れ候補・ＯＫ語ペアをすべて取得済みの場合は終了する。そうでない場合はステップＳ３３１へ戻る。 In step S339, it is confirmed whether or not all the spelling candidate / OK word pairs in the vocabulary information storage unit 122 have been acquired. If all the spelling candidate / OK word pairs have been acquired from the vocabulary information storage unit 122, the process ends. If not, the process returns to step S331.

実施例３によれば、平均類似度とコサイン類似度に出現頻度を考慮した重み付けを持たせることで、表記揺れ候補とＯＫ語のペアの類似度の精度を高めることができる。 According to the third embodiment, by giving the average similarity and the cosine similarity weights in consideration of the appearance frequency, it is possible to improve the accuracy of the similarity between the pair of the spelling variation candidate and the OK word.

１０：本装置、１１：処理部、１２：記憶部、１１１：入出力部、１１２：複合語抽出部、１１３：語彙ベクトル取得部、１１４：平均類似度計算部、１１５：選別部、１１６：語彙出現頻度取得部、１２１：文書記憶部、１２２：語彙情報記憶部、１２２１：一般
単語ベクトル、１２２２：複合語ベクトル 10: This device, 11: processing unit, 12: storage unit, 111: input / output unit, 112: compound word extraction unit, 113: vocabulary vector acquisition unit, 114: average similarity calculation unit, 115: selection unit, 116: Vocabulary appearance frequency acquisition unit, 121: document storage unit, 122: vocabulary information storage unit, 1221: general word vector, 1222: compound word vector

Claims

A storage unit for storing a document before correction and a document after correction in one domain,
A processing unit that extracts a compound word from the document before the correction and the document after the correction, and sets a compound word that appears only in the document before the correction as a spelling fluctuation candidate of the domain term,
The processing unit includes:
Register the combination of the spelling fluctuation candidate and the domain term as a spelling fluctuation candidate / domain term pair in the storage unit,
The spelling candidate and the domain term of the spelling candidate / domain term pair registered in the storage unit are divided into general terms, respectively, and the general term of the divided spelling candidate and the divided Calculating a maximum similarity between the domain term and the general term, and calculating an average similarity between the notation fluctuation candidate / domain term pair based on the calculated maximum similarity between the general terms. Extraction device.

The storage unit has a vocabulary information storage unit,
The processing unit extracts a compound word by morphological analysis from the document before correction stored in the storage unit, registers the compound word in the vocabulary information storage unit as a compound word dictionary before correction, and stores the document in the storage unit. 2. The spelling variation extraction device according to claim 1, further comprising a compound word extraction unit that extracts a compound word from the corrected document by morphological analysis and registers the compound word in the storage unit as a corrected compound word dictionary.

The apparatus according to claim 2, wherein the processing unit includes a vocabulary vector acquisition unit that calculates a vector value from the co-occurrence expression of the general term.

The processing unit has an average similarity calculation unit that calculates the maximum similarity based on the vector value calculated by the vocabulary vector acquisition unit and calculates an average similarity of the spelling fluctuation candidate / domain term pair. The notation fluctuation extracting device according to claim 3, wherein:

A selecting unit that selects the spelling fluctuation candidate / domain term pair that satisfies a condition from the average similarity calculated by the average similarity calculation unit, and stores the spelling fluctuation / domain term pair in the vocabulary information storage unit. 5. The notation fluctuation extracting device according to claim 4, wherein:

For the compound of the spelling candidate / domain term pair, the processing unit includes an appearance frequency at which the spelling candidate appears in the document before the correction and an appearance frequency at which the domain term appears in the corrected document. Has a vocabulary appearance frequency acquisition unit that acquires
The average similarity calculator calculates the average similarity when the difference in the frequency of appearance of the domain term in the corrected document is larger than a threshold, and the spelling candidate when the difference is smaller than the threshold. 5. The spelling variation extraction apparatus according to claim 4, wherein a similarity between the spelling variation candidate and the domain term pair is calculated based on a cosine similarity between the spelling variation candidate forming the domain term pair and the domain term.

For the compound of the spelling candidate / domain term pair, the processing unit includes an appearance frequency at which the spelling candidate appears in the document before the correction and an appearance frequency at which the domain term appears in the corrected document. Has a vocabulary appearance frequency acquisition unit that acquires
The average similarity calculator calculates the average similarity and the notation of the notation fluctuation candidate / domain term pair according to a difference between an appearance frequency at which the notation fluctuation candidate appears and an appearance frequency at which the domain term appears. The apparatus according to claim 4, wherein a weighted average is calculated for the shake candidate and the cosine similarity of the domain term.

From a document before correction and a document after correction in one domain, a domain term that is a correct compound word after correction and a correspondence of the sway of the notation of the document before correction corresponding to the domain term are stored as a pair. In the notation fluctuation extraction device,
A storage unit that stores the document before the correction and the document after the correction in the one domain,
A processing unit that extracts a compound word from the document before the correction and the document after the correction, and sets a compound word that appears only in the document before the correction as a spelling candidate of the domain term. Extraction device.

The document before correction and the document after correction in one domain are stored in the storage unit,
By the processing unit, a compound word is extracted from the document before the correction and the document after the correction, and a compound word that appears only in the document before the correction is a notation fluctuation candidate of the domain term,
Register the combination of the spelling fluctuation candidate and the domain term as a spelling fluctuation candidate / domain term pair in the storage unit,
The spelling fluctuation candidate and the domain term of the spelling fluctuation candidate / domain term pair registered in the storage unit are each divided into general terms, and the general term of the divided spelling fluctuation candidate and the divided domain are divided. Calculating the maximum similarity of a term with a general term and calculating the average similarity of the notation fluctuation candidate / domain term pair based on the calculated maximum similarity of the general term. Method.

The processing unit extracts a compound word from the document before correction stored in the storage unit by morphological analysis, registers the compound word as a compound word dictionary before correction in the vocabulary information storage unit of the storage unit, and stores the compound word in the storage unit. The method according to claim 9, wherein the corrected document that has been corrected is subjected to morphological analysis to extract a compound word and to register the compound word in the storage unit as a corrected compound word dictionary.

The method according to claim 10, wherein the processing unit calculates a vector value from a co-occurrence expression of the general term.

The said processing part calculates the said maximum similarity based on the vector value calculated by the said vocabulary vector acquisition part, Computes the average similarity of the said notation fluctuation candidate / domain term pair, The Claims characterized by the above-mentioned. 12. A method for extracting notational deviation according to item 11.

13. The notation according to claim 12, wherein the spelling fluctuation candidate / domain term pair satisfying a condition is selected from the calculated average similarity, and stored as the spelling fluctuation / domain term pair in the vocabulary information storage unit. Shake extraction method.

The processing unit includes:
For the compound word of the spelling fluctuation candidate / domain term pair, an appearance frequency in which the spelling fluctuation candidate appears in the document before the correction and an appearance frequency in which the domain term appears in the corrected document,
The difference in the frequency of appearance of the domain term in the document after the modification is based on the average similarity when the difference is larger than a threshold, and when the difference is smaller than the threshold, the spelling candidate / domain term pair is configured. 13. The method according to claim 12, wherein the similarity between the spelling candidate / domain term pair is calculated based on the cosine similarity between the spelling candidate and the domain term.

The processing unit includes:
For the compound word of the spelling fluctuation candidate / domain term pair, an appearance frequency in which the spelling fluctuation candidate appears in the document before the correction and an appearance frequency in which the domain term appears in the corrected document,
The average similarity and the cosine of the spelling candidate and the domain term of the spelling candidate / domain term pair according to the difference between the appearance frequency of the spelling variation candidate and the appearance frequency of the domain term. The method according to claim 12, wherein a weighted average is calculated for the similarity.