JP2019204214A

JP2019204214A - Learning device, learning method, program and estimation device

Info

Publication number: JP2019204214A
Application number: JP2018097926A
Authority: JP
Inventors: 伊藤　直之; Naoyuki Ito; 直之伊藤; 和久大野; Kazuhisa Ono
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2018-05-22
Filing date: 2018-05-22
Publication date: 2019-11-28
Anticipated expiration: 2038-05-22
Also published as: JP7163618B2

Abstract

【課題】文章中の位置に関わらず適切な語であるか否かを識別することができる学習装置、学習方法、プログラム及び推定装置を提供する。【解決手段】学習装置１は、文章を取得する取得部１５１と、取得した前記文章を、所定単位の文字又は文字列である複数の要素に分割する分割部１５２と、各要素の次に出現する前記要素を、取得した前記文章の順に学習する第１学習部１５３と、前記複数の要素の並び順を、前記文章の先頭から末尾までを逆順序に並び替える順序変換部１５４と、並び替えた前記文章の順に、前記各要素の次に出現する前記要素を学習する第２学習部１５５とを備えることを特徴とする。【選択図】図１５PROBLEM TO BE SOLVED: To provide a learning device, a learning method, a program, and an estimation device capable of identifying whether or not a word is appropriate regardless of the position in a sentence. SOLUTION: A learning device 1 includes an acquisition unit 151 that acquires a sentence, a division unit 152 that divides the acquired sentence into a plurality of elements that are characters or character strings in a predetermined unit, and appears next to each element. A first learning unit 153 that learns the elements in the order of the acquired sentences; an order conversion unit 154 that rearranges the arrangement order of the plurality of elements in reverse order from the beginning to the end of the sentence; And a second learning unit 155 that learns the element that appears next to each element in the order of the sentences. [Selection diagram] Fig. 15

Description

本発明は、学習装置、学習方法、プログラム及び推定装置に関する。 The present invention relates to a learning device, a learning method, a program, and an estimation device.

文章内に出現する語句の共起関係を学習することで、入力されたテキストに続くテキストの推定、テキスト内の誤りの推定等を行う技術がある。例えば特許文献１では、文章中に各単語が出現する順序と、各単語の係り受け等の関係とを学習した再帰型ニューラルネットワーク（Recurrent Neural Network、以下では「ＲＮＮ」と記載）を構築し、構築したＲＮＮを用いて、入力されたテキストに続くテキストを推定し、適切な構造の文章を出力する学習装置等が開示されている。 There is a technique for estimating a text following an input text, estimating an error in the text, and the like by learning a co-occurrence relationship of words appearing in a sentence. For example, in Patent Document 1, a recurrent neural network (Recurrent Neural Network, hereinafter referred to as “RNN”) in which the order in which each word appears in a sentence and the relationship such as the dependency of each word is learned is constructed. A learning device or the like is disclosed that estimates the text following the input text using the constructed RNN and outputs a sentence having an appropriate structure.

特開２０１８−４５６５６号公報JP 2018-45656 Gazette

しかしながら、特許文献１に係る発明では、文章中に出現する単語の順序を、文章の先頭から順に学習しているため、例えば文章の先頭に位置する単語、読点に続く単語等を適切に推定することが難しい。 However, in the invention according to Patent Document 1, since the order of words appearing in a sentence is learned in order from the beginning of the sentence, for example, a word positioned at the beginning of the sentence, a word following a punctuation mark, and the like are appropriately estimated. It is difficult.

一つの側面では、文章中の位置に関わらず適切な語であるか否かを識別することができる学習装置等を提供することを目的とする。 In one aspect, an object of the present invention is to provide a learning device or the like that can identify whether a word is an appropriate word regardless of the position in the sentence.

一つの側面では、学習装置は、文章を取得する取得部と、取得した前記文章を、所定単位の文字又は文字列である複数の要素に分割する分割部と、各要素の次に出現する前記要素を、取得した前記文章の順に学習する第１学習部と、前記複数の要素の並び順を、前記文章の先頭から末尾までを逆順序に並び替える順序変換部と、並び替えた前記文章の順に、前記各要素の次に出現する前記要素を学習する第２学習部とを備えることを特徴とする。 In one aspect, the learning device includes an acquisition unit that acquires a sentence, a division unit that divides the acquired sentence into a plurality of elements that are characters or character strings of a predetermined unit, and the element that appears next to each element A first learning unit that learns elements in the order of the acquired sentences; an order conversion unit that rearranges the arrangement order of the plurality of elements from the beginning to the end of the sentences; and A second learning unit that learns the element that appears next to each element in order.

一つの側面では、文章中の位置に関わらず適切な語であるか否かを識別することができる。 In one aspect, it is possible to identify whether or not the word is appropriate regardless of the position in the sentence.

学習装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of a learning apparatus. 言語モデルについて説明するための説明図である。It is explanatory drawing for demonstrating a language model. テキスト学習処理について説明するための説明図である。It is explanatory drawing for demonstrating a text learning process. 順方向のみでのテキストの並び順に基づく誤り推定処理の説明図である。It is explanatory drawing of the error estimation process based on the arrangement | sequence order of the text only in the forward direction. 逆順序のテキスト学習処理について説明するための説明図である。It is explanatory drawing for demonstrating the text learning process of reverse order. 誤り推定処理の説明図である。It is explanatory drawing of an error estimation process. テキスト学習処理の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of a text learning process. 誤り推定処理の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of an error estimation process. 実施の形態２に係る誤り推定処理の説明図である。12 is an explanatory diagram of an error estimation process according to Embodiment 2. FIG. 実施の形態２に係る誤り推定処理の処理手順の一例を示すフローチャートである。10 is a flowchart illustrating an example of a processing procedure of error estimation processing according to the second embodiment. 実施の形態３に係る学習装置の構成例を示すブロック図である。10 is a block diagram illustrating a configuration example of a learning device according to Embodiment 3. FIG. 語句リストのレコードレイアウトの一例を示す説明図である。It is explanatory drawing which shows an example of the record layout of a phrase list. サブワード学習処理を説明するための説明図である。It is explanatory drawing for demonstrating a subword learning process. サブワード学習処理の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of a subword learning process. 上述した形態の学習装置の動作を示す機能ブロック図である。It is a functional block diagram which shows operation | movement of the learning apparatus of the form mentioned above.

以下、本発明をその実施の形態を示す図面に基づいて詳述する。
（実施の形態１）
図１は、学習装置１の構成例を示すブロック図である。本実施の形態では、文章内に出現する各語句を、その文章の並び順に従って学習した言語モデル１４１を生成し、生成した言語モデル１４１を用いて、文章内の誤り箇所を推定する学習装置１について説明する。 Hereinafter, the present invention will be described in detail with reference to the drawings illustrating embodiments thereof.
(Embodiment 1)
FIG. 1 is a block diagram illustrating a configuration example of the learning device 1. In the present embodiment, a learning device 1 that generates a language model 141 that learns each word and phrase appearing in a sentence according to the arrangement order of the sentences, and estimates an error location in the sentence using the generated language model 141. Will be described.

学習装置１は、種々の情報処理が可能な情報処理装置であり、例えばサーバ装置、パーソナルコンピュータ、多機能端末等である。学習装置１は、学習用の文章を所定単位の語句（要素）毎に分割し、分割した各語句を、当該文章の並び順に従って学習する。これにより学習装置１は、直前までの語句から次の語句の出現確率（生起確率）を予測する言語モデル１４１を生成する。 The learning device 1 is an information processing device capable of various information processing, and is, for example, a server device, a personal computer, a multifunction terminal, or the like. The learning device 1 divides a learning sentence into words (elements) of a predetermined unit, and learns the divided words according to the arrangement order of the sentences. Thereby, the learning apparatus 1 generates a language model 141 that predicts the appearance probability (occurrence probability) of the next phrase from the previous phrase.

図２は、言語モデル１４１について説明するための説明図である。言語モデル１４１は、自然言語の文章が生成される確率をモデル化したものである。一般的な言語モデル１４１では、文章内で先頭から順に出現する一又は複数の語句から、当該一又は複数の語句に続いて出現する語句の生起確率を算出し、次の語句を推定する。図２の例では、文章の先頭から順次出現する語句「私」、「は」、「学校」、「に」に基づき、次に出現する語句の生起確率を算出する様子を図示している。図２に示すように、一般的な言語モデル１４１では「私」等の語句を入力して次の語句の候補「行く」、「いる」等の生起確率を算出し、生起確率が最も高い候補「行く」を次の語句として予測する。このように、一般的な言語モデル１４１では、直前までの語句から次の語句を予測する。 FIG. 2 is an explanatory diagram for explaining the language model 141. The language model 141 models the probability that a natural language sentence is generated. In the general language model 141, the occurrence probability of a word that appears following the one or more words is calculated from one or more words that appear in order from the top in the sentence, and the next word is estimated. In the example of FIG. 2, the occurrence probability of the next appearing phrase is calculated based on the phrases “I”, “ha”, “school”, and “ni” that appear sequentially from the top of the sentence. As shown in FIG. 2, in a general language model 141, a word such as “I” is input to calculate the occurrence probability of the next word candidate “go”, “is”, etc., and the candidate with the highest occurrence probability Predict “go” as the next phrase. Thus, in the general language model 141, the next word / phrase is predicted from the word / phrase up to immediately before.

本実施の形態ではさらに、学習装置１は文章の先頭及び末尾を入れ替え、文章内の語句を逆順序にした場合の語句の順序を学習し、文章の末尾から順に出現する一又は複数の語句から、直前の語句を予測する言語モデル１４１を生成する。すなわち学習装置１は、文章内の順方向における語句の並び順と、逆方向における並び順とを学習して２つの言語モデル１４１を生成する。学習装置１は当該２つの言語モデル１４１を用いて、誤り推定対象である文章から、誤字等と推定される誤り箇所を推定する。 In the present embodiment, the learning device 1 further replaces the beginning and end of the sentence, learns the order of words when the words in the sentence are reversed, and uses one or more words that appear in order from the end of the sentence. The language model 141 that predicts the immediately preceding word is generated. That is, the learning device 1 generates two language models 141 by learning the arrangement order of words in the forward direction in the sentence and the arrangement order in the reverse direction. The learning device 1 uses the two language models 141 to estimate an error location that is estimated to be a typographical error from a sentence that is an error estimation target.

図１に戻って説明を続ける。学習装置１は、制御部１１、主記憶部１２、通信部１３、補助記憶部１４を備える。
制御部１１は、一又は複数のＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro-Processing Unit）、ＧＰＵ（Graphics Processing Unit）等の演算処理装置を有し、補助記憶部１４に記憶されたプログラムＰを読み出して実行することにより、学習装置１に係る種々の情報処理、制御処理等を行う。主記憶部１２は、ＳＲＡＭ（Static Random Access Memory）、ＤＲＡＭ（Dynamic Random Access Memory）、フラッシュメモリ等の一時記憶領域であり、制御部１１が演算処理を実行するために必要なデータを一時的に記憶する。通信部１３は、通信に関する処理を行うための処理回路等を含み、外部と情報の送受信を行う。 Returning to FIG. 1, the description will be continued. The learning device 1 includes a control unit 11, a main storage unit 12, a communication unit 13, and an auxiliary storage unit 14.
The control unit 11 includes an arithmetic processing unit such as one or a plurality of CPUs (Central Processing Units), MPUs (Micro-Processing Units), and GPUs (Graphics Processing Units), and stores the program P stored in the auxiliary storage unit 14. Various information processing, control processing, and the like related to the learning device 1 are performed by reading and executing. The main storage unit 12 is a temporary storage area such as a static random access memory (SRAM), a dynamic random access memory (DRAM), or a flash memory, and temporarily stores data necessary for the control unit 11 to execute arithmetic processing. Remember. The communication unit 13 includes a processing circuit for performing processing related to communication, and transmits and receives information to and from the outside.

補助記憶部１４は大容量メモリ、ハードディスク等であり、制御部１１が処理を実行するために必要なプログラムＰ、その他のデータを記憶している。また、補助記憶部１４は、上述の如く学習用の文章から生成された言語モデル１４１を記憶している。 The auxiliary storage unit 14 is a large-capacity memory, a hard disk, or the like, and stores a program P and other data necessary for the control unit 11 to execute processing. The auxiliary storage unit 14 stores the language model 141 generated from the learning text as described above.

なお、補助記憶部１４は学習装置１に接続された外部記憶装置であってもよい。また、学習装置１は複数のコンピュータからなるマルチコンピュータであってもよく、ソフトウェアによって仮想的に構築された仮想マシンであってもよい。 Note that the auxiliary storage unit 14 may be an external storage device connected to the learning device 1. Further, the learning apparatus 1 may be a multi-computer composed of a plurality of computers, or may be a virtual machine constructed virtually by software.

また、本実施の形態において学習装置１は上記の構成に限られず、例えば可搬型記憶媒体に記憶された情報を読み取る読取部、操作入力を受け付ける入力部、画像を表示する表示部等を含んでもよい。 Further, in the present embodiment, the learning device 1 is not limited to the above configuration, and may include, for example, a reading unit that reads information stored in a portable storage medium, an input unit that receives operation input, a display unit that displays an image, and the like. Good.

図３は、テキスト学習処理について説明するための説明図である。図３では、学習装置１が学習用の文章において順方向に並んだ各語句を学習して、ＲＮＮの一種であるＬＳＴＭ（Long Short-Term Memory）による言語モデル１４１を構築する様子を図示している。以下では学習装置１が実行する処理の概要について説明する。 FIG. 3 is an explanatory diagram for explaining the text learning process. FIG. 3 illustrates a state in which the learning device 1 learns each phrase arranged in the forward direction in the learning sentence, and constructs a language model 141 based on LSTM (Long Short-Term Memory) which is a kind of RNN. Yes. Below, the outline | summary of the process which the learning apparatus 1 performs is demonstrated.

学習装置１は、例えば通信部１３を介して外部から複数の学習用の文章を取得する。そして学習装置１はまず、取得した学習用の文章に対して形態素解析等の自然言語処理を行い、所定単位の文字又は文字列である語句（要素）毎に分割する。この分割単位は、例えば単語、文節等の単位であるが、特に限定されない。例えば学習装置１は、複数の語句を格納した辞書（不図示）を予め記憶しておき、当該辞書に格納された語句に従って文章を分割すればよい。 The learning device 1 acquires a plurality of learning sentences from the outside via, for example, the communication unit 13. The learning device 1 first performs natural language processing such as morphological analysis on the acquired learning sentence, and divides the sentence into words (elements) that are characters or character strings of a predetermined unit. The division unit is, for example, a unit such as a word or a phrase, but is not particularly limited. For example, the learning device 1 may store in advance a dictionary (not shown) storing a plurality of words and divide the sentence according to the words stored in the dictionary.

なお、後述する実施の形態３のように、学習装置１は単語、文節等の単位ではなく、その他の単位で文章を分割してもよい。つまり学習装置１は、所定単位の文字又は文字列である要素毎に学習用文章を分割可能であればよく、分割単位とする要素は単語、文節等の単位に限定されない。 Note that the learning apparatus 1 may divide a sentence in units other than words, phrases, and the like as in Embodiment 3 described later. That is, the learning device 1 only needs to be able to divide a learning sentence into elements that are characters or character strings in a predetermined unit, and the element as a division unit is not limited to a unit such as a word or a phrase.

学習装置１は、分割した各語句をＲＮＮに係る入力層に入力し、機械学習を行う。図３に、ＲＮＮの構成を概念的に図示する。図３に示すように、ＲＮＮは、入力層、中間層（隠れ層）、及び出力層を有する。入力層は、文章の先頭から順に出現する各語句の入力をそれぞれ受け付ける複数のニューロンを有する。出力層は、入力層の各ニューロンに対応して、各ニューロンに入力される語句に続く語句をそれぞれ推定して出力する複数のニューロンを有する。そして中間層は、入力層の各ニューロンへの入力値（語句）に対して出力層の各ニューロンにおける出力値（語句）を演算するための複数のニューロンを有する。中間層の各ニューロンは、過去の入力値に関する中間層での演算結果を用いて（図３では右向きの矢印で図示）次の入力値に関する演算を行うことで、直前までの語句から次の語句を推定する。 The learning device 1 inputs each divided phrase to the input layer related to the RNN and performs machine learning. FIG. 3 conceptually illustrates the configuration of the RNN. As shown in FIG. 3, the RNN has an input layer, an intermediate layer (hidden layer), and an output layer. The input layer has a plurality of neurons that respectively accept inputs of words that appear in order from the beginning of the sentence. The output layer has a plurality of neurons corresponding to each neuron of the input layer and estimating and outputting words following the word inputted to each neuron. The intermediate layer has a plurality of neurons for calculating the output value (phrase) in each neuron in the output layer with respect to the input value (phrase) in each neuron in the input layer. Each neuron in the intermediate layer uses the operation result in the intermediate layer for the past input value (shown by a right-pointing arrow in FIG. 3) to perform the operation on the next input value, so that the next word to the next word Is estimated.

なお、図３に示すＲＮＮの構成は一例であって、本実施の形態はこれに限定されるものではない。例えば中間層は一層に限定されず、二層以上であってもよい。また、入力層及び出力層のニューロンの数は同数に限定されず、例えば入力に対して出力の数は少なくともよい。 The configuration of the RNN illustrated in FIG. 3 is an example, and the present embodiment is not limited to this. For example, the intermediate layer is not limited to one layer, and may be two or more layers. Further, the number of neurons in the input layer and the output layer is not limited to the same number. For example, the number of outputs is at least as good as the input.

また、本実施の形態で学習装置１はＲＮＮのアルゴリズムに従って学習を行うが、例えばその他の深層学習、Ｎ−ｇｒａｍモデル、ＳＶＭ（Support Vector Machine）、ベイジアンネットワーク、決定木など、他のアルゴリズムに従って学習を行い、言語モデル１４１を生成してもよい。 In the present embodiment, the learning apparatus 1 performs learning according to an RNN algorithm. For example, learning according to another algorithm such as other deep learning, N-gram model, SVM (Support Vector Machine), Bayesian network, or decision tree. And the language model 141 may be generated.

学習装置１は、学習用文章の各語句を、当該文章内での並び順に従って入力層の各ニューロンに入力し、出力層の各ニューロンから出力値を得る。図３の例では、学習装置１は学習用文章の各語句「昨日」、「は」、「電車」、「に」…を、文章内での順番に従い、対応する入力層の各ニューロンに入力する。学習装置１は、中間層を経て出力層の各ニューロンでの演算を行い、文章内の任意の位置（順番）に出現する語句の生起確率を、直前までに出現する語句に基づいて算出し、次に出現する語句を推定する。図３の例では、学習装置１は１番目の語句「昨日」に基づき、２番目に出現する語句の生起確率を算出して推定を行う。また、学習装置１は１番目及び２番目の語句「昨日」及び「は」に基づき、３番目の語句の生起確率を算出して推定を行う。以下同様にして、学習装置１は各語句を推定する。 The learning device 1 inputs each phrase of the learning sentence to each neuron in the input layer in accordance with the arrangement order in the sentence, and obtains an output value from each neuron in the output layer. In the example of FIG. 3, the learning device 1 inputs each phrase “yesterday”, “ha”, “train”, “ni”,... Of the learning sentence to each neuron of the corresponding input layer according to the order in the sentence. To do. The learning device 1 performs an operation on each neuron in the output layer through the intermediate layer, and calculates the occurrence probability of a word that appears at an arbitrary position (order) in the sentence based on the word that appears immediately before, Estimate the next word. In the example of FIG. 3, the learning device 1 performs the estimation by calculating the occurrence probability of the word that appears second based on the first word “yesterday”. The learning device 1 calculates and estimates the occurrence probability of the third phrase based on the first and second phrases “yesterday” and “ha”. Similarly, the learning device 1 estimates each word / phrase.

学習装置１は、推定した語句を実際の語句（正解値）と比較し、出力層の各ニューロンからの出力値が正解値に近似するよう各ニューロンのパラメータを調整し、ＲＮＮを構築する。例えば学習装置１は、「昨日」に続く語句として推定した語句が、実際の語句「は」となるように、各ネットワーク層のニューロンの重み等を調整する。これにより学習装置１は、学習用文章の順方向において、直前までに出現した語句から、当該語句の次に出現する語句を学習した言語モデル１４１を生成する。 The learning device 1 compares the estimated phrase with the actual phrase (correct answer value), adjusts the parameters of each neuron so that the output value from each neuron in the output layer approximates the correct answer value, and constructs the RNN. For example, the learning device 1 adjusts the weights of neurons in each network layer so that the phrase estimated as the phrase following “yesterday” becomes the actual phrase “ha”. Thereby, the learning apparatus 1 generates a language model 141 in which a word / phrase that appears next to the word / phrase is learned from a word / phrase that appears immediately before in the forward direction of the learning text.

図４は、順方向のみでのテキストの並び順に基づく誤り推定処理の説明図である。図４では、文章内の誤り箇所を検出するため、順方向の言語モデル１４１のみに基づいて各語句の生起確率を算出した場合を概念的に図示している。 FIG. 4 is an explanatory diagram of an error estimation process based on the text arrangement order only in the forward direction. FIG. 4 conceptually illustrates a case in which the occurrence probability of each word / phrase is calculated based only on the forward language model 141 in order to detect an error location in the sentence.

学習装置１は、例えば通信部１３を介して、誤り箇所の検出（推定）対象である対象文を取得する。学習装置１は学習時と同様に、対象文を所定単位の語句に分割する。そして学習装置１は、生成済みの言語モデル１４１を参照して、各語句の生起確率を、直前までに出現する語句に基づいて算出する。図４に示す例では、学習装置１は、３番目に出現する語句「昨日」の生起確率を算出する場合、上述のＲＮＮを用いて、１番目及び２番目の語句「私」及び「は」から「昨日」の生起確率を算出する。そして学習装置１は、算出した生起確率に基づき、当該語句が誤り箇所であるか否かを判定する。例えば学習装置１は、生起確率を所定の閾値と比較し、閾値以下である場合、つまり生起確率が低い場合に誤り箇所であるものと判定する。 The learning device 1 acquires, for example, a target sentence that is a target for detection (estimation) of an error portion via the communication unit 13. The learning device 1 divides the target sentence into words of a predetermined unit, as in learning. Then, the learning device 1 refers to the generated language model 141 and calculates the occurrence probability of each word based on the words that appear until immediately before. In the example shown in FIG. 4, when the learning device 1 calculates the occurrence probability of the phrase “yesterday” that appears third, the first and second phrases “I” and “ha” are calculated using the above-described RNN. From the above, the occurrence probability of “Yesterday” is calculated. Then, the learning device 1 determines whether or not the word / phrase is an error part based on the calculated occurrence probability. For example, the learning device 1 compares the occurrence probability with a predetermined threshold, and determines that it is an error location when the occurrence probability is equal to or lower than the threshold, that is, when the occurrence probability is low.

しかしながら、順方向のみの学習結果から誤り箇所を推定した場合、文章内における該当語句の位置によっては、適切に誤り箇所と推定できない虞がある。例えば日本語の文章の場合、文章の先頭、読点の直後、あるいは助詞の直後などには様々な語句が出現し得るため、多くの語句の生起確率が低くなる。例えば図４に示す文章は正しい文章であるにも関わらず、先頭に位置する語句「私」、及び読点の直後に位置する「学校」は、生起確率が低くなっている。この場合、学習装置１は正しい語句「私」及び「学校」を誤り箇所と推定する虞がある。 However, when an error part is estimated from the learning result only in the forward direction, there is a possibility that the error part cannot be appropriately estimated depending on the position of the corresponding phrase in the sentence. For example, in the case of a Japanese sentence, various words can appear at the beginning of a sentence, immediately after a punctuation mark, or immediately after a particle, so that the occurrence probability of many words is low. For example, although the sentence shown in FIG. 4 is a correct sentence, the word “I” located at the head and “School” located immediately after the reading point have low occurrence probabilities. In this case, the learning apparatus 1 may estimate the correct words “I” and “School” as erroneous parts.

そこで本実施の形態では、学習装置１はさらに各語句を逆順序に並び替えた場合の文章を学習して言語モデル１４１を生成し、順方向及び逆方向それぞれの学習結果を組み合わせて誤り箇所を推定することで、上記の問題を解決する。 Therefore, in the present embodiment, the learning apparatus 1 further learns sentences when the words are rearranged in the reverse order to generate the language model 141, and combines the learning results in the forward direction and the reverse direction to identify the error part. The above problem is solved by estimating.

図５は、逆順序のテキスト学習処理について説明するための説明図である。図５では、学習用文章の各語句を逆順序に並び替え、文章の末尾からの順に各語句を学習していく様子を図示している。 FIG. 5 is an explanatory diagram for explaining a reverse-order text learning process. FIG. 5 illustrates a state in which the words in the learning sentence are rearranged in the reverse order and the words are learned in order from the end of the sentence.

例えば学習装置１は、順方向について学習処理を完了後、学習用文章の先頭から末尾までを入れ替え、末尾の語句を先頭にした逆順序に並び替える。すなわち、図５に示すように、学習装置１は「昨日」、「は」、「電車」、「に」、…「行っ」、「た」と続く語句を、「た」、「行っ」、「に」、「京都」、…「は」、「昨日」という順序に並び替える。学習装置１は、図３で示したＲＮＮと同様の構成を有する逆順序学習用のＲＮＮの入力層に、並び替えた各語句を入力する。そして学習装置１は、文章の末尾から逆順序で出現する一又は複数の語句に基づき、当該一又は複数の語句の直前に出現する語句を推定するＲＮＮを構築する。つまり学習装置１は、文章内の任意の位置の語句を、当該語句に続く後続の語句から推定するＲＮＮを構築する。 For example, after completing the learning process in the forward direction, the learning device 1 replaces the learning sentence from the beginning to the end, and rearranges the learning sentences in the reverse order starting with the last word / phrase. That is, as shown in FIG. 5, the learning apparatus 1 uses “yesterday”, “ha”, “train”, “ni”,. Rearrange in the order of “ni”, “Kyoto”, “ha”, “yesterday”. The learning device 1 inputs the rearranged words to the input layer of the reverse order learning RNN having the same configuration as the RNN shown in FIG. Then, the learning device 1 constructs an RNN that estimates a word that appears immediately before the one or more words / phrases based on one or more words / phrases that appear in reverse order from the end of the sentence. That is, the learning device 1 constructs an RNN that estimates a phrase at an arbitrary position in a sentence from a subsequent phrase that follows the phrase.

図６は、誤り推定処理の説明図である。図６では、順方向及び逆方向それぞれの言語モデル１４１に基づいて各語句の生起確率を算出する様子を図示している。
学習装置１は、図４でも説明したように、誤り推定対象の対象文を所定単位の語句に分割する。そして学習装置１は、対象文の各語句を順方向及び逆方向それぞれに係るＲＮＮの入力層に入力し、順方向及び逆方向それぞれについて生起確率を算出する。 FIG. 6 is an explanatory diagram of the error estimation process. FIG. 6 illustrates how the occurrence probability of each word is calculated based on the forward and backward language models 141.
As described with reference to FIG. 4, the learning device 1 divides the target sentence for error estimation into words of a predetermined unit. Then, the learning device 1 inputs each word / phrase of the target sentence to the input layer of the RNN in each of the forward direction and the backward direction, and calculates the occurrence probability for each of the forward direction and the backward direction.

順方向について計算を行う場合、学習装置１は、順方向について学習したＲＮＮの入力層に対して各語句を入力し、先頭から順に出現する一又は複数の語句から、当該一又は複数の語句に続く語句の生起確率を算出する。つまり学習装置１は、直前までの語句から次に出現する語句の生起確率を算出する。 When performing calculation for the forward direction, the learning device 1 inputs each word / phrase to the input layer of the RNN learned for the forward direction, and from one or more words / phrases appearing in order from the top, to the one or more words / phrases Calculate the occurrence probability of the following phrase. That is, the learning device 1 calculates the occurrence probability of the next word / phrase from the previous words / phrases.

逆方向について計算を行う場合、学習装置１はまず、対象文の各語句を逆順序に並び替える。そして学習装置１は、逆順序について学習したＲＮＮに対し、並び替えた各語句をその逆順序で入力し、末尾から順に出現する一又は複数の語句、つまり後続の語句から、当該一又は複数の語句の直前に出現する語句の生起確率を算出する。これにより、図６に示すように、順方向の場合に生起確率が低くなっていた語句「私」及び「学校」は、逆方向の場合には生起確率が高くなっている。 When performing calculation in the reverse direction, the learning device 1 first rearranges the words and phrases of the target sentence in the reverse order. Then, the learning device 1 inputs each rearranged word / phrase in the reverse order to the RNN learned about the reverse order, and the one or more words appearing in order from the end, that is, from the subsequent words / phrases, The probability of occurrence of a word that appears immediately before the word is calculated. Thus, as shown in FIG. 6, the words “I” and “School” whose occurrence probability is low in the forward direction have a high occurrence probability in the reverse direction.

学習装置１は、順方向及び逆方向それぞれについて算出した各語句の生起確率に基づき、各語句が誤りであるか否かを判定する。例えば学習装置１は、順方向及び逆方向それぞれの生起確率が共に所定の閾値以下である場合、誤り箇所と推定する。学習装置１は、例えば誤り箇所と推定した語句を色分け表示するなどして、誤り箇所の表示態様をその他の箇所の表示態様と異ならせた対象文を出力することで、推定結果を出力する。 The learning device 1 determines whether or not each word is erroneous based on the occurrence probability of each word calculated for each of the forward direction and the backward direction. For example, the learning device 1 estimates an error location when the occurrence probabilities in the forward direction and the backward direction are both equal to or less than a predetermined threshold. The learning device 1 outputs the estimation result by outputting a target sentence in which the display mode of the error part is different from the display mode of the other part, for example, by color-displaying the words estimated as the error part.

さらに学習装置１は、誤り箇所を検知した場合、誤りと推定した語句に代わる正しい語句を推定して提示（出力）してもよい。すなわち学習装置１は、誤り箇所と推定された語句に対し、修正候補とする語句を出力する。図２で説明したように、学習装置１は、言語モデル１４１を用いることで、生起確率が高い語句を前後の語句から予測可能である。例えば学習装置１は、誤り推定のために用いた言語モデル１４１（学習結果）を参照して、誤り箇所と推定された語句の位置（順序）において、生起確率が最も高い語句を修正候補として出力すればよい。この場合、学習装置１は順方向及び逆方向の２つの言語モデル１４１を用いて双方向から修正候補を推定してもよく、順方向又は逆方向のいずれかの言語モデル１４１のみから修正候補を推定してもよい。 Further, when the learning device 1 detects an error location, the learning device 1 may estimate and present (output) a correct word / phrase instead of the word / phrase estimated to be an error. That is, the learning device 1 outputs a word / phrase that is a correction candidate for the word / phrase estimated to be an error part. As described with reference to FIG. 2, the learning device 1 can predict a word / phrase having a high occurrence probability from the preceding and following words / phrases by using the language model 141. For example, the learning apparatus 1 refers to the language model 141 (learning result) used for error estimation, and outputs a word / phrase having the highest occurrence probability as a correction candidate at the position (order) of the word / phrase estimated as an error location. do it. In this case, the learning apparatus 1 may estimate a correction candidate from both directions using the two language models 141 in the forward direction and the reverse direction, and the correction candidate is determined only from the language model 141 in either the forward direction or the reverse direction. It may be estimated.

以上より、学習装置１は、順方向及び逆方向それぞれの並び順で各語句を学習した言語モデル１４１を用いることで、文章内の誤り箇所を適切に推定することができる。 As described above, the learning device 1 can appropriately estimate the error location in the sentence by using the language model 141 in which each word / phrase is learned in the order of arrangement in the forward direction and the reverse direction.

なお、学習装置１は言語モデル１４１をＢｉ‐ｄｉｒｅｃｔｉｏｎａｌＲＮＮとして、順方向及び逆方向の並び順を同時に学習し、上記の２つのＲＮＮを１つに統合してもよい。ただし、単方向のＲＮＮの方がチューニングを容易に行えるため、上記のように２つのＲＮＮを生成した方が好適である。 Note that the learning apparatus 1 may use the language model 141 as a Bi-directional RNN to simultaneously learn the order of arrangement in the forward direction and the reverse direction, and integrate the two RNNs into one. However, since a unidirectional RNN can be tuned more easily, it is preferable to generate two RNNs as described above.

図７は、テキスト学習処理の処理手順の一例を示すフローチャートである。図７に基づき、学習装置１が実行するテキスト学習処理の処理内容について説明する。
学習装置１の制御部１１は、例えば通信部１３を介して学習用の複数の文章を取得する（ステップＳ１１）。制御部１１は、取得した文章を、所定単位の文字又は文字列である複数の語句（要素）に分割する（ステップＳ１２）。例えば制御部１１は、単語、文節等の単位で文章を分割するが、分割する語句（要素）の単位は特に限定されない。 FIG. 7 is a flowchart illustrating an example of a processing procedure of the text learning process. Based on FIG. 7, the processing content of the text learning process which the learning apparatus 1 performs is demonstrated.
For example, the control unit 11 of the learning device 1 acquires a plurality of learning sentences via the communication unit 13 (step S11). The control unit 11 divides the acquired sentence into a plurality of words (elements) that are characters or character strings of a predetermined unit (step S12). For example, the control unit 11 divides a sentence in units such as words and phrases, but the unit of the phrase (element) to be divided is not particularly limited.

制御部１１は、文章の先頭から末尾に亘る順に、各語句の次に出現する語句を学習し、順方向の言語モデル１４１を生成する（ステップＳ１３）。例えば制御部１１は、上述の如く、文章の先頭から順に出現する一又は複数の語句を入力として、当該一又は複数の語句に続く語句の生起確率を出力するＲＮＮを構築する。 The control unit 11 learns a phrase that appears next to each phrase in the order from the beginning to the end of the sentence, and generates a forward language model 141 (step S13). For example, as described above, the control unit 11 inputs one or a plurality of words that appear in order from the head of the sentence, and constructs an RNN that outputs the occurrence probability of the word that follows the one or more words.

制御部１１は、各語句の並び順を、文章の先頭から末尾までを入れ替えた逆順序に並び替える（ステップＳ１４）。制御部１１は、並び替えた文章の順に、各語句の次に出現する語句を学習して、逆方向の言語モデル１４１を生成する（ステップＳ１５）。例えば制御部１１は、文章の末尾から順に出現する一又は複数の語句を入力として、当該一又は複数の語句の直前に位置する語句の生起確率を出力するＲＮＮを構築する。制御部１１は、一連の処理を終了する。 The control unit 11 rearranges the order of the words in the reverse order in which the sentence is replaced from the beginning to the end (step S14). The control unit 11 learns a phrase that appears next to each phrase in the order of the rearranged sentences, and generates a language model 141 in the reverse direction (step S15). For example, the control unit 11 receives, as an input, one or more words that appear in order from the end of the sentence, and constructs an RNN that outputs the occurrence probability of the word that is located immediately before the one or more words. The control unit 11 ends the series of processes.

図８は、誤り推定処理の処理手順の一例を示すフローチャートである。図８に基づき、学習装置１が実行する誤り推定処理の処理内容について説明する。
学習装置１の制御部１１は、例えば通信部１３を介して、誤り推定対象である対象文を取得する（ステップＳ３１）。制御部１１は、取得した対象文を、所定単位の文字又は文字列である複数の語句（要素）に分割する（ステップＳ３２）。 FIG. 8 is a flowchart illustrating an example of a processing procedure of error estimation processing. Based on FIG. 8, the processing content of the error estimation processing which the learning apparatus 1 performs is demonstrated.
The control unit 11 of the learning device 1 acquires a target sentence that is an error estimation target via, for example, the communication unit 13 (step S31). The control unit 11 divides the acquired target sentence into a plurality of words (elements) that are characters or character strings of a predetermined unit (step S32).

制御部１１は、順方向について生成した言語モデル１４１を参照して、対象文に出現する各語句の生起確率を算出する（ステップＳ３３）。すなわち制御部１１は、言語モデル１４１として構築済みのＲＮＮの入力層に対し、対象文の先頭から順に出現する一又は複数の語句を入力して、当該一又は複数の語句に続く語句の生起確率を算出する。 The control unit 11 refers to the language model 141 generated for the forward direction, and calculates the occurrence probability of each word that appears in the target sentence (step S33). That is, the control unit 11 inputs one or a plurality of words that appear in order from the top of the target sentence to the input layer of the RNN constructed as the language model 141, and the occurrence probability of the words following the one or more words Is calculated.

制御部１１は、対象文の各語句の並び順を逆順序に並び替える（ステップＳ３４）。そして制御部１１は、逆方向について生成した言語モデル１４１を参照して、対象文に出現する各語句の生起確率を算出する（ステップＳ３５）。すなわち制御部１１は、逆方向について構築したＲＮＮの入力層に対し、対象文の末尾から順に出現する一又は複数の語句を入力して、当該一又は複数の語句の直前に出現する語句の生起確率を算出する。 The control unit 11 rearranges the order of the words in the target sentence in the reverse order (step S34). Then, the control unit 11 refers to the language model 141 generated in the reverse direction and calculates the occurrence probability of each word that appears in the target sentence (step S35). That is, the control unit 11 inputs one or a plurality of words that appear in order from the end of the target sentence to the input layer of the RNN constructed in the reverse direction, and generates a word that appears immediately before the one or more words. Probability is calculated.

制御部１１は、順方向及び逆方向それぞれについて算出した各語句の生起確率に基づき、対象文における誤り箇所を推定する（ステップＳ３６）。例えば制御部１１は、順方向及び逆方向それぞれの生起確率が共に閾値以下である語句を誤り箇所と推定する。制御部１１は推定結果を出力する（ステップＳ３７）。例えば制御部１１は、色分け表示等により、誤り箇所と推定された語句の表示態様をその他の語句の表示態様とは異ならせた対象文を出力する。また、制御部１１はステップＳ３７と併せて、誤り箇所と推定された語句に対し、修正候補を推定して出力するようにしてもよい。例えば制御部１１は、順方向及び／又は逆方向の言語モデル１４１を参照して、誤り箇所と推定された語句の位置（順序）において、生起確率が最も高い語句を修正候補として出力する。制御部１１は、一連の処理を終了する。 The control unit 11 estimates an error location in the target sentence based on the occurrence probability of each phrase calculated for each of the forward direction and the backward direction (step S36). For example, the control unit 11 estimates a word / phrase whose occurrence probabilities in the forward direction and the backward direction are both equal to or less than a threshold value as an error location. The control unit 11 outputs an estimation result (step S37). For example, the control unit 11 outputs a target sentence in which the display mode of the word / phrase estimated to be an erroneous part is different from the display mode of other words / phrases by color-coded display or the like. In addition to the step S37, the control unit 11 may estimate and output a correction candidate for the word / phrase estimated to be an error part. For example, the control unit 11 refers to the forward and / or backward language model 141 and outputs, as a correction candidate, a phrase having the highest occurrence probability at the position (order) of the phrase estimated as an error location. The control unit 11 ends the series of processes.

なお、上記では同一の学習装置１がテキストの学習及び誤り推定を行うものとして説明したが、本実施の形態はこれに限定するものではない。学習装置１は文章の学習のみを行い、他の装置に学習結果（言語モデル１４１）をインストールして推定装置として構成し、当該推定装置が誤り推定を行うものとしてもよい。 In the above description, it is assumed that the same learning apparatus 1 performs text learning and error estimation. However, the present embodiment is not limited to this. The learning device 1 may perform only text learning, install a learning result (language model 141) in another device, and configure as an estimation device, and the estimation device may perform error estimation.

以上より、本実施の形態１によれば、文章中の各語句の並び順について、順方向だけでなく逆方向の順序も学習する。これにより、学習装置１は、文章中の位置に関わらず適切な語であるか否かを識別することができる言語モデル１４１を生成する。 As described above, according to the first embodiment, not only the forward direction but also the reverse order is learned with respect to the arrangement order of the words in the sentence. Thereby, the learning apparatus 1 generates the language model 141 that can identify whether the word is an appropriate word regardless of the position in the sentence.

また、本実施の形態１によれば、言語モデル１４１としてＲＮＮを生成することで、Ｎ−ｇｒａｍモデル等の他の学習アルゴリズムと比較して、より精度の高い言語モデル１４１を生成することができる。 Further, according to the first embodiment, by generating an RNN as the language model 141, it is possible to generate a language model 141 with higher accuracy than other learning algorithms such as an N-gram model. .

また、本実施の形態１によれば、順方向及び逆方向それぞれの言語モデル１４１を用いて対象文の各語句の生起確率を算出することで、対象文における誤り箇所を適切に指摘することができる。 Further, according to the first embodiment, by calculating the occurrence probability of each word / phrase in the target sentence using the forward and backward language models 141, it is possible to appropriately point out an error location in the target sentence. it can.

また、本実施の形態１によれば、言語モデル１４１を用いて、誤り箇所と推定された語句の修正候補を出力することもできる。これにより、利便性を向上させることができる。 Further, according to the first embodiment, it is also possible to output a correction candidate for a word / phrase estimated to be an error part using the language model 141. Thereby, the convenience can be improved.

（実施の形態２）
本実施の形態では、各語句の生起確率に加えて、各語句の文字種及び／又は文字数を誤り推定の判定基準に用いる形態について述べる。なお、実施の形態１と重複する内容については同一の符号を付して説明を省略する。
図９は、実施の形態２に係る誤り推定処理の説明図である。図９では、誤り推定対象の対象文に含まれる各語句について、順方向及び逆方向それぞれの言語モデル１４１から算出した生起確率のほか、各語句の文字種及び文字数を図示している。 (Embodiment 2)
In the present embodiment, a mode in which the character type and / or the number of characters of each word is used as a criterion for error estimation in addition to the occurrence probability of each word. In addition, about the content which overlaps with Embodiment 1, the same code | symbol is attached | subjected and description is abbreviate | omitted.
FIG. 9 is an explanatory diagram of error estimation processing according to the second embodiment. FIG. 9 illustrates the occurrence probability calculated from the forward and backward language models 141 for each word included in the target sentence for error estimation, as well as the character type and the number of characters of each word.

本実施の形態で学習装置１は、該当語句の順方向及び逆方向それぞれに係る生起確率が閾値以下であり、かつ、該当語句の文字種及び／又は文字数が特定の条件を満たす場合、該当語句が誤り箇所であるものと推定する。例えば学習装置１は、文字種に関する条件として、該当語句が平仮名のみの文字又は文字列である場合、未変換の可能性が高いことから、誤り箇所と判定する。また、例えば学習装置１は、文字数に関する条件として、該当語句が所定の文字数以下である場合、誤入力の可能性が高いことから、誤り箇所と判定する。 In the present embodiment, the learning device 1 determines that when the occurrence probability in each of the forward and backward directions of the corresponding phrase is equal to or less than the threshold and the character type and / or the number of characters of the corresponding phrase satisfy a specific condition, Presumed to be an error location. For example, if the corresponding phrase is a character or a character string with only a hiragana character as a condition regarding the character type, the learning device 1 determines that it is an error part because there is a high possibility that it has not been converted. In addition, for example, the learning device 1 determines that an error location is present because there is a high possibility of erroneous input when the corresponding phrase is equal to or less than a predetermined number of characters as a condition regarding the number of characters.

また、学習装置１は文字種及び文字数のほかに、他の条件を誤り推定の判定条件に加えてもよい。例えば学習装置１は、生起確率が閾値以下の語句が所定回数連続する箇所を誤りと推定してもよい。これにより、例えば漢字に変換すべき文字列を誤って未変換とした場合など、生起確率が閾値以下になる語句が連続して出現する箇所を適切に誤り箇所と推定することができる。 In addition to the character type and the number of characters, the learning device 1 may add other conditions to the error estimation determination condition. For example, the learning device 1 may estimate a location where a word having an occurrence probability equal to or less than a threshold continues for a predetermined number of times as an error. Thereby, for example, when a character string to be converted into Kanji is mistakenly unconverted, it is possible to appropriately estimate a portion where words or phrases having an occurrence probability equal to or less than a threshold value appear continuously.

図１０は、実施の形態２に係る誤り推定処理の処理手順の一例を示すフローチャートである。図１０に基づき、本実施の形態に係る誤り推定処理の処理内容について説明する。
ステップＳ３５の処理を実行後、学習装置１の制御部１１は、順方向及び逆方向それぞれについて算出した各語句の生起確率に加え、各語句の文字種及び／又は文字数に基づいて誤り箇所を推定する（ステップＳ２０１）。例えば制御部１１は、順方向及び逆方向ともに生起確率が閾値以下であり、かつ、該当語句が特定の文字種又は文字数であるか否かを判定する。例えば制御部１１は、文字種に関する条件として、該当語句が平仮名であるか否かを判定する。また、例えば制御部１１は、文字数に関する条件として、該当語句が所定の文字数以下であるか否かを判定する。なお、制御部１１は上記の条件に加えて、さらに順方向及び逆方向それぞれに係る生起確率が閾値以下の語句が所定数以上連続する箇所を誤り箇所と推定してもよい。制御部１１は、処理をステップＳ３７に移行する。 FIG. 10 is a flowchart illustrating an example of a processing procedure of error estimation processing according to the second embodiment. Based on FIG. 10, the processing content of the error estimation processing according to the present embodiment will be described.
After executing the process of step S35, the control unit 11 of the learning device 1 estimates an error location based on the character type and / or the number of characters of each word in addition to the occurrence probability of each word calculated for each of the forward direction and the backward direction. (Step S201). For example, the control unit 11 determines whether the occurrence probability is equal to or less than a threshold value in both the forward direction and the backward direction, and whether the corresponding phrase is a specific character type or the number of characters. For example, the control unit 11 determines whether or not the corresponding phrase is hiragana as a condition regarding the character type. For example, the control unit 11 determines whether or not the corresponding phrase is equal to or less than a predetermined number of characters as a condition regarding the number of characters. In addition to the above conditions, the control unit 11 may further estimate a place where a predetermined number or more of words having occurrence probabilities relating to the forward direction and the backward direction are equal to or less than a threshold value as an error place. The control part 11 transfers a process to step S37.

以上より、本実施の形態２によれば、言語モデル１４１を用いて算出された生起確率だけでなく、該当語句の文字種及び／又は文字数を判定基準とすることで、誤り箇所をより適切に推定することができる。 As described above, according to the second embodiment, not only the occurrence probability calculated using the language model 141 but also the character type and / or the number of characters of the corresponding phrase are used as the determination criteria, so that the error part can be estimated more appropriately. can do.

また、本実施の形態２によれば、生起確率が閾値以下の語句が連続する箇所を誤り箇所と推定することで、さらに適切に誤りを推定することができる。 Further, according to the second embodiment, it is possible to estimate an error more appropriately by estimating a place where words having an occurrence probability equal to or less than a threshold value continue as an error place.

（実施の形態３）
本実施の形態では、文章を分割する単位として、サブワードと呼ばれる単位を用いることで、例えば専門書のように、一般的な辞書に収録されていない未知語が多い文章を扱う場合でも適切に対応することができる形態について述べる。
図１１は、実施の形態３に係る学習装置１の構成例を示すブロック図である。本実施の形態に係る学習装置１の補助記憶部１４は、語句リスト１４２を記憶している。語句リスト１４２は、所定のサンプル文書から抽出した、サブワードと呼ばれる語句を記憶するデータベースである。本実施の形態では、学習装置１が取り扱う文書（文章）として医療診断書を想定し、診断書をサブワード単位で分割して誤り推定を行う形態について述べる。 (Embodiment 3)
In the present embodiment, by using a unit called a subword as a unit for dividing a sentence, even when dealing with a sentence having many unknown words that are not recorded in a general dictionary, such as a technical book, it is appropriately handled. The form which can be done is described.
FIG. 11 is a block diagram illustrating a configuration example of the learning device 1 according to the third embodiment. The auxiliary storage unit 14 of the learning device 1 according to the present embodiment stores a phrase list 142. The phrase list 142 is a database that stores phrases called subwords extracted from a predetermined sample document. In the present embodiment, a mode is described in which a medical certificate is assumed as a document (sentence) handled by the learning device 1 and error estimation is performed by dividing the certificate in subword units.

図１２は、語句リスト１４２のレコードレイアウトの一例を示す説明図である。語句リスト１４２は、語句列、スコア列を有する。語句列は、サンプル文書から抽出（分割）したサブワードである語句（文字又は文字列）を記憶している。スコア列は、サブワードと対応付けて、サンプル文書において各サブワードが出現した頻度を元に計算したスコア（パラメータ）を記憶している。 FIG. 12 is an explanatory diagram illustrating an example of a record layout of the phrase list 142. The phrase list 142 has a phrase string and a score string. The phrase string stores phrases (characters or character strings) that are subwords extracted (divided) from the sample document. The score column stores a score (parameter) calculated based on the frequency of occurrence of each subword in the sample document in association with the subword.

図１３は、サブワード学習処理を説明するための説明図である。学習装置１は、所定のサンプル文書を多数取得し、各サンプル文書からサブワードを抽出して語句リスト１４２を生成する。図１３では、サンプル文書からサブワードを抽出し、語句リスト１４２に登録していく過程を図示している。 FIG. 13 is an explanatory diagram for explaining the subword learning process. The learning apparatus 1 acquires a large number of predetermined sample documents, extracts subwords from each sample document, and generates a phrase list 142. FIG. 13 illustrates a process in which subwords are extracted from a sample document and registered in the phrase list 142.

サブワード（部分語）は、通常の分かち書きとは異なり、文章中に出現する頻度に応じて文章を区分した語句（文字又は文字列）の単位である。一般的に文章の最小構成単位として用いられる「単語」は、文章中の文字又は文字列を意味、文法等の観点から最小化した単位であるが、サブワードは意味、文法等による単位ではなく、文章中で用いられる頻度に応じて最小化した単位である。サブワードの概念によれば、低頻度の語句（文字列）は、語句を構成する文字や部分文字列といった、語句そのものよりも短い単位でまとめられる。 A subword (partial word) is a unit of a word (character or character string) in which a sentence is divided according to the frequency of appearance in the sentence, unlike normal segmentation. Generally, `` word '' used as the minimum constituent unit of a sentence is a unit that is minimized from the viewpoint of grammar, etc. meaning a character or character string in a sentence, but a subword is not a unit based on meaning, grammar, etc. It is a unit that is minimized according to the frequency used in the text. According to the concept of subwords, low-frequency words (character strings) are grouped in units shorter than the words themselves, such as characters and partial character strings that constitute the words.

以下、サンプル文書からサブワードを学習する処理について説明する。本実施の形態で学習装置１は、ＢＰＥ（Byte Pair Encoding）の手法を用いてサンプル文書からサブワードを抽出する。 Hereinafter, processing for learning subwords from a sample document will be described. In the present embodiment, the learning device 1 extracts subwords from a sample document using a BPE (Byte Pair Encoding) technique.

学習装置１はまず、サンプル文書を文字単位で分割する。図４の最上段に示す例では、学習装置１は、「腫大したリンパ節を認めない」という文章を、「腫」、「大」、「し」、「た」…の各文字に分割している。 First, the learning apparatus 1 divides the sample document in character units. In the example shown at the top of FIG. 4, the learning device 1 divides the sentence “no enlarged lymph nodes are recognized” into the characters “tumor”, “large”, “shi”, “ta”. doing.

学習装置１は、図１３の二段目に示すように、分割した全ての文字をサブワードとして語句リスト１４２に登録する。この場合に学習装置１は、サンプル文書における各サブワード（文字）の出現頻度を元にサブワードのスコア（パラメータ）を計算し、計算したスコアを語句リスト１４２のスコア列に登録しておく。スコアは、例えば出現頻度を正規化することによって計算される。図４の例では、「腫」よりも「大」の方が文章中で出現する頻度が高いため、「大」のスコアは「腫」のスコア０．０１よりも大きい０．０５となっている。 As shown in the second row of FIG. 13, the learning device 1 registers all the divided characters as subwords in the phrase list 142. In this case, the learning device 1 calculates the subword score (parameter) based on the appearance frequency of each subword (character) in the sample document, and registers the calculated score in the score column of the phrase list 142. The score is calculated by normalizing the appearance frequency, for example. In the example of FIG. 4, since “large” appears more frequently in the text than “tumor”, the score of “large” is 0.05, which is higher than the score of “tumor” 0.01. Yes.

なお、正規化する際に用いる重みなどの値は、場合によって適宜変更される。また、以下の説明では出現頻度を正規化したスコア（パラメータ）に基づき一連の処理を行うものとするが、例えば学習装置１は正規化していない出現頻度そのものをスコアとして用いても良く、出現頻度に応じたパラメータに基づいて一連の処理を行うことができれば良い。 Note that values such as weights used for normalization are appropriately changed depending on circumstances. In the following description, a series of processing is performed based on a score (parameter) whose appearance frequency is normalized. For example, the learning device 1 may use an appearance frequency itself that is not normalized as a score. It suffices if a series of processing can be performed based on parameters according to the above.

次に学習装置１は、サンプル文書において隣り合う文字を連結した二文字の文字列を、当該文字列の出現頻度に応じて語句リスト１４２に登録する。具体的には、学習装置１は、文章中で最もスコアが高い二文字を語句リスト１４２に登録する。 Next, the learning apparatus 1 registers a two-character character string obtained by connecting adjacent characters in the sample document in the phrase list 142 according to the appearance frequency of the character string. Specifically, the learning device 1 registers two characters having the highest score in the sentence in the phrase list 142.

例えば学習装置１は、文章の先頭から末尾に至るまで二文字ずつ文字列を取り出し、各文字列のスコアを計算していく。図１３の例では、学習装置１はまず「腫大」のスコアを計算し、続いて「大し」のスコアを、「した」のスコアを…というように計算を行う。そして学習装置１は、二文字から成る各文字列のうち、スコアが最も高い文字列を語句リスト１４２に登録する。図１３の三段目に示す例では、「転移」の文字列のスコアが最も高かったため、学習装置１は「転移」をサブワードとして語句リスト１４２に登録する。なお、学習装置１は併せて、当該文字列の出現頻度を元に計算したスコアを語句リスト１４２に登録する。 For example, the learning device 1 takes out character strings by two characters from the beginning to the end of the sentence, and calculates the score of each character string. In the example of FIG. 13, the learning device 1 first calculates a “swelling” score, then calculates a “great” score, a “do” score, and so on. The learning device 1 registers the character string having the highest score among the character strings composed of two characters in the phrase list 142. In the example shown in the third row of FIG. 13, since the score of the character string “transfer” is the highest, the learning apparatus 1 registers “transfer” as a subword in the phrase list 142. The learning device 1 also registers the score calculated based on the appearance frequency of the character string in the phrase list 142.

続いて学習装置１は、再度サンプル文書を探索し、スコアが最も高い二文字を語句リスト１４２に登録する。この場合に学習装置１は、サブワードとして語句リスト１４２に登録済みの文字列は一文字とみなし、新たなサブワードを探索する。上記の例では、語句リスト１４２に「転移」を登録済みであるため、この「転移」の文字列が一つの文字とみなされる。このように、学習装置１はＢＰＥの手法を用いて、隣り合う文字同士を一つの情報（文字列）に圧縮する。学習装置１は、「転移」に跨る部分についてスコアを計算する場合、「転移」の前に位置する「骨」と「転移」とを連結した「骨転移」と、「転移」の後に位置する「が」と「転移」とを連結した「転移が」とを二文字の文字列とみなし、スコアを計算する。 Subsequently, the learning device 1 searches the sample document again and registers the two characters having the highest score in the phrase list 142. In this case, the learning device 1 considers the character string already registered in the phrase list 142 as a subword as one character and searches for a new subword. In the above example, since “transition” has already been registered in the word list 142, the character string of “transition” is regarded as one character. As described above, the learning device 1 compresses adjacent characters into one piece of information (character string) using the BPE method. When the learning device 1 calculates a score for a portion extending over “metastasis”, the “bone metastasis” obtained by connecting “bone” positioned before “metastasis” and “metastasis” is positioned after “metastasis”. A score is calculated by regarding “Transition”, which is a concatenation of “Ga” and “Transition”, as a two-character string.

このように、学習装置１は、語句リスト１４２に登録済みの二つのサブワード（文字又は文字列）を連結して新たなサブワード（文字列）を特定し、当該新たなサブワードを出現頻度に応じて語句リスト１４２に追加する。図１３の四段目に示す例では、一文字のサブワード「骨」と二文字のサブワード「転移」とを連結した文字列「骨転移」のスコアが最も高かったため、学習装置１は、文字列「骨転移」をサブワードとして新たに語句リスト１４２に追加する。 As described above, the learning device 1 connects two subwords (characters or character strings) registered in the phrase list 142 to specify a new subword (character string), and the new subword is determined according to the appearance frequency. Add to the phrase list 142. In the example shown in the fourth row of FIG. 13, since the score of the character string “bone metastasis” obtained by concatenating one character subword “bone” and two characters subword “transfer” is the highest, the learning apparatus 1 determines that the character string “ “Bone metastasis” is newly added to the word list 142 as a subword.

以下同様にして、学習装置１は、サンプル文書内で隣り合う二つのサブワード（文字又は文字列）を連結し、二つのサブワードから成る文字列を、その出現頻度に応じて語句リスト１４２に登録する処理を順に行っていく。学習装置１は、複数のサンプル文書に対して当該処理を行い、語句リスト１４２に登録されたサブワードが予め定められた最大数（例えば８０００語）に達するまで処理を繰り返す。これにより学習装置１は、図１２で例示した語句リスト１４２を生成する。このようにして学習装置１は、文章内で現れやすい文字列のパターン（サブワード）を学習する。 Similarly, the learning apparatus 1 connects two adjacent subwords (characters or character strings) in the sample document, and registers a character string composed of the two subwords in the phrase list 142 according to the appearance frequency. Processing is performed in order. The learning apparatus 1 performs the process on a plurality of sample documents, and repeats the process until the number of subwords registered in the phrase list 142 reaches a predetermined maximum number (for example, 8000 words). Thereby, the learning apparatus 1 generates the phrase list 142 illustrated in FIG. In this way, the learning device 1 learns a character string pattern (subword) that is likely to appear in a sentence.

なお、上記で学習装置１は、文章内で出現頻度に応じたスコアが最も高い文字列を語句リスト１４２に登録することとしているが、例えばスコアの閾値を定め、閾値以上のスコアを有する文字列を全てサブワードとして登録するようにしてもよい。つまり、学習装置１は出現頻度に応じてサブワードを登録可能であればよく、出現頻度に係る判定基準は特に限定されない。 In the above description, the learning device 1 registers the character string having the highest score according to the appearance frequency in the sentence in the phrase list 142. For example, the character string having a score equal to or higher than the threshold value is set. May all be registered as subwords. That is, the learning device 1 only needs to be able to register subwords in accordance with the appearance frequency, and the determination criterion related to the appearance frequency is not particularly limited.

学習装置１は、上記の語句リスト１４２を参照して学習用の文章（上述の例では既存の診断書）をサブワード単位で分割し、言語モデル１４１を生成する。例えば学習装置１は、分割後の全てのサブワードのスコアの合計値が最大化するように文章を分割する。同様に、学習装置１は誤り推定対象の対象文（診断書）をサブワード単位で分割し、言語モデル１４１を用いて誤り箇所を推定する。 The learning device 1 divides a learning sentence (existing medical certificate in the above example) in units of subwords with reference to the word list 142, and generates a language model 141. For example, the learning device 1 divides the sentence so that the total value of the scores of all the subwords after the division is maximized. Similarly, the learning device 1 divides the target sentence (diagnosis document) to be error-estimated in units of subwords, and estimates the error location using the language model 141.

上記のように、学習装置１は意味、文法等のような一般的な基準ではなく、出現頻度に応じて区分されるサブワードによって対象文を分割し、誤り部分を検出（推定）する。これにより、人手で作成された辞書を用いる必要がなく、未知語の多い文章であっても対応することができる。 As described above, the learning device 1 detects (estimates) an error part by dividing the target sentence by subwords classified according to the appearance frequency instead of a general standard such as meaning and grammar. As a result, it is not necessary to use a manually created dictionary, and even a sentence with many unknown words can be handled.

図１４は、サブワード学習処理の処理手順の一例を示すフローチャートである。図１４に基づき、学習装置１が実行する誤り推定処理の処理内容について説明する。
学習装置１の制御部１１は、学習用の複数のサンプル文書を取得する（ステップＳ３０１）。制御部１１は、取得したサンプル文書を文字単位に分割する（ステップＳ３０２）。制御部１１は、分割した全ての文字をサブワードとして語句リスト１４２に登録する（ステップＳ３０３）。 FIG. 14 is a flowchart illustrating an example of a processing procedure of subword learning processing. Based on FIG. 14, the processing content of the error estimation processing which the learning apparatus 1 performs is demonstrated.
The control unit 11 of the learning device 1 acquires a plurality of sample documents for learning (step S301). The control unit 11 divides the acquired sample document into character units (step S302). The control unit 11 registers all the divided characters as subwords in the phrase list 142 (step S303).

制御部１１は、語句リスト１４２に登録済みのサブワード（文字又は文字列）のうち、サンプル文書において隣り合う二つのサブワードを連結した文字列の、サンプル文書内での出現頻度に応じたスコア（パラメータ）を算出する（ステップＳ３０４）。例えば制御部１１は、文章の先頭から末尾に至るまで二つずつサブワードを取り出して一つの文字列とし、各文字列の出現頻度に基づくスコアを順に計算する。 The control unit 11 sets a score (parameter) according to the appearance frequency of the character string obtained by connecting two adjacent subwords in the sample document among the subwords (characters or character strings) registered in the phrase list 142. ) Is calculated (step S304). For example, the control unit 11 takes out subwords two by two from the beginning to the end of the sentence to form one character string, and sequentially calculates a score based on the appearance frequency of each character string.

制御部１１は、ステップＳ３０４で算出したスコアに応じて、二つのサブワードを連結した文字列を新たなサブワードとして語句リスト１４２に登録する（ステップＳ３０５）。具体的には、制御部１１は、ステップＳ３０４においてスコアを計算した全ての文字列のうち、スコアが最も高い文字列を語句リスト１４２に登録する。この場合に制御部１１は、ステップＳ３０４で算出された出現頻度を元にしたスコアを併せて語句リスト１４２に登録する。 The control unit 11 registers a character string obtained by connecting two subwords as a new subword in the phrase list 142 according to the score calculated in step S304 (step S305). Specifically, the control unit 11 registers the character string having the highest score in the phrase list 142 among all the character strings whose scores are calculated in step S304. In this case, the control unit 11 registers a score based on the appearance frequency calculated in step S304 together with the phrase list 142.

制御部１１は、予め定められた最大数のサブワードが語句リスト１４２に登録されたか否かを判定する（ステップＳ３０６）。最大数のサブワードが登録されていないと判定した場合（Ｓ３０６：ＮＯ）、制御部１１は処理をステップＳ３０４に戻す。最大数のサブワードが登録されたと判定した場合（Ｓ３０６：ＹＥＳ）、制御部１１は一連の処理を終了する。 The control unit 11 determines whether or not a predetermined maximum number of subwords has been registered in the phrase list 142 (step S306). When it is determined that the maximum number of subwords is not registered (S306: NO), the control unit 11 returns the process to step S304. When it is determined that the maximum number of subwords has been registered (S306: YES), the control unit 11 ends the series of processes.

以上より、本実施の形態３によれば、文章を分割する要素（語句）として、意味、文法等を基準とした単語、文節等ではなく、出現頻度に応じて区分されるサブワードを分割単位として用いる。これにより、人手で作成された辞書を用いる必要がなく、未知語の多い文章であっても対応することができる。特にサブワード単位で分割することで、低頻度の語句は短い文字数で分割されるため、実施の形態２（文字種及び／又は文字数に基づく誤り推定、生起確率が閾値以下の語句の連続性に基づく誤り推定）と本実施の形態を組み合わせることで、一層の精度向上を期待することができる。 As described above, according to the third embodiment, as an element (phrase) that divides a sentence, not a word or phrase based on meaning, grammar, or the like, but a subword that is classified according to the appearance frequency is used as a division unit. Use. As a result, it is not necessary to use a manually created dictionary, and even a sentence with many unknown words can be handled. In particular, by dividing in units of subwords, low-frequency words are divided by a short number of characters, so the second embodiment (error estimation based on character type and / or number of characters, error based on continuity of words whose occurrence probability is below a threshold value) A further improvement in accuracy can be expected by combining the estimation) and this embodiment.

（実施の形態４）
図１５は、上述した形態の学習装置１の動作を示す機能ブロック図である。制御部１１がプログラムＰを実行することにより、学習装置１は以下のように動作する。
取得部１５１は、複数の文章を取得する。分割部１５２は、取得した前記文章夫々を、所定単位の文字又は文字列である複数の要素に分割する。第１学習部１５３は、各要素の次に出現する前記要素を、取得した前記文章の順に学習する。順序変換部１５４は、前記複数の要素の並び順を、前記文章の先頭から末尾までを逆順序に並び替える。第２学習部１５５は、並び替えた前記文章の順に、前記各要素の次に出現する前記要素を学習する。 (Embodiment 4)
FIG. 15 is a functional block diagram illustrating the operation of the learning device 1 having the above-described form. When the control unit 11 executes the program P, the learning device 1 operates as follows.
The acquisition unit 151 acquires a plurality of sentences. The dividing unit 152 divides each acquired sentence into a plurality of elements that are characters or character strings of a predetermined unit. The first learning unit 153 learns the element that appears next to each element in the order of the acquired sentences. The order conversion unit 154 rearranges the arrangement order of the plurality of elements in reverse order from the beginning to the end of the sentence. The second learning unit 155 learns the elements that appear next to the elements in the order of the rearranged sentences.

本実施の形態４は以上の如きであり、その他は実施の形態１から３と同様であるので、対応する部分には同一の符号を付してその詳細な説明を省略する。 The fourth embodiment is as described above, and the other parts are the same as those of the first to third embodiments. Therefore, the corresponding parts are denoted by the same reference numerals, and detailed description thereof is omitted.

今回開示された実施の形態はすべての点で例示であって、制限的なものではないと考えられるべきである。本発明の範囲は、上記した意味ではなく、特許請求の範囲によって示され、特許請求の範囲と均等の意味及び範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time is to be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the meanings described above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

１学習装置
１１制御部
１２主記憶部
１３通信部
１４補助記憶部
Ｐプログラム
１４１言語モデル
１４２語句リスト DESCRIPTION OF SYMBOLS 1 Learning apparatus 11 Control part 12 Main memory part 13 Communication part 14 Auxiliary memory part P Program 141 Language model 142 Phrase list

Claims

An acquisition unit for acquiring sentences;
A dividing unit that divides the acquired sentence into a plurality of elements that are characters or character strings of a predetermined unit;
A first learning unit that learns the elements that appear next to each element in the order of the acquired sentences;
An order conversion unit that rearranges the order of the plurality of elements in reverse order from the beginning to the end of the sentence;
A learning device comprising: a second learning unit that learns the element that appears next to each element in the order of the rearranged sentences.

The learning device according to claim 1, wherein the first and second learning units generate a recursive neural network.

A document dividing unit that divides each of a plurality of documents into predetermined units of characters or character strings;
A registration unit that registers the character or character string in a phrase list according to the appearance frequency of the character or character string in the document,
The learning device according to claim 1, wherein the dividing unit divides each character or character string registered in the word list.

Get multiple sentences,
The acquired sentence is divided into a plurality of elements that are characters or character strings of a predetermined unit,
Learn the elements that appear next to each element in the order of the acquired sentences,
Rearranging the order of the plurality of elements from the beginning to the end of the sentence in reverse order,
A learning method, comprising: causing a computer to execute a process of learning the elements that appear next to the elements in the order of the rearranged sentences.

Get multiple sentences,
The acquired sentence is divided into a plurality of elements that are characters or character strings of a predetermined unit,
Learn the elements that appear next to each element in the order of the acquired sentences,
Rearranging the order of the plurality of elements from the beginning to the end of the sentence in reverse order,
A program for causing a computer to execute a process of learning the element that appears next to each element in the order of the rearranged sentences.

An acquisition unit for acquiring a target sentence that is an error estimation target;
A dividing unit that divides the acquired target sentence into a plurality of elements that are characters or character strings of a predetermined unit;
Based on the learning sentence, the occurrence probability of each of the elements included in the target sentence by referring to the first language model that has learned the elements next to the elements included in the sentence in the order of the sentences A first calculation unit for calculating
An order conversion unit that rearranges the order of the plurality of elements in the target sentence from the beginning to the end of the target sentence in reverse order;
Based on the learning sentence, the occurrence probability of each element of the rearranged target sentence is calculated by referring to the second language model that has learned the element next to each element in reverse order. A second calculation unit;
An estimation apparatus comprising: an estimation unit that estimates an element as an error location based on occurrence probabilities of the element calculated by the first and second calculation units, respectively.

The estimation device according to claim 6, wherein the estimation unit estimates the element as an error location based on the occurrence probability of the element and the character type and / or the number of characters of the element.

The estimation apparatus according to claim 6 or 7, wherein the estimation unit estimates a plurality of continuous elements as error locations when a predetermined number or more of the elements whose occurrence probabilities are equal to or less than a threshold value continue.

An output unit that outputs, as a correction candidate, an element having the highest occurrence probability with respect to the element estimated as an error location with reference to the first or second language model. The estimation apparatus of any one of -8.