JP7032650B2

JP7032650B2 - Similar text search method, similar text search device and similar text search program

Info

Publication number: JP7032650B2
Application number: JP2018123365A
Authority: JP
Inventors: 謙介馬場
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2022-03-09
Anticipated expiration: 2038-06-28
Also published as: JP2020004107A

Description

本発明は類似テキスト検索方法、類似テキスト検索装置および類似テキスト検索プログラムに関する。 The present invention relates to a similar text search method, a similar text search device, and a similar text search program.

コンピュータによる自然言語処理として、データベースに記憶された保存テキストの中から入力テキストに類似する保存テキストを検索したいことがある。例えば、問い合わせ文のサンプルと当該サンプルに対応する返答文とをデータベースに登録しておき、入力された問い合わせ文に類似するサンプルを検索し、当該類似するサンプルに対応する返答文を出力する対話システムを構築することが考えられる。 As a natural language process by a computer, you may want to search for stored text that is similar to the input text from the stored text stored in the database. For example, a dialogue system that registers a sample of inquiry text and a response text corresponding to the sample in a database, searches for a sample similar to the input inquiry text, and outputs a response text corresponding to the similar sample. Is conceivable to build.

入力テキストに類似する保存テキストを検索する方法としては、２つのテキストの間で出現する単語の共通度を評価する方法がある。例えば、以下のような検索方法が考えられる。あるテキストに含まれる単語集合から１つのハッシュ値を算出するハッシュ関数（Min-hash関数と言うことがある）を複数個定義しておく。各ハッシュ関数は、異なる単語に対して異なる値を対応付けた対応関係をもち、ある単語集合に含まれる単語に対応する値のうち最小の値をハッシュ値として出力する。各保存テキストに対して、この複数のハッシュ関数を用いて算出された複数のハッシュ値を列挙したベクトルを予め生成しておく。入力テキストに含まれる単語集合と上記の複数のハッシュ関数から同様にベクトルを算出し、近似するベクトルをもつ保存テキストを選択する。 As a method of searching for a stored text similar to the input text, there is a method of evaluating the commonality of words appearing between the two texts. For example, the following search methods can be considered. A plurality of hash functions (sometimes called Min-hash functions) that calculate one hash value from a word set contained in a certain text are defined. Each hash function has a correspondence relationship in which different values are associated with different words, and outputs the smallest value among the values corresponding to the words included in a certain word set as a hash value. For each stored text, a vector enumerating a plurality of hash values calculated by using the plurality of hash functions is generated in advance. Similarly, a vector is calculated from the word set contained in the input text and the above-mentioned multiple hash functions, and a stored text having an approximate vector is selected.

なお、問い合わせ文に対応する返答文の候補を絞り込む質問回答サーバが提案されている。提案の質問回答サーバは、問い合わせ文から単語を抽出し、抽出した単語を含む複数のコメント文をデータベースから検索し、検索されたコメント文を複数のトピックグループに分類する。質問回答サーバは、トピックグループ毎にコメント文との類似度が閾値以上である返答文の候補をデータベースから抽出する。 A question / answer server has been proposed that narrows down the candidates for the response text corresponding to the inquiry text. The question-and-answer server of the proposal extracts words from the inquiry sentence, searches a plurality of comment sentences including the extracted words from the database, and classifies the searched comment sentences into a plurality of topic groups. The question-and-answer server extracts from the database the candidates for the answer sentence whose similarity with the comment sentence is equal to or more than the threshold value for each topic group.

また、問い合わせ文に対応する返答文を異なる検索方法を併用して検索する端末装置が提案されている。提案の端末装置は、第１の検索方法によって問い合わせ文に対応する返答文を検索し、検索された第１の返答文を音声で再生する。端末装置は、第１の返答文が再生されている間、第２の検索方法によって問い合わせ文に対応する返答文を検索し、検索された第２の返答文と第１の返答文との類似度が低い場合、第１の返答文の再生が終了した後に追加的に第２の返答文を音声で再生する。 Further, a terminal device for searching a response sentence corresponding to an inquiry sentence by using different search methods in combination has been proposed. The proposed terminal device searches for a response sentence corresponding to the inquiry sentence by the first search method, and reproduces the searched first response sentence by voice. While the first reply sentence is being played back, the terminal device searches for the reply sentence corresponding to the inquiry sentence by the second search method, and the searched second reply sentence is similar to the first reply sentence. If the degree is low, the second response sentence is additionally reproduced by voice after the reproduction of the first response sentence is completed.

また、コンテンツのハッシュ値を算出するハッシュ関数を学習するハッシュ関数生成装置が提案されている。提案のハッシュ関数生成装置は、複数のコンテンツそれぞれから特徴量を算出する。また、ハッシュ関数生成装置は、各コンテンツに付与された意味ラベルに基づいて、２つのコンテンツの組み合わせ毎に意味ラベルの相関を算出する。ハッシュ関数生成装置は、意味ラベルの相関が高いほど２つのコンテンツそれぞれの特徴量から算出されるハッシュ値が近似するように、ハッシュ関数を学習する。 Further, a hash function generation device for learning a hash function for calculating a hash value of contents has been proposed. The proposed hash function generator calculates a feature amount from each of a plurality of contents. Further, the hash function generation device calculates the correlation of the semantic label for each combination of the two contents based on the semantic label given to each content. The hash function generator learns the hash function so that the higher the correlation between the semantic labels, the closer the hash value calculated from the features of each of the two contents.

特開２０１４－１１２３１６号公報Japanese Unexamined Patent Publication No. 2014-11316 特開２０１６－９０９１号公報Japanese Unexamined Patent Publication No. 2016-9091 特開２０１６－６６０１２号公報Japanese Unexamined Patent Publication No. 2016-66012

出現する単語の共通度を評価する検索方法では、内容の類似度が高いにもかかわらず異なる表現が使用されることで単語の共通度が低く評価され、検索漏れが発生することがあるという問題がある。例えば、入力テキストや保存テキストがユーザの発話メッセージである場合、話し言葉は１文が短く表現も多様であることから、入力テキストと保存テキストとの間で共通の単語が出現する確率が全体的に低くなる。その結果、入力テキストに類似する保存テキストの検索精度が低くなりやすい。 In the search method that evaluates the commonality of the words that appear, the commonality of the words is evaluated low because different expressions are used even though the similarities of the contents are high, and there is a problem that search omission may occur. There is. For example, if the input text or the stored text is a user's utterance message, the spoken language is short and has various expressions, so the overall probability that a common word will appear between the input text and the stored text is overall. It gets lower. As a result, the search accuracy of the stored text similar to the input text tends to be low.

１つの側面では、本発明は、類似するテキストの検索精度を向上させる類似テキスト検索方法、類似テキスト検索装置および類似テキスト検索プログラムを提供することを目的とする。 In one aspect, it is an object of the present invention to provide a similar text search method, a similar text search device, and a similar text search program that improve the search accuracy of similar texts.

１つの態様では、コンピュータが実行する類似テキスト検索方法が提供される。第１のテキストの入力を受け付ける。第１のテキストに含まれる２以上の第１の単語を抽出し、関連する単語のグループを示す関連語辞書を参照して、２以上の第１の単語のうち何れかのグループに属する第１の単語に対して当該グループを示す第１のダミー語を割り当てる。２以上の第１の単語および第１のダミー語を含む第１の単語集合に応じた第１の特徴情報を生成する。複数の第２のテキストそれぞれに対応して記憶された、当該第２のテキストに含まれる２以上の第２の単語および何れかの第２の単語が属するグループを示す第２のダミー語を含む第２の単語集合に応じた第２の特徴情報と、第１の特徴情報との間の比較に基づいて、第１のテキストに類似する第２のテキストを検索する。 In one aspect, a computer-executed similar text retrieval method is provided. Accepts the input of the first text. A first word belonging to any of the two or more first words by extracting two or more first words contained in the first text and referring to a related word dictionary indicating a group of related words. A first dummy word indicating the group is assigned to the word. Generates first feature information according to a first word set containing two or more first words and a first dummy word. Includes two or more second words contained in the second text and a second dummy word indicating a group to which any second word belongs, which is stored corresponding to each of the plurality of second texts. A second text similar to the first text is searched based on the comparison between the second feature information according to the second word set and the first feature information.

また、１つの態様では、記憶部と処理部とを有する類似テキスト検索装置が提供される。また、１つの態様では、類似テキスト検索プログラムが提供される。 Further, in one embodiment, a similar text search device having a storage unit and a processing unit is provided. Also, in one aspect, a similar text search program is provided.

１つの側面では、類似するテキストの検索精度が向上する。 In one aspect, the search accuracy of similar texts is improved.

第１の実施の形態の類似テキスト検索装置の例を説明する図である。It is a figure explaining the example of the similar text search apparatus of 1st Embodiment. 情報処理装置のハードウェア例を示すブロック図である。It is a block diagram which shows the hardware example of an information processing apparatus. 発話テーブルの例を示す図である。It is a figure which shows the example of the utterance table. ハッシュ関数およびベクトルの例を示す図である。It is a figure which shows the example of a hash function and a vector. 探索木の例を示す図である。It is a figure which shows the example of the search tree. ジャッカール係数の算出例を示す図である。It is a figure which shows the calculation example of a Jackal coefficient. 類似度の閾値と検索対象発話のヒット数との関係例を示すグラフである。It is a graph which shows the relation example between the threshold of similarity and the number of hits of a search target utterance. 関連語テーブルの例を示す図である。It is a figure which shows the example of the related word table. 単語集合からベクトルを算出する例を示す図である。It is a figure which shows the example of calculating a vector from a word set. 問い合わせ発話に対応する返答発話の選択例を示す図である。It is a figure which shows the selection example of the answer utterance corresponding to the inquiry utterance. 情報処理装置の機能例を示すブロック図である。It is a block diagram which shows the functional example of an information processing apparatus. インデックス生成の手順例を示すフローチャートである。It is a flowchart which shows the procedure example of index generation. インデックス生成の手順例を示すフローチャート（続き）である。It is a flowchart (continued) which shows the procedure example of index generation. 発話検索の手順例を示すフローチャートである。It is a flowchart which shows the procedure example of the utterance search.

以下、本実施の形態を図面を参照して説明する。
［第１の実施の形態］
第１の実施の形態を説明する。 Hereinafter, the present embodiment will be described with reference to the drawings.
[First Embodiment]
The first embodiment will be described.

図１は、第１の実施の形態の類似テキスト検索装置の例を説明する図である。
第１の実施の形態の類似テキスト検索装置１０は、入力されたテキストと類似するテキストをデータベースの中から検索する。類似テキスト検索装置１０は、情報処理装置やコンピュータと呼んでもよい。また、類似テキスト検索装置１０は、ユーザが操作するクライアント装置でもよいし、ネットワーク経由でアクセスされるサーバ装置でもよい。類似テキスト検索装置１０は、ユーザから問い合わせテキストを受け付け返答テキストを出力する対話システムに用いられるものであってもよい。 FIG. 1 is a diagram illustrating an example of a similar text retrieval device according to the first embodiment.
The similar text search device 10 of the first embodiment searches the database for text similar to the input text. The similar text search device 10 may be referred to as an information processing device or a computer. Further, the similar text search device 10 may be a client device operated by a user or a server device accessed via a network. The similar text search device 10 may be used in a dialogue system that receives inquiry text from a user and outputs response text.

類似テキスト検索装置１０は、記憶部１１および処理部１２を有する。記憶部１１は、ＲＡＭ（Random Access Memory）などの揮発性の半導体メモリでもよいし、ＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの不揮発性ストレージでもよい。処理部１２は、例えば、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＤＳＰ（Digital Signal Processor）などのプロセッサである。ただし、処理部１２は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などの特定用途の電子回路を含んでもよい。複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うことがある。 The similar text retrieval device 10 has a storage unit 11 and a processing unit 12. The storage unit 11 may be a volatile semiconductor memory such as a RAM (Random Access Memory) or a non-volatile storage such as an HDD (Hard Disk Drive) or a flash memory. The processing unit 12 is, for example, a processor such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), or a DSP (Digital Signal Processor). However, the processing unit 12 may include an electronic circuit for a specific purpose such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). A collection of multiple processors may be referred to as a "multiprocessor" or simply a "processor."

記憶部１１は、関連語辞書１３を記憶する。関連語辞書１３は、互いに関連する単語のグループを示すデータである。関連する単語は、類似する意味をもつ単語である。例えば、関連語辞書１３は、「猫」と「にゃんこ」が同一のグループＧ１に属し、「ご飯」と「えさ」が同一のグループＧ２に属することを示す。 The storage unit 11 stores the related word dictionary 13. The related word dictionary 13 is data indicating a group of words related to each other. Related words are words that have similar meanings. For example, the related word dictionary 13 indicates that "cat" and "nyanko" belong to the same group G1 and "rice" and "food" belong to the same group G2.

また、記憶部１１は、テキスト１５ａ，１５ｂを含む複数のテキスト（第２のテキスト）に対応して、特徴情報１９ａ，１９ｂを含む複数の特徴情報（第２の特徴情報）を記憶する。テキスト１５ａ，１５ｂは、類似テキスト検索装置１０に入力されるテキスト１４（第１のテキスト）と照合すべき検索対象テキストであり、２以上の単語を含む文字列である。特徴情報１９ａ，１９ｂは、検索のためのインデックスとして使用され、テキスト１５ａ，１５ｂから生成される。記憶部１１は、テキスト１５ａ，１５ｂそのものを記憶していてもよいし記憶していなくてもよい。特徴情報１９ａ，１９ｂは、類似テキスト検索装置１０が生成してもよいし他の情報処理装置が生成してもよい。 Further, the storage unit 11 stores a plurality of feature information (second feature information) including the feature information 19a and 19b corresponding to the plurality of texts (second text) including the texts 15a and 15b. The texts 15a and 15b are search target texts to be collated with the text 14 (first text) input to the similar text search device 10, and are character strings including two or more words. The feature information 19a, 19b is used as an index for searching and is generated from the texts 15a, 15b. The storage unit 11 may or may not store the texts 15a and 15b themselves. The feature information 19a and 19b may be generated by the similar text search device 10 or may be generated by another information processing device.

特徴情報１９ａ，１９ｂを生成する方法については後述する。なお、類似テキスト検索装置１０を対話システムに用いる場合、記憶部１１は、テキスト１５ａ，１５ｂを含む複数のテキストに対応付けて、複数の返答テキスト（第３のテキスト）を記憶してもよい。例えば、ユーザからの問い合わせテキストに最も類似する検索対象テキストが選択されると、選択された検索対象テキストに対応付けられた返答テキストが出力される。 The method for generating the feature information 19a and 19b will be described later. When the similar text search device 10 is used in the dialogue system, the storage unit 11 may store a plurality of response texts (third texts) in association with a plurality of texts including the texts 15a and 15b. For example, when the search target text most similar to the inquiry text from the user is selected, the response text associated with the selected search target text is output.

処理部１２は、テキスト１４の入力を受け付ける。テキスト１４は、ユーザから入力される問い合わせテキストであり、２以上の単語を含む文字列である。テキスト１４は、類似テキスト検索装置１０に接続された入力デバイスから入力されてもよいし、他の情報処理装置からネットワーク経由で受信されてもよい。また、類似テキスト検索装置１０に音声信号が入力され、音声認識により音声信号がテキスト１４に変換されてもよい。 The processing unit 12 accepts the input of the text 14. The text 14 is an inquiry text input by the user, and is a character string including two or more words. The text 14 may be input from an input device connected to the similar text search device 10, or may be received from another information processing device via a network. Further, a voice signal may be input to the similar text search device 10 and the voice signal may be converted into the text 14 by voice recognition.

処理部１２は、テキスト１４に含まれる２以上の単語を抽出し、抽出した単語を含む単語集合１６を生成する。例えば、処理部１２は、テキスト１４から「にゃんこ」と「えさ」を抽出し、「にゃんこ」と「えさ」を含む単語集合１６を生成する。 The processing unit 12 extracts two or more words included in the text 14 and generates a word set 16 including the extracted words. For example, the processing unit 12 extracts "nyanko" and "food" from the text 14 and generates a word set 16 including "nyanko" and "food".

また、処理部１２は、関連語辞書１３を参照して、テキスト１４から抽出した単語のうち何れかのグループに属する単語に対して、当該グループを示すダミー語（第１のダミー語）を割り当てる。処理部１２は、割り当てたダミー語を単語集合１６に追加する。ダミー語は、テキスト１４やテキスト１５ａ，１５ｂに使用されない仮想的な単語である。例えば、「にゃんこ」はグループＧ１に属するため、グループＧ１を示すダミー語「Ｇ１」が単語集合１６に追加される。また、「えさ」はグループＧ２に属するため、グループＧ２を示すダミー語「Ｇ２」が単語集合１６に追加される。このとき、元の単語である「にゃんこ」や「えさ」は単語集合１６から削除されずに残される。 Further, the processing unit 12 refers to the related word dictionary 13 and assigns a dummy word (first dummy word) indicating the group to a word belonging to any group of the words extracted from the text 14. .. The processing unit 12 adds the assigned dummy word to the word set 16. The dummy word is a virtual word that is not used in the text 14 and the texts 15a and 15b. For example, since "Nyanko" belongs to the group G1, the dummy word "G1" indicating the group G1 is added to the word set 16. Further, since "food" belongs to the group G2, the dummy word "G2" indicating the group G2 is added to the word set 16. At this time, the original words "Nyanko" and "Esa" are left without being deleted from the word set 16.

処理部１２は、テキスト１４に含まれる元の単語およびダミー語を含む単語集合１６から、テキスト１４のインデックスに相当する特徴情報１８（第１の特徴情報）を生成する。特徴情報１８は、問い合わせテキストと検索対象テキストとの間で出現する単語の共通度を評価するための情報である。特徴情報１８は、例えば、以下のように生成される。 The processing unit 12 generates the feature information 18 (first feature information) corresponding to the index of the text 14 from the word set 16 including the original word and the dummy word included in the text 14. The feature information 18 is information for evaluating the commonality of words appearing between the inquiry text and the search target text. The feature information 18 is generated, for example, as follows.

処理部１２は、異なる複数のハッシュ関数を予め用意しておく。処理部１２は、複数のハッシュ関数にそれぞれ単語集合１６を入力し、複数のハッシュ関数から出力された複数のハッシュ値を列挙したベクトルを特徴情報１８として生成する。ハッシュ関数はMin-hash関数であってもよい。例えば、各ハッシュ関数は、単語およびダミー語それぞれに対して一意な値を対応付けた対応関係をもつ。異なるハッシュ関数は異なる対応関係をもつ。ハッシュ関数は、単語集合に含まれる単語およびダミー語に対応する値の中で最小の値をハッシュ値として出力する。例えば、「にゃんこ」、「えさ」、「Ｇ１」および「Ｇ２」を含む単語集合１６から（０，１，２，２，…）といったベクトルが生成される。 The processing unit 12 prepares a plurality of different hash functions in advance. The processing unit 12 inputs a word set 16 into each of a plurality of hash functions, and generates a vector listing a plurality of hash values output from the plurality of hash functions as feature information 18. The hash function may be a Min-hash function. For example, each hash function has a correspondence relationship in which a unique value is associated with each word and dummy word. Different hash functions have different correspondences. The hash function outputs the smallest value among the values corresponding to the words and dummy words included in the word set as the hash value. For example, a vector such as (0, 1, 2, 2, ...) Is generated from the word set 16 including "Nyanko", "food", "G1" and "G2".

テキスト１５ａ，１５ｂからも同様にして特徴情報１９ａ，１９ｂを生成することが可能である。テキスト１５ａに含まれる２以上の単語が抽出され、抽出された単語に対して関連語辞書１３を参照してダミー語（第２のダミー語）が割り当てられ、テキスト１５ａに含まれる元の単語およびダミー語を含む単語集合１７ａが生成される。そして、単語集合１７ａから特徴情報１９ａが生成される。例えば、単語集合１７ａを複数のハッシュ関数に入力することで算出された複数のハッシュ値を列挙したベクトルが生成される。また、テキスト１５ｂに含まれる２以上の単語が抽出され、抽出された単語に対して関連語辞書１３を参照してダミー語が割り当てられ、テキスト１５ｂに含まれる元の単語およびダミー語を含む単語集合１７ｂが生成される。そして、単語集合１７ｂから特徴情報１９ｂが生成される。例えば、単語集合１７ｂを複数のハッシュ関数に入力することで算出された複数のハッシュ値を列挙したベクトルが生成される。 It is possible to generate feature information 19a and 19b from the texts 15a and 15b in the same manner. Two or more words included in the text 15a are extracted, a dummy word (second dummy word) is assigned to the extracted words by referring to the related word dictionary 13, and the original word and the original word included in the text 15a and A word set 17a containing a dummy word is generated. Then, the feature information 19a is generated from the word set 17a. For example, by inputting the word set 17a into a plurality of hash functions, a vector enumerating a plurality of hash values calculated is generated. Further, two or more words included in the text 15b are extracted, dummy words are assigned to the extracted words by referring to the related word dictionary 13, and the original words included in the text 15b and words containing the dummy words are assigned. The set 17b is generated. Then, the feature information 19b is generated from the word set 17b. For example, by inputting the word set 17b into a plurality of hash functions, a vector enumerating a plurality of hash values calculated is generated.

例えば、テキスト１５ａから「猫」と「ご飯」が抽出され、「猫」が属するグループＧ１を示すダミー語「Ｇ１」が追加され、「ご飯」が属するグループＧ２を示すダミー語「Ｇ２」が追加される。そして、「猫」、「ご飯」、「Ｇ１」および「Ｇ２」を含む単語集合１７ａから（０，０，２，２，…）といったベクトルが生成される。また、テキスト１５ｂから「メール」と「会議」が抽出され、「メール」が属するグループＧ３を示すダミー語「Ｇ３」が追加され、「会議」が属するグループＧ４を示すダミー語「Ｇ４」が追加される。そして、「メール」、「会議」、「Ｇ３」および「Ｇ４」を含む単語集合１７ｂから（４，４，５，５，…）といったベクトルが生成される。 For example, "cat" and "rice" are extracted from the text 15a, a dummy word "G1" indicating the group G1 to which the "cat" belongs is added, and a dummy word "G2" indicating the group G2 to which the "rice" belongs is added. Will be done. Then, a vector such as (0,0,2,2, ...) Is generated from the word set 17a including "cat", "rice", "G1" and "G2". Further, "mail" and "meeting" are extracted from the text 15b, a dummy word "G3" indicating the group G3 to which the "mail" belongs is added, and a dummy word "G4" indicating the group G4 to which the "meeting" belongs is added. Will be done. Then, a vector such as (4, 4, 5, 5, ...) Is generated from the word set 17b including "mail", "meeting", "G3" and "G4".

処理部１２は、特徴情報１８と特徴情報１９ａ，１９ｂとの間の比較に基づいて、テキスト１５ａ，１５ｂなどの検索対象テキストの中からテキスト１４に類似するテキストを検索する。例えば、処理部１２は、特徴情報１９ａ，１９ｂなどの検索対象テキストに対応する特徴情報の中から特徴情報１８に最も近似する特徴情報を検索する。特徴情報１８がハッシュ値のベクトルである場合、近似する特徴情報は、例えば、一致するハッシュ値の個数が最も多いベクトルである。処理部１２は、所定の近傍探索アルゴリズムによって最も近似する特徴情報を探索してもよい。例えば、処理部１２は、検索対象テキストに対応するベクトルから生成された二分探索用の探索木を記憶部１１に予め記憶しておき、探索木を辿ることで特徴情報１８に最も近似するベクトルを探索してもよい。 The processing unit 12 searches for text similar to the text 14 from the search target texts such as the texts 15a and 15b based on the comparison between the feature information 18 and the feature information 19a and 19b. For example, the processing unit 12 searches for the feature information that most closely resembles the feature information 18 from the feature information corresponding to the search target texts such as the feature information 19a and 19b. When the feature information 18 is a vector of hash values, the approximate feature information is, for example, the vector having the largest number of matching hash values. The processing unit 12 may search for the most similar feature information by a predetermined neighborhood search algorithm. For example, the processing unit 12 stores in advance a search tree for binary search generated from a vector corresponding to the search target text in the storage unit 11, and traces the search tree to obtain a vector that most closely resembles the feature information 18. You may search.

例えば、特徴情報１８がベクトル（０，１，２，２，…）であり、特徴情報１９ａがベクトル（０，０，２，２，…）であり、特徴情報１９ｂがベクトル（４，４，５，５，…）であるとする。この場合、特徴情報１８は特徴情報１９ｂよりも特徴情報１９ａに近いため、テキスト１４はテキスト１５ｂよりもテキスト１５ａに類似することになる。 For example, the feature information 18 is a vector (0,1,2,2,2, ...), The feature information 19a is a vector (0,0,2,2, ...), And the feature information 19b is a vector (4,4,4). 5, 5, ...). In this case, since the feature information 18 is closer to the feature information 19a than the feature information 19b, the text 14 is more similar to the text 15a than the text 15b.

処理部１２は、テキスト１４に類似するテキスト１５ａを出力してもよい。例えば、処理部１２は、類似テキスト検索装置１０に接続されたティスプレイにテキスト１５ａを表示させるなど、出力デバイスにテキスト１５ａを出力してもよいし、他の情報処理装置にテキスト１５ａを送信してもよい。また、処理部１２は、テキスト１５ａに対応付けられた返答テキストを、ディスプレイに表示させるなど出力デバイスに出力してもよいし、他の情報処理装置に送信してもよい。また、処理部１２は、テキスト１５ａに対応付けられた返答テキストを音声信号に変換して音声として再生してもよい。 The processing unit 12 may output a text 15a similar to the text 14. For example, the processing unit 12 may output the text 15a to the output device, such as displaying the text 15a on the display connected to the similar text search device 10, or transmit the text 15a to another information processing device. May be. Further, the processing unit 12 may output the response text associated with the text 15a to an output device such as displaying it on a display, or may transmit it to another information processing device. Further, the processing unit 12 may convert the response text associated with the text 15a into a voice signal and reproduce it as voice.

ここで、テキスト１５ａはテキスト１４に含まれる単語と近い意味をもつ関連語を含んでおり、テキスト１４とテキスト１５ａはある程度類似していると言える。しかし、テキスト１４に含まれる単語とテキスト１５ａに含まれる単語の共通度を単純に評価した場合、共通する単語が存在しないためテキスト１５ａはテキスト１４と類似しないと判定されるおそれがある。また、関連語を単純に同一単語とみなした場合、テキスト１４とテキスト１５ａは使用している表現が異なるにもかかわらず、表現が同一であるテキストと同一視されてしまい、類似度が過大評価されるおそれがある。 Here, the text 15a includes related words having a meaning similar to the words contained in the text 14, and it can be said that the text 14 and the text 15a are similar to some extent. However, when the degree of commonality between the word contained in the text 14 and the word contained in the text 15a is simply evaluated, it may be determined that the text 15a is not similar to the text 14 because there is no common word. In addition, when related words are simply regarded as the same word, the text 14 and the text 15a are equated with the text having the same expression even though the expressions used are different, and the degree of similarity is overestimated. May be done.

これに対して、類似テキスト検索装置１０は、テキスト１４に含まれる元の単語に加えて、元の単語に対応する関連語のグループを示すダミー語を単語集合１６に追加し、単語集合１６から特徴情報１８を生成する。また、テキスト１５ａに含まれる元の単語に加えて、元の単語に対応する関連語のグループを示すダミー語が単語集合１７ａに追加され、単語集合１７ａから特徴情報１９ａが生成される。 On the other hand, the similar text search device 10 adds, in addition to the original word contained in the text 14, a dummy word indicating a group of related words corresponding to the original word to the word set 16 from the word set 16. The feature information 18 is generated. Further, in addition to the original word included in the text 15a, a dummy word indicating a group of related words corresponding to the original word is added to the word set 17a, and the feature information 19a is generated from the word set 17a.

よって、テキスト１４に含まれる単語とテキスト１５ａに含まれる単語が同一でなくても、両者の意味的な近さが評価される。また、単語集合１６，１７ａの中に元の単語が残っているため、使用している表現の違いも評価される。すなわち、テキスト１４とテキスト１５ａがある程度類似していることを特徴情報１８，１９ａによって表現することができる。その結果、類似するテキストの検索精度が向上する。また、単語集合１６から生成された特徴情報１８と単語集合１７ａから生成された特徴情報１９ａの間の比較によって類似テキストを検索でき、類似テキスト検索の速度を向上させることができる。 Therefore, even if the word contained in the text 14 and the word contained in the text 15a are not the same, the semantic closeness between the two is evaluated. Moreover, since the original word remains in the word sets 16 and 17a, the difference in the expressions used is also evaluated. That is, it can be expressed by the feature information 18, 19a that the text 14 and the text 15a are similar to some extent. As a result, the search accuracy of similar texts is improved. Further, similar text can be searched by comparing the feature information 18 generated from the word set 16 and the feature information 19a generated from the word set 17a, and the speed of the similar text search can be improved.

［第２の実施の形態］
次に、第２の実施の形態を説明する。
第２の実施の形態の情報処理装置１００は、ユーザから問い合わせ発話を受け付け、問い合わせ発話に対応する適切な返答発話を出力する対話システムに使用される。情報処理装置１００は、クライアント装置であってもよいしサーバ装置であってもよい。 [Second Embodiment]
Next, a second embodiment will be described.
The information processing apparatus 100 of the second embodiment is used for a dialogue system that receives an inquiry utterance from a user and outputs an appropriate response utterance corresponding to the inquiry utterance. The information processing device 100 may be a client device or a server device.

図２は、情報処理装置のハードウェア例を示すブロック図である。
情報処理装置１００は、バスに接続されたＣＰＵ１０１、ＲＡＭ１０２、ＨＤＤ１０３、画像信号処理部１０４、入力信号処理部１０５、媒体リーダ１０６および通信インタフェース１０７を有する。情報処理装置１００は、第１の実施の形態の類似テキスト検索装置１０に対応する。ＣＰＵ１０１は、第１の実施の形態の処理部１２に対応する。ＲＡＭ１０２またはＨＤＤ１０３は、第１の実施の形態の記憶部１１に対応する。 FIG. 2 is a block diagram showing a hardware example of the information processing apparatus.
The information processing apparatus 100 includes a CPU 101, a RAM 102, an HDD 103, an image signal processing unit 104, an input signal processing unit 105, a medium reader 106, and a communication interface 107 connected to the bus. The information processing device 100 corresponds to the similar text search device 10 of the first embodiment. The CPU 101 corresponds to the processing unit 12 of the first embodiment. The RAM 102 or the HDD 103 corresponds to the storage unit 11 of the first embodiment.

ＣＰＵ１０１は、プログラムの命令を実行するプロセッサである。ＣＰＵ１０１は、ＨＤＤ１０３に記憶されたプログラムやデータの少なくとも一部をＲＡＭ１０２にロードし、プログラムを実行する。なお、ＣＰＵ１０１は複数のプロセッサコアを備えてもよく、情報処理装置１００は複数のプロセッサを備えてもよい。複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うことがある。 The CPU 101 is a processor that executes a program instruction. The CPU 101 loads at least a part of the programs and data stored in the HDD 103 into the RAM 102, and executes the program. The CPU 101 may include a plurality of processor cores, and the information processing apparatus 100 may include a plurality of processors. A collection of multiple processors may be referred to as a "multiprocessor" or simply a "processor."

ＲＡＭ１０２は、ＣＰＵ１０１が実行するプログラムやＣＰＵ１０１が演算に使用するデータを一時的に記憶する揮発性の半導体メモリである。なお、情報処理装置１００は、ＲＡＭ以外の種類のメモリを備えてもよく、複数のメモリを備えてもよい。 The RAM 102 is a volatile semiconductor memory that temporarily stores a program executed by the CPU 101 and data used by the CPU 101 for calculation. The information processing apparatus 100 may include a type of memory other than the RAM, or may include a plurality of memories.

ＨＤＤ１０３は、ＯＳ（Operating System）やミドルウェアやアプリケーションソフトウェアなどのソフトウェアのプログラム、および、データを記憶する不揮発性の記憶装置である。なお、情報処理装置１００は、フラッシュメモリやＳＳＤ（Solid State Drive）など他の種類の記憶装置を備えてもよく、複数の記憶装置を備えてもよい。 The HDD 103 is a non-volatile storage device that stores software programs such as an OS (Operating System), middleware, and application software, and data. The information processing device 100 may be provided with other types of storage devices such as a flash memory and an SSD (Solid State Drive), or may be provided with a plurality of storage devices.

画像信号処理部１０４は、ＣＰＵ１０１からの命令に従って、情報処理装置１００に接続されたディスプレイ１０４ａに画像を出力する。ディスプレイ１０４ａとしては、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ（ＬＣＤ：Liquid Crystal Display）、有機ＥＬ（ＯＥＬ：Organic Electro-Luminescence）ディスプレイなど、任意の種類のディスプレイを使用することができる。 The image signal processing unit 104 outputs an image to the display 104a connected to the information processing apparatus 100 in accordance with a command from the CPU 101. As the display 104a, any kind of display such as a CRT (Cathode Ray Tube) display, a liquid crystal display (LCD: Liquid Crystal Display), and an organic EL (OEL: Organic Electro-Luminescence) display can be used.

入力信号処理部１０５は、情報処理装置１００に接続された入力デバイス１０５ａから入力信号を受信する。入力デバイス１０５ａとして、マウス、タッチパネル、タッチパッド、キーボードなど、任意の種類の入力デバイスを使用できる。また、情報処理装置１００に複数の種類の入力デバイスが接続されてもよい。 The input signal processing unit 105 receives an input signal from the input device 105a connected to the information processing device 100. As the input device 105a, any kind of input device such as a mouse, a touch panel, a touch pad, and a keyboard can be used. Further, a plurality of types of input devices may be connected to the information processing apparatus 100.

媒体リーダ１０６は、記録媒体１０６ａに記録されたプログラムやデータを読み取る読み取り装置である。記録媒体１０６ａとして、例えば、フレキシブルディスク（ＦＤ：Flexible Disk）やＨＤＤなどの磁気ディスク、ＣＤ（Compact Disc）やＤＶＤ（Digital Versatile Disc）などの光ディスク、光磁気ディスク（ＭＯ：Magneto-Optical disk）、半導体メモリなどを使用できる。媒体リーダ１０６は、例えば、記録媒体１０６ａから読み取ったプログラムやデータをＲＡＭ１０２またはＨＤＤ１０３に格納する。 The medium reader 106 is a reading device that reads programs and data recorded on the recording medium 106a. Examples of the recording medium 106a include magnetic disks such as flexible disks (FDs) and HDDs, optical disks such as CDs (Compact Discs) and DVDs (Digital Versatile Discs), and optical magnetic disks (MOs: Magneto-Optical disks). A semiconductor memory or the like can be used. The medium reader 106 stores, for example, a program or data read from the recording medium 106a in the RAM 102 or the HDD 103.

通信インタフェース１０７は、ネットワーク１０７ａに接続され、ネットワーク１０７ａを介して他の情報処理装置と通信を行うインタフェースである。通信インタフェース１０７は、スイッチやルータなどの有線通信装置に接続される有線通信インタフェースでもよいし、基地局やアクセスポイントに接続される無線通信インタフェースでもよい。 The communication interface 107 is an interface that is connected to the network 107a and communicates with other information processing devices via the network 107a. The communication interface 107 may be a wired communication interface connected to a wired communication device such as a switch or a router, or may be a wireless communication interface connected to a base station or an access point.

次に、問い合わせ発話から返答発話を決定する方法について説明する。
図３は、発話テーブルの例を示す図である。
情報処理装置１００は、問い合わせと返答のサンプルを多数蓄積したデータベースとして、発話テーブル１４１を保持する。発話テーブル１４１は、発話ＩＤ、検索対象発話および返答発話の項目を含む。発話ＩＤは、検索対象発話を識別する識別子である。 Next, a method of determining a response utterance from an inquiry utterance will be described.
FIG. 3 is a diagram showing an example of an utterance table.
The information processing apparatus 100 holds the utterance table 141 as a database in which a large number of inquiry and response samples are stored. The utterance table 141 includes an utterance ID, a search target utterance, and a response utterance item. The utterance ID is an identifier that identifies the utterance to be searched.

検索対象発話は、問い合わせ発話のサンプルである。検索対象発話は、ユーザが口頭で発することがある文を示す文字列であり、２以上の単語を含むテキストである。検索対象発話は、比較的短い１つの文または少数の文によって構成される。返答発話は、検索対象発話に対応付けられている。ある検索対象発話に類似する問い合わせ発話が入力されたとき、当該検索対象発話に対応付けられた返答発話が出力される。返答発話は、ユーザに対して伝達されることがある文を示す文字列であり、２以上の単語を含むテキストである。返答発話は、比較的短い１つの文または少数の文によって構成される。 The search target utterance is a sample of inquiry utterances. The search target utterance is a character string indicating a sentence that the user may verbally utter, and is a text containing two or more words. The utterance to be searched is composed of one relatively short sentence or a small number of sentences. The reply utterance is associated with the search target utterance. When an inquiry utterance similar to a certain search target utterance is input, the response utterance associated with the search target utterance is output. The reply utterance is a character string indicating a sentence that may be transmitted to the user, and is a text containing two or more words. Response utterances consist of one relatively short sentence or a small number of sentences.

例えば、情報処理装置１００は、ユーザが口頭で発した問い合わせ発話を示す音声信号を受信し、音声認識により音声信号をテキストの問い合わせ発話に変換する。情報処理装置１００は、テキストの問い合わせ発話に類似する検索対象発話を判定し、判定した検索対象発話に対応する返答発話を選択する。情報処理装置１００は、テキストの返答発話を音声信号に変換して音声として返答発話を再生する。 For example, the information processing apparatus 100 receives a voice signal indicating an inquiry utterance verbally uttered by the user, and converts the voice signal into a text inquiry utterance by voice recognition. The information processing apparatus 100 determines a search target utterance similar to a text inquiry utterance, and selects a response utterance corresponding to the determined search target utterance. The information processing apparatus 100 converts the text response utterance into a voice signal and reproduces the response utterance as voice.

次に、問い合わせ発話に類似する検索対象発話を探索する探索方法について説明する。情報処理装置１００は、Min-hash関数を用いて発話間の出現単語の共通度を評価する。
図４は、ハッシュ関数およびベクトルの例を示す図である。 Next, a search method for searching for a search target utterance similar to the inquiry utterance will be described. The information processing apparatus 100 evaluates the commonality of the appearing words between utterances by using the Min-hash function.
FIG. 4 is a diagram showing an example of a hash function and a vector.

一例として、情報処理装置１００は、問い合わせ発話３１を受け付ける。また、発話テーブル１４１に検索対象発話３２，３３が登録されている。また、情報処理装置１００は、Min-hash関数であるハッシュ関数３４～３６を保持している。 As an example, the information processing apparatus 100 receives an inquiry utterance 31. Further, the search target utterances 32 and 33 are registered in the utterance table 141. Further, the information processing apparatus 100 holds hash functions 34 to 36, which are Min-hash functions.

問い合わせ発話３１は、単語として「大谷」、「ＬＡ」、「盗塁」、「メジャー」、「ボール」および「本塁打」を含む。検索対象発話３２は、単語として「松井」、「ＮＹ」、「三振」、「メジャー」、「ボール」および「本塁打」を含む。検索対象発話３３は、単語として「大谷」、「三振」、「太刀」、「戦国時代」、「関ヶ原」および「切腹」を含む。情報処理装置１００は、問い合わせ発話３１を受け付けたとき、ハッシュ関数３４～３６を用いて問い合わせ発話３１からベクトル３７を生成する。また、情報処理装置１００は、ハッシュ関数３４～３６を用いて検索対象発話３２から予めベクトル３８を生成して保持しておく。また、情報処理装置１００は、ハッシュ関数３４～３６を用いて検索対象発話３３から予めベクトル３９を生成して保持しておく。 The inquiry utterance 31 includes the words "Otani", "LA", "stolen base", "major", "ball" and "home run". The search target utterance 32 includes the words "Matsui", "NY", "strikeout", "major", "ball" and "home run". The search target utterance 33 includes the words "Otani", "strikeout", "tachi", "Sengoku period", "Sekigahara", and "seppuku". When the information processing apparatus 100 receives the inquiry utterance 31, the information processing apparatus 100 generates the vector 37 from the inquiry utterance 31 by using the hash functions 34 to 36. Further, the information processing apparatus 100 generates and holds the vector 38 in advance from the search target utterance 32 by using the hash functions 34 to 36. Further, the information processing apparatus 100 generates and holds a vector 39 in advance from the search target utterance 33 by using the hash functions 34 to 36.

ハッシュ関数３４～３６は、１つの問い合わせ発話または１つの検索対象発話に含まれる１つの単語集合から１つのハッシュ値を算出するMin-hash関数である。ハッシュ関数３４～３６はそれぞれ、異なる単語に対して異なる整数を対応付けた対応関係をもつ。異なるハッシュ関数は異なる対応関係をもっている。例えば、情報処理装置１００は、検索対象発話３２，３３に出現し得る複数の単語をランダムに整列し、単語の列に対して０，１，２，…と連続する非負整数を割り当てることで、１つのハッシュ関数を生成する。ただし、複数の単語に対して不連続な整数を割り当てるようにしてもよい。 Hash functions 34 to 36 are Min-hash functions that calculate one hash value from one word set included in one inquiry utterance or one search target utterance. Each of the hash functions 34 to 36 has a correspondence relationship in which different integers are associated with different words. Different hash functions have different correspondences. For example, the information processing apparatus 100 randomly arranges a plurality of words that may appear in the search target utterances 32 and 33, and assigns a continuous non-negative integer such as 0, 1, 2, ... To the word sequence. Generate one hash function. However, discontinuous integers may be assigned to a plurality of words.

例えば、ハッシュ関数３４は、「メジャー」に整数０、「戦国時代」に整数１、「野球」に整数２を割り当てている。ハッシュ関数３５は、「切腹」に整数０、「歴史」に整数１、「本塁打」に整数２を割り当てている。ハッシュ関数３６は、「ＮＹ」に整数０、「ＬＡ」に整数１、「関ヶ原」に整数２を割り当てている。ハッシュ関数３４～３６はそれぞれ、単語集合に含まれる２以上の単語に対応する２以上の整数の中から、最小の整数を選択し、選択した最小の整数をハッシュ値として出力する。 For example, the hash function 34 assigns an integer 0 to "major", an integer 1 to "Sengoku period", and an integer 2 to "baseball". The hash function 35 assigns an integer 0 to "seppuku", an integer 1 to "history", and an integer 2 to "home runs". The hash function 36 assigns an integer 0 to "NY", an integer 1 to "LA", and an integer 2 to "Sekigahara". Each of the hash functions 34 to 36 selects the smallest integer from two or more integers corresponding to two or more words included in the word set, and outputs the selected minimum integer as a hash value.

ベクトル３７は、問い合わせ発話３１の単語集合をハッシュ関数３４～３６に入力することで算出された３つのハッシュ値を列挙した整数ベクトルである。問い合わせ発話３１に対して、ハッシュ関数３４は「メジャー」に対応する整数０を出力し、ハッシュ関数３５は「本塁打」に対応する整数２を出力し、ハッシュ関数３６は「ＬＡ」に対応する整数１を出力する。よって、ベクトル３７は（０，２，１）となる。 The vector 37 is an integer vector enumerating three hash values calculated by inputting the word set of the inquiry utterance 31 into the hash functions 34 to 36. For the inquiry utterance 31, the hash function 34 outputs the integer 0 corresponding to the "major", the hash function 35 outputs the integer 2 corresponding to the "home base", and the hash function 36 outputs the integer corresponding to the "LA". Output 1 Therefore, the vector 37 becomes (0, 2, 1).

ベクトル３８は、検索対象発話３２の単語集合をハッシュ関数３４～３６に入力することで算出された３つのハッシュ値を列挙した整数ベクトルである。検索対象発話３２に対して、ハッシュ関数３４は「メジャー」に対応する整数０を出力し、ハッシュ関数３５は「本塁打」に対応する整数２を出力し、ハッシュ関数３６は「ＮＹ」に対応する整数０を出力する。よって、ベクトル３８は（０，２，０）となる。 The vector 38 is an integer vector enumerating three hash values calculated by inputting the word set of the search target utterance 32 into the hash functions 34 to 36. For the search target speech 32, the hash function 34 outputs an integer 0 corresponding to "major", the hash function 35 outputs an integer 2 corresponding to "home base hit", and the hash function 36 corresponds to "NY". Outputs the integer 0. Therefore, the vector 38 becomes (0,2,0).

ベクトル３９は、検索対象発話３３の単語集合をハッシュ関数３４～３６に入力することで算出された３つのハッシュ値を列挙した整数ベクトルである。検索対象発話３３に対して、ハッシュ関数３４は「戦国時代」に対応する整数１を出力し、ハッシュ関数３５は「切腹」に対応する整数０を出力し、ハッシュ関数３６は「関ヶ原」に対応する整数２を出力する。よって、ベクトル３８は（１，０，２）となる。 The vector 39 is an integer vector enumerating three hash values calculated by inputting the word set of the search target utterance 33 into the hash functions 34 to 36. For the search target speech 33, the hash function 34 outputs an integer 1 corresponding to "Sengoku era", the hash function 35 outputs an integer 0 corresponding to "cutting", and the hash function 36 corresponds to "Sekigahara". Outputs the integer 2 to be used. Therefore, the vector 38 becomes (1,0,2).

情報処理装置１００は、ベクトル３７とベクトル３８を比較することで、問い合わせ発話３１と検索対象発話３２の類似度を評価することができる。類似度は、ベクトル３７，３８の次元のうち整数が一致する次元の個数で表現される。ここでは、整数が一致しているか否かが重要であり、整数が異なる場合は整数の近さは重要でない。ベクトル３７とベクトル３８の間では、３次元のうち２次元で整数が一致している。同様に、情報処理装置１００は、ベクトル３７とベクトル３９を比較することで、問い合わせ発話３１と検索対象発話３３の類似度を評価することができる。ベクトル３７とベクトル３９の間では、３次元のうち１つの次元でも整数が一致していない。よって、問い合わせ発話３１は、検索対象発話３３よりも検索対象発話３２に類似していると判定できる。 The information processing apparatus 100 can evaluate the similarity between the inquiry utterance 31 and the search target utterance 32 by comparing the vector 37 and the vector 38. The similarity is expressed by the number of dimensions in which the integers match among the dimensions of the vectors 37 and 38. Here, it is important whether or not the integers match, and if the integers are different, the closeness of the integers is not important. Between the vector 37 and the vector 38, the integers match in two of the three dimensions. Similarly, the information processing apparatus 100 can evaluate the similarity between the inquiry utterance 31 and the search target utterance 33 by comparing the vector 37 and the vector 39. Between the vector 37 and the vector 39, the integers do not match even in one of the three dimensions. Therefore, it can be determined that the inquiry utterance 31 is more similar to the search target utterance 32 than the search target utterance 33.

任意のハッシュ関数が１つの問い合わせ発話と１つの検索対象発話に対して同一のハッシュ値を出力する確率は、問い合わせ発話の単語集合と検索対象発話の単語集合の間で共通する単語が出現する割合に一致する。共通する単語の割合をジャッカール係数と言うことがある。よって、異なるハッシュ関数が算出した複数のハッシュ値のうち一致するハッシュ値の割合を、ジャッカール係数の近似値として用いることが可能である。多数の検索対象発話に対応するベクトルを予め算出しておけば、受け付けた問い合わせ発話に類似する検索対象発話を高速に検索することが可能である。 The probability that any hash function will output the same hash value for one query utterance and one search target utterance is the rate at which a common word appears between the query utterance word set and the search target utterance word set. Matches. The ratio of common words is sometimes called the Jackal coefficient. Therefore, it is possible to use the ratio of matching hash values among a plurality of hash values calculated by different hash functions as an approximate value of the Jackal coefficient. If the vectors corresponding to a large number of search target utterances are calculated in advance, it is possible to search for search target utterances similar to the received inquiry utterances at high speed.

例えば、問い合わせ発話３１と検索対象発話３２の間では、重複を除去した９個の単語（９種類の単語）のうち３個の単語が共通しているため、ジャッカール係数は０．３３となる。また、問い合わせ発話３１と検索対象発話３３の間では、重複を除去した１１個の単語のうち１個の単語が共通しているため、ジャッカール係数は０．０９となる。よって、ベクトル間で一致するハッシュ値の割合はジャッカール係数に近似している。 For example, between the inquiry utterance 31 and the search target utterance 32, three words out of the nine words (nine kinds of words) from which duplicates have been removed are common, so the Jackal coefficient is 0.33. .. Further, since one of the 11 words from which the duplication has been removed is common between the inquiry utterance 31 and the search target utterance 33, the Jackal coefficient is 0.09. Therefore, the ratio of hash values that match between vectors is close to the Jackal coefficient.

情報処理装置１００は、多数の検索対象発話に対応する多数のベクトルの中から、問い合わせ発話に対応するベクトルと最も一致度が高い１つのベクトルまたは一致度が高い少数のベクトルを検索できればよい。そこで、情報処理装置１００は、二分探索木を用いた近傍探索によって１つまたは少数のベクトルを探索する。 The information processing apparatus 100 may search for one vector having the highest degree of matching with the vector corresponding to the inquiry utterance or a small number of vectors having a high degree of matching from among a large number of vectors corresponding to a large number of search target utterances. Therefore, the information processing apparatus 100 searches for one or a small number of vectors by neighborhood search using a binary search tree.

図５は、探索木の例を示す図である。
情報処理装置１００は、発話テーブル１４１に登録された検索対象発話から生成されたベクトルに基づいて、予め探索木１４２を生成して保持しておく。探索木１４２は、木構造に接続された複数のノードを含む。葉ノード以外の各ノードには２つの子ノードが接続されている。葉ノード以外の各ノードは、ベクトルの中の特定の次元に対する閾値をもつ。入力されたベクトルの中の特定の次元のハッシュ値が閾値以上である場合は右子ノードに進み、特定の次元のハッシュ値が閾値未満である場合は左子ノードに進む。このようにして、探索木１４２をルートノードから葉ノードに向かって辿る。 FIG. 5 is a diagram showing an example of a search tree.
The information processing apparatus 100 generates and holds the search tree 142 in advance based on the vector generated from the search target utterance registered in the utterance table 141. The search tree 142 includes a plurality of nodes connected to the tree structure. Two child nodes are connected to each node other than the leaf node. Each node except the leaf node has a threshold for a particular dimension in the vector. If the hash value of a specific dimension in the input vector is greater than or equal to the threshold value, the process proceeds to the right child node, and if the hash value of the specific dimension is less than the threshold value, the process proceeds to the left child node. In this way, the search tree 142 is traced from the root node to the leaf node.

例えば、図５の例では、ルートノードは１次元目に対する閾値として１０をもつ。このため、１次元目のハッシュ値が１０以上であるベクトルが入力された場合はルートノードから右子ノードに進むことになり、１次元目のハッシュ値が１０未満であるベクトルが入力された場合はルートノードから左子ノードに進むことになる。 For example, in the example of FIG. 5, the root node has 10 as a threshold value for the first dimension. Therefore, if a vector having a hash value of 10 or more in the first dimension is input, the process proceeds from the root node to the right child node, and if a vector having a hash value of less than 10 in the first dimension is input. Will go from the root node to the left child node.

探索木１４２の葉ノードは、検索対象発話を指し示す。例えば、葉ノードは、検索対象発話のベクトルと当該検索対象発話を識別する発話ＩＤとを含む。ただし、葉ノードがベクトルを含まなくてもよい。２以上の検索対象発話のベクトルが互いに近似している場合、１つの葉ノードに２以上の検索対象発話が対応付けられることもある。ただし、検索対象発話を効率的に絞り込むため、１つの葉ノードは１つまたは少数の検索対象発話を指し示すことが好ましい。なお、探索木１４２の葉ノードはルートノードからの深さが同一でなくてもよく、探索に全ての次元のハッシュ値が使用されなくてもよい。 The leaf node of the search tree 142 points to the utterance to be searched. For example, the leaf node contains a vector of search target utterances and an utterance ID that identifies the search target utterance. However, the leaf node does not have to contain the vector. When the vectors of two or more search target utterances are close to each other, one leaf node may be associated with two or more search target utterances. However, in order to efficiently narrow down the search target utterances, it is preferable that one leaf node points to one or a small number of search target utterances. The leaf nodes of the search tree 142 do not have to have the same depth from the root node, and hash values of all dimensions may not be used for the search.

次に、問い合わせ発話と検索対象発話が、同一の単語を含まないものの関連する単語を含んでいる場合の問題について説明する。問い合わせ発話および検索対象発話は、話し言葉を用いた短文であるため、同一または類似の事象を多様な単語で表現することができ、問い合わせ発話と検索対象発話の間に同一の単語が出現する確率が全体的に低い。よって、単純に単語の共通度を評価する方法では、関連語を使用する問い合わせ発話と検索対象発話の間の類似度が適切に評価されないという問題がある。 Next, the problem when the inquiry utterance and the search target utterance do not contain the same word but contain related words will be described. Since the inquiry utterance and the search target utterance are short sentences using spoken words, the same or similar events can be expressed by various words, and the probability that the same word appears between the inquiry utterance and the search target utterance is high. Overall low. Therefore, in the method of simply evaluating the commonality of words, there is a problem that the similarity between the inquiry utterance using the related word and the search target utterance is not appropriately evaluated.

図６は、ジャッカール係数の算出例を示す図である。
（Ａ）問い合わせ発話４１は、単語として「猫」および「ご飯」を含む。検索対象発話４２は、単語として「にゃんこ」および「えさ」を含む。「猫」と「にゃんこ」は類似する意味をもつ関連語であり、「ご飯」と「えさ」は類似する意味をもつ関連語である。しかし、単純に単語の同一性に基づきジャッカール係数を算出すると、問い合わせ発話４１と検索対象発話４２は同一の単語を含まないため、ジャッカール係数が０．００となる。このため、問い合わせ発話４１と検索対象発話４２は類似する内容を表している可能性が高いにもかかわらず、類似度が非常に低く判定されてしまう。 FIG. 6 is a diagram showing a calculation example of the Jackal coefficient.
(A) The inquiry utterance 41 includes "cat" and "rice" as words. The search target utterance 42 includes "nyanko" and "food" as words. "Cat" and "Nyanko" are related words with similar meanings, and "rice" and "food" are related words with similar meanings. However, if the Jackal coefficient is simply calculated based on the identity of the words, the inquiry utterance 41 and the search target utterance 42 do not include the same word, so that the Jackal coefficient is 0.00. Therefore, although the inquiry utterance 41 and the search target utterance 42 are likely to represent similar contents, the degree of similarity is determined to be very low.

（Ｂ）問い合わせ発話４３は、単語として「猫」および「ご飯」を含む。検索対象発話４４は、単語として「にゃんこ」および「えさ」を含む。ここで、検索対象発話４４の「にゃんこ」は「猫」の関連語であるため、「にゃんこ」を「猫」と同一の単語であるとみなすとする。また、検索対象発話４４の「えさ」は「ご飯」の関連語であるため、「えさ」を「ご飯」と同一の単語であるとみなすとする。すると、問い合わせ発話４３と検索対象発話４４の間のジャッカール係数が１．００と算出される。しかし、問い合わせ発話４３と検索対象発話４４は異なる表現を使用しているため、類似度が過大評価されており、問い合わせ発話４３と同一の表現を使用する他の検索対象発話との区別が困難となる。 (B) The inquiry utterance 43 includes "cat" and "rice" as words. The search target utterance 44 includes "nyanko" and "food" as words. Here, since "Nyanko" in the search target utterance 44 is a related word of "cat", it is assumed that "nyanko" is regarded as the same word as "cat". Further, since "food" in the search target utterance 44 is a related word to "rice", it is assumed that "food" is the same word as "rice". Then, the Jackal coefficient between the inquiry utterance 43 and the search target utterance 44 is calculated as 1.00. However, since the inquiry utterance 43 and the search target utterance 44 use different expressions, the similarity is overestimated, and it is difficult to distinguish them from other search target utterances that use the same expression as the inquiry utterance 43. Become.

そこで、第２の実施の形態では以下のようにして類似度を評価する。
（Ｃ）問い合わせ発話４５は、単語として「猫」および「ご飯」を含む。検索対象発話４６は、単語として「にゃんこ」および「えさ」を含む。「猫」と「にゃんこ」が同一の関連語グループに属し、「ご飯」と「えさ」が同一の関連語グループに属する。 Therefore, in the second embodiment, the degree of similarity is evaluated as follows.
(C) The inquiry utterance 45 includes "cat" and "rice" as words. The search target utterance 46 includes "nyanko" and "food" as words. "Cat" and "Nyanko" belong to the same related word group, and "rice" and "food" belong to the same related word group.

すると、「猫」が属する関連語グループに対応する仮想語ｓ_１を問い合わせ発話４５の単語集合に追加し、「ご飯」が属する関連語グループに対応する仮想語ｓ_２を問い合わせ発話４５の単語集合に追加する。仮想語は問い合わせ発話や検索対象発話に出現しない仮想的な単語であり、ダミー語やラベルと言うこともできる。「猫」および「ご飯」を残したまま仮想語ｓ_１，ｓ_２が追加される。また、「にゃんこ」が属する関連語グループに対応する仮想語ｓ_１を検索対象発話４６の単語集合に追加し、「えさ」が属する関連語グループに対応する仮想語ｓ_２を検索対象発話４６の単語集合に追加する。「にゃんこ」および「えさ」を残したまま仮想語ｓ_１，ｓ_２が追加される。 Then, the virtual word s ₁ corresponding to the related word group to which "cat" belongs is added to the word set of the inquiry utterance 45, and the virtual word s ₂ corresponding to the related word group to which "rice" belongs is added to the word set of the inquiry utterance 45. Add to. A virtual word is a virtual word that does not appear in inquiry utterances or search target utterances, and can also be called a dummy word or a label. Virtual words s ₁ and s ₂ are added while leaving "cat" and "rice". Further, the virtual word s ₁ corresponding to the related word group to which "Nyanko" belongs is added to the word set of the search target utterance 46, and the virtual word s ₂ corresponding to the related word group to which "Esa" belongs is added to the search target utterance 46. Add to the word set. Virtual words s ₁ and s ₂ are added while leaving "Nyanko" and "Esa".

仮想語が追加された問い合わせ発話４５の単語集合と、仮想語が追加された検索対象発話４６の単語集合とを比較すると、６種類の単語のうち２種類の単語が共通するため、ジャッカール係数は０．３３と算出される。このように算出したジャッカール係数を、第２の実施の形態では拡張ジャッカール係数と言うことがある。拡張ジャッカール係数は、単語間の関連性（意味の類似性）を無視する場合のジャッカール係数よりも大きくなる。また、単語集合に元の単語を残したまま仮想語を追加するため、拡張ジャッカール係数は、関連語を同一視する場合のジャッカール係数よりも小さくなる。よって、異なる表現を使用する問い合わせ発話と検索対象発話の間の類似性を適切に評価することができる。 Comparing the word set of the inquiry utterance 45 to which the virtual word is added and the word set of the search target utterance 46 to which the virtual word is added, since two kinds of words out of the six kinds of words are common, the Jackal coefficient. Is calculated as 0.33. The Jackal coefficient calculated in this way may be referred to as an extended Jackal coefficient in the second embodiment. The extended Jackal coefficient is larger than the Jackal coefficient when ignoring the relevance (similarity of meaning) between words. In addition, since the virtual word is added while leaving the original word in the word set, the extended Jackal coefficient is smaller than the Jackal coefficient when the related words are equated. Therefore, it is possible to appropriately evaluate the similarity between the inquiry utterance using different expressions and the search target utterance.

情報処理装置１００は、仮想語も独立した単語として取り扱って複数のハッシュ関数を生成しておく。情報処理装置１００は、発話テーブル１４１に登録された検索対象発話から、仮想語を追加した単語集合を生成してハッシュ関数に入力し、仮想語の影響を反映したベクトルを算出する。情報処理装置１００は、仮想語の影響を反映したベクトルから探索木１４２を生成する。問い合わせ発話が入力されると、情報処理装置１００は、問い合わせ発話から、仮想語を追加した単語集合を生成してハッシュ関数に入力し、仮想語の影響を反映したベクトルを算出する。情報処理装置１００は、探索木１４２を用いて、問い合わせ発話のベクトルに近似する検索対象発話のベクトルを探す。このように、情報処理装置１００は、ハッシュ関数と単語集合を拡張することで、関連語を考慮しない探索方法を流用して高速に類似の検索対象発話を検索することができる。 The information processing apparatus 100 treats a virtual word as an independent word and generates a plurality of hash functions. The information processing apparatus 100 generates a word set to which a virtual word is added from the search target utterance registered in the utterance table 141, inputs the word set to the hash function, and calculates a vector reflecting the influence of the virtual word. The information processing apparatus 100 generates a search tree 142 from a vector that reflects the influence of a virtual word. When the inquiry utterance is input, the information processing apparatus 100 generates a word set to which a virtual word is added from the inquiry utterance, inputs the word set to the hash function, and calculates a vector reflecting the influence of the virtual word. The information processing apparatus 100 uses the search tree 142 to search for a vector of search target utterances that approximates the vector of inquiry utterances. In this way, the information processing apparatus 100 can search for similar search target utterances at high speed by diverting a search method that does not consider related words by extending the hash function and the word set.

図７は、類似度の閾値と検索対象発話のヒット数との関係例を示すグラフである。
グラフ５０は、問い合わせ発話と検索対象発話の類似度の閾値と、類似度が閾値より大きい検索対象発話の数（ヒット数）との間の関係を示す。類似度は、問い合わせ発話のベクトルと検索対象発話のベクトルの間でハッシュ値が一致する次元の数である。 FIG. 7 is a graph showing an example of the relationship between the threshold value of similarity and the number of hits of the utterance to be searched.
The graph 50 shows the relationship between the threshold value of the similarity between the inquiry utterance and the search target utterance and the number of search target utterances (hits) whose similarity is larger than the threshold value. The similarity is the number of dimensions in which the hash values match between the vector of inquiry utterances and the vector of search target utterances.

（Ａ）関連語を考慮せずにベクトルを算出する方法では、問い合わせ発話と各検索対象発話の類似度が全体として低く算出される。よって、類似度の閾値とヒット数との間の関係は曲線５１のようになる。すなわち、類似度の閾値を低く設定してもヒット数が少なくなり、類似する検索対象発話の検索漏れが多くなる。 (A) In the method of calculating the vector without considering the related words, the similarity between the inquiry utterance and each search target utterance is calculated to be low as a whole. Therefore, the relationship between the threshold of similarity and the number of hits is as shown in the curve 51. That is, even if the threshold value of the similarity is set low, the number of hits is small, and the search omission of similar search target utterances is large.

（Ｂ）関連語を同一視してベクトルを算出する方法では、問い合わせ発話と各検索対象発話の類似度が全体として高く算出される。よって、類似度の閾値とヒット数との間の関係は曲線５２のようになる。すなわち、類似度の閾値を高く設定してもヒット数が多くなり、類似する検索対象発話を効率的に絞り込むことが難しい。 (B) In the method of equating related words and calculating the vector, the similarity between the inquiry utterance and each search target utterance is calculated to be high as a whole. Therefore, the relationship between the threshold of similarity and the number of hits is as shown in the curve 52. That is, even if the threshold value of the similarity is set high, the number of hits increases, and it is difficult to efficiently narrow down similar search target utterances.

（Ｃ）元の単語を残しつつ関連語グループを示す仮想語を追加してベクトルを算出する方法では、問い合わせ発話と各検索対象発話の類似度が上記２つの方法の中間の値をとる。よって、類似度の閾値とヒット数との間の関係は曲線５３のようになる。仮想語が追加されることで、一致するハッシュ値の個数が若干増える。一方、元の単語が残っているため、一致するハッシュ値の個数が増え過ぎることを抑制できる。その結果、問い合わせ発話に類似する検索対象発話を効率的に絞り込むことができる。 (C) In the method of calculating the vector by adding a virtual word indicating a related word group while keeping the original word, the similarity between the inquiry utterance and each search target utterance takes an intermediate value between the above two methods. Therefore, the relationship between the threshold of similarity and the number of hits is as shown in curve 53. The addition of virtual words slightly increases the number of matching hash values. On the other hand, since the original word remains, it is possible to prevent the number of matching hash values from increasing too much. As a result, search target utterances similar to inquiry utterances can be efficiently narrowed down.

関連語グループは予め定義されている。
図８は、関連語テーブルの例を示す図である。
情報処理装置１００は、関連語テーブル１４３を保持している。関連語テーブル１４３は、関連語辞書と言うこともできる。関連語テーブル１４３は、グループＩＤおよび関連語の項目を含む。グループＩＤは、関連語グループを識別する識別子である。１つ関連語グループに対して１つの仮想語が割り当てられる。関連語の項目には、同一の関連語グループに属する２以上の単語が列挙される。同一の関連語グループに属する単語は、類似する意味として使用されることがある単語である。 Related word groups are predefined.
FIG. 8 is a diagram showing an example of a related word table.
The information processing apparatus 100 holds a related word table 143. The related word table 143 can also be called a related word dictionary. The related word table 143 includes the group ID and related word items. The group ID is an identifier that identifies a related word group. One virtual word is assigned to one related word group. The related word item lists two or more words that belong to the same related word group. Words that belong to the same related word group are words that may be used as having similar meanings.

「猫」および「にゃんこ」は同一の関連語グループに属する。そこで、例えば、情報処理装置１００は、「猫」および「にゃんこ」に同一の仮想語ｓ_１を割り当てる。また、「ご飯」および「えさ」は同一の関連語グループに属する。そこで、例えば、情報処理装置１００は、「ご飯」および「えさ」に同一の仮想語ｓ_２を割り当てる。 "Cat" and "Nyanko" belong to the same related word group. Therefore, for example, the information processing apparatus 100 assigns the same virtual word s ₁ to "cat" and "nyanko". Also, "rice" and "food" belong to the same related word group. Therefore, for example, the information processing apparatus 100 assigns the same virtual word s ₂ to "rice" and "food".

なお、関連語テーブル１４３に登録されていない単語は、何れの関連語グループにも属さず、関連する他の単語をもたない。後述するように、第２の実施の形態では、関連語が存在しない単語に対しても一意な仮想語を追加してハッシュ値を算出する。これは、関連語が存在しない単語集合からは同一のジャッカール係数と拡張ジャッカール係数とが算出されるようにし、ジャッカール係数と拡張ジャッカール係数の整合性を維持するためである。ただし、関連語が存在しない単語に対して仮想語を割り当てないことも可能である。その場合であっても、拡張ジャッカール係数は、関連語を考慮しない場合のジャッカール係数以上の値をとり、関連語を同一視する場合のジャッカール係数以下の値をとる。 A word not registered in the related word table 143 does not belong to any related word group and has no other related words. As will be described later, in the second embodiment, a unique virtual word is added to a word for which a related word does not exist, and a hash value is calculated. This is to ensure that the same Jackal coefficient and extended Jackal coefficient are calculated from a word set in which no related word exists, and to maintain the consistency between the Jackal coefficient and the extended Jackal coefficient. However, it is also possible not to assign a virtual word to a word that does not have a related word. Even in that case, the extended Jackal coefficient takes a value equal to or higher than the Jackal coefficient when the related words are not considered, and a value equal to or lower than the Jackal coefficient when the related words are equated.

ここで、ジャッカール係数やハッシュ関数の数学的側面について説明する。
第２の実施の形態で使用するハッシュ関数はMin-hash関数であり、単語集合を整数に変換する写像である。複数のハッシュ関数によって近似するベクトルが生成される問い合わせ発話と検索対象発話は、共通する単語の出現割合が高い発話である。一方の発話（例えば、１つの検索対象発話）の単語集合をＡ、他方の発話（例えば、問い合わせ発話）の単語集合をＢとすると、ジャッカール係数は数式（１）のように定義される。単語集合Ａ，Ｂは出現する単語の種類を示し、同じ単語が２回以上出現しても重複カウントしない。 Here, the mathematical aspects of the Jackal coefficient and the hash function will be described.
The hash function used in the second embodiment is a Min-hash function, which is a mapping that converts a word set into an integer. The inquiry utterance and the search target utterance in which a vector to be approximated by a plurality of hash functions are generated are utterances in which the appearance rate of a common word is high. Assuming that the word set of one utterance (for example, one search target utterance) is A and the word set of the other utterance (for example, inquiry utterance) is B, the Jackal coefficient is defined as in the mathematical formula (1). The word sets A and B indicate the types of words that appear, and even if the same word appears more than once, they are not counted twice.

１つのハッシュ関数のハッシュ値が２つのベクトルの間で一致している場合、当該ハッシュ値に対応する単語が両方の発話に出現している。一致していない場合、小さい方のハッシュ値に対応する単語が一方の発話にのみ出現している。前述のように、１つのハッシュ関数のハッシュ値が一致する確率はジャッカール係数に等しいため、複数のハッシュ関数についての一致割合によってジャッカール係数を近似できる。単語集合Ａ，Ｂに含まれる単語の数に依存しない所定回数のハッシュ値計算によって近似値が得られる。 When the hash value of one hash function matches between two vectors, the word corresponding to the hash value appears in both utterances. If they do not match, the word corresponding to the smaller hash value appears in only one utterance. As described above, since the probability that the hash values of one hash function match is equal to the Jackal coefficient, the Jackal coefficient can be approximated by the matching ratio for a plurality of hash functions. An approximate value can be obtained by a predetermined number of hash value calculations that do not depend on the number of words included in the word sets A and B.

ｎを検索対象発話の数、ｍを発話１つ当たりの平均の単語数、ｋをハッシュ関数の数とする。すると、予め検索対象発話のベクトルを計算する計算量はＯ（ｋｍｎ）となる。１つの問い合わせ発話のベクトルを計算する計算量はＯ（ｋｍ）となり、このベクトルを用いた近傍探索の計算コストはＯ（ｌｏｇ（ｎ））となる。ハッシュ関数は予めランダムにｋ個生成される。ｋをＯ（ｌｏｇ（ｎ））とすると、異なる検索対象発話から同一のハッシュ値が算出される衝突確率を十分に小さくすることができる。 Let n be the number of utterances to be searched, m be the average number of words per utterance, and k be the number of hash functions. Then, the amount of calculation for calculating the vector of the utterance to be searched in advance is O (kmn). The amount of calculation for calculating the vector of one inquiry utterance is O (km), and the calculation cost of the neighborhood search using this vector is O (log (n)). K hash functions are randomly generated in advance. When k is O (log (n)), the collision probability in which the same hash value is calculated from different search target utterances can be sufficiently reduced.

これに対して、ジャッカール係数とハッシュ関数を拡張する。単語集合Ａ，Ｂの中に関連語が存在しない場合、拡張ジャッカール係数はジャッカール係数に等しい。関連語が存在する場合、拡張ジャッカール係数は、関連語を異なる単語とみなした際のジャッカール係数よりも大きく、関連語を同一単語とみなした際のジャッカール係数よりも小さい。ハッシュ値の一致確率が拡張ジャッカール係数に等しくなるように、単語集合Ａ，Ｂに仮想語を追加すると共に、仮想語に対しても整数を割り当てるようハッシュ関数を拡張する。 On the other hand, the Jackal coefficient and hash function are extended. If there are no related words in the word sets A and B, the extended Jackal coefficient is equal to the Jackal coefficient. If a related word is present, the extended Jackal coefficient is greater than the Jackal coefficient when the related word is regarded as a different word and smaller than the Jacker coefficient when the related word is regarded as the same word. A virtual word is added to the word sets A and B so that the matching probability of the hash value is equal to the extended Jackal coefficient, and the hash function is expanded so that an integer is also assigned to the virtual word.

検索対象発話に出現し得る単語の集合をＷとし、単語集合Ｗに含まれない仮想語の集合をＷ’とする。単語集合Ｗに属する任意の単語ａに対して、仮想語集合Ｗ’に属する一意な仮想語ｓ_ｉが割り当てられる。同一の関連語グループに属する単語に対しては同一の仮想語が割り当てられる。何れの関連語グループにも属さない単語に対しては、他の単語に対して使用されない仮想語が割り当てられる。数式（２）に示す関数ｆはＷからＷ×Ｗ’への写像であり、１つの単語から当該単語と仮想語の組を返す。 Let W be a set of words that can appear in the search target utterance, and let W'be a set of virtual words that are not included in the word set W. A unique virtual word s _i belonging to the virtual word set W'is assigned to any word a belonging to the word set W. The same virtual word is assigned to words that belong to the same related word group. Words that do not belong to any related word group are assigned virtual words that are not used for other words. The function f shown in the mathematical formula (2) is a mapping from W to W × W', and returns a set of the word and a virtual word from one word.

拡張ジャッカール係数Ｊ’は、ジャッカール係数Ｊの定義および関数ｆを用いて数式（３）のように定義される。関数ｆによって、単語集合Ａに含まれる各単語に対応する仮想語が単語集合Ａに挿入され、単語集合Ｂに含まれる各単語に対応する仮想語が単語集合Ｂに挿入される。そして、仮想語を挿入した単語集合に対してジャッカール係数Ｊを算出したものが拡張ジャッカール係数Ｊ’となる。元の単語集合Ａ，Ｂに関連語が含まれない場合、Ｊ’（Ａ，Ｂ）＝Ｊ（Ａ，Ｂ）である。 The extended Jackal coefficient J'is defined as in the equation (3) using the definition of the Jackal coefficient J and the function f. By the function f, the virtual word corresponding to each word included in the word set A is inserted into the word set A, and the virtual word corresponding to each word included in the word set B is inserted into the word set B. Then, the extended Jackal coefficient J'is obtained by calculating the Jackal coefficient J for the word set into which the virtual word is inserted. If the original word sets A and B do not contain related words, then J'(A, B) = J (A, B).

ハッシュ関数を拡張するにあたり、ハッシュ関数毎に数式（４）のように関数ｈ’を定義する。関数ｈは、単語集合Ｗに属する単語ａを整数ｎに変換するものであり、拡張前のハッシュ関数で使用される。整数の集合をＮとすると、関数ｈはＷからＮへの写像である。関数ｈ’は、単語集合Ｗに属する単語ａまたは仮想語集合Ｗ’に属する仮想語ａを整数ｎに変換するものであり、拡張後のハッシュ関数で使用される。ｎを整数集合Ｎの中の任意の整数とし、Ｎ’＝｛ｎ＋ｍａｘＮ＋１｝とすると、関数ｈ’はＷ∪Ｗ’からＮ∪Ｎ’への写像である。整数集合Ｎ’は、関数ｈの値域である整数集合Ｎを２倍に拡張したときの拡張部分を示している。これは、仮想語集合Ｗ’のサイズは単語集合Ｗのサイズ以下であるためである。関数ｅは、仮想語集合Ｗ’に属する仮想語ａを単語集合Ｗに属する単語に変換するものであり、Ｗ’からＷへの単射である。関数ｇは、単語または仮想語と整数との対応関係を入れ替えるものであり、Ｎ∪Ｎ’からＮ∪Ｎ’への全単射である。 In extending the hash function, the function h'is defined for each hash function as in the equation (4). The function h converts the word a belonging to the word set W into an integer n, and is used in the hash function before expansion. Assuming that the set of integers is N, the function h is a mapping from W to N. The function h'converts a word a belonging to the word set W or a virtual word a belonging to the virtual word set W'to an integer n, and is used in the expanded hash function. If n is an arbitrary integer in the integer set N and N'= {n + maxN + 1}, then the function h'is a mapping from W∪W'to N∪N'. The integer set N'shows an extended portion when the integer set N, which is the range of the function h, is expanded twice. This is because the size of the virtual word set W'is smaller than or equal to the size of the word set W. The function e converts the virtual word a belonging to the virtual word set W'to a word belonging to the word set W, and is an injective function from W'to W. The function g swaps the correspondence between a word or virtual word and an integer, and is a bijection from N∪N'to N∪N'.

すなわち、情報処理装置１００は、ハッシュ関数を拡張するにあたり、既存の単語に割り当てられた整数よりも大きい整数を仮想語に割り当てる。仮想語に割り当てる整数は関数ｅ，ｈを用いて決定される。ある仮想語ａから関数ｅによって一意な既存の単語が選択され、関数ｈによってその単語に対応付けられた整数ｈ（ｅ（ａ））に、関数ｈの値域の最大値ｍａｘＮと１を加えたものが、仮想語ａに割り当てられる。そして、情報処理装置１００は、既存の単語と整数との対応関係および仮想語と整数との対応関係を、関数ｇによってシャッフルして新たな対応関係を生成する。 That is, when expanding the hash function, the information processing apparatus 100 assigns an integer larger than the integer assigned to the existing word to the virtual word. The integers assigned to the virtual words are determined using the functions e and h. A unique existing word is selected from a virtual word a by the function e, and the maximum values maxN and 1 in the range of the function h are added to the integer h (e (a)) associated with the word by the function h. Things are assigned to the virtual word a. Then, the information processing apparatus 100 shuffles the correspondence between the existing word and the integer and the correspondence between the virtual word and the integer by the function g to generate a new correspondence.

なお、関数ｅは、入力された仮想語から、当該仮想語が示す関連語グループに属する１つの単語を選択するものであってもよいし、単語集合Ｗに属する単語をランダムに選択するものであってもよい。また、割り当てる整数を入れ替える関数ｇは、ランダムに生成してもよい。また、関数ｅや関数ｇは、複数のハッシュ関数の間で共通に使用してもよいし、ハッシュ関数毎にランダムに生成するようにしてもよい。 The function e may select one word belonging to the related word group indicated by the virtual word from the input virtual word, or may randomly select a word belonging to the word set W. There may be. Further, the function g for exchanging the integers to be assigned may be randomly generated. Further, the function e and the function g may be used in common among a plurality of hash functions, or may be randomly generated for each hash function.

上記の関数ｈ’から、数式（５）のように拡張後のハッシュ関数Ｈ’が定義される。ハッシュ関数Ｈ’は、ｆ（Ａ）に含まれる単語および仮想語それぞれに対応する整数を関数ｈ’によって算出し、その中から最小の整数を選択するものである。なお、関数ｈ’の定義域のサイズは関数ｈの定義域のサイズの高々２倍である。関数ｈ’の値域を１桁大きくするか、または、ハッシュ関数の数ｋを１つ増やすことで、ハッシュ関数Ｈ’のもとでハッシュ値が衝突する確率を元のハッシュ関数と同等に維持することができる。 From the above function h', the extended hash function H'is defined as in the equation (5). The hash function H'calculates an integer corresponding to each of the word and the virtual word included in f (A) by the function h', and selects the smallest integer from the integers. The size of the domain of the function h'is at most twice the size of the domain of the function h. By increasing the range of the function h'by one digit or increasing the number k of the hash function by one, the probability that the hash values collide under the hash function H'is maintained equal to that of the original hash function. be able to.

次に、仮想語を考慮したベクトルの算出例を説明する。
図９は、単語集合からベクトルを算出する例を示す図である。
表６０は、単語集合Ａに対応するベクトルと単語集合Ｂに対応するベクトルを算出して比較する過程を示している。単語集合Ａは「猫」と「ご飯」を含み、更に仮想語「ｓ_１」と「ｓ_２」が挿入されている。単語集合Ｂは「にゃんこ」と「えさ」を含み、更に仮想語「ｓ_１」と「ｓ_２」が挿入されている。また、単語および仮想語と整数との対応関係が異なる１２個のハッシュ関数が定義されており、１２個のハッシュ値を列挙したものが、単語集合Ａに対応するベクトルおよび単語集合Ｂに対応するベクトルになる。 Next, an example of vector calculation considering virtual words will be described.
FIG. 9 is a diagram showing an example of calculating a vector from a word set.
Table 60 shows the process of calculating and comparing the vector corresponding to the word set A and the vector corresponding to the word set B. The word set A includes "cat" and "rice", and further inserts virtual words "s ₁ " and "s ₂ ". The word set B includes "nyanko" and "food", and further inserts virtual words "s ₁ " and "s ₂ ". Further, 12 hash functions having different correspondences between words and virtual words and integers are defined, and a list of 12 hash values corresponds to the vector corresponding to the word set A and the word set B. Become a vector.

ハッシュ関数Ｈ’_１で使用される関数ｈ’_１は、「猫」に０、「ご飯」に１、「にゃんこ」に２、「えさ」に３、「ｓ_１」に４、「ｓ_２」に５を割り当てている。よって、単語集合Ａに対するハッシュ値は「猫」に対応する０になり、単語集合Ｂに対するハッシュ値は「にゃんこ」に対応する２になる。また、ハッシュ関数Ｈ’_２で使用される関数ｈ’_２は、「猫」に１、「ご飯」に０、「にゃんこ」に５、「えさ」に４、「ｓ_１」に３、「ｓ_２」に２を割り当てている。よって、単語集合Ａに対するハッシュ値は「ご飯」に対応する０になり、単語集合Ｂに対するハッシュ値は「ｓ_２」に対応する２になる。 The function h'1 used in the hash function H'1 is ₀ for "cat", ₁ for "rice", 2 for "nyanko", 3 for "food", 4 for "s ₁ ", and "s ₂ ". Is assigned to 5. Therefore, the hash value for the word set A is 0 corresponding to "cat", and the hash value for the word set B is 2 corresponding to "nyanko". The function _h'2 used in the hash function _H'2 is 1 for "cat", 0 for "rice", 5 for "nyanko", 4 for "food", 3 for "s ₁ ", and "s". ₂ is assigned to "2". Therefore, the hash value for the word set A becomes 0 corresponding to "rice", and the hash value for the word set B becomes 2 corresponding to "s ₂ ".

また、ハッシュ関数Ｈ’_５で使用される関数ｈ’_５は、「猫」に２、「ご飯」に３、「にゃんこ」に４、「えさ」に５、「ｓ_１」に０、「ｓ_２」に１を割り当てている。よって、単語集合Ａに対するハッシュ値は「ｓ_１」に対応する０になり、単語集合Ｂに対するハッシュ値は「ｓ_１」に対応する０になる。このようにして、単語集合Ａからはベクトル（０，０，２，２，０，０，０，１，１，０，０，０）が算出される。単語集合Ｂからはベクトル（２，２，０，０，０，０，１，０，０，０，０，１）が算出される。 The function _h'5 used in the hash function _H'5 is 2 for "cat", 3 for "rice", 4 for "nyanko", 5 for "food", 0 for "s ₁ ", and "s". 1 is assigned to " ₂ ". Therefore, the hash value for the word set A becomes 0 corresponding to "s ₁ ", and the hash value for the word set B becomes 0 corresponding to "s ₁ ". In this way, a vector (0,0,2,2,0,0,0,1,1,0,0,0) is calculated from the word set A. A vector (2,2,0,0,0,0,1,0,0,0,0,1) is calculated from the word set B.

元の単語集合Ａ，Ｂの間には共通する単語が存在しないものの、１２個のハッシュ関数のうちハッシュ関数Ｈ’_５，Ｈ’_６，Ｈ’_１０，Ｈ’_１１のハッシュ値が一致しており、ハッシュ値の一致割合が０．３３になっている。仮に、仮想語を使用しない場合はハッシュ値の一致割合が０．００になる。よって、発話の類似性を適切に評価することができる。 Although there is no common word between the original word sets A and B, the hash values of the hash functions _H'5 , _H'6 , H'10, and _H'11 out of the ₁₂ hash functions match. The hash value match ratio is 0.33. If the virtual word is not used, the hash value match ratio is 0.00. Therefore, the similarity of utterances can be appropriately evaluated.

図１０は、問い合わせ発話に対応する返答発話の選択例を示す図である。
前述のように、情報処理装置１００は、蓄積された検索対象発話の中から問い合わせ発話に最も類似する検索対象発話を選択し、選択した検索対象発話に対応する返答発話を出力する。一例として、検索対象発話７１～７３が蓄積されている。 FIG. 10 is a diagram showing a selection example of a response utterance corresponding to an inquiry utterance.
As described above, the information processing apparatus 100 selects the search target utterance most similar to the inquiry utterance from the accumulated search target utterances, and outputs the response utterance corresponding to the selected search target utterance. As an example, search target utterances 71 to 73 are accumulated.

検索対象発話７１は「メールを調べて」というメッセージであり、検索対象発話７２は「スケジュールの調整」というメッセージであり、検索対象発話７３は「打ち合わせを調整したい」というメッセージである。情報処理装置１００は予め、検索対象発話７１からベクトル７４を生成し、検索対象発話７２からベクトル７５を生成し、検索対象発話７３からベクトル７６を生成しておく。ベクトル７４は（０，０，１，１，２，２，…）という整数の列であり、ベクトル７５は（２，２，０，０，１，１，…）という整数の列であり、ベクトル７５は（１，２，２，２，１，１，…）という整数の列である。 The search target utterance 71 is a message "check the mail", the search target utterance 72 is a message "schedule adjustment", and the search target utterance 73 is a message "want to adjust the meeting". The information processing apparatus 100 generates a vector 74 from the search target utterance 71, a vector 75 from the search target utterance 72, and a vector 76 from the search target utterance 73 in advance. The vector 74 is a sequence of integers (0,0,1,1,2,2, ...), And the vector 75 is a sequence of integers (2,2,0,0,1,1, ...). The vector 75 is a sequence of integers (1,2,2,2,1,1, ...).

情報処理装置１００は、問い合わせ発話７７を受け付ける。問い合わせ発話７７は、「会議を調整したい」というメッセージである。情報処理装置１００は、問い合わせ発話７７からベクトル７８を生成する。ベクトル７８は、（１，１，２，２，１，１，…）という整数の列である。ベクトル７４～７６のうちベクトル７８に最も近似するものはベクトル７６である。そこで、情報処理装置１００は、検索対象発話７３を選択し、検索対象発話７３に対応する返答発話７９を出力する。返答発話７９は、問い合わせ発話７７に対する応答であり、「スケジュールを調整します」というメッセージである。「会議」と「打ち合わせ」が類似する意味をもつ関連語であることが予め登録されていれば、情報処理装置１００は、検索対象発話７３が問い合わせ発話７７に類似すると判定できる。 The information processing apparatus 100 receives the inquiry utterance 77. The inquiry utterance 77 is a message "I want to coordinate the meeting". The information processing apparatus 100 generates a vector 78 from the inquiry utterance 77. The vector 78 is a sequence of integers (1,1,2,2,1,1, ...). Of the vectors 74 to 76, the one closest to the vector 78 is the vector 76. Therefore, the information processing apparatus 100 selects the search target utterance 73 and outputs the response utterance 79 corresponding to the search target utterance 73. The response utterance 79 is a response to the inquiry utterance 77, and is a message "adjust the schedule". If it is registered in advance that "meeting" and "meeting" are related words having similar meanings, the information processing apparatus 100 can determine that the search target utterance 73 is similar to the inquiry utterance 77.

なお、探索木１４２の葉ノードに単一の検索対象発話が対応付けられている場合など、ベクトルを用いた近傍探索によって単一の検索対象発話が検索された場合、情報処理装置１００は当該検索対象発話に対応する返答発話を出力すればよい。一方、探索木１４２の葉ノードに２以上の検索対象発話が対応付けられている場合など、ベクトルを用いた近傍探索によって２以上の検索対象発話が検索された場合、情報処理装置１００は何らかの方法で検索対象発話を１つに絞り込むことが考えられる。例えば、情報処理装置１００は、ベクトルを用いた簡易的な類似判定とは異なる詳細分析により検索対象発話を絞り込む。 When a single search target utterance is searched by a neighborhood search using a vector, such as when a single search target utterance is associated with the leaf node of the search tree 142, the information processing apparatus 100 performs the search. The response utterance corresponding to the target utterance may be output. On the other hand, when two or more search target utterances are searched by a neighborhood search using a vector, such as when two or more search target utterances are associated with the leaf node of the search tree 142, the information processing apparatus 100 uses some method. It is conceivable to narrow down the search target utterances to one. For example, the information processing apparatus 100 narrows down the search target utterances by a detailed analysis different from a simple similarity determination using a vector.

次に、情報処理装置１００の機能および処理手順について説明する。
図１１は、情報処理装置の機能例を示すブロック図である。
情報処理装置１００は、発話データベース１２１、関連語記憶部１２２、ハッシュ関数記憶部１２３およびインデックス記憶部１２４を有する。これらの記憶部は、例えば、ＲＡＭ１０２またはＨＤＤ１０３の記憶領域を用いて実現される。また、情報処理装置１００は、ハッシュ関数生成部１３１、インデックス生成部１３２、問い合わせ受信部１３３、ハッシュ値算出部１３４、検索部１３５および返答出力部１３６を有する。これらの処理部は、例えば、ＣＰＵ１０１が実行するプログラムを用いて実現される。 Next, the functions and processing procedures of the information processing apparatus 100 will be described.
FIG. 11 is a block diagram showing a functional example of the information processing apparatus.
The information processing apparatus 100 has an utterance database 121, a related word storage unit 122, a hash function storage unit 123, and an index storage unit 124. These storage units are realized by using, for example, the storage area of the RAM 102 or the HDD 103. Further, the information processing apparatus 100 includes a hash function generation unit 131, an index generation unit 132, an inquiry reception unit 133, a hash value calculation unit 134, a search unit 135, and a response output unit 136. These processing units are realized, for example, by using a program executed by the CPU 101.

発話データベース１２１は、複数の検索対象発話とそれに対応する複数の返答発話とが登録された発話テーブル１４１を記憶する。検索対象発話と返答発話の組は、事前に登録されていてもよいし、情報処理装置１００を使用するユーザとの対話を通じて追加されてもよいし、ネットワークから自動的に収集されてもよい。 The utterance database 121 stores an utterance table 141 in which a plurality of search target utterances and a plurality of response utterances corresponding to the plurality of search target utterances are registered. The set of the search target utterance and the response utterance may be registered in advance, may be added through a dialogue with a user who uses the information processing apparatus 100, or may be automatically collected from the network.

関連語記憶部１２２は、関連語グループを示す関連語テーブル１４３を記憶する。関連語グループは、事前に登録されていてもよいし、情報処理装置１００を使用するユーザとの対話を通じて追加されてもよいし、ネットワークから自動的に収集されてもよい。 The related word storage unit 122 stores the related word table 143 showing the related word group. The related word group may be registered in advance, may be added through a dialogue with a user who uses the information processing apparatus 100, or may be automatically collected from the network.

ハッシュ関数記憶部１２３は、異なる複数のハッシュ関数を記憶する。各ハッシュ関数は、検索対象発話に出現し得る単語および検索対象発話に出現しない仮想語それぞれに対して一意な整数を対応付ける対応関係をもち、単語集合を受け付けて１つのハッシュ値を出力する。異なるハッシュ関数は異なる対応関係をもつ。 The hash function storage unit 123 stores a plurality of different hash functions. Each hash function has a correspondence relationship in which a unique integer is associated with each of a word that can appear in the search target utterance and a virtual word that does not appear in the search target utterance, accepts a word set, and outputs one hash value. Different hash functions have different correspondences.

インデックス記憶部１２４は、問い合わせ発話に類似する検索対象発話を検索するためのインデックスとして探索木１４２を記憶する。探索木１４２は、複数のハッシュ関数を用いて検索対象発話から算出されたベクトルに基づいて生成される。 The index storage unit 124 stores the search tree 142 as an index for searching for a search target utterance similar to the inquiry utterance. The search tree 142 is generated based on a vector calculated from the search target utterance using a plurality of hash functions.

ハッシュ関数生成部１３１は、発話データベース１２１に記憶された検索対象発話と関連語記憶部１２２に記憶された関連語テーブル１４３に基づいて、複数のハッシュ関数を生成し、生成した複数のハッシュ関数をハッシュ関数記憶部１２３に格納する。 The hash function generation unit 131 generates a plurality of hash functions based on the search target utterance stored in the utterance database 121 and the related word table 143 stored in the related word storage unit 122, and generates a plurality of generated hash functions. It is stored in the hash function storage unit 123.

まず、ハッシュ関数生成部１３１は、検索対象発話で使用されている単語を抽出し、抽出した単語をランダムに整列して拡張前のハッシュ関数を生成する。また、ハッシュ関数生成部１３１は、関連語テーブル１４３に登録された関連語グループと関連語をもたない単語に対して仮想語を割り当てる。ハッシュ関数生成部１３１は、仮想語に対しても整数が割り当てられるようにハッシュ関数を拡張する。 First, the hash function generation unit 131 extracts words used in the search target utterance, randomly arranges the extracted words, and generates a hash function before expansion. Further, the hash function generation unit 131 assigns a virtual word to a related word group registered in the related word table 143 and a word having no related word. The hash function generation unit 131 extends the hash function so that an integer can be assigned to a virtual word as well.

インデックス生成部１３２は、発話データベース１２１に記憶された検索対象発話とハッシュ関数記憶部１２３に記憶されたハッシュ関数に基づいて、探索木１４２を生成し、生成した探索木１４２をインデックス記憶部１２４に格納する。 The index generation unit 132 generates a search tree 142 based on the search target utterance stored in the utterance database 121 and the hash function stored in the hash function storage unit 123, and the generated search tree 142 is stored in the index storage unit 124. Store.

まず、インデックス生成部１３２は、検索対象発話毎に単語集合を抽出し、各単語に対応する仮想語を単語集合に追加する。インデックス生成部１３２は、仮想語を追加した単語集合を複数のハッシュ関数それぞれに入力して、ハッシュ値のベクトルを算出する。インデックス生成部１３２は、複数の検索対象発話に対応する複数のベクトルを効率的に検索できるように探索木１４２を生成する。例えば、インデックス生成部１３２は、ベクトルの中の１つの次元に着目し、ベクトルの集合が二分割されるように当該次元のハッシュ値の閾値を決定することを繰り返す。インデックス生成部１３２は、探索木１４２の葉ノードにはできる限り単一のベクトルが対応付けられるように中間ノードを生成する。 First, the index generation unit 132 extracts a word set for each search target utterance, and adds a virtual word corresponding to each word to the word set. The index generation unit 132 inputs a word set to which a virtual word is added into each of a plurality of hash functions, and calculates a vector of hash values. The index generation unit 132 generates a search tree 142 so that a plurality of vectors corresponding to a plurality of search target utterances can be efficiently searched. For example, the index generation unit 132 pays attention to one dimension in the vector, and repeatedly determines the threshold value of the hash value of the dimension so that the set of vectors is divided into two. The index generation unit 132 generates an intermediate node so that a single vector is associated with the leaf node of the search tree 142 as much as possible.

問い合わせ受信部１３３は、問い合わせ発話を受信する。問い合わせ受信部１３３は、ユーザから文字列として入力された問い合わせ発話を受信してもよいし、ユーザが口頭で発した問い合わせ発話の音声信号を文字列に変換してもよい。また、問い合わせ受信部１３３は、他の情報処理装置から文字列または音声信号を受信してもよい。 The inquiry receiving unit 133 receives the inquiry utterance. The inquiry receiving unit 133 may receive the inquiry utterance input as a character string from the user, or may convert the voice signal of the inquiry utterance uttered by the user into a character string. Further, the inquiry receiving unit 133 may receive a character string or an audio signal from another information processing device.

ハッシュ値算出部１３４は、ハッシュ関数記憶部１２３に記憶された複数のハッシュ関数に基づいて、問い合わせ発話に対応するベクトルを生成する。まず、ハッシュ値算出部１３４は、問い合わせ発話から単語集合を抽出し、各単語に対応する仮想語を単語集合に追加する。ハッシュ値算出部１３４は、仮想語を追加した単語集合を複数のハッシュ関数それぞれに入力して、ハッシュ値のベクトルを算出する。 The hash value calculation unit 134 generates a vector corresponding to the inquiry utterance based on the plurality of hash functions stored in the hash function storage unit 123. First, the hash value calculation unit 134 extracts a word set from the inquiry utterance and adds a virtual word corresponding to each word to the word set. The hash value calculation unit 134 inputs a word set to which a virtual word is added into each of a plurality of hash functions, and calculates a vector of hash values.

検索部１３５は、インデックス記憶部１２４に記憶された探索木１４２と問い合わせ発話のベクトルに基づいて、近傍探索により問い合わせ発話に最も類似する検索対象発話を検索する。問い合わせ発話に最も類似する検索対象発話は、ベクトル同士を比較したときにハッシュ値が一致する次元が最も多い検索対象発話である。検索部１３５は、問い合わせ発話のベクトルの中の特定の次元のハッシュ値と閾値とを比較しながら、探索木１４２をルートノードから葉ノードに向かって辿り、特定の葉ノードに到達する。検索部１３５は、到達した葉ノードに対応する検索対象発話を選択する。なお、到達した葉ノードに２以上の検索対象発話が対応付けられている場合、すなわち、探索木１４２では検索対象発話を１つに絞り込めない場合、他の方法で検索対象発話を１つ選択する。 The search unit 135 searches for the search target utterance most similar to the inquiry utterance by neighborhood search based on the search tree 142 stored in the index storage unit 124 and the vector of the inquiry utterance. The search target utterance most similar to the inquiry utterance is the search target utterance having the largest number of dimensions with which the hash values match when the vectors are compared. The search unit 135 traces the search tree 142 from the root node toward the leaf node and reaches the specific leaf node while comparing the hash value of a specific dimension in the inquiry utterance vector with the threshold value. The search unit 135 selects the search target utterance corresponding to the reached leaf node. If two or more search target utterances are associated with the reached leaf node, that is, if the search target utterance cannot be narrowed down to one in the search tree 142, one search target utterance is selected by another method. do.

返答出力部１３６は、選択された検索対象発話に対応付けられた返答発話を出力する。返答出力部１３６は、返答発話の文字列をディスプレイ１０４ａに表示してもよいし、返答発話を音声信号に変換してスピーカーにより音声を再生してもよい。また、返答出力部１３６は、他の情報処理装置に文字列または音声信号を送信してもよい。 The response output unit 136 outputs the response utterance associated with the selected search target utterance. The response output unit 136 may display the character string of the response utterance on the display 104a, or may convert the response utterance into a voice signal and reproduce the voice by the speaker. Further, the response output unit 136 may transmit a character string or an audio signal to another information processing device.

図１２は、インデックス生成の手順例を示すフローチャートである。
（Ｓ１０）ハッシュ関数生成部１３１は、発話テーブル１４１に登録された複数の検索対象発話に含まれる単語を収集して単語集合Ｗを抽出する。 FIG. 12 is a flowchart showing an example of an index generation procedure.
(S10) The hash function generation unit 131 collects words included in a plurality of search target utterances registered in the utterance table 141 and extracts a word set W.

（Ｓ１１）ハッシュ関数生成部１３１は、単語集合Ｗに含まれる単語をランダムに整列し、単語の列の先頭から末尾に向かって整数を小さい順に割り当てる。ハッシュ関数生成部１３１は、これをｋ回繰り返してｋ個のハッシュ関数Ｈを生成する。 (S11) The hash function generation unit 131 randomly arranges the words included in the word set W, and assigns integers in ascending order from the beginning to the end of the word sequence. The hash function generation unit 131 repeats this k times to generate k hash functions H.

（Ｓ１２）ハッシュ関数生成部１３１は、単語集合Ｗに含まれる単語ｗを１つ選択し、選択した単語ｗに対応するフラグｆｗを０に初期化する。
（Ｓ１３）ハッシュ関数生成部１３１は、ステップＳ１２で選択した単語ｗを関連語テーブル１４３から検索し、単語ｗが何れかの関連語グループに属するか、すなわち、単語ｗに類似する意味をもつ関連語が存在するか判断する。関連語がある場合はステップＳ１４に進み、関連語がない場合はステップＳ１５に進む。 (S12) The hash function generation unit 131 selects one word w included in the word set W and initializes the flag fw corresponding to the selected word w to 0.
(S13) The hash function generation unit 131 searches the related word table 143 for the word w selected in step S12, and whether the word w belongs to any related word group, that is, a relation having a meaning similar to the word w. Determine if a word exists. If there is a related word, the process proceeds to step S14, and if there is no related word, the process proceeds to step S15.

（Ｓ１４）ハッシュ関数生成部１３１は、フラグｆｗを１に更新する。
（Ｓ１５）ハッシュ関数生成部１３１は、単語集合Ｗの中にステップＳ１２でまだ選択していない単語ｗがあるか判断する。未選択の単語ｗがある場合はステップＳ１２に進み、未選択の単語ｗがない場合はステップＳ１６に進む。 (S14) The hash function generation unit 131 updates the flag fw to 1.
(S15) The hash function generation unit 131 determines whether or not there is a word w that has not yet been selected in step S12 in the word set W. If there is an unselected word w, the process proceeds to step S12, and if there is no unselected word w, the process proceeds to step S16.

（Ｓ１６）ハッシュ関数生成部１３１は、関連語テーブル１４３に登録された関連語グループそれぞれに対して一意な仮想語ｓを割り当てる。また、ハッシュ関数生成部１３１は、ｆｗ＝０である単語それぞれに対して一意な仮想語ｓを割り当てる。ここで割り当てる仮想語ｓは、単語集合Ｗに含まれないダミー語ないしラベルであり、関連語グループおよびｆｗ＝０である単語の間で重複しない。仮想語ｓは「ｓ_１」のような記号でもよい。 (S16) The hash function generation unit 131 assigns a unique virtual word s to each of the related word groups registered in the related word table 143. Further, the hash function generation unit 131 assigns a unique virtual word s to each word whose fw = 0. The virtual word s assigned here is a dummy word or label not included in the word set W, and does not overlap between the related word group and the word with fw = 0. The virtual word s may be a symbol such as "s ₁ ".

（Ｓ１７）ハッシュ関数生成部１３１は、ステップＳ１１で生成したｋ個のハッシュ関数Ｈの中から１つのハッシュ関数Ｈを選択する。
（Ｓ１８）ハッシュ関数生成部１３１は、単語集合Ｗから単語ｗを１つ選択する。 (S17) The hash function generation unit 131 selects one hash function H from the k hash functions H generated in step S11.
(S18) The hash function generation unit 131 selects one word w from the word set W.

（Ｓ１９）ハッシュ関数生成部１３１は、ステップＳ１７で選択したハッシュ関数Ｈにおいて、ステップＳ１８で選択した単語ｗに対応付けられている整数ｈ（ｗ）を特定する。そして、ハッシュ関数生成部１３１は、シャッフル用の所定の関数ｇを用いて、単語ｗに対応付ける新たな整数ｈ’（ｗ）＝ｇ（ｈ（ｗ））を算出する。 (S19) The hash function generation unit 131 specifies an integer h (w) associated with the word w selected in step S18 in the hash function H selected in step S17. Then, the hash function generation unit 131 calculates a new integer h'(w) = g (h (w)) associated with the word w by using a predetermined function g for shuffling.

（Ｓ２０）ハッシュ関数生成部１３１は、単語集合Ｗの中にステップＳ１８でまだ選択していない単語ｗがあるか判断する。未選択の単語ｗがある場合はステップＳ１８に進み、未選択の単語ｗがない場合はステップＳ２１に進む。 (S20) The hash function generation unit 131 determines whether or not there is a word w that has not yet been selected in step S18 in the word set W. If there is an unselected word w, the process proceeds to step S18, and if there is no unselected word w, the process proceeds to step S21.

図１３は、インデックス生成の手順例を示すフローチャート（続き）である。
（Ｓ２１）ハッシュ関数生成部１３１は、ステップＳ１６で割り当てた仮想語の集合（仮想語集合Ｗ’）の中から仮想語ｓを１つ選択する。 FIG. 13 is a flowchart (continued) showing an example of the index generation procedure.
(S21) The hash function generation unit 131 selects one virtual word s from the set of virtual words (virtual word set W') assigned in step S16.

（Ｓ２２）ハッシュ関数生成部１３１は、仮想語を既出の単語に変換する所定の関数ｅを用いて、ステップＳ２１で選択した仮想語ｓに対応する単語ｅ（ｓ）を特定する。単語ｅ（ｓ）は、仮想語ｓに対応する関連語グループの中の１つの単語または仮想語ｓに対応する単一の単語でもよく、単語集合Ｗの中からランダムに選択されたものでもよい。ハッシュ関数生成部１３１は、ステップＳ１７で選択したハッシュ関数Ｈにおいて、単語ｅ（ｓ）に対応付けられている整数ｈ（ｅ（ｓ））を特定する。ハッシュ関数生成部１３１は、シャッフル用の所定の関数ｇを用いて、仮想語ｓに対応付ける新たな整数ｈ’（ｓ）＝ｇ（ｈ（ｅ（ｓ））＋ｍａｘＮ＋１）を算出する。なお、ｍａｘＮはハッシュ関数Ｈが出力し得るハッシュ値（値域）の最大値である。 (S22) The hash function generation unit 131 identifies the word e (s) corresponding to the virtual word s selected in step S21 by using a predetermined function e that converts the virtual word into the existing word. The word e (s) may be one word in the related word group corresponding to the virtual word s or a single word corresponding to the virtual word s, or may be randomly selected from the word set W. .. The hash function generation unit 131 specifies an integer h (e (s)) associated with the word e (s) in the hash function H selected in step S17. The hash function generation unit 131 calculates a new integer h'(s) = g (h (e (s)) + maxN + 1) associated with the virtual word s by using a predetermined function g for shuffling. Note that maxN is the maximum value of the hash value (range) that can be output by the hash function H.

（Ｓ２３）ハッシュ関数生成部１３１は、仮想語集合Ｗ’の中にステップＳ２１でまだ選択していない仮想語ｓがあるか判断する。未選択の仮想語ｓがある場合はステップＳ２１に進み、未選択の仮想語ｓがない場合はステップＳ２４に進む。 (S23) The hash function generation unit 131 determines whether or not there is a virtual word s that has not yet been selected in step S21 in the virtual word set W'. If there is an unselected virtual word s, the process proceeds to step S21, and if there is no unselected virtual word s, the process proceeds to step S24.

（Ｓ２４）ハッシュ関数生成部１３１は、ステップＳ１９，Ｓ２２の結果から、ステップＳ１７で選択したハッシュ関数Ｈに対応する拡張後のハッシュ関数Ｈ’を決定する。
（Ｓ２５）ハッシュ関数生成部１３１は、ステップＳ１７でまだ選択していないハッシュ関数Ｈがあるか判断する。未選択のハッシュ関数Ｈがある場合はステップＳ１７に進み、未選択のハッシュ関数Ｈがない場合はステップＳ２６に進む。 (S24) The hash function generation unit 131 determines the extended hash function H'corresponding to the hash function H selected in step S17 from the results of steps S19 and S22.
(S25) The hash function generation unit 131 determines whether or not there is a hash function H that has not yet been selected in step S17. If there is an unselected hash function H, the process proceeds to step S17, and if there is no unselected hash function H, the process proceeds to step S26.

（Ｓ２６）インデックス生成部１３２は、発話テーブル１４１に登録された複数の検索対象発話の中から検索対象発話を１つ選択する。
（Ｓ２７）インデックス生成部１３２は、ステップＳ２６で選択した検索対象発話に含まれる単語を抽出し、抽出した単語の集合である単語集合Ａを生成する。 (S26) The index generation unit 132 selects one search target utterance from the plurality of search target utterances registered in the utterance table 141.
(S27) The index generation unit 132 extracts words included in the search target utterance selected in step S26, and generates a word set A which is a set of the extracted words.

（Ｓ２８）インデックス生成部１３２は、単語集合Ａに含まれる単語それぞれに対応する仮想語を判定し、判定した仮想語を単語集合Ａに追加する。何れかの関連語グループに属する単語には、当該関連語グループに対してステップＳ１６で割り当てられた仮想語が追加される。何れの関連語グループにも属さない単語（関連語がない単語）には、当該単語に対してステップＳ１６で割り当てられた仮想語が追加される。 (S28) The index generation unit 132 determines a virtual word corresponding to each word included in the word set A, and adds the determined virtual word to the word set A. The virtual word assigned in step S16 is added to the word belonging to any of the related word groups. A virtual word assigned in step S16 is added to a word that does not belong to any of the related word groups (a word that does not have a related word).

（Ｓ２９）インデックス生成部１３２は、仮想語を追加した単語集合Ａを、ステップＳ２４で決定された複数のハッシュ関数Ｈ’にそれぞれ入力し、複数のハッシュ値を求める。各ハッシュ関数Ｈ’は、単語集合Ａに含まれる元の単語および仮想語それぞれに対応する整数を判定し、最小の整数をハッシュ値として出力する。インデックス生成部１３２は、複数のハッシュ値を列挙したベクトルを算出する。 (S29) The index generation unit 132 inputs the word set A to which the virtual word is added into the plurality of hash functions H'determined in step S24, and obtains a plurality of hash values. Each hash function H'determines an integer corresponding to each of the original word and the virtual word included in the word set A, and outputs the smallest integer as a hash value. The index generation unit 132 calculates a vector in which a plurality of hash values are listed.

（Ｓ３０）インデックス生成部１３２は、ステップＳ２６でまだ選択していない検索対象発話があるか判断する。未選択の検索対象発話がある場合はステップＳ２６に進み、未選択の検索対象発話がない場合はステップＳ３１に進む。 (S30) The index generation unit 132 determines whether there is a search target utterance that has not yet been selected in step S26. If there is an unselected search target utterance, the process proceeds to step S26, and if there is no unselected search target utterance, the process proceeds to step S31.

（Ｓ３１）インデックス生成部１３２は、ステップＳ２９で算出した複数のベクトルから、検索対象発話のインデックスとしての探索木１４２を生成する。
図１４は、発話検索の手順例を示すフローチャートである。 (S31) The index generation unit 132 generates a search tree 142 as an index of the search target utterance from the plurality of vectors calculated in step S29.
FIG. 14 is a flowchart showing an example of the procedure for utterance search.

（Ｓ４０）問い合わせ受信部１３３は、問い合わせ発話を受信する。
（Ｓ４１）ハッシュ値算出部１３４は、ステップＳ４０で受信した問い合わせ発話に含まれる単語を抽出し、抽出した単語の集合である単語集合Ｂを生成する。 (S40) The inquiry receiving unit 133 receives the inquiry utterance.
(S41) The hash value calculation unit 134 extracts words included in the inquiry utterance received in step S40, and generates a word set B which is a set of the extracted words.

（Ｓ４２）ハッシュ値算出部１３４は、単語集合Ｂに含まれる単語それぞれに対応する仮想語を判定し、判定した仮想語を単語集合Ｂに追加する。何れかの関連語グループに属する単語には、当該関連語グループに対してステップＳ１６で割り当てられた仮想語が追加される。何れの関連語グループにも属さない単語（関連語がない単語）には、当該単語に対してステップＳ１６で割り当てられた仮想語が追加される。 (S42) The hash value calculation unit 134 determines a virtual word corresponding to each word included in the word set B, and adds the determined virtual word to the word set B. The virtual word assigned in step S16 is added to the word belonging to any of the related word groups. A virtual word assigned in step S16 is added to a word that does not belong to any of the related word groups (a word that does not have a related word).

（Ｓ４３）ハッシュ値算出部１３４は、仮想語を追加した単語集合Ｂを、ステップＳ２４で決定された複数のハッシュ関数Ｈ’にそれぞれ入力し、複数のハッシュ値を求める。各ハッシュ関数Ｈ’は、単語集合Ｂに含まれる元の単語および仮想語それぞれに対応する整数を判定し、最小の整数をハッシュ値として出力する。ハッシュ値算出部１３４は、複数のハッシュ値を列挙したベクトルを算出する。 (S43) The hash value calculation unit 134 inputs the word set B to which the virtual word is added into the plurality of hash functions H'determined in step S24, and obtains a plurality of hash values. Each hash function H'determines an integer corresponding to each of the original word and the virtual word included in the word set B, and outputs the smallest integer as a hash value. The hash value calculation unit 134 calculates a vector in which a plurality of hash values are listed.

（Ｓ４４）検索部１３５は、ステップＳ３１で生成された探索木１４２とステップＳ４３で算出されたベクトルとを照合して、最も類似する検索対象発話を判定する。
（Ｓ４５）返答出力部１３６は、発話テーブル１４１からステップＳ４４の検索対象発話に対応付けられた返答発話を取得し、取得した返答発話を出力する。 (S44) The search unit 135 collates the search tree 142 generated in step S31 with the vector calculated in step S43, and determines the most similar search target utterance.
(S45) The response output unit 136 acquires the response utterance associated with the search target utterance in step S44 from the utterance table 141, and outputs the acquired response utterance.

第２の実施の形態の情報処理装置１００によれば、対話システムを実現するために、問い合わせ発話に類似する検索対象発話がデータベースの中から検索される。このとき、各発話に含まれる単語集合から複数のMin-hash関数によりハッシュ値のベクトルが算出され、ベクトル同士の近さによって発話の類似度が評価される。よって、多数の検索対象発話の中から問い合わせ発話に類似するものを高速に検索することができる。 According to the information processing apparatus 100 of the second embodiment, in order to realize the dialogue system, the search target utterance similar to the inquiry utterance is searched from the database. At this time, a vector of hash values is calculated from a set of words included in each utterance by a plurality of Min-hash functions, and the similarity of the utterances is evaluated by the closeness of the vectors. Therefore, it is possible to search a large number of search target utterances similar to the inquiry utterance at high speed.

また、関連語グループを示す関連語辞書を参照して、単語集合に含まれる元の単語を残したまま、当該単語が属する関連語グループを示す仮想語が単語集合に追加される。また、元の単語と仮想語とに異なる整数が割り当てられるように複数のMin-hash関数が拡張される。そして、仮想語を追加した単語集合と拡張した複数のMin-hash関数によりベクトルが算出される。これにより、問い合わせ発話と検索対象発話が同一の単語を含まないものの近い意味をもつ関連語を含む場合に、両者の類似度を適切に評価することができる。 Further, referring to the related word dictionary indicating the related word group, a virtual word indicating the related word group to which the word belongs is added to the word set while keeping the original word included in the word set. Also, multiple Min-hash functions are extended so that different integers are assigned to the original word and the virtual word. Then, the vector is calculated by the word set to which the virtual word is added and a plurality of extended Min-hash functions. Thereby, when the inquiry utterance and the search target utterance do not include the same word but include related words having similar meanings, the similarity between the two can be appropriately evaluated.

例えば、単語間の意味の近さを無視する場合よりも類似度が高く評価され、関連語を同一視する場合よりも類似度が低く評価される。その結果、異なる表現を使用する問い合わせ発話と検索対象発話の類似度を適切に評価することができ、問い合わせ発話に類似する検索対象発話を効率的に絞り込むことが可能となる。また、Min-hash関数のハッシュ値のベクトルを使用するため、検索の高速性を維持することができる。 For example, the similarity is evaluated higher than the case of ignoring the closeness of meaning between words, and the similarity is evaluated lower than the case of equating related words. As a result, the similarity between the inquiry utterance and the search target utterance using different expressions can be appropriately evaluated, and the search target utterance similar to the inquiry utterance can be efficiently narrowed down. In addition, since the hash value vector of the Min-hash function is used, the search speed can be maintained.

１０類似テキスト検索装置
１１記憶部
１２処理部
１３関連語辞書
１４，１５ａ，１５ｂテキスト
１６，１７ａ，１７ｂ単語集合
１８，１９ａ，１９ｂ特徴情報 10 Similar text search device 11 Storage unit 12 Processing unit 13 Related word dictionary 14, 15a, 15b Text 16, 17a, 17b Word set 18, 19a, 19b Feature information

Claims

The computer
Accept the input of the first text,
Extracting two or more first words included in the first text, referring to a related word dictionary showing a group of related words, and belonging to any group of the two or more first words. Assign the first dummy word indicating the group to the first word,
The first feature information corresponding to the first word set including the two or more first words and the first dummy word is generated.
Includes two or more second words contained in the second text and a second dummy word indicating a group to which any second word belongs, which is stored corresponding to each of the plurality of second texts. Searching for a second text similar to the first text, based on a comparison between the second feature information according to the second word set and the first feature information.
Similar text search method.

The first feature information is a first vector containing a plurality of hash values calculated from the first word set by a plurality of different hash functions.
The second feature information is a second vector containing a plurality of hash values calculated from the second word set by the plurality of hash functions.
The similar text search method according to claim 1.

Each of the plurality of hash functions has a correspondence relationship in which a unique value is associated with each word and a dummy word, and is the smallest value among the values corresponding to the word and the dummy word included in the input word set. Is output as a hash value,
The similar text search method according to claim 2.

The computer further
The third text stored in association with the searched second text is output as a response to the first text.
The similar text search method according to claim 1.

A related word dictionary showing a group of related words is stored, and two or more second words contained in the second text and any second word corresponding to each of the plurality of second texts are stored. A storage unit that stores a second feature information corresponding to a second word set including a second dummy word indicating a group to which the word belongs, and a storage unit that stores the second feature information.
The input of the first text is accepted, two or more first words included in the first text are extracted, and any group of the two or more first words is referred to with reference to the related word dictionary. A first dummy word indicating the group is assigned to the first word belonging to the above, and the first feature corresponding to the first word set including the two or more first words and the first dummy word. A processing unit that generates information and searches for a second text similar to the first text based on a comparison between the second feature information and the first feature information.
Similar text search device with.

On the computer
Accept the input of the first text,
Extracting two or more first words included in the first text, referring to a related word dictionary showing a group of related words, and belonging to any group of the two or more first words. Assign the first dummy word indicating the group to the first word,
The first feature information corresponding to the first word set including the two or more first words and the first dummy word is generated.
Includes two or more second words contained in the second text and a second dummy word indicating a group to which any second word belongs, which is stored corresponding to each of the plurality of second texts. Searching for a second text similar to the first text, based on a comparison between the second feature information according to the second word set and the first feature information.
A similar text search program that executes processing.