JP4579638B2

JP4579638B2 - Data search apparatus and data search method

Info

Publication number: JP4579638B2
Application number: JP2004292606A
Authority: JP
Inventors: 英生久保山
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2004-10-05
Filing date: 2004-10-05
Publication date: 2010-11-10
Anticipated expiration: 2024-10-05
Also published as: JP2006107108A

Description

本発明は、データに付与された音声アノテーションデータを音声認識して求めた音声認識結果アノテーションデータに基づき、データを検索する装置や方法に関するものである。 The present invention relates to an apparatus and method for retrieving data based on speech recognition result annotation data obtained by speech recognition of speech annotation data attached to data.

昨今、デジタルカメラ等の普及が著しい。ユーザは、デジタルカメラのような携帯型撮像装置により撮像したデジタル画像を、ＰＣやサーバ等で管理することが一般的である。例えば撮影した画像を、ＰＣあるいはサーバ上のフォルダ内に整理したり、特定の画像を印刷し、年賀状に組み込むことが可能である。また、サーバで管理する場合は、一部の画像を他のユーザに公開することも可能である。 In recent years, the spread of digital cameras and the like has been remarkable. In general, a user manages a digital image captured by a portable imaging device such as a digital camera using a PC or a server. For example, photographed images can be organized in folders on a PC or server, or specific images can be printed and incorporated into New Year's cards. In addition, when managed by the server, some images can be disclosed to other users.

このような作業を行う場合には、ユーザの意図する特定の画像を見つけ出すことが必要となる。見つけ出す対象となる画像数が少ない場合は、画像をサムネイル表示し、その一覧から目視で見つけ出すことも可能である。しかし、対象となる画像数が何百となる場合や、対象画像群が複数フォルダに分断されて格納されている場合は、目視で見つけ出すことは困難である。 When performing such work, it is necessary to find a specific image intended by the user. When the number of images to be found is small, the images can be displayed as thumbnails, and can be found visually from the list. However, when the number of target images is hundreds, or when the target image group is divided and stored in a plurality of folders, it is difficult to find it visually.

そこで、撮像装置上で画像に音声アノテーション（音声による注釈）を付け、検索時のその情報を使うことが行われている。例えば携帯型撮像装置により山の画像を撮像し、その画像に対して「箱根の山」と発声する。この音声データは先の画像データと対となって撮像装置内に格納された後、その画像撮像装置内あるいは画像をアップロードしたＰＣ内で音声認識され、“はこねのやま”というテキスト情報に変換する。音声アノテーションデータがテキスト情報に変換されれば、後は一般的なテキスト検索技術で処理することが可能であり、「やま」、「はこね」等の単語でその画像を検索することができる。 In view of this, an audio annotation (an audio annotation) is attached to an image on the imaging apparatus, and the information used at the time of search is used. For example, an image of a mountain is picked up by a portable imaging device, and “Hakone no Yama” is uttered to the image. This audio data is stored in the imaging device as a pair with the previous image data, and then recognized in the imaging device or in the PC to which the image is uploaded, and converted to text information “Hakoneyama”. To do. If the voice annotation data is converted into text information, it can be processed by a general text search technique, and the image can be searched for using words such as “Yama” and “Hakone”.

このような音声アノテーションを利用した先行技術に、特許文献１、特許文献２、特許文献３がある。これらの先行技術では、画像の撮像時あるいは撮像後に注釈となる音声をユーザが入力し、その音声データを既存の音声認識技術を利用して画像検索に利用している。
特開２００３−２１９３２７号公報特開２００２−３２５２２５号公報特開平９−１３５４１７号公報 Prior art using such voice annotation includes Patent Literature 1, Patent Literature 2, and Patent Literature 3. In these prior arts, a user inputs a voice to be annotated when an image is taken or after the image is taken, and the voice data is used for image retrieval using an existing voice recognition technology.
JP 2003-219327 A JP 2002-325225 A JP-A-9-135417

音声認識で音声アノテーションを変換して検索する時には誤認識は避けられない。誤認識による誤りの割合が大きい場合には検索キーを正しく入れてもマッチングの相関が悪く、正しく検索されない。しかしながら、アノテーションのごく一部が誤認識により誤っていても、大部分が正解していればしばしば正しく検索することができる。 Misrecognition is unavoidable when searching by converting voice annotation by voice recognition. If the percentage of errors due to misrecognition is large, even if the search key is correctly inserted, the matching correlation is poor and the search is not performed correctly. However, even if only a small part of the annotation is wrong due to misrecognition, it can often be searched correctly if the majority is correct.

ここで、検索キーとのマッチングの相関度をスコアとして検索画像をランキングして並べると、検索キーとの相関が高いアノテーションについてはランキングの上位に正しく検索されることが多いが、誤認識などにより検索キーとの相関が低い画像は、その他のアノテーションとの差がつき難くなり、急激に順位が落ちる。このように相関度が低い画像に付いては順位順に並べるとその中から探すのが困難であり、むしろ従来のフォルダのように名前順、時間順などで整列させたり、検索結果を表示しないで検索失敗を通知するほうが好ましい。 Here, when the search images are ranked and arranged using the correlation degree of matching with the search key as a score, annotations with high correlation with the search key are often correctly searched for higher in the ranking. Images that have a low correlation with the search key are less likely to be different from other annotations, and the ranking drops rapidly. For images with low correlation, it is difficult to search for images in the order of rank. Rather than arranging images in order of name, time, etc. like conventional folders, do not display search results. It is preferable to notify the search failure.

上記課題を解決するために、本発明のデータ検索方法は、各々のデータが所定の音声データと対応付けて蓄積されている複数のデータから、所望のデータを検索し、検索結果を表示手段に表示させるデータ検索装置のデータ検索方法であって、前記複数のデータと、前記複数のデータの各々に対応付けられた所定の音声データとを蓄積する蓄積ステップと、各々の前記音声データを音声認識することによって得られた第１の音素列を取得する取得ステップと、ユーザによる操作に応答して、検索条件に相当する検索キーを入力する入力ステップと、前記検索キーを形態素解析して単語列に分割し、さらに当該単語列に読みを付与し、第２の音素列を得る変換ステップと、各々の前記音声データから得られた前記第１の音素列について、音素マッチングを行うことにより、前記第２の音素列との相関度をそれぞれ決定する決定ステップと、前記相関度が所定の閾値以上である前記第１の音素列に対応する音声データが対応付けられた前記データを前記相関度でランキングした順位順に並べて表示させると共に、前記相関度が前記閾値未満である前記第１の音素列に対応する音声データが対応付けられた前記データを、前記データの名称の名前順、前記データが有する時間情報順、前記データのデータサイズ順、前記データの表示サイズ順のうちいずれかに従って並べて表示させるよう前記表示手段を制御する表示制御ステップを有する。また、本発明のデータ検索方法は、各々のデータが所定の音声データと対応付けて蓄積されている複数のデータから、所望のデータを検索し、検索結果を表示手段に表示させるデータ検索装置のデータ検索方法であって、前記複数のデータと、前記複数のデータの各々に対応付けられた所定の音声データとを蓄積する蓄積ステップと、各々の前記音声データを音声認識することによって得られた第１の単語列を取得する取得ステップと、ユーザによる操作に応答して、検索条件に相当する検索キーを入力する入力ステップと、前記検索キーを形態素解析して第２の単語列を得る変換ステップと、各々の前記音声データから得られた前記第１の単語列について、単語マッチングを行うことにより、前記第２の単語列との相関度をそれぞれ決定する決定ステップと、前記相関度が所定の閾値以上である前記第１の単語列に対応する音声データが対応付けられた前記データを前記相関度でランキングした順位順に並べて表示させると共に、前記相関度が前記閾値未満である前記第１の単語列に対応する音声データが対応付けられた前記データを、前記データの名称の名前順、前記データが有する時間情報順、前記データのデータサイズ順、前記データの表示サイズ順のうちいずれかに従って並べて表示させるよう前記表示手段を制御する表示制御ステップを有する。 In order to solve the above-described problems, the data search method of the present invention searches for desired data from a plurality of data in which each data is stored in association with predetermined audio data, and the search result is displayed on the display means. A data search method for a data search device to be displayed, the storage step for storing the plurality of data and predetermined voice data associated with each of the plurality of data, and voice recognition of each of the voice data An acquisition step of acquiring the first phoneme string obtained by performing the input, an input step of inputting a search key corresponding to a search condition in response to an operation by the user, and a word string obtained by performing morphological analysis on the search key divided into further reading is given to the word sequence, a conversion step to obtain the second phoneme sequence, for said obtained from each of the audio data first phoneme sequence, the phoneme Ma By performing quenching, a determination step of determining a correlation between the second phoneme sequence, respectively, the audio data to which the degree of correlation corresponding to said first phoneme string is above a predetermined threshold are correlated Rutotomoni is displayed side by side the data to rank order of ranking in the correlation, said data voice data is associated by the correlation degree corresponding to said first phoneme string is less than the threshold value, the name of the data A display control step of controlling the display means to display the data according to any one of the name order, the time information order of the data, the data size order of the data, and the display size order of the data . Further, the data search method of the present invention is a data search apparatus for searching for desired data from a plurality of data in which each data is stored in association with predetermined audio data and displaying the search result on a display means. A data search method, which is obtained by accumulating the plurality of data and predetermined audio data associated with each of the plurality of data, and recognizing each of the audio data. An obtaining step for obtaining a first word string; an input step for inputting a search key corresponding to a search condition in response to an operation by a user; and a conversion for obtaining a second word string by performing morphological analysis on the search key. Steps and word matching are performed on the first word string obtained from each of the audio data, thereby determining the degree of correlation with the second word string. And determining and displaying the data associated with the voice data corresponding to the first word string having the correlation degree equal to or greater than a predetermined threshold in order of ranking ranked by the correlation degree, and the correlation degree is The data associated with the speech data corresponding to the first word string that is less than the threshold is in the order of the name of the data, the order of the time information included in the data, the order of the data size of the data, the data A display control step of controlling the display means so that the display means are arranged and displayed according to any of the display size orders.

以上に述べたとおり、本発明におけるデータ検索では、データに対応する音声にデータを音声認識した結果である音声認識結果アノテーションデータと検索条件との相関をあらわすスコアによって、検索結果の表示方法を切り替えることにより、高い順位にあらわれるデータと高い順位に現れないデータを区別して探し出すことができ、ユーザの利便性向上につながる。 As described above, in the data search in the present invention, the search result display method is switched depending on the score representing the correlation between the speech recognition result annotation data, which is the result of speech recognition of the data in the speech corresponding to the data, and the search condition. As a result, it is possible to distinguish and search for data that appears in a higher order and data that does not appear in a higher order, leading to improved user convenience.

以下、図面を参照して本発明の一実施形態を詳細に説明する。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

図１は、本発明の一実施例に係るデータ検索装置の機能を示すブロック図である。同図において、１０１はデータベースである。１０２は、データベース１０１に蓄積される画像やドキュメントなどのデータである。１０３は、データ１０２に対応して音声によって注釈をつけた音声アノテーションデータである。１０４は、音声アノテーションデータ１０３を音声認識し、音素列や単語列などに変換した音声認識結果アノテーションデータである。１０５は、所望のデータ１０２を検索するために検索条件として、検索キーを入力する検索キー入力部である。１０６は、検策キーでマッチングを取るために、音声認識結果アノテーションデータ１０４と同様の書式の音素列や単語列に検索キーを変換する検索キー変換部である。 FIG. 1 is a block diagram illustrating functions of a data search apparatus according to an embodiment of the present invention. In the figure, reference numeral 101 denotes a database. Reference numeral 102 denotes data such as images and documents stored in the database 101. Reference numeral 103 denotes voice annotation data that is annotated with voice corresponding to the data 102. Reference numeral 104 denotes speech recognition result annotation data obtained by speech recognition of the speech annotation data 103 and converted into a phoneme string or a word string. A search key input unit 105 inputs a search key as a search condition in order to search for desired data 102. A search key conversion unit 106 converts the search key into a phoneme string or a word string in the same format as the speech recognition result annotation data 104 in order to match with the check key.

１０７は、データベース１０１内の複数の音声認識結果アノテーションデータ１０４と検索キーとでマッチングを取り、それぞれの音声認識結果アノテーションデータ１０４に対する相関度スコアを求め、音声認識結果アノテーションデータ１０４に対応するデータ１０２を順位付けする検索部である。１０８は、それぞれのデータの相関度スコアに基づいてデータの表示方法を切り替える表示切り替え部である。１０９は、表示切り替え部１０８でそれぞれ指定された方法に基づいてデータ１０２を表示する表示部である。 Reference numeral 107 denotes a match between the plurality of speech recognition result annotation data 104 in the database 101 and the search key, obtains a correlation score for each speech recognition result annotation data 104, and data 102 corresponding to the speech recognition result annotation data 104. It is a search part which ranks. A display switching unit 108 switches the data display method based on the correlation score of each data. Reference numeral 109 denotes a display unit that displays the data 102 based on the method designated by the display switching unit 108.

図１を用いて、本実施例の処理の流れを詳細に説明する。画像やドキュメントなどのデータ１０２には、対応する音声アノテーションデータ１０３及びこの音声アノテーションデータを音声認識した結果である音声認識結果アノテーションデータ１０４が存在する。ここで、音声認識結果アノテーションデータは、本装置に音声認識部を有しておいて作成しても構わないし、画像を撮るカメラなどの別の装置上で音声認識部を有して作成しても構わない。また、本発明においてデータ検索で使用するアノテーションデータは音声認識結果アノテーションデータ１０４であるため、音声アノテーションデータ１０３は存在しなくとも構わない。 The processing flow of the present embodiment will be described in detail with reference to FIG. The data 102 such as an image or a document includes corresponding voice annotation data 103 and voice recognition result annotation data 104 that is a result of voice recognition of the voice annotation data. Here, the voice recognition result annotation data may be created by having the voice recognition unit in the apparatus, or may be created by having the voice recognition unit on another apparatus such as a camera for taking an image. It doesn't matter. In addition, since the annotation data used in the data search in the present invention is the speech recognition result annotation data 104, the speech annotation data 103 may not exist.

図２に、音声認識結果アノテーションデータ１０４の一例を示す。同図の２０１は、音声アノテーションデータ１０３を音声認識して音素列に変換した認識結果音素列であり、尤もらしいとされる上位５位までが順番に並んでいる。２０２は音声認識に用いた文法名である。本実施例では音素列に変換する文法を用いて説明するが、単語列に変換する文法を用いても良い。２０３は、音声に対する音素列の尤もらしさを表す認識尤度である。本実施例におけるデータ検索ではこれらの情報のうち、認識結果音素列２０１のみ用いるため、文法名２０２、認識尤度２０３はなくとも良い。 FIG. 2 shows an example of the voice recognition result annotation data 104. Reference numeral 201 in the figure denotes a recognition result phoneme string obtained by voice recognition of the voice annotation data 103 and converted into a phoneme string, and the top five most likely lines are arranged in order. Reference numeral 202 denotes a grammar name used for speech recognition. In this embodiment, the grammar for converting to a phoneme string will be described. However, the grammar for converting to a word string may be used. Reference numeral 203 denotes a recognition likelihood representing the likelihood of a phoneme string for speech. Of these pieces of information, only the recognition result phoneme string 201 is used in the data search in this embodiment, so that the grammar name 202 and the recognition likelihood 203 are not necessary.

まず、ユーザは検索キー入力部１０５に検索条件として検索キーを入力する。図３に検索キー入力部においてユーザに提示する検索キー入力ダイアログの一例を示す。このようなダイアログに、ユーザは検索したいデータに付与されている音声もしくはその一部に相当する単語、文をテキスト入力する。検索キーを入力して検索ボタンを押すと、検索キーは検索キー変換部１０６にわたり、検索キーが認識結果音素列２０１と同じ形式の音素列に変換される。図４は検索キーを音素列に変換する様子を表す図である。検索キー「箱根の山」を、形態素解析し、単語列に分割する。さらに、単語列に読みを付与し、音素列を得る。形態素解析、読み付与の方法は一般的な自然言語処理技術を適用する。 First, the user inputs a search key as a search condition in the search key input unit 105. FIG. 3 shows an example of a search key input dialog presented to the user in the search key input unit. In such a dialog, the user inputs a text or a word or sentence corresponding to the voice or a part of the data to be searched. When a search key is input and a search button is pressed, the search key is transmitted to the search key conversion unit 106, and the search key is converted into a phoneme string having the same format as the recognition result phoneme string 201. FIG. 4 is a diagram showing how the search key is converted into a phoneme string. The search key “Hakone no Yama” is morphologically analyzed and divided into word strings. Further, reading is given to the word string to obtain a phoneme string. A general natural language processing technique is applied to the morphological analysis and reading method.

次に、検索部１０７において検索キーの音素列と、検索対象となる全てのデータ１０２に対応する音声認識結果アノテーションデータ１０４とで音素マッチングを取り、検索キーとの相関度を表す音素正解精度を求める。マッチングは、一般的なＤＰマッチング法などにより取ることができる。図５に音素正解精度を求める様子を表す。音素正解精度は、ＤＰマッチング等により正解数、挿入誤り数、削除誤り数、置換誤り数が得られたときに、
｛（正解数）―（挿入誤り数）―（削除誤り数）―（置換誤り数）｝×１００／（正解数）
で求める。図５においては挿入誤りが“ｏ”と“ａ”の２箇所、そして“ｈ”を“ｆ”として誤った置換誤りが１箇所であり、音素正解精度は７５％となる。このようにして求めた音素正解精度を検索に用いるスコアとしてデータ１０２をランキングする。ここで、図２の音声認識結果アノテーションデータは上位５つの認識結果音素列が存在するが、マッチングはこれらそれぞれと行って音素正解精度を求め、最も良い音素正解精度及び認識結果音素列を採用する。ただし本発明はこれに限るものではなく、順位によって重み係数を音素正解精度にかけてから最大値を取ったり、総和を取ったりしても良い。また、音声認識結果アノテーションデータは、図２のように上位Ｎ個の認識結果を保持する形態に限らず、各音素（あるいは単語など）で構成するラティス（単語グラフ）を出力し、ラティスの始端から終端までのそれぞれのパスにおいて音素正解精度を求めても良い。 Next, in the search unit 107, phoneme matching is performed between the phoneme string of the search key and the speech recognition result annotation data 104 corresponding to all data 102 to be searched, and the phoneme correct answer accuracy representing the degree of correlation with the search key is obtained. Ask. Matching can be performed by a general DP matching method or the like. FIG. 5 shows how the phoneme correct answer accuracy is obtained. The correct phoneme accuracy is obtained when the number of correct answers, the number of insertion errors, the number of deletion errors, the number of replacement errors is obtained by DP matching, etc.
{(Number of correct answers)-(Number of insertion errors)-(Number of deletion errors)-(Number of replacement errors)} × 100 / (Number of correct answers)
Ask for. In FIG. 5, there are two insertion errors, “o” and “a”, and one erroneous replacement error with “h” being “f”, and the correct phoneme accuracy is 75%. The data 102 is ranked as a score used for the search with the phoneme correct accuracy obtained in this way. Here, the speech recognition result annotation data in FIG. 2 includes the top five recognition result phoneme strings, and matching is performed with each of them to obtain the correct phoneme accuracy, and the best phoneme correct answer accuracy and the recognition result phoneme string are adopted. . However, the present invention is not limited to this, and the maximum value may be taken or the sum may be taken after applying the weighting coefficient to the correct phoneme accuracy according to the order. In addition, the speech recognition result annotation data is not limited to the form of holding the top N recognition results as shown in FIG. 2, but a lattice (word graph) composed of each phoneme (or word) is output, and the beginning of the lattice The correct phoneme accuracy may be obtained in each path from to the end.

次に、表示切り替え部１０８において、それぞれのデータ１０２に対応する音素正解精度を閾値と比較し、音素正解精度が閾値以上のデータについては表示部１０９において音素正解精度でランキングした順位順に整列して表示され、閾値未満のデータにおいては表示部１０９の別の領域においてデータの名称の名前順、データが有する時間情報順、データのデータサイズ順、データの表示サイズ順などの前記スコア順とは異なる基準で表示する。 Next, in the display switching unit 108, the phoneme correct answer accuracy corresponding to each data 102 is compared with a threshold value, and data whose phoneme correct answer accuracy is equal to or higher than the threshold value are arranged in the order of ranking ranked by the phoneme correct answer accuracy in the display unit 109. Displayed data that is less than the threshold is different from the score order such as the order of the name of the data, the order of the name of the data, the order of the time information of the data, the order of the data size of the data, and the order of the data display size Display by reference.

図６に検索データ表示の様子を示す。同図において、６０１は、検索キーとの音素正解精度が閾値以上のデータ（ここでは画像）が順位順で表示される検索結果表示ウインドウである。６０２は、音素正解精度が閾値未満のデータが名前順、時間順など、順位順以外の方法で表示されるデータ表示ウインドウである。図６においてはデータ表示ウインドウでは名前順で画像が整列されて表示されている。また、音符のボタンを押すと対応する音声アノテーションを聞くことができる。ユーザは、まず検索結果表示ウインドウ６０１に表示された音素正解精度の高いデータを見て、所望のデータが無い場合にはデータ表示ウインドウで名前順、時間順などに整列されたデータから所望のデータを探す。なお、本実施例では順位順に整列するデータとそれ以外のデータでウインドウを分けて表示したが、本発明はこれに限るものではなく、例えば同一ウインドウで領域を分けて表示しても構わない。このように、サブワード正解精度が高いデータを順位順に、その他のデータを名前順、時間順などに整列してこれらを並列に表示することで、ユーザはまず限られた数の順位順データを見て、そこに無かったら通常どおり名前、時間順で探すという併用した使い方ができるので、利便性が高まる。 FIG. 6 shows how the search data is displayed. In the figure, reference numeral 601 denotes a search result display window in which data (here, images) whose phoneme accuracy with the search key is equal to or higher than a threshold value are displayed in order of rank. Reference numeral 602 denotes a data display window in which data whose phoneme accuracy is less than a threshold is displayed by a method other than the order of order such as name order or time order. In FIG. 6, images are arranged and displayed in the order of names in the data display window. You can also hear the corresponding voice annotation by pressing the note button. The user first looks at the data with high phoneme accuracy displayed in the search result display window 601. If there is no desired data, the user selects the desired data from the data arranged in the order of name, time, etc. in the data display window. Search for. In this embodiment, the windows are divided and displayed by the data arranged in the order of the order and the other data. However, the present invention is not limited to this. For example, the areas may be divided and displayed in the same window. In this way, users can first view a limited number of rank-ordered data by arranging data with high subword accuracy in rank order and displaying other data in parallel in name order and time order. If it is not there, it can be used in combination with the usual search by name and time as usual, which increases convenience.

ここで、音素正解精度とデータ検索性能との関係からの音素正解精度閾値設定方法と、閾値処理による表示方法切り替えの有効性について説明する。図７は、１０００データを対象に検索を行ったときに、所望する正解データの検索キーとの音素正解精度及び、正解データを音素正解精度でランキングした検索順位をプロットした散布図である。同図を見ると、音素正解精度が６０％を超えるデータについては検索順位は良い順位に集中し、順位を大きく落とすデータは無い。その一方で、音素正解精度６０％を下回るデータは大きく順位を落とし、順位の範囲もデータによって大きく異なる。したがって、音素正解精度６０％を超える正解データについてはロバストに上位で検索できるのに対して、音素正解精度６０％を下回るデータに対しては大きく順位を落とし、順位の範囲もデータに大きく左右されるので検索結果として提示するのは好ましくない。 Here, the phoneme correct answer accuracy threshold setting method based on the relationship between the phoneme correct answer accuracy and the data search performance and the effectiveness of display method switching by threshold processing will be described. FIG. 7 is a scatter diagram in which the phoneme correct answer accuracy with the search key for the desired correct answer data and the search rank in which the correct answer data are ranked by the phoneme correct answer accuracy are plotted when 1000 data is searched. As shown in the figure, for data with a correct phoneme accuracy of over 60%, the search order is concentrated in a good rank, and there is no data that greatly drops the rank. On the other hand, data with a phoneme accuracy of less than 60% falls greatly in rank, and the rank range varies greatly depending on the data. Therefore, while correct data with a correct phoneme accuracy of over 60% can be robustly searched at the top, data with a phoneme accuracy of under 60% is greatly reduced in rank, and the range of the ranking is greatly influenced by the data. Therefore, it is not preferable to present it as a search result.

そこで、この予備実験で確認した特徴を生かし、音素正解精度の閾値を６０％に設定する。音素正解精度６０％を超える正解データについては（実際にはシステムは正解を知らないので音素正解精度６０％を超える全てのデータは）検索結果表示ウインドウ６０１に順位順で表示する。検索キーとの音素正解精度が閾値６０％を超える正解については検索結果表示ウインドウ６０１においてロバストに高い順位で検索できる。音素正解精度６０％を超えないデータについては順位順に並べてもどの順位の範囲に現れるかわからず検索効率がむしろ悪いため、データ表示ウインドウ６０２でデータの名称の名前順、データが有する時間情報順、データのデータサイズ順、データの表示サイズ順などの前記スコア順とは異なる基準で表示する。 Therefore, taking advantage of the characteristics confirmed in this preliminary experiment, the threshold value of phoneme accuracy is set to 60%. The correct answer data exceeding the phoneme correct accuracy of 60% (actually, all data exceeding the phoneme correct answer accuracy of 60% is displayed in the order of rank in the search result display window 601 because the system does not know the correct answer). A correct answer in which the phoneme correct answer accuracy with the search key exceeds a threshold value of 60% can be searched for in a robustly high order in the search result display window 601. For data that does not exceed the accuracy of phoneme accuracy of 60%, even if it is arranged in the order of rank, it does not know in which rank range the search efficiency is rather bad. The data is displayed according to a standard different from the score order such as the data size order and the data display size order.

すなわち、システム設計者があらかじめ検索キーに対応する正解データが既知である検索セットを用意して図７のような散布図を作れば、音素正解精度と検索性能の関係がもつ図７のグラフ形状の特徴により適切な閾値を設定でき、検索による限られた数の順位順表示と、名前順、時間順などの通常表示とのハイブリッドなユーザ提示ができる。 That is, if a system designer prepares a search set in which correct data corresponding to a search key is known in advance and creates a scatter diagram as shown in FIG. 7, the graph shape of FIG. An appropriate threshold value can be set according to the features of the above, and a hybrid user presentation of a limited number of rank order displays by search and normal displays such as name order and time order can be performed.

図８に、本発明のデータ検索装置を実現するハードウエア構成図を示す。同図において、８０１は、データやＧＵＩパネルなどを表示するディスプレイなどの画面表示部である。８０２は、検索キーなどを入力したりＧＵＩボタンを押下するキーボードやマウスなどのデータ入力部である。８０３は、音声アノテーションデータや警告音などの音を出力するスピーカなどの音出力部である。８０４は、データベース１０１や本データ検索方法のプログラムを保持するＲＯＭ或いはハードディスクなどの外部記憶部である。８０５は、本データ検索方法のプログラム実行時に、プログラムやデータ等、一時情報を保持するためのＲＡＭである。８０６は、本データ検索方法のプログラムを実行するＣＰＵである。 FIG. 8 shows a hardware configuration diagram for realizing the data search apparatus of the present invention. In the figure, reference numeral 801 denotes a screen display unit such as a display for displaying data, a GUI panel, and the like. Reference numeral 802 denotes a data input unit such as a keyboard and a mouse for inputting a search key or pressing a GUI button. Reference numeral 803 denotes a sound output unit such as a speaker that outputs sound such as sound annotation data and warning sound. Reference numeral 804 denotes an external storage unit such as a ROM or a hard disk that holds the database 101 and a program for this data search method. Reference numeral 805 denotes a RAM for holding temporary information such as programs and data when the data search method is executed. Reference numeral 806 denotes a CPU that executes a program of this data search method.

（他の実施例）
上記実施例では音素列マッチングにより音素正解精度を検索のためのスコアとして用いたが、本発明はこれに限るものではなく、例えば音素ではなく音節でのマッチングや、単語単位でのマッチングによる正解精度でも良い。また、これに音声認識で求まる認識尤度２０３を加味したり、音素間の類似度（“ｐ”と“ｔ”は類似度が高いなど）を用いてスコアの重み付けをしたりしても良い。また、上記実施例では、図５に示すとおり音素列の全体マッチングによる音素正解精度を検索のためのスコアとして用いたが、挿入誤りによるスコアの劣化を抑えるなどとしたスコアの工夫により検索キーの部分的なマッチングによって検索しても良い。この実施例では、例えば音声認識結果アノテーションデータに「箱根の山」というアノテーションが付与されているときに、部分マッチングにより「箱根」、「山」を検索キーとして検索可能となる。 (Other examples)
In the above embodiment, the phoneme correct answer accuracy is used as a search score by phoneme string matching. However, the present invention is not limited to this, for example, the correct answer accuracy by syllable matching instead of phoneme, or word unit matching. But it ’s okay. In addition, the recognition likelihood 203 obtained by speech recognition may be added to this, or the score may be weighted using the similarity between phonemes (eg, “p” and “t” are high in similarity). . Further, in the above embodiment, as shown in FIG. 5, the correct phoneme accuracy based on the entire phoneme string matching is used as a search score. However, the search key is improved by devising the score to suppress the deterioration of the score due to an insertion error. You may search by partial matching. In this embodiment, for example, when the annotation “Hakone no Yama” is added to the speech recognition result annotation data, it is possible to search by using “Hakone” and “Mountain” as search keys by partial matching.

（他の実施例）
上記実施例は音素正解精度の閾値処理によって、順位順による整列とその他の整列とを切り替えて別の領域に表示するものであるが、本発明はこれに限らず、音素正解精度の閾値処理によってデータの表示方法を切り替える全ての実施形態に適用可能である。例えば、音素正解精度が閾値未満のデータは全て表示せず、音素正解精度が閾値以上のデータのみを表示させたり、音素正解精度が閾値以上のデータのみ画像を大きく表示して閾値未満のデータは小さなアイコンもしくはリンクのテキストのみを表示するといった実施形態も考えられる。 (Other examples)
In the above-described embodiment, the phoneme correct answer accuracy threshold value processing is used to switch between the sorting in the order of rank and the other sorts and display them in another region. However, the present invention is not limited to this, and the phoneme correct answer accuracy threshold value process The present invention is applicable to all embodiments that switch the data display method. For example, not all data with a correct phoneme accuracy of less than a threshold is displayed, only data with a correct phoneme accuracy of a threshold is displayed, or only data with a correct accuracy of the phoneme of a threshold is displayed with a large image An embodiment in which only a small icon or link text is displayed is also conceivable.

（他の実施例）
なお、本発明は、１つの機器からなる装置に適用しても、複数の機器から構成されるシステムに適用してもよい。また、前述した実施形態の機能を実現するソフトウエアのプログラムコードを記録した記録媒体を、システム或いは装置に供給し、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記録媒体に格納されたプログラムコードを読み出し実行することによっても、達成されることは言うまでもない。この場合、記録媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記録した記録媒体は本発明を構成することになる。 (Other examples)
Note that the present invention may be applied to an apparatus composed of one device or a system composed of a plurality of devices. Further, a recording medium in which a program code of software for realizing the functions of the above-described embodiments is recorded is supplied to a system or apparatus, and a computer (or CPU or MPU) of the system or apparatus is stored in the recording medium. Needless to say, this can also be achieved by reading and executing the code. In this case, the program code itself read from the recording medium realizes the functions of the above-described embodiment, and the recording medium on which the program code is recorded constitutes the present invention.

上記実施例においては、プログラムをＲＯＭに保持する場合について説明したが、これに限定されるものではなく、任意の記憶媒体を用いて実現してもよい。また、同様の動作をする回路で実現してもよい。 In the above embodiment, the case where the program is stored in the ROM has been described. However, the present invention is not limited to this and may be realized using any storage medium. Further, it may be realized by a circuit that performs the same operation.

プログラムコードを供給するための記録媒体としては、例えば、フロッピー（登録商標）ディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤーＲＯＭ、ＣＤーＲ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。 As a recording medium for supplying the program code, for example, a floppy (registered trademark) disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like is used. be able to.

また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳなどが実際の処理の一部または全部を行ない、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an OS running on the computer performs actual processing based on an instruction of the program code. Needless to say, a case where the function of the above-described embodiment is realized by performing part or all of the processing, is also included.

更に、記録媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行ない、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, after the program code read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. It goes without saying that the CPU or the like provided in the board or the function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.

本発明のデータ検索装置の機能構成図である。It is a functional block diagram of the data search device of the present invention. 本発明における音声認識結果アノテーションデータの例である。It is an example of the speech recognition result annotation data in the present invention. 本発明の検索キー入力部における検索キー入力ダイアログの例である。It is an example of the search key input dialog in the search key input part of this invention. 本発明の検索キー変換部における処理の様子を表す図である。It is a figure showing the mode of the process in the search key conversion part of this invention. 本発明の検索部における音素マッチングを表す図である。It is a figure showing the phoneme matching in the search part of this invention. 本発明の表示部における検索結果表示ウインドウとデータ表示ウインドウの一例である。It is an example of the search result display window and data display window in the display part of this invention. 音素正解精度とデータ検索順位の関係を表す散布図である。It is a scatter diagram showing the relationship between phoneme correct answer precision and data search order. 本発明のデータ検索装置を実現するハードウエア構成図である。It is a hardware block diagram which implement | achieves the data search device of this invention.

Claims

A data search method for a data search apparatus for searching for desired data from a plurality of data stored in association with predetermined audio data and displaying a search result on a display means.
An accumulation step of accumulating the plurality of data and predetermined audio data associated with each of the plurality of data;
An acquisition step of acquiring a first phoneme string obtained by recognizing each of the audio data;
An input step for inputting a search key corresponding to the search condition in response to an operation by the user;
A conversion step of obtaining a second phoneme string by morphologically analyzing the search key and dividing it into word strings , further adding readings to the word strings ;
A determination step of determining a degree of correlation with the second phoneme sequence by performing phoneme matching on the first phoneme sequence obtained from each of the speech data ;
The Rutotomoni is displayed side by side the data voice data is associated to the degree of correlation corresponding to said first phoneme string is above a predetermined threshold ranking order of ranking in the correlation, the correlation is below the threshold value The data associated with the speech data corresponding to the first phoneme sequence is the name order of the data name, the time information order of the data, the data size order of the data, the display size of the data A data search method comprising a display control step for controlling the display means so that the display means are arranged and displayed according to any one of the order .

A data search method for a data search apparatus for searching for desired data from a plurality of data stored in association with predetermined audio data and displaying a search result on a display means.
An accumulation step of accumulating the plurality of data and predetermined audio data associated with each of the plurality of data;
An obtaining step of obtaining a first word string obtained by recognizing each of the audio data;
An input step for inputting a search key corresponding to the search condition in response to an operation by the user;
A conversion step of obtaining a second word string by morphological analysis of the search key;
A determination step of determining a degree of correlation with the second word string by performing word matching on the first word string obtained from each of the voice data;
The data associated with the voice data corresponding to the first word string having the correlation degree equal to or greater than a predetermined threshold is displayed side by side in the order ranked by the correlation degree, and the correlation degree is less than the threshold value. The data associated with the audio data corresponding to the first word string is arranged in the order of the name of the data, the order of time information included in the data, the order of the data size of the data, the order of the display size of the data A data search method comprising a display control step of controlling the display means so that the display means are arranged and displayed according to any of the above.

In the display control step, the data associated with the speech data corresponding to the first phoneme sequence having the correlation degree equal to or greater than a predetermined threshold is arranged in the order ranked by the correlation degree and displayed in the first window. And the data associated with the speech data corresponding to the first phoneme string having the correlation degree less than the threshold value, in the order of the name of the data, the order of the time information included in the data, the data 2. The data according to claim 1, wherein the display unit is controlled to display the second data in a second window different from the first window in accordance with either the data size order or the data display size order. retrieval method.

The program which makes a computer perform the data search method of any one of Claim 1 thru | or 3 .

A computer-readable storage medium storing the program according to claim 4 .

A data search device for searching for desired data from a plurality of data stored in association with predetermined audio data, and displaying a search result on a display means,
Storage means for storing the plurality of data and predetermined audio data associated with each of the plurality of data;
Obtaining means for obtaining a first phoneme string obtained by voice recognition of each of the voice data;
An input means for inputting a search key corresponding to the search condition in response to an operation by the user;
Morphological analysis of the search key to divide it into word strings , further adding a reading to the word strings, and converting means for obtaining a second phoneme string ;
Determining means for determining the degree of correlation with the second phoneme string by performing phoneme matching on the first phoneme string obtained from each of the speech data ;
The Rutotomoni is displayed side by side the data voice data is associated to the degree of correlation corresponding to said first phoneme string is above a predetermined threshold ranking order of ranking in the correlation, the correlation is below the threshold value The data associated with the speech data corresponding to the first phoneme sequence is the name order of the data name, the time information order of the data, the data size order of the data, the display size of the data A data search apparatus comprising display control means for controlling the display means so that the display means are arranged and displayed according to any one of the order .

A data search device for searching for desired data from a plurality of data stored in association with predetermined audio data, and displaying a search result on a display means,
Storage means for storing the plurality of data and predetermined audio data associated with each of the plurality of data;
Obtaining means for obtaining a first word string obtained by voice recognition of each of the voice data;
An input means for inputting a search key corresponding to the search condition in response to an operation by the user;
Conversion means for obtaining a second word string by morphological analysis of the search key;
Determining means for determining a degree of correlation with the second word string by performing word matching on the first word string obtained from each of the voice data;
The data associated with the voice data corresponding to the first word string having the correlation degree equal to or greater than a predetermined threshold is displayed side by side in the order ranked by the correlation degree, and the correlation degree is less than the threshold value. The data associated with the audio data corresponding to the first word string is arranged in the order of the name of the data, the order of time information included in the data, the order of the data size of the data, the order of the display size of the data A data search apparatus comprising display control means for controlling the display means so that the display means are arranged and displayed according to any of the above.