JP2002229584A

JP2002229584A - Speech recognition method, speech information search method, program, recording medium, speech recognition system, speech recognition server computer, and speech information search server computer

Info

Publication number: JP2002229584A
Application number: JP2001023494A
Authority: JP
Inventors: Ryuta Terajima; 立太寺嶌; Toshihiro Wakita; 敏裕脇田
Original assignee: Toyota Central R&D Labs Inc
Current assignee: Toyota Central R&D Labs Inc
Priority date: 2001-01-31
Filing date: 2001-01-31
Publication date: 2002-08-16

Abstract

(57)【要約】【課題】話者から発せられた音声をコンピュータにより
認識する方法において、その音声のみならずその発声に
関連する他の要因をも考慮することにより、音声認識の
精度を向上させる。【解決手段】音声認識部８２により使用される認識辞書
メモリ８０において、言語モデルと音響モデルとを含む
音声認識辞書を複数種類、話者に対して予め想定された
複数種類の本人の属性と、音声を発する際に話者が置か
れることが予め想定された複数種類の環境の属性とに関
連付けて予め記憶させる。音声認識部８２は、認識され
るべき音声を発する話者本人の実際の属性と、音声を発
する際に話者が実際に置かれた環境の属性とに基づき、
認識辞書メモリ８０において、話者に適合する言語モデ
ルと音響モデルとを選択し、それら選択されたモデルを
用いることにより、話者から発せられた音声を認識す
る。 (57) [Summary] [PROBLEMS] To improve the accuracy of speech recognition by considering not only the speech but also other factors related to the speech in a method of recognizing speech emitted from a speaker by a computer. Let it. In a recognition dictionary memory used by a voice recognition unit, a plurality of types of voice recognition dictionaries including a language model and an acoustic model, a plurality of types of personal attributes assumed in advance for a speaker, and It is stored in advance in association with attributes of a plurality of types of environments in which it is assumed that a speaker is placed when uttering a voice. The voice recognition unit 82 is based on the actual attributes of the speaker who utters the voice to be recognized and the attributes of the environment where the speaker is actually placed when uttering the voice.
In the recognition dictionary memory 80, a language model and an acoustic model suitable for the speaker are selected, and the voice generated by the speaker is recognized by using the selected model.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、特定または不特定
の話者から発せられた音声をコンピュータにより認識す
る技術に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for recognizing a voice uttered by a specific or unspecified speaker by a computer.

【０００２】[0002]

【従来の技術】話者から発せられた音声をコンピュータ
により認識する技術が既に存在する。この音声認識技術
には、予め登録された話者から発せられる音声のみを認
識対象とする特定話者方式と、話者を限定しない不特定
話者方式とがある。2. Description of the Related Art There is already a technique for recognizing a voice uttered from a speaker by a computer. This speech recognition technology includes a specific speaker system that recognizes only a voice uttered from a speaker registered in advance, and an unspecified speaker system that does not limit the speaker.

【０００３】いずれの方式においても、音声認識技術に
おいては、一般に、話者が実際に発した実際音声を表す
実際音声データに基づき、音声認識辞書を用いることに
より、実際音声を認識する。[0003] In any of the systems, the speech recognition technology generally recognizes actual speech by using a speech recognition dictionary based on actual speech data representing actual speech actually uttered by a speaker.

【０００４】このような音声認識技術の一用途として、
音声を用いて情報を検索する音声情報検索が既に存在す
る。[0004] As one application of such speech recognition technology,
There is already a speech information search for retrieving information using speech.

【０００５】この音声情報検索に関する一従来例が特開
２０００−２１６８８８号公報に記載されている。この
従来例においては、利用者が音声によってクライアント
・コンピュータに対して問い合わせを行うと、その問い
合わせ情報が、そのクライアント・コンピュータにネッ
トワークを介して接続されたサーバ・コンピュータに送
信される。その問い合わせ情報を受信したサーバ・コン
ピュータは、その問い合わせ情報に対して音声認識処理
を行い、その音声認識結果に適合した情報を特定のデー
タベースにおいて検索する。その検索された情報は、サ
ーバ・コンピュータからクライアント・コンピュータに
送信され、それにより、利用者は、必要な情報をクライ
アント・コンピュータにおいて取得することになる。[0005] One conventional example related to this voice information search is described in Japanese Patent Application Laid-Open No. 2000-216888. In this conventional example, when a user makes an inquiry to a client computer by voice, the inquiry information is transmitted to a server computer connected to the client computer via a network. The server computer having received the inquiry information performs a speech recognition process on the inquiry information, and searches a specific database for information that matches the speech recognition result. The retrieved information is transmitted from the server computer to the client computer, so that the user obtains necessary information at the client computer.

【０００６】[0006]

【発明が解決しようとする課題】音声認識においては、
一般に、それの精度が音声認識辞書の性能に依存する。
したがって、音声認識の精度を向上させるためには、音
声認識辞書の性能を向上させることが必要である。SUMMARY OF THE INVENTION In speech recognition,
In general, its accuracy depends on the performance of the speech recognition dictionary.
Therefore, in order to improve the accuracy of speech recognition, it is necessary to improve the performance of the speech recognition dictionary.

【０００７】そして、本発明者は、音声認識辞書につい
て研究を行った結果、次のような知見を得た。The present inventor has conducted research on a speech recognition dictionary, and has obtained the following knowledge.

【０００８】音声認識辞書が使用される際には、それに
予め記憶させられた複数種類の音声データ（想定音声デ
ータ）のうち実際音声データに最も類似するものが検索
され、それにより、実際音声が認識される。しかし、実
際音声データのみに依存して音声認識処理を行う場合に
は、音声認識の精度を向上させることが困難である。When a speech recognition dictionary is used, the most similar to the actual speech data is searched for from a plurality of types of speech data (assumed speech data) stored in advance, whereby the actual speech is retrieved. Be recognized. However, when the speech recognition process is performed only depending on the actual speech data, it is difficult to improve the accuracy of the speech recognition.

【０００９】一方、音声認識技術は、用途が限定されな
いで使用される場合があるが、限定されて使用される場
合もある。そして、後者の場合には、話者から発せられ
る音声に一定の傾向を容易に見い出すことができる場合
がある。[0009] On the other hand, the speech recognition technology may be used without any limitation, but may be used in a limited manner. In the latter case, a certain tendency may be easily found in the voice emitted from the speaker.

【００１０】例えば、音声認識技術が音声情報検索に利
用され、しかも、その音声情報検索が、利用者からの情
報検索リクエストに応じて地理的環境に関する情報を提
供するように設計される場合（例えば、各施設の場所を
案内したり、各道路の渋滞情報を提供する場合）には、
情報検索のために話者から音声により発せられる問い合
わせ情報の内容が自ずと限定される。例えば、その問い
合わせ情報の内容が、話者がその問い合わせ情報を発す
る際にその話者が置かれる環境に依存するのである。し
たがって、例えば、話者が置かれる環境が異なれば、そ
れに伴ってその話者から発せられる音声の言語上の内容
（意味）も異なり、また、話者が置かれる環境が分かれ
ば、その話者から発せられる音声の言語上の内容（意
味）をある程度正確に予測し得るのである。For example, when speech recognition technology is used for speech information retrieval, and the speech information retrieval is designed to provide information on a geographical environment in response to an information retrieval request from a user (for example, , To provide information on the location of each facility or provide traffic congestion information for each road)
The content of inquiry information issued by a speaker by voice for information retrieval is naturally limited. For example, the content of the inquiry information depends on the environment in which the speaker is placed when the speaker issues the inquiry information. Therefore, for example, if the environment in which the speaker is placed is different, the linguistic content (meaning) of the voice uttered from the speaker is also different, and if the environment in which the speaker is placed is known, the speaker It is possible to predict the linguistic content (meaning) of the voice uttered from the language to some extent accurately.

【００１１】さらに、情報検索のために話者から発せら
れる音声の音響上の特徴も、その話者が置かれる環境に
依存する。例えば、その話者が自動車で高速道路を走行
中である場合には、駐車場に停止中である場合における
より、その自動車の走行音が大きく、その結果、話者か
ら発せられる音声に混入される背景音としての走行音も
大きくなるのである。[0011] Further, the acoustic characteristics of speech emitted from a speaker for information retrieval also depend on the environment in which the speaker is placed. For example, when the speaker is driving on a highway in a car, the running sound of the car is louder than in the case where the car is stopped in a parking lot, and as a result, the sound is mixed in the voice emitted from the speaker. The running sound as background sound also becomes loud.

【００１２】本発明者は、さらに、それらの知見に基づ
き、話者から発せられる音声の言語上の内容および音響
上の特徴が、その話者が置かれる環境（発声環境）に依
存することが予想される場合には、話者からの実際音声
のみならずその環境をも考慮して最終的な音声認識結果
が取得されるように音声認識辞書を構成することが最適
であるという知見も得た。Further, based on these findings, the present inventor has found that the linguistic content and acoustic characteristics of speech emitted from a speaker depend on the environment (speech environment) in which the speaker is placed. In the case that it is anticipated, it has been found that it is optimal to construct a speech recognition dictionary so that the final speech recognition result is obtained taking into account not only the actual speech from the speaker but also its environment. Was.

【００１３】以上説明したいくつかの知見は、音声認識
の方式が特定話者方式であるか不特定話者方式であると
を問わず、有効であるが、不特定話者方式である場合に
は、認識すべき音声の音響上の特徴が、その音声を発す
る話者である対象話者の本人の属性（例えば、性別や年
齢）に依存するという事実に着目し、話者からの実際音
声のみならずその話者本人の属性をも考慮して最終的な
音声認識結果が取得されるように音声認識辞書を構成す
ることも有効であることにも本発明者は気が付いた。Some of the findings described above are effective irrespective of whether the speech recognition method is the specific speaker method or the unspecified speaker method. Focuses on the fact that the acoustic features of the speech to be recognized depend on the attributes (eg, gender and age) of the target speaker who is the speaker that emits the speech. The inventor has noticed that it is also effective to construct a speech recognition dictionary so that a final speech recognition result is obtained in consideration of not only the speaker's own attribute but also the speaker's own attribute.

【００１４】[0014]

【課題を解決するための手段および発明の効果】このよ
うな事情を背景として、本発明は、話者から発せられる
音声のみならずその発声に関連する他の要因をも考慮す
ることにより、音声認識の精度を向上させることを課題
としてなされたものであり、本発明によって下記の各態
様が得られる。各態様は、請求項と同様に、項に区分
し、各項に番号を付し、必要に応じて他の項の番号を引
用する形式で記載する。これは、本明細書に記載の技術
的特徴のいくつかおよびそれらの組合せのいくつかの理
解を容易にするためであり、本明細書に記載の技術的特
徴やそれらの組合せが以下の態様に限定されると解釈さ
れるべきではない。SUMMARY OF THE INVENTION Against this background, the present invention considers not only voices uttered by a speaker but also other factors related to the utterances, thereby making it possible to obtain voices. The object of the present invention is to improve recognition accuracy, and the following aspects are obtained by the present invention. Each mode is described in the same manner as in the claims, divided into sections, each section is numbered, and described in the form of citing the numbers of other sections as necessary. This is to facilitate understanding of some of the technical features described in the present specification and some of the combinations thereof, and the technical features and the combinations thereof described in the present specification have the following aspects. It should not be construed as limited.

【００１５】（１）特定または不特定の話者から発せ
られた音声をコンピュータにより認識する方法であっ
て、前記話者が音声を発する際にその話者が実際に置か
れた環境の属性である実際環境属性を表す実際環境属性
データに基づき、音声を発する際に前記話者が置かれる
ことが予め想定された複数種類の環境の属性にそれぞれ
関連付けて予め用意された複数種類の音声認識辞書の中
から、前記実際環境属性に対応する音声認識辞書を選択
する第１ステップと、前記話者が実際に発した実際音声
を表す実際音声データに基づき、前記選択された音声認
識辞書を用いることにより、前記実際音声を認識し、そ
の認識結果を表す音声認識結果データを作成する第２ス
テップとを含む音声認識方法［請求項１］。この方法に
おいては、音声認識辞書が複数種類、音声を発する際に
話者が置かれることが予め想定された複数種類の環境の
属性に関連付けて予め用意される。そして、この方法に
おいては、話者が音声を発する際にその話者が実際に置
かれた環境の属性である実際環境属性を表す実際環境属
性データに基づき、複数種類の音声認識辞書のうち、実
際環境属性に対応する音声認識辞書が選択される。さら
に、この方法においては、話者が実際に発した実際音声
を表す実際音声データに基づき、その選択された音声認
識辞書を用いることにより、実際音声が認識され、その
認識結果を表す音声認識結果データが作成される。した
がって、この方法によれば、話者の実際音声のみならず
その話者が発声時に実際に置かれた環境の属性をも考慮
することにより、その実際音声が認識されるため、その
実際音声のみを考慮して音声認識を行う場合に比較し、
音声認識の精度を容易に向上させ得る。本項において
「環境」は、例えば、話者が置かれている地理上の環境
を含むように解釈することができる。地理上の環境の属
性には、例えば、観光地、繁華街、遊園地、高速道路等
がある。「環境」は、さらに、話者が置かれている音響
上の環境を含むように解釈することもできる。音響上の
環境の属性には、例えば、自然音、騒音等の背景音が存
在する場所、話者を載せて移動させる車両等がある。話
者が車両内にいる場合には、その車両の走行音が、音響
上の環境の属性を特定する要因となる。車両の走行音
は、例えば、車種、走行速度、路面状況等の要因に依存
するため、それら要因も、音響上の環境の属性を特定す
る要因となる。（２）さらに、前記複数種類の音声認識辞書に、その
内容が、互いに対応する実際環境属性データと実際音声
データと音声認識結果データとの関係を反映するように
学習させる第３ステップを含む（１）項に記載の音声認
識方法［請求項２］。この方法によれば、音声認識辞書
の学習が、互いに対応する実際音声データと音声認識結
果データのみならず、それらに対応する実際環境属性デ
ータをも反映するように行われるため、音声認識辞書の
内容を新たな音声認識状況に確実に適応させることが容
易になる。（３）不特定の話者から発せられた音声をコンピュー
タにより認識する方法であって、認識されるべき音声を
発する話者である対象話者本人の実際の属性である実際
本人属性を表す実際本人属性データと、その音声を発す
る際にその対象話者が実際に置かれた環境の属性である
実際環境属性を表す実際環境属性データとに基づき、複
数種類の音声認識辞書であって、音声を発する際に前記
話者が置かれることが予め想定された複数種類の環境の
属性と、前記話者に対して予め想定された複数種類の本
人の属性とにそれぞれ関連付けて予め用意されたものの
中から、前記実際本人属性と前記実際環境属性とに対応
する音声認識辞書を選択する第１ステップと、前記対象
話者が実際に発した実際音声を表す実際音声データに基
づき、前記選択された音声認識辞書を用いることによ
り、前記実際音声データを認識し、その認識結果を表す
音声認識結果データを作成する第２ステップとを含む音
声認識方法［請求項３］。この方法においては、不特定
話者方式を採用することに着目して、複数種類の音声認
識辞書が、話者に対して予め想定された複数種類の本人
の属性と、話者に対して予め想定された複数種類の環境
の属性とにそれぞれ関連付けて予め用意される。そし
て、この方法においては、認識されるべき音声を発する
話者である対象話者本人の実際の属性である実際本人属
性を表す実際本人属性データと、その音声を発する際に
その対象話者が実際に置かれた環境の属性である実際環
境属性を表す実際環境属性データとに基づき、複数種類
の音声認識辞書の中から、対象話者の実際本人属性デー
タと実際環境属性データとに対応する音声認識辞書が選
択される。さらに、この方法においては、対象話者が実
際に発した実際音声を表す実際音声データに基づき、そ
の選択された音声認識辞書を用いることにより、実際音
声データが認識され、その認識結果を表す音声認識結果
データが作成される。したがって、この方法によれば、
話者の実際音声のみならず、その話者本人の属性および
その話者が発声時に実際に置かれた環境の属性をも考慮
することにより、その実際音声が認識されるため、その
実際音声のみを考慮して音声認識を行う場合に比較し、
音声認識の精度を容易に向上させ得る。本項において
「本人の属性」は、例えば、プロフィール、性別、年齢
等の少なくとも１つを含むように解釈することができ
る。また、本項において「環境」は、前記（１）項にお
けると同様に解釈することが可能である。（４）さらに、前記複数種類の音声認識辞書に、それ
の内容が、互いに対応する実際音声データと音声認識結
果データと実際本人属性データと実際環境属性データと
の関係を反映するように学習させる第３ステップを含む
（３）項に記載の音声認識方法［請求項４］。この方法
によれば、音声認識辞書の学習が、互いに対応する実際
音声データと音声認識結果データのみならず、それらに
対応する実際本人属性データと実際環境属性データをも
反映するように行われるため、音声認識辞書の内容を新
たな音声認識状況に確実に適応させることが容易にな
る。（５）前記各種類の音声認識辞書が、音響モデルと言
語モデルとを含むものである（１）ないし（４）項のい
ずれかに記載の音声認識方法。この方法によれば、話者
から発せられた実際音声の音響上の特徴と、その実際音
声に対応する文字列における複数の単語間における言語
上の制約との双方を考慮することにより、その実際音声
が認識される。（６）各々、必要な情報の検索を音声により行うこと
を希望する複数の利用者により使用される複数のクライ
アント・コンピュータと通信可能なサーバ・コンピュー
タにおいて、不特定の利用者からの音声による情報検索
リクエストに適合した情報を検索して各クライアント・
コンピュータに送信する方法であって、認識されるべき
音声を発する話者である対象話者本人の実際の属性であ
る実際本人属性を表す実際本人属性データと、その音声
を発する際にその対象話者が実際に置かれた環境の属性
である実際環境属性を表す実際環境属性データとに基づ
き、複数種類の音声認識辞書であって、音声を発する際
に前記話者が置かれることが予め想定された複数種類の
環境の属性と、前記話者に対して予め想定された複数種
類の本人の属性とにそれぞれ関連付けて予め用意された
ものの中から、前記実際本人属性と前記実際環境属性と
に対応する音声認識辞書を選択する第１ステップと、前
記対象話者が実際に発した実際音声を表す実際音声デー
タに基づき、前記選択された音声認識辞書を用いること
により、前記実際音声データを認識し、その認識結果を
表す音声認識結果データを作成する第２ステップと、少
なくとも、その作成された音声認識結果データに基づ
き、前記各利用者に提供可能な複数種類の提供情報が、
予め想定された複数種類の音声認識結果データに関連付
けて予め記憶させられた提供情報データベースにおい
て、その作成された音声認識結果データに対応する前記
提供情報を検索し、その検索された提供情報を、前記対
象利用者により使用される前記クライアント・コンピュ
ータに送信する第３ステップとを含む音声情報検索方法
［請求項５］。この方法によれば、前記（３）項に係る
方法と基本的に同じ原理に従い、サーバ・コンピュータ
により、利用者の実際音声のみならず、その利用者本人
の属性およびその利用者が発声時に実際に置かれた環境
の属性をも考慮することにより、その実際音声が認識さ
れるため、その実際音声のみを考慮して音声認識を行う
場合に比較し、音声認識の精度を容易に向上させ得る。
さらに、この方法によれば、高精度の音声認識結果に基
づき、利用者が希望する情報を精度よく検索してその利
用者に提供し得る。本項において「クライアント・コン
ピュータとサーバ・コンピュータ」との通信は、電波を
利用して行われる通信方式としたり、専用または公衆の
電話回線を利用して行われる通信方式とすることができ
る。さらに、その「通信」は、それら通信方式のいずれ
であるかを問わず、インターネット等のオープン・ネッ
トワークを介して行われるものとしたり、ＬＡＮ、ＷＡ
Ｎ等のクローズド・ネットワークを介して行われるもの
とすることができる。このような解釈は、下記の各項に
おいても適用され得る。（７）前記第３ステップが、前記作成された音声認識
結果データに基づき、かつ、前記実際本人属性データと
前記実際環境属性データとの少なくとも一方を参照する
ことにより、前記提供情報データベースにおいて、その
作成された音声認識結果データに対応する前記提供情報
を検索するステップを含む（６）項に記載の音声情報検
索方法［請求項６］。この方法においては、利用者が希
望する情報が、検索された音声認識結果データのみなら
ず、実際本人属性データと実際環境属性データとの少な
くとも一方にも基づき、提供情報データベースにおいて
検索される。したがって、この方法によれば、利用者が
希望する情報を、検索された音声認識結果データのみに
基づいて検索する場合に比較し、その検索の精度を容易
に向上させ得る。さらに、この方法によれば、実際本人
属性データと実際環境属性データとの少なくとも一方
を、音声認識と情報検索との双方に、有効に利用し得
る。（８）前記複数種類の提供情報が、地理的環境に関す
る複数種類の地理的環境関連情報を含む（６）または
（７）項に記載の音声情報検索方法［請求項７］。本項
において「地理的環境」は、例えば、遊園地、宿泊施
設、店舗、工場、イベント会場、駐車場、公園、博物
館、美術館、公的施設等の施設に関する環境を含むよう
に解釈したり、一般道路、高速道路、砂利道、雪道等の
道路に関する環境を含むように解釈することができる。（９）（１）ないし（８）項のいずれかに記載の方法
を実施するためにコンピュータにより実行されるプログ
ラム［請求項８］。このプログラムがコンピュータによ
り実行されれば、前記（１）ないし（８）項のいずれか
に係る方法と同様な作用効果を実現し得る。本項に係る
プログラムは、それの機能を果たすためにコンピュータ
により実行される指令の組合せのみならず、各指令によ
り処理されるファイルやデータをも含むように解釈する
ことが可能である。（１０）（９）項に記載のプログラムをコンピュータ
読取り可能に記録した記録媒体［請求項９］。この記録
媒体に記録されたプログラムがコンピュータにより実行
されれば、前記（１）ないし（８）項のいずれかに係る
方法と同様な作用効果を実現し得る。本項における「記
録媒体」は種々の形式を採用可能であり、例えば、フロ
ッピー（登録商標）ディスク等の磁気記録媒体、ＣＤ、
ＣＤ−ＲＯＭ等の光記録媒体、ＭＯ等の光磁気記録媒
体、ＲＯＭ等のアンリムーバブル・ストレージ等の少な
くとも１つを採用可能である。（１１）不特定の話者から発せられた音声をコンピュ
ータにより認識するシステムであって、音声を発する際
に前記話者が置かれることが予め想定された複数種類の
環境の属性と、前記話者に対して予め想定された複数種
類の本人の属性とにそれぞれ関連付けて予め用意された
複数種類の音声認識辞書と、認識されるべき音声を発す
る話者である対象話者本人の実際の属性である実際本人
属性を表す実際本人属性データと、その音声を発する際
にその対象話者が実際に置かれた環境の属性である実際
環境属性を表す実際環境属性データとに基づき、前記複
数種類の音声認識辞書の中から、前記実際本人属性と前
記実際環境属性とに対応する音声認識辞書を選択すると
ともに、前記対象話者が実際に発した実際音声を表す実
際音声データに基づき、前記選択された音声認識辞書を
用いることにより、前記実際音声データを認識し、その
認識結果を表す音声認識結果データを作成する認識手段
とを含む音声認識システム［請求項１０］。このシステ
ムによれば、前記（３）項に係る方法と基本的に同じ原
理に従い、話者の実際音声のみならず、その話者本人の
属性およびその話者が発声時に実際に置かれた環境の属
性をも考慮することにより、その実際音声が認識される
ため、その実際音声のみを考慮して音声認識を行う場合
に比較し、音声認識の精度を容易に向上させ得る。（１２）各々、認識されるべき音声を発する複数の利
用者により使用される複数のクライアント・コンピュー
タとの通信により、不特定の利用者から発せられた音声
を認識するサーバ・コンピュータであって、音声を発す
る際に前記話者が置かれることが予め想定された複数種
類の環境の属性と、前記話者に対して予め想定された複
数種類の本人の属性とにそれぞれ関連付けて予め用意さ
れた複数種類の音声認識辞書と、認識されるべき音声を
発する話者である対象話者本人の実際の属性である実際
本人属性を表す実際本人属性データと、その音声を発す
る際にその対象話者が実際に置かれた環境の属性である
実際環境属性を表す実際環境属性データとに基づき、前
記複数種類の音声認識辞書の中から、前記実際本人属性
と前記実際環境属性とに対応する音声認識辞書を選択す
るとともに、前記対象話者が実際に発した実際音声を表
す実際音声データに基づき、前記選択された音声認識辞
書を用いることにより、前記実際音声データを認識し、
その認識結果を表す音声認識結果データを作成する認識
手段とを含む音声認識用サーバ・コンピュータ［請求項
１１］。このサーバ・コンピュータによれば、前記
（３）項に係る方法と基本的に同じ原理に従い、利用者
の実際音声のみならず、その利用者本人の属性およびそ
の利用者が発声時に実際に置かれた環境の属性をも考慮
することにより、その実際音声が認識されるため、その
実際音声のみを考慮して音声認識を行う場合に比較し、
音声認識の精度を容易に向上させ得る。（１３）各々、必要な情報の検索を音声により行うこ
とを希望する複数の利用者により使用される複数のクラ
イアント・コンピュータとの通信により、不特定の利用
者からの音声による情報検索リクエストに適合した情報
を検索して各クライアント・コンピュータに送信するサ
ーバ・コンピュータであって、音声を発する際に前記話
者が置かれることが予め想定された複数種類の環境の属
性と、前記話者に対して予め想定された複数種類の本人
の属性とにそれぞれ関連付けて予め用意された複数種類
の音声認識辞書と、前記各利用者に提供可能な複数種類
の提供情報が、予め想定された複数種類の音声認識結果
データに関連付けて予め記憶させられた提供情報データ
ベースと、前記必要な情報の検索に必要な情報を音声に
より発する利用者である対象利用者本人の実際の属性で
ある実際本人属性を表す実際本人属性データと、その音
声を発する際にその対象利用者が実際に置かれた環境の
属性である実際環境属性を表す実際環境属性データとに
基づき、前記複数種類の音声認識辞書の中から、前記実
際本人属性と前記実際環境属性とに対応する音声認識辞
書を選択するとともに、前記対象利用者が実際に発した
実際音声を表す実際音声データに基づき、前記選択され
た音声認識辞書を用いることにより、前記実際音声デー
タを認識し、その認識結果を表す音声認識結果データを
作成する認識手段と、少なくとも、その作成された音声
認識結果データに基づき、前記提供情報データベースに
おいて、その音声認識結果データに対応する前記提供情
報を検索し、その検索された提供情報を、前記対象利用
者により使用される前記クライアント・コンピュータに
送信する情報検索手段とを含む音声情報検索用サーバ・
コンピュータ［請求項１２］。このサーバ・コンピュー
タによれば、前記（３）項に係る方法と基本的に同じ原
理に従い、利用者の実際音声のみならず、その利用者本
人の属性およびその利用者が発声時に実際に置かれた環
境の属性をも考慮することにより、その実際音声が認識
されるため、その実際音声のみを考慮して音声認識を行
う場合に比較し、音声認識の精度を容易に向上させ得
る。さらに、このサーバ・コンピュータによれば、高精
度の音声認識結果に基づき、利用者が希望する情報を精
度よく検索してその利用者に提供し得る。（１４）前記情報検索手段が、前記作成された音声認
識結果データに基づき、かつ、前記実際本人属性データ
と前記実際環境属性データとの少なくとも一方を参照す
ることにより、前記提供情報データベースにおいて、そ
の音声認識結果データに対応する前記提供情報を検索す
る手段を含む（１３）項に記載の音声情報検索用サーバ
・コンピュータ。このサーバ・コンピュータによれば、
前記（７）項に係る方法におけると同様に、利用者が希
望する情報を、検索された音声認識結果データのみに基
づいて検索する場合に比較し、その検索の精度を容易に
向上させ得る。さらに、このサーバ・コンピュータによ
れば、前記（７）項に係る方法におけると同様に、実際
本人属性データと実際環境属性データとの少なくとも一
方を、音声認識と情報検索との双方に、有効に利用し得
る。(1) A method for recognizing a voice uttered by a specific or unspecified speaker by a computer, wherein the speaker utters a voice by using an attribute of an environment where the speaker is actually placed. Based on real environment attribute data representing a certain real environment attribute, a plurality of types of speech recognition dictionaries prepared in advance in association with a plurality of types of environment attributes which are assumed to be placed by the speaker when uttering a voice. A first step of selecting a speech recognition dictionary corresponding to the actual environment attribute from among the following, and using the selected speech recognition dictionary based on actual speech data representing actual speech actually emitted by the speaker. A second step of recognizing the actual voice and generating voice recognition result data representing the recognition result. In this method, a plurality of types of speech recognition dictionaries are prepared in advance in association with attributes of a plurality of types of environments in which speakers are assumed to be placed when uttering voice. Then, in this method, when a speaker utters a voice, based on actual environment attribute data representing an actual environment attribute which is an attribute of an environment where the speaker is actually placed, among a plurality of types of speech recognition dictionaries, The speech recognition dictionary corresponding to the actual environment attribute is selected. Further, in this method, the actual speech is recognized by using the selected speech recognition dictionary based on the actual speech data representing the actual speech actually uttered by the speaker, and the speech recognition result representing the recognition result is obtained. Data is created. Therefore, according to this method, the actual voice is recognized by considering not only the actual voice of the speaker but also the attributes of the environment in which the speaker is actually placed at the time of uttering. In comparison with speech recognition taking into account
The accuracy of voice recognition can be easily improved. In this section, “environment” can be interpreted to include, for example, the geographical environment where the speaker is located. The attributes of the geographical environment include, for example, a sightseeing spot, a downtown area, an amusement park, and an expressway. "Environment" can also be interpreted to include the acoustic environment in which the speaker is located. The attributes of the acoustic environment include, for example, a place where a background sound such as a natural sound and a noise exists, and a vehicle on which a speaker is placed and moved. When the speaker is in the vehicle, the running sound of the vehicle is a factor for specifying the attribute of the acoustic environment. The traveling sound of the vehicle depends on factors such as the type of vehicle, traveling speed, road surface conditions, and the like, and these factors are also factors that specify the attributes of the acoustic environment. (2) The method further includes a third step of causing the plurality of types of speech recognition dictionaries to learn so that the contents reflect the relationship between the corresponding actual environment attribute data, the actual speech data, and the speech recognition result data ( A speech recognition method according to the first aspect [Claim 2]. According to this method, the learning of the voice recognition dictionary is performed so as to reflect not only the actual voice data and the voice recognition result data corresponding to each other but also the actual environment attribute data corresponding to them. It is easy to reliably adapt the content to a new speech recognition situation. (3) A method of recognizing a voice uttered from an unspecified speaker by a computer, the method representing an actual attribute which is an actual attribute of a target speaker who is a speaker emitting a voice to be recognized. A plurality of types of speech recognition dictionaries based on personal attribute data and actual environment attribute data representing actual environment attributes which are attributes of an environment where the target speaker is actually placed when emitting the voice. Are prepared in advance in association with the attributes of a plurality of types of environment in which the speaker is assumed to be placed when emitting the speaker and the attributes of the plurality of types of person assumed in advance for the speaker. A first step of selecting a voice recognition dictionary corresponding to the real personal attribute and the real environment attribute from among the actual voice data and real voice data representing a real voice actually emitted by the target speaker; A second step of recognizing the actual voice data by using the obtained voice recognition dictionary and generating voice recognition result data representing the recognition result. [Claim 3]. In this method, paying attention to adopting an unspecified speaker method, a plurality of types of speech recognition dictionaries are used. It is prepared in advance in association with the assumed attributes of a plurality of types of environments. Then, in this method, the actual speaker attribute data representing the actual speaker attribute, which is the actual attribute of the target speaker who is the speaker emitting the voice to be recognized, and the target speaker is used when the voice is emitted. Based on the actual environment attribute data representing the actual environment attribute, which is the attribute of the environment actually placed, the actual speaker attribute data and the actual environment attribute data of the target speaker are selected from a plurality of types of speech recognition dictionaries. The voice recognition dictionary is selected. Further, in this method, the actual speech data is recognized by using the selected speech recognition dictionary based on the actual speech data representing the actual speech actually uttered by the target speaker, and the speech representing the recognition result is obtained. Recognition result data is created. Therefore, according to this method,
Considering not only the actual voice of the speaker but also the attributes of the speaker itself and the attributes of the environment where the speaker was actually placed when speaking, the actual voice is recognized. In comparison with speech recognition taking into account
The accuracy of voice recognition can be easily improved. In this section, “person's attribute” can be interpreted to include at least one of, for example, profile, gender, age, and the like. In this section, “environment” can be interpreted in the same manner as in the above section (1). (4) Further, the plurality of types of speech recognition dictionaries are trained so that the contents reflect the relationship among the actual speech data, the speech recognition result data, the actual personal attribute data, and the actual environment attribute data corresponding to each other. The speech recognition method according to claim 3, including a third step [Claim 4]. According to this method, the learning of the speech recognition dictionary is performed so as to reflect not only the actual speech data and the speech recognition result data corresponding to each other, but also the corresponding actual personal attribute data and actual environment attribute data. In addition, it is easy to reliably adapt the contents of the speech recognition dictionary to a new speech recognition situation. (5) The speech recognition method according to any one of (1) to (4), wherein each type of the speech recognition dictionary includes an acoustic model and a language model. According to this method, by considering both the acoustic characteristics of the actual speech uttered by the speaker and the linguistic constraints between words in a character string corresponding to the actual speech, the actual Voice is recognized. (6) Each of the server computers that can communicate with a plurality of client computers used by a plurality of users who desire to search for necessary information by voice, the information from voices from unspecified users. Search for information that matches the search request
A method of transmitting data to a computer, wherein actual personal attribute data representing an actual personal attribute which is an actual attribute of a target speaker who is a speaker that emits a voice to be recognized, and the target talk when the voice is generated A plurality of types of speech recognition dictionaries based on actual environment attribute data representing actual environment attributes, which are attributes of the environment where the speaker is actually placed, and it is assumed that the speaker will be placed when uttering a voice. The attributes of the plurality of types of environment and the attributes of the plurality of types of individuals presumed for the speaker are prepared in advance in association with the respective attributes of the individual. A first step of selecting a corresponding speech recognition dictionary, and using the selected speech recognition dictionary based on actual speech data representing actual speech actually uttered by the target speaker. A second step of recognizing voice data and generating voice recognition result data representing the recognition result; and at least a plurality of types of provided information that can be provided to each user based on the generated voice recognition result data. ,
In the provided information database stored in advance in association with a plurality of types of speech recognition result data assumed in advance, the provided information corresponding to the created speech recognition result data is searched, and the searched provided information is A third step of transmitting to the client computer used by the target user. According to this method, not only the actual voice of the user, but also the attribute of the user and the user's actual Since the actual voice is recognized by also considering the attribute of the environment placed in the environment, the accuracy of voice recognition can be easily improved as compared with the case where the voice recognition is performed in consideration of only the actual voice. .
Further, according to this method, information desired by a user can be retrieved with high accuracy based on a highly accurate speech recognition result and provided to the user. In this section, the communication between the "client computer and the server computer" may be a communication method using radio waves or a communication method using a dedicated or public telephone line. Further, the “communication” is performed via an open network such as the Internet, LAN, WA, or the like, regardless of the communication method.
N, etc., over a closed network. Such an interpretation can also be applied in the following sections. (7) In the providing information database, the third step is based on the created speech recognition result data and refers to at least one of the actual personal attribute data and the actual environment attribute data. The voice information search method according to claim 6, including a step of searching for the provided information corresponding to the generated voice recognition result data [Claim 6]. In this method, the information desired by the user is searched in the provided information database based on not only the searched voice recognition result data but also at least one of the actual personal attribute data and the actual environment attribute data. Therefore, according to this method, it is possible to easily improve the accuracy of the search as compared with a case where the information desired by the user is searched based only on the searched voice recognition result data. Further, according to this method, at least one of the actual personal attribute data and the actual environment attribute data can be effectively used for both voice recognition and information retrieval. (8) The voice information search method according to (6) or (7), wherein the plurality of types of provided information include a plurality of types of geographic environment-related information on a geographic environment. In this section, `` geographical environment '' is interpreted to include, for example, the environment related to facilities such as amusement parks, accommodation facilities, shops, factories, event venues, parking lots, parks, museums, museums, public facilities, etc. It can be interpreted to include environments related to roads such as general roads, highways, gravel roads, and snowy roads. (9) A program executed by a computer to execute the method according to any one of (1) to (8) [Claim 8]. When this program is executed by a computer, the same functions and effects as those of the method according to any one of the above items (1) to (8) can be realized. The program according to this section can be interpreted to include not only a combination of commands executed by a computer to perform its function but also a file or data processed by each command. (10) A recording medium on which the program according to (9) is recorded in a computer-readable manner. If the program recorded on the recording medium is executed by a computer, the same operation and effect as the method according to any one of the above modes (1) to (8) can be realized. The “recording medium” in this section can adopt various formats, for example, a magnetic recording medium such as a floppy (registered trademark) disk, a CD,
At least one of an optical recording medium such as a CD-ROM, a magneto-optical recording medium such as an MO, and an unremovable storage such as a ROM can be employed. (11) A system for recognizing a voice uttered from an unspecified speaker by a computer, wherein a plurality of types of environment attributes in which the speaker is supposed to be placed when uttering the voice, A plurality of types of speech recognition dictionaries prepared in advance in association with a plurality of types of attributes of a person assumed in advance for a speaker, and actual attributes of a target speaker who is a speaker emitting a voice to be recognized Based on the actual personal attribute data representing the actual personal attribute, and the actual environmental attribute data representing the actual environmental attribute which is the attribute of the environment where the target speaker is actually placed when emitting the voice. And selecting a speech recognition dictionary corresponding to the actual personal attribute and the actual environment attribute from among the speech recognition dictionaries based on actual speech data representing the actual speech actually emitted by the target speaker. A voice recognition system for recognizing the actual voice data by using the selected voice recognition dictionary, and generating voice recognition result data representing the recognition result [Claim 10]. According to this system, not only the actual voice of the speaker, but also the attributes of the speaker itself and the environment in which the speaker was actually placed at the time of uttering, according to basically the same principle as the method according to the above item (3). In addition, since the actual speech is recognized by taking into account the attribute, the accuracy of speech recognition can be easily improved as compared with the case where speech recognition is performed in consideration of only the actual speech. (12) A server computer for recognizing a voice emitted from an unspecified user by communicating with a plurality of client computers used by a plurality of users who emit voices to be recognized, It is prepared in advance in association with the attributes of a plurality of types of environment in which the speaker is assumed to be placed when emitting a voice and the attributes of a plurality of types of individuals presumed for the speaker. Plural types of speech recognition dictionaries, actual personal attribute data representing the actual personal attributes that are the actual attributes of the target speaker who is the speaker that utters the voice to be recognized, and the target speaker when the voice is uttered Is based on actual environment attribute data representing an actual environment attribute which is an attribute of an environment in which the actual personal attribute and the actual environment attribute are selected from the plurality of types of speech recognition dictionaries. And a speech recognition dictionary corresponding to the selected speaker, and based on the actual speech data representing the actual speech actually uttered by the target speaker, using the selected speech recognition dictionary to recognize the actual speech data. ,
A speech-recognition server computer including a recognition unit that creates speech recognition result data representing the recognition result. According to the server computer, not only the actual voice of the user but also the attribute of the user and the user are actually placed at the time of uttering according to basically the same principle as the method according to the above item (3). Considering the attributes of the environment, the actual speech is recognized.
The accuracy of voice recognition can be easily improved. (13) Communication with a plurality of client computers used by a plurality of users who wish to search for necessary information by voice, respectively, thereby adapting to an information search request by voice from an unspecified user. A server computer that searches for and sends the information to each client computer, and includes attributes of a plurality of types of environments in which the speaker is assumed to be placed when emitting a voice, and A plurality of types of speech recognition dictionaries prepared in advance in association with a plurality of types of personal attributes assumed in advance, and a plurality of types of provision information that can be provided to the respective users, A provided information database pre-stored in association with the speech recognition result data, and a user who emits, by voice, information necessary for searching for the necessary information. The actual personal attribute data representing the actual personal attribute which is the actual attribute of a certain target user, and the real environment representing the actual environmental attribute which is the attribute of the environment where the target user is actually placed when uttering the voice Based on the attribute data, from among the plurality of types of speech recognition dictionaries, a speech recognition dictionary corresponding to the actual personal attribute and the actual environment attribute is selected, and the actual speech actually emitted by the target user is Recognition means for recognizing the actual speech data by using the selected speech recognition dictionary based on the actual speech data to be represented and creating speech recognition result data representing the recognition result; Based on the recognition result data, the provided information corresponding to the voice recognition result data is searched in the provided information database, and the searched provided information is searched for. , The server voice information retrieval including the information retrieval means for transmitting to the client computer used by the target user,
Computer [Claim 12]. According to the server computer, not only the actual voice of the user but also the attribute of the user and the user are actually placed at the time of uttering according to basically the same principle as the method according to the above item (3). Since the actual voice is recognized by also considering the attribute of the environment, the accuracy of the voice recognition can be easily improved as compared with the case where the voice recognition is performed in consideration of only the actual voice. Further, according to the server computer, the information desired by the user can be retrieved with high accuracy based on the highly accurate speech recognition result and provided to the user. (14) The information retrieval means, based on the created speech recognition result data and referring to at least one of the actual personal attribute data and the actual environment attribute data, The voice information search server computer according to mode (13), further comprising means for searching for the provided information corresponding to voice recognition result data. According to this server computer,
As in the method according to the above mode (7), the information desired by the user is compared with a case where the information is searched based only on the searched voice recognition result data, and the accuracy of the search can be easily improved. Further, according to the server computer, at least one of the actual personal attribute data and the actual environment attribute data can be effectively used for both voice recognition and information retrieval, as in the method according to the above mode (7). Available.

【００１６】[0016]

【発明の実施の形態】以下、本発明のさらに具体的な一
実施形態を図面に基づいて詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a more specific embodiment of the present invention will be described in detail with reference to the drawings.

【００１７】図１には、本実施形態に従う音声情報検索
方法を実施するための音声情報検索システムが機能ブロ
ック図で示されている。その音声情報検索方法は、本発
明の別の側面の一実施形態に従う音声認識方法を利用す
るものである。また、その音声情報検索システムは、本
発明のさらに別の側面に従う音声認識システムを利用す
るとともに、本発明のさらに別の側面に従う音声認識用
サーバ・コンピュータを利用し、さらに、本発明のさら
に別の側面に従う音声情報検索用サーバ・コンピュータ
を利用するものである。FIG. 1 is a functional block diagram showing a voice information search system for implementing a voice information search method according to the present embodiment. The voice information search method utilizes a voice recognition method according to an embodiment of another aspect of the present invention. In addition, the voice information search system uses a voice recognition system according to still another aspect of the present invention, and uses a server computer for voice recognition according to yet another aspect of the present invention. And a server computer for voice information search according to the above aspect.

【００１８】本実施形態に従う音声情報検索方法（以
下、単に「検索方法」という）は、図１に示すように、
複数の利用者（図においては一人の利用者のみが代表的
に示されている）と、音声情報検索センタに設置された
サーバ・コンピュータ１０（これが上記音声認識用サー
バ・コンピュータの一例であるとともに上記音声情報検
索用サーバ・コンピュータの一例である）との間におい
て使用される。各利用者は、各クライアント・コンピュ
ータ１２を介してサーバ・コンピュータ１０と通信する
ことにより、そのサーバ・コンピュータ１０から、自分
が希望する情報を取得する。The voice information search method according to the present embodiment (hereinafter, simply referred to as “search method”) is as shown in FIG.
A plurality of users (only one user is shown in the figure as a representative) and a server computer 10 installed in a voice information search center (this is an example of the above-described voice recognition server computer) This is an example of the voice information search server computer). Each user obtains desired information from the server computer 10 by communicating with the server computer 10 via each client computer 12.

【００１９】各クライアント・コンピュータ１２は、図
１に示すように、音声入力部２０と、情報提示部２２と
を含むように構成される。As shown in FIG. 1, each client computer 12 is configured to include a voice input unit 20 and an information presenting unit 22.

【００２０】音声入力部２０は、情報検索のためにサー
バ・コンピュータ１０にとって必要な情報である問い合
わせ情報を利用者が音声により発した場合にその音声を
取り込んで音声データ（実際音声データ）を作成する部
分である。The voice input unit 20 generates voice data (actual voice data) by taking in the voice when the user utters inquiry information which is necessary information for the server computer 10 for information search. This is the part to do.

【００２１】これに対して、情報提示部２２は、利用者
から音声により発せられた問い合わせ情報に適合した情
報がサーバ・コンピュータ１０により検索されて送信さ
れてきた場合に、その情報を視覚的または／および聴覚
的に利用者に提示する部分である。On the other hand, when information matching the inquiry information uttered from the user by voice is retrieved and transmitted by the server computer 10, the information presenting unit 22 visually or indirectly retrieves the information. And / or a part that is presented audibly to the user.

【００２２】各クライアント・コンピュータ１２は、さ
らに、図１に示すように、位置取得部２４と、ＩＤ記憶
部２６とを含むように構成される。Each client computer 12 is further configured to include a position acquisition unit 24 and an ID storage unit 26, as shown in FIG.

【００２３】位置取得部２４は、地球上における利用者
の現在位置を取得する部分である。これに対して、ＩＤ
記憶部２６は、利用者本人の識別情報である利用者ＩＤ
を記憶している部分である。The position obtaining section 24 is a part for obtaining the current position of the user on the earth. On the other hand, ID
The storage unit 26 stores a user ID which is identification information of the user.
Is the part that memorizes.

【００２４】クライアント・コンピュータ１２は、それ
ら４つの機能、すなわち、音声入力、情報提示、位置取
得およびＩＤ記憶を果たすため、図２に示すように構成
される。クライアント・コンピュータ１２は、よく知ら
れているように、プロセシング・ユニット３０（以下、
「ＰＵ」と略称する）とメモリ３２とがバス３４により
互いに接続されて構成されたコンピュータ部３６を含む
ように構成される。メモリ３２は、ＲＯＭ，ＲＡＭ，磁
気ディスク，光ディスク等の記録媒体を含むように構成
される。このメモリ３２には、音声入力プログラムと利
用者ＩＤとが予め記憶されている。メモリ３２のうち利
用者ＩＤを記憶する部分が前記ＩＤ記憶部２６の一例で
ある。The client computer 12 is configured as shown in FIG. 2 to perform these four functions, namely, voice input, information presentation, position acquisition, and ID storage. As is well known, the client computer 12 includes a processing unit 30 (hereinafter, referred to as a processing unit 30).
"PU") and a memory 32 are connected to each other by a bus 34 to include a computer unit 36. The memory 32 is configured to include a recording medium such as a ROM, a RAM, a magnetic disk, and an optical disk. The memory 32 stores a voice input program and a user ID in advance. The part of the memory 32 that stores the user ID is an example of the ID storage unit 26.

【００２５】クライアント・コンピュータ１２は、さら
に、図２に示すように、送受信部４０を備えており、そ
の送受信部４０により、サーバ・コンピュータ１０と電
波により通信することが可能となっている。As shown in FIG. 2, the client computer 12 further includes a transmission / reception unit 40, and the transmission / reception unit 40 can communicate with the server computer 10 by radio waves.

【００２６】クライアント・コンピュータ１２は、さら
に、図２に示すように、グローバル・ポジショニング・
システム４２（以下、「ＧＰＳ」と略称する）を備えて
いる。ＧＰＳ４２は、よく知られているように、人工衛
星を利用してＧＰＳ４２の現在位置を測定する装置であ
る。ＧＰＳ４２は、前記位置取得部２４の一例である。The client computer 12 further includes a global positioning system, as shown in FIG.
A system 42 (hereinafter abbreviated as “GPS”) is provided. As is well known, the GPS 42 is a device that measures the current position of the GPS 42 using an artificial satellite. The GPS 42 is an example of the position acquisition unit 24.

【００２７】クライアント・コンピュータ１２は、さら
に、利用者により操作されるスイッチ４６と、利用者か
ら発せられた音声を入力するマイク４８と、情報を画面
に表示するディスプレイ５０と、情報を音により出力す
るスピーカ５２とを備えている。マイク４８は前記音声
入力部２０の一例であり、ディスプレイ５０およびスピ
ーカ５２はそれぞれ前記情報提示部２２の一例である。The client computer 12 further includes a switch 46 operated by a user, a microphone 48 for inputting a voice uttered by the user, a display 50 for displaying information on a screen, and outputting information by sound. And a speaker 52 to be used. The microphone 48 is an example of the voice input unit 20, and the display 50 and the speaker 52 are each an example of the information presenting unit 22.

【００２８】近年、図３に示すように、車両としての自
動車にナビゲーション装置６０を搭載することが普及し
てきている。ナビゲーション装置６０は、一般に、共に
図示しないコンピュータおよびＧＰＳを搭載するととも
に、スイッチ６２、ディスプレイ６４およびスピーカ６
６を含むように構成される。さらに、そのようなナビゲ
ーション装置６０に携帯電話機またはＰＨＳである移動
電話機７０を接続して、電波による外部との送受信を可
能にすることも普及してきている。移動電話機７０は、
話者からの音声が入力される受話部７２を備えている。In recent years, as shown in FIG. 3, mounting a navigation device 60 on an automobile as a vehicle has become widespread. The navigation device 60 generally includes a computer and a GPS (both not shown), a switch 62, a display 64, and a speaker 6.
6 are included. Further, connecting a mobile phone 70 such as a mobile phone or a PHS to such a navigation device 60 to enable transmission / reception to / from the outside by radio waves has become widespread. The mobile telephone 70
A receiver 72 is provided for receiving a voice from a speaker.

【００２９】それらナビゲーション装置６０および移動
電話機７０を利用することにより、クライアント・コン
ピュータ１２を実現することが可能である。この場合、
クライアント・コンピュータ１２のコンピュータ部３６
は上記ナビゲーション装置６０のコンピュータ、ＧＰＳ
４２は上記ナビゲーション装置６０のＧＰＳ、スイッチ
４６は上記スイッチ６２、マイク４８は上記移動電話機
７０の受話部７２、ディスプレイ５０は上記ディスプレ
イ６４、スピーカ５２は上記スピーカ６６、送受信部４
０は上記ナビゲーション装置６０の受信部および移動電
話機７０の送受信部によりそれぞれ構成することが可能
である。このようにそれらナビゲーション装置６０およ
び移動電話機７０を利用することによってクライアント
・コンピュータ１２を実現する場合には、この検索方法
を自動車において実施可能とするために特別のハードウ
エア資源を自動車に追加することが不可欠ではなくな
り、よって、この検索方法を安価に実施可能となる。By using the navigation device 60 and the mobile telephone 70, the client computer 12 can be realized. in this case,
Computer section 36 of client computer 12
Is the computer of the navigation device 60, GPS
42 is the GPS of the navigation device 60, switch 46 is the switch 62, microphone 48 is the receiver 72 of the mobile telephone 70, display 50 is the display 64, speaker 52 is the speaker 66, and transmitter / receiver 4
0 can be constituted by the receiving unit of the navigation device 60 and the transmitting / receiving unit of the mobile telephone 70, respectively. When the client computer 12 is realized by using the navigation device 60 and the mobile telephone 70 in this way, a special hardware resource must be added to the vehicle so that the search method can be implemented in the vehicle. Becomes indispensable, so that this search method can be implemented at low cost.

【００３０】以上、クライアント・コンピュータ１２の
ハードウエア資源を説明したが、次に、サーバ・コンピ
ュータ１０のハードウエア資源を説明する。The hardware resources of the client computer 12 have been described above. Next, the hardware resources of the server computer 10 will be described.

【００３１】図１に示すように、サーバ・コンピュータ
１０は、音声認識という機能を果たすために、認識辞書
メモリ８０、音声認識部８２、参照情報データベース８
４（以下、単に「参照情報ＤＢ」で表す。他のデータベ
ースについても同じとする）、データ蓄積部８６、蓄積
データメモリ８８および辞書学習部９０を備えている。As shown in FIG. 1, the server computer 10 has a recognition dictionary memory 80, a voice recognition unit 82, and a reference information database 8 in order to perform a function of voice recognition.
4 (hereinafter simply referred to as “reference information DB”; the same applies to other databases), a data storage unit 86, a storage data memory 88, and a dictionary learning unit 90.

【００３２】認識辞書メモリ８０は、複数種類の音声を
識別して認識するために参照される音声認識辞書を記憶
するメモリである。認識辞書メモリ８０は、本実施形態
においては、音響モデルと言語モデルとを音声認識辞書
として記憶するように構成されている。The recognition dictionary memory 80 is a memory for storing a voice recognition dictionary referred to for identifying and recognizing a plurality of types of voices. In the present embodiment, the recognition dictionary memory 80 is configured to store the acoustic model and the language model as a speech recognition dictionary.

【００３３】音響モデルは、音声の音響上の特徴情報を
考慮して音声認識を行うためのモデルである。本実施形
態においては、この音響モデルを利用することにより、
話者からの実際音声（問い合わせ情報）が問い合わせ文
（文字列）に変換される。The acoustic model is a model for performing speech recognition in consideration of acoustic feature information of the speech. In this embodiment, by using this acoustic model,
The actual voice (inquiry information) from the speaker is converted into an inquiry sentence (character string).

【００３４】この音響モデルは、図４に示すように、想
定された複数種類の話者の本人の属性（例えば、性別、
年齢等）を表す各想定本人属性データと、想定された複
数種類の環境（発声環境）の属性（例えば、繁華街、高
速道路等）を表す各想定環境属性データとに関連付けら
れることにより、同じ認識辞書メモリ８０に複数種類設
けられている。As shown in FIG. 4, this acoustic model has attributes of a plurality of supposed types of speakers (for example, sex,
By associating with each assumed personal attribute data representing the age (eg, age) and each assumed environment attribute data representing attributes of a plurality of assumed environments (speech environments) (for example, downtown areas, expressways, etc.). A plurality of types are provided in the recognition dictionary memory 80.

【００３５】これに対して、言語モデルは、一連の音声
により表現される文字列における複数の単語間における
言語上の複数種類の制約情報（例えば、複数の単語が互
いに連接する際の制約情報や、文法上の制約情報）を考
慮して音声認識を行うためのモデルである。本実施形態
においては、この言語モデルを用いることにより、上記
問い合わせ文が問い合わせ命令に変換される。On the other hand, the language model includes a plurality of types of linguistic constraint information (for example, constraint information when a plurality of words are connected to each other) between a plurality of words in a character string represented by a series of sounds. This is a model for performing speech recognition in consideration of grammatical constraint information). In the present embodiment, the above query sentence is converted into a query command by using this language model.

【００３６】ここに、問い合わせ命令は、利用者が希望
する情報検索サービス（後に詳述する）においてサーバ
・コンピュータ１０が参照することが必要である検索条
件を意味すると考えることができる。例えば、問い合わ
せ文が、２つの名詞である「わたし」と「あなた」とが
格助詞である「と」により互いに連結されて構成されて
いると予想される場合には、検索条件として、検索対象
が、「わたし」という単語と、「あなた」という単語と
を、それら２つの単語が同時に存在するという関連と共
に、含むという条件が導かれることになる。Here, the inquiry command can be considered to mean a search condition that the server computer 10 needs to refer to in an information search service desired by the user (described in detail later). For example, if the query sentence is expected to be composed of two nouns “I” and “you” connected to each other by the case particle “to”, the search condition Leads to the condition that the word "I" and the word "you" are included, together with the association that the two words exist simultaneously.

【００３７】この言語モデルも、音響モデルと同様に、
図５に示すように、想定された複数種類の話者の本人の
属性（例えば、性別、年齢等）を表す各想定本人属性デ
ータと、想定された複数種類の環境（発声環境）の属性
（例えば、観光地、高速道路等）を表す各想定環境属性
データとに関連付けられることにより、同じ認識辞書メ
モリ８０に複数種類設けられている。This language model, like the acoustic model,
As shown in FIG. 5, each assumed personal attribute data representing the attributes (for example, gender, age, etc.) of the assumed plurality of types of speakers, and the attributes of the assumed plurality of types of environment (speech environment) For example, a plurality of types are provided in the same recognition dictionary memory 80 by being associated with each assumed environment attribute data representing sightseeing spots, expressways, and the like.

【００３８】前記音声認識部８２は、それら音響モデル
および言語モデルを利用していわゆる音声認識処理を行
うことにより、話者から音声として実際に発せられた問
い合わせ情報を問い合わせ文に変換し、さらに、その変
換された問い合わせ文を問い合わせ命令に変換する。The speech recognition unit 82 performs so-called speech recognition processing using the acoustic model and the language model, thereby converting inquiry information actually issued as speech from a speaker into an inquiry sentence. The converted query is converted into a query command.

【００３９】この音声認識部８２は、音響モデルおよび
言語モデルの利用に先立ち、複数種類の言語モデルおよ
び複数種類の音響モデルの中から、認識されるべき音声
を発した利用者である利用者本人の実際の属性を表す実
際本人属性データと、その利用者がその音声を発した際
におけるその利用者の実際の環境の属性を表す実際環境
属性データとに対応する音響モデルおよび言語モデルを
選択する。その選択された音響モデルおよび言語モデル
を利用することにより、音声認識部８２は、話者として
の利用者から発せられた音声を認識する。Prior to using the acoustic model and the language model, the speech recognizing unit 82 is a user who is a user who has issued a speech to be recognized from a plurality of types of language models and a plurality of types of acoustic models. The acoustic model and the language model corresponding to the actual personal attribute data representing the actual attribute of the user and the actual environment attribute data representing the attribute of the actual environment of the user when the user utters the voice are selected. . By using the selected acoustic model and language model, the speech recognition unit 82 recognizes speech emitted from a user as a speaker.

【００４０】ここに、利用者本人の実際の属性（実際本
人属性）は、その利用者の利用者ＩＤにより特定され、
同様に、利用者の実際の環境の属性（実際環境属性）
は、その利用者のクライアント・コンピュータ１２から
サーバ・コンピュータ１０が受信した、その利用者の現
在位置を表す情報（現在位置情報）により特定される。Here, the actual attribute of the user (actual personal attribute) is specified by the user ID of the user.
Similarly, attributes of the user's actual environment (actual environment attributes)
Is specified by information (current position information) indicating the current position of the user received by the server computer 10 from the client computer 12 of the user.

【００４１】そして、本実施形態においては、それら実
際本人属性と利用者ＩＤとの関係と、それら実際環境属
性と現在位置情報との関係とが、基本的には、その利用
者がサーバ・コンピュータ１０において情報検索サービ
スを利用する前に、例えば、その情報検索サービスを運
営する事業体により、前記参照情報ＤＢ８４に記憶させ
られる。参照情報ＤＢ８４は、図１に示すように、位置
情報ＤＢ９２と利用者情報ＤＢ９４とを備えている。In the present embodiment, the relationship between the actual personal attribute and the user ID and the relationship between the actual environment attribute and the current location information are basically based on the fact that the user is a server computer. Before using the information search service in 10, for example, the information search service is stored in the reference information DB 84 by a business entity that operates the information search service. The reference information DB 84 includes a position information DB 92 and a user information DB 94 as shown in FIG.

【００４２】ここに、本人属性と利用者ＩＤとの関係を
説明すると、本人属性は例えば、性別、年齢、身長、体
重等を含み、これに対して、利用者ＩＤは例えば、サー
バ・コンピュータ１０において情報検索サービスの利用
が許可された会員として登録された際にそのサーバ・コ
ンピュータ１０から付与された、会員番号としての識別
番号を意味する。その登録に際して、その利用者が本人
属性を登録することとすれば、その情報検索サービスの
利用に先立って、本人属性と利用者ＩＤとの関係を上記
利用者情報ＤＢ９４に記憶させ得る。Here, the relationship between the personal attribute and the user ID will be described. The personal attribute includes, for example, gender, age, height, weight, etc., whereas the user ID is, for example, the server computer 10. Means an identification number as a member number assigned from the server computer 10 when the user is registered as a member permitted to use the information search service. At the time of the registration, if the user registers the personal attribute, the relationship between the personal attribute and the user ID can be stored in the user information DB 94 prior to using the information search service.

【００４３】次に、環境属性と位置情報との関係を説明
するに、環境属性は例えば、繁華街、高速道路等を含
み、これに対して、位置情報は例えば、ＧＰＳ４２から
提供される情報、例えば、緯度および経度を含む。それ
ら環境属性と位置情報との関係は、事前の調査により判
明し得る。したがって、本実施形態においては、それら
環境属性と位置情報との関係が上記位置情報ＤＢ９２に
予め記憶させられている。Next, the relationship between the environmental attribute and the location information will be described. The environmental attribute includes, for example, a downtown area, an expressway, and the like, whereas the location information includes, for example, information provided from the GPS 42, For example, it includes latitude and longitude. The relationship between these environmental attributes and location information can be found by prior research. Therefore, in the present embodiment, the relationship between the environment attributes and the position information is stored in the position information DB 92 in advance.

【００４４】図１に示すように、サーバ・コンピュータ
１０は、さらに、情報検索のために、提供情報ＤＢ１０
０と情報検索／提供部１０２とを備えている。As shown in FIG. 1, the server computer 10 further provides a provided information DB 10 for information search.
0 and an information retrieval / providing unit 102.

【００４５】提供情報ＤＢ１００は、利用者に提供可能
な情報である提供情報が複数種類、各音声認識結果デー
タに関連付けて予め記憶させられたものである。それら
複数種類の提供情報は、地理的環境に関する複数種類の
地理的環境関連情報（例えば、施設案内情報、道路渋滞
情報、店舗情報、旅行情報等）を含んでいる。In the provided information DB 100, provided are a plurality of types of provided information which can be provided to the user, which are stored in advance in association with the respective voice recognition result data. The plurality of types of provided information include a plurality of types of geographic environment-related information on the geographic environment (for example, facility guidance information, road congestion information, store information, travel information, and the like).

【００４６】これに対して、情報検索／提供部１０２
は、前記音声認識部８２による音声認識結果に基づき、
利用者が希望する情報と一致すると予想される情報を前
記提供情報ＤＢ１００において検索する。その検索に際
し、情報検索／提供部１０２は、利用者の利用者ＩＤに
対応する実際本人属性データを前記利用者情報ＤＢ９４
において検索するとともに、利用者の現在位置に対応す
る実際環境属性データを前記位置情報ＤＢ９２において
検索する。そして、情報検索／提供部１０２は、まず、
音声認識部８２による音声認識結果に基づき、利用者が
希望する情報と一致すると予想される情報を提供情報Ｄ
Ｂ１００において暫定的な提供情報として検索し、その
検索された暫定的な提供情報のうち、上記検索された実
際本人属性データと実際環境属性データとに適合するも
のを最終的な提供情報として利用者に提供する。On the other hand, the information retrieval / providing unit 102
Is based on the speech recognition result by the speech recognition unit 82,
The provided information DB 100 is searched for information expected to match the information desired by the user. At the time of the search, the information search / providing unit 102 stores the actual personal attribute data corresponding to the user ID of the user in the user information DB 94.
, And the actual environment attribute data corresponding to the current position of the user is searched in the position information DB 92. Then, the information search / provider 102 first
Based on the speech recognition result by the speech recognition unit 82, information that is expected to match the information desired by the user is provided to the provided information D.
In B100, the provisional provision information is searched as provisional provision information, and among the retrieved provisional provision information, information matching the retrieved actual personal attribute data and actual environment attribute data is used as final provision information by the user. To provide.

【００４７】前記データ蓄積部８６、蓄積データメモリ
８８および辞書学習部９０は、前記認識辞書の学習を行
うために設けられている。The data storage section 86, storage data memory 88, and dictionary learning section 90 are provided for learning the recognition dictionary.

【００４８】データ蓄積部８６は、音声データ蓄積部８
６ａと、問い合わせデータ蓄積部８６ｂとを備えてい
る。音声データ蓄積部８６ａは、サーバ・コンピュータ
１０がクライアント・コンピュータ１２から実際に受信
した音声データを蓄積データメモリ８８に蓄積する。問
い合わせデータ蓄積部８６ｂは、問い合わせデータであ
って、（ａ）その音声データに基づいて前記音声認識部
８２により取得された音声認識結果データとしての問い
合わせ命令と、（ｂ）その取得のために参照された実際
本人属性および実際環境属性との双方を表すものを蓄積
データメモリ８８に蓄積する。それら音声データと問い
合わせデータとは、蓄積データメモリ８８に互いに関連
付けて蓄積される。The data storage unit 86 includes the audio data storage unit 8
6a and an inquiry data storage unit 86b. The voice data storage unit 86a stores the voice data actually received by the server computer 10 from the client computer 12 in the storage data memory 88. The inquiry data storage unit 86b is inquiry data, (a) an inquiry command as speech recognition result data acquired by the speech recognition unit 82 based on the speech data, and (b) a reference for acquisition thereof. The data representing both the actual personal attribute and the actual environment attribute are stored in the storage data memory 88. The voice data and the inquiry data are stored in the storage data memory 88 in association with each other.

【００４９】辞書学習部９０は、蓄積データメモリ８８
から各音声データと各問い合わせデータとを読み出し、
さらに、その読み出された音声データと問い合わせデー
タとによってそれらの内容が反映されるように認識辞書
メモリ８０を更新する。認識辞書メモリ８０の学習を行
うのであり、この学習は、本実施形態においては、蓄積
データメモリ８８に蓄積されたデータの量が設定値より
多い場合に行われる。The dictionary learning section 90 has a storage data memory 88
Read out each voice data and each inquiry data from
Further, the recognition dictionary memory 80 is updated so that the contents are reflected by the read voice data and inquiry data. The learning of the recognition dictionary memory 80 is performed. In the present embodiment, the learning is performed when the amount of data stored in the storage data memory 88 is larger than a set value.

【００５０】図６には、サーバ・コンピュータ１０のハ
ードウエア構成が概念的に示されている。サーバ・コン
ピュータ１０は、クライアント・コンピュータ１２のコ
ンピュータ部３６と同様に、ＰＵ１２０とメモリ１２２
とがバス１２４により互いに接続されて構成される。メ
モリ１２２は、コンピュータ部３６のメモリ３２と同様
に、ＲＯＭ，ＲＡＭ，磁気ディスク，光ディスク等の記
録媒体を含むように構成される。このメモリ１２２はい
くつかの領域を有しており、それら領域により、プログ
ラムメモリ１３０と、前記認識辞書メモリ８０と、前記
参照情報ＤＢ８４と、前記提供情報ＤＢ１００と、前記
蓄積データメモリ８８とが構成されている。プログラム
メモリ１３０には、前述の音声認識および情報検索を行
うための音声情報検索プログラムと、認識辞書メモリ８
０を学習させるための辞書学習プログラムとが予め記憶
させられている。FIG. 6 conceptually shows a hardware configuration of server computer 10. The server computer 10 includes a PU 120 and a memory 122, like the computer unit 36 of the client computer 12.
Are connected to each other by a bus 124. The memory 122 is configured to include a recording medium such as a ROM, a RAM, a magnetic disk, and an optical disk, like the memory 32 of the computer unit 36. The memory 122 has several areas, and these areas constitute a program memory 130, the recognition dictionary memory 80, the reference information DB 84, the provided information DB 100, and the stored data memory 88. Have been. The program memory 130 includes a voice information search program for performing the above-described voice recognition and information search, and a recognition dictionary memory 8.
A dictionary learning program for learning 0 is stored in advance.

【００５１】サーバ・コンピュータ１０は、さらに、ク
ライアント・コンピュータ１２と同様に、図６に示すよ
うに、送受信部１３２を備えており、その送受信部１３
２により、任意のクライアント・コンピュータ１２と電
波により通信することが可能となっている。The server computer 10 further includes a transmission / reception unit 132 as shown in FIG.
2 makes it possible to communicate with an arbitrary client computer 12 by radio waves.

【００５２】図７には、クライアント・コンピュータ１
２により実行される音声入力プログラムの内容と、サー
バ・コンピュータ１０により実行される音声情報検索プ
ログラムの内容とが概念的に、かつ、利用者の各行為に
時期的に関連付けてフローチャートで表されている。FIG. 7 shows the client computer 1
The content of the voice input program executed by the server 2 and the content of the voice information search program executed by the server computer 10 are represented in a flowchart conceptually and associated with each action of the user in a timely manner. I have.

【００５３】情報検索サービスを受けることを希望する
利用者は、ステップＳ１（以下、単に「Ｓ１」で表す。
他のステップについても同じとする）において、スイッ
チ４６をＯＮに操作する。すると、音声入力プログラム
のＳ２１の判定がＹＥＳとなり、Ｓ２２において、送受
信部４０（例えば、移動電話機７０）による音声情報検
索センタへの発呼が行われる。A user who desires to receive the information search service is referred to as step S1 (hereinafter simply referred to as "S1").
In the other steps, the switch 46 is turned ON. Then, the determination in S21 of the voice input program becomes YES, and in S22, a call is made to the voice information search center by the transmitting / receiving unit 40 (for example, the mobile telephone 70).

【００５４】その結果、音声情報検索プログラムのＳ４
１において、音声情報検索センタにより使用されるサー
バ・コンピュータ１０が、今回の利用者により使用され
るクライアント・コンピュータ１２と電話回線により接
続される。As a result, the voice information search program S4
In 1, the server computer 10 used by the voice information retrieval center is connected to the client computer 12 used by the current user by a telephone line.

【００５５】その後、音声入力プログラムのＳ２３にお
いて、サーバ・コンピュータ１０の応呼が確認されて、
その判定がＹＥＳとなる。Thereafter, in S23 of the voice input program, a call from the server computer 10 is confirmed, and
The determination is YES.

【００５６】以上で、それらサーバ・コンピュータ１０
とクライアント・コンピュータ１２とが、互いに電波に
より通信可能な状態に移行する。As described above, the server computer 10
And the client computer 12 shift to a state where they can communicate with each other by radio waves.

【００５７】その後、音声入力プログラムのＳ２４にお
いて、今回の利用者の利用者ＩＤがメモリ３２から読み
出され、それがサーバ・コンピュータ１０に送信され
る。その利用者ＩＤを受信したサーバ・コンピュータ１
０は、Ｓ４２において、その受信した利用者ＩＤをメモ
リ１２２に記憶させる。Thereafter, in S24 of the voice input program, the user ID of the current user is read from the memory 32 and transmitted to the server computer 10. Server computer 1 that has received the user ID
0 stores the received user ID in the memory 122 in S42.

【００５８】その後、クライアント・コンピュータ１２
は、音声入力プログラムのＳ２５において、ＧＰＳ４２
を利用することにより、今回の利用者の現在位置を検出
する。その検出された現在位置は、例えば緯度および経
度で数値化されて現在位置データとしてサーバ・コンピ
ュータ１０に送信される。続いて、サーバ・コンピュー
タ１０は、Ｓ４３において、その受信した現在位置をメ
モリ１２２に記憶させる。その後、サーバ・コンピュー
タ１０は、Ｓ４４において、メモリ１２２から質問デー
タを読み出してそれをクライアント・コンピュータ１２
に送信する。Thereafter, the client computer 12
Indicates that the GPS 42
To detect the current position of the user this time. The detected current position is digitized by, for example, latitude and longitude, and transmitted to the server computer 10 as current position data. Subsequently, the server computer 10 stores the received current position in the memory 122 in S43. Thereafter, the server computer 10 reads the question data from the memory 122 in S44, and
Send to

【００５９】続いて、クライアント・コンピュータ１２
は、Ｓ２６において、その質問データを受信し、その
後、Ｓ２７において、その質問データにより表される質
問の内容をスピーカ５２を介して音声により利用者に出
力する。この質問に応答し、利用者は、Ｓ２において、
サーバ・コンピュータ１０に対する問い合わせ情報を、
音声によりマイク４８を介して入力する。Subsequently, the client computer 12
Receives the question data in S26, and then outputs the content of the question represented by the question data to the user by voice via the speaker 52 in S27. In response to this question, the user, at S2,
Inquiry information to the server computer 10
The voice is input through the microphone 48.

【００６０】その後、クライアント・コンピュータ１２
は、Ｓ２８において、その利用者から発せられた音声を
音声データとしてサーバ・コンピュータ１０に送信す
る。Thereafter, the client computer 12
Transmits the voice uttered by the user to the server computer 10 as voice data in S28.

【００６１】続いて、サーバ・コンピュータ１０は、Ｓ
４５において、受信した音声データに基づいて音声情報
検索を行う。Subsequently, the server computer 10
At 45, an audio information search is performed based on the received audio data.

【００６２】このＳ４５の詳細が音声情報検索ルーチン
として図８にフローチャートで表されている。このルー
チンにおいては、まず、Ｓ６１において、メモリ１２２
に記憶されている利用者ＩＤに対応する実際本人属性デ
ータが利用者情報ＤＢ９４において検索される。次に、
Ｓ６２において、メモリ１２２に記憶されている現在位
置に対応する実際環境属性データが位置情報ＤＢ９２に
おいて検索される。The details of step S45 are shown in the flowchart of FIG. 8 as a voice information search routine. In this routine, first, in S61, the memory 122
The actual personal attribute data corresponding to the user ID stored in the user information DB 94 is searched in the user information DB 94. next,
In S62, the actual environment attribute data corresponding to the current position stored in the memory 122 is searched in the position information DB 92.

【００６３】その後、Ｓ６３において、それら検索され
た実際本人属性データと実際環境属性データとに対応す
る音響モデルが認識辞書メモリ８０において検索され
る。続いて、Ｓ６４において、それら検索された実際本
人属性データと実際環境属性データとに対応する言語モ
デルが認識辞書メモリ８０において検索される。Thereafter, in S63, an acoustic model corresponding to the retrieved actual personal attribute data and actual environmental attribute data is retrieved in the recognition dictionary memory 80. Subsequently, in S64, a language model corresponding to the retrieved actual personal attribute data and the actual environment attribute data is retrieved in the recognition dictionary memory 80.

【００６４】その後、Ｓ６５において、クライアント・
コンピュータ１２から受信した音声データに基づき、上
記選択された音響モデルと言語モデルとを利用すること
により、前述のようにして、利用者から音声により発せ
られた問い合わせ情報に対応する問い合わせ命令が作成
される。Thereafter, in S65, the client
By using the selected acoustic model and language model based on the voice data received from the computer 12, an inquiry command corresponding to the inquiry information issued by the user by voice is created as described above. You.

【００６５】続いて、Ｓ６６において、その作成された
問い合わせ命令に基づき、利用者の問い合わせ情報に適
合した提供情報が提供情報ＤＢ１００において検索され
る。具体的には、前述のように、利用者の利用者ＩＤに
対応する実際本人属性データが利用者情報ＤＢ９４にお
いて検索されるとともに、利用者の現在位置に対応する
実際環境属性データが位置情報ＤＢ９２において検索さ
れ、それら検索された実際本人属性データおよび実際環
境属性データと、上記作成された問い合わせ命令とに基
づき、提供情報が検索される。Subsequently, in S66, based on the created inquiry command, provided information matching the user's inquiry information is searched in the provided information DB 100. Specifically, as described above, the actual personal attribute data corresponding to the user ID of the user is searched in the user information DB 94, and the actual environmental attribute data corresponding to the current position of the user is retrieved from the position information DB 92. The provided information is searched based on the searched actual personal attribute data and the actual environment attribute data, and the created inquiry command.

【００６６】その後、Ｓ６７において、その検索された
提供情報がクライアント・コンピュータ１２に送信され
る。続いて、Ｓ６８において、今回の音声認識において
使用された音声データが蓄積データメモリ８８に蓄積さ
れ、さらに、Ｓ６９において、今回の音声認識において
取得された問い合わせ命令と実際本人属性と実際環境属
性とを表す問い合わせデータが蓄積データメモリ８８に
蓄積される。以上で、この音声情報検索ルーチンの一回
の実行、すなわち、図７におけるＳ４５の一回の実行が
終了し、その結果、音声情報検索プログラムの一回の実
行が終了する。After that, in S67, the retrieved provided information is transmitted to the client computer 12. Subsequently, in S68, the voice data used in the current voice recognition is stored in the storage data memory 88. Further, in S69, the inquiry command, the actual personal attribute, and the real environment attribute acquired in the current voice recognition are stored. The inquiry data to be represented is stored in the storage data memory 88. As described above, one execution of the voice information search routine, that is, one execution of S45 in FIG. 7 ends, and as a result, one execution of the voice information search program ends.

【００６７】上記検索された提供情報を受信したクライ
アント・コンピュータ１２は、図７のＳ２９において、
その提供情報をディスプレイ５０とスピーカ５２との少
なくとも一方を介して利用者に提示する。以上で、この
音声入力プログラムの一回の実行が終了する。その結
果、利用者は、Ｓ３において、上記検索された提供情
報、すなわち、その利用者が希望している情報を取得す
ることになる。以上で、音声情報検索のための利用者に
よる一連の手続が終了する。The client computer 12 which has received the retrieved provided information, in S29 of FIG.
The provided information is presented to the user via at least one of the display 50 and the speaker 52. This completes one execution of the voice input program. As a result, in S3, the user acquires the retrieved provided information, that is, the information desired by the user. Thus, a series of procedures by the user for voice information search is completed.

【００６８】図９には、サーバ・コンピュータ１０によ
り実行される辞書学習プログラムの内容が概念的に表さ
れている。このプログラムは、繰り返し実行される。各
回の実行時には、まず、Ｓ８１において、蓄積データメ
モリ８８に蓄積されているデータの量であるデータ蓄積
量が設定値より多いか否かが判定される。設定値より多
くはない場合には、判定がＮＯとなり、直ちにこのプロ
グラムの一回の実行が終了する。FIG. 9 conceptually shows the contents of a dictionary learning program executed by the server computer 10. This program is executed repeatedly. At the time of each execution, first, in S81, it is determined whether or not the data storage amount, which is the amount of data stored in the storage data memory 88, is larger than a set value. If it is not more than the set value, the determination is NO, and one cycle of the program is immediately terminated.

【００６９】これに対して、そのデータ蓄積量が設定値
より多い場合には、Ｓ８１の判定がＹＥＳとなり、Ｓ８
２に移行する。このＳ８２においては、音響モデルが、
蓄積データメモリ８８に蓄積された音声データと問い合
わせデータとが反映されるように学習させられる。その
後、言語モデルが、蓄積データメモリ８８に蓄積された
問い合わせデータが反映されるように学習させられる。
以上で、この辞書学習プログラムの一回の実行が終了す
る。On the other hand, if the data storage amount is larger than the set value, the determination in S81 becomes YES, and S8
Move to 2. In this S82, the acoustic model is
The learning is performed so that the voice data and the inquiry data stored in the storage data memory 88 are reflected. Thereafter, the language model is trained so that the inquiry data stored in the storage data memory 88 is reflected.
This completes one execution of the dictionary learning program.

【００７０】以上の説明から明らかなように、本実施形
態においては、サーバ・コンピュータ１０のうち図８の
Ｓ６１ないしＳ６５を実行する部分が音声認識部８２を
構成し、Ｓ６６およびＳ６７を実行する部分が情報検索
／提供部１０２を構成し、Ｓ６８を実行する部分が音声
データ蓄積部８６ａを構成し、Ｓ６９を実行する部分が
問い合わせデータ蓄積部８６ｂを構成しているのであ
る。さらに、サーバ・コンピュータ１０のうち辞書学習
プログラムを実行する部分が辞書学習部９０を構成して
いるのである。As is clear from the above description, in the present embodiment, the part of the server computer 10 that executes S61 to S65 in FIG. 8 constitutes the voice recognition unit 82, and the part that executes S66 and S67. Constitute the information retrieval / providing unit 102, the part executing S68 constitutes the voice data storage unit 86a, and the part executing S69 constitutes the inquiry data storage unit 86b. Furthermore, the part of the server computer 10 that executes the dictionary learning program constitutes the dictionary learning unit 90.

【００７１】以上の説明から明らかなように、本実施形
態においては、音響モデルと言語モデルとが互いに共同
して、請求項１または３における「音声認識辞書」の一
例を構成し、図８のＳ６１ないしＳ６４が互いに共同し
て、同請求項における「第１ステップ」の一例を構成
し、同図のＳ６５が同請求項における「第２ステップ」
の一例を構成しているのである。As is clear from the above description, in the present embodiment, the acoustic model and the language model cooperate with each other to constitute an example of the "speech recognition dictionary" in claim 1 or 3, and FIG. S61 to S64 cooperate with each other to constitute an example of the “first step” in the claim, and S65 in the same figure corresponds to “second step” in the claim.
This constitutes an example.

【００７２】さらに、本実施形態においては、図９のＳ
８１ないしＳ８３が互いに共同して、請求項２または４
における「第３ステップ」の一例を構成しているのであ
る。Further, in the present embodiment, S in FIG.
5. The method according to claim 2, wherein 81 to S83 cooperate with each other.
Constitutes an example of the “third step” in FIG.

【００７３】さらに、本実施形態においては、音響モデ
ルと言語モデルとが互いに共同して、請求項５における
「音声認識辞書」の一例を構成し、図８のＳ６１ないし
Ｓ６４が互いに共同して、同請求項における「第１ステ
ップ」の一例を構成し、同図のＳ６５が同請求項におけ
る「第２ステップ」の一例を構成し、同図のＳ６６およ
びＳ６７が互いに共同して、請求項５または６における
「第３ステップ」の一例を構成し、提供情報ＤＢ１００
が同請求項における「提供情報データベース」の一例を
構成し、提供情報が請求項７における「地理的環境関連
情報」の一例を構成しているのである。Further, in this embodiment, the acoustic model and the language model cooperate with each other to constitute an example of the "speech recognition dictionary" in claim 5, and S61 to S64 in FIG. An example of a “first step” in the claim is constituted, S65 of the same figure constitutes an example of a “second step” in the same claim, and S66 and S67 of the same figure are cooperated with each other. Or an example of the “third step” in 6 and the provision information DB 100
Constitutes an example of the “provided information database” in the claim, and the provided information constitutes an example of the “geographical environment-related information” in claim 7.

【００７４】さらに、本実施形態においては、音声情報
検索プログラムの全ステップのうち図８のＳ６１ないし
Ｓ６５と、音声情報検索プログラムの全ステップとがそ
れぞれ、請求項８に係る「プログラム」の一例を構成し
ているのである。Further, in this embodiment, among all the steps of the voice information search program, S61 to S65 of FIG. 8 and all the steps of the voice information search program are each an example of the "program" according to claim 8. It is composed.

【００７５】さらに、本実施形態においては、メモリ１
２２が請求項９に係る「記録媒体」の一例を構成してい
るのである。Further, in the present embodiment, the memory 1
22 constitutes an example of the “recording medium” according to claim 9.

【００７６】さらに、本実施形態においては、音響モデ
ルと言語モデルとが互いに共同して、請求項１０または
１１における「音声認識辞書」の一例を構成し、音声認
識部８２が同請求項における「認識手段」の一例を構成
しているのである。Further, in the present embodiment, the acoustic model and the language model cooperate with each other to constitute an example of the “speech recognition dictionary” in claim 10 or 11, and the speech recognition unit 82 uses the “voice recognition dictionary” in the claim. It constitutes an example of "recognition means".

【００７７】さらに、本実施形態においては、音響モデ
ルと言語モデルとが互いに共同して、請求項１２におけ
る「音声認識辞書」の一例を構成し、提供情報ＤＢ１０
０が同請求項における「提供情報データベース」の一例
を構成し、音声認識部８２が同請求項における「認識手
段」の一例を構成し、情報検索／提供部１０２が同請求
項における「情報検索手段」の一例を構成しているので
ある。Further, in the present embodiment, the acoustic model and the language model cooperate with each other to constitute an example of the “speech recognition dictionary” according to the twelfth aspect.
0 constitutes an example of the “provided information database” in the claim, the voice recognition unit 82 constitutes an example of the “recognition unit” in the claim, and the information search / provide unit 102 reads the “information search” in the claim. It constitutes an example of "means".

【００７８】以上、本発明の一実施形態を図面に基づい
て詳細に説明したが、これは例示であり、前記［課題を
解決するための手段および発明の効果］の欄に記載の態
様を始めとして、当業者の知識に基づいて種々の変形、
改良を施した他の形態で本発明を実施することが可能で
ある。The embodiment of the present invention has been described in detail with reference to the drawings. However, this is merely an example, and the embodiments described in the above-mentioned "Means for Solving the Problems and Effects of the Invention" will be described. As various modifications based on the knowledge of those skilled in the art,
The invention can be implemented in other modified forms.

[Brief description of the drawings]

【図１】本発明の一実施形態である音声情報検索方法を
実施するために利用される音声情報検索システムを概念
的に示す機能ブロック図である。FIG. 1 is a functional block diagram conceptually showing a voice information search system used for implementing a voice information search method according to an embodiment of the present invention.

【図２】図１におけるクライアント・コンピュータ１２
のハードウエア構成を概念的に示すブロック図である。FIG. 2 is a client computer 12 in FIG.
FIG. 2 is a block diagram conceptually showing the hardware configuration of FIG.

【図３】図２におけるクライアント・コンピュータ１２
を構成する車載ナビゲーション装置６０と移動電話機７
０とを例示する正面図である。FIG. 3 shows a client computer 12 in FIG.
Vehicle-mounted navigation device 60 and mobile telephone 7
It is a front view which illustrates 0.

【図４】図１における音響モデルの構成を概念的に表形
式で示す図である。FIG. 4 is a diagram conceptually showing a configuration of an acoustic model in FIG. 1 in a tabular form.

【図５】図１における言語モデルの構成を概念的に表形
式で示す図である。FIG. 5 is a diagram conceptually showing a configuration of a language model in FIG. 1 in a tabular form.

【図６】図１におけるサーバ・コンピュータ１０のハー
ドウエア構成を概念的に示すブロク図である。FIG. 6 is a block diagram conceptually showing a hardware configuration of server computer 10 in FIG.

【図７】図２における音声入力プログラムの内容と、図
６における音声情報検索プログラムの内容とを、利用者
の各行為に時期的に関連付けて概念的に表すフローチャ
ートである。7 is a flowchart conceptually showing the content of a voice input program in FIG. 2 and the content of a voice information search program in FIG.

【図８】図７におけるＳ４５の内容を音声情報検索ルー
チンとして概念的に表すフローチャートである。8 is a flowchart conceptually showing the contents of S45 in FIG. 7 as a voice information search routine.

【図９】図６における辞書学習プログラムの内容を概念
的に表すフローチャートである。9 is a flowchart conceptually showing the contents of a dictionary learning program in FIG.

[Explanation of symbols]

１０サーバ・コンピュータ１２クライアント・コンピュータ８０認識辞書メモリ８２音声認識部９０辞書学習部１００提供情報データベース１０２情報検索／提供部 Reference Signs List 10 server computer 12 client computer 80 recognition dictionary memory 82 voice recognition unit 90 dictionary learning unit 100 provided information database 102 information search / providing unit

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/28 Ｇ１０Ｌ 3/00 ５２１Ｓ５５１Ａ５５１Ｐ ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G10L 15/28 G10L 3/00 521S 551A 551P

Claims

[Claims]

1. A method for recognizing a voice uttered by a specific or unspecified speaker by a computer, wherein the attribute is an attribute of an environment where the speaker is actually placed when the speaker utters the voice. Based on actual environment attribute data representing actual environment attributes, a plurality of types of speech recognition dictionaries prepared in advance are respectively associated with attributes of a plurality of types of environments in which the speaker is supposed to be placed when uttering a voice. A first step of selecting a speech recognition dictionary corresponding to the actual environment attribute from among the following, and using the selected speech recognition dictionary based on actual speech data representing actual speech actually emitted by the speaker. A second step of recognizing the actual voice and creating voice recognition result data representing the recognition result.

2. A third step of learning the plurality of types of speech recognition dictionaries so that the contents reflect the relationship between the actual environment attribute data, the actual speech data, and the speech recognition result data corresponding to each other. The speech recognition method according to claim 1, comprising:

3. A method for recognizing a voice uttered from an unspecified speaker by a computer, wherein a real personal attribute which is a real attribute of a target speaker who is a speaker emitting a voice to be recognized is determined. A plurality of types of speech recognition dictionaries based on the actual personal attribute data representing the actual speaker attribute data and the actual environment attribute data representing the actual environment attribute which is the attribute of the environment where the target speaker is actually placed when uttering the voice. ,
It is prepared in advance in association with the attributes of a plurality of types of environment in which the speaker is assumed to be placed when emitting a voice and the attributes of a plurality of types of individuals presumed for the speaker. First selecting a speech recognition dictionary corresponding to the actual personal attribute and the actual environment attribute from among the objects;
Based on actual speech data representing actual speech actually emitted by the target speaker, using the selected speech recognition dictionary to recognize the actual speech data, and a speech recognition result representing the recognition result. And a second step of creating data.

4. The method according to claim 1, wherein the contents of the plurality of types of voice recognition dictionaries reflect the relationship among the corresponding real voice data, voice recognition result data, real personal attribute data, and real environment attribute data. The voice recognition method according to claim 3, further comprising a third step of learning.

5. A server computer which can communicate with a plurality of client computers used by a plurality of users who desire to search for necessary information by voice, respectively. Is a method of searching for information conforming to the information search request by the user and transmitting the information to each client computer, which represents the actual attribute of the target speaker who is the speaker emitting the voice to be recognized. A plurality of types of speech recognition dictionaries based on actual personal attribute data and actual environment attribute data representing an actual environment attribute which is an attribute of an environment where the target speaker is actually placed when emitting the voice,
It is prepared in advance in association with the attributes of a plurality of types of environment in which the speaker is assumed to be placed when emitting a voice and the attributes of a plurality of types of individuals presumed for the speaker. First selecting a speech recognition dictionary corresponding to the actual personal attribute and the actual environment attribute from among the objects;
Based on actual speech data representing actual speech actually emitted by the target speaker, using the selected speech recognition dictionary to recognize the actual speech data, and a speech recognition result representing the recognition result. A second step of creating data; and at least a plurality of types of provided information that can be provided to each user based on the created voice recognition result data,
In the provided information database stored in advance in association with a plurality of types of speech recognition result data assumed in advance, the provided information corresponding to the created speech recognition result data is searched, and the searched provided information is Transmitting to the client computer used by the target user.

6. The provision information database according to claim 3, wherein the third step is based on the created speech recognition result data and refers to at least one of the actual personal attribute data and the actual environment attribute data. 6. The voice information search method according to claim 5, further comprising the step of searching for the provided information corresponding to the generated voice recognition result data.

7. The plurality of types of provided information include a plurality of types of geographical environment-related information on a geographical environment.
Or the voice information search method according to 6.

8. A computer-executable program for performing the method according to claim 1.

9. A recording medium recording the program according to claim 8 in a computer-readable manner.

10. A system for recognizing a voice uttered from an unspecified speaker by a computer, wherein a plurality of types of environment attributes are assumed in advance that the speaker is placed when uttering a voice; A plurality of types of speech recognition dictionaries prepared in advance in association with a plurality of types of attributes of the speaker presumed for the speaker, and the actual status of the target speaker who is the speaker emitting the voice to be recognized Based on the actual personal attribute data representing the actual personal attribute which is the attribute of the actual environmental attribute data representing the actual environmental attribute which is the attribute of the environment where the target speaker is actually placed when emitting the voice, A speech recognition dictionary corresponding to the actual personal attribute and the actual environment attribute is selected from a plurality of types of speech recognition dictionaries, and an actual speech data representing an actual speech actually produced by the target speaker is provided. The basis, by using the selected speech recognition dictionary, the speech recognition system comprising a recognition unit for the actual recognizing the speech data, to create a speech recognition result data representing the recognition result.

11. A server computer for recognizing a voice emitted from an unspecified user by communicating with a plurality of client computers used by a plurality of users who emit voices to be recognized. The attributes of a plurality of types of environment in which the speaker is assumed to be placed when uttering a voice, and the attributes of a plurality of types of person assumed in advance for the speaker are prepared in advance. And the actual attribute data representing the actual attribute of the target speaker who is the speaker who emits the voice to be recognized, and the target Based on the actual environment attribute data representing the actual environment attribute which is the attribute of the environment where the speaker is actually placed, the actual personal attribute and the actual A voice recognition dictionary corresponding to the boundary attribute is selected, and the actual voice data is obtained by using the selected voice recognition dictionary based on actual voice data representing actual voice actually uttered by the target speaker. A server for voice recognition, comprising: recognition means for recognizing and generating voice recognition result data representing the recognition result.
Computer.

12. An information search request by voice from an unspecified user by communicating with a plurality of client computers used by a plurality of users who desire to search for necessary information by voice. A server computer for searching for information that conforms to and transmitting the information to each client computer, wherein a plurality of attributes of the environment in which the speaker is supposed to be placed when uttering a voice, and the speaker A plurality of types of speech recognition dictionaries prepared in advance in association with a plurality of types of personal attributes assumed in advance, and a plurality of types of provided information that can be provided to each user, Provided information database stored in advance in association with the type of speech recognition result data, and information necessary for searching for the necessary information is issued by voice The actual personal attribute data representing the actual personal attribute, which is the actual attribute of the target user who is the user, and the real environment attribute, which is the attribute of the environment where the target user is actually placed when uttering the sound, Based on the actual environment attribute data to be represented, the speech recognition dictionary corresponding to the actual personal attribute and the actual environment attribute is selected from the plurality of types of speech recognition dictionaries, and the target user actually utters the speech recognition dictionary. Recognition means for recognizing the actual speech data by using the selected speech recognition dictionary based on actual speech data representing actual speech and creating speech recognition result data representing the result of the recognition; Based on the obtained speech recognition result data, the provided information corresponding to the speech recognition result data is searched in the provided information database, and the searched An information search means for transmitting provided information to the client computer used by the target user;