JP4769345B2

JP4769345B2 - Audio processing apparatus and recording medium

Info

Publication number: JP4769345B2
Application number: JP08113098A
Authority: JP
Inventors: 一郎服部; 晃鈴木
Original assignee: Animo Ltd
Current assignee: Animo Ltd
Priority date: 1998-03-27
Filing date: 1998-03-27
Publication date: 2011-09-07
Anticipated expiration: 2018-03-27
Also published as: JPH11282856A

Description

【０００１】
【発明の属する技術分野】
本発明は音声処理装置および記録媒体に関し、特に、入力された話者の音声データをもとに、話者を識別する話者識別を行う音声処理装置およびそのような処理をコンピュータに実行させるプログラムを記録した記録媒体に関する。
【０００２】
【従来の技術】
例えば、大手の販売店などでは、客が商品を購入した際にその客の氏名、住所、電話番号、購入商品名などを顧客リストに登録しておき、そのリストを参照して、個々の顧客のニーズにマッチすると思われる情報（例えば、ダイレクトメールなどによる情報）を提供する場合がある。
【０００３】
【発明が解決しようとする課題】
しかし、このような従来の方法では、顧客リストに登録されるのは、商品を購入した客のみに限定されるという問題点があった。例えば、購入を予定する商品があって、品定めのために何度も来店している客に関する情報は登録することができないという問題点があった。
【０００４】
また、特に大手の販売店などでは、店員の人数が多いため、客が来店する度に異なる店員が対応する場合が多い。従って、以前に他の店員に説明した内容を再度繰り返す必要が生じたり、あるいは、他の店員から受けた説明を再度繰り返されたりする場合が生じ、煩雑であるという問題点もあった。
【０００５】
本発明はこのような点に鑑みてなされたものであり、現在対応している客やユーザの過去の行動などに関する情報を参照することを可能とする音声処理装置を提供することを目的とする。
【０００６】
本発明では上記課題を解決するために、話者の音声データの入力を受け付ける音声データ入力受付手段と、前記音声データ入力受付手段が受け付けた音声データから特徴量を抽出する特徴量抽出手段と、前記特徴量抽出手段によって抽出された特徴量、または、もとの音声データを記録装置に供給して記録させる記録手段と、前記記録装置に既に記録されている他の音声データを前記特徴量抽出手段が処理することによって得られた特徴量、または、前記記録装置に既に記録されている他の特徴量と、前記特徴量抽出手段によって抽出された新たな特徴量の類似度を算出する類似度算出手段と、前記話者の同定を補完する属性情報の入力を受け付ける属性情報入力受付手段と、前記類似度算出手段によって算出された所定の値を上回る類似度を有する前記音声データまたは特徴量に対応付けられた前記属性情報同士の関連付けを行う関連付け手段と、前記類似度算出手段によって算出された類似度の値の高い順に、前記音声データを識別するための識別情報を表示装置に表示させる出力手段と、を有し、前記記録手段は、前記属性情報入力受付手段が受け付けた属性情報を、前記音声データまたは特徴量と対応付けて前記記録装置に供給して記録させ、前記類似度算出手段は、類似度を算出する際には、既に類似度を算出した特徴量に対応付けられた属性情報に前記関連付け手段によって関連付けられた属性情報が対応付けられている特徴量以外の特徴量との類似度を算出し、前記出力手段は、前記音声データを識別するための識別情報とともに、前記音声データに対応付けられた属性情報も表示装置に表示させることを特徴とする音声処理装置が提供される。
【０００９】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照して説明する。
図１は、本発明の原理を説明するための原理図である。
【００１０】
この図において、音声データ入力受付手段１は、例えば、マイクなどを介して入力され、Ａ／Ｄコンバータによってディジタル信号に変換されて得られた音声データを受け付ける。
【００１１】
特徴量抽出手段２は、音声データ入力受付手段１から入力された音声データから特徴量を抽出する。
属性情報入力受付手段３は、音声データ入力受付手段１から入力された音声データに対応する話者に関する属性情報の入力を受け付ける。
【００１２】
記録手段４は、特徴量抽出手段２によって抽出された特徴量と、属性情報入力受付手段３から入力された属性情報とを関連付けて記録装置５に記録する。
記録装置５は、特徴量抽出手段２によって抽出された音声データの特徴量と、その音声データに対応する話者の属性情報とを関連付けて記録しており、換言すれば音声によるデータベースを形成している。
【００１３】
類似度算出手段６は、新たに入力された音声データの特徴量と、記録装置５に既に記録されている音声データの特徴量の類似度を算出する。
関連付け手段７は、新たに入力された音声データと、関連度が高いものを関連付ける。
【００１４】
出力手段８は、類似度算出手段６の処理結果を参照して、新たに入力された音声データと類似度が高い音声データの特徴量の属性情報を、例えば、リスト形式で表示装置９に表示する。
【００１５】
表示装置９は、出力手段８から供給された情報を画面に表示する。
次に、図１に示す原理図の動作について説明する。なお、以下では、図１に示す装置により、例えば、デパートなどに来店した客の情報の管理を行うことを想定して説明を行う。
【００１６】
いま、客が来店し、「○○製のフレームレス眼鏡が欲しいのですが。」という発話を行ったとする。すると、この音声に対応する音声データは、音声データ入力受付手段１によって入力され、特徴量抽出手段２に供給される。
【００１７】
特徴量抽出手段２は、供給された音声データから、特徴量を抽出する。なお、この特徴量としては、音声データの物理的特性に関する情報（例えば、音声のフォルマント分布に関する情報等）を使用する。
【００１８】
記録手段４は、抽出された特徴量に対して所定のファイル名を付与し、記録装置５に供給して記録させる。
類似度算出手段６は、特徴量抽出手段２によって抽出された特徴量と、記録装置に既に記録されている特徴量の類似度を算出する。そして、類似度が所定の値を上回る特徴量が存在している場合には、その特徴量を識別するための識別情報（例えば、ファイル名）と、その特徴量に関連付けて記録されている属性情報と、算出された類似度とを出力手段８に供給する。
【００１９】
出力手段８は、類似度算出手段６から供給された、識別情報、属性情報、および、類似度を表示装置９に供給する。表示装置９は、出力手段８から供給された識別情報、属性情報、および、類似度を、リスト形式で表示する。
【００２０】
そして、この客に対する応対が終了した場合には、この客に関する属性情報（年齢、性別、購入商品などに関する情報）を店員が属性情報入力受付手段３から入力する。
【００２１】
記録手段４は、特徴量抽出手段２によって抽出された特徴量と、属性情報入力受付手段３から入力された属性情報とを相互に関連付けて記録装置５に記録する。その結果、記録装置５のデータベースに新たなデータが追加されることになる。
【００２２】
関連付け手段７は、新たに入力された音声データとの類似度が所定の値を上回る音声データの特徴量に関しては、新たに入力された特徴量と関連付け（リンク）を行う。その結果、記録装置５に記録されているデータベースの各項目は、類似度に応じて相互に関連付けられることになる（類似度が所定の値を上回るデータは相互にリンクされる）ので、例えば、類似度が高いデータが複数存在している場合には、リンクされたデータのうちの１つが特定されると、他のデータはそのリンクを辿ることにより取得されることになる。
【００２３】
以上のような本発明に関する音声処理装置によれば、来店した客の音声（自由発話による音声）データが入力された場合には、記録装置５に記録されているデータベースから、類似度が高い音声データの属性データが取得され、表示装置９に表示されることになる。従って、過去に１度以上来店し、かつ、店員と会話を行った経験のある客に関しては、その属性情報が取得されて表示されることになるので、店員はその属性情報を参照することにより、商談をより円滑に進めることが可能となる。
【００２４】
図２は、本発明の実施の形態を含む音声処理システムの全体の構成例を示すブロック図である。
この図において、マイク１０ａ〜１０ｚは、客の音声を電気信号に変換して端末装置１１ａ〜１１ｚにそれぞれ供給する。
【００２５】
端末装置１１ａ〜１１ｚは、例えば、携帯型の端末装置であり、マイク１０ａ〜１０ｚから入力された音声信号をディジタル信号に変換して、音声処理装置１４に送信するとともに、音声処理装置１４から送信された属性情報を表示装置１２ａ〜１２ｚに表示させる。また、図示せぬ入力部を操作することにより、客の属性情報を入力することも可能とされている。
【００２６】
レジスタ１３は、客が商品を購入した場合には、その商品の商品名などを音声処理装置１４に対して送信する。
音声処理装置１４は、図３に示す構成とされており、その詳細は後述する。
【００２７】
データベース１５は、端末装置１１ａ〜１１ｚから入力された音声データが音声処理装置１４によって処理された結果得られた特徴量と、その属性情報とを関連付けて記録する。
【００２８】
なお、端末装置１１ａ〜１１ｚ、レジスタ１３、音声処理装置１４、および、データベース１５は、ＬＡＮ（Local Area Network）を形成しており、相互に情報の授受が可能とされている。
【００２９】
ルータ１６は、端末装置１１ａ〜１１ｚ、レジスタ１３、音声処理装置１４、および、データベース１５からなるＬＡＮをインターネット１７に対して接続し、例えば、インターネット１７に接続されている他の店舗のＬＡＮとの間で情報の授受が可能とされている。
【００３０】
次に、図３を参照して図２に示す音声処理装置１４の詳細な構成について説明する。
ＣＰＵ１４ａは、装置の各部を制御するとともに、種々の演算処理を実行する。ＬＡＮユニット１４ｂは、例えば、ＣＳＭＡ／ＣＤ（Carrier Sense Multiple Access with Collision Detection）方式に基づいて、他の装置との間でデータを授受する。
【００３１】
ＣＤ−ＲＯＭドライブ１４ｃは、ＣＤ−ＲＯＭから必要なデータを読み込む。
ハードディスク装置１４ｄは、ＣＰＵ１４ａが実行するプログラムなどを記録している。
【００３２】
ＣＲＴモニタ１４ｅは、ＣＰＵ１４ａの処理結果等を画面上に表示出力する。
メモリ１４ｆは、ＲＡＭおよびＲＯＭによって構成されており、ＣＰＵ１４ａが演算処理を行う場合に必要なプログラムを読み出したり、一時的に記憶する。
【００３３】
入力装置１４ｇは、例えば、キーボードやマウスなどによって構成されており、必要な情報を入力する際に操作される。
なお、図１の原理図と図２および図３の実施の形態との対応関係を以下に示す。
【００３４】
即ち、音声データ入力受付手段１は、ＬＡＮユニット１４ｂに対応している。特徴量抽出手段２は、ＣＰＵ１４ａに対応している。属性情報入力受付手段３は、ＬＡＮユニット１４ｂに対応している。
【００３５】
記録手段４は、ＣＰＵ１４ａおよびＬＡＮユニット１４ｂに対応している。記録装置５は、ハードディスク装置１４ｄまたはデータベース１５に対応している。
【００３６】
類似度算出手段６は、ＣＰＵ１４ａに対応している。関連付け手段７は、ＣＰＵ１４ａに対応している。
出力手段８は、ＬＡＮユニット１４ｂに対応している。表示装置９は、表示装置１２ａ〜１２ｚに対応している。
【００３７】
次に、以上の実施の形態の動作を図４に示すフローチャートを参照して説明する。このフローチャートが開始されると、以下に示す処理が実行されることになる。なお、以下の処理では、端末装置１１ａから音声データが入力された場合を想定して説明を行う。
［Ｓ１］端末装置１１ａの図示せぬ入力部が操作されると、音声データの入力がスタートする。
【００３８】
いま、店舗に客が来訪し、店員との間で会話が開始され、店員が端末装置１１ａの入力部の所定のキーを操作することにより音声の入力処理が開始される。その結果、マイク１０ａから出力された音声信号は、端末装置１１ａに供給され、そこでディジタル信号に変換された後、音声処理装置１４に供給される。
【００３９】
なお、客が発話している場合にのみ所定のキーを押圧し続け、店員と客の音声の双方が入力されることを防止するようにしてもよい。
［Ｓ２］音声処理装置１４のＣＰＵ１４ａは、ＬＡＮユニット１４ｂを介して入力した音声データから特徴量を抽出する。
【００４０】
なお、この特徴量としては、例えば、話者の発話に含まれている音声から所定の音素（“あ”、“い”、“う”など）を切り出し、切り出した音素のフォルマント分布を用いるようにすればよい。
［Ｓ３］ＣＰＵ１４ａは、ＬＡＮユニット１４ｂを介してデータベース１５に登録されている音声データの特徴量（以前に来店した客の音声の特徴量）を読み込む。
【００４１】
なお、データベース１５に登録されているデータの構造の一例を図５に示す。この図に示すデータは、ヘッダーｄ１、特徴量ｄ２、および、属性情報ｄ３によって構成されている。ヘッダーｄ１は、図中に拡大して示すように、ファイル名ｄ１１および記録日時ｄ１２によって構成されている。また、属性情報ｄ３は、客の性別ｄ３１、（推定）年齢ｄ３２、購入商品ｄ３３、購入予定商品ｄ３４、および、その他ｄ３５によって構成されている。
【００４２】
図６は、図５に示すヘッダーｄ１と属性情報ｄ３の具体例を示す図である。この図に示すように、ファイル名ｄ１１は、例えば、新たなデータが入力される度に１ずつインクリメントされて生成される数値が付与される。
【００４３】
記録日時ｄ１２は、そのデータが登録された日時を示している。性別ｄ３１、および年齢ｄ３２は、端末装置の図示せぬ入力部を店員が操作して入力したものである。
【００４４】
購入商品ｄ３３は、例えば、購入金額の精算を行う際にレジスタ１３から入力されたものである。
購入予定商品ｄ３４は、店員が客との対話の中で推測した情報を、端末装置の図示せぬ入力部から入力したものである。
【００４５】
更に、その他ｄ３５は、例えば、店員がその客に抱いた印象などを、端末装置の図示せぬ入力部から入力したものである。
［Ｓ４］ＣＰＵ１４ａは、ＬＡＮユニット１４ｂを介してデータベース１５から読み込んだ音声データの特徴量と、ステップＳ２において抽出された特徴量の類似度を算出する。
【００４６】
なお、この類似度の算出方法としては、例えば、抽出された各音素のフォルマントの相関関係を計算するようにすればよい。
［Ｓ５］ＣＰＵ１４ａは、データベース１５に未処理の特徴量がまだあるか否かを判定し、まだある場合にはステップＳ３に戻り、また、それ以外の場合にはステップＳ６に進む。
【００４７】
ステップＳ３〜ステップＳ５の処理が繰り返されることにより、データベース１５に登録されている全ての特徴量が音声処理装置１４に読み込まれ、その特徴量とステップＳ２において抽出された特徴量との類似度が計算されることになる。
【００４８】
なお、これらの特徴量をハードディスク装置１４ｄに全て格納しておくようにしてもよい。
［Ｓ６］ＣＰＵ１４ａは、読み込まれた特徴量のうち、類似度が所定の値以上の特徴量が存在するか否かを判定し、１つでも存在する場合にはステップＳ７に進み、それ以外の場合にはステップＳ８に進む。
【００４９】
例えば、類似度が７０％以上であるデータが存在する場合にはステップＳ７に進むことになる。
［Ｓ７］ＣＰＵ１４ａは、類似度が所定の値以上である特徴量の属性情報を、ＬＡＮユニット１４ｂを介してデータベースから読み出すとともに、それぞれのデータの類似度を信頼度として所定の端末装置１１ａに供給する。その結果、端末装置は、供給された属性情報と信頼度とを表示装置に一覧表示させる。
【００５０】
図７は、端末装置１１ａの表示装置１２ａに表示される情報の一例を示している。この例では、店員が応対している客のものと推定される属性情報とその信頼度とが、信頼度が高い順に一覧表示されている。例えば、候補No.1としては、ファイル名“０２０４”のデータが表示されており、この客は、“１９９８／１１／２０−１１：２４：３４”に来店し、年齢は“２０代”の“男性”であり、“パソコン”を購入予定で、“グラフィックスに興味”があることが示されている。また、この属性を有する客が現在来店中の客であるという判定への信頼度は“９８％”であることも同時に示されている。
［Ｓ８］ＣＰＵ１４ａは、来店した客が新規の顧客であるとして、データベース１５に登録させる。
［Ｓ９］ＣＰＵ１４ａは、所定の制御コードをＬＡＮユニット１４ｂを介して端末装置に送り、表示装置に来店中の客が新規の客である旨を示す。
【００５１】
いまの例では、端末装置１１ａの表示装置１２ａに対して、例えば、メッセージ「新規のお客様であると思われます。」などが表示される。
［Ｓ１０］ＣＰＵ１４ａは、レジスタ１３や端末装置から出力された属性情報を入力する。
［Ｓ１１］ＣＰＵ１４ａは、ステップＳ２の処理において取得した特徴量と、ステップＳ１０において取得した属性情報とをデータベース１５に送り、これらを関連付けて記憶させる。
【００５２】
その結果、図６に示すようなデータが記録されることになる。
以上の実施の形態によれば、来客があった場合には、その音声データが取得されて音声処理装置１４に送られ、そこで特徴量が抽出される。そして、抽出された特徴量と、データベース１５に記録されている特徴量（過去に来店して店員と会話を行った全ての客に対応する特徴量）との類似度が算出され、所定の値以上の類似度を有する特徴量が存在した場合には、その属性情報が読み出されて、信頼度とともに端末装置の表示装置に表示されることになる。
【００５３】
その結果、来店中の客のものと推定される特徴量が存在している場合には、その属性情報が信頼度とともに表示されることになるので、店員は、これらの情報を参照することにより、その客の過去の行動や嗜好などを考慮に入れて接客を行うことができるので、商談を円滑に進めることが可能となる。
【００５４】
なお、以上の実施の形態では、属性情報をデータベースに記録するようにしたが、例えば、属性情報を記録せずに、識別情報と信頼度のみを表示するようにしてもよい、そのような構成にした場合においても、その客が過去に何度来店しているかを知ることができるので、商談を行う際の参照とすることができる。
【００５５】
また、インターネット１７を介して他の店舗（または売場）との間でデータを授受するようにしてもよい、そのような構成とすることにより、例えば、ある売場や店舗では得意客として認知されている客を、他の売場でも同等に扱うことができるので、更にきめ細かいサービスを提供することが可能となる。
【００５６】
更に、以上の実施の形態では、データベース１５には、特徴量とその属性情報のみを記録するようにしたが、例えば、図８に示すように、類似度が高いデータを相互に関連付けるリンク情報を付加するようにしてもよい。このような処理は、新たな特徴量をデータベース１５に登録する際に、関連付け手段７に対応するＣＰＵ１４ａが実行するようにする。
【００５７】
そのような構成をとることにより、データベース１５に記録されているデータのうち、その類似度が高いものは相互にリンクされることになるので、来店中の客の音声と類似度が高いデータが複数存在している場合であって、その内の１つがヒットした場合には、そのデータのリンクを参照することにより、その他のデータを簡単に得ることができる。
【００５８】
なお、このようなリンク情報としては、図９に示すような情報を用いることができる。
この例では、ファイル名“０００１”，“０００３”，“０００５”，“０００６”の特徴量の類似度（例えば、ファイル名“０００１”を基準とした類似度）が、例えば、７０％以上である。このような場合には、図９に示すように、全ての特徴量を循環するようなリンク情報を付与すればよい。その結果、検索によって、これらのうちで１つがヒットすると、このリンク情報を辿ることにより、全ての特徴量を取得することができる。例えば、最初に“０００３”が取得されたとすると、そのリンクは“０００１”であるので、ファイル名“０００１”の特徴量が取得される、以下同様にして、“０００６”，“０００５”が取得されることになる。
【００５９】
以上の実施の形態によれば、全ての特徴量との間で類似度を算出する必要がなくなるので、表示装置に情報が表示されるまでの時間を短縮することが可能となり、その結果、接客を円滑に行うことが可能となる。
【００６０】
次に、図１０を参照して、本発明の第２の実施の形態の構成例について説明する。なお、この図において、図２と対応する部分には同一の符号を付してあるので、その説明は省略する。
【００６１】
この実施の形態においては、図２の場合と比較して、マイク１０ａ〜１０ｚがヘッドセット２２ａ〜２２ｚに置換されているとともに、レジスタ１３が除外されている。更に、ヘッドセット２２ａ〜２２ｚがＰＢＸ（Private Branch Exchange ）２１を介して公衆回線２０に接続されている。その他の構成は、図２の場合と同様である。
【００６２】
公衆回線２０は、例えば、アナログ回線やＩＳＤＮなどの公衆回線である。ＰＢＸ２１は、呼をヘッドセット２２ａ〜２２ｚに振り分ける機能を有する。
ヘッドセット２２ａ〜２２ｚは、スピーカとマイクにより構成されており、ＰＢＸ２１を介して発呼を行ったユーザと通話を行うことができる。スピーカ側の音声信号（ユーザの音声信号）は、端末装置１１ａ〜１１ｚに供給される。
【００６３】
この実施の形態は、商品（この例ではパソコン）を購入したユーザから商品に関する電話があった場合に、そのユーザの音声データの特徴量と類似度が高い属性情報を検索して表示装置１２ａ〜１２ｚに表示させ、オペレータの対応を支援するものである。
【００６４】
なお、以上の実施の形態において実行される処理は図４に示す場合と同様であり、扱われるデータだけが異なるので、その動作については、図１１〜図１３を参照して簡単に説明する。
【００６５】
図１１は、図１０に示す実施の形態のデータベース１５に格納されているデータの構造の一例を示す図である。この図に示すように、図１０に示す実施の形態では、図２の場合と比較して属性情報ｄ４のみが異なっている。即ち、属性情報ｄ４は、ユーザの性別ｄ４１、（推定）年齢ｄ４２、所有するパソコンの種類ｄ４３、パソコンの習熟度ｄ４４、相談内容ｄ４５、および、その対応ｄ４６によって構成されている。
【００６６】
図１２は、図１１に示すヘッダーｄ１と属性情報ｄ４の詳細を示す図である。この図に示すように、ファイル名は、記録がなされる度に１ずつインクリメントされて生成された値が付与される。
【００６７】
記録日時ｄ１２は、そのデータが登録された日時を示している。性別ｄ４１、および年齢ｄ４２は、会話から推定される性別と年齢を、端末装置の図示せぬ入力部をオペレータが操作して入力したものである。
【００６８】
パソコン種類ｄ４３は、ユーザが所有しているパソコンの種類を示す。習熟度ｄ４４は、ユーザのパソコンへの習熟度を示しており、「初級」、「中級」、「上級」によって示される。なお、このような情報は、オペレータが会話の内容から推定して入力する。
【００６９】
相談内容ｄ４５は、相談の内容である。また、対応ｄ４６は、相談内容に対する応答である。なお、これらの情報は、会話の終了後にオペレータが図示せぬ入力部を操作することにより入力する。
【００７０】
図１３は、図４に示すステップＳ７の処理によって、端末装置側の表示装置に表示される属性情報の一例である。
この例では、３つの候補が表示されている。候補No. １の属性情報としては、１９９８年１月２０日の１０時２０分３０秒に登録された、年齢が５０代であり所有するパソコンがＫＺＲ−１０で、また、習熟度が初級である男性に関する情報が表示されている。なお、この男性の前回の相談内容は、「電源が入らない」ことであり、また、その対応は「電源ケーブルが未接続」であったこととされている。電話の主がこの男性であるとする判定への信頼度は９８％である。
【００７１】
このような情報を参照することにより、オペレータは、相談を行っているユーザの所有しているパソコンの種類、習熟度、および、前回の相談内容を知ることができるので、これらを参照して、ユーザのレベルに合致した適切な情報を提供することが可能となる。
【００７２】
なお、上記の処理機能は、コンピュータによって実現することができる。その場合、音声処理装置が有すべき機能の処理内容は、コンピュータで読み取り可能な記録媒体に記録されたプログラムに記述されており、このプログラムをコンピュータで実行することにより、上記処理がコンピュータで実現される。コンピュータで読み取り可能な記録媒体としては、磁気記録装置や半導体メモリ等がある。
【００７３】
市場に流通させる場合には、ＣＤ−ＲＯＭ(Compact Disk Read Only Memory) やフロッピーディスク等の可搬型記録媒体にプログラムを格納して流通させたり、ネットワークを介して接続されたコンピュータの記憶装置に格納しておき、ネットワークを通じて他のコンピュータに転送することもできる。コンピュータで実行する際には、コンピュータ内のハードディスク装置等にプログラムを格納しておき、メインメモリにロードして実行するようにすればよい。
【００７４】
【発明の効果】
以上説明したように本発明では、話者の音声データから特徴量を抽出し、抽出された特徴量と、記録装置に記録されている特徴量の類似度を算出し、類似度が高い特徴量に関しては、類似度に関連付けて記録されている属性情報を表示装置に表示するようにしたので、その話者のものと推定される属性情報を参照してその話者に対して適切な対応をとることができる。
【図面の簡単な説明】
【図１】本発明の原理を説明するための原理図である。
【図２】本発明の実施の形態を含む音声処理システムの全体の構成例を示すブロック図である。
【図３】図２に示す音声処理装置の詳細な構成例を示すブロック図である。
【図４】図３に示す音声処理装置において実行される処理の一例を説明するフローチャートである。
【図５】図３に示すデータベースに格納されているデータの構造の一例を示す図である。
【図６】図５に示すヘッダーと属性情報の一例を示す図である。
【図７】図４に示す処理の結果、端末装置側の表示装置に表示される情報の一例である。
【図８】図６に示すデータにリンク情報が付加された場合の一例を示す図である。
【図９】図８に示すリンク情報によるリンクの様子を示す図である。
【図１０】本発明の第２の実施の形態の構成例を示すブロック図である。
【図１１】図１０に示す実施の形態のデータベースに格納されているデータの構造の一例を示す図である。
【図１２】図１１に示すヘッダーと属性情報の一例を示す図である。
【図１３】図４に示す処理の結果、端末装置側の表示装置に表示される情報の一例である。
【符号の説明】
１音声データ入力受付手段
２特徴量抽出手段
３属性情報入力受付手段
４記録手段
５記録装置
６類似度算出手段
７関連付け手段
８出力手段
９表示装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech processing apparatus and a recording medium, and more particularly, a speech processing apparatus for performing speaker identification for identifying a speaker based on input speech data of the speaker and a program for causing a computer to execute such processing. The present invention relates to a recording medium on which is recorded.
[0002]
[Prior art]
For example, at a major retailer, when a customer purchases a product, the customer's name, address, telephone number, purchased product name, etc. are registered in the customer list, and each customer is referred to by referring to that list. Information that seems to match the needs of the user (for example, information by direct mail) may be provided.
[0003]
[Problems to be solved by the invention]
However, in such a conventional method, there is a problem that registration in the customer list is limited only to customers who have purchased the product. For example, there is a problem in that there is a product to be purchased and information regarding customers who have visited the store many times to determine the product cannot be registered.
[0004]
Also, especially at large retail stores, the number of salesclerks is large, so there are many cases where different salesclerks correspond each time a customer visits. Therefore, it is necessary to repeat the content previously explained to another store clerk, or the explanation received from another store clerk may be repeated again, which is troublesome.
[0005]
The present invention has been made in view of these points, and an object of the present invention is to provide an audio processing device that can refer to information related to past behaviors of customers and users currently supported. .
[0006]
In the present invention, in order to solve the above problems,Voice data input accepting means for accepting input of speaker's voice data, feature quantity extracting means for extracting feature quantities from the voice data accepted by the voice data input accepting means, and feature quantities extracted by the feature quantity extracting means Or a recording means for supplying the original audio data to the recording apparatus for recording, and a feature quantity obtained by processing the other audio data already recorded in the recording apparatus by the feature quantity extraction means Or another feature quantity already recorded in the recording device and a similarity calculation means for calculating the similarity of a new feature quantity extracted by the feature quantity extraction means, and complementing the speaker identification Attribute information input accepting means for accepting input of attribute information to be performed, and the voice data or feature amount having a similarity higher than a predetermined value calculated by the similarity calculating means An output that causes the display device to display identification information for identifying the audio data in descending order of the similarity value calculated by the similarity calculation unit, and an association unit that associates the attribute information with each other And the recording means supplies the attribute information received by the attribute information input receiving means to the recording device in association with the audio data or feature quantity, and records the attribute information. When calculating the similarity, a feature amount other than the feature amount in which the attribute information associated by the association unit is associated with the attribute information associated with the feature amount that has already been calculated. The similarity is calculated, and the output means displays the attribute information associated with the audio data together with the identification information for identifying the audio data on the display device. Audio processing apparatus is provided to symptoms.
[0009]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a principle diagram for explaining the principle of the present invention.
[0010]
In this figure, audio data input accepting means 1 accepts audio data that is input via, for example, a microphone and is converted into a digital signal by an A / D converter.
[0011]
The feature amount extraction unit 2 extracts a feature amount from the voice data input from the voice data input reception unit 1.
The attribute information input accepting unit 3 accepts input of attribute information related to the speaker corresponding to the speech data input from the speech data input accepting unit 1.
[0012]
The recording unit 4 records the feature quantity extracted by the feature quantity extraction unit 2 and the attribute information input from the attribute information input receiving unit 3 in the recording device 5 in association with each other.
The recording device 5 records the feature amount of the voice data extracted by the feature amount extraction unit 2 and the attribute information of the speaker corresponding to the voice data in association with each other, and in other words, forms a voice database. ing.
[0013]
The similarity calculation unit 6 calculates the similarity between the feature amount of the newly input audio data and the feature amount of the audio data already recorded in the recording device 5.
The associating means 7 associates the newly input audio data with the one having a high degree of association.
[0014]
The output unit 8 refers to the processing result of the similarity calculation unit 6 and displays the attribute information of the feature amount of the voice data having high similarity to the newly input voice data on the display device 9 in a list format, for example. To do.
[0015]
The display device 9 displays the information supplied from the output unit 8 on the screen.
Next, the operation of the principle diagram shown in FIG. 1 will be described. In the following, description will be made on the assumption that, for example, information on customers who have visited a department store is managed by the apparatus shown in FIG.
[0016]
Suppose that a customer visits the store and utters "I want frameless glasses made of XX." Then, the audio data corresponding to the audio is input by the audio data input receiving unit 1 and supplied to the feature amount extracting unit 2.
[0017]
The feature quantity extraction unit 2 extracts a feature quantity from the supplied audio data. Note that information relating to the physical characteristics of the audio data (for example, information relating to the audio formant distribution) is used as the feature amount.
[0018]
The recording unit 4 assigns a predetermined file name to the extracted feature amount, and supplies it to the recording device 5 for recording.
The similarity calculation means 6 calculates the similarity between the feature quantity extracted by the feature quantity extraction means 2 and the feature quantity already recorded in the recording device. If there is a feature quantity whose similarity exceeds a predetermined value, identification information (for example, a file name) for identifying the feature quantity and an attribute recorded in association with the feature quantity Information and the calculated similarity are supplied to the output means 8.
[0019]
The output unit 8 supplies the identification information, attribute information, and similarity supplied from the similarity calculation unit 6 to the display device 9. The display device 9 displays the identification information, the attribute information, and the similarity supplied from the output unit 8 in a list format.
[0020]
Then, when the response to the customer is completed, the store clerk inputs attribute information (information regarding age, sex, purchased product, etc.) from the attribute information input receiving means 3.
[0021]
The recording unit 4 records the feature quantity extracted by the feature quantity extraction unit 2 and the attribute information input from the attribute information input receiving unit 3 in the recording device 5 in association with each other. As a result, new data is added to the database of the recording device 5.
[0022]
The associating means 7 associates (links) the feature amount of the voice data whose similarity with the newly input voice data exceeds a predetermined value with the newly input feature amount. As a result, each item of the database recorded in the recording device 5 is associated with each other according to the similarity (data whose similarity exceeds a predetermined value is linked to each other). In the case where there are a plurality of pieces of data having a high degree of similarity, when one of the linked data is specified, the other data is acquired by following the link.
[0023]
According to the voice processing apparatus related to the present invention as described above, when voice data of a visitor (voice by free utterance) is input, voice having a high similarity is recorded from the database recorded in the recording device 5. The attribute data of the data is acquired and displayed on the display device 9. Therefore, for customers who have visited the store more than once in the past and who have had a conversation with the store clerk, the attribute information is acquired and displayed, so the store clerk refers to the attribute information. , Business negotiations can be promoted more smoothly.
[0024]
FIG. 2 is a block diagram showing an example of the overall configuration of the speech processing system including the embodiment of the present invention.
In this figure, microphones 10a to 10z convert customer voices into electrical signals and supply them to the terminal devices 11a to 11z, respectively.
[0025]
The terminal devices 11a to 11z are, for example, portable terminal devices, convert audio signals input from the microphones 10a to 10z into digital signals, transmit the digital signals to the audio processing device 14, and transmit from the audio processing device 14. The attribute information thus displayed is displayed on the display devices 12a to 12z. It is also possible to input customer attribute information by operating an input unit (not shown).
[0026]
When the customer purchases a product, the register 13 transmits the product name of the product to the sound processing device 14.
The voice processing device 14 has the configuration shown in FIG. 3, and details thereof will be described later.
[0027]
The database 15 records the feature amount obtained as a result of processing the voice data input from the terminal devices 11a to 11z by the voice processing device 14 and the attribute information in association with each other.
[0028]
Note that the terminal devices 11a to 11z, the register 13, the voice processing device 14, and the database 15 form a LAN (Local Area Network) and can exchange information with each other.
[0029]
The router 16 connects a LAN composed of the terminal devices 11 a to 11 z, the register 13, the sound processing device 14, and the database 15 to the Internet 17, for example, with a LAN of another store connected to the Internet 17. It is possible to exchange information between them.
[0030]
Next, the detailed configuration of the audio processing device 14 shown in FIG. 2 will be described with reference to FIG.
The CPU 14a controls each part of the apparatus and executes various arithmetic processes. The LAN unit 14b exchanges data with other devices based on, for example, CSMA / CD (Carrier Sense Multiple Access with Collision Detection).
[0031]
The CD-ROM drive 14c reads necessary data from the CD-ROM.
The hard disk device 14d records a program executed by the CPU 14a.
[0032]
The CRT monitor 14e displays and outputs the processing result of the CPU 14a on the screen.
The memory 14f includes a RAM and a ROM, and reads out or temporarily stores a program required when the CPU 14a performs arithmetic processing.
[0033]
The input device 14g is configured by, for example, a keyboard and a mouse, and is operated when inputting necessary information.
The correspondence between the principle diagram of FIG. 1 and the embodiment of FIGS. 2 and 3 is shown below.
[0034]
That is, the voice data input receiving unit 1 corresponds to the LAN unit 14b. The feature quantity extraction unit 2 corresponds to the CPU 14a. The attribute information input receiving means 3 corresponds to the LAN unit 14b.
[0035]
The recording unit 4 corresponds to the CPU 14a and the LAN unit 14b. The recording device 5 corresponds to the hard disk device 14d or the database 15.
[0036]
The similarity calculation means 6 corresponds to the CPU 14a. The associating means 7 corresponds to the CPU 14a.
The output unit 8 corresponds to the LAN unit 14b. The display device 9 corresponds to the display devices 12a to 12z.
[0037]
Next, the operation of the above embodiment will be described with reference to the flowchart shown in FIG. When this flowchart is started, the following processing is executed. Note that the following processing will be described assuming that audio data is input from the terminal device 11a.
[S1] When an input unit (not shown) of the terminal device 11a is operated, input of audio data starts.
[0038]
Now, a customer visits the store, a conversation with the store clerk is started, and the voice input process is started by the store clerk operating a predetermined key of the input unit of the terminal device 11a. As a result, the audio signal output from the microphone 10a is supplied to the terminal device 11a, where it is converted into a digital signal and then supplied to the audio processing device 14.
[0039]
Note that only when a customer is speaking, a predetermined key may be continuously pressed to prevent both the store clerk and the customer's voice from being input.
[S2] The CPU 14a of the voice processing device 14 extracts feature amounts from voice data input via the LAN unit 14b.
[0040]
As the feature amount, for example, a predetermined phoneme (“A”, “I”, “U”, etc.) is extracted from the speech included in the speaker's speech, and the formant distribution of the extracted phoneme is used. You can do it.
[S3] The CPU 14a reads the feature value of the voice data registered in the database 15 via the LAN unit 14b (the feature value of the voice of the customer who visited the store before).
[0041]
An example of the structure of data registered in the database 15 is shown in FIG. The data shown in this figure is composed of a header d1, a feature quantity d2, and attribute information d3. The header d1 is composed of a file name d11 and a recording date and time d12, as shown in an enlarged manner in the drawing. The attribute information d3 is composed of the customer's gender d31, (estimated) age d32, purchased product d33, planned purchase product d34, and other d35.
[0042]
FIG. 6 is a diagram showing a specific example of the header d1 and the attribute information d3 shown in FIG. As shown in this figure, the file name d11 is given a numerical value generated by incrementing by one each time new data is input, for example.
[0043]
The recording date and time d12 indicates the date and time when the data was registered. The sex d31 and the age d32 are input by the store clerk operating an input unit (not shown) of the terminal device.
[0044]
For example, the purchased product d33 is input from the register 13 when the purchase amount is settled.
The planned purchase product d34 is obtained by inputting information estimated by the store clerk during the dialogue with the customer from an input unit (not shown) of the terminal device.
[0045]
Furthermore, other d35 is obtained by inputting, for example, an impression that the store clerk has held to the customer from an input unit (not shown) of the terminal device.
[S4] The CPU 14a calculates the similarity between the feature amount of the audio data read from the database 15 via the LAN unit 14b and the feature amount extracted in step S2.
[0046]
As a method for calculating the similarity, for example, a correlation between formants of each extracted phoneme may be calculated.
[S5] The CPU 14a determines whether or not there is still an unprocessed feature amount in the database 15. If there is still an unprocessed feature amount, the CPU 14a returns to step S3, and otherwise proceeds to step S6.
[0047]
By repeating the processing of step S3 to step S5, all feature quantities registered in the database 15 are read into the speech processing device 14, and the similarity between the feature quantity and the feature quantity extracted in step S2 is determined. Will be calculated.
[0048]
It should be noted that all of these feature quantities may be stored in the hard disk device 14d.
[S6] The CPU 14a determines whether or not there is a feature quantity with a similarity greater than or equal to a predetermined value among the read feature quantities. If there is at least one, the process proceeds to step S7. If so, the process proceeds to step S8.
[0049]
For example, if there is data having a similarity of 70% or more, the process proceeds to step S7.
[S7] The CPU 14a reads out the attribute information of the feature quantity whose similarity is a predetermined value or more from the database via the LAN unit 14b, and supplies the similarity of each data to the predetermined terminal device 11a as the reliability. To do. As a result, the terminal device causes the display device to display a list of the supplied attribute information and reliability.
[0050]
FIG. 7 shows an example of information displayed on the display device 12a of the terminal device 11a. In this example, the attribute information presumed to belong to the customer who the store clerk is dealing with and its reliability are listed in descending order of reliability. For example, as the candidate No. 1, data of the file name “0204” is displayed, and this customer comes to “1998/11 / 20-11: 24: 34” and the age is “20s”. He is a “male”, plans to purchase a “computer” and is shown to be “interested in graphics”. It is also shown that the reliability of the determination that the customer having this attribute is a customer who is currently visiting the store is “98%”.
[S8] The CPU 14a registers in the database 15 that the customer who visited the store is a new customer.
[S9] The CPU 14a sends a predetermined control code to the terminal device via the LAN unit 14b, and indicates to the display device that the customer visiting the store is a new customer.
[0051]
In the present example, for example, the message “I think you are a new customer” is displayed on the display device 12a of the terminal device 11a.
[S10] The CPU 14a inputs the attribute information output from the register 13 or the terminal device.
[S11] The CPU 14a sends the feature amount acquired in the process of step S2 and the attribute information acquired in step S10 to the database 15, and stores them in association with each other.
[0052]
As a result, data as shown in FIG. 6 is recorded.
According to the above embodiment, when there is a visitor, the voice data is acquired and sent to the voice processing device 14 where the feature amount is extracted. Then, the degree of similarity between the extracted feature quantity and the feature quantity recorded in the database 15 (feature quantity corresponding to all customers who have visited the store in the past and talked with the store clerk) is calculated, and a predetermined value is calculated. When there is a feature amount having the above similarity, the attribute information is read out and displayed on the display device of the terminal device together with the reliability.
[0053]
As a result, if there is a feature that is estimated to be for a customer visiting the store, the attribute information will be displayed along with the reliability, so the store clerk can refer to this information by referring to this information. In addition, since the customer can be treated in consideration of the past behavior and preferences of the customer, the business negotiation can be smoothly advanced.
[0054]
In the above embodiment, the attribute information is recorded in the database. For example, only the identification information and the reliability may be displayed without recording the attribute information. Even in this case, it is possible to know how many times the customer has visited the store in the past, so that it can be used as a reference when conducting business negotiations.
[0055]
In addition, data may be exchanged with other stores (or sales floors) via the Internet 17. By adopting such a configuration, for example, it is recognized as a customer at a certain sales floor or store. Since customers can be treated equally at other sales floors, it is possible to provide more detailed services.
[0056]
Furthermore, in the above embodiment, only the feature amount and its attribute information are recorded in the database 15, but for example, as shown in FIG. You may make it add. Such processing is executed by the CPU 14a corresponding to the association means 7 when a new feature amount is registered in the database 15.
[0057]
By adopting such a configuration, among the data recorded in the database 15, those having a high similarity are linked to each other, so that data having a high similarity to the voice of the customer visiting the store is obtained. When there are a plurality of data and one of them hits, other data can be easily obtained by referring to the link of the data.
[0058]
As such link information, information as shown in FIG. 9 can be used.
In this example, the similarity of the feature quantities of the file names “0001”, “0003”, “0005”, “0006” (for example, the similarity based on the file name “0001”) is, for example, 70% or more. is there. In such a case, as shown in FIG. 9, link information that circulates all the feature values may be given. As a result, when one of these hits by the search, all the feature values can be acquired by following this link information. For example, if “0003” is acquired first, the link is “0001”, so the feature quantity of the file name “0001” is acquired. Similarly, “0006” and “0005” are acquired. Will be.
[0059]
According to the above embodiment, since it is not necessary to calculate the degree of similarity between all feature quantities, it is possible to shorten the time until information is displayed on the display device. Can be performed smoothly.
[0060]
Next, a configuration example of the second exemplary embodiment of the present invention will be described with reference to FIG. In this figure, parts corresponding to those in FIG. 2 are denoted by the same reference numerals, and description thereof is omitted.
[0061]
In this embodiment, compared with the case of FIG. 2, the microphones 10a to 10z are replaced with headsets 22a to 22z, and the register 13 is excluded. Further, headsets 22 a to 22 z are connected to the public line 20 via a PBX (Private Branch Exchange) 21. Other configurations are the same as those in FIG.
[0062]
The public line 20 is, for example, a public line such as an analog line or ISDN. The PBX 21 has a function of distributing calls to the headsets 22a to 22z.
The headsets 22a to 22z are constituted by a speaker and a microphone, and can make a call with a user who has made a call via the PBX 21. An audio signal on the speaker side (user's audio signal) is supplied to the terminal devices 11a to 11z.
[0063]
In this embodiment, when a user who has purchased a product (in this example, a personal computer) receives a phone call regarding the product, the user searches for attribute information having a high degree of similarity and feature amount of the voice data of the user and displays the display devices 12a to 12a. 12z is displayed to support the operator's response.
[0064]
The processing executed in the above embodiment is the same as that shown in FIG. 4, and only the data to be handled is different. The operation will be briefly described with reference to FIGS.
[0065]
FIG. 11 is a diagram illustrating an example of the structure of data stored in the database 15 according to the embodiment illustrated in FIG. 10. As shown in this figure, in the embodiment shown in FIG. 10, only the attribute information d4 is different from the case of FIG. That is, the attribute information d4 is composed of the user's gender d41, (estimated) age d42, possessed personal computer type d43, personal computer proficiency d44, consultation content d45, and corresponding d46.
[0066]
FIG. 12 is a diagram showing details of the header d1 and the attribute information d4 shown in FIG. As shown in this figure, the file name is given a value generated by incrementing by 1 each time recording is performed.
[0067]
The recording date and time d12 indicates the date and time when the data was registered. The gender d41 and the age d42 are obtained by inputting the gender and age estimated from the conversation by an operator operating an input unit (not shown) of the terminal device.
[0068]
The personal computer type d43 indicates the type of personal computer owned by the user. The proficiency level d44 indicates the proficiency level of the user with respect to the personal computer, and is indicated by “beginner”, “intermediate”, and “advanced”. Such information is input by the operator by estimating from the content of the conversation.
[0069]
The consultation content d45 is the content of the consultation. Correspondence d46 is a response to the consultation content. These pieces of information are input by operating an input unit (not shown) after the conversation is finished.
[0070]
FIG. 13 is an example of attribute information displayed on the display device on the terminal device side by the process of step S7 shown in FIG.
In this example, three candidates are displayed. The attribute information of candidate No. 1 is KZR-10, which is registered at 10:20:30 on Jan. 20, 1998 and is in his 50s and possesses proficiency level. Information about a man is displayed. It should be noted that the man's previous consultation was that the power could not be turned on, and that the response was that the power cable was not connected. The reliability of the determination that the main person of the telephone is this man is 98%.
[0071]
By referring to such information, the operator can know the type of the personal computer owned by the user who is consulting, the level of proficiency, and the content of the previous consultation. It is possible to provide appropriate information that matches the level of the user.
[0072]
The above processing functions can be realized by a computer. In this case, the processing contents of the functions that the sound processing apparatus should have are described in a program recorded on a computer-readable recording medium, and the above processing is realized by the computer by executing the program by the computer. Is done. Examples of the computer-readable recording medium include a magnetic recording device and a semiconductor memory.
[0073]
When distributing to the market, store the program on a portable recording medium such as a CD-ROM (Compact Disk Read Only Memory) or floppy disk, or store it in a computer storage device connected via a network. In addition, it can be transferred to another computer through the network. When executed by a computer, the program may be stored in a hard disk device or the like in the computer, loaded into the main memory, and executed.
[0074]
【The invention's effect】
As described above, according to the present invention, feature amounts are extracted from speaker's voice data, and the similarity between the extracted feature amount and the feature amount recorded in the recording device is calculated, and the feature amount having a high degree of similarity is calculated. Since the attribute information recorded in association with the similarity is displayed on the display device, an appropriate response to the speaker is taken with reference to the attribute information presumed to be that of the speaker. Can take.
[Brief description of the drawings]
FIG. 1 is a principle diagram for explaining the principle of the present invention.
FIG. 2 is a block diagram showing an example of the overall configuration of a voice processing system including an embodiment of the present invention.
FIG. 3 is a block diagram illustrating a detailed configuration example of the sound processing apparatus illustrated in FIG. 2;
4 is a flowchart for explaining an example of processing executed in the voice processing apparatus shown in FIG. 3;
5 is a diagram showing an example of the structure of data stored in the database shown in FIG. 3. FIG.
6 is a diagram showing an example of a header and attribute information shown in FIG.
7 is an example of information displayed on the display device on the terminal device side as a result of the processing shown in FIG. 4;
8 is a diagram showing an example when link information is added to the data shown in FIG. 6. FIG.
9 is a diagram showing a state of a link based on link information shown in FIG. 8. FIG.
FIG. 10 is a block diagram illustrating a configuration example of a second exemplary embodiment of the present invention.
11 is a diagram illustrating an example of a structure of data stored in the database according to the embodiment illustrated in FIG. 10;
12 is a diagram showing an example of a header and attribute information shown in FIG.
FIG. 13 is an example of information displayed on the display device on the terminal device side as a result of the processing shown in FIG. 4;
[Explanation of symbols]
1 Voice data input acceptance means
2 feature extraction means
3 Attribute information input acceptance means
4 Recording means
5 recording devices
6 Similarity calculation means
7 association means
8 Output means
9 Display device

Claims

Voice data input acceptance means for accepting input of speaker voice data;
Feature quantity extraction means for extracting feature quantities from the voice data received by the voice data input reception means;
A recording means for supplying the characteristic amount extracted by the characteristic amount extracting means or the original audio data to the recording device, and recording it;
The feature amount obtained by processing the other sound data already recorded in the recording device by the feature amount extraction unit, or another feature amount already recorded in the recording device, and the feature amount Similarity calculating means for calculating the similarity of the new feature amount extracted by the extracting means;
Attribute information input receiving means for receiving input of attribute information that complements the speaker identification;
And associating means for associating the attribute information to each other associated with the audio data or the feature value having a similarity above a predetermined value calculated by the similarity calculation means,
Output means for causing the display device to display identification information for identifying the audio data in descending order of the similarity value calculated by the similarity calculation means;
Have
The recording means supplies the attribute information received by the attribute information input receiving means to the recording device in association with the audio data or feature amount, and records the information.
The similarity calculation means, when calculating the similarity, attribute information associated by said association means is associated with the attribute information associated with the feature amount calculated the already to the class similarity score Calculate the similarity with the feature quantity other than the feature quantity,
The output means displays the attribute information associated with the audio data together with the identification information for identifying the audio data on the display device.

The audio processing apparatus according to claim 1, wherein the recording apparatus is connected via a network.

The speech processing apparatus according to claim 1, wherein the attribute information includes at least one of a customer appearance, a customer preference, a purchased product, or a purchase planned product.

Computer
Voice data input receiving means for receiving input of speaker's voice data;
Feature quantity extraction means for extracting feature quantities from the voice data received by the voice data input acceptance means;
Recording means for supplying the characteristic amount extracted by the characteristic amount extraction means or the original audio data to the recording device, and recording it.
The feature amount obtained by processing the other sound data already recorded in the recording device by the feature amount extraction unit, or another feature amount already recorded in the recording device, and the feature amount Similarity calculating means for calculating the similarity of the new feature amount extracted by the extracting means;
Attribute information input receiving means for receiving input of attribute information that complements the speaker identification;
Associating means for associating the attribute information with each other, wherein associated with the audio data or the feature value having a similarity above a predetermined value calculated by the similarity calculation means,
Output means for causing the display device to display identification information for identifying the audio data in descending order of the similarity value calculated by the similarity calculation means;
Function as
The recording means supplies the attribute information received by the attribute information input receiving means to the recording device in association with the audio data or feature amount, and records the information.
The similarity calculation means, when calculating the similarity, attribute information associated by said association means is associated with the attribute information associated with the feature amount calculated the already to the class similarity score Calculate the similarity with the feature quantity other than the feature quantity,
The output means displays the attribute information associated with the audio data on the display device together with the identification information for identifying the audio data .
The computer-readable recording medium which recorded the program characterized by the above-mentioned.