JP2000148189A

JP2000148189A - Speech processing device

Info

Publication number: JP2000148189A
Application number: JP10327230A
Authority: JP
Inventors: 秀享 ▲高▼橋; Hideyuki Takahashi
Original assignee: Olympus Optical Co Ltd
Current assignee: Olympus Corp
Priority date: 1998-11-17
Filing date: 1998-11-17
Publication date: 2000-05-26
Anticipated expiration: 2018-11-17
Also published as: JP4146949B2

Abstract

PROBLEM TO BE SOLVED: To provide a speech processing device capable of more precisely performing the retrieval of a recorded voice. SOLUTION: The speech characteristic parameter of each speaker is preliminarily extracted by an extraction part 14, a speech model is further formed by a speech model forming part 15 and registered in a speech model recording part 13B. When the voice of a certain speaker is recorded, the characteristic parameter of the speech data recorded in a speech data recording part 13A is extracted by the extraction part 14, and the similarity to the registered voice model of each speaker is determined by a similarity calculation part 16, whereby the speaker is identified. At this time, the identified speaker code or the speaker code not identified are recorded together with the position information of the speech data corresponding thereto.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声処理装置、詳
しくは、話者識別を可能とする音声処理装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice processing device, and more particularly to a voice processing device capable of identifying a speaker.

【０００２】[0002]

【従来の技術】近年、マイクロホン等によって得られた
音声信号をデジタル信号に変換して、例えば半導体メモ
リに記録しておき、再生時において、該半導体メモリか
らこの音声信号を読み出してアナログ信号に変換し、ス
ピーカ等により音声として出力する、いわゆるデジタル
レコーダと呼ばれているデジタル情報記録再生装置が開
発されている。特開昭６３−２５９７００号公報には、
このようなデジタル音声記録再生装置が開示されてい
る。2. Description of the Related Art In recent years, an audio signal obtained by a microphone or the like is converted into a digital signal, which is recorded in, for example, a semiconductor memory. At the time of reproduction, the audio signal is read from the semiconductor memory and converted into an analog signal. A digital information recording / reproducing apparatus called a so-called digital recorder which outputs sound as a sound by a speaker or the like has been developed. JP-A-63-259700 discloses that
Such a digital audio recording / reproducing apparatus is disclosed.

【０００３】また、このようなデジタル音声記録再生装
置においては、録音された音声データを再生する際にそ
の操作性や検索性をより向上させることが望まれてお
り、その実現のために種々の提案がなされている。例え
ば本出願人は、所望の範囲の音声データを再生させるた
めのインデックスマーク記録用釦を具備したデジタル音
声記録再生装置を、特開平１０−３３００号公報におい
て開示している。In such a digital audio recording / reproducing apparatus, it is desired to further improve the operability and searchability when reproducing the recorded audio data. A proposal has been made. For example, the present applicant discloses a digital audio recording / reproducing apparatus provided with an index mark recording button for reproducing audio data in a desired range in Japanese Patent Laid-Open No. 10-3300.

【０００４】また、本出願人は、デジタル録音装置から
パーソナルコンピュータに転送された録音データを、パ
ーソナルコンピュータにおいて簡単な操作で扱うことを
可能とする音声データの処理制御装置を、特願平９−１
４９７２８号において提案している。[0004] The applicant of the present invention has proposed a processing control apparatus for audio data which enables recorded data transferred from a digital recording apparatus to a personal computer to be handled by a simple operation of the personal computer. 1
No. 49728.

【０００５】さらに近年の音声処理技術の発展により、
音声認識技術、話者認識技術等が実用のものとなりつつ
ある。例えば特開平８−１５３１１８号公報は、話者識
別技術を応用して音声データを検索し、指定した話者の
音声だけを再生することを可能とする音声データ処理装
置を開示している。Further, with the recent development of audio processing technology,
Speech recognition technology, speaker recognition technology, and the like are becoming practical. For example, Japanese Patent Application Laid-Open No. 8-153118 discloses an audio data processing device that enables search of audio data by applying speaker identification technology and reproduction of only the audio of a specified speaker.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、従来の
発話内容を限定しない音声を対象とする話者識別技術に
おいては誤識別を含むことが避けられなかった。これに
より、話者を指定して再生しようとしても、漏れなく当
該話者の音声を再生することができない虞があり問題と
なっていた。However, in the conventional speaker identification technology for speech which does not limit the utterance content, it is inevitable that erroneous identification is included. Thus, even if an attempt is made to reproduce a specified speaker, the voice of the speaker may not be able to be reproduced without omission, which has been a problem.

【０００７】本発明はかかる問題点に鑑みてなされたも
のであり、録音音声の検索をより正確に行える音声処理
装置を提供することを目的とする。[0007] The present invention has been made in view of the above problems, and has as its object to provide an audio processing device capable of more accurately searching for recorded audio.

【０００８】[0008]

【課題を解決するための手段】上記の目的を達成するた
めに本発明の第１の音声処理装置は、音声を入力する音
声入力手段と、各話者毎に、登録用音声の特徴パラメー
タを登録音声モデルとして記録する登録音声モデル記録
手段と、上記音声入力手段で入力した音声データを記録
する音声記録手段と、上記音声記録手段により記録され
た音声データから特徴パラメータを抽出し、該特徴パラ
メータと上記各話者の登録音声モデルとの類似度を求め
て話者の識別処理を行う話者識別手段と、上記話者識別
手段により識別された話者に対応する話者コードと、か
かる話者識別手段の識別処理に対応する音声データの位
置情報とを記録する話者識別データ記録手段と、を有す
る音声処理装置において、上記話者識別データ記録手段
は、上記話者識別手段によって音声データの話者識別が
できなかった場合に、当該音声データの位置情報と話者
が特定できなかったことを示す特別の話者コードとを記
録することを特徴とする。In order to achieve the above object, a first speech processing apparatus of the present invention comprises a speech input means for inputting speech, and a feature parameter of registration speech for each speaker. A registered voice model recording means for recording as a registered voice model; a voice recording means for recording voice data inputted by the voice input means; and a feature parameter extracted from the voice data recorded by the voice recording means. Speaker identification means for performing a speaker identification process by obtaining a similarity between the speaker model and the registered voice model of each speaker; a speaker code corresponding to the speaker identified by the speaker identification means; Speaker identification data recording means for recording position information of audio data corresponding to the identification processing of the speaker identification means, wherein the speaker identification data recording means comprises: When unable to speaker identification of the audio data by the step, characterized in that recording the special speaker code indicating that the positional information and speaker of the audio data can not be identified.

【０００９】上記の目的を達成するために本発明の第２
の音声処理装置は、上記第１の音声処理装置において、
上記話者識別手段は、上記音声記録手段により記録され
た音声データから特徴パラメータを抽出する特徴パラメ
ータ抽出手段と、上記特徴パラメータ抽出手段により抽
出された特徴パラメータと上記各話者の登録音声モデル
との類似度を求める類似度演算手段と、上記類似度演算
手段により演算された上記類似度と話者認識用しきい値
とを比較し、話者の識別を行う話者特定手段と、を有す
ることを特徴とする。In order to achieve the above object, a second aspect of the present invention is provided.
The audio processing device according to the first audio processing device,
The speaker identification unit includes a feature parameter extraction unit that extracts a feature parameter from the voice data recorded by the voice recording unit, a feature parameter extracted by the feature parameter extraction unit, and a registered voice model of each speaker. And a speaker identification unit for comparing the similarity calculated by the similarity calculation unit with a threshold for speaker recognition to identify a speaker. It is characterized by the following.

【００１０】上記の目的を達成するために本発明の第３
の音声処理装置は、上記第１または第２の音声処理装置
において、上記音声記録手段により記録された音声デー
タのうち有音データを検出する有音検出手段を更に有
し、上記話者識別手段は、上記有音検出手段により検出
された有音データに対して話者識別処理を行うことを特
徴とする。[0010] In order to achieve the above object, a third aspect of the present invention is provided.
The voice processing device according to the first or second voice processing device, further includes voiced detection means for detecting voiced data among voice data recorded by the voice recording means, and the speaker identification means. Is characterized in that speaker identification processing is performed on the sound data detected by the sound detection means.

【００１１】上記の目的を達成するために本発明の第４
の音声処理装置は、上記第３の音声処理装置において、
上記有音検出手段は、上記音声記録手段により記録され
た音声データのヘッダ部に記録された有音／無音情報に
基づいて検出することを特徴とする。In order to achieve the above object, a fourth aspect of the present invention is provided.
The voice processing device of the third voice processing device described above,
The sound detection means detects the sound based on sound / non-sound information recorded in a header part of the audio data recorded by the audio recording means.

【００１２】上記の目的を達成するために本発明の第５
の音声処理装置は、上記第１の音声処理装置において、
話者コードを指定されることで、該話者コードに対応す
る音声データと上記特別の話者コードに対応する音声デ
ータを再生する再生手段を有することを特徴とする。In order to achieve the above object, a fifth aspect of the present invention is provided.
The audio processing device according to the first audio processing device,
When a speaker code is designated, the apparatus has a reproducing unit for reproducing voice data corresponding to the speaker code and voice data corresponding to the special speaker code.

【００１３】[0013]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態を説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１４】図１は、本発明の一実施形態であるデジタ
ル音声処理装置の構成を示したブロック図である。図１
に示すように、本実施形態のデジタル音声処理装置は、
当該音声処理装置全体の制御を司るシステム制御部１０
を備え、外部音声等を入力するマイクロフォン１と、こ
のマイクロフォン１からの音声信号を増幅するプリアン
プ２と、後述する符号化／復号化処理部７で所定処理が
なされた出力音声信号を増幅するパワーアンプ６と、増
幅された該音声信号を出力するスピーカ５と、これら入
出力音声信号に対して不要高域成分を除去するとともに
Ａ／Ｄ変換あるいはＤ／Ａ変換を施す回路であるＣＯＤ
ＥＣ３と、ＣＯＤＥＣ３でＡ／Ｄ変換された音声データ
に所定の符号化等の処理を施す符号化／復号化処理部
（ＤＳＰ）７と、メモリ制御部８の制御により符号化／
復号化処理部７で適宜処理が施された音声データを記録
するフラッシュメモリ１３と、ＣＯＤＥＣ３でＡ／Ｄ変
換された音声データあるいは復号化された音声信号より
所定のパラメータを抽出する特徴パラメータ抽出部１４
と、この特徴パラメータ抽出部１４からのデータに基づ
き音声モデルを作成する音声モデル作成部１５と、特徴
パラメータ抽出部１４からのデータに基づき類似度を計
算する類似度計算部１６と、類似度計算部１６の計算結
果に基づいて話者を特定する処理を行う話者特定部１７
と、当該デジタル音声処理装置の所定状況を表示する表
示部９と、録音、再生等の操作釦あるいは操作スイッチ
からなる操作入力部１１と、当該デジタル音声処理装置
の電源１２と、を備えている。FIG. 1 is a block diagram showing a configuration of a digital audio processing device according to an embodiment of the present invention. FIG.
As shown in FIG.
A system control unit 10 that controls the entire voice processing apparatus.
, A microphone 1 for inputting external audio and the like, a preamplifier 2 for amplifying an audio signal from the microphone 1, and a power for amplifying an output audio signal that has been subjected to predetermined processing by an encoding / decoding processing unit 7 described later. An amplifier 6, a speaker 5 for outputting the amplified audio signal, and a COD circuit for removing unnecessary high frequency components and performing A / D conversion or D / A conversion on these input / output audio signals.
An EC3, an encoding / decoding processing unit (DSP) 7 for performing predetermined encoding and the like on the audio data A / D-converted by the CODEC 3, and an encoding / decoding unit under the control of the memory control unit 8.
A flash memory 13 for recording audio data appropriately processed by the decoding processing unit 7; and a feature parameter extracting unit for extracting predetermined parameters from audio data A / D converted by the CODEC 3 or decoded audio signals. 14
A speech model creation unit 15 for creating a speech model based on the data from the feature parameter extraction unit 14, a similarity calculation unit 16 for calculating the similarity based on the data from the feature parameter extraction unit 14, a similarity calculation Speaker specifying unit 17 that performs processing for specifying a speaker based on the calculation result of unit 16
A display unit 9 for displaying a predetermined state of the digital audio processing device; an operation input unit 11 including operation buttons or switches for recording and playback; and a power supply 12 for the digital audio processing device. .

【００１５】上記ＣＯＤＥＣ３は、マイクロフォン１か
らの音声信号より不要高域成分を除去するローパスフィ
ルタ３Ａ、さらに該アナログ音声信号をＡ／Ｄ変換する
Ａ／Ｄ変換器３Ｂ、符号化／復号化処理部７からの音声
データをＤ／Ａ変換するＤ／Ａ変換器３Ｃ、該Ｄ／Ａ変
換された音声信号より不要高域成分を除去するローパス
フィルタ３Ｄとを備えている。The CODEC 3 is a low-pass filter 3A for removing unnecessary high-frequency components from the audio signal from the microphone 1, an A / D converter 3B for A / D converting the analog audio signal, an encoding / decoding processing unit. 7 is provided with a D / A converter 3C for D / A converting the audio data from 7 and a low-pass filter 3D for removing unnecessary high frequency components from the D / A converted audio signal.

【００１６】また、符号化／復号化処理部７は、例え
ば、ＣＥＬＰ（ＣｏｄｅＥｘｃｉｔｅｄＬｉｎｅａ
ｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ）方式により
音声の符号化／復号化を行い、ＤＳＰ（Ｄｉｇｉｔａｌ
ＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）により構成され
る。The encoding / decoding processing unit 7 is, for example, a CELP (Code Excited Linea).
Encoding / decoding of audio is performed by an r Predictive Coding (DSP) method, and a DSP (Digital
Signal Processor).

【００１７】システム制御部１０は、当該デジタル音声
処理装置各部の動作制御を司り、本実施形態では、８ビ
ットＣＰＵで構成される。そして、１６ビットＣＰＵで
構成されるメモリ制御部８を介して符号化／復号化処理
部７およびフラッシュメモリ１３に接続されるととも
に、ＣＯＤＥＣ３と、表示部９と、録音、再生等の操作
釦あるいは操作スイッチからなる操作入力部１１と、電
源１２とにそれぞれ接続されている。The system control section 10 controls the operation of each section of the digital audio processing apparatus, and in this embodiment, is constituted by an 8-bit CPU. It is connected to the encoding / decoding processing unit 7 and the flash memory 13 via a memory control unit 8 composed of a 16-bit CPU, and has a CODEC 3, a display unit 9 and operation buttons for recording and reproduction or the like. An operation input unit 11 including operation switches and a power supply 12 are connected to each other.

【００１８】上記フラッシュメモリ１３は、音声データ
記録部１３Ａと、音声モデル記録部１３Ｂと、話者コー
ド記録部１３Ｃとの３つの領域に区分けされている。ま
たフラッシュメモリ１３は、音声処理装置に内蔵されて
いてもよく、着脱自在に構成されていてもよい。The flash memory 13 is divided into three areas: a voice data recording section 13A, a voice model recording section 13B, and a speaker code recording section 13C. Further, the flash memory 13 may be built in the audio processing device, or may be configured to be detachable.

【００１９】特徴パラメータ抽出部１４は符号化／復号
化処理部７に接続されるとともに、音声モデル作成部１
５を介してメモリ制御部８に接続されている。さらに特
徴パラメータ抽出部１４は類似度計算部１６、話者特定
部１７を介してメモリ制御部８に接続されている，ここ
で、このように構成される本デジタル音声処理装置の録
音、再生に係る主要動作を簡単に説明する。操作者が操
作入力部１１により録音操作を行うと、システム制御部
１０の制御下にマイクロフォン１で入力したアナログ音
声信号がプリアンプ２で増幅され、ローパスフィルタ３
Ａによって音声信号成分のうち不要な高域成分が遮断さ
れる。このローパスフィルタ３Ａからの出力信号はＡ／
Ｄ変換器３Ｂでデジタル信号に変換される。The feature parameter extracting unit 14 is connected to the encoding / decoding processing unit 7 and the speech model creating unit 1
5 is connected to the memory control unit 8. Further, the feature parameter extraction unit 14 is connected to the memory control unit 8 via the similarity calculation unit 16 and the speaker identification unit 17. The main operation will be briefly described. When the operator performs a recording operation using the operation input unit 11, an analog audio signal input by the microphone 1 under the control of the system control unit 10 is amplified by the preamplifier 2 and the low-pass filter 3.
Unnecessary high frequency components of the audio signal components are blocked by A. The output signal from the low-pass filter 3A is A /
It is converted into a digital signal by the D converter 3B.

【００２０】この後、符号化／復号化処理部７でＡ／Ｄ
変換器３Ｂからのデジタル信号に符号化処理を施す。こ
の符号化処理によって得られた符号化データはメモリ制
御部８を介してフラッシュメモリ１３の音声データ記録
部１３Ａの領域に格納される。Thereafter, the encoding / decoding processing section 7 performs A / D
The digital signal from the converter 3B is subjected to an encoding process. The encoded data obtained by this encoding process is stored in the area of the audio data recording unit 13A of the flash memory 13 via the memory control unit 8.

【００２１】この一連の動作の際、メモリ制御部８は、
フラッシュメモリ１３と符号化／復号化処理部７との間
でやりとりされる信号の入出力動作を制御する。また、
フラッシュメモリ１３はメモリ制御部８から出力される
符号化データの他にヘッダ情報を記録する。このヘッダ
情報としては、例えば本出願人による特開平９−２１８
６９４号公報に開示されるように有音／無音情報等が挙
げられる。During this series of operations, the memory control unit 8
The input / output operation of a signal exchanged between the flash memory 13 and the encoding / decoding processing unit 7 is controlled. Also,
The flash memory 13 records header information in addition to the encoded data output from the memory control unit 8. The header information includes, for example, Japanese Patent Application Laid-Open No. 9-218 by the present applicant.
As disclosed in Japanese Unexamined Patent Publication No. 694, there is mentioned sound / silence information.

【００２２】また、操作者が操作入力部１１により再生
操作を行うと、システム制御部１０の制御下にメモリ制
御部８を介してフラッシュメモリ１３から符号化データ
が読み出され、メモリ制御部８を介して符号化／復号化
処理部７に供給され、復号化データが作成される。When the operator performs a reproducing operation by the operation input unit 11, the encoded data is read from the flash memory 13 via the memory control unit 8 under the control of the system control unit 10, and the memory control unit 8 Is supplied to the encoding / decoding processing unit 7 via the, and decoded data is created.

【００２３】この復号化データはＣＯＤＥＣ３のＤ／Ａ
変換器３Ｃにおいてアナログ音声信号に変換され、該ア
ナログ音声信号はローパスフィルタ３Ｄにおいてその周
波数成分のうち不要な高域成分が遮断される。そして、
パワーアンプ６で増幅されスピーカ５より再生信号とし
て出力される。This decoded data is the D / A of CODEC3.
The converter 3C converts the analog sound signal into an analog sound signal, and the analog sound signal is cut off in the low-pass filter 3D of unnecessary high frequency components among its frequency components. And
The signal is amplified by the power amplifier 6 and output from the speaker 5 as a reproduction signal.

【００２４】次に、当該デジタル音声処理装置における
話者識別処理のための話者登録の動作について説明す
る。なお、この話者登録は上述した録音操作に先だって
行うことを想定する。Next, the operation of speaker registration for speaker identification processing in the digital voice processing apparatus will be described. It is assumed that this speaker registration is performed prior to the above-described recording operation.

【００２５】図２は、話者登録の流れを示すフローチャ
ートである。操作者が操作入力部１１を介して話者登録
操作を行ったとき、話者登録用音声データを特徴パラメ
ータ抽出部１４に入力する（ステップＳ１）。このと
き、話者登録用音声データは、マイクロフォン１から話
者が入力するようにしても良いし、予め話者登録用音声
データを録音しておき、その録音データを入力するよう
にしても良い。FIG. 2 is a flowchart showing the flow of speaker registration. When the operator performs a speaker registration operation via the operation input unit 11, the speaker registration voice data is input to the feature parameter extraction unit 14 (step S1). At this time, the speaker registration voice data may be input by the speaker from the microphone 1, or the speaker registration voice data may be recorded in advance, and the recorded data may be input. .

【００２６】特徴パラメータ抽出部１４は、入力された
登録用音声データを話者識別に適した表現形式、例えば
ピッチやケプストラム等の特徴パラメータを抽出する
（ステップＳ２）。次に、特徴パラメータの時系列デー
タが音声モデル作成部１５に入力され、特徴パラメータ
の標準パターンが音声モデルとして作成される（ステッ
プＳ３）。そして、作成された音声モデルがメモリ制御
部８を介してフラッシュメモリ１３の音声モデル記録部
１３Ｂに記録される（ステップＳ４）。The feature parameter extracting unit 14 extracts the input registration voice data into a representation format suitable for speaker identification, for example, feature parameters such as pitch and cepstrum (step S2). Next, the time-series data of the feature parameters is input to the speech model creation unit 15, and a standard pattern of the feature parameters is created as a speech model (step S3). Then, the created audio model is recorded in the audio model recording unit 13B of the flash memory 13 via the memory control unit 8 (Step S4).

【００２７】これら上述した話者登録操作を、録音を予
定する各話者について行う。The above-described speaker registration operation is performed for each speaker to be recorded.

【００２８】次に、録音データ、すなわち音声データ記
録部１３Ａに記録されている音声データについて、話者
を識別する処理について説明する。Next, a process for identifying a speaker with respect to recorded data, that is, audio data recorded in the audio data recording unit 13A will be described.

【００２９】ここで、音声データ記録部１３Ａに記録さ
れている音声データについて話者を識別することとした
のは、一般に話者識別処理の処理演算量は膨大であり、
特に小型・安価の音声処理装置においては、録音時にマ
イクからの音声の話者識別処理を行うことは困難である
ためである。Here, the speaker is identified with respect to the audio data recorded in the audio data recording unit 13A. Generally, the amount of calculation for speaker identification processing is enormous.
This is because it is difficult for a small and inexpensive voice processing device to perform speaker identification processing of voice from a microphone during recording.

【００３０】このように音声データ記録部１３Ａに記録
されている音声データについて話者を識別することによ
り、話者識別処理の実行のタイミングに自由度が生まれ
る。例えばこの処理は、録音が終了した直後に自動的に
行うようにしても良いし、操作者が操作入力部１１を介
して話者識別操作を行ったときに行うようにしても良
い。As described above, by identifying the speaker with respect to the audio data recorded in the audio data recording unit 13A, the timing of executing the speaker identification processing is increased. For example, this processing may be automatically performed immediately after the recording is completed, or may be performed when the operator performs a speaker identification operation via the operation input unit 11.

【００３１】図３は、話者識別の処理を示すフローチャ
ートである。まず、音声データのフレーム番号を示す変
数ｆの値を“０”にセットする（ステップＳ５）。続い
てｆの値を“１”加算する（ステップＳ６）。そして、
変数ｆの値に対応するフレームの音声が有音であるか否
かを判定する（ステップＳ７）。この判定方法は、例え
ば、本出願人による特開平９−２８１９８７号公報に開
示するような方法でも良いし、前出の特開平９−２１８
６９４号公報に開示されているように、予めヘッダに記
録された有音・無音情報を用いるようにしても良い。FIG. 3 is a flowchart showing a speaker identification process. First, the value of the variable f indicating the frame number of the audio data is set to "0" (step S5). Subsequently, "1" is added to the value of f (step S6). And
It is determined whether or not the sound of the frame corresponding to the value of the variable f is sound (step S7). This determination method may be, for example, a method disclosed in Japanese Patent Application Laid-Open No. 9-281987 by the present applicant, or a method disclosed in Japanese Patent Application Laid-Open No. 9-218 described above.
As disclosed in Japanese Unexamined Patent Application Publication No. 694, sound / silence information recorded in a header in advance may be used.

【００３２】上記ステップＳ７での判定がｙｅｓであれ
ば、現フレームの音声データが記録されている音声デー
タ記録部１３Ａのアドレス情報を話者コード記録部１３
Ｃに記録する（ステップＳ８）。次に、現フレームの音
声を特徴パラメータ抽出部１４に入力して所定の特徴パ
ラメータを抽出する（ステップＳ９）。If the determination in step S7 is yes, the address information of the audio data recording unit 13A in which the audio data of the current frame is recorded is stored in the speaker code recording unit 13.
Recorded in C (step S8). Next, the voice of the current frame is input to the feature parameter extraction unit 14 to extract a predetermined feature parameter (step S9).

【００３３】類似度計算部１６は、音声モデル１３Ｂに
記録された各話者の音声モデルと、上記特徴パラメータ
との類似度を計算する（ステップＳ１０）。次に、類似
度が話者識別用しきい値より高い話者を、現フレーム音
声の話者と特定する（ステップＳ１１）。The similarity calculator 16 calculates the similarity between the voice model of each speaker recorded in the voice model 13B and the above-mentioned characteristic parameter (step S10). Next, the speaker whose similarity is higher than the speaker identification threshold is specified as the speaker of the current frame voice (step S11).

【００３４】このとき、複数の話者に対して、類似度が
話者認識用しきい値より大となったとき、または類似度
が話者認識用しきい値より大となることがどの話者に対
してもないときは、そのときの話者を特定できず、とす
る。そして、その話者に対応する話者コードもしくは話
者を特定できなかった旨を表すコードを話者コード記録
部１３Ｃに、ステップＳ８で記録した情報に対応するよ
うに記録する（ステップＳ１２）。At this time, for a plurality of speakers, when the similarity becomes larger than the speaker recognition threshold value or when the similarity becomes larger than the speaker recognition threshold value, If there is no speaker, the speaker at that time cannot be specified. Then, a speaker code corresponding to the speaker or a code indicating that the speaker could not be specified is recorded in the speaker code recording unit 13C so as to correspond to the information recorded in step S8 (step S12).

【００３５】次に、ファイルが終わりかどうかを判定し
（ステップＳ１３）、ｎｏであればステップＳ６に戻っ
て次のフレームについて処理を繰り返し、ｙｅｓであれ
ば処理を終了する。また、ステップＳ７の判定がｎｏで
あれば当該フレームについては何も処理を行わずにステ
ップＳ６に戻って次のフレームについて処理を繰り返
す。Next, it is determined whether or not the file ends (step S13). If no, the process returns to step S6 to repeat the process for the next frame. If yes, the process ends. If the determination in step S7 is no, no processing is performed for the frame and the process returns to step S6 to repeat the processing for the next frame.

【００３６】このように有音区間についてのみ話者識別
処理を行うようにしたのは、無音区間には話者の個人性
が存在しないため、類似度の計算には不要であり、有音
区間のみを話者識別処理の対象としたほうが、精度よく
識別ができるためである。As described above, the speaker identification processing is performed only for the sound section, since the silent section does not have the personality of the speaker, it is unnecessary for calculating the similarity, and the sound section is not necessary. This is because if only the speaker identification processing is performed, identification can be performed with higher accuracy.

【００３７】上述した話者識別処理を経て、操作者は操
作入力部１１を介して話者選択操作と再生操作を行うこ
とにより、選択された話者と、特定できなかった部分の
音声だけを再生することを可能とする。例えば、インタ
ビューや座談会等の録音データを再生する際、特定の話
者だけを再生することができ、内容の把握を格段に速め
ることができる。After the above-described speaker identification processing, the operator performs a speaker selection operation and a reproduction operation via the operation input unit 11 so that only the selected speaker and the voice of the unspecified part can be obtained. Enables playback. For example, when playing back recorded data such as an interview or a roundtable, only a specific speaker can be played back, and the content can be grasped much faster.

【００３８】現在の話者識別技術では、常に話者を正確
に識別することは困難であり、誤識別することが多い。
このような状況では、必ずいずれかの話者に識別するよ
うにするのではなく、上述したように話者を特定できな
いときは特定しないでおくことにより、話者を指定して
も再生されない部分ができてしまうことを防止すること
ができる。With current speaker identification technology, it is difficult to always accurately identify a speaker, and erroneous identification often occurs.
In such a situation, as described above, when a speaker cannot be identified, the speaker is not identified, so that the speaker is not reproduced even if the speaker is specified. Can be prevented.

【００３９】[0039]

【発明の効果】以上説明したように本発明によれば、録
音音声の検索をより正確に行える音声処理装置を提供で
きる。As described above, according to the present invention, it is possible to provide a voice processing apparatus capable of searching for a recorded voice more accurately.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の一実施形態であるデジタル音声処理装
置の構成を示したブロック図である。FIG. 1 is a block diagram showing a configuration of a digital audio processing device according to an embodiment of the present invention.

【図２】上記実施形態のデジタル音声処理装置における
話者登録の流れを示すフローチャートである。FIG. 2 is a flowchart showing a flow of speaker registration in the digital audio processing device of the embodiment.

【図３】上記実施形態のデジタル音声処理装置における
話者識別の処理を示すフローチャートである。FIG. 3 is a flowchart showing speaker identification processing in the digital audio processing device of the embodiment.

[Explanation of symbols]

１…マイクロフォン３…ＣＯＤＥＣ３Ａ…ローパスフィルタ３Ｂ…Ａ／Ｄ変換器３Ｃ…ローパスフィルタ３Ｄ…Ｄ／Ａ変換器７…符号化／復号化処理部８…メモリ制御部１０…システム制御部１１…操作入力部１３…フラッシュメモリ１３Ａ…音声データ記録部１３Ｂ…音声モデル記録部１３Ｃ…話者コード記録部１４…特徴パラメータ抽出部１５…音声モデル作成部１６…類似度計算部１７…話者特定部 DESCRIPTION OF SYMBOLS 1 ... Microphone 3 ... CODEC 3A ... Low-pass filter 3B ... A / D converter 3C ... Low-pass filter 3D ... D / A converter 7 ... Encoding / decoding processing part 8 ... Memory control part 10 ... System control part 11 ... Operation Input unit 13 Flash memory 13A Audio data recording unit 13B Audio model recording unit 13C Speaker code recording unit 14 Feature parameter extraction unit 15 Audio model creation unit 16 Similarity calculation unit 17 Speaker identification unit

Claims

[Claims]

1. A voice input means for inputting voice, a registered voice model recording means for recording a feature parameter of a voice for registration as a registered voice model for each speaker, and a voice data input by the voice input means. Voice recording means for recording; and a feature parameter is extracted from the voice data recorded by the voice recording means, and a similarity between the feature parameter and the registered voice model of each speaker is obtained to perform speaker identification processing. Speaker identification data recording for recording speaker identification means, a speaker code corresponding to the speaker identified by the speaker identification means, and position information of voice data corresponding to the identification processing of the speaker identification means Means, the speaker identification data recording means, when the speaker identification of the voice data could not be performed by the speaker identification means, the voice data And a special speaker code indicating that the speaker could not be specified.

2. The speaker identification unit includes: a feature parameter extraction unit configured to extract a feature parameter from voice data recorded by the voice recording unit; a feature parameter extracted by the feature parameter extraction unit; A similarity calculating means for calculating a similarity with the registered voice model of the speaker, and a speaker identification for identifying the speaker by comparing the similarity calculated by the similarity calculating means with a threshold for speaker recognition. The voice processing device according to claim 1, comprising:

3. The apparatus according to claim 1, further comprising: a voice detection unit configured to detect voice data in the voice data recorded by the voice recording unit; wherein the speaker identification unit includes a voice detected by the voice detection unit. 3. The speech processing device according to claim 1, wherein speaker identification processing is performed on the data.

4. The sound detecting device according to claim 3, wherein said sound detecting means detects the sound based on sound / no sound information recorded in a header portion of the sound data recorded by said sound recording means. Voice processing device.

5. A reproduction means for reproducing voice data corresponding to the speaker code and voice data corresponding to the special speaker code by designating the speaker code. 2. The audio processing device according to 1.