KR101023211B1

KR101023211B1 - Microphone array based speech recognition system and target speech extraction method in the system

Info

Publication number: KR101023211B1
Application number: KR1020080088318A
Authority: KR
Inventors: 조훈영; 이윤근; 강점자; 강병옥; 김갑기; 이성주; 정호영; 정훈; 박전규; 전형배
Original assignee: 한국전자통신연구원
Priority date: 2007-12-11
Filing date: 2008-09-08
Publication date: 2011-03-18
Anticipated expiration: 2028-09-08
Also published as: KR20090061566A

Abstract

본 발명은 암묵신호분리를 이용한 마이크배열 기반 음성인식 시스템 및 그 시스템에서의 목표 음성 추출방법에 관한 것으로서, 음성인식 시스템은 다수의 마이크를 통해 각각 입력된 혼합신호들을 독립요소분석을 통해 분리하고, 상기 분리된 음원 신호들 중에서 음성인식을 목표로 발성된 하나의 목표음성을 가우시안 혼합 밀도 모델 또는 은닉 마르코프 모델을 이용하여 추출하고, 상기 추출된 목표음성을 통해 원하는 음성을 자동으로 인식함으로써, 잡음이 존재하는 상황에서도 보다 높은 인식률을 확보할 수 있다. The present invention relates to a microphone array based speech recognition system using blind signal separation and a target speech extraction method in the system, wherein the speech recognition system separates mixed signals inputted through a plurality of microphones through independent element analysis, Among the separated sound source signals, one target voice spoken for speech recognition is extracted using a Gaussian mixture density model or a hidden Markov model, and the desired voice is automatically recognized through the extracted target voice, thereby reducing noise. Higher recognition rates can be attained even when they exist.

마이크배열, 음성인식, 암묵신호분리, 독립요소분석(ICA), 가우시안 혼합 밀도 모델(GMM), 은닉 마르코프 모델(HMM), 목표음성, 특징벡터, 대수 우도비(LLR). Microphone Array, Speech Recognition, Blind Signal Separation, Independent Element Analysis (ICA), Gaussian Mixed Density Model (GMM), Hidden Markov Model (HMM), Target Speech, Feature Vector, Algebraic Likelihood Ratio (LLR).

Description

MICROPHONE ARRAY BASED SPEECH RECOGNITION SYSTEM AND TARGET SPEECH EXTRACTION METHOD OF THE SYSTEM}

본 발명은 음성인식 시스템 및 그 시스템에서의 목표음성 추출 방법에 관한 것으로서, 특히 암묵신호분리를 이용한 마이크배열 기반 음성인식 시스템 및 그 시스템에서의 목표음성 추출 방법에 관한 것이다. The present invention relates to a speech recognition system and a method for extracting a target voice in the system, and more particularly, to a microphone array based speech recognition system using blind signal separation and a target speech extraction method in the system.

본 발명은 지식경제부 및 정보통신연구진흥원의 IT성장동력기술개발사업의 일환으로 수행한 연구로부터 도출된 것이다[과제관리번호: 2006-S-036-02, 과제명: 신성장동력산업용 대용량 대화형 분산 처리 음성인터페이스 기술개발].The present invention is derived from a study conducted as part of the IT growth engine technology development project of the Ministry of Knowledge Economy and the Ministry of Information and Telecommunication Research and Development. [Task management number: 2006-S-036-02, Task name: Large-capacity interactive dispersion for new growth engine industries Development of processing voice interface technology].

기존의 음성인식 기술은 비교적 조용한 환경에서 음성을 인식하는 경우에는 단어인식률이 95% 이상의 높은 인식 성능을 보인다. 그러나 다양한 잡음이 존재하는 실제 응용환경에서는 인식률이 급격히 저하되므로 이러한 음성인식 기술의 상용화를 위해서는 높은 인식률을 확보하는 것이 필수적이다. Conventional speech recognition technology has more than 95% word recognition rate when speech is recognized in a relatively quiet environment. However, in a real application environment where various noises exist, the recognition rate drops rapidly. Therefore, it is essential to secure a high recognition rate for the commercialization of the voice recognition technology.

최근 수십 년간 다양한 잡음처리 방법들이 음성인식의 전처리 단계, 인식단계 및 후처리 단계에서 연구되어 왔으나, 아직도 모든 종류의 잡음을 양호하게 처리하는 방법은 알려지지 않고 있다. In recent decades, various noise processing methods have been studied in the pre-processing, recognition and post-processing stages of speech recognition, but there is still no known method for handling all kinds of noises well.

최근에 2개 이상의 마이크를 이용하여 원하는 음성신호를 분리하는 마이크배열 기반의 암묵신호분리(Blind Source Separation 이하, BSS라 칭함) 기법이 활발히 연구되고 있다. 상기 BSS의 주요 방법론 중의 하나인 독립요소분석(Independent Component Analysis 이하, ICA라 칭함) 기술을 사용하면 음성인식기, 유무선 휴대폰 등 음성을 입력받는 장치들에서 주변 화자, TV 또는 라디오 소음 등의 간섭신호를 효과적으로 제거하거나 감쇠시킬 수 있다. 즉, 입력 음성을 포함하여 N개의 음원이 존재하고, M개의 마이크가 존재한다고 할 때, M과 N의 개수가 유사한 경우에 M개의 마이크 입력신호로부터 원래의 N개의 음원 신호를 복원해낼 수 있다. Recently, a microphone array based blind signal separation (BSS) technique for separating a desired voice signal using two or more microphones has been actively studied. Independent Component Analysis (hereinafter referred to as ICA) technology, one of the main methodologies of the BSS, allows interference signals such as surrounding speakers, TV, or radio noise to be received from devices that receive voice, such as voice recognizers and wired / wireless mobile phones. Can be effectively removed or attenuated. That is, when there are N sound sources including the input voice and M microphones, the original N sound source signals can be restored from the M microphone input signals when the number of M and N is similar.

그러나 ICA에 의해 분리해 낸 N개의 음원 신호는 그 순서가 임의로 뒤바뀐다는 문제점이 있다. However, the N sound source signals separated by ICA have a problem that their order is reversed arbitrarily.

또한, 종래의 ICA 기술은 시간 영역에서 임의의 가중치를 각 음원 신호에 곱한 후 이를 더하여 혼합신호들을 만들고, 이를 다시 ICA 알고리즘을 이용해서 분리해내는 수준이었다. 그러나 최근에는 기술의 발전으로 실제 실내의 반향음(room reverberation)이 존재하는 경우에도 원래의 음원들을 분리해 낼 수 있는 수준으로 발전하고 있다. 하지만, 이와 같은 발전된 ICA 기술에서도 분리된 음원 신호들이 무엇에 해당하는지 자동으로 알아낼 수 있는 방법은 아직 알려지지 않고 있으며, 음성인식을 위해 인식 시스템에 입력으로 들어가야 할 목표 음성을 자동으로 찾아 내야만 하는 문제점이 있다. In addition, the conventional ICA technology multiplies each sound source signal with an arbitrary weight in the time domain and adds them to make mixed signals, which are then separated using the ICA algorithm. Recently, however, due to the development of technology, even in the presence of actual room reverberation (room reverberation) has been developed to the level that can separate the original sound sources. However, even in this advanced ICA technology, there is no known method to automatically find out what the separated sound signals correspond to, and it is necessary to automatically find a target voice to be input to the recognition system for speech recognition. There is this.

상술한 바와 같은 문제점을 해결하기 위해 본 발명은 음성인식을 위해 마이크배열 기반의 암묵신호분리 기술을 이용하여 원하는 목표 음성을 자동으로 찾기 위한 마이크배열 기반 음성인식 시스템 및 그 시스템에서의 목표음성 추출 방법을 제공함에 있다. In order to solve the above problems, the present invention provides a microphone array based speech recognition system for automatically searching for a desired target speech using a microphone array-based blind signal separation technology for speech recognition, and a method of extracting a target voice from the system. In providing.

상기 이러한 본 발명의 목적들을 달성하기 위한 마이크배열 기반 음성인식 시스템은, 다수의 마이크를 통해 각각 입력된 혼합신호들을 독립요소분석을 통해 음원 신호들로 분리하는 신호분리기; 상기 신호분리기를 통해 분리된 음원 신호들 중에서 음성인식을 목표로 발성된 하나의 목표음성을 추출하는 목표음성 추출기; 상기 추출된 목표 음성을 통해 원하는 음성을 인식하는 음성 인식기를 포함하며, 상기 목표음성 추출에 이용되는 부가 정보를 상기 목표음성 추출기로 전송하는 부가 정보기를 더 포함하는 것을 특징으로 한다. The microphone array-based speech recognition system for achieving the objects of the present invention, the signal separator for separating each of the mixed signals input through a plurality of microphones into the sound source signals through independent element analysis; A target voice extractor for extracting one target voice spoken for voice recognition from among the sound source signals separated by the signal separator; And a voice recognizer for recognizing a desired voice through the extracted target voice, and further comprising an additional information transmitter for transmitting additional information used for the target voice extraction to the target voice extractor.

그리고 본 발명의 목적들을 달성하기 위한 마이크배열 기반 음성인식 시스템에서의 목표음성 추출 방법은, 다수의 마이크를 통해 각각 입력된 혼합신호들을 독립요소분석을 통해 음원 신호들로 분리하는 단계; 상기 분리된 음원 신호들 중에서 음성인식을 목표로 발성된 하나의 목표음성을 추출하는 단계; 상기 추출된 목표음성을 통해 원하는 음성을 인식하는 단계를 포함하는 것을 특징으로 한다. And a target voice extraction method in a microphone array based speech recognition system for achieving the objects of the present invention, comprising: separating the mixed signals respectively input through a plurality of microphones into sound source signals through independent element analysis; Extracting one target voice spoken for voice recognition from the separated sound source signals; Recognizing a desired voice through the extracted target voice, characterized in that it comprises.

본 발명은 독립요소분석 기술을 통해 분리된 음원 신호들 중에서 음성인식을 목표로 발성된 하나의 목표음성을 은닉 마르코프 모델(HMM) 및 가우시안 혼합 밀도 모델(GMM)을 이용하여 자동으로 찾아냄으로써, 분리된 음원 신호들이 무엇에 해당하는지 알아낼 수 있으므로 음성인식 시 잡음이 존재하는 상황에서도 보다 높은 인식률을 확보할 수 있는 효과가 있다. According to the present invention, a single target voice, which is aimed for speech recognition, is automatically detected using a Hidden Markov Model (HMM) and a Gaussian Mixed Density Model (GMM). Since it is possible to find out what the corresponding sound source signals correspond to, it is effective to secure a higher recognition rate even in the presence of noise in speech recognition.

이하, 본 발명의 바람직한 실시 예를 첨부한 도면을 참조하여 상세히 설명한다. 본 발명을 설명함에 있어, 관련된 공지 기능 혹은 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In describing the present invention, if it is determined that detailed descriptions of related known functions or configurations may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted.

본 발명의 실시예에 따른 목표음성 추출을 위한 음성인식 기술은 마이크배열 기반의 신호처리 분야인 암묵신호분리(Blind Source Separation 이하, BSS라 칭함) 기술 및 잡음환경에서 음성인식을 위한 전처리 기술에 속한다. 여기서 상기 암묵신호분리 기술에 관련하여 최근 독립요소분석 기술이 개발되고 있는데, 이는 두 개 이상의 마이크를 사용하여 음원들을 성공적으로 분리할 수 있으며, 음향신호분리, 다채널 뇌파 신호분리, 영상패턴 분석 등 다양한 분야에 적용할 수 있다. Speech recognition technology for target speech extraction according to an embodiment of the present invention belongs to the blind source separation (BSS) technology and the pre-processing technology for speech recognition in a noise environment based on the microphone array-based signal processing . Here, the independent element analysis technology has been recently developed in relation to the blind signal separation technology, which can successfully separate sound sources using two or more microphones, and is capable of acoustic signal separation, multi-channel EEG signal separation, image pattern analysis, and the like. Applicable to various fields.

그러면 상기 암묵신호분리를 이용한 마이크배열 기반 목표음성 추출을 위한 음성인식 시스템에 대해 첨부된 도면을 참조하여 구체적으로 설명하기로 한다. Next, a speech recognition system for microphone array based target speech extraction using the blind signal separation will be described in detail with reference to the accompanying drawings.

도 2는 본 발명의 실시예에 따른 목표음성 추출을 위한 마이크배열 기반 음성인식 시스템의 구조를 도시한 블록도이다. 2 is a block diagram illustrating a structure of a microphone array based speech recognition system for extracting target speech according to an embodiment of the present invention.

상기 도 2를 참조하면, 음성인식 시스템은 다수의 마이크(101), 신호 분리기(110), 목표음성 추출기(120), 음성 인식기(130) 및 부가정보기(140)로 구성될 수 있다. Referring to FIG. 2, the voice recognition system may include a plurality of microphones 101, a signal separator 110, a target voice extractor 120, a voice recognizer 130, and an additional information device 140.

상기 다수의 마이크(101) 각각은 다수의 음원신호들(S₁(t), S₂(t), …, S_N(t))을 혼합하여 수신한다. Each of the plurality of microphones 101 receives a plurality of sound source signals S ₁ (t), S ₂ (t), ..., S _N (t). Receive by mixing

상기 신호 분리기(110)는 상기 마이크들(101) 각각으로부터 출력된 혼합신호(X₁(t), X₂(t), …, X_N(t)를 입력받고, 독립요소분석(ICA)을 통해 입력된 상기 각 혼합신호를 분리하여 분리된 각 음원 신호(r₁(t), r₂(t), …, r_N(t)를 출력한다. The signal separator 110 receives a mixed signal (X ₁ (t), X ₂ (t), ..., X _N (t) output from each of the microphones 101, and performs independent element analysis (ICA). Each of the mixed signals input through the signal is separated and outputs the separated sound source signals r ₁ (t), r ₂ (t), ..., r _N (t).

상기 목표음성 추출기(120)는 상기 분리된 각 음원 신호를 입력받아 상기 음원 신호들로부터 하나의 목표음성을 추출한다. 이때, 목표음성 추출은 N개의 분리된 음원 신호에 대한 특징 추출, 가우시안 혼합 밀도 모델(Gaussian mixture model 이하, GMM라 칭함)을 이용한 대수 우도비(Log Likelihood Ratio : LLR) 신뢰도 계산, 신뢰도 비교 및 목표음성 결정 등을 수행한다. 상기 목표음성 추출기(120)는 상기 부가정보기(140)로부터 수신된 부가 정보를 사용하는 경우 및 부가정보가 주어지지 않은 경우로 구분하여 목표음성을 추출할 수 있다. 또한, 상기 목표음성 추출기(120)는 상기 목표음성이 남성 또는 여성이며 그 외의 음원들을 목표음성과 다 른 성별이거나 그 외의 모든 신호라는 정보, 목표 음성 외의 다른 음성들이 잡음신호라는 정보, 목표 음성이 특정화자의 음성이라는 화자 개인정보 등을 고려하여 목표 음성을 추출한다. The target voice extractor 120 receives each of the separated sound source signals and extracts one target voice from the sound source signals. At this time, the target speech extraction includes feature extraction for N separated sound source signals, Log Likelihood Ratio (LLR) reliability calculation using a Gaussian mixture density model (hereinafter referred to as GMM), reliability comparison and target. Perform voice decisions and more. The target voice extractor 120 may extract the target voice by dividing it into a case of using the additional information received from the additional information 140 and a case in which no additional information is given. In addition, the target voice extractor 120 is the target voice male or female and other The target voice is extracted by considering the sound sources as the target voice and all other signals or other signals, the information other than the target voice as the noise signal, and the speaker personal information that the target voice is the voice of the specific speaker.

상기 부가 정보가 주어지는 경우, 상기 목표음성 추출기(120)는 가우시안 혼합 밀도 모델을 활용하여 각 음원 신호들에 대한 가설검정(hypothesis test)을 통해 가장 신뢰도가 높은 음원을 목표음성으로 결정한다. 그리고 상기 부가 정보가 주어지지 않는 경우, 상기 목표음성 추출기(120)는 음성인식 시스템에 내장된 은닉 마르코프 모델(Hidden Markov Model 이하, HMM라 칭함)을 사용하여 각 음원의 신뢰도를 계산한다. When the additional information is given, the target voice extractor 120 applies Gaussian mixture density model to each sound source signal. Through the hypothesis test, the most reliable sound source is determined as the target voice. If the additional information is not provided, the target voice extractor 120 calculates the reliability of each sound source using a hidden Markov model (hereinafter, referred to as HMM) built in a voice recognition system.

상기 음성 인식기(130)는 상기 추출된 목표음성 신호(y(t))를 수신하여 원하는 음성을 인식한다. The voice recognizer 130 receives the extracted target voice signal y (t) to recognize a desired voice.

이와 같은 구조를 갖는 음성 인식 시스템에서 목표음성을 추출하기 위한 방법을 첨부된 도면을 참조하여 구체적으로 설명하기로 한다. A method for extracting a target voice in a speech recognition system having such a structure will be described in detail with reference to the accompanying drawings.

도 3은 본 발명의 실시예에 따라 마이크배열 기반 음성인식 시스템의 목표음성 추출기에서 목표음성을 추출하기 위한 방법을 도시한 흐름도이다. 3 is a flowchart illustrating a method for extracting a target voice from a target voice extractor of a microphone array based voice recognition system according to an embodiment of the present invention.

상기 도 3을 참조하면, 210단계에서 음성인식 시스템의 목표음성 추출기(120)는 신호 분리기(110)로부터 분리된 음원 신호들을 입력받는다. Referring to FIG. 3, in operation 210, the target voice extractor 120 of the speech recognition system receives sound source signals separated from the signal separator 110.

그런 다음 220단계에서 상기 목표음성 추출기(120)는 N개의 분리된 8kHz 또는 16kHz 음원 신호들이 주어졌을 경우, 오디오 신호의 매 10ms마다 20ms 구간의 프레임에서 N차의 특징벡터 x_t를 계산함으로써 특징을 추출한다. 여기서 상기 특징 벡터의 계산은 LPCC(Linear prediction cepstral coefficient), PLP(Perceptual Linear Prediction), MFCC(Mel-Frequency Cepstral Coefficient) 등 다수의 방법을 사용할 수 있다. Then, in step 220, when the N separate 8 kHz or 16 kHz sound signals are given, the target speech extractor 120 calculates the feature by calculating the Nth feature vector x _t in a frame of 20 ms intervals every 10 ms of the audio signal. Extract. The feature vector may be calculated using a number of methods such as linear prediction cepstral coefficient (LPCC), perceptual linear prediction (LPP), and mel-frequency cepstral coefficient (MFCC).

230단계에서 목표음성 추출기(120)는 각 추출된 특징들에서 가우시안 혼합 밀도 모델(GMM)을 이용한 대수 우도비(LLR) 신뢰도를 계산한다. 여기서 상기 가우시안 혼합 밀도 모델(GMM)은 N차원 벡터공간(vector space)을 M개의 다변량 가우시안 분포가 가산된 형태로 모델링하는 통계 모델 기법이며, 음성인식 또는 화자인식기에 널리 사용된다. 이러한 임의의 가우시안 혼합 밀도 모델(GMM)의 m번째 가우시안 분포를 하기 <수학식 1>과 같이 나타낼 수 있다. In operation 230, the target voice extractor 120 calculates logarithmic likelihood ratio (LLR) reliability using a Gaussian mixture density model (GMM) in each extracted feature. The Gaussian mixture density model (GMM) is a statistical model technique for modeling an N-dimensional vector space in the form of M multivariate Gaussian distributions, and is widely used in speech recognition or speaker recognition. The m-th Gaussian distribution of this arbitrary Gaussian mixture density model (GMM) may be represented by Equation 1 below.

상기 <수학식 1>에서 x,μ_m, ∑_m은 각각 특징벡터, m번째 가우시안 분포 평균벡터와 공분산 행렬을 의미한다. 그리고 GMM 출력 확률은 오디오 신호 전체구간에 해당하는 T개의 프레임에서의 특징 벡터열을 X=[x₁, x₂, …, x_T]라고 하고, 인접하는 특징벡터들이 독립인 경우, 하기 <수학식 2>와 같이 나타낼 수 있다. In Equation 1, x, μ _m and ∑ _m mean a feature vector, an m-th Gaussian distribution mean vector, and a covariance matrix, respectively. The GMM output probability is obtained by converting the feature vector sequence in T frames corresponding to the entire audio signal interval X = [x ₁ , x ₂ ,... , x _T ] and when adjacent feature vectors are independent, it may be expressed as Equation 2 below.

여기서 상기 GMM의 파라미터들인 μ_m, ∑_m 값들은 널리 알려진 최대우도(maximum likelihood) 추정 알고리즘 등을 사용하여 학습 데이터베이스로부터 구할 수 있다. 또한,

은 M개의 가우시안 분포 중 m번째 가우시안 분포의 기여도(혹은 가중치)를 의미하며,

이다.Where μ _m, which are the parameters of the GMM Σ _m values can be obtained from the learning database, by using the well-known maximum likelihood (maximum likelihood) estimation algorithm. Also,

Is the contribution (or weight) of the mth Gaussian distribution among the M Gaussian distributions.

to be.

이후, 240단계에서 목표음성 추출기(120)는 K(최대값)를 구한 후, 250단계에서 LLR_K> θ인 경우, 260단계에서 분리된 음원 신호 K를 목표음성으로 설정하고, 그렇지 않은 경우, 270단계에서 목표음성 부재 즉, 입력 거절로 설정한다. Subsequently, in step 240, the target voice extractor 120 obtains K (maximum value), and when LLR _K > θ in step 250, the target sound extractor 120 sets the separated sound source signal K as the target voice in step 260. In step 270, the target voice member, that is, the input rejection, is set.

상술한 바와 같이 230단계 내지 270단계는 여러 가지 방법을 이용하여 수행될 수 있다. 이러한 방법들은 남녀 성별 정보를 이용하는 경우, 음성-음악 정보를 이용하는 경우, 음성-잡음 정보를 이용하는 경우, 화자의 개인 정보를 이용하는 경우, 부가 정보가 별도로 주어지지 않는 경우 등의 목표 음성 추출 방법들이다. 그러면 이러한 방법들을 각각 구체적으로 설명하기로 한다. As described above, steps 230 to 270 may be performed using various methods. These methods are target voice extraction methods, such as when using gender information, when using voice-music information, when using voice-noise information, when using personal information of a speaker, and when additional information is not separately provided. Each of these methods will then be described in detail.

첫 번째로, 남녀 성별 정보를 이용하여 목표음성을 추출하는 경우, 즉 목표음성에 해당하는 음원이 여성이고, 이 외의 다른 음원들은 남성 화자의 음성 또는 그 외의 오디오 신호라는 정보가 제공될 경우에는 다음과 같이 목표음성을 추출한다.First, when a target voice is extracted using gender information, that is, when a sound source corresponding to the target voice is a female and other sound sources are provided with information such as a male speaker's voice or other audio signal, Extract the target voice as follows.

우선, 음성 데이터베이스를 남성 화자와 여성 화자로 구분한 후 각각에 대해 가우시안 혼합모델(GMM) λ_Male, λ_Female을 생성한다. First, the voice database is divided into male and female speakers, and a Gaussian mixed model (GMM) λ _Male and λ _Female are generated for each.

그런 다음 N개의 분리된 음원 신호에서 추출한 특징벡터열을 X¹, X², …, X^N (

)라고 할 때, λ_Male, λ_Female 모델에 대한 상기 특징벡터열 Xⁱ의 대수우도비(LLR_i)를 하기 <수학식 3>과 같이 계산한다. Then, the feature vector sequence extracted from the N separated sound source signals is X ¹ , X ² ,. , X ^N (

), The algebraic likelihood ratio (LLR _i ) of the feature vector sequence X ⁱ for the λ _Male and λ _Female models is calculated as in Equation 3 below.

이때, i번째 음원 Xⁱ가 실제로 여성 화자의 음성일 경우, 상기 <수학식 3>에서 분자항이 높은 값을 가지므로 LLR_i는 높은 값을 나타내며, 그 외의 경우에는 LLR_i는 상대적으로 낮은 값을 나타낸다. In this case, when the i-th sound source X ⁱ is actually a female speaker's voice, LLR _i represents a high value because the molecular term has a high value in Equation 3, otherwise LLR _i represents a relatively low value. Indicates.

상기 모든 LLR_i들 중에서 최대값은 하기 <수학식 4>와 같이 계산할 수 있다. The maximum value among all the LLR _i may be calculated as in Equation 4 below.

상기 250단계에서와 같이 최대 대수우도비(LLR_k)를 미리 정해진 임계값(threshold)(θ)과 비교하여 임계값보다 클 경우, 상기 목표음성 추출기(120)는 X^k를 목표음성으로 판단하고, 이를 음성 인식기(130)로 출력한다. 만약, LLR_k가 임계값보다 작은 경우, 상기 목표음성 추출기(120)는 분리된 음원 신호들 중 목표음성이 존재하지 않는 것으로 판단한다. 상기 임계값(θ)은

·

로 계산된다. 여기서

는 프레임의 갯수이고, 표준 임계값

는 각 응용 시스템에 따라 실험적으로 결정된다.When the maximum algebraic likelihood ratio LLR _k is greater than a threshold value as compared with a predetermined threshold θ as in step 250, the target voice extractor 120 determines X ^k as the target voice. This is output to the speech recognizer 130. If the LLR _k is smaller than the threshold value, the target voice extractor 120 determines that the target voice does not exist among the separated sound source signals. The threshold value θ is

·

. here

Is the number of frames and the standard threshold

Is determined experimentally for each application system.

반면, 목표음성에 해당하는 음원이 남성이고, 이 외의 다른 음원들은 여성 화자의 음성 또는 그 외의 오디오 신호라는 정보가 제공될 경우, LLRi는 하기 <수학식 5>와 같으며, 그 외에는 상기 목표음성에 해당하는 음원이 여성인 경우의 상기 설명과 동일하다. On the other hand, when the sound source corresponding to the target voice is male, and other sound sources are provided with the information of the female speaker's voice or other audio signal, LLRi is represented by Equation 5 below, otherwise the target voice is The same as described above in the case where the sound source corresponding to the present invention is a female.

두 번째로, 음성-음악 정보를 이용하는 경우, 즉 목표음성에 해당하는 음원이 음성(Speech)이고, 이 외의 다른 음원들은 음악 신호라는 정보가 제공될 경우에는 다음과 같이 목표음성을 추출한다.Secondly, when using voice-music information, that is, a sound source corresponding to the target voice is speech, and other sound sources are provided with the information of the music signal, the target voice is extracted as follows.

먼저, 데이터베이스를 음성과 음악 데이터로 구분한 후에 각각에 대해 가우시안 혼합모델(GMM) λ_Speech, λ_Music 을 생성한다. First, the database is divided into speech and music data, and then Gaussian mixed model (GMM) λ _Speech and λ _Music are generated for each.

그런 다음 특징벡터열 Xⁱ의 대수 우도비 LLR_i를 하기 <수학식 6>과 같이 계산한다. Then, the algebraic likelihood ratio LLR _i of the feature vector sequence X ⁱ is calculated as in Equation 6 below.

이후의 목표음성 추출방식은 상기 첫 번째의 방법과 동일하므로 설명을 생략하기로 한다. Since the target voice extraction method is the same as the first method, a description thereof will be omitted.

세 번째로, 음성-잡음 정보를 이용한 경우, 즉 목표음성에 해당하는 음원이 음성(Speech)이고, 이 외의 다른 음원들은 잡음 신호라는 정보가 제공될 경우에는 다음과 같이 목표음성을 추출한다.Third, when the voice-noise information is used, that is, the sound source corresponding to the target voice is speech, and other sound sources are provided with the information of the noise signal, the target voice is extracted as follows.

먼저, 데이터베이스를 음성과 잡음 데이터로 구분한 후에 각각에 대해 가우시안 혼합모델(GMM) λ_Speech, λ_Noise를 생성한다. First, the database is divided into speech and noise data, and then Gaussian mixture model (GMM) λ _Speech and λ _Noise are generated for each.

그런 다음 N개의 분리된 음원 신호에서 추출한 특징벡터열을 X¹, X², …,X^N이라고 할 때, λ_Speech, λ_Noise 모델에 대한 특징벡터열 Xⁱ의 대수 우도비 LLRi를 하기 <수학식 7>과 같이 계산한다. Then, the feature vector sequence extracted from the N separated sound source signals is X ¹ , X ² ,. , X ^N , the algebraic likelihood ratio LLRi of the feature vector sequence X ⁱ for the λ _Speech and λ _Noise models is calculated as shown in Equation 7 below.

네 번째로, 화자의 개인성 정보를 이용한 경우, 즉 목표음성에 해당하는 음원이 미리 알려진 특정 화자이고, 이 외의 다른 음원들은 다른 화자의 음성 또는 그 외의 오디오 신호라는 정보가 제공될 경우에는 다음과 같이 목표음성을 추출한다.Fourthly, when the speaker's personal information is used, that is, when the sound source corresponding to the target voice is a predetermined speaker and other sound sources are provided with the information of another speaker's voice or other audio signal, as follows. Extract the target voice.

먼저, 데이터베이스를 특정 화자와 그 외의 오디오 신호로 구분한 후에 각각에 대해 가우시안 혼합모델(GMM) λ_Individual, λ_Others를 생성한다. First, the database is divided into a specific speaker and other audio signals, and then a Gaussian mixed model (GMM) λ _Individual and λ _Others are generated for each.

그런 다음 N개의 분리된 음원 신호에서 추출한 특징벡터열을 X¹, X², …, X^N이라고 할 때, λ_Individual, λ_Others 모델에 대한 특징벡터열 Xⁱ의 대수 우도비 LLR_i를 하기 <수학식 8>과 같이 계산한다. Then, the feature vector sequence extracted from the N separated sound source signals is X ¹ , X ² ,. , X ^N , the algebraic likelihood ratio LLR _i of the feature vector sequence X ⁱ for the λ _Individual and λ _Others models is calculated as shown in Equation 8 below.

다섯 번째로, 부가 정보가 별도로 주어지지 않는 경우 즉, 목표음성에 대해서 특별한 부가 정보가 주어지지 않을 경우에는 N개의 분리된 음원 신호들 중에서 인식기 사용을 목적으로 발성된 신호가 존재한다고 간주하고, 음성인식의 음향모델인 HMM(Hidden Markov Model)을 이용하여 LLR 기반의 신뢰도를 계산한다. 이러한 경우 신뢰도 계산 방법은 다음과 같다. Fifthly, when additional information is not given separately, that is, when no additional information is provided for the target voice, it is assumed that there is a signal uttered for the use of the recognizer among the N separate sound source signals. Reliability based on LLR is calculated by using HMM (Hidden Markov Model). In this case, the reliability calculation method is as follows.

먼저, 음성 인식기(130)의 HMM을 이용하여 각각의 분리된 음원 신호들에 대해 1차적으로 음성 인식기(130)를 통과시킨 후, 인식 결과로 주어지는 단어열에 대해 HMM 음향모델을 정렬한다. First, the HMM of the speech recognizer 130 After passing through the speech recognizer 130 for each of the separated sound source signals, the HMM acoustic model is aligned with respect to the word sequence that is given as a result of the recognition.

다음으로, i번째 음원의 특징벡터열을

라 하고,

에 해당하는 HMM 상태(state)는

로서, m과 j는 m번째 HMM 부단어 모델(subword model)의 j번째 상태를 뜻한다. 이때, LLR_i는 하기 <수학식 9>와 같이 계산한다. Next, the feature vector string of the i-th sound source

,

The HMM state corresponding to

Where m and j represent the j th state of the m th HMM subword model. In this case, LLR _i is calculated as in Equation 9 below.

상기 <수학식 9>에서

는 음성 인식기(130)를 통해 얻은 단어열을 뜻하며,

는

의 컴플리먼트(complement),

는

의 컴플리먼트(complement)를 의미한다. 여기서 상기

는 실질적으로 직접 구하기가 어려우므로 상기 <수학식 9>의 맨 마지막 줄과 같이 근사하여 추정할 수 있다. K는

에 상응하는 모든 HMM들에 포함된 각각의 상태의 번호(number)이다.

는 실험적으로 결정될 수 있는 상수항이다. 만약, 이 값이 매우 크게 설정되면, 상기 <수학식 9>의 마지막 줄 내의 두번째 합산 항(summation term)은 가장 큰 우도 값(likelihood value)에 의해 좌우된다. 반면에,

가 더 작은 값을 갖게 될수록, 다른 우도 값들의 기여는 더 현저해진다. In Equation 9 above

Is a word string obtained through the speech recognizer 130,

Is

Complement of,

Is

It means the complement of. Where above

Since it is difficult to find the direct value, it can be estimated by approximating the last line of Equation (9). K is

Is the number of each state contained in all HMMs corresponding to

Is a constant term that can be determined experimentally. If this value is set very large, the second summation term in the last line of Equation 9 depends on the largest likelihood value. On the other hand,

The smaller the is, the more significant the contribution of the other likelihood values.

이후의 목표음성 추출방식은 상기 첫 번째의 방법과 동일하므로 구체적인 설 명을 생략하기로 한다. Since the target voice extraction method is the same as the first method, a detailed description thereof will be omitted.

마지막으로, 목표음성의 음원은 특정 특성(property) A에 의한 음성이고, 다른 음원들은 특정 특성(property) B의 오디오 신호들이라는 부가 정보가 제공될 경우, 상기 목표음성 추출 방식은 후술된 바와 같이 수행된다. Lastly, when additional information is provided that the sound source of the target voice is a sound having a specific property A, and the other sound sources are audio signals having a specific property B, the target voice extraction method may be performed as described below. Is performed.

우선, 데이터베이스를 특정 특성 A를 갖는 음성 데이터와, 특정 특성 B에 의한 다른 오디오 신호 데이터로 구분한 후, 가우시안 혼합 모델(GMM) λ_property _{_A} 및 λ_{property_B}를 생성한다. First, the database is divided into speech data having a specific characteristic A and other audio signal data according to the specific characteristic B, and then Gaussian mixed model (GMM) lambda _properties _{_A} and lambda _{property_B} are generated.

다음으로, N개의 분리된 음원 신호에서 추출한 특징벡터열을 X¹, X², …, X^N라고 할 때, λ_property _{_A} 및 λ_property _{_B} 에 대한 특징벡터열 Xⁱ의LLR_i를 하기 <수학식 10>과 같이 계산한다. Next, the feature vector sequence extracted from the N separated sound source signals is X ¹ , X ² ,... , X ^N , for the feature vector column X ⁱ for λ _property _{_A} and λ _property _{_B} The LLR _i is calculated as shown in Equation 10 below.

한편, 본 발명의 상세한 설명에서는 구체적인 실시 예에 관하여 설명하였으나, 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시 예에 국한되어 정해져서는 안되며 후술하는 발명청구의 범위뿐만 아니라 이 발명청구의 범위와 균등한 것들에 의해 정해져야 한다.Meanwhile, in the detailed description of the present invention, specific embodiments have been described, but various modifications are possible without departing from the scope of the present invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined not only by the scope of the following claims, but also by the equivalents of the claims.

도 1은 일반적인 독립요소분석 기술을 이용한 신호분리시스템을 도시한 블록도, 1 is a block diagram showing a signal separation system using a general independent element analysis technique,

도 2는 본 발명의 실시예에 따른 목표음성 추출을 위한 마이크배열 기반 음성인식 시스템의 구조를 도시한 블록도, 2 is a block diagram illustrating a structure of a microphone array based speech recognition system for extracting target speech according to an embodiment of the present invention;

도 3은 본 발명의 실시예에 따라 마이크배열 기반 음성인식 시스템의 목표음성 추출기에서 목표 음성을 추출하기 위한 방법을 도시한 흐름도.3 is a flow chart illustrating a method for extracting a target voice in a target voice extractor of a microphone array based speech recognition system in accordance with an embodiment of the present invention.

Claims

A signal separator for separating the mixed signals input through the plurality of microphones into sound source signals through independent element analysis;

A target voice extractor for extracting one target voice spoken for voice recognition from among the sound source signals separated by the signal separator; And

Including a voice recognizer for recognizing a desired voice through the extracted target voice,

The target voice extractor

Extracting a feature vector sequence from the separated sound source signals, calculating a log likelihood ratio for the extracted feature vector sequence, calculating a maximum value using the calculated likelihood ratio, and calculating the maximum value And a predetermined threshold value, and when the maximum value is greater than the threshold value, determining the maximum value as the target voice.

The method of claim 1,

And an additional information unit for transmitting the additional information used to extract the target voice to the target voice extractor.

delete

The method of claim 1,

And the target voice extractor determines that the target voice does not exist among the separated sound source signals when the maximum value is smaller than the threshold value.

The method of claim 2,

The target voice extractor determines that the target voice is the most reliable sound source signal by performing a hypothesis test on the separated sound source signals using a Gaussian mixture density model when the additional information is given. Array-based Speech Recognition System.

The method of claim 5,

The additional information is a microphone array based speech recognition system, characterized in that gender information, voice-music information, voice-noise information, personal information of the speaker.

The method of claim 1,

The target speech extractor calculates a logarithmic likelihood ratio (LLR) based reliability using an acoustic model hidden Markov model (HMM) of speech recognition when no additional information about the target speech is given. Speech recognition system.

Separating the mixed signals input through the plurality of microphones into sound source signals through independent element analysis;

Extracting one target voice spoken for voice recognition from the separated sound source signals; And

Recognizing a desired voice through the extracted target voice,

Extracting the target voice

Extracting a feature vector sequence (X ⁱ ) from the separated sound source signals;

Calculating an algebraic likelihood ratio (LLR _i ) for the extracted feature vector sequence;

Calculating a maximum value k using the calculated log likelihood ratio;

Comparing the maximum value with a preset threshold value; And

And determining the maximum value as the target voice when the maximum value is greater than the threshold value.

delete

The method of claim 8,

And determining that the target voice does not exist among the separated sound source signals when the maximum value is smaller than the threshold value.

The method of claim 8,

The logarithmic likelihood ratio (LLR _i ) indicates that the target voice is female. If additional information is given, passing the extracted feature vector sequence (X ⁱ ) to a pre-generated male and female Gaussian mixture model (λ _Male , λ _Female ), and calculating as shown in Equation 3 below. A target speech extraction method in a microphone array based speech recognition system.

The method of claim 8,

The logarithmic likelihood ratio (LLR _i ) indicates that the target voice is male. If additional information is given, passing the extracted feature vector sequence (X ⁱ ) to a pre-generated male and female Gaussian mixture model (λ _Male , λ _Female ) and calculating the following equation (5): A target speech extraction method in a microphone array based speech recognition system.

The method of claim 8,

The algebraic likelihood ratio LLR _i is a feature vector sequence extracted to each Gaussian mixture model (λ _Speech , λ _Music ) previously generated according to the speech-music information when the additional information of the speech-music information is given. X ⁱ ) passing through and calculating as in Equation 6 below.

The method of claim 8,

The logarithmic likelihood ratio LLR _i is a feature vector string extracted to each Gaussian mixture model (λ _Speech , λ _Noise ) previously generated according to the speech-noise information when the additional information of the speech-noise information is given. X ⁱ ) passing through and calculating as in Equation (7).

The method of claim 8,

The logarithmic likelihood ratio LLR _i passes the extracted feature vector sequence X ⁱ to each previously generated Gaussian mixture model λ _Individual , λ _Others when the target voice is given additional information that the target voice is a specific _speaker . And calculating as shown in Equation (8).

The method of claim 8,

The logarithmic likelihood ratio LLR _i may include the extracted feature vector sequence in each generated Gaussian mixture model λ _{Property_A} and λ _{Property_B} when the target voice is given additional information that the target voice is the first specific property Property_A. X ⁱ ) passing through and calculating as in Equation (10).

The method of claim 8, wherein the extracting of the target voice comprises:

Performing additional speech recognition on the separated sound source signals using an acoustic model hidden markov model (HMM) of speech recognition when no additional information on the target speech is given;

Calculating a closest hidden Markov model and a corresponding state sequence with respect to the word sequence obtained according to the speech recognition;

Calculating an algebraic likelihood ratio LLR _i using the calculated hidden Markov model and the corresponding state string;

Calculating a maximum value k using the calculated log likelihood ratio;

Comparing the maximum value with a preset threshold value;

The method of claim 17,

The logarithmic likelihood ratio LLR _i is calculated as in Equation 9 below,

Means the word string obtained by performing the voice recognition,

Above

Means complement of,

Is

Means complement, and K is

Means the number of each state included in all hidden Markov models corresponding to

Is a constant term that can be determined experimentally. The method of extracting a target voice in a microphone array-based speech recognition system.

&Quot; (9) "