CN1932974A

CN1932974A - Speaker identifying equipment, speaker identifying program and speaker identifying method

Info

Publication number: CN1932974A
Application number: CNA2005100995434A
Authority: CN
Inventors: 柿野友成; 伊久美智则
Original assignee: Toshiba Tec Corp
Current assignee: Toshiba Tec Corp
Priority date: 2005-09-13
Filing date: 2005-09-13
Publication date: 2007-03-21

Abstract

In the speaker distance calculation section of the speaker recognition apparatus, a quantized distance between a speech feature vector of a speech feature vector sequence generated from a speech of a speaker to be recognized and a representative vector in a codebook is obtained. The speech feature vector is quantized based on the quantization distance. And the quantization distortion is obtained by using a higher order speech feature vector group in the speech feature vector sequence. In the recognition section of the speaker recognition device, speaker recognition is performed based on quantization distortion, which is, for example, an average value of a plurality of quantization distortions.

Description

Speaker recognition device, speaker recognition program, and speaker recognition method

技术领域technical field

本发明涉及说话者识别设备、用于说话者识别的计算机程序、以及说话者识别方法，用于通过使用包括在声波中的个人信息来识别说话者。The present invention relates to a speaker recognition device, a computer program for speaker recognition, and a speaker recognition method for recognizing a speaker by using personal information included in sound waves.

背景技术Background technique

已经提出了基于说预定内容的语音而识别说话者的取决于文本的说话者识别设备、以及基于说任何内容的语音而标识说话者的与文本无关的说话者识别设备作为说话者识别设备。A text-dependent speaker recognition device that recognizes a speaker based on a voice speaking a predetermined content, and a text-independent speaker recognition device that identifies a speaker based on a voice speaking any content have been proposed as speaker recognition devices.

说话者识别设备通常将输入的声波转换为模拟信号，将所转换的模拟信号转换为数字信号，执行数字信号的离差分析，并然后产生包括个人信息的语音特征向量序列。这里，倒谱(cepstrum)系数用作语音特征向量。在登记模式中，说话者识别设备将语音特征向量序列群集为预定数量的簇(cluster)，例如三十二个簇，并且产生作为每个簇的形心(centroid)的代表向量(参见Furui所著，日本Morikita Shuppan Co.Ltd的“Speech Information Processing”，第一版，第56-57页)。此外，在标识模式中，说话者识别设备基于每个语音特征向量而计算在登记模式从输入声波中产生的语音特征向量序列和预先登记的码本之间的距离，算出平均值(平均距离)，并且基于该平均距离标识说话者。A speaker recognition device generally converts an input sound wave into an analog signal, converts the converted analog signal into a digital signal, performs dispersion analysis of the digital signal, and then generates a speech feature vector sequence including personal information. Here, cepstrum coefficients are used as speech feature vectors. In the registration mode, the speaker recognition device clusters the sequence of speech feature vectors into a predetermined number of clusters, for example, thirty-two clusters, and generates a representative vector that is the centroid of each cluster (see Furui et al. "Speech Information Processing", Morikita Shuppan Co.Ltd, Japan, first edition, pp. 56-57). In addition, in the identification mode, the speaker recognition device calculates the distance between the sequence of speech feature vectors generated from the input sound waves in the registration mode and the pre-registered codebook based on each speech feature vector, and calculates an average value (mean distance) , and identify speakers based on this average distance.

在其中说话者识别设备用作说话者验证设备的情况下，计算在从要被识别的说话者产生的语音特征向量序列和关于该说话者的码本之间的距离，并且将该距离和阈值进行比较以执行说话者验证。在其中说话者识别设备用作说话者标识设备的情况下，计算在从要被标识的说话者中产生的语音特征向量序列和所有登记的说话者的码本之间的距离，并且从对应于登记的说话者的多个距离中选出最短的距离以执行说话者标识。In the case where the speaker recognition device is used as the speaker verification device, the distance between the speech feature vector sequence generated from the speaker to be recognized and the codebook about the speaker is calculated, and the distance and the threshold Compare to perform speaker verification. In the case where the speaker recognition device is used as the speaker identification device, the distance between the speech feature vector sequence generated from the speaker to be identified and the codebooks of all registered speakers is calculated, and from the codebook corresponding to The shortest distance is selected among the plurality of distances of the registered speakers to perform speaker identification.

当前，反映声道形状的倒谱系数、或者指示声带的振动频率的音调通常被用作语音特征量。它的信息包括指示话音内容的音韵(phonological)信息，以及取决于说话者的个人信息。当作为距离计算说话者语音的差别时，因为音韵信息的离差大于个人信息的离差，所以将音韵信息的离差与个人信息的离差进行比较不是所希望的。相反地，希望比较相同的音韵信息。因此，依据现有的说话者识别设备，通过在观察空间中群集向量离差而执行音素的近似归一化，并且将通过对近似相同的音素进行比较而获得的反映个性的说话者距离计算为失真量。Currently, cepstral coefficients reflecting the shape of the vocal tract, or pitch indicating the vibration frequency of the vocal cords are generally used as speech feature quantities. Its information includes phonological information indicating the content of the voice, and personal information depending on the speaker. When calculating the difference of the speaker's voice as a distance, since the dispersion of the phonological information is larger than that of the personal information, it is not desirable to compare the dispersion of the phonological information with that of the personal information. Instead, it is desirable to compare the same phonological information. Therefore, according to the existing speaker recognition apparatus, approximate normalization of phonemes is performed by clustering vector dispersion in the observation space, and the speaker distance reflecting personality obtained by comparing approximately identical phonemes is calculated as Amount of distortion.

然而，当群集语音特征向量序列时，应该将语音特征向量设置为哪个阶是个问题。通常，大量的音韵信息存在于低阶，而大量的个人信息存在于高阶。因此，如果当群集时为了改善音韵分辨性能而将语音特征向量阶设置为低阶，则可能降低说话者分辨性能。相反地，如果为了提高说话者分辨性能而将语音特征向量设置为高阶，则可能降低音韵分辨性能。这导致折衷关系的问题。因为这个问题，语音特征向量阶当前被设置为由实验方法确定的最合适的阶数。However, when clustering a sequence of speech feature vectors, there is a problem to which order the speech feature vectors should be set. Usually, a large amount of phonological information exists in the low order, while a large amount of personal information exists in the high order. Therefore, if the speech feature vector order is set to a low order in order to improve the phonetic discrimination performance when clustering, the speaker discrimination performance may be degraded. Conversely, if the speech feature vector is set to a high order in order to improve the speaker discrimination performance, the phoneme discrimination performance may be degraded. This leads to a problem of trade-off relationships. Because of this issue, the speech feature vector order is currently set to the most appropriate order determined by experimental methods.

发明内容Contents of the invention

因此，本发明的目的是消除在音韵分辨性能和演说者分辨性能之间的折衷关系，并且实现精确的说话者识别。Therefore, it is an object of the present invention to eliminate the trade-off relationship between phonetic discrimination performance and speaker discrimination performance, and to achieve accurate speaker recognition.

依据本发明的一个方面，提供了一种说话者识别设备，其中基于从要被登记的说话者的语音中产生的第一语音特征向量序列中的低阶语音特征向量组而获得第一语音特征向量序列的语音特征向量之间的距离。基于所获得的距离而群集第一语音特征向量序列，并且产生和存储包括多个代表向量的码本。基于从要被识别的说话者的语音中产生的第二语音特征向量序列的低阶语音特征向量组，而获得在(a)第二语音特征向量序列中的每个语音特征向量和(b)存储在码本中的多个代表向量中对应的一个之间的量化距离。基于所获得的量化距离而量化第二语音特征向量序列中的每个语音特征向量。并且基于第二特征向量序列的高阶语音特征向量组，而获得在第二语音特征向量序列的每个语音特征向量和存储在码本中的多个代表向量中对应的一个之间的量化失真。基于所获得的量化失真执行说话者识别。According to an aspect of the present invention, there is provided a speaker recognition apparatus, wherein a first speech feature is obtained based on a group of low-order speech feature vectors in a first speech feature vector sequence generated from a speech of a speaker to be registered The distance between the phonetic feature vectors of the sequence of vectors. The first speech feature vector sequence is clustered based on the obtained distance, and a codebook including a plurality of representative vectors is generated and stored. Each speech feature vector in (a) the second speech feature vector sequence and (b) A quantization distance between corresponding ones of the plurality of representative vectors stored in the codebook. Each speech feature vector in the second sequence of speech feature vectors is quantized based on the obtained quantization distance. And based on the high-order speech feature vector group of the second feature vector sequence, the quantization distortion between each speech feature vector of the second speech feature vector sequence and a corresponding one of the plurality of representative vectors stored in the codebook is obtained . Speaker identification is performed based on the obtained quantization distortion.

依据本发明的另一个方面，提供了一种说话者识别设备，其中在从要被登记的说话者的语音中产生的第一语音特征向量序列的语音特征向量之间获得基于第一权重的加权向量距离。基于所获得的加权向量距离而群集第一语音特征向量序列，并且产生和存储包括多个代表向量的码本。获得在存储在码本中的多个代表向量中对应的一个和从要被识别的说话者语音中产生的第二语音特征向量序列的每个语音特征向量之间的基于第二权重的加权量化距离。基于所获得的加权量化距离而量化第二语音特征向量序列中的每个语音特征向量。以及获得在存储在码本中的多个代表向量中对应的一个和第二语音特征向量序列的每个语音特征向量之间的基于与第一权重和第二权重不同的第三权重的加权量化失真。基于该量化失真执行说话者识别。According to another aspect of the present invention, there is provided a speaker recognition apparatus in which a weighting based on a first weight is obtained between speech feature vectors of a first speech feature vector sequence generated from a speech of a speaker to be registered Vector distance. The first speech feature vector sequence is clustered based on the obtained weighted vector distances, and a codebook including a plurality of representative vectors is generated and stored. obtaining a weighted quantization based on a second weight between a corresponding one of the plurality of representative vectors stored in the codebook and each speech feature vector of a second sequence of speech feature vectors generated from the speaker's speech to be recognized distance. Each speech feature vector in the second sequence of speech feature vectors is quantized based on the obtained weighted quantization distance. and obtaining a weighted quantization based on a third weight different from the first weight and the second weight between a corresponding one of the plurality of representative vectors stored in the codebook and each speech feature vector of the second speech feature vector sequence distortion. Speaker identification is performed based on the quantization distortion.

依据本发明，可以实现高度精确的说话者识别。According to the present invention, highly accurate speaker recognition can be realized.

附图说明Description of drawings

图1为示出了本发明的说话者识别设备的结构的方框图；1 is a block diagram showing the structure of a speaker recognition device of the present invention;

图2为示意地示出为了从语音特征向量序列中获得代表向量而进行的群集的模式图；Fig. 2 schematically shows a pattern diagram of clustering in order to obtain representative vectors from the speech feature vector sequence;

图3为示出了在说话者识别设备中提供的说话者识别部分的结构的方框图；FIG. 3 is a block diagram showing the structure of a speaker recognition section provided in the speaker recognition device;

图4为示出特征向量的结构的模式图；以及FIG. 4 is a schematic diagram showing the structure of a feature vector; and

图5为示出由软件实现的本发明的说话者识别设备的示例结构的方框图。FIG. 5 is a block diagram showing an example structure of the speaker recognition device of the present invention realized by software.

具体实施方式Detailed ways

将参考图1到4说明本发明的第一实施例。图1为示出第一实施例的说话者识别设备100的结构的方框图。A first embodiment of the present invention will be described with reference to FIGS. 1 to 4 . FIG. 1 is a block diagram showing the structure of a speaker recognition device 100 of the first embodiment.

如图1所示，该说话者识别设备100包括麦克风1、低通滤波器2、A/D转换器3、特征向量产生部分4、说话者识别部分5、说话者模型产生部分6、以及存储部分7。利用这些部分和元件，可以执行各种装置(或者步骤)。As shown in FIG. 1, the speaker recognition device 100 includes a microphone 1, a low-pass filter 2, an A/D converter 3, a feature vector generation part 4, a speaker recognition part 5, a speaker model generation part 6, and a memory Part 7. With these parts and elements, various means (or steps) can be performed.

麦克风1把输入的语音转换为电模似信号。低通滤波器2从输入模拟信号中去除高于预定频率的频率。A/D转换器3以预定取样频率和量化位数将输入模拟信号转换为数字信号。语音输入部分8包括麦克风1、低通滤波器2、和A/D转换器3。The microphone 1 converts the input speech into an electrical analog signal. The low-pass filter 2 removes frequencies higher than a predetermined frequency from the input analog signal. The A/D converter 3 converts an input analog signal into a digital signal at a predetermined sampling frequency and quantization bit number. Voice input section 8 includes microphone 1 , low-pass filter 2 , and A/D converter 3 .

特征向量产生部分4执行输入数字信号的离散分析并且产生和输出M阶语音特征向量序列(特征参数时间序列)。此外，特征向量产生部分4包括切换器(未显示)，用于选择登记模式和标识模式。依据所选择的模式，在登记模式中，特征向量产生部分4电连接到说话者模型产生部分6，并且向说话者产生部分6输出M阶语音特征向量序列，而在标识模式中，特征向量产生部分4电连接到说话者识别部分5，并且向说话者识别部分5输出M阶语音特征向量序列。在本发明的这个实施例中，M阶语音特征向量序列是16阶语音特征向量序列(M＝16)，而且该特征向量包括1到16阶LPC倒谱系数，但是不局限于这个示例。The feature vector generation section 4 performs discrete analysis of the input digital signal and generates and outputs an M-order speech feature vector sequence (feature parameter time series). Furthermore, the feature vector generation section 4 includes a switch (not shown) for selecting a registration mode and an identification mode. According to the selected mode, in the registration mode, the feature vector generation part 4 is electrically connected to the speaker model generation part 6, and outputs the M-order speech feature vector sequence to the speaker generation part 6, while in the identification mode, the feature vector generation The section 4 is electrically connected to the speaker identification section 5, and outputs the M-order speech feature vector sequence to the speaker identification section 5. In this embodiment of the present invention, the M-order speech feature vector sequence is a 16-order speech feature vector sequence (M=16), and the feature vector includes 1 to 16-order LPC cepstral coefficients, but is not limited to this example.

说话者模型产生部分6在登记模式中，从在特征向量产生部分4处产生的语音特征向量序列中产生码本，作为说话者模型。存储部分7是存储(登记)在说话者模型产生部分6处产生的码本的词典。The speaker model generating section 6 generates a codebook as a speaker model from the speech feature vector sequence generated at the feature vector generating section 4 in the registration mode. The storage section 7 is a dictionary that stores (registers) the codebook generated at the speaker model generation section 6 .

说话者识别部分5计算在预先存储在存储部分7中的码本和在特征向量产生部分4处产生的语音特征向量序列之间的距离，然后基于该距离识别说话者，并且输出结果作为说话者识别结果。The speaker identification section 5 calculates the distance between the codebook pre-stored in the storage section 7 and the speech feature vector sequence generated at the feature vector generation section 4, then identifies the speaker based on the distance, and outputs the result as the speaker recognition result.

接下来，将参考图2描述说话者模型产生部分6，其是示意地示出为了从语音特征向量序列获得代表向量(形心)而进行的群集的模式图。Next, the speaker model generating section 6 will be described with reference to FIG. 2 , which is a pattern diagram schematically showing clustering performed to obtain representative vectors (centroids) from a sequence of speech feature vectors.

如图2所示，说话者模型产生部分6将在登记模式中在特征向量产生部分4处从要被登记的说话者语音中产生的M阶语音特征向量序列群集为多个对应于预定码本尺寸的簇。说话者模型产生部分6然后为每个簇获得作为该簇的加权中心的形心，作为该簇的代表向量，并且作为码本元素将多个代表向量(每个簇的形心)登记到存储部分7(词典)。为每个登记的说话者产生码本。As shown in FIG. 2 , the speaker model generation section 6 clusters the M-order speech feature vector sequence generated from the speaker's voice to be registered at the feature vector generation section 4 in the registration mode into a plurality of corresponding predetermined codebooks clusters of size. The speaker model generating section 6 then obtains, for each cluster, the centroid as the weighting center of the cluster as a representative vector of the cluster, and registers a plurality of representative vectors (centroids of each cluster) into the storage as codebook elements. Part 7 (Dictionary). A codebook is generated for each enrolled speaker.

这里，通过使用M阶语音特征向量序列中的N阶(N＜M)语音特征向量序列(图2中的阴影区)而执行群集，并且获得M阶代表向量。N阶语音特征向量序列是低阶语音特征向量组。Here, clustering is performed by using an N-order (N<M) speech feature vector sequence (shaded area in FIG. 2 ) out of an M-order speech feature vector sequence, and an M-order representative vector is obtained. The sequence of N-order speech feature vectors is a group of low-order speech feature vectors.

可以从以下的公式(1)中获得在群集中使用的向量之间的向量距离D1。在本发明的这个实施例中，N＝8、M＝16、且码本尺寸是32。The vector distance D1 between vectors used in clustering can be obtained from the following formula (1). In this embodiment of the invention, N=8, M=16, and the codebook size is 32.

$D D. 11 = = {[[{Σ Σ}_{K K = = 11}^{N N} {(({X x}_{K K} - - {Y Y}_{K K}))}^{22}]]}^{\frac{11}{22}} - - - - - - ((11))$

其中D1：向量距离where D1: vector distance

X_K、Y_K：M阶特征向量X _K , Y _K : M order eigenvectors

N＜MN<M

也就是说，说话者模型产生部分6使用在登记模式中在特征向量产生部分4处产生的M阶语音特征向量序列中的N阶语音特征向量序列，按照公式(1)获得向量距离D1，然后基于所获得的向量距离D1群集M阶语音特征向量序列，并且产生由多个M阶代表向量组成的码本。That is, the speaker model generating section 6 uses the N-order speech feature vector sequence among the M-order speech feature vector sequences generated at the feature vector generating section 4 in the registration mode, obtains the vector distance D1 according to formula (1), and then The M-order speech feature vector sequence is clustered based on the obtained vector distance D1, and a codebook composed of a plurality of M-order representative vectors is generated.

接下来，将参考图3说明说话者识别部分5，该图为示出说话者识别部分5的结构的方框图。Next, the speaker recognition section 5 will be explained with reference to FIG. 3, which is a block diagram showing the structure of the speaker recognition section 5. FIG.

如图3所示，说话者识别部分5包括说话者距离计算部分11和识别部分12。As shown in FIG. 3 , the speaker recognition section 5 includes a speaker distance calculation section 11 and a recognition section 12 .

说话者距离计算部分11计算在存储在码本中的多个代表向量和在特征向量产生部分4处从要被识别的说话者语音中产生的M阶语音特征向量序列之间的距离(在码本和特征向量序列之间的距离)。也就是说，说话者距离计算部分11为在特征向量产生部分4处产生的M阶语音特征向量序列中的每个特征向量，计算来自特征向量产生部分4的特征向量和存储部分7中的码本的代表向量之间的距离(在代表向量和特征向量之间的距离)。The speaker distance calculation section 11 calculates the distance between the plurality of representative vectors stored in the codebook and the M-order speech feature vector sequence generated at the feature vector generation section 4 from the speaker's voice to be recognized (in the codebook The distance between this and the sequence of feature vectors). That is, the speaker distance calculating section 11 calculates the feature vector from the feature vector producing section 4 and the code in the storing section 7 for each feature vector in the sequence of M-order speech feature vectors produced at the feature vector producing section 4. The distance between the representative vectors of this (the distance between the representative vector and the feature vector).

这里，在码本和特征向量序列之间的距离可以通过下述获得：(a)基于通过使用N阶元素计算的在代表向量和特征向量之间的量化距离D2，量化该语音特征向量序列中的每个M阶语音特征向量；以及(b)通过使用M阶语音特征向量而获得在代表向量和特征向量之间的失真距离D3(量化失真)。因此，作为所获得的量化失真的平均值而计算在码本和特征向量序列之间的距离。这里，N阶语音特征向量序列是低阶语音特征向量组，而且M阶语音特征向量序列是高阶语音特征向量组。Here, the distance between the codebook and the feature vector sequence can be obtained as follows: (a) based on the quantization distance D2 between the representative vector and the feature vector calculated by using the N-order elements, quantize the speech feature vector sequence Each of the M-order speech feature vectors; and (b) obtaining the distortion distance D3 (quantization distortion) between the representative vector and the feature vector by using the M-order speech feature vector. Therefore, the distance between the codebook and the sequence of feature vectors is calculated as the average value of the obtained quantization distortions. Here, the sequence of N-order speech feature vectors is a group of low-order speech feature vectors, and the sequence of M-order speech feature vectors is a group of high-order speech feature vectors.

可以从以下的公式(2)中获得在量化处理中使用的代表向量和特征向量之间的量化距离D2，而且可以从以下的公式(3)中获得失真距离D3。The quantization distance D2 between the representative vector and the feature vector used in the quantization process can be obtained from the following formula (2), and the distortion distance D3 can be obtained from the following formula (3).

$D D. 22 = = {[[{Σ Σ}_{K K = = 11}^{N N} {(({C C}_{K K} - - {X x}_{K K}))}^{22}]]}^{\frac{11}{22}} - - - - - - ((22))$

其中，D2：代表向量-特征向量距离(量化距离)Among them, D2: represents the vector-feature vector distance (quantization distance)

C_K：代表向量C _K : represents the vector

X_K：M阶特征向量X _K : M order eigenvector

$D D. 33 = = {[[{Σ Σ}_{K K = = 11}^{M m} {(({C C}_{K K} - - {X x}_{K K}))}^{22}]]}^{\frac{11}{22}} - - - - - - ((33))$

其中，D3：代表向量-特征向量距离(失真距离)Among them, D3: represents the vector-eigenvector distance (distortion distance)

C_K：代表向量C _K : represents the vector

X_K：M阶特征向量X _K : M order eigenvector

说话者距离计算部分11根据公式(2)获得量化距离D2。D2是在代表向量和特征向量之间的量化距离，也就是，在特征向量产生部分4处产生的M阶语音特征向量序列中的每个语音特征向量和在登记模式中存储在存储部分7的码本中的多个代表向量之间的量化距离。然后基于所获得的量化距离D2，通过使用N阶语音特征向量序列来执行M阶语音特征向量序列的量化。也就是说，量化M阶语音特征向量序列中的每个语音特征向量。然后，根据公式(3)计算在代表向量和特征向量之间的失真距离D3。D3是在存储部分7的码本中存储的多个代表向量和在特征向量产生部分4处产生的M阶语音特征向量序列之间的失真距离。Speaker distance calculation section 11 obtains quantized distance D2 according to formula (2). D2 is the quantization distance between representative vectors and feature vectors, that is, each speech feature vector in the sequence of M-order speech feature vectors generated at feature vector generating section 4 and stored in storage section 7 in the registration mode Multiple representations in the codebook represent quantized distances between vectors. Then, based on the obtained quantization distance D2, quantization of the M-order speech feature vector sequence is performed by using the N-order speech feature vector sequence. That is, each speech feature vector in the sequence of M-order speech feature vectors is quantized. Then, the distortion distance D3 between the representative vector and the feature vector is calculated according to formula (3). D3 is the distortion distance between the plurality of representative vectors stored in the codebook of the storage section 7 and the sequence of M-order speech feature vectors generated at the feature vector generation section 4 .

在这个实施例中，通过使用M阶语音特征向量序列获得量化失真，但是其不局限于这个示例。例如，可以通过使用包括(m到M)阶(N＜m＜M)语音特征向量序列(高阶语音特征向量序列)的语音特征向量序列获得量化失真。包括(m到M)阶(N＜m＜M)语音特征向量序列的语音特征向量序列应该是高阶语音特征向量组。如果高阶语音特征向量组包括高阶语音特征向量序列，则它应当足够高。(m到M)阶(N＜m＜M)语音特征向量序列可以是下列的任何一个：仅仅包括图4(b)阴影区中所示的(m到M)阶倒谱系数的语音特征向量序列；包括图4(c)阴影区中所示的(m到M)阶倒谱系数以及(1到N)阶倒谱系数中的一部分的语音特征向量序列；或者包括图4(d)阴影区中所示的(1到M)阶倒谱系数的语音特征向量序列(M阶语音特征向量序列)。这里，(1到N)阶倒谱系数(图4(a)中的阴影区)是低阶倒谱系数，而(m到M)阶倒谱系数是高阶倒谱系数。高阶倒谱系数比低阶倒谱系数包括更多的个人信息，而低阶倒谱系数比高阶倒谱系数包括更多的音韵信息。在这个实施例中，N＝8且M＝16，但是不局限于这些值。In this embodiment, quantization distortion is obtained by using an M-order speech feature vector sequence, but it is not limited to this example. For example, quantization distortion can be obtained by using a speech feature vector sequence including a (m to M) order (N<m<M) speech feature vector sequence (higher-order speech feature vector sequence). A speech feature vector sequence including (m to M) order (N<m<M) speech feature vector sequences should be a high-order speech feature vector group. If the set of high-order speech feature vectors comprises a sequence of high-order speech feature vectors, it should be high enough. (m to M) order (N＜m＜M) speech feature vector sequence can be following any one: only comprise the speech feature vector of (m to M) order cepstral coefficient shown in Fig. 4 (b) shaded area Sequence; Include (m to M) order cepstral coefficient shown in Fig. 4 (c) shaded area and (1 to N) the voice feature vector sequence of a part in order N) order cepstral coefficient; Or comprise Fig. 4 (d) shadow The speech feature vector sequence (M-order speech feature vector sequence) of (1 to M) order cepstral coefficients shown in the region. Here, the (1 to N) order cepstral coefficients (shaded area in Fig. 4(a)) are low-order cepstral coefficients, and the (m to M) order cepstral coefficients are high-order cepstral coefficients. High-order cepstral coefficients contain more personal information than low-order cepstral coefficients, and low-order cepstral coefficients contain more phonological information than high-order cepstral coefficients. In this embodiment, N=8 and M=16, but not limited to these values.

识别部分12基于在说话者距离计算部分11处获得的量化失真的平均值而识别说话者，并且输出识别结果作为说话者识别结果。当说话者识别设备100用作说话者验证设备时，说话者距离计算部分11计算在从要被识别的说话者语音中产生的语音特征向量序列和存储在要被识别的说话者的码本中的多个代表向量之间的距离(量化失真的平均值)。识别部分12通过将该距离和阈值比较来识别说话者。此外，当说话者识别设备100用作说话者标识设备时，说话者距离计算部分11计算在从要被识别的说话者语音中产生的语音特征向量序列和存储在所有登记的说话者的码本中的多个代表向量之间的距离，并然后通过从多个距离中选择最短距离来识别说话者。The identification section 12 identifies the speaker based on the average value of the quantization distortion obtained at the speaker distance calculation section 11, and outputs the identification result as a speaker identification result. When the speaker recognition device 100 is used as a speaker verification device, the speaker distance calculation section 11 calculates a sequence of speech feature vectors generated from the speech of the speaker to be recognized and stored in the codebook of the speaker to be recognized The distance (average value of quantization distortion) between multiple representative vectors of . The recognition section 12 recognizes the speaker by comparing the distance with a threshold. In addition, when the speaker identification device 100 is used as a speaker identification device, the speaker distance calculation section 11 calculates the sequence of speech feature vectors generated from the speech of the speaker to be recognized and stored in the codebooks of all registered speakers. The multiples in represent the distances between vectors, and then the speaker is identified by selecting the shortest distance from the multiple distances.

依据本发明的第一实施例，在登记模式中，可以通过使用从要在登记模式中登记的说话者语音中产生的M阶语音特征向量序列中的N阶向量元素，而获得每个语音特征向量D1的向量到向量的距离。基于向量距离D1而群集M阶语音特征向量序列，并且产生由多个M阶形心组成的码本。此外，在标识模式中，基于在从要被识别的说话者语音中产生的每个M阶语音特征向量和码本的每个代表向量中的N阶向量元素之间的量化距离D2，而量化M阶语音特征向量序列中的每个语音特征向量，获得使用M阶向量元素的失真距离D3，并且基于量化失真的平均值执行说话者识别。利用上述结构，可以消除在音韵分辨性能和说话者分辨性能之间的折衷关系，并且可以确保它们的良好平衡。因此，可以实现高度精确的说话者识别。According to the first embodiment of the present invention, in the registration mode, it is possible to obtain each speech feature Vector-to-vector distance of vector D1. The sequence of M-order speech feature vectors is clustered based on the vector distance D1, and a codebook composed of multiple M-order centroids is generated. Furthermore, in the identification mode, quantization is performed based on the quantization distance D2 between each M-order speech feature vector generated from the speaker's voice to be recognized and the N-order vector element in each representative vector of the codebook. For each speech feature vector in the M-order speech feature vector sequence, the distortion distance D3 using the M-order vector elements is obtained, and speaker recognition is performed based on the average value of the quantization distortion. With the above structure, the trade-off relationship between the phonetic discrimination performance and the speaker discrimination performance can be eliminated, and their good balance can be ensured. Therefore, highly accurate speaker recognition can be realized.

在本发明的这个实施例中，从要被登记的说话者语音中产生的第一语音特征向量序列和从要被识别的说话者语音中产生的第二语音特征向量序列都是M阶语音特征向量序列，低阶语音特征向量组是N阶(N＜M)语音特征向量序列，码本包括M阶代表向量，而高阶语音特征向量组是M阶语音特征向量序列。因此，可以确信地确保稳定的识别性能。In this embodiment of the present invention, the first sequence of speech feature vectors generated from the speaker's speech to be registered and the second sequence of speech feature vectors generated from the speaker's speech to be recognized are M-order speech features The vector sequence, the low-order speech feature vector group is an N-order (N<M) speech feature vector sequence, the codebook includes M-order representative vectors, and the high-order speech feature vector group is an M-order speech feature vector sequence. Therefore, stable recognition performance can be assuredly ensured.

可替换地，依据本发明的这个实施例，第一语音特征向量序列和第二语音特征向量序列都是M阶语音特征向量序列，低阶语音特征向量组是N阶(N＜M)语音特征向量序列，码本包括M阶代表向量，而高阶语音特征向量组是包括(m到M)阶(N＜m＜M)语音特征向量序列的语音特征向量序列。因此，可以确信地确保稳定的识别性能。Alternatively, according to this embodiment of the present invention, both the first speech feature vector sequence and the second speech feature vector sequence are M-order speech feature vector sequences, and the low-order speech feature vector group is N-order (N<M) speech features The vector sequence, the codebook includes M-order representative vectors, and the high-order speech feature vector group is a speech feature vector sequence including (m to M) order (N<m<M) speech feature vector sequences. Therefore, stable recognition performance can be assuredly ensured.

现在将说明本发明的第二实施例。第二实施例是本发明第一实施例的说话者识别部分5和说话者模型产生部分6的修改。因此，将用与第一实施例中相同的附图标记表示第二实施例中出现的相同结构的部分，并且除了说话者识别部分5和说话者模型产生部分6之外，将省略它们的说明。A second embodiment of the present invention will now be described. The second embodiment is a modification of the speaker recognition section 5 and speaker model generation section 6 of the first embodiment of the present invention. Therefore, parts of the same structure appearing in the second embodiment will be denoted by the same reference numerals as those in the first embodiment, and their explanations will be omitted except for the speaker recognition section 5 and the speaker model generation section 6 .

将参考图2说明依据第二实施例的说话者模型产生部分6。说话者模型产生部分6将在登记模式中在特征向量产生部分4处从要被登记的说话者语音中产生的M阶语音特征向量序列群集为对应于预定码本尺寸的多个簇，获得作为每个簇的加权中心的形心以便使该形心成为该簇的代表向量，并且向存储部分(词典)7登记多个代表向量以作为码本。为每个登记的说话者产生码本。The speaker model generating section 6 according to the second embodiment will be explained with reference to FIG. 2 . The speaker model generation section 6 clusters the M-order speech feature vector sequence generated from the speaker's voice to be registered at the feature vector generation section 4 in the registration mode into a plurality of clusters corresponding to a predetermined codebook size, obtaining as The centroid of the center of each cluster is weighted so as to make the centroid a representative vector of the cluster, and a plurality of representative vectors are registered to the storage section (dictionary) 7 as a codebook. A codebook is generated for each enrolled speaker.

这里，通过使用M阶语音特征向量序列执行群集，并且获得M阶代表向量。可以从以下的公式(4)中获得在群集中使用的向量之间的加权向量距离D1。在这个实施例中，N＝8、M＝16、而且码本尺寸是32。Here, clustering is performed by using an M-order speech feature vector sequence, and an M-order representative vector is obtained. The weighted vector distance D1 between vectors used in clustering can be obtained from the following formula (4). In this embodiment, N=8, M=16, and the codebook size is 32.

$D D. 11 = = {[[{Σ Σ}_{K K = = 11}^{M m} {U u}_{K K} {(({X x}_{K K} - - {Y Y}_{K K}))}^{22}]]}^{\frac{11}{22}} - - - - - - ((44))$

其中D1：向量距离where D1: vector distance

U_K：权重 $U_{K} = \{\begin{matrix} 1 & K \leq N \\ 0 & K > N \end{matrix}$ U _K : Weight $u_{K} = \{\begin{matrix} 1 & K \leq N \\ 0 & K > N \end{matrix}$

X_K、Y_K：M阶特征向量X _K , Y _K : M order eigenvectors

说话者模型产生部分6使用在特征向量产生部分4处产生的M阶语音特征向量序列依据公式(4)而获得每个加权向量距离D1，基于所获得的加权向量距离D1而群集该M阶语音特征向量序列，并且产生由多个M阶代表向量组成的码本。The speaker model generation section 6 obtains each weighted vector distance D1 according to formula (4) using the M-order speech feature vector sequence generated at the feature vector generation section 4, and clusters the M-order speech based on the obtained weighted vector distance D1 feature vector sequence, and generate a codebook composed of multiple M-order representative vectors.

接下来，将说明依据第二实施例的说话者识别部分5(参见图3)。说话者识别部分5基本上具有与第一实施例相似的结构，并且包括说话者距离计算部分11和识别部分12。Next, the speaker recognition section 5 (see FIG. 3) according to the second embodiment will be explained. The speaker recognition section 5 basically has a structure similar to that of the first embodiment, and includes a speaker distance calculation section 11 and a recognition section 12 .

说话者距离计算部分11计算在存储部分7的码本中存储的多个代表向量和在特征向量产生部分4处从要被识别的说话者语音中产生的M阶语音特征向量序列之间的距离(在码本和特征向量序列之间的距离)。也就是说，说话者距离计算部分11为在特征向量产生部分4处产生的M阶语音特征向量序列中的每个特征向量，计算在特征向量和码本的代表向量之间的距离(在代表向量和特征向量之间的距离)。The speaker distance calculation section 11 calculates the distance between the plurality of representative vectors stored in the codebook of the storage section 7 and the M-order speech feature vector sequence generated at the feature vector generation section 4 from the speaker's voice to be recognized (distance between codebook and sequence of feature vectors). That is, the speaker distance calculation section 11 calculates, for each feature vector in the sequence of M-order speech feature vectors generated at the feature vector generation section 4, the distance between the feature vector and the representative vector of the codebook (in the representative vector vector and the distance between the eigenvectors).

这里，通过基于加权量化距离D2量化语音特征向量序列中的每个M阶语音特征向量，并然后通过使用该M阶语音特征向量获得在代表向量和特征向量之间的距离的加权失真距离D3(量化失真)，而获得在码本和特征向量之间的距离作为量化失真的平均值。Here, by quantizing each M-order speech feature vector in the speech feature vector sequence based on the weighted quantization distance D2, and then obtaining a weighted distortion distance D3 representing the distance between the vector and the feature vector by using the M-order speech feature vector ( quantization distortion), and obtain the distance between the codebook and the feature vector as the average value of the quantization distortion.

依据第二实施例，可以从以下公式(5)中获得在量化中使用的代表向量和特征向量的加权量化距离D2，而且可以从以下公式(6)中获得用于获得量化失真的加权失真距离D3。According to the second embodiment, the weighted quantization distance D2 of the representative vector and the eigenvector used in quantization can be obtained from the following formula (5), and the weighted distortion distance for obtaining the quantization distortion can be obtained from the following formula (6) D3.

$D D. 22 = = {[[{Σ Σ}_{K K = = 11}^{M m} {U u}_{K K} {(({C C}_{K K} - - {X x}_{K K}))}^{22}]]}^{\frac{11}{22}} - - - - - - ((55))$

U_K：代表向量U _K : represents the vector

X_K：M阶特征向量X _K : M order eigenvector

$D D. 33 = = {[[{Σ Σ}_{K K = = 11}^{M m} {V V}_{K K} {(({C C}_{K K} - - {X x}_{K K}))}^{22}]]}^{\frac{11}{22}} - - - - - - ((66))$

V_K：权重(VK＝1)V _K : Weight (VK=1)

C_K：代表向量C _K : represents the vector

X_K：M阶特征向量X _K : M order eigenvector

因此，说话者距离计算部分11依据公式(5)获得在代表向量和特征向量之间的加权量化距离D2，其是在识别模式中存储在存储部分7的码本中的多个代表向量和在特征向量产生部分4处产生的M阶语音特征向量序列中的每个语音特征向量之间的距离。然后，基于所获得的加权量化距离D2执行M阶语音特征向量序列的量化。也就是说，量化M阶语音特征向量序列中的每个语音特征向量，依据公式(6)获得在存储部分7的码本中存储的多个代表向量和在特征向量产生部分4处产生的M阶语音特征向量序列中的每个语音特征向量之间的加权失真距离D3，并且获得所获得的加权失真距离D3的平均值(量化失真的平均值)。Therefore, the speaker distance calculation section 11 obtains the weighted quantization distance D2 between the representative vector and the feature vector according to the formula (5), which is a plurality of representative vectors stored in the codebook of the storage section 7 in the recognition mode and in The distance between each speech feature vector in the sequence of M-order speech feature vectors generated at the feature vector generation section 4 . Then, quantization of the sequence of M-order speech feature vectors is performed based on the obtained weighted quantization distance D2. That is to say, quantize each speech feature vector in the sequence of M-order speech feature vectors, obtain a plurality of representative vectors stored in the codebook of the storage part 7 and the M generated at the feature vector generation part 4 according to formula (6). The weighted distortion distance D3 between each speech feature vector in the first-order speech feature vector sequence, and the average value of the obtained weighted distortion distances D3 (average value of quantization distortion) is obtained.

识别部分12基于在说话者距离计算部分11处获得的量化失真的平均值而识别说话者，并且作为说话者识别结果而输出识别结果。当说话者识别设备100用作说话者验证设备时，说话者距离计算部分11计算在从要被识别的说话者语音中产生的语音特征向量序列和存储在要被识别的说话者的码本中的多个代表向量之间的距离，而且识别部分12通过将距离和阈值进行比较来验证说话者。此外，当说话者识别设备100用作说话者标识设备时，说话者距离计算部分11计算在从要被识别的说话者语音中产生的语音特征向量和存储在所有登记的说话者的码本中的多个代表向量之间的距离(量化失真的平均值)，并且通过从所获得的距离中选择最短距离来标识说话者。The identification section 12 identifies the speaker based on the average value of the quantization distortion obtained at the speaker distance calculation section 11, and outputs the identification result as a speaker identification result. When the speaker recognition device 100 is used as a speaker verification device, the speaker distance calculation section 11 calculates a sequence of speech feature vectors generated from the speech of the speaker to be recognized and stored in the codebook of the speaker to be recognized A plurality of representative vectors of , and the recognition section 12 verifies the speaker by comparing the distance with a threshold. In addition, when the speaker recognition device 100 is used as a speaker identification device, the speaker distance calculation section 11 calculates a speech feature vector generated from a speaker's speech to be recognized and stored in codebooks of all registered speakers. The distance between multiple representative vectors of , (the average value of the quantization distortion), and the speaker is identified by selecting the shortest distance from the obtained distances.

依据上述本发明的第二实施例，在登记模式中，获得从要被登记的说话者语音中产生的M阶语音特征向量序列的每个向量的加权的向量到向量距离D1，基于所获得的加权向量距离D1而群集M阶语音特征向量序列，并且产生包括多个M阶代表向量的码本。在标识模式中，基于在从要被识别的说话者语音中产生的M阶语音特征向量序列中的每个语音特征向量和码本的每个代表向量之间的加权量化距离D2而量化每个语音特征向量，通过基于失真距离D3使用M阶语音特征向量序列而获得量化失真，并然后基于该量化失真的平均值而执行说话者识别。利用上述结构，可以消除在音韵分辨性能和说话者分辨性能之间的折衷关系，并且可以确保它们的良好平衡。因此，可以实现高度精确的说话者识别。According to the second embodiment of the present invention described above, in the registration mode, the weighted vector-to-vector distance D1 of each vector of the M-order speech feature vector sequence generated from the speaker's speech to be registered is obtained, based on the obtained The weighted vector distance D1 clusters the M-order speech feature vector sequence, and generates a codebook including a plurality of M-order representative vectors. In the identification mode, each is quantized based on the weighted quantization distance D2 between each speech feature vector in the sequence of M-order speech feature vectors generated from the speaker's speech to be recognized and each representative vector of the codebook A speech feature vector, quantization distortion is obtained by using an M-order speech feature vector sequence based on the distortion distance D3, and then speaker recognition is performed based on an average value of the quantization distortion. With the above structure, the trade-off relationship between the phonetic discrimination performance and the speaker discrimination performance can be eliminated, and their good balance can be ensured. Therefore, highly accurate speaker recognition can be achieved.

在本发明的第二实施例中，使用了公式(4)、(5)和(6)，但是不局限于这些公式。例如，公式(4)可以替换为以下的公式(7)(权重：U_K＝1)、公式(5)可以替换为以下的公式(8)(权重：U_K＝1)、以及公式(6)可以替换为以下的公式(9)(权重：V_K＝1/S_K)。这里，预先统计地获得作为每个语音特征向量的离差值的标准偏差S_K。In the second embodiment of the present invention, formulas (4), (5) and (6) are used, but not limited to these formulas. For example, formula (4) can be replaced by the following formula (7) (weight: U _K =1), formula (5) can be replaced by the following formula (8) (weight: U _K =1), and formula (6 ) can be replaced by the following formula (9) (weight: V _K =1/S _K ). Here, the standard deviation S _K which is the dispersion value of each speech feature vector is obtained statistically in advance.

$D D. 11 = = {[[{Σ Σ}_{K K = = 11}^{M m} {U u}_{K K} {(({X x}_{K K} - - {Y Y}_{K K}))}^{22}]]}^{\frac{11}{22}} - - - - - - ((77))$

其中D1：向量距离where D1: vector distance

U_K：权重(U_K＝1)U _K : weight (U _K =1)

X_K、Y_K：M阶特征向量X _K , Y _K : M order eigenvectors

$D D. 22 = = {[[{Σ Σ}_{K K = = 11}^{M m} {U u}_{K K} {(({C C}_{k k} - - {X x}_{K K}))}^{22}]]}^{\frac{11}{22}} - - - - - - ((88))$

U_K：权重(U_K＝1)U _K : weight (U _K =1)

C_K：代表向量C _K : represents the vector

X_K：M阶特征向量X _K : M order eigenvector

$D D. 33 = = {[[{Σ Σ}_{K K = = 11}^{M m} {V V}_{K K} {(({C C}_{K K} - - {X x}_{K K}))}^{22}]]}^{\frac{11}{22}} - - - - - - ((99))$

V_K：权重(V_K＝1/S_K)V _K : Weight (V _K =1/S _K )

C_K：代表向量C _K : represents the vector

X_K：M阶特征向量X _K : M order eigenvector

S_K：K阶标准偏差S _K : K order standard deviation

在本发明的第二实施例中，从要被登记的说话者的语音中产生的第一语音特征向量序列和从要被识别的说话者语音中产生的第二语音特征向量序列都是M阶语音特征向量序列。通过使用满足以下关系的权重分别获得加权的向量距离和加权的量化距离，In the second embodiment of the present invention, the first sequence of speech feature vectors generated from the speech of the speaker to be registered and the second sequence of speech feature vectors generated from the speech of the speaker to be recognized are both of order M A sequence of speech feature vectors. The weighted vector distance and the weighted quantized distance are respectively obtained by using weights satisfying the following relation,

U_K＝1(k≤N)，0(k＞N)，其中N＜M。U _K =1(k≦N), 0(k>N), where N<M.

这里U_K是第一权重和第二权重。可以通过使用满足以下关系的权重获得加权的失真距离，Here U _K is the first weight and the second weight. The weighted distortion distance can be obtained by using weights satisfying the following relation,

V_K＝1(k≤M)，其中第三权重是V_K。V _K =1(k≤M), wherein the third weight is V _K .

因此，可以实现高度精确的识别性能。Therefore, highly accurate recognition performance can be achieved.

可替换地，在本发明的第二实施例中，第一语音特征向量序列和第二语音特征向量序列都是M阶语音特征向量序列。通过使用满足以下关系的权重分别获得加权的向量距离和加权的量化距离，Alternatively, in the second embodiment of the present invention, both the first speech feature vector sequence and the second speech feature vector sequence are M-order speech feature vector sequences. The weighted vector distance and the weighted quantized distance are respectively obtained by using weights satisfying the following relation,

U_K＝1(k≤M)U _K ＝1(k≤M)

V_K＝1/S_K(k≤M)，其中第三权重是V_K。V _K =1/S _K (k≤M), wherein the third weight is V _K .

因此，可以实现高度精确的识别性能。Therefore, highly accurate recognition performance can be realized.

硬件结构不局限于上述特定结构，而且它可以通过软件实现。说话者识别部分5或者说话者模型产生部分6可以由软件实现。图5为示出由软件实现的说话者识别设备100的方框图。The hardware structure is not limited to the above-mentioned specific structure, and it can be realized by software. The speaker recognition section 5 or the speaker model generation section 6 can be realized by software. FIG. 5 is a block diagram showing the speaker recognition device 100 implemented by software.

如图5所示，说话者识别设备100包括：CPU 101，其经由总线连接到存储BIOS等的ROM；以及存储器102，其包括ROM和RAM，以构成微型计算机。CPU 101通过I/O(未示出)经由总线连接到HDD 103、读取计算机可读CD-ROM 104的CD-ROM驱动器105、与因特网等通信的通信设备106、键盘107、诸如CRT或者LCD之类的显示器108、以及麦克风1。As shown in FIG. 5, the speaker recognition device 100 includes: a CPU 101 connected via a bus to a ROM storing a BIOS and the like; and a memory 102 including a ROM and a RAM to constitute a microcomputer. The CPU 101 is connected via I/O (not shown) to an HDD 103, a CD-ROM drive 105 that reads a computer-readable CD-ROM 104, a communication device 106 that communicates with the Internet, etc., a keyboard 107, such as a CRT or LCD, via a bus like display 108, and microphone 1.

作为计算机可读存储介质的CD-ROM 104存储实现本发明的说话者识别功能的程序，而且CPU 101可以通过安装该程序而实现本发明的说话者识别功能。此外，通过麦克风1输入的语音存储在HDD 103等中。然后，当程序运行时，读取存储在HDD 103等中的语音数据以执行说话者识别处理。说话者识别处理实现与特征向量产生部分4、说话者识别部分5和说话者产生部分6等中的每个部分相似的功能，因此可以获得与上述相似的效果。The CD-ROM 104 as a computer-readable storage medium stores a program for realizing the speaker recognition function of the present invention, and the CPU 101 can realize the speaker recognition function of the present invention by installing the program. Also, the voice input through the microphone 1 is stored in the HDD 103 or the like. Then, when the program is running, voice data stored in the HDD 103 or the like is read to perform speaker recognition processing. The speaker recognition processing realizes a function similar to that of each of the feature vector generation section 4, speaker recognition section 5, speaker generation section 6, etc., and thus effects similar to those described above can be obtained.

对于存储介质，可以使用诸如DVD之类的各种光盘、各种光磁盘、诸如软盘之类的各种磁盘、以及半导体存储器等。此外，可以通过从例如因特网的网络中下载程序并且将该程序安装到作为存储部分的HDD 103中，而实现本发明。在这种情况下，在发送端的服务器处存储程序的存储装置成为本发明的存储介质。该程序可以在给定OS(操作系统)上运行，而且在那种情况下，该程序可以允许OS执行上述处理的某些部分，而且该程序可以是包括诸如字处理器软件之类的规定应用软件、或者OS等在内的程序文件组的一部分。As the storage medium, various optical disks such as DVD, various optical magnetic disks, various magnetic disks such as floppy disks, semiconductor memories, and the like can be used. Furthermore, the present invention can be realized by downloading a program from a network such as the Internet and installing the program into the HDD 103 as a storage section. In this case, the storage device storing the program at the server on the sending side becomes the storage medium of the present invention. The program may run on a given OS (Operating System), and in that case, the program may allow the OS to perform some part of the above-mentioned processing, and the program may include prescribed applications such as word processor software A part of a program file group including software or an OS.

Claims

1, a kind of speaker identification equipment (100) is characterized in that comprising:

Be used for obtaining based on the low order speech feature vector group first mentioned speech feature vector sequence that produces from speaker's voice that will be registered between the speech feature vector of first mentioned speech feature vector sequence distance, first mentioned speech feature vector sequence of trooping, and produce the device of the code book that comprises a plurality of representation vectors based on the distance that is obtained;

Be used to store the device of the code book that is produced;

Be used for obtaining each speech feature vector in (a) second mentioned speech feature vector sequence and (b) be stored in quantized distance between of correspondence in a plurality of representation vectors in the code book based on the low order speech feature vector group second mentioned speech feature vector sequence that produces from speaker's voice that will be identified vector, quantize each the described speech feature vector in second mentioned speech feature vector sequence based on the quantized distance that is obtained, and obtain each the described speech feature vector in second mentioned speech feature vector sequence based on the high-order speech feature vector group in second characteristic vector sequence and be stored in the device of the quantizing distortion between one corresponding in a plurality of representation vectors in the code book; And

Be used for carrying out the device of speaker identification based on the quantizing distortion that is obtained.

2, speaker identification equipment as claimed in claim 1, wherein, in first mentioned speech feature vector sequence and second mentioned speech feature vector sequence each all is M rank mentioned speech feature vector sequence, low order speech feature vector group is the N rank (mentioned speech feature vector sequence of N＜M), corresponding vector is M rank representation vectors, and high-order speech feature vector group is M rank mentioned speech feature vector sequence.

3, speaker identification equipment as claimed in claim 1, wherein, in first mentioned speech feature vector sequence and second mentioned speech feature vector sequence each all is M rank mentioned speech feature vector sequence, the low order mentioned speech feature vector sequence is the N rank (mentioned speech feature vector sequence of N＜M), representation vector is M rank representation vectors, and high-order speech feature vector group is to comprise that m is to the M rank mentioned speech feature vector sequence (mentioned speech feature vector sequence of N＜m＜M).

4, a kind of speaker identification equipment (100) is characterized in that comprising:

Be used for obtaining between the speech feature vector of first mentioned speech feature vector sequence that produces from speaker's voice that will be registered the weighing vector distance based on first weight, first mentioned speech feature vector sequence of trooping, and produce the device of the code book that comprises a plurality of representation vectors based on the weighing vector distance that is obtained;

Be used to store the device of the code book that is produced;

Weight quantization distance between each speech feature vector in second mentioned speech feature vector sequence that is used for obtaining corresponding in a plurality of representation vectors of code book storage one and from speaker's voice that will be identified, produces based on second weight, quantize each the described speech feature vector in second mentioned speech feature vector sequence based on the weight quantization that obtained distance, and in a plurality of representation vectors that obtain in code book, to store corresponding one with second mentioned speech feature vector sequence in each described speech feature vector between the device based on the weight quantization distortion of three weight different with second weight with first weight; And

Be used for carrying out the device of speaker identification based on this quantizing distortion.

5, speaker identification equipment as claimed in claim 4, wherein, each in first mentioned speech feature vector sequence and second mentioned speech feature vector sequence all is M rank mentioned speech feature vector sequence;

Wherein second weight of first weight of weighing vector distance and weight quantization distance all is U _K, wherein:

U _K＝1(k≤N)，0(k＞N)，

And N＜M; And

Wherein the 3rd weight of weighted distortion distance is V _K, wherein:

V _K＝1(k≤M)。

6, speaker identification equipment as claimed in claim 4, wherein, each in first mentioned speech feature vector sequence and second mentioned speech feature vector sequence all is M rank mentioned speech feature vector sequence;

U _K=1 (k≤M); And

Wherein the 3rd weight of weighted distortion distance is V _K, wherein:

V _K＝1/S _K(k≤M)，

And the deviation value on each M rank is S _K

7, a kind of speaker identification method is characterized in that comprising:

Be used for obtaining the vector distance between the speech feature vector in first mentioned speech feature vector sequence, first mentioned speech feature vector sequence of trooping based on the vector distance that is obtained and produce the step of the code book that comprises a plurality of representation vectors based on the low order speech feature vector group first mentioned speech feature vector sequence that produces from speaker's voice that will be registered;

Be used to store the step of the code book that is produced;

Be used for obtaining each speech feature vector in (a) second mentioned speech feature vector sequence and (b) be stored in quantized distance between one corresponding in a plurality of representation vectors in the code book based on the low order speech feature vector group second mentioned speech feature vector sequence that produces from speaker's voice that will be identified, quantize each the described speech feature vector in second mentioned speech feature vector sequence based on the quantized distance that is obtained, and obtain each the described speech feature vector in second mentioned speech feature vector sequence based on the high-order speech feature vector group in second mentioned speech feature vector sequence and be stored in the step of the quantizing distortion between one corresponding in a plurality of representation vectors in the code book; And

Be used for carrying out the step of speaker identification based on the quantizing distortion that is obtained.

8, a kind of speaker identification method is characterized in that comprising:

Be used for obtaining the weighing vector distance between the speech feature vector of first mentioned speech feature vector sequence that produces from speaker's voice that will be registered, first mentioned speech feature vector sequence of trooping based on the weighing vector distance that is obtained and produce the step of the code book that comprises a plurality of representation vectors based on first weight;

Be used to store the step of the code book that is produced;

Weight quantization distance between each speech feature vector in second mentioned speech feature vector sequence that is used for obtaining corresponding in a plurality of representation vectors that code book is stored one and from speaker's voice that will be identified, produces based on second weight, quantize each the described speech feature vector in second mentioned speech feature vector sequence based on the weight quantization that obtained distance, and in a plurality of representation vectors that obtain in code book, to store corresponding one with second mentioned speech feature vector sequence in each described speech feature vector between the step based on the weight quantization distortion of three weight different with second weight with first weight; And