CN1763843A

CN1763843A - Pronunciation quality evaluating method for language learning machine

Info

Publication number: CN1763843A
Application number: CNA2005101148488A
Authority: CN
Inventors: 梁维谦; 董明; 丁玉国; 刘润生
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2005-11-18
Filing date: 2005-11-18
Publication date: 2006-04-26
Anticipated expiration: 2025-11-18
Also published as: CN100411011C

Abstract

The invention discloses a pronunciation quality evaluation method of language study machine in the computer subsidiary language study and phonetic technique domain, which is characterized by the following: extracting exercise phonetic feature; exercising standard pronunciation model; forming standard pronunciation network; detecting phonetic end; extracting evaluation phonetic feature; searching optimum path; calculating the mark of pronunciation quality. The method displays objective and stable evaluation, which constitutes imbedded English study system and mutual human-machine education and oral English self-detection.

Description

Pronunciation quality assessment method for language learning machine

技术领域technical field

本发明属于计算机辅助语言学习和语音技术领域，尤其涉及采用16位及以上的数字信号处理芯片实现的发音质量评价方法。The invention belongs to the field of computer-aided language learning and speech technology, and in particular relates to a pronunciation quality evaluation method realized by adopting 16-bit and above digital signal processing chips.

背景技术Background technique

近年来，嵌入式语言学习产品在国内外发展迅速。早期主要是复读机，它是在一台模拟磁带录音机上附加一个可将一小段语音数字化存储的装置，这一小段语音可以多次重复放音，利于学习者反复听音，跟读记忆。目前市场上主流的语言学习机是采用数字信号处理芯片(DSP，Digital Signal Processing)技术的第二代产品。硬件系统一般包括微控制器(Micro Control Unit，MCU)、数字信号处理芯片(DSP)、多媒体数字信号编解码器(CODEC)、ROM、SRAM、快闪存储器(Flash Memory)、通用串行总线(USB)、键盘和液晶显示器(Liquid Crystal Display，LCD)等；其中MCU作为主控芯片，执行设备驱动和程序调度等系统控制程序，DSP执行应用算法程序。应用程序包括录音、放音、语速调节等基本模块，有些产品还带有mp3模块。功能上具有复读，跟读，跟读对比，文字同步显示，内容检索查询和语速可调节的放音等。这类学习机大多可以通过互联网下载和更新学习材料。深圳好记星公司的好记星英语学习机是第二代数字式英语学习机的典型代表。In recent years, embedded language learning products have developed rapidly at home and abroad. In the early days, it was mainly a repeater. It is an analog tape recorder with a device that can digitally store a short section of speech. At present, the mainstream language learning machines on the market are the second-generation products using digital signal processing chip (DSP, Digital Signal Processing) technology. The hardware system generally includes a microcontroller (Micro Control Unit, MCU), digital signal processing chip (DSP), multimedia digital signal codec (CODEC), ROM, SRAM, flash memory (Flash Memory), universal serial bus ( USB), keyboard and liquid crystal display (Liquid Crystal Display, LCD), etc.; among them, MCU is used as the main control chip to execute system control programs such as device drivers and program scheduling, and DSP executes application algorithm programs. The application program includes basic modules such as recording, playback, and speech rate adjustment, and some products also have an mp3 module. In terms of functions, it has repeat reading, follow-up reading, follow-up reading comparison, synchronous display of text, content retrieval query and adjustable speech speed playback, etc. Most of these learning machines can download and update learning materials through the Internet. The Haojixing English learning machine of Shenzhen Haojixing Company is a typical representative of the second-generation digital English learning machine.

学习语言尤其是学习口语关键在于互动的学习过程，即在学习过程中教师及时地有针对性地评判和指导。然而在传统的以教师为中心的语言学习中，由于师资力量的缺乏，这一任务无法完成。而现有的语言学习机又都不具备这种评价学习者发音的能力。The key to language learning, especially oral language learning, lies in the interactive learning process, that is, teachers make timely and targeted judgments and guidance during the learning process. However, in the traditional teacher-centered language learning, this task cannot be accomplished due to the lack of teachers. However, the existing language learning machines do not have the ability to evaluate the learner's pronunciation.

发明内容Contents of the invention

本发明的目的是为克服已有技术的不足之处，提出一种用于语言学习机的发音质量评价方法，可在嵌入式语言学习机上的实现高性能的文本和说话人无关的发音质量评价，具有方法复杂度适中、评价准确度高和稳健性好的特点。特别是对汉语口音的人群的评价准确性达到、甚至超过了当前的国际先进水平。The purpose of the present invention is to overcome the weak point of prior art, propose a kind of pronunciation quality evaluation method that is used for language learning machine, can realize high-performance text and speaker-independent pronunciation quality evaluation on embedded language learning machine , which has the characteristics of moderate complexity, high evaluation accuracy and good robustness. In particular, the accuracy of the evaluation of people with a Chinese accent has reached or even exceeded the current international advanced level.

本发明提出的用于语言学习机的发音质量评价方法，包括用于训练的语音特征提取，标准发音模型训练，标准发音网络的生成，语音端点检测，用于评价的语音特征提取，最优路径搜索，以及发音质量分数的计算各部分；其特征在于，各部分的实现方法具体包括以下步骤：The method for evaluating the pronunciation quality of a language learning machine proposed by the present invention includes extraction of speech features for training, training of standard pronunciation models, generation of standard pronunciation networks, detection of speech endpoints, extraction of speech features for evaluation, and optimal path Search, and the calculation parts of pronunciation quality score; It is characterized in that, the implementation method of each part specifically comprises the following steps:

A、用于训练的语音特征提取：A. Speech feature extraction for training:

(1)预先建立包含大量朗读语音的训练数据库；(1) pre-establish a training database that includes a large amount of reading speech;

(2)对所说的训练数据库中的每个语音文件中的数字语音进行预加重和分帧加窗处理，得到具有准平稳性的分帧语音；(2) carry out pre-emphasis and frame-by-frame windowing processing to the digital voice in each voice file in said training database, obtain the frame-by-frame voice with quasi-stationarity;

(3)对所说的分帧语音提取语音特征，该语音特征为倒谱系数；(3) extracting speech feature to said sub-frame speech, this speech feature is cepstral coefficient;

B、标准发音模型训练B. Standard pronunciation model training

(1)利用步骤A所说的语音特征训练得到基于音素的标准发音模型；(1) Utilize the said phonetic feature training of step A to obtain the standard pronunciation model based on phonemes;

(2)对所说的标准发音模型进行汉语人群口音的自适应作为最终的标准发音模型，优化模型对汉语人群的评价性能；(2) Carry out the self-adaptation of the accent of the Chinese population to the said standard pronunciation model as the final standard pronunciation model, and optimize the evaluation performance of the model to the Chinese population;

C、标准发音网络的生成C. Generation of standard pronunciation network

对给定的文本进行单词切分，查找发音字典得到音素标注，最后利用所说的基于音素的标准发音模型得到以状态为节点的线性标准发音网络；Segment the given text into words, search the pronunciation dictionary to get the phoneme annotation, and finally use the phoneme-based standard pronunciation model to obtain a linear standard pronunciation network with the state as the node;

D、语音端点检测：D. Voice endpoint detection:

(1)模拟语音信号经过A/D变换，得到数字语音；(1) The analog voice signal undergoes A/D conversion to obtain digital voice;

(2)对所说的数字语音进行预加重和分帧加窗处理，得到具有准平稳性的分帧语音；(2) carry out pre-emphasis and frame-by-frame windowing process to said digital voice, obtain the frame-by-frame voice with quasi-stationarity;

(3)对所说的分帧语音计算得到时域对数能量；(3) calculating the time-domain logarithmic energy to said framed speech;

(4)采用滑动平均滤波(moving-average filter)的方法由所说的时域对数能量得到用于端点检测的特征(以下简称为端检特征)；(4) adopt the method for moving-average filter (moving-average filter) to obtain the feature (hereinafter referred to as the end detection feature) that is used for endpoint detection by said time-domain logarithmic energy;

(5)采用上限和下限双阈值和有限状态机结合的方法，对所说的端检特征进行端点检测，得到语音的起始和结束端点；(5) adopt the method that upper limit and lower limit double threshold value and finite state machine combine, carry out endpoint detection to said end detection feature, obtain the start and end endpoint of speech;

E、用于评价的语音特征提取E. Speech feature extraction for evaluation

对步骤D所说的分帧语音提取语音特征，过程与步骤A的第(3)步相同。The said framed speech of step D is extracted speech feature, and the process is identical with step (3) of step A.

F、最优路径搜索：F. Optimal path search:

(1)将步骤E所说的语音特征与步骤C所说的标准发音网络进行强制匹配，得到网络中所有可能的路径信息；(1) carry out forcible matching with the said speech feature of step E and the said standard pronunciation network of step C, obtain all possible path information in the network;

(2)利用所说的路径信息，从网络允许的终止节点回溯出最优路径；(2) Utilize said path information, trace out the optimal path from the terminal node allowed by the network;

G、发音质量分数的计算：G. Calculation of pronunciation quality score:

(1)利用步骤F中所说的最优路径信息计算得到每帧语音特征的置信分数；(1) Utilize said optimal path information in the step F to calculate the confidence score of every frame speech feature;

(2)利用步骤F中所说的最优路径信息计算路径上每个状态的置信分数；对最优路径上所有状态的置信分数取平均得到整句的置信分数；(2) Utilize the confidence score of each state on the optimal path information calculation path mentioned in the step F; The confidence score of all states on the optimal path is averaged to obtain the confidence score of the whole sentence;

(3)利用映射函数将所说的整句置信分数映射到主观评价分数区间，得到最终的发音质量分数。(3) Use the mapping function to map the confidence score of the entire sentence to the subjective evaluation score interval to obtain the final pronunciation quality score.

所说的步骤A中的倒谱系数可以为美尔频标倒谱系数(MFCC，Mel-FrequencyCepstrum Coefficients)，它利用了人耳的频率分辨特性。The cepstrum coefficients in said step A may be Mel-Frequency Cepstrum Coefficients (MFCC, Mel-Frequency Cepstrum Coefficients), which utilizes the frequency resolution characteristics of the human ear.

所说的步骤B(1)中的标准发音模型为基于音素的隐含马尔可夫模型(HMM，HiddenMarkov Model)，该模型的具体训练过程为：采用所有语音特征初始化一个高斯模型，利用这个模型复制出所有的音素模型，采用Baum-Welth的方法对模型进行多次训练。不断增加每个音素模型的高斯成分的数量，重新进行Baum-Welth训练。The standard pronunciation model in the said step B (1) is a phoneme-based Hidden Markov Model (HMM, HiddenMarkov Model), and the specific training process of this model is: adopt all speech features to initialize a Gaussian model, utilize this model Copy all the phoneme models, and use the Baum-Welth method to train the model multiple times. Continuously increasing the number of Gaussian components of each phoneme model, re-training Baum-Welth.

所说的步骤B(2)中的对标准发音模型进行汉语人群口音的自适应的实现方法为：对得到的标准发音模型进行基于最大似然线性回归(Maximum Likelihood LinearRegression，MLLR)和最大后验概率(Maximum A Posteriori，MAP)方法的口音自适应，得到最终的标准发音模型。The realization method of carrying out the self-adaptation of the accent of the Chinese population to the standard pronunciation model in said step B (2) is: the standard pronunciation model obtained is based on maximum likelihood linear regression (Maximum Likelihood LinearRegression, MLLR) and maximum a posteriori Accent adaptation of the probability (Maximum A Posteriori, MAP) method to obtain the final standard pronunciation model.

所说的步骤C的标准发音网络可为一个具有确定的起始节点和终止节点，当前节点只与其前序节点相关的不考虑文法的以HMM的状态为节点的线性网络。The standard pronunciation network in step C can be a linear network with a definite start node and end node, the current node is only related to its predecessor node, and the state of the HMM is used as the node without considering the grammar.

所说的步骤F的最优路径搜索的方法采用了帧同步维特比(Viterbi)束搜索的方法。The optimal path search method in step F uses a frame-synchronous Viterbi (Viterbi) beam search method.

为能够在嵌入式系统的有限内存资源上实现本发明的发音质量评价方法，步骤D、E、F和G都是以预先设定的固定帧数步长在时间上分段进行的，这样可大大降低对系统资源的要求，使得嵌入式学习系统能够处理比较长的语音；In order to be able to realize the pronunciation quality evaluation method of the present invention on the limited memory resource of embedded system, step D, E, F and G all carry out in segments in time with the preset fixed frame number step size, can like this Greatly reduce the requirements for system resources, so that the embedded learning system can handle relatively long speech;

本发明的发音质量评价方法使得语言学习机具有了交互功能。利用该方法实现的嵌入式英语学习系统在实际应用中取得了较好的性能。The pronunciation quality evaluation method of the invention enables the language learning machine to have an interactive function. The embedded English learning system realized by this method has achieved good performance in practical application.

本发明具有如下特点：The present invention has following characteristics:

(1)本发明具有评价准确性高、稳健性好、系统资源开销小等特点；(1) The present invention has the characteristics of high evaluation accuracy, good robustness, and low system resource overhead;

(2)采用基于音素的标准发音模型，使得嵌入式学习系统可以方便的更改课件内容，无需重新训练；(2) The phoneme-based standard pronunciation model is adopted, so that the embedded learning system can easily change the content of the courseware without retraining;

(3)考虑到母语口音对英语发音的影响，对音素模型进行了口音自适应；(3) Taking into account the influence of the native accent on English pronunciation, the phoneme model is adapted to the accent;

(4)采用滑动平均滤波和有限状态机等技术进行实时的端点检测，提高了端点检测对英语语音的准确性和稳健性；(4) Real-time endpoint detection is carried out by using technologies such as moving average filtering and finite state machine, which improves the accuracy and robustness of endpoint detection for English speech;

(5)可用于基于以DSP为核心的嵌入式语言学习系统，具有体积小、重量轻、耗电省、成本低的突出特点；(5) It can be used in embedded language learning systems based on DSP, and has the outstanding characteristics of small size, light weight, low power consumption and low cost;

(6)本发明的发音质量评价方法，结合丰富的课件形式，可以改变传统的学习机工作模式以及课堂教学模式。(6) The pronunciation quality evaluation method of the present invention, combined with rich courseware forms, can change the traditional learning machine working mode and classroom teaching mode.

附图说明Description of drawings

图1为本发明实施例的方法总体流程示意图。Fig. 1 is a schematic flow chart of the overall method of the embodiment of the present invention.

图2为本发明实施例的标准发音模型训练流程图；图2(a)表示标准发音模型训练的全过程，图2(b)表示一个特定隐含马尔可夫模型的训练过程。Fig. 2 is the standard pronunciation model training flowchart of the embodiment of the present invention; Fig. 2 (a) represents the whole process of standard pronunciation model training, Fig. 2 (b) represents the training process of a specific hidden Markov model.

图3为本发明实施例的标准发音模型的拓扑结构图；图3(a)表示停顿模型，图3(b)表示音素和静音模型。Fig. 3 is a topological structure diagram of a standard pronunciation model of an embodiment of the present invention; Fig. 3(a) shows a pause model, and Fig. 3(b) shows a phoneme and silence model.

图4为本发明实施例的隐含马尔可夫模型口音自适应流程图。Fig. 4 is a flow chart of HMM accent adaptation according to an embodiment of the present invention.

图5为本发明实施例的标准发音网络的拓扑结构示意图；图5(a)表示整句以单词为节点的线性网络结构，图5(b)表示每一个单词以音素为节点的线性网络结构。Fig. 5 is the topological structure schematic diagram of the standard pronunciation network of the embodiment of the present invention; Fig. 5 (a) represents the linear network structure that whole sentence takes word as node, and Fig. 5 (b) represents the linear network structure that each word takes phoneme as node .

图6为本发明实施例的标准发音网络的生成过程示意图。Fig. 6 is a schematic diagram of the generation process of the standard pronunciation network according to the embodiment of the present invention.

图7为本发明实施例的在嵌入式平台上实现的发音质量评价方法的详细流程图。Fig. 7 is a detailed flow chart of the pronunciation quality evaluation method realized on the embedded platform according to the embodiment of the present invention.

具体实施方式Detailed ways

本发明提出的一种用于语言学习机的发音质量评价方法实施例结合各图详细说明如下：A kind of pronunciation quality evaluation method embodiment that is used for language learning machine that the present invention proposes is described in detail as follows in conjunction with each figure:

本发明方法的实施例总体流程如图1所示，分为：A、用于训练的语音特征提取；B、标准发音模型的训练；C、标准发音网络的生成(以上各步骤可事先利用计算机完成)；D、语音端点检测；E、用于评价的语音特征提取；F、最优路径搜索；G、发音质量分数的计算和输出(以上各步骤利用嵌入式平台完成)。每个步骤的实施例详细说明如下。The overall flow of the embodiment of the inventive method is divided into as shown in Figure 1, is divided into: A, the speech feature extraction that is used for training; B, the training of standard pronunciation model; Complete); D, speech endpoint detection; E, speech feature extraction for evaluation; F, optimal path search; G, calculation and output of pronunciation quality scores (the above steps are completed using an embedded platform). Examples of each step are detailed below.

(1)预先建立包含大量英语朗读语音的训练数据库(要求包含的内容对每一个音素都有一定数量的覆盖)；(1) Pre-establish a training database that includes a large amount of English reading voices (the content required to be included has a certain amount of coverage for each phoneme);

(2)对所说的训练数据库中的每个语音文件中的数字语音进行预加重处理，预加重滤波器取为H(z)＝1-0.9375z^-1；对预加重后的语音进行分帧加窗(采用哈明窗)处理，帧长可为32ms，帧移可为16ms，得到具有准平稳性的分帧语音；(2) carry out pre-emphasis processing to the digital speech in each speech file in said training database, pre-emphasis filter is taken as H (z)=1-0.9375z ^-1 ; The speech after pre-emphasis is divided Frame windowing (using Hamming window) processing, the frame length can be 32ms, the frame shift can be 16ms, and the framed speech with quasi-stationarity can be obtained;

(3)对所说的分帧语音提取美尔频标倒谱系数(MFCC)作为语音特征；语音的短时频域特征能精确描述语音的变化，MFCC是根据人耳听觉的频率分辨特性计算出来的一种特征矢量，建立在傅立叶频谱分析的基础上，MFCC的计算方法为：首先对分帧语音进行快速傅立叶变换(Fast Fourier Transformation，FFT)得到信号的短时频谱，其次根据MEL频标把短时频谱等分成若干个带通组，其带通的频率响应为三角形，再次计算相应滤波器组的信号能量，最后通过离散余弦变换计算对应的倒谱系数；(3) Extract the Mel-Frequency Cepstral Coefficient (MFCC) as the speech feature for the said framed speech; the short-time frequency domain feature of the speech can accurately describe the change of the speech, and the MFCC is calculated according to the frequency resolution characteristics of human hearing A eigenvector is obtained based on Fourier spectrum analysis. The calculation method of MFCC is as follows: first, perform Fast Fourier Transformation (FFT) on the framed speech to obtain the short-term spectrum of the signal, and secondly, according to the MEL frequency standard Divide the short-time spectrum into several band-pass groups equally, and the frequency response of the band-pass is triangular, calculate the signal energy of the corresponding filter group again, and finally calculate the corresponding cepstral coefficient through discrete cosine transform;

MFCC特征主要反映语音的静态特征，语音信号的动态特征可以用静态特征的一阶差分谱和二阶差分谱来描述。整个语音特征由MFCC参数、MFCC一阶、二阶差分系数、归一化能量系数及其一阶、二阶差分系数构成。每帧共包含39维特征；MFCC features mainly reflect the static features of speech, and the dynamic features of speech signals can be described by the first-order difference spectrum and second-order difference spectrum of static features. The entire speech feature is composed of MFCC parameters, MFCC first-order and second-order difference coefficients, normalized energy coefficients and their first-order and second-order difference coefficients. Each frame contains a total of 39 dimensional features;

B、标准发音模型的训练：B. Training of standard pronunciation model:

(1)利用步骤A所说的语音特征训练基于音素的标准发音模型的过程，如图2所示：(1) Utilize the said speech feature of step A to train the process of the standard pronunciation model based on phonemes, as shown in Figure 2:

a、根据语音特征的维数建立一个协方差矩阵为对角形式的单数据流的多维高斯分布的原型，使用全部的语音数据估计该高斯分布的均值矢量和协方差矩阵。a. Establish a prototype of a multidimensional Gaussian distribution whose covariance matrix is a diagonal form of a single data stream according to the dimensionality of speech features, and use all the speech data to estimate the mean vector and covariance matrix of the Gaussian distribution.

b、确定发音字典和音标体系，完成对所有语音的音素级标注，本实施例的音标体系包括40个音素以及1个静音标识、1个停顿标识。b. Determine the pronunciation dictionary and phonetic symbol system, and complete the phoneme-level labeling of all voices. The phonetic symbol system in this embodiment includes 40 phonemes, 1 mute mark, and 1 pause mark.

c、本实施例采用基于音素的隐含马尔可夫模型(HMM)作为标准发音模型，HMM是目前被广泛采用的统计语音识别模型。HMM从左向右的状态转移模型，能够很好地描叙语音的发音特点。本发明采用的音素和静音模型为3状态的HMM，停顿模型为单状态可跨越的HMM，其拓扑结构如图3所示。其中q_i表示HMM的状态。a_ij表示HMM的跳转概率。b_j(O_t)为HMM模型的状态输出的多流混合高斯密度概率分布函数。如式(1)所示。c. In this embodiment, a phoneme-based Hidden Markov Model (HMM) is used as a standard pronunciation model, and HMM is a statistical speech recognition model widely used at present. The state transition model of HMM from left to right can well describe the pronunciation characteristics of speech. The phoneme and silence models used in the present invention are 3-state HMMs, and the pause model is a single-state spanable HMM, and its topological structure is shown in FIG. 3 . where q _i represents the state of the HMM. a _ij represents the jump probability of the HMM. b _j (O _t ) is the multi-stream mixed Gaussian density probability distribution function of the state output of the HMM model. As shown in formula (1).

${b b}_{j j} (({O o}_{t t})) = = {Π Π}_{s the s = = 11}^{s the s} {[[{Σ Σ}_{m m = = 11}^{{M m}_{s the s}} {C C}_{jsm jsm} N N (({O o}_{st st};; {μ μ}_{jsm jsm};; {φ φ}_{jsm jsm}))]]}^{{γ γ}_{s the s}} - - - - - - ((11))$

其中S是数据的流数，M_s是每一数据流中的混合高斯密度分布的个数；N为多维高斯分布，如式(2)所示：Where S is the number of data streams, M _s is the number of mixed Gaussian density distributions in each data stream; N is a multidimensional Gaussian distribution, as shown in formula (2):

$N N ((o o;; μ μ;; φ φ)) = = \frac{11}{\sqrt{{((22 π π))}^{n no} | | φ φ | |}} {e e}^{- - \frac{11}{22} ((o o - - μ μ)) {φ φ}^{- - 11} ((o o - - μ μ))} - - - - - - ((22))$

本实施例的标准发音模型包括40个音素HMM模型以及一个静音HMM模型、一个停顿HMM模型；将所说的高斯分布原型复制成各个HMM模型；然后利用Baum-Welch算法对每个HMM模型进行多次估值，估值次数可为5；The standard pronunciation model of the present embodiment comprises 40 phoneme HMM models and a mute HMM model, a pause HMM model; Said Gaussian distribution prototype is copied into each HMM model; Then utilize Baum-Welch algorithm to carry out multiple HMM models to each HMM model Second valuation, the number of valuations can be 5;

d、逐步增加HMM模型中高斯成分的数量，对得到的模型再次进行Baum-Welch训练；高斯成分的数量增加依次为2、4、6、8；当高斯数量增长到8时，重复训练10次，训练过程结束。d. Gradually increase the number of Gaussian components in the HMM model, and perform Baum-Welch training on the obtained model again; the number of Gaussian components increases in order of 2, 4, 6, and 8; when the number of Gaussian components increases to 8, repeat the training 10 times , the training process ends.

(2)对所说的标准发音模型进行汉语人群口音的自适应，本发明实施例采用了基于全局MLLR和MAP串行的口音自适应方法，自适应次数设定为4，具体流程如图4所示。(2) Carry out the self-adaptation of Chinese population accent to said standard pronunciation model, the embodiment of the present invention has adopted the accent self-adaptation method based on global MLLR and MAP serial, self-adaptation number of times is set to 4, and concrete process is as Fig. 4 shown.

a、MLLR是一种基于模型变换的自适应算法。这一类算法的基本假设是相近语音在说话人无关语音模型空间与被适应人语音空间之间的变换关系也是相近的，因此可以利用训练语音中出现过的语音统计出这一变换关系，对未出现的语音的模型用该变换实现从说话人无关模型到被适应人语音空间的映射，从而实现自适应。语音模型空间根据一定测度(如欧氏距离，似然度等)被划分为R类，各类的变换为T_r(*)，各类对应的训练语音集为X_r，r＝1，2，...，R，模型参数为λ_r，r＝1，2，...，R，则自适应训练满足：a. MLLR is an adaptive algorithm based on model transformation. The basic assumption of this type of algorithm is that the transformation relationship between similar speech in the speaker-independent speech model space and the adapted human speech space is also similar, so the transformation relationship can be calculated by using the speech that has appeared in the training speech. The model of the non-appearing speech uses this transformation to realize the mapping from the speaker-independent model to the speech space of the adapted person, so as to realize the adaptation. The speech model space is divided into R classes according to a certain measure (such as Euclidean distance, likelihood, etc.), the transformation of each class is T _r (*), and the corresponding training speech set of each class is X _r , r=1, 2 ,..., R, the model parameter is λ _r , r=1, 2,..., R, then the adaptive training satisfies:

${T T}_{r r} = = \underset{T T}{arg arg max max} ((P P (({X x}_{r r} | | {T T}_{r r})))),, r r = = 1,2 1,2 . . . . . .,, R R - - - - - - ((33))$

自适应后的参数 ${\hat{λ}}_{r} r = 1,2, . . . R$ 满足Adapted parameters ${\hat{λ}}_{r} r = 1,2, . . . R$ satisfy

${\overset{^^}{λ λ}}_{r r} = = {T T}_{r r} (({λ λ}_{r r})),, r r = = 1,2 1,2 . . . . . .,, R R - - - - - - ((44))$

由于这一类算法充分利用了语音间的相互关系，多个模型共享一个变换，需要估计的参数是各个变换的系数，较容易积累数据估计参数可以在较少自适应数据情况下生效，因此具有较快的自适应速度。本发明实施例采用的是未分类的全局MLLR自适应。Because this type of algorithm makes full use of the interrelationship between speech, multiple models share a transformation, and the parameters to be estimated are the coefficients of each transformation, and it is easier to accumulate data to estimate parameters, which can take effect in the case of less adaptive data, so it has the advantages of Faster adaptive speed. The embodiment of the present invention adopts unclassified global MLLR adaptation.

b、MAP算法的基本准则是后验概率最大化，因此具有理论上的最优性：b. The basic principle of the MAP algorithm is to maximize the posterior probability, so it has theoretical optimality:

${\overset{^^}{θ θ}}_{i i} = = \underset{{θ θ}_{i i}}{arg arg max max} P P (({θ θ}_{i i} | | x x)) - - - - - - ((55))$

标准MAP算法的均值矢量估值公式为：The mean vector estimation formula of the standard MAP algorithm is:

$\overset{^^}{μ μ} = = \frac{{Σ Σ}_{t t = = 11}^{T T} {L L}_{t t}}{{Σ Σ}_{t t = = 11}^{T T} {L L}_{t t} + + τ τ} \overset{__}{μ μ} + + \frac{τ τ}{{Σ Σ}_{t t = = 11}^{T T} {L L}_{t t} + + τ τ} μ μ - - - - - - ((66))$

其中L_t是t时刻观测矢量对该高斯混合分量的概率，τ是自适应语音数据基于先验知识的权重， μ是自适应语音的均值矢量，μ是说话人无关模型的均值矢量。由此也可看出，当自适应数据足够多时，自适应后的均值矢量将趋向于说话人相关的均值矢量 μ。本发明实施例在MLLR自适应之后又采用MAP自适应的目的是充分利用自适应的语音数据，进一步提供口音自适应的效果。where L _t is the probability of the observation vector at time t to the Gaussian mixture component, τ is the weight of the adaptive speech data based on prior knowledge, μ is the mean vector of the adaptive speech, and μ is the mean vector of the speaker-independent model. It can also be seen from this that when there are enough adaptive data, the mean value vector after adaptation will tend towards the speaker-dependent mean vector μ. The purpose of adopting MAP adaptation after MLLR adaptation in the embodiments of the present invention is to make full use of adaptive voice data and further provide the effect of accent adaptation.

将最终得到的标准发音模型存入嵌入式系统的外部存储器。Store the final standard pronunciation model into the external memory of the embedded system.

C、标准发音网络的生成：C. Generation of standard pronunciation network:

本发明实施例的标准发音网络如图5所示，其中(a)为以单词为节点的线性网络示例，起始节点为开始的“sil”，终止节点为最后的“sil”，(b)为每个单词内部的以音素为节点的线性网络，每个音素内部为如图3所示的以状态为节点的网络。网络生成过程如图6所示：首先对原始文本进行单词切分得到如图5(a)所示，其次对每一个单词查找发音字典得到如图5(b)所示。考虑到单词的多发音情况，为节省存储空间和提高搜索效率，本实施例在单词的多种发音之间进行了基于动态规划的音素字符串比对，将多个音素序列融合成一个以音素为节点的网络，使得各个发音之间的相同音素得到共享。最后利用音素HMM模型将网络最终展开成以状态为节点的网络，每个状态节点上记录了当前节点的状态标识、音素标识、单词标识以及前序节点数目和前序节点标识信息。至此，得到本实施例的具有确定的起始节点P和终止节点T，当前节点只与其前序节点相关的不考虑文法的以HMM的状态为节点的标准发音网络。The standard pronunciation network of the embodiment of the present invention is as shown in Figure 5, wherein (a) is a linear network example with words as nodes, the starting node is the beginning "sil", and the ending node is the last "sil", (b) is a linear network with phonemes as nodes inside each word, and a network with states as nodes inside each phoneme as shown in Figure 3. The network generation process is shown in Figure 6: first, the original text is segmented into words as shown in Figure 5(a), and then the pronunciation dictionary is searched for each word to get the result as shown in Figure 5(b). Considering the multi-pronunciation situation of a word, in order to save storage space and improve search efficiency, this embodiment has carried out phoneme string comparison based on dynamic programming between multiple pronunciations of a word, and multiple phoneme sequences are fused into a phoneme is a network of nodes such that the same phoneme is shared among the various pronunciations. Finally, the phoneme HMM model is used to expand the network into a network with states as nodes. Each state node records the current node's state identifier, phoneme identifier, word identifier, number of preorder nodes, and preorder node identifier information. So far, the standard pronunciation network of this embodiment with the definite starting node P and terminating node T, the current node is only related to its predecessor node and the state of the HMM is obtained without considering the grammar.

将所说的标准发音网络存入嵌入式系统的外部存储器。The said standard pronunciation network is stored in the external memory of the embedded system.

D、语音端点检测：D. Voice endpoint detection:

(1)语音信号首先进行低通滤波，然后通过16bit线性A/D进行采样和量化，成为数字语音。采样频率为8kHz；(1) The voice signal is first low-pass filtered, and then sampled and quantized by a 16-bit linear A/D to become a digital voice. The sampling frequency is 8kHz;

(2)对所说的数字语音进行预加重和分帧加窗处理，得到具有准平稳性的分帧语音；方法与步骤A的第(2)步相同；(2) carry out pre-emphasis and frame-by-frame windowing process to said digital voice, obtain the frame-by-frame voice with quasi-stationarity; method is identical with the (2) step of step A;

(3)再对所说的分帧语音计算短时对数能量。(3) Calculate the short-time logarithmic energy for the framed speech.

(4)采用滑动平均滤波的方法由所说的时域对数能量得到端检特征：端点检测是实时进行的，实时端点检测方法需满足以下要求：a、对不同的背景噪声电平有一致的输出；b、能够检测到起始点和终止点；c、较短的延时；d、有限的响应区间；e、在端点处最大化信噪比；f、准确定位检测的端点；g、最大限度地抑制检测错误；综合考虑以上要求定义的目标函数，和图像处理中通常采用的图形边界检测函数(滑动平均滤波)非常相似。所说的滑动平均滤波器如式(7)所示，其中g(·)是时域对数能量，t为当前帧数，h(·)为滑动平均滤波器，如式(8)所示，可见h(·)是一个奇对称函数，W可取13，f(·)如式(9)所示，其参数可为：A＝0.2208，s＝0.5383，[K₁...K₆]＝[1.583，1.468，-0.078，-0.036，-0.872，-0.56]。(4) Adopt the method for moving average filtering to obtain the end detection feature by said time-domain logarithmic energy: the end point detection is carried out in real time, and the real-time end point detection method needs to meet the following requirements: a, have consistency to different background noise levels output; b, can detect the start point and end point; c, short delay; d, limited response interval; e, maximize the signal-to-noise ratio at the end point; f, accurately locate the detected end point; g, Suppress detection errors to the greatest extent; comprehensively consider the objective function defined by the above requirements, which is very similar to the graphic boundary detection function (sliding average filter) usually used in image processing. The moving average filter is shown in formula (7), wherein g(·) is the time-domain logarithmic energy, t is the current frame number, and h(·) is the moving average filter, as shown in formula (8) , it can be seen that h(·) is an odd symmetric function, W can be 13, f(·) is shown in formula (9), its parameters can be: A=0.2208, s=0.5383, [K ₁ ...K ₆ ] =[1.583, 1.468, -0.078, -0.036, -0.872, -0.56].

$F f ((t t)) = = {Σ Σ}_{i i = = - - W W}^{W W} h h ((i i)) g g ((t t + + i i)) - - - - - - ((77))$

$h h ((i i)) = = \{\begin{matrix} - - f f ((- - i i)) & - - W W \leq \leq i i < < 00 \\ f f ((i i)) & 00 \leq \leq i i \leq \leq W W \end{matrix} - - - - - - ((88))$

f(x)＝e^Ax[K₁sin(Ax)+K₂cos(Ax)]+e^-Ax[K₃sin(Ax)+K₄cos(Ax)]+K₅+K₆e^sx(9)f(x)＝e ^Ax [K ₁ sin(Ax)+K ₂ cos(Ax)]+e ^-Ax [K ₃ sin(Ax)+K ₄ cos(Ax)]+K ₅ +K ₆ e ^sx ( 9)

(5)采用上限和下限双阈值和有限状态机结合的方法，对所说的端检特征进行端点检测，得到语音的起始和结束端点：所说的端检特征F(t)在语音的起始端为正值，在结束端为负值，在静音段则接近为零。根据预先设定的上限、下限阈值和语音最短持续时间，控制每一帧语音在语音、静音和离开语音状态之间进行跳转。初始设定为静音状态，当F(t)达到上限阈值时输出语音的起始端点，进入语音状态。处于语音状态，当F(t)达到下限阈值时就进入了离开语音状态。处于离开语音状态的时间达到一个设定的阈值时输出语音的结束端点，关闭录音通道，端点检测结束。(5) adopt the method that upper limit and lower limit double threshold value and finite state machine combine, carry out endpoint detection to said end detection feature, obtain the starting point and the end point of speech: said end detection feature F(t) is in speech Positive values at the beginning, negative values at the end, and close to zero in silent segments. According to the pre-set upper limit, lower limit threshold and the shortest duration of the voice, each frame of voice is controlled to jump between voice, mute and leave the voice state. The initial setting is the mute state, and when F(t) reaches the upper threshold, the starting endpoint of the output voice is entered into the voice state. In the speech state, when F(t) reaches the lower threshold, it enters the leaving speech state. When the time in the state of leaving the voice reaches a set threshold, the end point of the output voice is closed, the recording channel is closed, and the end point detection ends.

E、用于评价的语音特征提取：E. Speech feature extraction for evaluation:

F、最优路径搜索：F. Optimal path search:

(1)将步骤E所说的语音特征与步骤C所说的标准发音网络进行强制匹配，得到网络中所有可能的路径信息：本发明实施例的标准发音网络是自左向后的线性网络(如图5所示)，可采用帧同步的维特比束搜索算法得到最优路径。给定HMM模型Φ和观察矢量序列O＝{o₁，…，o_T}后，需要求取产生此观察矢量序列的最佳状态序列S＝{s¹，…s^T}，即(1) carry out forcibly matching the said speech feature of step E with the said standard pronunciation network of step C, obtain all possible path information in the network: the standard pronunciation network of the embodiment of the present invention is a linear network from left to back ( As shown in Figure 5), the optimal path can be obtained by using the frame-synchronized Viterbi beam search algorithm. Given the HMM model Φ and the observation vector sequence O={o ₁ ,...,o _T }, it is necessary to find the best state sequence S={s ¹ ,...s ^T } that produces this observation vector sequence, namely

$\overset{^^}{S S} = = \underset{s the s}{arg arg max max} {{P P ((S S,, O o | | Φ Φ))}} - - - - - - ((1010))$

在维特比算法中，定义t时刻的最佳路径似然度为In the Viterbi algorithm, the optimal path likelihood at time t is defined as

V_i(t)＝P(o₁，…，o_t，s¹，…，s^t-1，s^t＝i |Φ) (11)V _i (t)=P(o ₁ ,..., o _t , s ¹ ,..., s ^t-1 , s ^t =i |Φ) (11)

在线性网络中，任意时刻的最优路径仅和当前帧与上一帧的信息相关，即满足无后效性的原则。因此，如果全局最优路径在t时刻通过节点i，那么，路径在0～t时刻之间的部分，一定是在t时刻以节点i为最终节点的各条路径中是最优的。如果我们只想获得最优路径，那么在t时刻，以节点i为最终节点的路径只需要保留一条就足够了。In a linear network, the optimal path at any time is only related to the information of the current frame and the previous frame, which satisfies the principle of no aftereffect. Therefore, if the globally optimal path passes through node i at time t, then the part of the path between 0 and time t must be optimal among the paths with node i as the final node at time t. If we only want to obtain the optimal path, then at time t, only one path with node i as the final node needs to be reserved.

根据上述原则，本实施例的搜索算法为：According to the above principles, the search algorithm in this embodiment is:

定义：PreNode(i)为节点i的前序节点集合。BestPre(t，i)为t时刻节点i的最优前序节点。L(t，i)为t时刻的语音帧对应节点i的似然分数。L_Path(-1，i)和L_Path(0，i)分别为前一帧和当前帧以节点i为最终节点的最优路径似然分数。Definition: PreNode(i) is the preorder node set of node i. BestPre(t, i) is the optimal preorder node of node i at time t. L(t, i) is the likelihood score of the speech frame corresponding to node i at time t. L_Path(-1, i) and L_Path(0, i) are the likelihood scores of the optimal path with node i as the final node in the previous frame and the current frame, respectively.

步骤1：在t＝0时刻Step 1: At time t=0

$L L__Path path ((- - 11,, i i)) = = \{\begin{matrix} L L ((00,, i i)) & i i &Element; &Element; Entry Entry \\ 00 & i i &NotElement; &NotElement; Entry Entry \end{matrix}- - - - - - - ((1212))$

其中i∈Entry表示i为起始节点。Where i∈Entry means that i is the starting node.

步骤2：在t时刻，对于任意的节点i已经得到当前帧的似然分数L(t，i)，则当前帧的最优路径分数为：Step 2: At time t, for any node i, the likelihood score L(t, i) of the current frame has been obtained, then the optimal path score of the current frame is:

$L L__Path path ((00,, i i)) = = \underset{j j}{max max} ((L L__Path path ((- - 11,, j j)))) + + L L ((t t,, i i)),, &ForAll; &ForAll; j j &Element; &Element; PreNode PreNode ((i i)) - - - - - - ((1313))$

将最优前序节点记入BestPre(t，i)，将L_Path(-1，i)和L_Path(0，i)的数据进行交换为下一帧的计算做准备。Record the optimal preorder node into BestPre(t, i), and exchange the data of L_Path(-1, i) and L_Path(0, i) to prepare for the calculation of the next frame.

步骤3：如果t＜T，转到步骤2；否则，结束。Step 3: If t<T, go to step 2; otherwise, end.

(2)当语音结束时，可以从网络允许得终止节点回溯BestPre(t，i)获取到强制匹配的最优状态路径；(2) When the speech ends, the optimal state path of mandatory matching can be obtained by backtracking BestPre(t, i) from the terminating node allowed by the network;

G、发音质量分数的计算G. Calculation of pronunciation quality score

(1)利用步骤F中所说的最优路径信息计算每帧语音特征的置信分数，如式(14)所示：(1) Utilize the optimal path information mentioned in step F to calculate the confidence score of each frame of speech features, as shown in formula (14):

${C C}_{j j} = = log log ((p p (({O o}_{j j} | | {s the s}^{i i})))) - - log log ((\underset{i i}{Σ Σ} p p (({O o}_{j j} | | {s the s}^{i i})))) - - - - - - ((1414))$

(2)利用步骤F中所说的最优路径信息计算路径上每个状态的置信分数；对最优路径上所有状态的置信分数取平均得到整句的置信分数，如式(15)所示，其中N为最优路径包含的状态数。(2) Use the optimal path information mentioned in step F to calculate the confidence score of each state on the path; average the confidence scores of all states on the optimal path to obtain the confidence score of the entire sentence, as shown in formula (15) , where N is the number of states contained in the optimal path.

$C C = = \frac{{Σ Σ}_{i i = = 11}^{N N} ((\frac{{Σ Σ}_{j j = = js js}^{je je} {C C}_{j j}}{je je - - js js}))}{N N} - - - - - - ((1515))$

(3)利用映射函数将所说的整句置信分数映射到主观评价分数区间：直接计算得到的置信分数的取值区间通常在(-∞，a]之间，其中a为一常数，与主观评价分数区间不一致，本发明实施例利用分段线性函数将其映射到主观分数区间，如式(16)所示，其中a和b由实验确定，a为一个调节因子：(3) Use the mapping function to map the confidence score of the whole sentence to the subjective evaluation score interval: the value interval of the confidence score obtained by direct calculation is usually between (-∞, a], where a is a constant, which is consistent with the subjective evaluation score interval. The evaluation score interval is inconsistent, and the embodiment of the present invention uses a piecewise linear function to map it to the subjective score interval, as shown in formula (16), where a and b are determined by experiments, and a is an adjustment factor:

$S S = = \{\begin{matrix} αC αC & if if & a a \leq \leq C C \leq \leq b b \\ 100100 & if if & C C > > b b \\ 00 & if if & C C < < a a \end{matrix} - - - - - - ((1616))$

也可以将得到的S进一步量化为优、良、中、差的发音质量等级。The obtained S can also be further quantified into pronunciation quality grades of excellent, good, medium and poor.

考虑到内存资源的限制，本发明实施例的步骤D、E、F和G都是以预先设定的固定帧数步长在时间上分段进行的，每段大小可为40帧。Considering the limitation of memory resources, steps D, E, F and G of the embodiment of the present invention are all performed in time segments with a preset fixed frame number step, and the size of each segment can be 40 frames.

本实施例基于上述方法开发了基于发音质量评价的嵌入式英语学习系统。学习内容可以根据教学要求随时自动地更新。采用发音质量评价技术可以使人机之间互动学习，大大减轻了课堂口语教学的工作量，缓解了教师供需紧张的问题，实现了英语口语的自主学习和自动测试。本发明可以评价汉语普通话人群的英语发音质量。该方法对汉语人群的英语发音质量评价在评分等级为4级(优、良、中、差)时，与主观评价的相关性达到了0.74。This embodiment develops an embedded English learning system based on pronunciation quality evaluation based on the above method. Learning content can be automatically updated at any time according to teaching requirements. The use of pronunciation quality evaluation technology can enable interactive learning between humans and computers, greatly reducing the workload of oral English teaching in the classroom, alleviating the problem of tight supply and demand of teachers, and realizing independent learning and automatic testing of spoken English. The invention can evaluate the English pronunciation quality of the Mandarin Chinese population. This method has a correlation of 0.74 with the subjective evaluation of the English pronunciation quality evaluation of the Chinese population when the grades are 4 (excellent, good, medium, and poor).

Claims

1, a kind of pronunciation quality evaluating method that is used for language learner, comprise that the phonetic feature that is used to train extracts, the Received Pronunciation model training, the generation of Received Pronunciation network, sound end detects, the phonetic feature that is used to estimate extracts, optimum route search, and the calculating each several part of voice quality mark; It is characterized in that the implementation method of each several part specifically may further comprise the steps:

A, the phonetic feature that is used to train extract:

(1) foundation comprises the tranining database of reading aloud voice in a large number in advance;

(2) digital speech in each voice document in the said tranining database is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;

(3) said minute frame voice are extracted phonetic feature, this phonetic feature is a cepstrum coefficient;

B, Received Pronunciation model training

(1) utilize the said phonetic feature of steps A to train the Received Pronunciation model that obtains based on phoneme;

(2) self-adaptation that said Received Pronunciation model is carried out Chinese crowd accent is as final Received Pronunciation model, and Optimization Model is to Chinese crowd's assess performance;

The generation of C, Received Pronunciation network

Given text is carried out the segmentation of words, search Pronounceable dictionary and obtain the phoneme mark, utilizing said Received Pronunciation model based on phoneme to obtain with the state at last is the linear Received Pronunciation network of node;

D, sound end detect:

(1) analog voice signal obtains digital speech through the A/D conversion;

(2) said digital speech is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;

(3) said minute frame voice are calculated time domain logarithm energy;

(4) method of employing moving average filter is obtained being used for the end inspection feature of end-point detection by said time domain logarithm energy;

(5) method of employing upper and lower bound dual threshold and finite state machine combination is carried out end-point detection to said end inspection feature, obtains the starting and ending end points of voice;

E, the phonetic feature that is used to estimate extract

Said minute frame voice of step D are extracted phonetic feature, and process is identical with (3) step of steps A.

F, optimum route search:

(1) said phonetic feature of step e and the said Received Pronunciation network of step C are forced coupling, obtain all possible routing information in the network;

(2) utilize said routing information, the terminal node that allows from network is recalled and optimal path;

The calculating of G, voice quality mark:

(1) utilize said optimal path information in the step F to calculate the confidence score of every frame phonetic feature;

(2) utilize in the step F confidence score of each state on the said optimal path information calculating path; Confidence score to all states on the optimal path is averaged the confidence score that obtains whole sentence;

(3) utilize mapping function that said whole sentence confidence score is mapped to subjective assessment and divide number interval, obtain final voice quality mark.

2, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1 is characterized in that, the cepstrum coefficient in the said steps A is the Mei Er frequency marking cepstrum coefficient that utilizes the frequency discrimination characteristic of people's ear.

3, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, Received Pronunciation model among the said step B (1) is the hidden Markov model based on phoneme, the concrete training process of this model is: adopt Gauss model of all phonetic feature initialization, utilize this model copy to go out all phoneme models, adopt the method for Baum-Welth that model is repeatedly trained; Constantly increase the quantity of the gauss component of each phoneme model, carry out the Baum-Welth training again.

4, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, the adaptive implementation method that the Received Pronunciation model is carried out Chinese crowd accent among the said step B (2) is: the Received Pronunciation model that obtains is carried out returning and the accent self-adaptation of maximum a posteriori probability method based on maximum likelihood is linear, obtain final Received Pronunciation model.

5, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, the Received Pronunciation network of said step C is one and has definite start node and terminal node that the state with HMM of not considering the syntax that present node is only relevant with its preorder node is the Linear Network of node.