CN114220419A

CN114220419A - A voice evaluation method, device, medium and equipment

Info

Publication number: CN114220419A
Application number: CN202111679151.0A
Authority: CN
Inventors: 金海�; 吴奎; 盛志超
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-03-22
Anticipated expiration: 2041-12-31
Also published as: CN114220419B

Abstract

The present application discloses a speech evaluation method, the method includes: acquiring speech to be evaluated and reference text, extracting acoustic features from the speech, inputting the acoustic features into K sub-acoustic models of the acoustic model, and obtaining comprehensive state posterior probability; Synthesizing the state posterior probability and the segmentation network constructed by the reference text, the phonemes of the speech are aligned with the reference text, and the scoring feature is obtained based on the alignment result; the scoring feature is input into the scoring model to regress the score of the voice. Since the K sub-acoustic models are obtained by training their respective in-set and out-of-set data, each sub-acoustic model has different out-of-set data, so the acoustic model can distinguish the out-of-set data, thereby improving the scoring accuracy of abnormal speech, solving the problem of It solves the problem of unstable abnormal speech score, and has high reliability and usability.

Description

A voice evaluation method, device, medium and equipment

技术领域technical field

本申请涉及语音处理技术领域，尤其涉及一种语音评价方法、装置、存储介质及设备。The present application relates to the technical field of speech processing, and in particular, to a speech evaluation method, apparatus, storage medium and device.

背景技术Background technique

随着教育改革的不断深入以及企业等机构对语言能力要求的提高，越来越多的人致力于提升自身的口语表达能力，并试图通过参与口试获得相应的证书。其中，朗读题是口语考试中最为常见的一种题型，因此，朗读题评分的准确性至关重要。With the deepening of education reform and the improvement of language ability requirements of enterprises and other institutions, more and more people are committed to improving their oral expression ability, and try to obtain corresponding certificates by participating in oral examinations. Among them, the reading questions are the most common type of questions in the oral exam, so the accuracy of the reading questions is very important.

近年来，语音识别技术在许多领域取得了显著的效果。因此，很多研究人员尝试将语音识别应用到口语考试，实现自动地对参试人员的语音进行评价，从而减轻老师的工作量。In recent years, speech recognition technology has achieved remarkable results in many fields. Therefore, many researchers try to apply speech recognition to oral examinations to automatically evaluate the participants' speech, thereby reducing the workload of teachers.

目前，口语考试中朗读题评分过程主要分为以下阶段：第一阶段是基于预先训练好的声学模型对朗读文本和考生语音做复杂网络强制切分对齐，获取增漏读信息、单词和音素级时间边界以及每帧的状态后验概率等信息；第二阶段是基于强制切分结果，提取与完整度、流畅度和发音准确度有关的评分特征；第三阶段是基于正常语音(例如按照文本正常朗读的语音)和异常语音(例如不按照文本朗读的语音)采用评分特征训练的二分类模型进行异常语音检测；第四阶段是根据人工分训练评分特征回归模型，测试阶段回归出机器分。At present, the scoring process of reading questions in the speaking test is mainly divided into the following stages: The first stage is based on the pre-trained acoustic model to do complex network segmentation and alignment of the reading text and the candidate's voice, and obtain the information of increased and missed reading, word and phoneme levels Information such as the time boundary and the state posterior probability of each frame; the second stage is based on the forced segmentation results, and the scoring features related to completeness, fluency and pronunciation accuracy are extracted; the third stage is based on normal speech (for example, according to text). Normal speech) and abnormal speech (such as speech not read according to text) are detected by using a binary classification model trained by scoring features; the fourth stage is to train the scoring feature regression model based on manual scores, and the machine score is regressed in the testing phase.

然而，声学模型的训练数据通常是由正常语音转写得到，声学模型在训练过程中并未见过异常语音，导致声学模型对于异常语音的分类效果未知，因此，基于声学模型提取的评分特征在区分性上会出现不稳定，进而导致朗读题评分结果具有不确定性。如何提供一种语音评价方法，从而获得准确的评分成为业界重点关注的问题。However, the training data of the acoustic model is usually transcribed from normal speech, and the acoustic model has not seen abnormal speech during the training process, so the classification effect of the acoustic model on abnormal speech is unknown. Therefore, the scoring features extracted based on the acoustic model are in There will be instability in the distinction, which will lead to uncertainty in the scoring results of reading questions. How to provide a voice evaluation method so as to obtain an accurate score has become a key concern of the industry.

发明内容SUMMARY OF THE INVENTION

本申请实施例的主要目的在于提供一种语音评价方法、系统、计算设备集群、计算机可读存储介质以及计算机程序产品，能够提高提高异常语音的评分准确性，解决了异常语音评分不稳定的问题，具有较高可靠性和可用性。The main purpose of the embodiments of the present application is to provide a voice evaluation method, system, computing device cluster, computer-readable storage medium, and computer program product, which can improve the scoring accuracy of abnormal voice and solve the problem of unstable abnormal voice scoring. , with high reliability and availability.

本申请实施例提供了一种语音评价方法，包括：The embodiment of the present application provides a voice evaluation method, including:

获取待评价的语音以及参考文本；Obtain the speech and reference text to be evaluated;

从所述语音中提取声学特征，将所述声学特征输入声学模型的K个子声学模型，获得综合状态后验概率，所述K个子声学模型通过各自的集内数据和集外数据训练得到，每个子声学模型的集外数据不同；Acoustic features are extracted from the speech, and the acoustic features are input into K sub-acoustic models of the acoustic model to obtain comprehensive state posterior probability. The out-of-set data of each sub-acoustic model are different;

根据所述综合状态后验概率以及通过所述参考文本构建的切分网络，将所述语音的音素与所述参考文本对齐，基于对齐结果获得评分特征；According to the comprehensive state posterior probability and the segmentation network constructed by the reference text, the phonemes of the speech are aligned with the reference text, and scoring features are obtained based on the alignment results;

将所述评分特征输入评分模型，以回归出所述语音的评分。The scoring features are input into a scoring model to regress the scores for the speech.

本申请实施例还提供了一种语音评价系统，包括：The embodiment of the present application also provides a voice evaluation system, including:

获取模块，用于获取待评价的语音以及参考文本；The acquisition module is used to acquire the speech to be evaluated and the reference text;

声学处理模块，用于从所述语音中提取声学特征，将所述声学特征输入声学模型的K个子声学模型，获得综合状态后验概率，所述K个子声学模型通过各自的集内数据和集外数据训练得到，每个子声学模型的集外数据不同；The acoustic processing module is used for extracting acoustic features from the speech, and inputting the acoustic features into K sub-acoustic models of the acoustic model to obtain comprehensive state posterior probability, and the K sub-acoustic models pass the respective data in the set and the sub-acoustic model. The out-of-set data is obtained by training, and the out-of-set data of each sub-acoustic model is different;

后处理模块，用于根据所述综合状态后验概率以及通过所述参考文本构建的切分网络，将所述语音的音素与所述参考文本对齐，基于对齐结果获得评分特征；A post-processing module, configured to align the phonemes of the speech with the reference text according to the comprehensive state posterior probability and the segmentation network constructed by the reference text, and obtain scoring features based on the alignment results;

评价模块，用于将所述评分特征输入评分模型，以回归出所述语音的评分。The evaluation module is used for inputting the scoring feature into a scoring model to regress the score of the speech.

本申请实施例还提供了一种计算设备集群，计算设备集群包括至少一个计算设备，所述至少一个计算设备包括：处理器、存储器、系统总线。An embodiment of the present application further provides a computing device cluster, where the computing device cluster includes at least one computing device, and the at least one computing device includes: a processor, a memory, and a system bus.

其中，所述处理器以及所述存储器通过所述系统总线相连；Wherein, the processor and the memory are connected through the system bus;

所述存储器用于存储一个或多个程序，所述一个或多个程序包括指令，所述指令当被所述处理器执行时使所述计算设备集群执行上述语音评价方法中的任意一种实现方式。The memory is used to store one or more programs, the one or more programs comprising instructions that, when executed by the processor, cause the cluster of computing devices to perform any one of the above-described speech evaluation methods. Way.

本申请实施例还提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有指令，当所述指令在计算设备集群上运行时，使得所述计算设备集群执行上述方法中的任意一种实现方式。Embodiments of the present application further provide a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computing device cluster, the computing device cluster is made to execute the above method. any implementation.

本申请实施例还提供了一种计算机程序产品，所述计算机程序产品在计算设备集群上运行时，使得所述计算设备集群执行上述语音评价方法中的任意一种实现方式。Embodiments of the present application further provide a computer program product, which, when running on a computing device cluster, enables the computing device cluster to execute any one of the above voice evaluation methods.

本申请实施例提供的一种语音评价方法。具体地，语音评价系统获取待评价的语音以及参考文本(例如是朗读文本)，从所述语音中提取声学特征，将所述声学特征输入声学模型的K个子声学模型，获得综合状态后验概率，根据所述综合状态后验概率以及通过所述参考文本构建的切分网络，将所述语音的音素与所述参考文本对齐，基于对齐结果获得评分特征，将所述评分特征输入评分模型，以回归出所述语音的评分。A voice evaluation method provided by an embodiment of the present application. Specifically, the speech evaluation system obtains the speech to be evaluated and the reference text (for example, reading text), extracts acoustic features from the speech, inputs the acoustic features into the K sub-acoustic models of the acoustic model, and obtains the comprehensive state posterior probability , according to the comprehensive state posterior probability and the segmentation network constructed by the reference text, align the phonemes of the speech with the reference text, obtain scoring features based on the alignment results, and input the scoring features into the scoring model, to return the score for the speech.

该方法借鉴K折交叉验证的思想，获得K个子声学模型各自的集内数据和集外数据，并且每个子声学模型的集外数据不同。通过各自的集内数据和集外数据训练子声学模型，可以使得声学模型对集外数据具有区分性，进而提高了异常语音的评分准确性，解决了异常语音评分不稳定的问题，具有较高可靠性和可用性。This method draws on the idea of K-fold cross-validation to obtain the in-set data and out-set data of each of the K sub-acoustic models, and the out-of-set data of each sub-acoustic model is different. By training the sub-acoustic model with the respective in-set data and out-of-set data, the acoustic model can be differentiated from the out-of-set data, thereby improving the scoring accuracy of abnormal speech, solving the problem of unstable score of abnormal speech, and has high performance. reliability and availability.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are For some embodiments of the present application, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1为本申请实施例提供的一种语音评价方法的系统架构图；1 is a system architecture diagram of a voice evaluation method provided by an embodiment of the present application;

图2为本申请实施例提供的一种语音评价方法的流程图；2 is a flowchart of a voice evaluation method provided by an embodiment of the present application;

图3为本申请实施例提供的一种训练声学模型的流程示意图；3 is a schematic flowchart of a training acoustic model provided by an embodiment of the present application;

图4为本申请实施例提供的一种语音评价系统的结构示意图；4 is a schematic structural diagram of a voice evaluation system provided by an embodiment of the present application;

图5为本申请实施例提供的一种计算设备集群的结构示意图。FIG. 5 is a schematic structural diagram of a computing device cluster according to an embodiment of the present application.

具体实施方式Detailed ways

本申请实施例中的术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多该特征。The terms "first" and "second" in the embodiments of the present application are only used for the purpose of description, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as "first", "second" may expressly or implicitly include one or more of that feature.

首先对本申请实施例中所涉及到的一些技术术语进行介绍。First, some technical terms involved in the embodiments of this application are introduced.

语音识别也称为自动语音识别(Automatic Speech Recognition，ASR)、计算机语音识别或语音转文本。语音识别用于将语音(例如是人类语音)处理为书面格式的文本。Speech recognition is also known as Automatic Speech Recognition (ASR), computer speech recognition or speech-to-text. Speech recognition is used to process speech (eg, human speech) into written text.

语音识别在许多领域取得了显著的效果。因此，很多研究人员尝试将语音识别应用到口语考试，实现自动地对参试人员的语音进行评价，从而减轻老师的工作量。类似地，语音识别也可以应用到口语训练，实现对训练时的语音进行评价，从而辅助口语学习。Speech recognition has achieved remarkable results in many fields. Therefore, many researchers try to apply speech recognition to oral examinations to automatically evaluate the participants' speech, thereby reducing the workload of teachers. Similarly, speech recognition can also be applied to oral language training to evaluate the speech during training, thereby assisting oral language learning.

口语考试或口语训练中包括朗读题。朗读题是指示按照文本进行朗读的题型。目前，朗读题评分过程主要分为以下阶段：第一阶段是基于预先训练好的声学模型对朗读文本和考生语音做复杂网络强制切分对齐，获取增漏读信息、单词和音素级时间边界以及每帧的状态后验概率等信息；第二阶段是基于强制切分结果，提取与完整度、流畅度和发音准确度有关的评分特征；第三阶段是基于正常语音(例如按照文本正常朗读的语音)和异常语音(例如不按照文本朗读的语音)采用评分特征训练的二分类模型进行异常语音检测；第四阶段是根据人工分训练评分特征回归模型，测试阶段回归出机器分。Reading questions are included in the speaking test or training. A reading question is a question type that indicates reading aloud according to the text. At present, the scoring process of reading questions is mainly divided into the following stages: The first stage is based on the pre-trained acoustic model to do complex network segmentation and alignment of the reading text and the candidate's voice, and obtain the information of increased and missed reading, word and phoneme-level time boundaries, and The state posterior probability of each frame and other information; the second stage is based on the forced segmentation results, and the scoring features related to completeness, fluency and pronunciation accuracy are extracted; the third stage is based on normal speech (such as normal reading according to text) Speech) and abnormal speech (such as speech that is not read according to text) are detected by using a binary classification model trained by scoring features; the fourth stage is to train a regression model of scoring features based on manual scores, and the machine score is returned in the testing phase.

然而，声学模型的训练数据通常是由正常语音转写得到，声学模型在训练过程中并未见过异常语音，导致声学模型对于异常语音的分类效果未知，因此，基于声学模型提取的评分特征在区分性上会出现不稳定，进而导致朗读题评分结果具有不确定性。However, the training data of the acoustic model is usually transcribed from normal speech, and the acoustic model has not seen abnormal speech during the training process, so the classification effect of the acoustic model on abnormal speech is unknown. Therefore, the scoring features extracted based on the acoustic model are in There will be instability in the distinction, which will lead to uncertainty in the scoring results of reading questions.

为了解决上述问题，本申请实施例提供了一种语音评价方法。该方法可以由语音评价系统执行。在一些实施例中，语音评价系统可以是软件系统，计算设备或计算设备集群通过运行该软件系统的程序代码，以执行语音评价方法。在另一些实施例中，该语音评价系统也可以是用于语音评价的硬件系统。本申请实施例以语音评价系统为软件系统进行示例说明。In order to solve the above problem, an embodiment of the present application provides a voice evaluation method. The method may be performed by a speech evaluation system. In some embodiments, the speech evaluation system may be a software system, and a computing device or a cluster of computing devices executes the speech evaluation method by running the program code of the software system. In other embodiments, the speech evaluation system may also be a hardware system for speech evaluation. The embodiment of the present application uses a voice evaluation system as a software system for illustration.

具体地，语音评价系统获取待评价的语音以及参考文本(例如是朗读文本)，从所述语音中提取声学特征，将所述声学特征输入声学模型的K个子声学模型，获得综合状态后验概率，根据所述综合状态后验概率以及通过所述参考文本构建的切分网络，将所述语音的音素与所述参考文本对齐，基于对齐结果获得评分特征，将所述评分特征输入评分模型，以回归出所述语音的评分。Specifically, the speech evaluation system obtains the speech to be evaluated and the reference text (for example, reading text), extracts acoustic features from the speech, inputs the acoustic features into the K sub-acoustic models of the acoustic model, and obtains the comprehensive state posterior probability , according to the comprehensive state posterior probability and the segmentation network constructed by the reference text, align the phonemes of the speech with the reference text, obtain scoring features based on the alignment results, and input the scoring features into the scoring model, to return the score for the speech.

该方法借鉴K折交叉验证的思想，将声学模型的输出层的状态类别分为K组，并基于K组状态类别获得K个子声学模型各自的集内数据和集外数据，并且每个子声学模型的集外数据不同。利用各自的集内数据和集外数据训练子声学模型可以使得声学模型对集外数据具有区分性，进而提高了异常语音的评分准确性，解决了异常语音评分不稳定的问题，具有较高可靠性和可用性。This method draws on the idea of K-fold cross-validation, divides the state categories of the output layer of the acoustic model into K groups, and obtains the respective in-set and out-set data of the K sub-acoustic models based on the K-group state categories, and each sub-acoustic model The out-of-set data is different. Using the respective in-set data and out-of-set data to train the sub-acoustic model can make the acoustic model distinguishable from the out-of-set data, thereby improving the scoring accuracy of abnormal speech, solving the problem of unstable abnormal speech scoring, and having high reliability sex and availability.

为了使得本申请的技术方案更加清楚、易于理解，下面结合附图对本申请的应用场景进行介绍。In order to make the technical solutions of the present application clearer and easier to understand, the following describes the application scenarios of the present application with reference to the accompanying drawings.

参见图1所示的语音评价方法的应用场景示意图，如图1所示，服务器10与终端20建立有通信连接。该通信连接例如可以实现有线网络连接或者无线无线网络连接。Referring to the schematic diagram of the application scenario of the speech evaluation method shown in FIG. 1 , as shown in FIG. 1 , a communication connection is established between the server 10 and the terminal 20 . The communication connection may, for example, implement a wired network connection or a wireless wireless network connection.

服务器10中部署有语音评价系统(图1中未示出)，以用于提供语音评价服务。终端20具有录音功能，用于向用户呈现口语考试的试卷，并采集用户针对该试卷的作答语音。其中，口语考试可以是英语口语考试，或者是普通话口语考试，本实施例以英语口语考试进行示例说明。具体地，试卷中包括朗读题，终端20可以向用户呈现朗读题对应的朗读文本，然后采集用户根据朗读文本进行朗读时的语音。终端20可以将采集到的上述语音作为待评价的语音，上述朗读文本作为参考文本，发送至服务器10，由服务器10通过语音评价系统进行评价。A voice evaluation system (not shown in FIG. 1 ) is deployed in the server 10 to provide a voice evaluation service. The terminal 20 has a recording function for presenting the test paper of the oral test to the user, and collecting the user's answering voice for the test paper. Wherein, the oral test may be an English oral test or a Putonghua oral test, and the present embodiment uses the English oral test as an example for description. Specifically, the test paper includes reading questions, and the terminal 20 may present the reading text corresponding to the reading question to the user, and then collect the user's voice when reading aloud according to the reading text. The terminal 20 may use the collected voice as the voice to be evaluated, and the read-aloud text as the reference text, and send it to the server 10, and the server 10 will evaluate through the voice evaluation system.

具体地，服务器10中的语音评价系统在获取待评价的语音以及参考文本后，从语音中提取声学特征，将声学特征输入声学模型的K个子声学模型，获得综合状态后验概率，然后根据所述综合状态后验概率以及通过所述参考文本构建的切分网络，将所述语音的音素与所述参考文本对齐，基于对齐结果获得评分特征，接着将所述评分特征输入评分模型，以回归出所述语音的评分。Specifically, after acquiring the speech to be evaluated and the reference text, the speech evaluation system in the server 10 extracts the acoustic features from the speech, inputs the acoustic features into the K sub-acoustic models of the acoustic model, obtains the comprehensive state posterior probability, and then according to the The integrated state posterior probability and the segmentation network constructed by the reference text, align the phonemes of the voice with the reference text, obtain scoring features based on the alignment results, and then input the scoring features into a scoring model to regression out a score for the speech.

进一步地，服务器10还可以向终端20返回所述语音的评分，如此，终端20可以向用户呈现语音的评分。Further, the server 10 can also return the score of the voice to the terminal 20, so that the terminal 20 can present the score of the voice to the user.

需要说明的是，图1仅以语音评价系统部署在服务器进行示例说明，在本申请实施例其他可能的实现方式中，语音评价系统也可以部署在终端，或者是部署在多个计算设备形成的计算设备集群，本实施例对此不作限制。It should be noted that FIG. 1 only illustrates that the voice evaluation system is deployed on the server. In other possible implementations of the embodiments of the present application, the voice evaluation system may also be deployed on a terminal, or a system formed by being deployed on multiple computing devices. A computing device cluster, which is not limited in this embodiment.

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

参见图2所示的语音评价方法的流程图，该方法包括以下步骤：Referring to the flowchart of the speech evaluation method shown in Figure 2, the method includes the following steps:

S202：语音评价系统获取待评价的语音以及参考文本。S202: The speech evaluation system acquires the speech to be evaluated and the reference text.

待评价的语音是指用户(如口语考试参试人员、口语学员)按照参考文本进行朗读时的语音。其中，朗读文本为预设的文本，例如是口语考试试卷中的文本。参考文本可以是英语文本或者是中文文本，相应地，待评价的语音可以是英语语音或者是普通话语音，本实施例对此不作限制。The voice to be evaluated refers to the voice of a user (such as a test taker of an oral language test, an oral language student) reading aloud according to the reference text. The read-aloud text is a preset text, for example, the text in the oral examination paper. The reference text may be English text or Chinese text, and correspondingly, the speech to be evaluated may be English speech or Mandarin speech, which is not limited in this embodiment.

S204：语音评价系统从所述语音中提取声学特征，将所述声学特征输入声学模型的K个子声学模型，获得综合状态后验概率。S204: The speech evaluation system extracts acoustic features from the speech, and inputs the acoustic features into the K sub-acoustic models of the acoustic model to obtain comprehensive state posterior probability.

声学特征可以为线性预测倒谱系数(Linear Predictive Cepstral Coding，LPCC)特征、滤波器(FilterBank，fBank)特征、梅尔频率倒谱系数(Mel-frequency CeptraICoefficients，MFCC)特征或感知线性预测(Perceptual Linear Predictive，PLP)特征中的任意一种或多种。The acoustic features can be Linear Predictive Cepstral Coding (LPCC) features, Filter Bank (fBank) features, Mel-frequency CeptraIC Coefficients (MFCC) features or Perceptual Linear Prediction (Perceptual Linear Prediction) features. Any one or more of Predictive, PLP) features.

语音评价系统可以通过声学特征提取算法，从语音中提取相应的特征。下面以提取MFCC特征进行示例说明。具体地，语音评价系统可以先对语音进行预处理，例如是进行分帧、预加重和加窗，如此可以得到多个语音帧，然后语音评价系统可以将多个语音帧由时域转换至频域，例如，语音评价系统可以对多个语音帧进行快速傅里叶变换(fast Fouriertransform，FFT)。经过FFT所得的特征为一个复数矩阵。该复数矩阵表示一个能量谱。由于能量谱中的相位谱包含的信息量极少，语音评价系统可以通过对复数矩阵中的复数求平方或者求绝对值的方式，实现丢弃相位谱，而保留幅度谱。语音评价系统可以对幅度谱进行Mel滤波，从而将线性频率转换为非线性分布的Mel频率。经过Mel滤波之后的特征也称作fBank特征。The speech evaluation system can extract the corresponding features from the speech through the acoustic feature extraction algorithm. The following is an example of extracting MFCC features. Specifically, the speech evaluation system can preprocess the speech, such as framing, pre-emphasis and windowing, so that multiple speech frames can be obtained, and then the speech evaluation system can convert the multiple speech frames from time domain to frequency domain. Domain, for example, a speech evaluation system may perform a fast Fourier transform (FFT) on multiple speech frames. The feature obtained through FFT is a complex matrix. The complex number matrix represents an energy spectrum. Since the phase spectrum in the energy spectrum contains very little information, the speech evaluation system can discard the phase spectrum and retain the amplitude spectrum by squaring or finding the absolute value of the complex numbers in the complex number matrix. The speech evaluation system can perform Mel filtering on the magnitude spectrum to convert linear frequencies into non-linearly distributed Mel frequencies. The features after Mel filtering are also called fBank features.

由于人耳对声音的感受是成对数值增长的，所以语音评价系统可以将fBank特征再进行一次对数运算，以模拟人耳的感受。进一步地，语音评价系统还可以在进行对数运算之后，进行离散余弦变换(Discrete Cosine Transform，DCT)，以消除不同阶数之间的相关性，将信号映射到低维空间中，得到MFCC特征。Since the human ear's perception of sound increases in pairs, the speech evaluation system can perform a logarithmic operation on the fBank feature to simulate the human ear's perception. Further, the speech evaluation system can also perform discrete cosine transform (Discrete Cosine Transform, DCT) after logarithmic operation to eliminate the correlation between different orders, map the signal into a low-dimensional space, and obtain MFCC features. .

上述MFCC特征为静态的，语音评价系统还可以对静态的MFCC特征进行差分运算，得到动态的MFCC特征。语音评价系统可以将静态的MFCC特征和动态的MFCC特征进行结合，以提高识别性能。The above-mentioned MFCC features are static, and the speech evaluation system can also perform differential operations on the static MFCC features to obtain dynamic MFCC features. The speech evaluation system can combine static MFCC features and dynamic MFCC features to improve the recognition performance.

声学模型的输出层的状态类别是指语音识别中基于隐马尔科夫模型(HiddenMarkov Model，HMM)对三音素(tri-phone)建模的隐状态。例如，状态类别可以包括由音素/k/、/uh/、/p/建模的隐状态“cup”。当用户发出语音“would u like a cup of tea”时，该语音的一个语音帧的状态类别可以为上述“cup”。The state category of the output layer of the acoustic model refers to the hidden state of tri-phone modeling based on Hidden Markov Model (HMM) in speech recognition. For example, a state class may include the hidden state "cup" modeled by the phonemes /k/, /uh/, /p/. When the user utters the speech "would u like a cup of tea", the state category of one speech frame of the speech may be the above-mentioned "cup".

声学模型的输出层的状态类别可以分为K组。其中，状态类别可以平均分为K组。在分组时，可以采用随机方式进行分组，也可以根据音素类型进行分组，例如英语中可以根据元音和辅音进行分组，普通话中可以根据声母和韵母进行分组。The state categories of the output layer of the acoustic model can be divided into K groups. Among them, the state categories can be equally divided into K groups. When grouping, grouping can be done in a random manner or according to phoneme types. For example, in English, groups can be grouped according to vowels and consonants, and in Mandarin, groups can be grouped according to initials and finals.

声学模型具有K个子声学模型。K个声学子模型通过各自的集内数据和集外数据训练得到。其中，每个子声学模型的集外数据为数据集中状态类别属于所述K组中的一组的数据。每个子声学模型的集外数据不同。每个子声学模型的集内数据为所述数据集中状态类别属于所述K组中的剩余K-1组的数据。其中，集内数据的标签为状态类别的向量，例如是one-hot向量，集外数据没有对应的状态类别，其标签设置为对应组的状态类别的均匀分布。The acoustic model has K sub-acoustic models. The K acoustic sub-models are trained by their respective in-set and out-of-set data. Wherein, the out-of-set data of each sub-acoustic model is the data whose state category in the dataset belongs to one of the K groups. The out-of-set data is different for each sub-acoustic model. The intra-set data of each sub-acoustic model is the data in the data set whose state category belongs to the remaining K-1 groups in the K groups. Among them, the label of the data in the set is the vector of the state category, such as a one-hot vector, and the data outside the set has no corresponding state category, and its label is set to the uniform distribution of the state category of the corresponding group.

基于此，语音评价系统将所述声学特征输入声学模型的K个子声学模型后，针对每个状态类型，K个声学子模型中有K-1个声学子模型可以输出该状态类型的后验概率。语音评价系统可以根据K-1个后验概率获得综合状态后验概率。例如语音评价系统可以确定K-1个后验概率的平均值，作为综合状态后验概率。Based on this, after the speech evaluation system inputs the acoustic features into the K sub-acoustic models of the acoustic model, for each state type, K-1 acoustic sub-models in the K acoustic sub-models can output the posterior probability of the state type . The speech evaluation system can obtain the comprehensive state posterior probability according to K-1 posterior probability. For example, the speech evaluation system can determine the average value of K-1 posterior probabilities as the comprehensive state posterior probability.

S206：语音评价系统根据所述综合状态后验概率以及通过所述参考文本构建的切分网络，将所述语音的音素与所述参考文本对齐，基于对齐结果获得评分特征。S206: The speech evaluation system aligns the phonemes of the speech with the reference text according to the comprehensive state posterior probability and the segmentation network constructed by the reference text, and obtains scoring features based on the alignment result.

切分网络用于语音与文本的强制切分对齐(Force Alignment)。切分网络可以对参考文本进行切分，从而获得用户对参考文本的朗读路径的多种可能，例如包括正常的朗读路径、单词或句子增读、漏读、重读或者句子回读的路径。通常情况下，切分网络可以基于设定的规则，对参考文本进行构建得到。The segmentation network is used for Force Alignment of speech and text. The segmentation network can segment the reference text, so as to obtain various possibilities of the user's reading path for the reference text, such as the normal reading path, the word or sentence addition, omission, stress or sentence readback path. Usually, the segmentation network can be obtained by constructing the reference text based on the set rules.

强制切分对齐是指采用维特比算法(Viterbi algorithm)，基于切分网络和声学模型输出的状态后验概率找出的最优的朗读路径，从而根据该最优的朗读路径实现语音和参考文本对齐。Forced segmentation and alignment refers to the use of the Viterbi algorithm to find the optimal reading path based on the state posterior probability output by the segmentation network and the acoustic model, so as to realize the speech and reference text according to the optimal reading path. Align.

其中，维特比算法是一种动态规划算法。该算法用于寻找最有可能产生观测事件序列(或者称为观察事件序列)的维特比路径，该维特比路径通过隐含状态序列(或者称为隐状态序列)表征。具体到语音评价场景，语音(例如是语音中的音素序列)可以视为观察事件序列，参考文本中的字符序列可以视为隐状态序列。因此，语音评价系统可以对语音应用维特比算法寻找最有可能的隐状态序列，从而实现语音和参考文本对齐。Among them, the Viterbi algorithm is a dynamic programming algorithm. The algorithm is used to find a Viterbi path that is most likely to generate an observation event sequence (or called an observation event sequence), and the Viterbi path is characterized by a hidden state sequence (or called a hidden state sequence). Specific to the speech evaluation scenario, speech (such as phoneme sequence in speech) can be regarded as the sequence of observation events, and the sequence of characters in the reference text can be regarded as the sequence of hidden states. Therefore, the speech evaluation system can apply the Viterbi algorithm to the speech to find the most probable sequence of hidden states, so as to achieve the alignment of speech and reference text.

具体地，语音评价系统可以根据综合状态后验概率以及通过所述参考文本构建的切分网络，通过维特比算法确定目标朗读路径，该目标朗读路径例如可以是上述最优的朗读路径，然后语音评价系统可以根据所述目标朗读路径将所述语音的音素与所述参考文本对齐，获得所述音素的时间边界信息。Specifically, the speech evaluation system can determine the target reading path through the Viterbi algorithm according to the comprehensive state posterior probability and the segmentation network constructed by the reference text. For example, the target reading path can be the above optimal reading path, and then the speech The evaluation system may align the phonemes of the speech with the reference text according to the target reading path, and obtain time boundary information of the phonemes.

在对语音进行评价时，可以从不同维度进行评价。相应地，评分特征可以包括以下一种或多种：发音良好度(goodness of pronunciation，GOP)、单词增漏读比例、平均音素发音时长。When evaluating speech, it can be evaluated from different dimensions. Correspondingly, the scoring features may include one or more of the following: goodness of pronunciation (GOP), the proportion of words that are added or omitted, and the average phoneme pronunciation duration.

在本实施例中，语音评价系统可以基于音素的时间边界信息进一步计算发音良好度GOP、单词增漏读比例、平均音素发音时长等评分特征，以用于衡量用户的发音准确度、完整度和流畅度。In this embodiment, the speech evaluation system can further calculate scoring features such as good pronunciation GOP, word increase and omission reading ratio, average phoneme pronunciation duration, etc. based on the time boundary information of phonemes, so as to measure the user's pronunciation accuracy, completeness and accuracy. Fluency.

GOP基于音素的时间边界信息和综合状态后验概率计算得到，用于评价用户的发音正确程度。GOP计算公式如下所示：GOP is calculated based on the time boundary information of the phoneme and the posterior probability of the comprehensive state, and is used to evaluate the correctness of the user's pronunciation. The GOP calculation formula is as follows:

其中，p为某个音素，o为观察向量(也称作观测向量)，每一个观察向量是由一个具有相应概率密度分布的状态序列产生。Q为所有音素的集合，p(o|p)为音素p的似然概率，S_i是第i个可能的状态序列，状态序列通过Viterbi算法找到，s_i,t是t时刻对应的tri-phone状态。Among them, p is a phoneme, o is an observation vector (also called observation vector), and each observation vector is generated by a state sequence with a corresponding probability density distribution. Q is the set of all phonemes, p(o|p) is the likelihood probability of the phoneme p, S _i is the ith possible state sequence, the state sequence is found by the Viterbi algorithm, s _{i, t} are the tri- phone status.

语音评价系统还可以基于音素的时间边界信息确定增读的单词、漏读的单词中的至少一个。例如，语音评价系统可以基于增读的单词的数量和单词总数量，确定单词增读比例，基于漏读的单词数量和单词总数量确定单词漏读比例。此外，语音评价系统还可以基于上述因素的时间边界信息确定用户对各音素的发音时长，基于该发音时长，可以确定平均音素发音时长。The speech evaluation system may further determine at least one of the added word and the missed word based on the time boundary information of the phonemes. For example, the speech evaluation system may determine the word read-increase ratio based on the number of added words and the total number of words, and determine the word miss-read ratio based on the number of missed words and the total number of words. In addition, the speech evaluation system can also determine the pronunciation duration of each phoneme by the user based on the time boundary information of the above factors, and based on the pronunciation duration, the average phoneme pronunciation duration can be determined.

在本实施例中，综合状态后验概率具有区分性，基于该综合状态后验概率确定的GOP等评分特征也具有区分性，将该评分特征用于语音评价，具有较高准确度。In this embodiment, the comprehensive state posterior probability is distinguishable, and the scoring features such as GOP determined based on the comprehensive state posterior probability are also distinguishable, and the scoring feature is used for speech evaluation with high accuracy.

S208：语音评价系统将所述评分特征输入评分模型，以回归出所述语音的评分。S208: The speech evaluation system inputs the scoring feature into a scoring model to regress the speech score.

评分模型是一种对语音进行评价的模型。该评分模型以语音的评分特征为输入，例如是以语音的GOP、单词增漏读比例、平均音素发音时长为输入，以语音的评分为输出。其中，评分模型通常是回归模型，例如为线性回归模型、决策树回归模型、高斯回归模型或者神经网络回归模型中的任意一种。相应地，评分模型可以通过回归方式输出语音的评分，该评分也称作机器分。A scoring model is a model that evaluates speech. The scoring model takes the scoring features of speech as input, for example, the GOP of speech, the ratio of word augmentation and omission reading, and the average phoneme pronunciation time as input, and the speech score as output. The scoring model is usually a regression model, such as any one of a linear regression model, a decision tree regression model, a Gaussian regression model, or a neural network regression model. Correspondingly, the scoring model can output the score of speech through regression, which is also called machine score.

其中，评分模型可以基于从数据集中提取的评分特征以及相应的人工分(具体是人工标注的评分)，对初始化的回归模型进行训练得到。语音评价系统可以利用数据集预先训练该评分模型，然后在推理阶段，利用该评分模型进行推理，从而实现对语音进行评价。The scoring model can be obtained by training the initialized regression model based on the scoring features extracted from the data set and the corresponding manual scores (specifically, manually labeled scores). The speech evaluation system can use the data set to pre-train the scoring model, and then use the scoring model to perform inference in the inference stage, so as to evaluate the speech.

基于上述内容描述，本申请实施例提供了一种语音评价方法。该方法借鉴K折交叉验证的思想，将声学模型的输出层的状态类别分为K组，K个子声学模型通过各自的集内数据和集外数据训练得到，每个子声学模型的集外数据为数据集中状态类别属于所述K组中的一组的数据，并且每个子声学模型的集外数据不同。每个子声学模型的集内数据为所述数据集中状态类别属于所述K组中的剩余K-1组的数据。如此使得声学模型对集外数据具有区分性，进而提高了异常语音的评分准确性，解决了异常语音评分不稳定的问题，具有较高可靠性和可用性。Based on the above content description, an embodiment of the present application provides a voice evaluation method. This method draws on the idea of K-fold cross-validation, and divides the state categories of the output layer of the acoustic model into K groups. The K sub-acoustic models are obtained by training their respective in-set and out-of-set data. The out-of-set data of each sub-acoustic model is The state category in the dataset belongs to the data of one of the K groups, and the out-of-set data is different for each sub-acoustic model. The intra-set data of each sub-acoustic model is the data in the data set whose state category belongs to the remaining K-1 groups in the K groups. In this way, the acoustic model can distinguish the data outside the set, thereby improving the scoring accuracy of abnormal speech, solving the problem of unstable scoring of abnormal speech, and having high reliability and usability.

本申请实施例提供的语音评价方法依赖于声学模型，该声学模型可以预先训练得到，接下来对声学模型的训练过程进行介绍。The speech evaluation method provided by the embodiment of the present application relies on an acoustic model, and the acoustic model can be obtained by pre-training. Next, the training process of the acoustic model is introduced.

参见图3所示的训练声学模型的流程示意图，该方法包括：Referring to the schematic flowchart of training an acoustic model shown in FIG. 3, the method includes:

S302：语音评价系统获取数据集。S302: The speech evaluation system acquires a data set.

数据集中的每个样本数据具有K个标签。其中，K-1个标签为该样本数据对应的状态类别的向量，例如是one-hot向量，剩余1个标签为状态类别的均匀分布。Each sample data in the dataset has K labels. Among them, K-1 labels are vectors of state categories corresponding to the sample data, such as one-hot vectors, and the remaining 1 label is a uniform distribution of state categories.

具体地，用户可以将声学模型的输出层的状态类别分为K组。假设输出层的状态类别数为N，则用户可以将状态类别平均分为K组，每组包括N/K个状态类别。用户在分组时，可以采用随机方式，或者按照音素进行分组。然后借鉴K折交叉验证的思想依次剔除其中一组状态类别，将剩余K-1组包括的状态类别构成一个状态类别集合，最终可以构建K个状态类别集合，且每个状态类别均包括在K-1个状态类别集合中。Specifically, the user can divide the state categories of the output layer of the acoustic model into K groups. Assuming that the number of state categories in the output layer is N, the user can equally divide the state categories into K groups, each group including N/K state categories. When users are grouped, they can use a random method or group them according to phonemes. Then, referring to the idea of K-fold cross-validation, one group of state categories is removed in turn, and the remaining K-1 groups of state categories are formed into a state category set, and finally K state category sets can be constructed, and each state category is included in K - 1 state category set.

接着，用户可以基于每一个状态类别集合，触发为数据集中的样本数据生成标签的操作。以其中一个状态类别集合进行示例说明，数据集中的样本数据可以分为集内数据和集外数据。集内数据是指状态类别在该状态类别集合中的样本数据(如语音帧)，集外数据是指状态类别不在该状态类别集合中的样本数据。针对集内数据，确定该集内数据的状态类别，将状态类别的one-hot向量作为标签，针对集外数据，由于无法确定其状态类别，可以将当前组中状态类别的均匀分布作为标签。Then, based on each state category set, the user can trigger the operation of generating labels for the sample data in the dataset. Taking one of the state category sets as an example, the sample data in the dataset can be divided into in-set data and out-of-set data. In-set data refers to sample data (such as speech frames) whose state category is in the state category set, and out-set data refers to sample data whose state category is not in the state category set. For the data in the set, the state category of the data in the set is determined, and the one-hot vector of the state category is used as the label. For the data outside the set, since the state category cannot be determined, the uniform distribution of the state categories in the current group can be used as the label.

需要说明的是，本实施例中的数据集不仅可以包括正常语音生成的样本数据，还可以包括异常语音生成的样本数据。异常语音包括不按照所述参考文本朗读的语音和/或音质异常的语音。以英语口语考试场景为例，不按照所述参考文本朗读的语音可以包括普通话、方言、中式英语、敲键盘、吹话筒、唱歌或者咳嗽等等。It should be noted that the data set in this embodiment may include not only sample data generated by normal speech, but also sample data generated by abnormal speech. Abnormal speech includes speech that is not read according to the reference text and/or speech with abnormal sound quality. Taking the English speaking test scenario as an example, the speech that is not read out according to the reference text may include Mandarin, dialect, Chinese English, typing on a keyboard, blowing a microphone, singing or coughing, and so on.

S304：语音评价系统根据数据集训练声学模型中的K个声学子模型。S304: The speech evaluation system trains K acoustic sub-models in the acoustic model according to the data set.

其中，数据集中的每个样本数据具有K个标签，也即数据集包括K组标签。基于此，语音评价系统可以根据数据集的一组标签训练一个声学子模型，每组标签分别用于训练不同的声学子模型。Wherein, each sample data in the dataset has K labels, that is, the dataset includes K groups of labels. Based on this, the speech evaluation system can train an acoustic sub-model according to a set of labels of the dataset, and each set of labels is used to train a different acoustic sub-model.

为了保证最终的声学模型对集内数据和集外数据具有区分性，在训练过程中采用两个训练目标(如两个损失函数)更新声学子模型的参数。下面以其中一个声学子模型的训练进行示例说明。In order to ensure that the final acoustic model is distinguishable between in-set data and out-of-set data, two training objectives (such as two loss functions) are used to update the parameters of the acoustic sub-model during the training process. The following is an example of the training of one of the acoustic sub-models.

对于集内数据，其标签为one-hot向量，因而可以采用标准的交叉熵(CrossEntropy)函数计算损失，其中，该损失也称作标准的交叉熵损失，记作

在一些实施例中，也可以采用Focal Loss函数计算损失，该损失记作

如此，可以缓解模型训练过程中各类样本数量比例不均衡的情况，让模型更加关注难分类样本。其中，

和

计算公式如下所示：For the data in the set, the label is a one-hot vector, so the standard cross entropy (CrossEntropy) function can be used to calculate the loss, where the loss is also called the standard cross entropy loss, denoted as

In some embodiments, the loss can also be calculated using the Focal Loss function, and the loss is recorded as

In this way, the unbalanced proportion of various samples in the model training process can be alleviated, and the model can pay more attention to the difficult-to-classify samples. in,

and

The calculation formula is as follows:

其中，X为用于训练该声学子模型的集内数据，Y为集内数据的标签，f(x_i)是指x_i帧数据的模型预测结果，

是指x_i帧数据在所述类别标签为c的模型预测结果，α_c是类别c的权重，γ为衰减系数。Among them, X is the data in the set used to train the acoustic sub-model, Y is the label of the data in the set, f(x _i ) refers to the model prediction result of the _xi frame data,

Refers to the model prediction result of the x _i frame data in the category label c, α _c is the weight of category c, and γ is the attenuation coefficient.

对集外数据，其标签为状态类别的均匀分布，可以使用KL散度(Kullback-LeiblerDivergence)计算损失

也可以使用JS散度(Jensen-Shannon Divergence)计算损失

两者之间的差异在于JS散度解决了KL散度不对称的问题。其中，

和

计算公式如下所示：For out-of-set data, the label is a uniform distribution of state categories, and the loss can be calculated using KL divergence (Kullback-Leibler Divergence)

Loss can also be calculated using JS Divergence (Jensen-Shannon Divergence)

The difference between the two is that JS divergence solves the problem of KL divergence asymmetry. in,

and

The calculation formula is as follows:

其中，

为用于训练该声学子模型的集外数据，μ为集外数据的标签，f(x_i)是指x_i帧数据的模型预测结果。in,

is the out-of-set data used to train the acoustic sub-model, μ is the label of the out-of-set data, and f( _xi ) refers to the model prediction result of the x _i frame data.

根据上述损失可以确定声学子模型的损失，具体如下所示：According to the above losses, the loss of the acoustic sub-model can be determined as follows:

其中，λ是一个超参，取值范围为[0,1]区间，通常可以基于数据集中的验证集微调确定。Among them, λ is a hyperparameter with a value in the [0,1] interval, which can usually be determined by fine-tuning based on the validation set in the dataset.

该方法通过K折声学模型训练框架解决异常作答语音评分不稳定问题，不需要引入单独的异常检测模块。具体地，首先依据K折交叉验证的思想将声学模型输出层的状态类别数分成K个状态类别集合；然后依据每个状态类别集合将数据集中的样本数据划分为集内数据和集外数据，标签分别用one-hot向量和类别的均匀分布表示；接着分别训练K个独立的子声学模型，在训练每个子声学模型时，集内数据和集外数据采用CE Loss和KL Loss两种不同的目标函数训练；最后在计算帧级状态后验概率时将K-1个模型的后验概率平均，进一步做强制切分和提取评分特征。本案通过在训练声学模型时引入K-折声学模型训练框架，让模型对集外数据具有区分性，进而提高集外异常语音数据的评分准确性。The method solves the problem of instability of abnormal response speech scores through the K-fold acoustic model training framework, and does not need to introduce a separate abnormality detection module. Specifically, first, according to the idea of K-fold cross-validation, the state categories of the output layer of the acoustic model are divided into K state category sets; then the sample data in the dataset is divided into in-set data and out-of-set data according to each state category set. Labels are represented by a one-hot vector and a uniform distribution of categories respectively; then K independent sub-acoustic models are trained respectively. When training each sub-acoustic model, the data in the set and the data outside the set use CE Loss and KL Loss. Objective function training; finally, the posterior probability of K-1 models is averaged when calculating the posterior probability of frame-level state, and further forced segmentation and extraction of scoring features are performed. In this case, by introducing the K-fold acoustic model training framework when training the acoustic model, the model can distinguish the data outside the set, thereby improving the scoring accuracy of the abnormal speech data outside the set.

基于本申请实施例提供的语音评价方法，本申请实施例还提供一种语音评价系统。该语音评价系统用于执行上述语音评价方法。接下来，将结合附图对本申请实施例的语音评价系统进行介绍。Based on the voice evaluation method provided by the embodiment of the present application, the embodiment of the present application further provides a voice evaluation system. The speech evaluation system is used to execute the above-mentioned speech evaluation method. Next, the speech evaluation system of the embodiment of the present application will be introduced with reference to the accompanying drawings.

参见图4所示的语音评价系统的结构示意图，如图4所示，语音评价系统400包括：Referring to the schematic structural diagram of the speech evaluation system shown in FIG. 4 , as shown in FIG. 4 , the speech evaluation system 400 includes:

获取模块402，用于获取待评价的语音以及参考文本；an acquisition module 402, configured to acquire the speech to be evaluated and the reference text;

声学处理模块404，用于从所述语音中提取声学特征，将所述声学特征输入声学模型的K个子声学模型，获得综合状态后验概率，所述K个子声学模型通过各自的集内数据和集外数据训练得到，每个子声学模型的集外数据不同；The acoustic processing module 404 is configured to extract acoustic features from the speech, input the acoustic features into the K sub-acoustic models of the acoustic model, and obtain the comprehensive state posterior probability. The out-of-set data training is obtained, and the out-of-set data of each sub-acoustic model is different;

后处理模块406，用于根据所述综合状态后验概率以及通过所述参考文本构建的切分网络，将所述语音的音素与所述参考文本对齐，基于对齐结果获得评分特征；A post-processing module 406, configured to align the phonemes of the speech with the reference text according to the comprehensive state posterior probability and the segmentation network constructed by the reference text, and obtain scoring features based on the alignment result;

评价模块408，用于将所述评分特征输入评分模型，以回归出所述语音的评分。The evaluation module 408 is configured to input the scoring feature into a scoring model to regress the score of the speech.

在一些可能的实现方式中，所述系统400还包括：In some possible implementations, the system 400 further includes:

训练模块，用于将所述声学模型的输出层的状态类别分为K组；分别以状态类别属于第一组的数据至状态类别属于第K组的数据为集外数据，以状态类别为剩余K-1组的数据为集内数据，所述集外数据的标签为对应分组的状态类别的均匀分布，所述集内数据的标签为对应状态类别的向量；通过所述K个声学子模型中每个声学子模型的集内数据和集外数据训练所述声学子模型。The training module is used to divide the state categories of the output layer of the acoustic model into K groups; the data whose state category belongs to the first group to the data whose state category belongs to the Kth group is the out-of-set data, and the state category is the remaining data. The data of the K-1 group is the data in the set, the label of the data outside the set is the uniform distribution of the state category corresponding to the group, and the label of the data in the set is the vector corresponding to the state category; through the K acoustic sub-models The in-set and out-of-set data of each acoustic sub-model in train the acoustic sub-model.

在一些可能的实现方式中，训练所述声学子模型采用的目标函数为多目标函数，所述多目标函数根据第一目标函数和第二目标函数获得，所述第一目标函数为通过所述集内数据训练所述声学子模型时的目标函数，所述第二目标函数为通过所述集外数据训练所述声学子模型时的目标函数。In some possible implementations, the objective function used for training the acoustic sub-model is a multi-objective function, and the multi-objective function is obtained according to a first objective function and a second objective function, and the first objective function is obtained by the The objective function when the acoustic sub-model is trained by the data in the set, and the second objective function is the objective function when the acoustic sub-model is trained by the data outside the set.

在一些可能的实现方式中，所述声学处理模块404具体用于：In some possible implementations, the acoustic processing module 404 is specifically used for:

将所述声学特征输入所述声学模型的所述K个子声学模型，获得每个状态类型的K-1个后验概率；Inputting the acoustic features into the K sub-acoustic models of the acoustic model to obtain K-1 posterior probabilities for each state type;

根据所述每个状态类型的K-1个后验概率，确定所述综合状态后验概率。The comprehensive state posterior probability is determined according to the K-1 posterior probabilities of each state type.

在一些可能的实现方式中，所述后处理模块406具体用于：In some possible implementations, the post-processing module 406 is specifically used for:

根据所述综合状态后验概率以及通过所述参考文本构建的切分网络，通过维特比算法确定目标朗读路径；According to the comprehensive state posterior probability and the segmentation network constructed by the reference text, determine the target reading path through the Viterbi algorithm;

根据所述目标朗读路径将所述语音的音素与所述参考文本对齐，获得所述音素的时间边界信息。The phonemes of the speech are aligned with the reference text according to the target reading path, and time boundary information of the phonemes is obtained.

在一些可能的实现方式中，所述评分特征包括发音良好度、单词增漏读比例、平均音素发音时长中的至少一种；In some possible implementations, the scoring feature includes at least one of good pronunciation, word augmentation and omission reading ratio, and average phoneme pronunciation duration;

所述后处理模块406具体用于：The post-processing module 406 is specifically used for:

根据所述音素的时间边界信息，确定所述发音良好度、所述单词增漏读比例和/或所述平均音素发音时长。According to the time boundary information of the phonemes, the pronunciation good degree, the word increase and omission reading ratio and/or the average phoneme pronunciation duration are determined.

在一些可能的实现方式中，所述数据集中还包括由异常语音生成的样本数据。In some possible implementations, the data set also includes sample data generated by abnormal speech.

根据本申请实施例的语音评价系统400可对应于执行本申请实施例中描述的方法，并且语音评价系统400的各个模块/单元的上述和其它操作和/或功能分别为了实现图2、图3所示实施例中的各个方法的相应流程，为了简洁，在此不再赘述。The speech evaluation system 400 according to the embodiment of the present application may correspond to executing the method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of the various modules/units of the speech evaluation system 400 are for the purpose of realizing FIG. 2 and FIG. 3 , respectively. For the sake of brevity, the corresponding processes of the respective methods in the illustrated embodiment are not repeated here.

进一步地，本申请实施例还提供了一种计算设备集群，该计算设备集群包括至少一台计算设备，该至少一台计算设备中的任一台计算设备可以是来自云环境或者边缘环境的服务器，也可以是终端设备。该计算设备集群或计算设备具体用于实现如图4所示实施例中的语音评价系统400的功能。Further, an embodiment of the present application also provides a computing device cluster, where the computing device cluster includes at least one computing device, and any computing device in the at least one computing device may be a server from a cloud environment or an edge environment , or a terminal device. The computing device cluster or computing device is specifically used to implement the functions of the speech evaluation system 400 in the embodiment shown in FIG. 4 .

图5提供了一种计算设备集群的结构示意图，如图5所示，计算设备集群50包括多台计算设备500，计算设备500包括总线501、处理器502、通信接口503和存储器504。处理器502、存储器504和通信接口503之间通过总线501通信。FIG. 5 provides a schematic structural diagram of a computing device cluster. As shown in FIG. 5 , the computing device cluster 50 includes multiple computing devices 500 . The processor 502 , the memory 504 and the communication interface 503 communicate through the bus 501 .

总线501可以是外设部件互连标准(peripheral component interconnect，PCI)总线或扩展工业标准结构(extended industry standard architecture，EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示，图5中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。The bus 501 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 5, but it does not mean that there is only one bus or one type of bus.

处理器502可以为中央处理器(central processing unit，CPU)、图形处理器(graphics processing unit，GPU)、微处理器(micro processor，MP)或者数字信号处理器(digital signal processor，DSP)等处理器中的任意一种或多种。The processor 502 may be a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP), etc. any one or more of the devices.

通信接口503用于与外部通信。例如，通信接口503用于获取待评价的语音以及参考文件，返回语音的评分等等。The communication interface 503 is used for external communication. For example, the communication interface 503 is used to obtain the speech to be evaluated and the reference file, return the score of the speech, and so on.

存储器504可以包括易失性存储器(volatile memory)，例如随机存取存储器(random access memory，RAM)。存储器504还可以包括非易失性存储器(non-volatilememory)，例如只读存储器(read-only memory，ROM)，快闪存储器，硬盘驱动器(hard diskdrive，HDD)或固态驱动器(solid state drive，SSD)。Memory 504 may include volatile memory, such as random access memory (RAM). The memory 504 may also include non-volatile memory (non-volatile memory) such as read-only memory (ROM), flash memory, hard disk drive (HDD) or solid state drive (SSD) ).

存储器504中存储有计算机可读指令，处理器502执行该计算机可读指令，以使得计算设备500或计算设备集群50执行前述语音评价方法(或实现前述语音评价系统400的功能)。Computer-readable instructions are stored in the memory 504, and the processor 502 executes the computer-readable instructions to cause the computing device 500 or the computing device cluster 50 to perform the aforementioned voice evaluation method (or implement the aforementioned functions of the voice evaluation system 400).

具体地，在实现图4所示系统的实施例的情况下，且图4中所描述的语音评价系统400的各模块如获取模块402、声学处理模块404、后处理模块406、评价模块408的功能为通过软件实现的情况下，执行图4中各模块的功能所需的软件或程序代码可以存储在计算设备集群50中的至少一个存储器504中。至少一个处理器502执行存储器504中存储的程序代码，以使得计算设备500或计算设备集群50执行前述语音评价方法。Specifically, in the case of implementing the embodiment of the system shown in FIG. 4 , and each module of the speech evaluation system 400 described in FIG. In the case where the functions are implemented by software, the software or program codes required to perform the functions of the modules in FIG. 4 may be stored in at least one memory 504 in the computing device cluster 50 . At least one processor 502 executes program code stored in memory 504 to cause computing device 500 or computing device cluster 50 to perform the aforementioned speech evaluation method.

需要说明的是，计算设备集群包括多个计算设备时，每个计算设备可以独立地执行语音评价方法，也即每个计算设备可以提供完整的语音评价服务。在另一些实施例中，多个计算设备可以协同执行语音评价方法，例如一些计算设备可以执行语音评价方法的若干步骤，另一些计算设备执行语音评价方法的其他步骤，多个计算设备通过协作提供完整的语音评价服务。It should be noted that when the computing device cluster includes multiple computing devices, each computing device can independently execute the voice evaluation method, that is, each computing device can provide a complete voice evaluation service. In other embodiments, a plurality of computing devices may cooperate to perform the speech evaluation method. For example, some computing devices may perform several steps of the speech evaluation method, and other computing devices may perform other steps of the speech evaluation method. Complete voice evaluation service.

进一步地，本申请实施例还提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有指令，当所述指令在计算设备集群50上运行时，使得所述计算设备集群50执行上述语音评价方法的任一种实现方法。Further, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on the computing device cluster 50, the computing device cluster 50 is made Any implementation method of the above-mentioned speech evaluation method is performed.

进一步地，本申请实施例还提供了一种计算机程序产品，所述计算机程序产品在计算设备集群50上运行时，使得所述计算设备集群50执行上述语音评价方法的任一种实现方法。Further, an embodiment of the present application also provides a computer program product, which, when running on the computing device cluster 50 , enables the computing device cluster 50 to execute any one of the above voice evaluation methods.

通过以上的实施方式的描述可知，本领域的技术人员可以清楚地了解到上述实施例方法中的全部或部分步骤可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者诸如媒体网关等网络通信设备，等等)执行本申请各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that all or part of the steps in the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in storage media, such as ROM/RAM, magnetic disks , CD, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the various embodiments or parts of the embodiments of the present application. method.

需要说明的是，本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。It should be noted that the various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments may be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply those entities or operations There is no such actual relationship or order between them. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下，在其它实施例中实现。因此，本申请将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, this application is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech evaluation method, comprising:

acquiring a voice to be evaluated and a reference text;

extracting acoustic features from the voice, inputting the acoustic features into K sub-acoustic models of an acoustic model to obtain a comprehensive state posterior probability, wherein the K sub-acoustic models are obtained by training respective in-set data and out-set data, and the out-set data of each sub-acoustic model is different;

aligning the phoneme of the voice with the reference text according to the comprehensive state posterior probability and a segmentation network constructed by the reference text, and obtaining a scoring characteristic based on an alignment result;

inputting the scoring characteristics into a scoring model to regress the score of the voice.

2. The method of claim 1, wherein the K sub-acoustic models are trained by:

dividing the state categories of the output layer of the acoustic model into K groups;

respectively taking data of which the state class belongs to a first group to data of which the state class belongs to a K-th group as off-set data, taking data of which the state class belongs to the remaining K-1 group as in-set data, wherein labels of the off-set data are uniformly distributed in the state classes of the corresponding groups, and labels of the in-set data are vectors of the corresponding state classes;

training the acoustic submodels through the in-set data and the out-set data of each acoustic submodel in the K acoustic submodels.

3. The method of claim 2, wherein an objective function used for training the acoustic submodel is a multi-objective function, the multi-objective function is obtained according to a first objective function and a second objective function, the first objective function is an objective function when the acoustic submodel is trained by the intra-set data, and the second objective function is an objective function when the acoustic submodel is trained by the out-set data.

4. The method according to any one of claims 1 to 3, wherein the inputting the acoustic features into K sub-acoustic models of an acoustic model to obtain a comprehensive state posterior probability comprises:

inputting the acoustic features into the K sub-acoustic models of the acoustic model to obtain K-1 posterior probabilities of each state type;

and determining the comprehensive state posterior probability according to the K-1 posterior probabilities of each state type.

5. The method according to any one of claims 1 to 3, wherein aligning the phonemes of the speech with the reference text according to the integrated state posterior probabilities and a segmentation network constructed by the reference text comprises:

determining a target reading path through a Viterbi algorithm according to the comprehensive state posterior probability and a segmentation network constructed through the reference text;

and aligning the phoneme of the voice with the reference text according to the target reading path to obtain the time boundary information of the phoneme.

6. The method of claim 5, wherein the scoring characteristics comprise at least one of pronunciation goodness, word miss-add ratio, average phoneme pronunciation duration;

the obtaining scoring characteristics based on the alignment result comprises:

and determining the pronunciation goodness, the word increase and omission ratio and/or the average phoneme pronunciation duration according to the time boundary information of the phonemes.

7. The method according to any one of claims 1 to 3, wherein the data set further comprises sample data generated by abnormal speech.

8. A speech assessment system, the system comprising:

the acquisition module is used for acquiring the voice to be evaluated and the reference text;

the acoustic processing module is used for extracting acoustic features from the voice, inputting the acoustic features into K sub-acoustic models of an acoustic model to obtain comprehensive state posterior probability, wherein the K sub-acoustic models are obtained through respective in-set data and out-set data training, and the out-set data of each sub-acoustic model is different;

the post-processing module is used for aligning the phonemes of the voice with the reference text according to the comprehensive state posterior probability and the segmentation network constructed by the reference text, and obtaining scoring characteristics based on an alignment result;

and the evaluation module is used for inputting the scoring characteristics into a scoring model so as to regress the score of the voice.

9. A cluster of computing devices, the cluster of computing devices comprising at least one computing device, the at least one computing device comprising: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the cluster of computing devices to perform the method of any of claims 1-7.

10. A computer-readable storage medium having instructions stored thereon, which when executed on a cluster of computing devices, cause the cluster of computing devices to perform the method of any of claims 1-7.