CN112951276B

CN112951276B - Method and device for comprehensively evaluating voice and electronic equipment

Info

Publication number: CN112951276B
Application number: CN202110442432.8A
Authority: CN
Inventors: 王丹; 饶丰; 庞永强; 黄伟; 袁佳艺
Original assignee: Beijing Yiyi Education Technology Co ltd
Current assignee: Beijing Yiyi Education Technology Co ltd
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2024-02-20
Anticipated expiration: 2041-04-23
Also published as: CN112951276A

Abstract

The invention provides a method, a device and electronic equipment for comprehensively evaluating voice, wherein the method comprises the following steps: performing recognition processing on the target voice data by taking the standard text as a reference, and determining word parameters of each target word voice and phoneme confidence of each target phoneme voice; determining word accuracy of the corresponding target word voice, phoneme accuracy of the corresponding target phoneme voice and corresponding time parameters; and determining an evaluation result of the target voice data by taking the word accuracy, the phoneme accuracy and the time parameter as evaluation dimensions. The method, the device and the electronic equipment for comprehensively evaluating the voice provided by the embodiment of the invention do not need to additionally set the model for determining the speed and the rhythm, and have high processing efficiency; and the word accuracy, the phoneme accuracy and the time parameter are determined by referring to the word parameters, so that a plurality of evaluation dimensions are associated together, the association between the evaluation dimensions can be improved, and the final evaluation result can be ensured to be more accurate.

Description

Method and device for comprehensively evaluating voice and electronic equipment

Technical Field

The present invention relates to the field of speech evaluation technologies, and in particular, to a method, an apparatus, an electronic device, and a computer readable storage medium for comprehensively evaluating speech.

Background

For students in the K12 (kindergarten through twelfth grade) stage, which refers to preschool education to high school education, the purpose of the homework/exercise/examination is to detect, diagnose, i.e. find where the student's weak point is. Spoken language exercises, due to the specificity of their nature of the task, need to be done with the aid of a speech scoring system. In the process, students complete a spoken language exercise, and if the students cannot fully divide, teachers/parents/students want to know where the problems occur; even a full student would like to know its overall ability, at what level the ability of the individual spoken dimensions are. Therefore, it is necessary to analyze the overall capability, both from the teaching level and the core spoken language capability culturing level, and the spoken language capability in each dimension.

In the current stage, the dimension of the mainstream report on the market is relatively simple, for the spoken language capability, the analysis dimension is basically integrity, fluency, pronunciation accuracy and the like, the dimension granularity is coarse, the data result is abstract, and the value is hardly really brought to diagnosis correction and teaching assistance. Although some research schemes try to evaluate by using finer granularity dimensions, the recognition methods of different dimensions are different, the correlation between different dimensions is weak, the evaluation efficiency is low, the final evaluation effect is poor, and the evaluation result is inaccurate.

Disclosure of Invention

In order to solve the existing technical problems, the embodiment of the invention provides a method, a device, electronic equipment and a computer readable storage medium for comprehensively evaluating voice.

In a first aspect, an embodiment of the present invention provides a method for comprehensively evaluating speech, including:

acquiring target voice data to be evaluated and a standard text corresponding to the target voice data;

performing recognition processing on the target voice data by taking the standard text as a reference, determining word parameters of each target word voice in the target voice data, and determining phoneme confidence of each target phoneme voice in the target word voice; the word parameters comprise word start time, word end time and word confidence;

determining word accuracy of corresponding target word voices according to the word confidence, determining phoneme accuracy of corresponding target phoneme voices according to the phoneme confidence, and determining corresponding time parameters according to word start time and word end time of a plurality of continuous target word voices, wherein the time parameters comprise speech speed and/or rhythm;

and taking the word accuracy, the phoneme accuracy and the time parameter as evaluation dimensions, and determining an evaluation result of the target voice data according to all the evaluation dimensions.

In a second aspect, an embodiment of the present invention further provides a device for comprehensively evaluating speech, including:

In a third aspect, an embodiment of the present invention provides an electronic device, including a bus, a transceiver, a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the transceiver, the memory, and the processor are connected by the bus, and the computer program when executed by the processor implements the steps in the method for comprehensively evaluating speech according to any one of the foregoing steps.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the method for comprehensively evaluating speech according to any one of the above.

According to the method, the device, the electronic equipment and the computer readable storage medium for comprehensively evaluating the voice, when the target voice data are identified, not only word voices but also phoneme voices at a phoneme level are identified, and the target voice data can be evaluated based on finer granularity. And, the word parameter determined in the recognition process can be utilized to conveniently and accurately determine the speed, rhythm and the like of the target voice data. The method does not need to additionally set a model for determining the speed and the rhythm, and has high processing efficiency; and the word accuracy, the phoneme accuracy and the time parameter are determined by referring to the word parameters, so that a plurality of evaluation dimensions are associated together, the association between the evaluation dimensions can be improved, and the final evaluation result can be ensured to be more accurate.

Drawings

In order to more clearly describe the embodiments of the present invention or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present invention or the background art.

FIG. 1 is a flowchart of a method for comprehensively evaluating speech according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a recognition model in the method for comprehensively evaluating speech according to the embodiment of the present invention;

fig. 3 is a schematic structural diagram of a device for comprehensively evaluating voice according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device for performing a method for comprehensively evaluating speech according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

Fig. 1 shows a flowchart of a method for comprehensively evaluating speech according to an embodiment of the present invention. As shown in fig. 1, the method includes:

step 101: and acquiring target voice data to be evaluated and standard text corresponding to the target voice data.

In the embodiment of the invention, the voice data to be evaluated is called target voice data; and, the target voice data is input based on a certain text, which is a standard text. For example, a standard text is shown to the user for the user to recite, and the voice uttered by the user is collected based on the terminal such as a smart phone used by the user, so that the target voice data can be collected.

Step 102: performing recognition processing on the target voice data by taking the standard text as a reference, determining word parameters of each target word voice in the target voice data, and determining phoneme confidence coefficient of each target phoneme voice in the target word voice; the word parameters include word start time, word end time, and word confidence.

In the embodiment of the invention, the target voice data is voice input aiming at the standard text, the standard text comprises a plurality of words, and correspondingly, a certain piece of data in the target voice data also corresponds to a word; in general, all words in the standard text may be used as target words, or the words left after the words a, an, etc. in english are removed may be used as target words. And, each word has a corresponding phoneme, in this embodiment, the phonemes in the target word are referred to as target phonemes, and correspondingly, the voices corresponding to the phonemes in the target word voice are referred to as target phoneme voices. For example, if the standard text is english and one of the target words is good, it has three target phonemes: and (b) a step of, d, a step of; accordingly, there is a target word voice corresponding to the good position in the target voice data, and there is a target phoneme voice corresponding to each target phoneme.

In the embodiment of the invention, a voice recognition model can be preset, and words and phonemes in target voice data are recognized based on the voice recognition model; and, by comparing with the standard text, the word confidence of the target word voice and the phoneme confidence of the target phoneme voice can be determined. Wherein the word confidence level indicates a likelihood that the target word speech is recognized as a corresponding target word in the standard text, and the phoneme confidence level indicates a likelihood that the target phoneme speech is recognized as a corresponding target phoneme in the standard text. The word confidence and the phoneme confidence may be probability values of 0 to 1, or may be values in other value ranges, for example, 0 to 10, which is not limited in this embodiment.

In the embodiment of the invention, because the target voice data may contain noise and the user may have a pause between two words, in order to accurately recognize the target word voice and the target phoneme voice, the embodiment extracts a section of voice corresponding to each target word, namely the target word voice, from the target voice data, wherein the target word voice does not contain noise, pause and other interference voices. Therefore, each target word speech has two word parameters, namely word start time and word end time, in addition to the corresponding word confidence. Accordingly, each target phone voice also corresponds to a start-end time, i.e., a phone start time and a phone end time. In the process of recognizing voice data, word parameters and the like of each word voice are generally required to be determined, so that the voice between the beginning time and the ending time of a phoneme can be extracted as a corresponding target word voice, and only the target word voice needs to be focused in the subsequent recognition process, so that the influence of noise and other interference voices can be reduced. The embodiment can adopt the existing mode to carry out voice recognition and conveniently extract the word parameters of each target word voice.

Step 103: determining word accuracy of the corresponding target word voices according to the word confidence, determining phoneme accuracy of the corresponding target phoneme voices according to the phoneme confidence, and determining corresponding time parameters according to word start time and word end time of a plurality of continuous target word voices, wherein the time parameters comprise speech speed and/or rhythm.

In the embodiment of the present invention, as described above, the word confidence coefficient may be a probability value, or the word confidence coefficient may also be a probability distribution; in order to better evaluate the target word voice, the present embodiment determines word accuracy based on word confidence, where the word accuracy indicates the degree to which the user accurately makes a sound corresponding to the target word; the greater the word confidence, the greater the word accuracy, and the positive correlation between the two. Likewise, the phoneme accuracy of the target phoneme speech can also be determined based on the phoneme confidence, and the phoneme accuracy and the phoneme confidence are also in positive correlation. Furthermore, the present embodiment calculates a parameter of speed of speech, rhythm equal to time-related parameter, i.e., an immediate parameter, based on the word parameter determined in the recognition process.

In the embodiment of the invention, the speech speed presents the speed of the word unit in a period of time, and the word unit can be a word, a phoneme and the like. Cadence refers to the variation in presentation of word units in the time dimension. In the embodiment of the invention, the target voice data comprises a plurality of target word voices, the continuous plurality of target word voices are taken as a group, the speed of the user for presenting the target word or the target phoneme in a period of time can be determined based on the word starting time and the word ending time of each target word voice, and the speech speed can be determined; also, a time interval between adjacent target words or target phonemes may be determined, and a change in a time dimension may be determined based on the change in the time interval, and thus a tempo may be determined.

Step 104: and taking the word accuracy, the phoneme accuracy and the time parameter as evaluation dimensions, and determining an evaluation result of the target voice data according to all the evaluation dimensions.

In the embodiment of the invention, a plurality of evaluation dimensions are selected, and the target voice data is evaluated from the plurality of evaluation dimensions. The evaluation dimension at least comprises word accuracy, phoneme accuracy and time parameters, namely, the target voice data can be evaluated based on the word accuracy, the phoneme accuracy and the time parameters, so that a corresponding evaluation result is obtained. The evaluation result can be the total evaluation result of the target voice data, and can also be a comprehensive evaluation result, namely, the evaluation condition comprising words and phonemes, the evaluation condition of the speed, the rhythm and the like of the user, so that the user can conveniently position the weak link of the user, and the capability of the link can be improved through subsequent practice.

Optionally, the evaluation result of the embodiment of the invention can evaluate the target voice data from each dimension, so that a user can conveniently position the target voice data from different dimensions to the existing problems. Specifically, the above-described process of determining the evaluation result of the target voice data may include: respectively determining word accuracy of a plurality of target word voices, and displaying target words corresponding to the target word voices with word accuracy smaller than a preset threshold value as error-prone words to a user; similarly, the phoneme accuracies of the plurality of target phoneme voices can be respectively determined, and the target phonemes corresponding to the target phoneme phonemes with the phoneme accuracies smaller than the preset threshold value are displayed to the user as error prone phonemes.

In addition, the speed or the rhythm of each sentence or paragraph can be determined by taking the sentence or the paragraph as a unit, and each sentence or paragraph is evaluated based on whether the speed or the rhythm is uniform, so that a user can conveniently and quickly locate the sentence or paragraph with the faster or slower speed or the sentence or paragraph with the nonuniform rhythm. The evaluation result of the embodiment may include detailed evaluation content, so that a user can know where a problem occurs, for example, where the problem affects the rhythm due to katon, and the like, and the user can improve the problem in a targeted manner.

According to the method for comprehensively evaluating the voice, when the target voice data are identified, not only word voices but also phoneme voices at a phoneme level are identified, and the target voice data can be evaluated based on finer granularity. And, the word parameter determined in the recognition process can be utilized to conveniently and accurately determine the speed, rhythm and the like of the target voice data. The method does not need to additionally set a model for determining the speed and the rhythm, and has high processing efficiency; and the word accuracy, the phoneme accuracy and the time parameter are determined by referring to the word parameters, so that a plurality of evaluation dimensions are associated together, the association between the evaluation dimensions can be improved, and the final evaluation result can be ensured to be more accurate.

Based on the above embodiments, the embodiment of the present invention adopts a specific recognition model to recognize a phoneme voice in target voice data. Specifically, the step 102 of determining the phoneme confidence of each target phoneme in the target word speech includes:

step A1: setting an identification model, wherein the identification model comprises a coding sub-model, an alignment output sub-model and an identification output sub-model; the coding sub-model is used for coding the input data into feature vectors, the alignment output sub-model is used for determining corresponding phoneme alignment information according to the feature vectors, and the recognition output sub-model is used for determining a recognition result of each phoneme in the input data according to the feature vectors and the phoneme alignment information.

In the embodiment of the present invention, the evaluation model mainly includes three parts, namely, a coding sub-model, an alignment output sub-model and an identification output sub-model, and can be specifically seen in fig. 2. The coding sub-model is used for coding input data into feature vectors, the input data can be voice data, and corresponding feature vectors can be generated by extracting features of the voice data; the feature vector may be a one-dimensional vector, a two-dimensional matrix, or the like, which is not limited in this embodiment. After the coding sub-model generates the feature vector, the alignment output sub-model and the recognition output sub-model share the feature vector to respectively perform corresponding processing, that is, the alignment output sub-model can determine corresponding phoneme alignment information according to the feature vector, where the phoneme alignment information is used to represent the position of each phoneme in the input data, for example, the start time frame and the end time frame of each phoneme, that is, the beginning time and the ending time of the phoneme. The recognition output submodel determines a recognition result of each phoneme in the input data according to the feature vector and the phoneme alignment information, wherein the recognition result can be specifically the probability that the phoneme in the input data is recognized as each phoneme; for example, if the input data is speech data of english, the recognition result may represent a probability that each phoneme in the input data is recognized as any one of 50 phonemes. Alternatively, the recognition result may be one for which phoneme is recognized by a phoneme in the input data, for example, a first phoneme in the input data is recognized as a phoneme/i: /.

Step A2: and training the coding sub-model and the alignment output sub-model, and then training the recognition output sub-model under the condition of keeping the coding sub-model unchanged to determine a trained recognition model.

Although the conventional GMM-HMM (Gaussian Mixture Model, gaussian mixture model; hidden Markov Model, hidden Markov model) or DNN-HMM (Deep Neural Network ) can realize speech recognition, the conventional model cannot distinguish between high-quality and sub-high-quality phonemes well, and has poor distinguishing ability. In the embodiment of the invention, the evaluation model is trained by adopting a two-pass decoding mode, so that accurate alignment of phonemes can be realized, and the phoneme distinguishing degree can be improved. Specifically, in the first training process, training the coding submodel and the alignment output submodel; at this time, the coding sub-model and the alignment output sub-model can be used as an alignment acoustic model, the characteristics of mass data can be learned by deep learning, the alignment acoustic model has strong pronunciation tolerance, more pronunciation possibilities can be learned, such as English Chinese pronunciation possibility, and the alignment effect of phonemes is good. The excellent phoneme pronunciation data are relatively less, and a model with a relatively high phoneme recognition function is difficult to train (a phenomenon of fitting easily occurs in the training process).

In the embodiment of the invention, the coding sub-model and the recognition output sub-model are used as a phoneme recognition model for the second time training, because the coding sub-model is trained in the first time training process, that is, the weight value and other parameters of the coding sub-model are determined, the coding sub-model can be kept unchanged in the second time training process, that is, the weight value of the coding sub-model is kept unchanged, the alignment output sub-model and the recognition output sub-model share the weight value of the coding sub-model, so that in the second time training process, the problem that the quality data quantity is insufficient in the training process can be effectively solved on the basis of adding a small amount of calculation (one reason is that the second time training can only use a small amount of high-quality training data, and the other reason is that the original coding sub-model is unchanged, but the newly added recognition output sub-model needs training adjustment, and the added calculation amount is less), and the increase of the engine calculation complexity caused by adding the recognition output sub-model can be avoided.

Step A3: inputting target voice data into a recognition model, determining feature vectors of the target voice data and phoneme alignment information of each target phoneme voice, and determining phoneme confidence of each target phoneme voice in the target voice data based on the recognition output submodel; wherein the phoneme alignment information includes a phoneme start time and a phoneme end time.

In the embodiment of the invention, after the training of the evaluation model is finished, the voice data (namely, the target voice data) provided by the user can be evaluated based on the evaluation model. In this embodiment, after the target voice data is input to the trained evaluation model, phoneme alignment information of the target voice data, that is, a position corresponding to each phoneme voice in the target voice data, may be determined based on the alignment output submodel; typically, the voice data is divided into frames according to time, for example, one frame of 25ms, and the interval between two adjacent frames is 10ms; and, each phoneme voice will generally correspond to multiple frames of voice, and the phoneme alignment information may represent data of which frames each phoneme voice corresponds to in the target voice data. Further, the recognition output submodel may determine a phoneme confidence of each phoneme voice in the target voice data, for example, a probability that the phoneme voice in the target voice data is recognized as corresponding to each phoneme. It should be noted that, the recognition output sub-model needs to determine which part of the target speech data corresponds to one phoneme based on the phoneme alignment information determined by the alignment output sub-model, so as to determine the phoneme confidence of each phoneme speech.

Optionally, the training the coding sub-model and the alignment output sub-model in the step A2 "above, and then, training the identification output sub-model while keeping the coding sub-model unchanged" includes:

step A21: and acquiring a first data set, performing phoneme alignment on the first voice data in the first data set, and determining the label of each frame of data in the first voice data.

Step A22: the first data set is used as a training set, the first voice data is used as the input of the coding sub-model, the label of each frame of data in the first voice data is used as the output of the alignment output sub-model, and the coding sub-model and the alignment output sub-model are trained.

In the embodiment of the invention, a coding sub-model and an alignment output sub-model in an evaluation model are trained for the first time based on a first data set; the first data set may be a conventional data set, which includes a large amount of voice data, i.e., a large amount of first voice data; for example, the first data set may be a data set used for speech recognition of children in spoken language evaluation. The features of the speech data may be learned by a first training pass based on a large amount of the first speech data.

Alternatively, the first speech data may be phoneme aligned using a classical DNN-HMM model. In this embodiment, the first voice data may be used as a base signal, and the MFCC (Mel-scale Frequency Cepstral Coefficients, mel-cepstral coefficient) feature of the first voice data may be extracted as an input feature of the DNN-HMM model; and then, a Tri-phone (Tri-phone) is used as a minimum unit of GMM modeling, and the output of the HMM-GMM model is used as a label of each frame of voice signal, so that the problem that continuous voice data cannot be labeled manually can be solved. The DNN-HMM model can be used for phoneme alignment of the first speech data. Compared with the traditional GMM-HMM model, the DNN model is superior to the GMM method in word error rate and system robustness. After determining the label of each frame of data in the first voice data, the first data set can be used as a training set for training.

In addition, the traditional mode generally adopts a cyclic convolution network represented by RNN (cyclic neural network) to obtain better voice recognition performance; in the embodiment of the present invention, the alignment output sub-model of the evaluation model is not used to obtain better speech recognition performance, but is used to better obtain more accurate phoneme distribution of each frame of speech, so that the coding sub-model and the alignment output sub-model in the embodiment may specifically use a TDNN (Time-Delay Neural Network ) model.

Step A23: acquiring a second data set, aligning phonemes of second voice data of the second data set, and determining a text corresponding to the second voice data; the second voice data is data with correct pronunciation, and the number of the second voice data is smaller than that of the first voice data.

Step A24: and taking the second data set as a training set, taking the second voice data as the input of the coding submodel and taking the text corresponding to the second voice data as the output of the recognition output submodel under the condition of keeping the coding submodel unchanged, and training the recognition output submodel.

In the embodiment of the invention, after the first training, a trained alignment output sub-model can be obtained, but the coding sub-model and the alignment output sub-model are essentially acoustic models for speech recognition, which can realize a phoneme alignment function and calculate likelihood probability of phonemes, but the model at the moment has poor discrimination of suboptimal and high-quality phoneme data, namely, high-quality phonemes and suboptimal phonemes are difficult to distinguish. Therefore, the second training is performed on the evaluation model based on the second dataset, and the second training process mainly trains the recognition output sub-model, so that similar phoneme data can be distinguished more accurately based on the trained recognition output sub-model. The second voice data in the second data set has a smaller quantity, but the second voice data are data with correct pronunciation (such as corpus screened by expert), and compared with the first voice data, the second voice data are better; in the second training process, the coding submodel is kept unchanged, and more accurate training is performed on the basis of original phoneme alignment information, so that weight information of the phoneme classification learned in the first training process can be obtained, and more accurate phoneme recognition tasks of recognition output submodel training can be realized.

Wherein the second speech data corresponds to a corresponding text, which may represent the meaning (label) of each phoneme in the second speech data; the phoneme alignment information of the second voice data can be determined through the alignment output submodel, and further, the phoneme label corresponding to each alignment interval in the second voice data can be determined based on the text. For example, the second speech data is a correctly pronounced "good" that corresponds to the text "good" based on which it can be determined that it has three phonemes: g.d, a step of; three alignment intervals of the second voice data can be determined based on the alignment output submodel, and then the three alignment intervals sequentially correspond to three phonemes g,/-for>d。

Optionally, the step A3 "determining the phoneme confidence of each target phoneme speech in the target speech data based on the recognition output submodel" includes:

step A31: inputting the feature vector of the target voice data into the recognition output submodel for forward calculation, and determining an output matrix Y of the recognition output submodel _rec Output matrix Y _rec Is a matrix of the number m of frames x the total number of phonemes n.

In the embodiment of the invention, the coding sub-model performs coding processing on input data such as target voice data, and can generate corresponding feature vectors, for example, a generation matrix Y _share The matrix Y _share I.e. feature vectors common to the recognition output sub-model and the alignment output sub-model. Then the feature vector is input into the recognition output submodel to obtain the output result of the recognition output submodel, namely matrix Y _rec . In this embodiment, the matrix Y _rec Matrix Y is a matrix of frame number m×phoneme total dimension n (either m rows and n columns or n rows and m columns) _rec Elements of (a)Representing t _i Element corresponding to frame, phoneme j, which element +.>Can be expressed as t _i The likelihood that the frame is identified as phoneme k. The number of frames m is the number of frames included in the target voice data, and the total number of phonemes n is the total number of phonemes, for example, the english language includes 50 phonemes, so when evaluating the spoken english language, n=50.

Step A32: phoneme pair based on target speech dataThe alignment information determines each phoneme voice in the target voice data and determines likelihood probability p for any phoneme j in the alignment interval of the phoneme voice i _i，j ：

Wherein t is _i，start A start frame, t, representing an aligned interval of a phoneme speech i _i，end An end frame representing the aligned interval of the phoneme speech i,representing an output matrix Y _rec Middle t _i Elements corresponding to frames and phonemes j, j E [1, n ]]。

In the embodiment of the invention, the phoneme voice i is the ith phoneme voice in the target voice data; wherein each phoneme voice corresponds to an aligned interval containing multiple frames of voices, and for the ith phoneme voice, t is used in the embodiment _i，start 、t _i，end A start frame and an end frame of the aligned interval of the phoneme voice i are represented, namely a phoneme start time and a phoneme end time; all t's between the start frame and the end frame _i Frame(s)As the likelihood probability of the corresponding phoneme speech. The likelihood probability that the phoneme speech i is recognized as each phoneme j, i.e. p, is given in this embodiment _i，j The likelihood probability may represent the phoneme confidence. Correspondingly, the value range of j can be [1, n]。

On the basis of the above-described embodiments, word accuracy and phoneme accuracy are determined statistically. The step 103 of determining the word accuracy of the corresponding target word speech according to the word confidence level and determining the phoneme accuracy of the corresponding target phoneme speech according to the phoneme confidence level includes:

step B1: word tags consistent with corresponding words in the standard text are added for each target word voice in the target voice data, and phoneme tags consistent with corresponding phonemes in the standard text are added for the target phoneme voices.

In the embodiment of the invention, the target voice data comprises a plurality of word voices, and different word voices possibly correspond to the same word, corresponding word labels are added for the word voices, and whether the words corresponding to the word voices are identical is judged based on whether the word labels are identical; specifically, word tags are added to target word voices based on standard texts. For example, the target voice data includes three target words A, B, C, and the standard text corresponding to the target voice data is "have name break", and the word tag of the target word voice a is "have", the word tag of the target word voice B is "name", and the word tag of the target word voice C is "break".

Similarly, different phonemic voices may also correspond to the same phonemes, and the embodiment may also add a corresponding phoneme label to each target phonemic voice based on the standard text. For example, the target word speech A "have" includes three target phonemic voices a1, a2, a3, and the corresponding phonemes are known to be h based on standard text,v, therefore, the phoneme labels of the three target phoneme voices a1, a2, a3 are in order: h. and (2)>v。

Step B2: taking the word label as a unit, taking the average value of word confidence degrees of a plurality of target word voices with the same word label as the word accuracy of the word label.

Step B3: taking the phoneme label as a unit, taking an average value of the phoneme confidence of a plurality of target phoneme voices with the same phoneme label as the phoneme accuracy of the phoneme label.

In the embodiment of the invention, for each word label, all target word voices corresponding to the word label in the target voice data are determined, and the average value of all the determined target word voices is used as the word accuracy of the word label. For example, for the word tag "have", there may be a word "have" uttered by the user alone in the target voice data, and the word "have" may be contained in some sentences or phrases; at this time, the word confidence degrees of all the word voices corresponding to "have" can be determined, and the average value of the word confidence degrees is used as the word accuracy of the word tag "have". The word accuracy may represent the degree or likelihood that the user can accurately recite "have". Similarly, the phoneme accuracy of the phoneme label may be determined according to the confidence degrees of the target phonemic voices corresponding to the same phoneme label, which is not described in detail in this embodiment.

In the embodiment of the present invention, the target voice data may be a section of voice input by the user, or may be a plurality of sections of voice input by the user in multiple times, which is not limited in this embodiment. The word accuracy and the phoneme accuracy are determined in a statistical mode, the capability of the user can be evaluated integrally and comprehensively, and the reliability is higher.

Optionally, the step 103 "determining the corresponding time parameter according to the word start time and the word end time of the continuous plurality of target word voices" includes:

step C1: in the case where the time parameter includes the speech rate, the time between the word start time and the word end time of the same target word speech is taken as the effective time, the number of target phoneme speech in a time period composed of a plurality of continuous effective times is determined, and the speech rate is determined according to the number of target phoneme speech in the time period.

In the embodiment of the invention, if the time parameter includes the speech speed, that is, if the speech speed is required to be used as an evaluation dimension, the speech speed needs to be determined. The conventional method generally takes the number of words (such as words) uttered by a user in a period of time as the speech speed, but the user may have abnormal pauses when recording voice data, and pronunciation time lengths of different words in languages such as English are different, so that the speech speed determined in the conventional method is inaccurate. In the embodiment of the invention, the phonemes are taken as the minimum unit, and the speech speed is determined based on the number of the phonemes.

Specifically, the embodiment of the invention provides an invalid time between two words, wherein only the time occupied by one word voice is taken as an effective time, namely, the time between the word start time and the word end time of the same target word voice is taken as the effective time, and the speech speed is calculated based on the number of the target phoneme voices in the effective time period. For example, the target speech data is "have name break", the start time field of the word "have" (i.e., word start time) is 0.69 seconds, the end time field (i.e., word end time) is 1.11 seconds, the effective time of that word is 0.42 seconds, and the number of target phoneme voices in that word is 3. The word "name" has a word start time of 1.34 seconds, an end time of 1.81 seconds, and an effective time of 0.47 seconds. The effective time of the two words is 0.42+0.47=0.89 seconds, and the ineffective time between the two words is 1.34-1.11=0.22 seconds. In the process of calculating the speech rate, only the effective time is accumulated. For example, an effective time of one minute is selected, and the number of target phoneme voices in the one minute is taken as the speech speed.

The method for determining the speech speed in the embodiment of the invention eliminates invalid parts among words, but determines the speech speed by the pronunciation speed of phonemes in the words, thereby more accurately representing the proficiency of students in grasping content. And by comparing the speech speeds of students on different topics, the problems of too high speech speed or too low speech speed on the topics or words can be known, and the problems can be corrected in a targeted manner.

Step C2: in the case that the time parameter comprises a rhythm, in two adjacent target word voices, taking the time between the word start time of the next target word voice and the word end time of the previous target word voice as a pause time, determining the discrete degree of a plurality of pause times, and determining the rhythm according to the discrete degree; wherein the degree of discretization comprises variance and/or standard deviation.

In the embodiment of the invention, if the rhythm is required to be used as an evaluation dimension, the rhythm is required to be determined, and the embodiment of the invention still determines the rhythm based on the word start time and the word end time. Specifically, unlike the above determination of the speech rate, the tempo is calculated with an "invalid time" which is useless in determining the speech rate, i.e., a time (a time difference) between the word start time of the next target word voice and the word end time of the previous target word voice is regarded as a stop time, which is the above "invalid time", but is useful for calculating the tempo. In the embodiment of the invention, the variance or standard deviation of the pause time can be used as the rhythm, and the larger the variance or standard deviation is, the more uneven the rhythm is. For example, the sentence "have some bread too" in the target voice data has three pauses, i.e., three pause times, the degree of balance of the three pause times indicating the degree of balance of the student's cadence. For example, if the three pauses are "0.01,0.5,0.02", the group of pauses is evident in that there is a relatively long pause in the second place, i.e., a longer pause between the name and the break, the sentence speaks as "non-uniform in tempo". The rhythm of each sentence in the target voice data may be determined, or the rhythm of each text may be determined, which is not limited in this embodiment.

Further optionally, a pause threshold may be set, if the degree of dispersion is greater than the pause threshold, the cadence is non-uniform; if the degree of dispersion is less than the pause threshold, the cadence is uniform. Wherein, a batch of data can be marked in advance by an expert, and the marking content is uniform or nonuniform; the method provided by the embodiment is used for determining the discrete degree of each sentence in the batch of data, setting different values as pause thresholds, and determining the recognition results of the different values, wherein the recognition structure is used for representing the determined rhythm of a certain sentence when the value is taken as the pause threshold, such as uniform rhythm or nonuniform rhythm. And finally, taking the value corresponding to the highest accuracy of the identification result as a finally selected pause threshold. For example, the stall threshold may be 0.19.

Optionally, the embodiment of the invention also takes whether the pause is abnormal as an evaluation dimension. Specifically, before "determining the evaluation result of the target voice data according to all the evaluation dimensions" in step 104, the method further includes:

step C3: taking the pause time greater than a preset threshold value as abnormal pause time under the condition that the time parameter comprises the rhythm, and determining the abnormal pause parameter of the abnormal pause time; the abnormal pause parameter comprises one or more of pause position, pause duration and number of all abnormal pause times of the abnormal pause time; the abnormal pause parameter is taken as an evaluation dimension.

In the embodiment of the present invention, after determining the dwell time in step C2, it is further calculated whether each dwell time is too long, i.e. whether it is greater than a preset threshold (e.g. 1 second); if the pause time is too long, the situation that the student has a shell stuck in the reading and expressing processes is indicated, and the problem of the student can be pointed out in a targeted manner based on the pause position and the pause time of the abnormal pause time, so that the student can correct the student conveniently. In addition, the number of all abnormal pause times can be used as an abnormal pause parameter, the abnormal pause parameter is obtained through statistics, the pause condition of students and even the whole class can be evaluated, and the position where the students or the class are most likely to have problems is determined.

Based on the above embodiments, the method may also use accents and/or intonation as evaluation dimensions. Specifically, before "determining the evaluation result of the target voice data according to all the evaluation dimensions" in step 104, the method further includes:

step D1: accent evaluation process and/or intonation evaluation process.

The "accent evaluation process" in step D1 includes:

step D11: and determining accent phonemes which are accents according to the standard text, and determining accent confidence that the accent phonemes are correctly recognized according to a preset accent recognition model.

Step D12: and determining accent accuracy according to accent confidence degrees of the accent phonemes, and taking the accent accuracy as an evaluation dimension.

In the embodiment of the invention, based on the standard text, which phonemes are accents, namely accent phonemes, can be determined, so that which phoneme voices in the target voice data are accent phonemes can be determined. The confidence that the accent phoneme speech was correctly recognized as accent, i.e., accent confidence, is then determined based on a pre-trained accent recognition model. For example, the accent recognition model may output three values of 0, 0.5, and 1, 0 indicating that the phoneme is accented wrong, 1 indicating that the accent is correct, and 0.5 indicating that the model cannot determine correct or incorrect. After the accent confidence degrees of all accent phonemes are determined, the accent confidence degrees of a plurality of accent phonemes can be counted, so that accent accuracy is counted. For example, accent phonemes having an accent result of 1 may be scaled to all accent phonemes in all vocabulary of the target speech data. Alternatively, the accent phoneme having the accent result of 0 may be counted as the proportion of all accent phonemes.

In addition, the "intonation evaluation process" in the above step D1 includes:

step D13: the last target word voice of each sentence in the target voice data is used as an effective target word voice, the pitch information of the effective target word voice is segmented, and the slope of the pitch information of each segment is determined.

Step D14: if the difference between the number of segments with positive slope and the number of segments with negative slope is larger than the preset difference, determining that the intonation of the effective target word voice is an ascending tone, otherwise, determining that the intonation of the effective target word voice is a descending tone.

In the embodiment of the invention, whether the intonation is correct is determined according to the pronunciation of the last word of each sentence; wherein, the intonation comprises rising and falling intonation. As shown in the above steps, the speech of the last word of the sentence is taken as the valid target word speech, and the intonation is judged based on the pitch (pitch) information of the valid target word speech. Specifically, the pitch information indicates the tone or volume size of the corresponding word, and the present embodiment divides the pitch information into a plurality of segments and calculates the slope of each segment, and determines whether the tone is an up-tone or a down-tone based on the number of pitches (i.e., the number of segments) whose slopes are positive and negative. For example, the preset difference may be set to 3, and if the difference obtained by the number of segments with positive slope and the number of segments with negative slope is greater than 3, the valid target word speech is considered to be an up-tone.

Step D15: and determining whether the intonation of the effective target word voice is correct according to the standard text, determining the intonation accuracy according to the plurality of effective target word voices, and taking the intonation accuracy as an evaluation dimension.

According to the embodiment of the invention, the intonation of the last word of each sentence can be determined based on the standard text, namely the intonation of the effective target word voice book can be determined; therefore, based on the standard text, it can be determined whether the intonation determined in the step D14 is correct, so as to determine the intonation accuracy of the target voice data, and the target voice data can be evaluated by using the intonation accuracy as an evaluation dimension.

Further alternatively, since the evaluation dimensions include a plurality of, there is a difference between different evaluation dimensions, all the evaluation dimensions may be normalized at this time, so that a total evaluation result for the target voice data may be generated based on all the evaluation dimensions. For example, each evaluation dimension is equally divided into a plurality of grades, such as four grades of excellent, good, medium, and bad, and the total evaluation result is determined as the comprehensive ability value based on the grade of each evaluation dimension. For example, you=95, good=82.5, medium=67.5, bad=30, if there are 8 evaluation dimensions in total, the grades are respectively: preferably, 3, then its overall capacity value= (95×2+82.5×3+67.5×3)/8=80.

It should be noted that, the evaluation result in the embodiment of the present invention is the evaluation of "capability", and the above-mentioned comprehensive capability value is also applicable to the evaluation of the capability of the user, instead of the "score" in the conventional sense. Conventional scores are used only to evaluate the performance of a certain voice or examination of a student, but are slightly different from the student's ability. The embodiment of the invention takes the words or phonemes as the minimum granularity, and can evaluate the capability of the user more accurately from the detail dimension; the size of the ability of the user to recite the corresponding word may be determined, for example, by the size of the word accuracy. The embodiment evaluates the capability of the user, so that the user can know the capability of the user more accurately, and the capability of the user is improved pertinently.

The method for comprehensively evaluating the voice provided by the embodiment of the invention is described in detail above, the method can also be realized by a corresponding device, and the device for comprehensively evaluating the voice provided by the embodiment of the invention is described in detail below.

Fig. 3 is a schematic structural diagram of a device for comprehensively evaluating voice according to an embodiment of the present invention. As shown in fig. 3, the apparatus for comprehensively evaluating voices includes:

an obtaining module 31, configured to obtain target voice data to be evaluated and a standard text corresponding to the target voice data;

the recognition module 32 is configured to perform recognition processing on the target voice data based on the standard text, determine a word parameter of each target word voice in the target voice data, and determine a phoneme confidence level of each target phoneme voice in the target word voice; the word parameters comprise word start time, word end time and word confidence;

an evaluation dimension determining module 33, configured to determine a word accuracy of a corresponding target word voice according to the word confidence, determine a phoneme accuracy of a corresponding target phoneme voice according to the phoneme confidence, and determine corresponding time parameters according to word start time and word end time of a plurality of continuous target word voices, where the time parameters include a speech speed and/or a rhythm;

And the evaluation module 34 is configured to take the word accuracy, the phoneme accuracy and the time parameter as evaluation dimensions, and determine an evaluation result of the target voice data according to all the evaluation dimensions.

On the basis of the above embodiment, the recognition module 32 determines a phoneme confidence of each target phoneme voice in the target word voices, including:

setting an identification model, wherein the identification model comprises a coding sub-model, an alignment output sub-model and an identification output sub-model; the coding sub-model is used for coding input data into feature vectors, the alignment output sub-model is used for determining corresponding phoneme alignment information according to the feature vectors, and the recognition output sub-model is used for determining recognition results of each phoneme in the input data according to the feature vectors and the phoneme alignment information;

training the coding sub-model and the alignment output sub-model, and then training the recognition output sub-model under the condition of keeping the coding sub-model unchanged to determine a trained recognition model;

inputting the target voice data into the recognition model, determining feature vectors of the target voice data and phoneme alignment information of each target phoneme voice, and determining the phoneme confidence of each target phoneme voice in the target voice data based on the recognition output submodel; wherein the phoneme alignment information includes a phoneme start time and a phoneme end time.

On the basis of the above embodiment, the training of the coding sub-model and the alignment output sub-model by the recognition module 32, and then, while keeping the coding sub-model unchanged, training the recognition output sub-model includes:

acquiring a first data set, aligning phonemes of first voice data in the first data set, and determining a label of each frame of data in the first voice data;

training the coding sub-model and the alignment output sub-model by taking the first data set as a training set, taking the first voice data as the input of the coding sub-model and taking the label of each frame of data in the first voice data as the output of the alignment output sub-model;

acquiring a second data set, aligning phonemes of second voice data of the second data set, and determining a text corresponding to the second voice data; the second voice data are data with correct pronunciation, and the number of the second voice data is smaller than that of the first voice data;

and taking the second data set as a training set, taking the second voice data as the input of the coding submodel and taking the text corresponding to the second voice data as the output of the recognition output submodel under the condition of keeping the coding submodel unchanged, and training the recognition output submodel.

On the basis of the above embodiment, the evaluation dimension determination module 33 determines the word accuracy of the corresponding target word speech according to the word confidence, determines the phoneme accuracy of the corresponding target phoneme speech according to the phoneme confidence, including:

adding word labels consistent with corresponding words in the standard text for each target word voice in the target voice data, and adding phoneme labels consistent with corresponding phonemes in the standard text for the target phoneme voice;

taking the word label as a unit, taking the average value of word confidence degrees of a plurality of target word voices with the same word label as the word accuracy of the word label;

taking the phoneme label as a unit, taking an average value of the phoneme confidence of a plurality of target phoneme voices with the same phoneme label as the phoneme accuracy of the phoneme label.

On the basis of the above embodiment, the evaluation dimension determining module 33 determines corresponding time parameters according to word start times and word end times of a plurality of continuous target word voices, including:

in the case that the time parameter includes a speech rate, taking a time between the word start time and the word end time of the same target word voice as an effective time, determining the number of the target phoneme voices in a time period consisting of a plurality of continuous effective times, and determining the speech rate according to the number of the target phoneme voices in the time period;

In the case that the time parameter comprises a rhythm, in two adjacent target word voices, taking the time between the word start time of the next target word voice and the word end time of the previous target word voice as a pause time, determining the discrete degree of a plurality of pause times, and determining the rhythm according to the discrete degree; wherein the degree of discretization comprises a variance and/or a standard deviation.

On the basis of the above embodiment, before the evaluation results of the target voice data are determined according to all the evaluation dimensions, the evaluation dimension determination module 33 is further configured to:

taking the pause time greater than a preset threshold value as abnormal pause time under the condition that the time parameter comprises the rhythm, and determining the abnormal pause parameter of the abnormal pause time; the abnormal pause parameter comprises one or more of pause position, pause duration and quantity of all abnormal pause time of the abnormal pause time;

and taking the abnormal pause parameter as an evaluation dimension.

Accent evaluation process and/or intonation evaluation process;

the accent evaluation process comprises the following steps: determining accent phonemes which are accents according to the standard text, and determining accent confidence degrees of correctly recognized accent phonemes according to a preset accent recognition model; determining accent accuracy according to accent confidence degrees of a plurality of accent phonemes, and taking the accent accuracy as an evaluation dimension;

the intonation evaluation process comprises the following steps: taking the last target word voice of each sentence in the target voice data as an effective target word voice, segmenting pitch information of the effective target word voice, and determining the slope of the pitch information of each segment;

if the difference between the number of segments with positive slope and the number of segments with negative slope is larger than the preset difference, determining that the intonation of the effective target word voice is an rising tone, otherwise, determining that the intonation of the effective target word voice is a falling tone; and

and determining whether the intonation of the effective target word voice is correct or not according to the standard text, determining intonation accuracy according to a plurality of the effective target word voices, and taking the intonation accuracy as an evaluation dimension.

In addition, the embodiment of the invention also provides an electronic device, which comprises a bus, a transceiver, a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the transceiver, the memory and the processor are respectively connected through the bus, and when the computer program is executed by the processor, the processes of the method embodiment for comprehensively evaluating the voice are realized, the same technical effect can be achieved, and the repetition is avoided, so that the description is omitted.

In particular, referring to FIG. 4, an embodiment of the invention also provides an electronic device comprising a bus 1110, a processor 1120, a transceiver 1130, a bus interface 1140, a memory 1150, and a user interface 1160.

In an embodiment of the present invention, the electronic device further includes: computer programs stored on the memory 1150 and executable on the processor 1120, which when executed by the processor 1120, implement the various processes of the method embodiments for comprehensively evaluating speech described above.

A transceiver 1130 for receiving and transmitting data under the control of the processor 1120.

In an embodiment of the invention, represented by bus 1110, bus 1110 may include any number of interconnected buses and bridges, with bus 1110 connecting various circuits, including one or more processors, represented by processor 1120, and memory, represented by memory 1150.

Bus 1110 represents one or more of any of several types of bus structures, including a memory bus and a memory controller, a peripheral bus, an accelerated graphics port (Accelerate Graphical Port, AGP), a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such an architecture includes: industry standard architecture (Industry Standard Architecture, ISA) bus, micro channel architecture (Micro Channel Architecture, MCA) bus, enhanced ISA (EISA) bus, video electronics standards association (Video Electronics Standards Association, VESA) bus, peripheral component interconnect (Peripheral Component Interconnect, PCI) bus.

Processor 1120 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method embodiments may be implemented by instructions in the form of integrated logic circuits in hardware or software in a processor. The processor includes: general purpose processors, central processing units (Central Processing Unit, CPU), network processors (Network Processor, NP), digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA), complex programmable logic devices (Complex Programmable Logic Device, CPLD), programmable logic arrays (Programmable Logic Array, PLA), micro control units (Microcontroller Unit, MCU) or other programmable logic devices, discrete gates, transistor logic devices, discrete hardware components. The methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. For example, the processor may be a single-core processor or a multi-core processor, and the processor may be integrated on a single chip or located on multiple different chips.

The processor 1120 may be a microprocessor or any conventional processor. The steps of the method disclosed in connection with the embodiments of the present invention may be performed directly by a hardware decoding processor, or by a combination of hardware and software modules in the decoding processor. The software modules may be located in a random access Memory (Random Access Memory, RAM), flash Memory (Flash Memory), read-Only Memory (ROM), programmable ROM (PROM), erasable Programmable ROM (EPROM), registers, and so forth, as are known in the art. The readable storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

Bus 1110 may also connect together various other circuits such as peripheral devices, voltage regulators, or power management circuits, bus interface 1140 providing an interface between bus 1110 and transceiver 1130, all of which are well known in the art. Accordingly, the embodiments of the present invention will not be further described.

The transceiver 1130 may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. For example: the transceiver 1130 receives external data from other devices, and the transceiver 1130 is configured to transmit the data processed by the processor 1120 to the other devices. Depending on the nature of the computer system, a user interface 1160 may also be provided, for example: touch screen, physical keyboard, display, mouse, speaker, microphone, trackball, joystick, stylus.

It should be appreciated that in embodiments of the present invention, the memory 1150 may further comprise memory located remotely from the processor 1120, such remotely located memory being connectable to a server through a network. One or more portions of the above-described networks may be an ad hoc network (ad hoc network), an intranet, an extranet (extranet), a Virtual Private Network (VPN), a Local Area Network (LAN), a Wireless Local Area Network (WLAN), a Wide Area Network (WAN), a Wireless Wide Area Network (WWAN), a Metropolitan Area Network (MAN), the Internet (Internet), a Public Switched Telephone Network (PSTN), a plain old telephone service network (POTS), a cellular telephone network, a wireless fidelity (Wi-Fi) network, and a combination of two or more of the above-described networks. For example, the cellular telephone network and wireless network may be a global system for mobile communications (GSM) system, a Code Division Multiple Access (CDMA) system, a Worldwide Interoperability for Microwave Access (WiMAX) system, a General Packet Radio Service (GPRS) system, a Wideband Code Division Multiple Access (WCDMA) system, a Long Term Evolution (LTE) system, an LTE Frequency Division Duplex (FDD) system, an LTE Time Division Duplex (TDD) system, a long term evolution-advanced (LTE-a) system, a Universal Mobile Telecommunications (UMTS) system, an enhanced mobile broadband (Enhance Mobile Broadband, embbb) system, a mass machine type communication (massive Machine Type of Communication, mctc) system, an ultra reliable low latency communication (Ultra Reliable Low Latency Communications, uirllc) system, and the like.

It should be appreciated that the memory 1150 in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. Wherein the nonvolatile memory includes: read-Only Memory (ROM), programmable ROM (PROM), erasable Programmable EPROM (EPROM), electrically Erasable EPROM (EEPROM), or Flash Memory (Flash Memory).

The volatile memory includes: random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as: static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (ddr SDRAM), enhanced SDRAM (Enhanced SDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRAM). The memory 1150 of the electronic device described in embodiments of the present invention includes, but is not limited to, the above and any other suitable types of memory.

In an embodiment of the invention, memory 1150 stores the following elements of operating system 1151 and application programs 1152: an executable module, a data structure, or a subset thereof, or an extended set thereof.

Specifically, the operating system 1151 includes various system programs, such as: a framework layer, a core library layer, a driving layer and the like, which are used for realizing various basic services and processing tasks based on hardware. The applications 1152 include various applications such as: a Media Player (Media Player), a Browser (Browser) for implementing various application services. A program for implementing the method of the embodiment of the present invention may be included in the application 1152. The application 1152 includes: applets, objects, components, logic, data structures, and other computer system executable instructions that perform particular tasks or implement particular abstract data types.

In addition, the embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the above method embodiment for comprehensively evaluating speech, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.

The computer-readable storage medium includes: persistent and non-persistent, removable and non-removable media are tangible devices that may retain and store instructions for use by an instruction execution device. The computer-readable storage medium includes: electronic storage, magnetic storage, optical storage, electromagnetic storage, semiconductor storage, and any suitable combination of the foregoing. The computer-readable storage medium includes: phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), non-volatile random access memory (NVRAM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassette storage, magnetic tape disk storage or other magnetic storage devices, memory sticks, mechanical coding (e.g., punch cards or bump structures in grooves with instructions recorded thereon), or any other non-transmission medium that may be used to store information that may be accessed by a computing device. In accordance with the definition in the present embodiments, the computer-readable storage medium does not include a transitory signal itself, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (e.g., a pulse of light passing through a fiber optic cable), or an electrical signal transmitted through a wire.

In several embodiments provided herein, it should be understood that the disclosed apparatus, electronic device, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one position, or may be distributed over a plurality of network units. Some or all of the units can be selected according to actual needs to solve the problem to be solved by the scheme of the embodiment of the invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the embodiments of the present invention is essentially or partly contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (including: a personal computer, a server, a data center or other network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the storage medium includes various media as exemplified above that can store program codes.

In the description of the embodiments of the present invention, those skilled in the art will appreciate that the embodiments of the present invention may be implemented as a method, an apparatus, an electronic device, and a computer-readable storage medium. Thus, embodiments of the present invention may be embodied in the following forms: complete hardware, complete software (including firmware, resident software, micro-code, etc.), a combination of hardware and software. Furthermore, in some embodiments, embodiments of the invention may also be implemented in the form of a computer program product in one or more computer-readable storage media having computer program code embodied therein.

Any combination of one or more computer-readable storage media may be employed by the computer-readable storage media described above. The computer-readable storage medium includes: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium include the following: portable computer diskette, hard disk, random Access Memory (RAM), read-only Memory (ROM), erasable programmable read-only Memory (EPROM), flash Memory (Flash Memory), optical fiber, compact disc read-only Memory (CD-ROM), optical storage device, magnetic storage device, or any combination thereof. In embodiments of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, device.

The computer program code embodied in the computer readable storage medium may be transmitted using any appropriate medium, including: wireless, wire, fiber optic cable, radio Frequency (RF), or any suitable combination thereof.

Computer program code for carrying out operations of embodiments of the present invention may be written in assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or in one or more programming languages, including an object oriented programming language such as: java, smalltalk, C ++, also include conventional procedural programming languages, such as: c language or similar programming language. The computer program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computers may be connected via any sort of network, including: a Local Area Network (LAN) or a Wide Area Network (WAN), which may be connected to the user's computer or to an external computer.

The embodiment of the invention describes a method, a device and electronic equipment through flowcharts and/or block diagrams.

It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions. These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer readable program instructions may also be stored in a computer readable storage medium that can cause a computer or other programmable data processing apparatus to function in a particular manner. Thus, instructions stored in a computer-readable storage medium produce an instruction means which implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The foregoing is merely a specific implementation of the embodiment of the present invention, but the protection scope of the embodiment of the present invention is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the embodiment of the present invention, and the changes or substitutions are covered by the protection scope of the embodiment of the present invention. Therefore, the protection scope of the embodiments of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for comprehensively evaluating speech, comprising:

Taking the word accuracy, the phoneme accuracy and the time parameter as evaluation dimensions, and determining an evaluation result of the target voice data according to all the evaluation dimensions;

the determining the phoneme confidence of each target phoneme speech in the target word speech includes:

inputting the target voice data into the recognition model, determining feature vectors of the target voice data and phoneme alignment information of each target phoneme voice, and determining the phoneme confidence of each target phoneme voice in the target voice data based on the recognition output submodel; wherein the phoneme alignment information includes a phoneme start time and a phoneme end time;

Wherein the training the coding sub-model and the alignment output sub-model, and then, while keeping the coding sub-model unchanged, training the recognition output sub-model includes:

2. The method of claim 1, wherein said determining word accuracy of the corresponding target word speech from the word confidence level, determining phoneme accuracy of the corresponding target phoneme speech from the phoneme confidence level, comprises:

3. The method of claim 1, wherein said determining respective time parameters from word start times and word end times of a consecutive plurality of said target word voices comprises:

4. A method according to claim 3, further comprising, prior to said determining the evaluation result of the target speech data from all of the evaluation dimensions:

and taking the abnormal pause parameter as an evaluation dimension.

5. The method according to claim 1, further comprising, prior to said determining the evaluation result of the target speech data from all of the evaluation dimensions:

Accent evaluation process and/or intonation evaluation process;

6. An apparatus for comprehensively evaluating speech, comprising:

The acquisition module is used for acquiring target voice data to be evaluated and standard texts corresponding to the target voice data;

the recognition module is used for carrying out recognition processing on the target voice data by taking the standard text as a reference, determining word parameters of each target word voice in the target voice data and determining the phoneme confidence coefficient of each target phoneme voice in the target word voice; the word parameters comprise word start time, word end time and word confidence;

the evaluation dimension determining module is used for determining word accuracy of corresponding target word voices according to the word confidence, determining phoneme accuracy of the corresponding target phoneme voices according to the phoneme confidence, and determining corresponding time parameters according to word start time and word end time of a plurality of continuous target word voices, wherein the time parameters comprise speech speed and/or rhythm;

the evaluation module is used for taking the word accuracy, the phoneme accuracy and the time parameter as evaluation dimensions and determining the evaluation result of the target voice data according to all the evaluation dimensions;

the recognition module determines a phoneme confidence of each target phoneme speech in the target word speech, including:

the training the coding sub-model and the alignment output sub-model by the identification module, and then training the identification output sub-model under the condition of keeping the coding sub-model unchanged, includes:

7. An electronic device comprising a bus, a transceiver, a memory, a processor and a computer program stored on the memory and executable on the processor, the transceiver, the memory and the processor being connected by the bus, characterized in that the computer program when executed by the processor implements the steps of the method of comprehensively evaluating speech according to any of claims 1 to 5.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of comprehensively evaluating speech according to any of claims 1 to 5.