CN114360502B

CN114360502B - Speech recognition model processing method, speech recognition method and device

Info

Publication number: CN114360502B
Application number: CN202111292319.2A
Authority: CN
Inventors: 邓克琦; 曹松军; 马龙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-03
Filing date: 2021-11-03
Publication date: 2025-05-06
Anticipated expiration: 2041-11-03
Also published as: CN114360502A

Abstract

The application relates to a processing method of a voice recognition model, a voice recognition method and a voice recognition device. The method relates to a voice recognition technology in the field of artificial intelligence, and comprises the steps of obtaining voice features corresponding to sample signals through a voice recognition model and outputting a first predicted character sequence based on the voice features, inputting a forward character sequence corresponding to a labeling character sequence into a decoder, wherein the forward character sequence is generated based on the previous character of each character in the labeling character sequence, decoding the voice features according to semantic features corresponding to the forward character sequence in the decoder to obtain voice semantic joint features, obtaining a second predicted character sequence based on the voice semantic joint features, and jointly training the voice recognition model and the decoder based on voice recognition loss calculated according to the labeling character sequence and the first predicted character sequence and semantic recognition loss calculated according to the labeling character sequence and the second predicted character sequence. By adopting the method, the accuracy of voice recognition can be improved.

Description

Processing method of voice recognition model, voice recognition method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for processing a speech recognition model.

Background

With the development of computer technology and artificial intelligence technology, speech recognition is required in many scenarios, such as virtual robot interaction scenarios, intelligent device control scenarios, machine translation scenarios, text conversion scenarios of voice messages, etc. For example, the terminal receives a voice signal input by a user through a virtual robot program installed on the terminal, performs voice recognition on the voice signal to obtain a voice recognition result, and performs a corresponding operation based on the voice recognition result. For another example, a voice control client is installed on the intelligent device, the intelligent device receives a voice signal input by a user through the voice control client, performs voice recognition on the voice signal to obtain a voice recognition result, obtains a control instruction based on the voice recognition result, and further executes corresponding operation.

At present, the non-autoregressive voice recognition model has wide application due to the advantages of high voice recognition speed and the like. But the non-autoregressive voice recognition model only uses the information of the voice signal on the voice level, and has the defect of low recognition accuracy.

Disclosure of Invention

Accordingly, in order to solve the above-mentioned problems, it is necessary to provide a method for processing a speech recognition model, a method for recognizing speech, and a device for recognizing speech, which can improve the accuracy of speech recognition.

A method of processing a speech recognition model, the method comprising:

Acquiring a sample signal and a corresponding labeling character sequence;

Inputting the sample signal into a voice recognition model to obtain voice characteristics corresponding to the sample signal and a first predicted character sequence output based on the voice characteristics;

inputting a forward character sequence corresponding to the labeling character sequence into a decoder, wherein the forward character sequence is generated based on the previous character of each character in the labeling character sequence;

In the decoder, decoding the voice features according to the semantic features corresponding to the forward character sequence to obtain voice semantic joint features corresponding to the sample signals, and predicting based on the voice semantic joint features to obtain second predicted character sequences corresponding to the sample signals;

the speech recognition model is trained jointly with the decoder based on speech recognition losses calculated from the sequence of annotated characters and the first sequence of predicted characters, and semantic recognition losses calculated from the sequence of annotated characters and the second sequence of predicted characters.

A processing apparatus of a speech recognition model, the apparatus comprising:

The acquisition module is used for acquiring the sample signal and the corresponding labeling character sequence;

The coding module is used for inputting the sample signal into a voice recognition model to obtain voice characteristics corresponding to the sample signal and a first predicted character sequence output based on the voice characteristics;

the input module is used for inputting a forward character sequence corresponding to the marking character sequence into the decoder, and the forward character sequence is generated based on the previous character of each character in the marking character sequence;

The decoding module is used for decoding the voice features according to the semantic features corresponding to the forward character sequences in the decoder to obtain voice semantic joint features corresponding to the sample signals, and predicting the voice semantic joint features to obtain second predicted character sequences corresponding to the sample signals;

And the training module is used for jointly training the voice recognition model and the decoder based on the voice recognition loss calculated according to the labeling character sequence and the first predicted character sequence and the semantic recognition loss calculated according to the labeling character sequence and the second predicted character sequence.

In one embodiment, the encoding module is further configured to input the sample signal into the speech recognition model, output, by an encoder of the speech recognition model, speech features corresponding to the sample signal, and output, by a classifier in the speech recognition model coupled to the encoder, the first predicted character sequence based on the speech features.

In one embodiment, the encoder comprises a feature extraction network and a self-attention-based voice context network, wherein the encoding module is further used for inputting the sample signals into the encoder to obtain voice vector sequences corresponding to the sample signals output by the feature extraction network in the encoder, carrying out random masking processing on voice vectors in the voice vector sequences, inputting the masked voice vector sequences into the voice context network to obtain contextual voice features output by the voice context network as voice features corresponding to the sample signals.

In one embodiment, the decoder comprises a vectorization layer, a self-attention-based semantic context network and a cross-attention-based voice semantic context network, wherein the decoding module is further used for converting the forward character sequence into a corresponding forward character vector sequence through the vectorization layer of the decoder, inputting the forward character vector sequence into the semantic context network, calculating context semantic features corresponding to the forward character sequence based on the forward character vector sequence through the semantic context network to serve as semantic features corresponding to the forward character sequence, and calculating voice semantic joint features corresponding to the sample signals through the voice semantic context network based on the semantic features corresponding to the forward character sequence and the voice features.

In one embodiment, the decoding module is further configured to input the speech semantic joint feature into a classifier of the decoder, and output, by the classifier, a second predicted character sequence corresponding to the sample signal based on the speech semantic joint feature.

In one embodiment, the voice recognition model comprises an encoder and a classifier connected with the encoder, the encoder is a pre-trained encoder obtained by self-supervision training by using a non-labeling sample signal, the training module is further used for performing supervision training on the decoder and the classifier of the voice recognition model according to the voice recognition loss and the semantic recognition loss, and performing supervision training on the decoder and the voice recognition model according to the voice recognition loss and the semantic recognition loss when a supervision training stop condition is met.

In one embodiment, the encoder is a pre-trained encoder obtained by performing self-supervision training by using unlabeled sample signals, the voice recognition model further comprises a pre-training module, the pre-training module is used for obtaining the unlabeled sample signals, inputting the unlabeled sample signals into an initial encoder to obtain a voice vector sequence corresponding to the unlabeled sample signals output by a feature extraction network in the initial encoder, performing quantization operation on the voice vector sequence to obtain a voice quantized vector sequence, determining a masking voice vector after performing random masking processing on the voice vectors in the voice vector sequence, inputting the masked voice vector sequence into a voice context network of the initial encoder to obtain a predicted voice vector corresponding to the masking voice vector output by the voice context network, constructing self-training loss based on the difference between the voice quantized vector corresponding to the masking voice vector and the predicted voice vector in the voice quantized vector sequence, updating network parameters of the initial encoder according to the self-supervision training loss, and continuing to obtain the pre-trained sample after the acquisition of the network parameters of the initial encoder.

In one embodiment, the training module is further configured to construct the speech recognition penalty based on a difference between the labeling character sequence and the first predicted character sequence, construct a semantic recognition penalty based on a difference between the labeling character sequence and the second predicted character sequence, weight sum the speech recognition penalty and the semantic recognition penalty according to a preset penalty weighting coefficient to obtain a target penalty, and jointly train the speech recognition model and the decoder according to the target penalty.

In one embodiment, the processing device of the voice recognition model further comprises a voice recognition module, wherein the voice recognition module is used for acquiring a signal to be recognized, inputting the signal to be recognized into a trained voice recognition model to obtain voice characteristics output by an encoder in the voice recognition model, and outputting a voice recognition result based on the voice characteristics by a classifier in the voice recognition model.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method for processing a speech recognition model as described above when the computer program is executed by the processor.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the steps of the method for processing a speech recognition model as described above.

A computer program comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium by a processor of a computer device, the computer instructions being executed by the processor to cause the computer device to perform the steps of the method of processing a speech recognition model as described above.

According to the processing method, the processing device, the computer equipment and the storage medium of the voice recognition model, the sample signal is input into the voice recognition model to obtain the voice feature corresponding to the sample signal, the first predicted character sequence output based on the voice feature is input into the decoder, the forward character sequence corresponding to the labeling character sequence is input into the decoder, the voice feature is decoded according to the semantic feature corresponding to the forward character sequence in the decoder to obtain the voice semantic joint feature corresponding to the sample signal, and the forward character sequence is generated based on the previous character of each character in the labeling character sequence, so that the voice semantic joint feature obtained by decoding and encoding the voice feature output by the encoder according to the semantic feature corresponding to the forward character sequence carries context information of a semantic level, the second predicted character sequence corresponding to the sample signal is obtained by prediction based on the voice semantic joint feature, and the voice recognition model is assisted by the semantic recognition loss training constructed according to the second predicted character sequence and the labeling character sequence, so that the context information of the semantic level can be distilled into the voice recognition model, thereby improving the recognition accuracy of the voice recognition model.

A method of speech recognition, the method comprising:

Acquiring a signal to be identified;

inputting the signal to be recognized into a trained voice recognition model to obtain voice characteristics output by an encoder in the voice recognition model and voice recognition results output by a classifier in the voice recognition model based on the voice characteristics;

The speech recognition model and the decoder are obtained based on joint training of speech recognition loss and semantic recognition loss, the speech recognition loss is obtained by calculating a labeling character sequence corresponding to the sample signal according to a first predicted character sequence, the semantic recognition loss is obtained by calculating a second predicted character sequence and the labeling character sequence, the first predicted character sequence is obtained after classification based on speech features output by the encoder, the second predicted character sequence is obtained by predicting speech semantic joint features obtained by decoding the speech features through semantic features corresponding to a forward character sequence corresponding to the labeling character sequence, and the forward character sequence is generated based on previous characters of all characters in the labeling character sequence.

A speech recognition device, the device comprising:

The acquisition module is used for acquiring the signal to be identified;

The voice recognition module is used for inputting the signal to be recognized into a trained voice recognition model to obtain voice characteristics output by an encoder in the voice recognition model and voice recognition results output by a classifier in the voice recognition model based on the voice characteristics;

A computer device comprising a memory storing a computer program and a processor implementing the steps of the above-described speech recognition method when the processor executes the computer program.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described speech recognition method.

A computer program comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium by a processor of a computer device, the computer instructions being executed by the processor to cause the computer device to perform the steps of the speech recognition method described above.

According to the voice recognition method, the device, the computer equipment and the storage medium, the signals to be recognized are input into the trained voice recognition model, the voice characteristics output by the encoder in the voice recognition model and the voice recognition results output by the classifier in the voice recognition model based on the voice characteristics are obtained, and the trained voice recognition model can utilize context information of semantic hierarchy to perform voice recognition, so that the voice recognition accuracy can be improved.

Drawings

FIG. 1 is an application environment diagram of a method of processing a speech recognition model in one embodiment;

FIG. 2 is a schematic diagram of a speech recognition scenario in one embodiment;

FIG. 3 is a flow diagram of a method of processing a speech recognition model in one embodiment;

FIG. 4 is a schematic diagram of training a speech recognition model with the aid of a decoder in one embodiment;

FIG. 5 is a schematic diagram of obtaining a speech feature corresponding to a sample signal by an encoder in one embodiment;

FIG. 6 is a schematic diagram of self-supervised pre-training of an initial encoder in one embodiment;

FIG. 7 is a schematic diagram of speech recognition model training with the aid of a decoder in another embodiment;

FIG. 8 is a flow diagram of a method of processing a speech recognition model in one embodiment;

FIG. 9 is a schematic diagram of a further embodiment of training a speech recognition model with the aid of a decoder;

FIG. 10 is a schematic diagram of test results in one embodiment;

FIG. 11 is a flow diagram of a method of speech recognition in one embodiment;

FIG. 12 is a block diagram of a processing device of a speech recognition model in one embodiment;

FIG. 13 is a block diagram of a voice recognition device in one embodiment;

FIG. 14 is an internal block diagram of a computer device in one embodiment;

fig. 15 is an internal structural view of a computer device in another embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The application provides a processing method of a voice recognition model and a voice recognition method, which relate to an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology, wherein the artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a processing method of a voice recognition model, which mainly relates to an artificial intelligence machine learning technology (MACHINE LEARNING, ML). Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

For example, in an embodiment of the present application, a speech recognition model and a decoder are trained based on a combination of speech recognition loss and semantic recognition loss, and finally a speech recognition model for recognizing a speech signal is obtained.

The voice recognition method provided by the embodiment of the application mainly relates to an artificial intelligence voice technology (Speech Technology). Key technologies of the voice technology are an automatic voice recognition technology and a voice synthesis technology, and a voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

For example, in the embodiment of the present application, the encoder in the trained speech recognition model outputs the speech features corresponding to the signal to be recognized, and the classifier in the trained speech recognition model outputs the speech recognition result based on the speech features.

The processing method and the voice recognition method of the voice recognition model provided by the embodiment of the application can also relate to a block chain technology. Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

For example, in the embodiment of the present application, the server may be a blockchain node in a blockchain network, and the trained speech recognition model may be stored on the blockchain, and the signal to be recognized is uploaded to a data block of the blockchain to perform speech recognition on the signal to be recognized.

The processing method and the voice recognition method of the voice recognition model provided by the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but is not limited to, various smartphones, tablet computers, notebook computers, desktop computers, portable wearable devices, smart speakers, vehicle-mounted devices, and the like. The server 104 may be an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms.

In one embodiment, the terminal 102 obtains a sample signal and a corresponding labeling character sequence, sends the sample signal and the corresponding labeling character sequence to the server 104, the server 104 inputs the sample signal into a speech recognition model to obtain a speech feature corresponding to the sample signal and a first predicted character sequence output based on the speech feature, inputs a forward character sequence corresponding to the labeling character sequence into a decoder, the forward character sequence is generated based on a previous character of each character in the labeling character sequence, decodes the speech feature according to a semantic feature corresponding to the forward character sequence in the decoder to obtain a speech semantic joint feature corresponding to the sample signal, predicts based on the speech semantic joint feature to obtain a second predicted character sequence corresponding to the sample signal, and jointly trains the speech recognition model and the decoder based on a speech recognition loss calculated according to the labeling character sequence and the first predicted character sequence and a semantic recognition loss calculated according to the labeling character sequence and the second predicted character sequence.

The execution main body of the processing method of the voice recognition model provided by the embodiment of the application can be the processing device of the voice recognition model provided by the embodiment of the application or the computer equipment integrated with the processing device of the voice recognition model, wherein the processing device of the voice recognition model can be realized in a hardware or software mode. The computer device may be the terminal 102 or the server 104 shown in fig. 1.

In one embodiment, the terminal 102 obtains a signal to be recognized, sends the signal to be recognized to the server 104, the server 104 inputs the signal to be recognized into a trained speech recognition model to obtain speech features output by an encoder in the speech recognition model and speech recognition results output by a classifier in the speech recognition model based on the speech features, wherein the speech recognition model and the decoder are obtained by joint training based on speech recognition loss and semantic recognition loss, the speech recognition loss is calculated according to a first predicted character sequence and a labeled character sequence corresponding to a sample signal, the semantic recognition loss is calculated according to a second predicted character sequence and the labeled character sequence, the first predicted character sequence is obtained by classifying the speech features output by the encoder, the second predicted character sequence is obtained by predicting speech semantic joint features obtained by decoding the speech features by using semantic features corresponding to a forward character sequence corresponding to the labeled character sequence, and the forward character sequence is generated based on a previous character of each character in the labeled character sequence.

The implementation main body of the voice recognition method provided by the embodiment of the application can be the voice recognition device provided by the embodiment of the application or the computer equipment integrated with the voice recognition device, wherein the voice recognition device can be realized in a hardware or software mode. The computer device may be the terminal 102 or the server 104 shown in fig. 1.

The voice recognition method provided by the embodiment of the application can be applied to voice interaction scenes, such as virtual robot interaction scenes, intelligent equipment control scenes, machine translation scenes, text conversion scenes of voice messages and the like. Speech interaction scenarios typically involve speech recognition techniques that convert speech signals into text and semantic recognition techniques that recognize the intent of the text resulting from the conversion of the speech signals. The speech recognition model obtained by training the application is particularly applied to the speech recognition technology.

For example, a virtual robot program is installed on the terminal, and a speech recognition model trained by the present application is stored in a background server of the virtual robot program. The terminal receives a voice signal input by a user through the virtual robot program, the voice recognition model stored by the background server recognizes a text corresponding to the voice signal, and the terminal can execute corresponding operation based on the text or a semantic recognition result of the text.

Taking a vehicle-mounted robot as an example, the vehicle-mounted robot is a social robot applied to a vehicle-mounted intelligent cabin scene, and belongs to a service robot. The in-vehicle robot may provide corresponding services such as playing music/radio/news/e-book, navigating, inquiring weather/surrounding food, making a call, interactive chat, etc., in response to the input voice of the in-vehicle user.

Referring to fig. 2, a voice recognition system of an in-vehicle robot may include an acoustic front-end module, a cloud voice recognition module, an off-line/cloud semantic recognition module, and the like. The acoustic front-end module is used for providing functions of voice noise reduction, sound source positioning, echo cancellation and the like. The offline voice recognition module is used for providing functions of fixed wake-up word wake-up, customized wake-up word wake-up, offline voice recognition and the like. The cloud speech recognition module may include a speech recognition model for recognizing the speech signal as text, optionally the speech recognition model may be split into an acoustic model for recognizing the speech signal as a phoneme, a language model and a dictionary for converting the phoneme into text, and a decoder for performing an entire search process of the speech signal to the text in combination with the acoustic model, the language model and the dictionary. The offline/cloud semantic recognition module is used for recognizing the intention of the text obtained by converting the voice signal. The voice recognition model obtained through training can be applied to a cloud voice recognition module of the vehicle-mounted robot to improve the accuracy of the vehicle-mounted robot voice recognition.

For another example, a voice control client is installed on the intelligent device, and a background server of the voice control client stores a voice recognition model trained by the method. The intelligent device receives a voice signal input by a user through the voice control client, the voice recognition model stored by the background server recognizes a text corresponding to the voice signal, and the intelligent device can obtain a control instruction based on the text or a semantic recognition result of the text so as to execute corresponding operation. Smart devices include, but are not limited to, smart home devices and the like.

For example, a terminal is provided with a translation client, and a background server of the translation client stores a speech recognition model trained by the application. The terminal receives a voice signal input by a user through a translation client, a voice recognition model stored by a background server recognizes a text corresponding to the voice signal, the text or a semantic recognition result of the text is translated, a translation result is obtained, and the terminal outputs the translation result corresponding to the voice signal.

For another example, a session client is installed on the terminal, and a background server of the session client stores the speech recognition model trained by the application. The terminal receives the voice message input by the user through the session client, responds to the voice message conversion instruction, the voice recognition model stored by the background server recognizes the text corresponding to the voice message, and the terminal can display the text message corresponding to the voice message based on the text or the semantic recognition result of the text.

In one embodiment, as shown in fig. 3, a method for processing a speech recognition model is provided, and this embodiment is mainly applied to the computer device (the terminal 102 or the server 104) in fig. 1 to exemplify the method, and includes the following steps:

step S302, a sample signal and a corresponding labeling character sequence are obtained.

Wherein the sample signal is a speech signal for training a speech recognition model, which has timing characteristics. The sample signal may be an original analog sound signal or a digital signal obtained by processing the original analog sound signal. The speech recognition model is an acoustic model with speech recognition capability after training, and specifically can be a model which is obtained by training by taking a sample signal as training data and is used for recognizing phonemes or characters of a speech signal. Each sample signal has a corresponding sequence of annotation characters, which may be a phoneme sequence or a word sequence. For example, the sample signal "good weather today" may be the phoneme sequence "bei3jin 1 tian qi4 hao3" or the word sequence "good weather today".

In one embodiment, the speech recognition model may be a non-autoregressive model based on CTCs (Connectionist temporal classification, time series class classification). The CTC algorithm is used for solving the problem of labeling time sequence data. In conventional acoustic model training, for each frame of sample signal, the corresponding annotation character needs to be known to perform effective training, so that alignment processing of the sample signal is required before training, which is a time-consuming task. The CTC loss function is adopted for training, alignment processing of sample signals is not needed, and training can be achieved only by providing the sample signals and the labeling character sequences corresponding to the sample signals. The autoregressive (Autoregressive Translation, ART) model needs to predict the next word by using the generated word during voice recognition, and has the characteristic of high recognition accuracy, but has low recognition speed, while the autoregressive voice recognition model can simultaneously generate predicted words within a specific iteration number during voice recognition, so that the recognition speed is high, but the recognition accuracy is not as high as that of the autoregressive voice recognition model. In the application, the decoder is introduced when the voice recognition model is trained, the context information of the semantic hierarchy is learned by the voice recognition model through the combined training of the voice recognition model and the decoder, and the decoder does not participate in the voice recognition process of the voice recognition model during voice recognition, so that the recognition accuracy of the voice recognition model is improved, and the recognition speed of the voice recognition model is not influenced.

In one embodiment, a computer device obtains a sample signal and a corresponding sequence of annotation characters, and trains a speech recognition model using the sample signal and the corresponding sequence of annotation characters.

Step S304, inputting the sample signal into a voice recognition model to obtain voice characteristics corresponding to the sample signal and outputting a first predicted character sequence based on the voice characteristics.

Where speech features are data describing characteristics of the sample signal at the speech level. The speech features may be in the form of vectors. For example, convert speech signals to "[1 0.2 4 0.3 0.10 0.8 0.7 0 0.7 2.1 5.2 0. ]. The first predicted character sequence is a predicted result obtained by performing voice recognition on the sample signal by a voice recognition model based on the voice feature, and can be a phoneme sequence or a text sequence.

In one embodiment, a computer device inputs a sample signal into a speech recognition model, outputs speech features corresponding to the sample signal through an encoder of the speech recognition model, and outputs a first predicted character sequence based on the speech features through a classifier coupled to the encoder in the speech recognition model.

In one embodiment, the speech recognition model may include an encoder for encoding the sample signal to obtain speech features corresponding to the sample signal, and a classifier for recognizing characters corresponding to each of the time-period signals in the sample signal based on the speech features to output a first predicted character sequence corresponding to the sample signal.

For example, referring to FIG. 4, FIG. 4 is a schematic diagram of speech recognition model training with decoder assistance in one embodiment. The computer device inputs the sample signal into a speech recognition model, outputs a speech feature [ c 1c 2c 3 c4 c5] corresponding to the sample signal through an encoder of the speech recognition model, and outputs a first predicted character sequence "w1w2w3w4w5" based on the speech feature [ c 1c 2c 3 c4 c5] through a classifier of the speech recognition model.

In one embodiment, the encoder may employ a generic encoder structure, such as CNN (Convolutional Neural Networks, convolutional neural network), RNN (Recurrent Neural Network, cyclic neural network), or the like. The classifier may also employ a generic classifier structure, such as a linear classifier, etc.

In step S306, a forward character sequence corresponding to the labeling character sequence is input to the decoder, the forward character sequence being generated based on a preceding character of each character in the labeling character sequence.

Wherein the forward character sequence is generated based on a preceding character of each character in the sequence of labeling characters. For example, if the character sequence L is "good today's weather", the forward character sequence corresponding to L can be obtained as "/good today's weather" according to the previous character of each character in L. Specifically, since the first character "present" in L does not have a corresponding previous character, the previous character "present" can be represented by "/" and the first character in the forward character sequence corresponding to L is obtained. Similarly, the second character in L is "day", and its previous character is "jin", so that the second character in the forward character sequence corresponding to L is "jin". By this, the forward character sequence corresponding to L is obtained as "/today's weather".

Step S308, in the decoder, the voice features are decoded according to the semantic features corresponding to the forward character sequence, so as to obtain voice semantic joint features corresponding to the sample signals, and prediction is performed based on the voice semantic joint features, so as to obtain a second predicted character sequence corresponding to the sample signals.

The second predicted character sequence is a predicted result obtained by the decoder through voice recognition based on the voice semantic joint characteristics, and can be a phoneme sequence or a text sequence. The voice semantic joint feature is a feature obtained by decoding and recoding the voice feature by using the above semantic information of the labeling character sequence reflected by the forward character sequence by the decoder. As the name implies, the voice semantic joint feature considers the feature of the voice signal on the voice level and the information of the labeling character sequence corresponding to the voice information on the semantic level.

In one embodiment, the decoder decodes and re-encodes the voice feature output by the encoder according to the semantic feature corresponding to the forward character sequence to obtain the voice semantic joint feature, so that the voice semantic joint feature carries context information of a semantic hierarchy, a second predicted character sequence corresponding to a sample signal is obtained based on voice semantic joint feature prediction, and a semantic recognition loss auxiliary voice recognition model constructed according to the second predicted character sequence and the labeling character sequence is trained, so that the context information of the semantic hierarchy can be distilled into the voice recognition model, the voice recognition model is helped to relieve the defects of independence assumption and incapability of utilizing the context information of the semantic hierarchy, and the recognition accuracy of the voice recognition model is improved.

In one embodiment, the computer device inputs the forward character sequence corresponding to the labeling character sequence into a decoder, obtains semantic features corresponding to the forward character sequence in the decoder, decodes and re-encodes the speech features according to the semantic features corresponding to the forward character sequence to obtain speech semantic joint features, predicts based on the speech semantic joint features, and obtains a second predicted character sequence corresponding to the sample signal.

In one embodiment, the decoder may include a vectorization layer and a cross-attention based speech semantic context network. The vectorization layer is used for acquiring semantic features corresponding to the forward character sequence. The feature dimension of the semantic feature corresponding to the forward character sequence is consistent with the feature dimension of the voice feature. The voice semantic context network based on the cross attention is used for decoding voice features and encoding by utilizing semantic features corresponding to the forward character sequence, so that the obtained voice semantic joint features carry context information of semantic hierarchy.

For example, with continued reference to fig. 4, the computer device inputs the forward character sequence "/x2x3x4x5" corresponding to the labeling character sequence into the decoder 402, obtains the semantic feature [ e1 e2 e3 e4 e5] corresponding to the forward character sequence through the vectorization layer of the decoder 402, inputs the semantic feature [ e1 e2 e3 e4 e5] and the speech feature [ c 1c 2 c3 c4 c5] extracted by the encoder into the speech semantic context network of the decoder 402, obtains the speech semantic joint feature [ r1 r2 r 3r 4 r5] through the speech semantic context network, predicts based on the speech semantic joint feature [ r1 r2 r 3r 4 r5], and obtains the second predicted character sequence "y1y2y3y4y5" corresponding to the sample signal.

Step S310, a speech recognition model and a decoder are jointly trained based on speech recognition losses calculated according to the labeling character sequence and the first predicted character sequence, and semantic recognition losses calculated according to the labeling character sequence and the second predicted character sequence.

It can be appreciated that the general penalty function meets the requirements of the embodiments of the present application for speech recognition penalty and semantic recognition penalty, so that the computer device can construct the speech recognition penalty and the semantic recognition penalty using the general penalty function. Common loss functions such as cross entropy loss functions, cosine similarity loss functions, and the like.

For example, with continued reference to fig. 4, the computer device obtains, via the decoder 402, a second predicted character sequence "y1y2y3y4y5" corresponding to the sample signal, and the classifier of the speech recognition model outputs a first predicted character sequence "w1w2w3w4w5" corresponding to the sample signal based on the speech feature [ c 1c 2c 3 c4 c5 ]. Thus, the computer device may calculate a speech recognition penalty based on the labeling character sequence "x1x2x3x4x5" and the first predicted character sequence "w1w2w3w4w5", and calculate a semantic recognition penalty based on the labeling character sequence "x1x2x3x4x5" and the second predicted character sequence "y1y2y3y4y5", and train the speech recognition model and decoder based on the speech recognition penalty and the semantic recognition penalty in combination.

In one embodiment, the computer device performs weighted summation of the speech recognition loss and the semantic recognition loss according to a preset loss weighting coefficient to obtain a target loss, and jointly trains the speech recognition model and the decoder according to the target loss.

In one embodiment, the target penalty is a composite penalty function that is a combination of speech recognition and semantic recognition penalty. The target loss can be expressed by the following formula:

L_t＝λ₁L_v+λ₂L_s

Where L _t represents a target loss, L _v represents a speech recognition loss, lambda ₁ represents a loss weighting factor corresponding to the speech recognition loss, e.g., lambda ₁ may be 0.3, L _s represents a semantic recognition loss, lambda ₂ represents a loss weighting factor corresponding to the semantic recognition loss, e.g., lambda ₂ may be 0.7.

In one embodiment, the computer device obtains a gradient corresponding to the present training based on a gradient descent algorithm in a direction that minimizes target loss, and updates network parameters of the speech recognition model and the decoder according to the gradient. The gradient descent algorithm may be a random gradient descent algorithm, or an algorithm optimized based on a random gradient descent algorithm, such as a random gradient descent algorithm with a vector term, etc.

In the processing method of the speech recognition model, the sample signal is input into the speech recognition model to obtain the speech feature corresponding to the sample signal, the first prediction character sequence output based on the speech feature is input into the decoder, the forward character sequence corresponding to the labeling character sequence is input into the decoder, the speech feature is decoded according to the semantic feature corresponding to the forward character sequence in the decoder to obtain the speech semantic joint feature corresponding to the sample signal, and the forward character sequence is generated based on the previous character of each character in the labeling character sequence, so that the speech semantic joint feature obtained by decoding and encoding the speech feature output by the encoder according to the semantic feature corresponding to the forward character sequence carries context information of the semantic layer, the second prediction character sequence corresponding to the sample signal is obtained based on the speech semantic joint feature, and the speech recognition model is trained according to the semantic recognition loss assistance constructed by the second prediction character sequence and the labeling character sequence, so that the context information of the semantic layer can be distilled into the speech recognition model, thereby improving the recognition accuracy of the speech recognition model.

In one embodiment, the encoder comprises a feature extraction network and a self-attention-based voice context network, wherein the encoder outputs voice features corresponding to sample signals through a voice recognition model, and the method comprises the steps of inputting the sample signals into the encoder to obtain voice vector sequences corresponding to the sample signals output by the feature extraction network in the encoder, randomly masking voice vectors in the voice vector sequences, and inputting the masked voice vector sequences into the voice context network to obtain contextual voice features output by the voice context network as voice features corresponding to the sample signals.

The speech vector sequence is a sequence of speech vectors, and the speech vectors are the results obtained by mapping speech signals to a high-dimensional vector space.

In one embodiment, the computer device inputs the sample signal to an encoder to obtain a sequence of speech vectors corresponding to the sample signal output by a feature extraction network in the encoder, each speech vector in the sequence of speech vectors being a speech vector corresponding to a speech signal of each period in the sample signal. For example, the computer device divides the sample signal into speech signals of time period t1 to time period t5, inputs the sample signal into the encoder, and obtains a speech vector sequence [ z1 z2 z3 z4 z5] output by the feature extraction network in the encoder, wherein the speech vector z1 is a speech vector corresponding to the speech signal of time period t 1. It is understood that the duration of each period may be set according to practical applications, and the present application is not particularly limited.

In one embodiment, the encoder may include a feature extraction network for feature extraction of the sample signal to obtain a sequence of speech vectors corresponding to the sample signal, and a self-attention-based speech context network for encoding the sequence of speech vectors to obtain contextual speech features corresponding to the sample signal, the self-attention-based speech context network being capable of encoding the sequence of speech vectors using context information, while the self-attention mechanism ensures efficient parallel efficiency and direct connection to long-distance information, thereby improving the characterization of the speech features.

In one embodiment, the feature extraction network may employ a generic feature extraction network, such as CNN (Convolutional Neural Networks, convolutional neural network), RNN (Recurrent Neural Network ), or the like. The self-attention based voice context network may employ a generic self-attention (self-attention) model, such as a transducer model, conformer model, or the like.

In one embodiment, a computer device performs a random masking process on speech vectors in a sequence of speech vectors. It will be appreciated that the general masking approach meets the need for masking in accordance with embodiments of the present application, and thus the general masking approach may be used to mask speech vectors in a sequence of speech vectors. Alternatively, the computer device may mask the speech vectors in the sequence of speech vectors by GELU (Gaussian Error Linerar Units).

In one embodiment, the computer device inputs the masked sequence of speech vectors into a speech context network, calculates the self-attentiveness corresponding to each speech vector in the masked sequence of speech vectors through the speech context network, the self-attentiveness being capable of reflecting the importance of each speech vector in the masked sequence of speech vectors, and outputs contextual speech features based on each speech vector and its corresponding self-attentiveness through a feedforward neural network.

In one embodiment, the computer device calculates similarities between each speech vector in the masked speech vector sequence and the masked speech vector sequence through a self-attention network in the speech context network, and normalizes each similarity to obtain a self-attention corresponding to each speech vector in the masked speech vector sequence. Optionally, the computer device calculates a sum of the similarities corresponding to the respective voice vectors in the masked voice vector sequence, and calculates a ratio of the similarities corresponding to the respective voice vectors in the masked voice vector sequence to the sum of the similarities as the self-attention corresponding to the respective voice vectors in the masked voice vector sequence.

For example, referring to fig. 5, fig. 5 is a schematic diagram illustrating a voice feature corresponding to a sample signal obtained by an encoder in one embodiment. The computer device divides the sample signal into speech signals of time period t 1-time period t5, inputs the sample signal into the encoder 502, and obtains a speech vector sequence [ z1 z2 z3 z4 z5] output by the feature extraction network in the encoder 502. The computer device performs random masking processing on the voice vectors in the voice vector sequence [ z1 z2 z3 z4 z5] to obtain a masked voice vector sequence [ z2×z4 ]. The computer device inputs the masked speech vector sequence [ ×z2×z4 ] into the self-attention-based speech context network 504, calculates the similarity s1, s2, s3, s4, s5 between each speech vector in the masked speech vector sequence [ ×z2×z4 ] and the masked speech vector sequence [ ×z2×z4 ], and normalizes the similarity s1, s2, s3, s4, s5 to obtain the self-attention p1, p2, p3, p4, p5 by the self-attention network in the self-attention-based speech context network 504. The computer device encodes each speech vector in the self-attention p1, p2, p3, p4, p5 and the masked sequence of speech vectors [. Z2 [. Z4 ] ], input to a feedforward neural network in the self-attention-based speech context network 504 to obtain a contextual speech feature [ c1 c 2c 3c 4 c5] output by the feedforward neural network.

In this embodiment, the encoder includes a self-attention-based voice context network, and the self-attention-based voice context network can utilize context information to encode a voice vector sequence output by the feature extraction network, and meanwhile, the self-attention mechanism ensures efficient parallel efficiency and direct connection to long-distance information, so as to improve the characterization capability of voice features.

In one embodiment, the encoder is a pre-trained encoder obtained by performing self-supervision training by using unlabeled sample signals, the method further comprises the steps of obtaining unlabeled sample signals, inputting the unlabeled sample signals into an initial encoder to obtain a voice vector sequence corresponding to the unlabeled sample signals output by a feature extraction network in the initial encoder, performing quantization operation on the voice vector sequence to obtain a voice quantized vector sequence, randomly masking voice vectors in the voice vector sequence, determining masking voice vectors, inputting the masked voice vector sequence into a voice context network of the initial encoder to obtain predicted voice vectors corresponding to the masking voice vectors output by the voice context network, constructing self-supervision training loss based on differences between the voice quantized vectors corresponding to the masking voice vectors and the predicted voice vectors in the voice quantized vector sequence, and returning to the step of obtaining the unlabeled sample signals to continue training after updating network parameters of the initial encoder according to the self-supervision training loss until the pre-supervision training is finished.

Wherein the unlabeled exemplar signal is a speech signal used for self-supervised pre-training of the encoder. The unlabeled sample signal has no corresponding labeling data. The initial encoder is the encoder to be self-supervised pre-trained.

In one embodiment, a computer device performs a quantization operation on a sequence of speech vectors resulting in a sequence of speech quantized vectors. The quantization operation may be a discretization process, such as a product quantization process, i.e., a Cartesian product (CARTESIAN PRODUCT). Infinite feature space is collapsed into a limited discrete space through quantization operation, so that the robustness of the features is enhanced, and the characterization capability of the features is improved.

In one embodiment, each speech vector in the speech quantization vector sequence includes a first speech vector corresponding to a speech signal of each period in the sample signal, and includes a second speech vector corresponding to a speech signal of each period in the sample signal in a speech feature corresponding to the unlabeled sample signal. The computer equipment constructs the speech vector prediction loss corresponding to the speech signals in the same time period based on the difference between the first speech vector and the second speech vector corresponding to the speech signals in the same time period, and fuses the speech vector prediction loss corresponding to the speech signals in each time period to obtain the self-supervision training loss.

In one embodiment, the speech vector prediction loss for a speech signal of period t may be expressed by the following formula:

Wherein L _m represents a speech vector prediction loss corresponding to the speech signal of the t period, Q _t represents a first speech vector corresponding to the speech signal of the t period, c _t represents a second speech vector corresponding to the speech signal of the t period, Q _t represents a candidate speech vector set comprising Q _t and k error speech vectors; Sim (c _t,q_t) represents the correlation between c _t and Q _t; representing c _t and Correlation between them.

In one embodiment, the computer device obtains a preset loss weighting coefficient, and sums the predicted loss weights of the voice vectors corresponding to the voice signals of each period according to the preset loss weighting coefficient to obtain the self-supervision training loss.

In one embodiment, training is ended when the number of exercises reaches a preset number, or the loss value calculated from the supervised exercise loss is less than a preset value.

For example, referring to fig. 6, fig. 6 is a schematic diagram of self-supervised pre-training of an initial encoder in one embodiment. The computer device divides the unlabeled sample signal into voice signals of t1 time period-t 5 time period, inputs the unlabeled sample signal into an initial encoder, and obtains a voice vector sequence [ z1 z2 z3 z4 z5] output by a feature extraction network in the initial encoder. The computer device performs a quantization operation on the speech vector sequence [ z1 z2 z3 z4 z5] to obtain a speech quantized vector sequence [ q1 q2 q3 q4 q5]. After the computer device performs random masking processing on the speech vectors in the speech vector sequence [ z1 z2 z3 z4 z5], masked speech vectors z1, z3, z5 are determined. The computer device inputs the masked speech vector sequence [ [ z2 ] z4 ] into the self-attention-based speech context network, calculates similarity s1, s2, s3, s4, s5 between each speech vector in the masked speech vector sequence [ [ z2 ] z4 ] and the masked speech vector sequence [ [ z2 ] z4 ] respectively through the self-attention network in the self-attention-based speech context network, and normalizes the similarity s1, s2, s3, s4, s5 to obtain self-attention p1, p2, p3, p4, p5. The computer device predicts predicted speech vectors c1, c3, c5 corresponding to the masking speech vectors z1, z3, z5 based on the self-attention p1, p3, p5 through a feedforward neural network in the self-attention based speech context network. The computer device trains the initial encoder based on the speech vector predictive loss constructed based on the difference between c1 and q1, the speech vector predictive loss constructed based on the difference between c3 and q3, and the speech vector predictive loss constructed based on the difference between c5 and q 5.

In this embodiment, the encoder is subjected to self-supervision pre-training, so that the representation capability of the voice features output by the encoder can be improved, and further the subsequent training efficiency and training effect are improved.

In one embodiment, the speech recognition model comprises an encoder and a classifier connected with the encoder, the encoder is a pre-trained encoder obtained by self-supervision training by using unlabeled sample signals, the speech recognition model and the decoder are trained jointly based on speech recognition losses calculated according to a labeled character sequence and a first predicted character sequence and semantic recognition losses calculated according to a labeled character sequence and a second predicted character sequence, the classifier of the decoder and the speech recognition model is supervised trained according to the speech recognition losses and the semantic recognition losses, and the decoder and the speech recognition model are supervised according to the speech recognition losses and the semantic recognition losses when a supervision training stop condition is met.

In one embodiment, the computer device performs self-supervision pre-training on the encoder in advance, fixes network parameters of the encoder after obtaining the pre-trained encoder, updates the network parameters of the decoder and the classifier of the speech recognition model according to the speech recognition loss and the semantic recognition loss, and updates the network parameters of the decoder and the speech recognition model according to the speech recognition loss and the semantic recognition loss when the training stop condition is satisfied.

In one embodiment, the decoder comprises a vectorization layer, a self-attention-based semantic context network and a cross-attention-based voice semantic context network, decodes voice features according to semantic features corresponding to forward character sequences to obtain voice semantic joint features corresponding to sample signals, and comprises the steps of converting the forward character sequences into corresponding forward character vector sequences through the vectorization layer of the decoder, inputting the forward character vector sequences into the semantic context network, calculating context semantic features corresponding to the forward character sequences based on the forward character vector sequences through the semantic context network to serve as the semantic features corresponding to the forward character sequences, and calculating voice semantic joint features corresponding to the sample signals based on the semantic features and the voice features corresponding to the forward character sequences through the voice semantic context network.

In one embodiment, the decoder may include a vectorization layer, a self-attention-based semantic context network, and a cross-attention-based speech semantic context network. The vectorization layer is used for converting the forward character sequence into a vector form, namely a forward character vector sequence. The self-attention-based semantic context network is used to determine the attention of each forward character vector itself in the sequence of forward character vectors, i.e., the importance of each forward character vector in the sequence of forward character vectors. The cross-attention based phonetic semantic context network is used to determine the attention contribution of the previous character to predict the next character, i.e., how much attention the previous character needs to pay to predict the next character.

In one embodiment, the computer device inputs the forward character vector sequence into a self-attention-based semantic context network of the decoder, calculates the similarity between each forward character vector and the forward character vector sequence through the semantic context network, normalizes each similarity, and obtains the self-attention of each forward character vector in the forward character vector sequence as the contextual semantic feature of the forward character vector sequence. Optionally, the computer device calculates a sum of the similarities, and calculates a ratio of the similarities to the sum of the similarities, respectively, as self-attention of the forward character vectors in the sequence of forward character vectors.

In one embodiment, the computer device inputs the self-attention and the voice feature extracted by the encoder into a voice semantic context network based on cross-attention of the decoder, calculates the similarity between the self-attention and the voice feature corresponding to each forward character vector through the voice semantic context network, normalizes each similarity to obtain the cross-attention of the self-attention corresponding to each forward character vector in the voice feature, and obtains the voice semantic joint feature based on each cross-attention. In one embodiment, the computer device inputs the self-attention and the voice feature extracted by the encoder into a voice semantic context network based on cross-attention of the decoder, calculates the similarity between the self-attention and the voice feature corresponding to each forward character vector through the cross-attention network in the voice semantic context network, and normalizes each similarity to obtain the cross-attention of the self-attention corresponding to each forward character vector in the voice feature. Optionally, the computer device calculates a sum of the similarities, and calculates a ratio of the similarities to the sum of the similarities, respectively, as a cross-attention of the self-attention corresponding to the forward character vectors in the speech feature.

In one embodiment, the computer device encodes a feedforward neural network in the cross-attention input speech semantic context network corresponding to each forward character vector to obtain speech semantic joint features output by the feedforward neural network. For example, referring to FIG. 7, FIG. 7 is a schematic diagram of speech recognition model training with decoder assistance in one embodiment. The computer device inputs the forward character sequence "/x2x3x4x5" into the decoder 702, and converts the forward character sequence "/x2x3x4x5" into a corresponding forward character vector sequence [ e1 e2 e3 e4 e5] through the vectorization layer of the decoder 702. The computer device inputs the forward character vector sequence [ e1 e2 e3 e4 e5] into the self-attention-based semantic context network of the decoder 702, calculates the similarity s1, s2, s3, s4, s5 between each of the forward character vectors e1, e2, e3, e4, e5 and the forward character vector sequence [ e1 e2 e3 e4 e5] through the semantic context network, normalizes the similarity s1, s2, s3, s4, s5, and obtains the self-attention o1, o2, o3, o4, o5 of each of the forward character vectors in the forward character vector sequence as the contextual semantic features of the forward character vector sequence [ e1 e2 e3 e4 e5]. The computer device inputs the self-attentions o1, o2, o3, o4, o5 and the speech features [ c1 c2 c 3c 4c 5] extracted by the encoder into a cross-attentions based speech semantic context network 704 of the decoder 702, calculates the similarity s1, s2, s3, s4, s5 between the respective attentions and the speech features through the cross-attentions network in the speech semantic context network 704, normalizes the similarity s1, s2, s3, s4, s5 to obtain the cross attentions u1, u2, u3, u4, u5 of the respective attentions o1, o2, o3, o4, o5 in the speech features [ c1 c2 c 3c 4c 5]. The computer device encodes the cross-attentions u1, u2, u3, u4, u5 input to the feedforward neural network in the speech semantic context network 704 to obtain the speech semantic joint feature [ r1 r2 r3 r4 r5] output by the feedforward neural network.

The cross attentions u1, u2, u3, u4 and u5 are used for representing the contribution of each forward character vector to the next character corresponding to the predicted forward character vector. For example, if the character sequence is "good weather today" and the forward character sequence is "/weather today", the cross attention u2 corresponding to the forward character vector of the forward character "present" is used to represent the importance degree of the forward character "present" for predicting "day".

In one embodiment, the predicting based on the speech semantic joint features to obtain a second predicted character sequence corresponding to the sample signal includes inputting the speech semantic joint features to a classifier of a decoder, and outputting the second predicted character sequence corresponding to the sample signal based on the speech semantic joint features through the classifier.

In one embodiment, the decoder may include a vectorization layer, a self-attention-based semantic context network, a cross-attention-based speech semantic context network, and a classifier for identifying a character corresponding to each period signal in the sample signal based on speech semantic joint features, and outputting a second predicted character sequence corresponding to the sample signal. For example, with continued reference to fig. 7, the computer device inputs the speech semantic joint feature [ r 1r 2r 3r 4r 5] into a classifier of the decoder 702, through which a second predicted character sequence "y1y2y3y4y5" corresponding to the sample signal is output based on the speech semantic joint feature. And the classifier of the speech recognition model outputs a first predicted character sequence 'w 1w2w3w4w 5' corresponding to the sample signal based on the speech feature 'c 1 c 2c 3c 4c 5'. Thus, the computer device may jointly train the speech recognition model with the decoder 702 based on the speech recognition penalty calculated from the labeling character sequence "x1x2x3x4x5" and the first predicted character sequence "w1w2w3w4w5" and the semantic recognition penalty calculated from the labeling character sequence "x1x2x3x4x5" and the second predicted character sequence "y1y2y3y4y5".

In this embodiment, the decoder includes a cross-attention-based speech semantic context network, and the cross-attention-based speech semantic context network can utilize speech features carrying a speech hierarchy output by the encoder to assist the speech recognition model in training, so as to distill the context information of the semantic hierarchy into the speech recognition model, help the speech recognition model alleviate the independence assumption and the inadequacies of the context information incapable of utilizing the semantic hierarchy, and further improve the speech recognition accuracy.

In one embodiment, the method further comprises obtaining a signal to be recognized, inputting the signal to be recognized into a trained speech recognition model, obtaining speech features output by an encoder in the speech recognition model, and outputting speech recognition results based on the speech features by a classifier in the speech recognition model.

The signal to be recognized is a voice signal to be subjected to voice recognition by the method provided by the embodiment of the application. The signal to be recognized may be a voice signal received in a voice interaction scenario, such as a virtual robot interaction scenario, an intelligent device control scenario, a machine translation scenario, a text conversion scenario of a voice message, etc.

In one embodiment, the computer device obtains a signal to be recognized, inputs the signal to be recognized into a trained speech recognition model, obtains speech features output by an encoder in the speech recognition model, and outputs a speech recognition result based on the speech features by a classifier in the speech recognition model, where the speech recognition result may be a phoneme or a word corresponding to the signal to be recognized.

In this embodiment, since the trained speech recognition model can perform speech recognition by using context information of semantic hierarchy, the speech recognition accuracy can be improved.

In one embodiment, referring to fig. 8, a method for processing a speech recognition model is provided, comprising the steps of:

Step S802, obtaining a non-labeling sample signal, inputting the non-labeling sample signal into an initial encoder to obtain a voice vector sequence corresponding to the non-labeling sample signal output by a feature extraction network in the initial encoder, performing quantization operation on the voice vector sequence to obtain a voice quantization vector sequence, sequentially masking the voice vectors in the voice vector sequence from the first voice vector of the voice vector sequence, sequentially inputting the masked voice vector sequence into a voice context network of the initial encoder to obtain a context voice feature output by the voice context network as a voice feature corresponding to the non-labeling sample signal, constructing self-supervision training loss based on the difference between the voice quantization vector sequence and the voice feature corresponding to the non-labeling sample signal, and returning to the step of obtaining the non-labeling sample signal to continue training until training is finished after updating network parameters of the initial encoder according to the self-supervision training loss, thereby obtaining the pre-training encoder.

Step 804, obtaining a sample signal and a corresponding labeling character sequence, inputting the sample signal into a pre-trained encoder in a speech recognition model to obtain a speech vector sequence corresponding to the sample signal output by a feature extraction network in the encoder, randomly masking speech vectors in the speech vector sequence, inputting the masked speech vector sequence into a speech context network to obtain context speech features output by the speech context network as speech features corresponding to the sample signal, and outputting a first predicted character sequence based on the speech features through a classifier connected with the encoder in the speech recognition model.

Step 806, inputting the forward character sequence corresponding to the labeling character sequence into a decoder, wherein the forward character sequence is generated based on the previous character of each character in the labeling character sequence, converting the forward character sequence into the corresponding forward character vector sequence through a vectorization layer of the decoder in the decoder, inputting the forward character vector sequence into a semantic context network, calculating context semantic features corresponding to the forward character sequence as semantic features corresponding to the forward character sequence through the semantic context network based on the forward character vector sequence, calculating to obtain voice semantic joint features corresponding to sample signals based on the semantic features and the voice features corresponding to the forward character sequence through the voice semantic context network, inputting the voice semantic joint features into a classifier of the decoder, and outputting a second predicted character sequence corresponding to the sample signals based on the voice semantic joint features through the classifier.

Step 808, constructing speech recognition loss based on the difference between the labeling character sequence and the first predicted character sequence, constructing semantic recognition loss based on the difference between the labeling character sequence and the second predicted character sequence, weighting and summing the speech recognition loss and the semantic recognition loss according to a preset loss weighting coefficient to obtain target loss, performing supervised training on the decoder and the classifier of the speech recognition model according to the target loss, and performing supervised training on the decoder and the speech recognition model according to the target loss when the stop condition of the supervised training is satisfied.

For example, referring to FIG. 9, FIG. 9 is a schematic diagram of training a speech recognition model with the aid of a decoder in one embodiment. The computer device divides the sample signal into voice signals of t1 time period to t5 time period, inputs the sample signal into an encoder, and obtains a voice vector sequence [ z1 z2 z3 z4 z5] output by a feature extraction network in the encoder. The computer device performs random masking processing on the voice vectors in the voice vector sequence [ z1 z2 z3 z4 z5] to obtain a masked voice vector sequence [ z2×z4 ]. The computer device inputs the masked speech vector sequence [ [ z2 ] z4 ] into the self-attention-based speech context network, calculates similarity s1, s2, s3, s4, s5 between each speech vector in the masked speech vector sequence [ [ z2 ] z4 ] and the masked speech vector sequence [ [ z2 ] z4 ] respectively through the self-attention network in the self-attention-based speech context network, and normalizes the similarity s1, s2, s3, s4, s5 to obtain self-attention p1, p2, p3, p4, p5. The computer device inputs each speech vector in the self-attention p1, p2, p3, p4, p5 and the masked speech vector sequence [ [ z2 ] z4 ] into a feedforward neural network in the self-attention-based speech context network for encoding, and obtains a context speech feature [ c1 c2 c3 c4 c5] output by the feedforward neural network. The computer device inputs the forward character sequence "/x2x3x4x5" into a decoder, and converts the forward character sequence "/x2x3x4x5" into a corresponding forward character vector sequence [ e1 e2 e3 e4 e5] through a vectorization layer of the decoder. The computer equipment inputs the forward character vector sequence [ e1 e2 e 3e 4 e5] into a semantic context network based on self-attention of the decoder, calculates the similarity s1, s2, s3, s4 and s5 between each forward character vector e1, e2, e3, e4 and e5 and the forward character vector sequence [ e1 e2 e 3e 4 e5] through the semantic context network, normalizes the similarity s1, s2, s3, s4 and s5, and obtains the self-attention o1, o2, o3, o4 and o5 of each forward character vector in the forward character vector sequence as the context semantic feature of the forward character vector sequence [ e1 e2 e 3e 4 e5 ]. The computer equipment inputs self-attentions o1, o2, o3, o4, o5 and voice characteristics [ c 1c 2 c3 c 4c 5] extracted by the encoder into a voice semantic context network based on cross attentions of the decoder, calculates similarity s1, s2, s3, s4 and s5 between the respective attentions and the voice characteristics respectively through the cross attentions network in the voice semantic context network, normalizes the similarity s1, s2, s3, s4 and s5 to obtain cross attentions u1, o2, o3, o4 and o5 in the voice characteristics [ c 1c 2 c3 c 4c 5], u2, u3, u4, u5. The computer equipment encodes the feedforward neural network in the voice semantic context network input by the cross attentions u1, u2, u3, u4 and u5 to obtain the voice semantic joint feature [ r1 r2 r3 r4 r5] output by the feedforward neural network.

The computer equipment inputs the voice semantic joint feature [ r1 r2 r3 r4 r5] into a classifier of the decoder, and outputs a second predicted character sequence 'y 1y2y3y4y 5' corresponding to the sample signal based on the voice semantic joint feature through the classifier. And the classifier of the speech recognition model outputs a first predicted character sequence 'w 1w2w3w4w 5' corresponding to the sample signal based on the speech feature 'c 1c 2c 3c 4c 5'. Thus, the computer device may jointly train the speech recognition model with the decoder based on the speech recognition penalty calculated from the labeling character sequence "x1x2x3x4x5" and the first predicted character sequence "w1w2w3w4w5" and the semantic recognition penalty calculated from the labeling character sequence "x1x2x3x4x5" and the second predicted character sequence "y1y2y3y4y5".

In one embodiment, the self-attention based speech context network may have M layers, each of which in turn comprises a structure of Multi-head Self Attention (Multi-head self-attention), add (summing operation), norm (normalization operation), feed Forward (Feed-Forward neural network), add (summing operation), norm (normalization operation). M may take on a value of 12.

In one embodiment, the decoder may specifically include Embedding Layer (vectorization layer), N intermediate coding layers connected to the vectorization layer, where the N intermediate coding layers may include a self-attention-based semantic context network, a cross-attention-based speech semantic context network. The decoder may further include a classifier connected to the intermediate coding layer of the N layers. The specific structure of the semantic context network based on Self-Attention sequentially comprises Multi-head Self-Attention, add and Norm. The specific structure of the Cross-Attention-based speech semantic context network sequentially comprises Multi-head Cross-Attention, add, norm, feed Forward, add and Norm. N may take a value of 6.

According to the processing method of the voice recognition model, the sample signal is input into the voice recognition model to obtain the voice feature corresponding to the sample signal, the first prediction character sequence output based on the voice feature is input into the decoder, the forward character sequence corresponding to the labeling character sequence is input into the decoder, the voice feature is decoded according to the semantic feature corresponding to the forward character sequence in the decoder to obtain the voice semantic joint feature corresponding to the sample signal, and the forward character sequence is generated based on the previous character of each character in the labeling character sequence, so that the voice semantic joint feature obtained by decoding and encoding the voice feature output by the encoder according to the semantic feature corresponding to the forward character sequence carries context information of a semantic layer, the second prediction character sequence corresponding to the sample signal is obtained by predicting based on the voice semantic joint feature, and the voice recognition model is assisted by training according to the semantic recognition loss constructed by the second prediction character sequence and the labeling character sequence, so that the context information of the semantic layer can be distilled into the voice recognition model, thereby improving the recognition accuracy of the voice recognition model.

In order to verify the effect produced by the scheme provided by the embodiment of the application, a test was carried out by a comparative experiment. The test adopts two training modes for the voice recognition model, one is to jointly train the voice recognition model and the decoder (hereinafter referred to as a joint training mode), and the other is to train the voice recognition model alone (hereinafter referred to as a single training mode). Two specific implementations of the training mode will now be described.

For the joint training mode, the computer device performs self-supervision pre-training on the encoder of the speech recognition model, and the pre-training step of the encoder refers to the step S802, which is not described herein again. After obtaining the pre-trained encoder, the computer equipment obtains a sample signal and a corresponding labeling character sequence, inputs the sample signal into the pre-trained encoder in a voice recognition model to obtain a voice vector sequence corresponding to the sample signal output by a feature extraction network in the encoder, performs random masking processing on voice vectors in the voice vector sequence, inputs the masked voice vector sequence into a voice context network to obtain context voice features output by the voice context network as voice features corresponding to the sample signal, and outputs a first predicted character sequence based on the voice features through a classifier connected with the encoder in the voice recognition model. The method comprises the steps of inputting a forward character sequence corresponding to a labeling character sequence into a decoder by computer equipment, wherein the forward character sequence is generated based on the previous character of each character in the labeling character sequence, converting the forward character sequence into a corresponding forward character vector sequence through a vectorization layer of the decoder, inputting the forward character vector sequence into a semantic context network, calculating context semantic features corresponding to the forward character sequence as semantic features corresponding to the forward character sequence through the semantic context network based on the forward character vector sequence, calculating to obtain voice semantic joint features corresponding to sample signals based on the semantic features and the voice features corresponding to the forward character sequence through the voice semantic context network, inputting the voice semantic joint features into a classifier of the decoder, outputting a second prediction character sequence corresponding to a sample signal through the classifier based on the voice semantic joint features, constructing a voice recognition loss based on the difference between the labeling character sequence and the first prediction character sequence, constructing a semantic recognition loss based on the difference between the labeling character sequence and the second prediction character sequence, weighting recognition and the target recognition loss according to a preset weighting coefficient, carrying out a training condition for the weighted recognition and the target recognition loss, and the target recognition loss is obtained when the target recognition and the target loss and the target recognition loss are satisfied, and the target loss is supervised and the target recognition loss is obtained and the target is judged and the model is judged.

For the independent training mode, the computer device performs self-supervision pre-training on the encoder of the speech recognition model, and the pre-training step of the encoder refers to the step S802, which is not described herein again. After obtaining a pre-trained encoder, a computer device obtains a sample signal and a corresponding labeling character sequence, inputs the sample signal into the pre-trained encoder in a voice recognition model to obtain a voice vector sequence corresponding to the sample signal output by a feature extraction network in the encoder, performs random masking processing on voice vectors in the voice vector sequence, inputs the masked voice vector sequence into a voice context network to obtain a context voice feature output by the voice context network as a voice feature corresponding to the sample signal, outputs a first prediction character sequence based on the voice feature through a classifier connected with the encoder in the voice recognition model, builds voice recognition loss based on the difference between the labeling character sequence and the first prediction character sequence, performs supervise training on the classifier of the voice recognition model according to the voice recognition loss, and performs supervise training on the encoder and the classifier of the voice recognition model according to the voice recognition loss when a supervise training stop condition is met.

The self-supervision training data adopted by the two training modes are librispeech data of 960 hours, the adopted supervision training data are open-source Chinese voice recognition data sets Aishell-1, the Aishell-1 data sets comprise training sets, verification sets and test sets, the number of test strips of the Aishell-1 training sets is 120098, the number of test strips of the Aishell-1 verification sets is 14326, and the number of test strips of the Aishell-1 test sets is 7176. The feature dimension of both the decoder and the encoder is 768. For the joint training method, the loss weighting coefficient of the speech recognition loss is 0.3, and the loss weighting coefficient of the semantic recognition loss is 0.7. The M-layer value of the self-attention based speech context network is 12 and the N-layer value of the decoder is 6.

The speech recognition model obtained by training the combined training mode and the independent training mode respectively is tested, and the obtained test result is shown in fig. 10. It can be seen that, compared with the speech recognition model obtained by training in the independent training mode, the speech recognition model obtained by training in the combined training mode has significantly reduced word error rate, that is, the performance of the model can be significantly improved in the combined training mode.

In one embodiment, as shown in fig. 11, a voice recognition method is provided, and this embodiment is mainly applied to the computer device (the terminal 102 or the server 104) in fig. 1 to exemplify the method, and includes the following steps:

In step S1102, a signal to be identified is acquired.

And step S1104, inputting the signal to be recognized into a trained voice recognition model to obtain voice characteristics output by an encoder in the voice recognition model and voice recognition results output by a classifier in the voice recognition model based on the voice characteristics, wherein the voice recognition model and the decoder are obtained by joint training based on voice recognition loss and semantic recognition loss, the voice recognition loss is calculated according to a first predicted character sequence and a labeling character sequence corresponding to a sample signal, the semantic recognition loss is calculated according to a second predicted character sequence and the labeling character sequence, the first predicted character sequence is obtained after classification based on the voice characteristics output by the encoder, the second predicted character sequence is obtained by predicting voice semantic joint characteristics obtained by decoding voice characteristics by using semantic characteristics corresponding to the forward character sequence corresponding to the labeling character sequence, and the forward character sequence is generated based on the previous character of each character in the labeling character sequence.

Reference may be made to the above embodiments for the training manner of the speech recognition model, and details are not repeated here.

In the above voice recognition method, the signal to be recognized is input into the trained voice recognition model to obtain the voice feature output by the encoder in the voice recognition model and the voice recognition result output by the classifier in the voice recognition model based on the voice feature, and the trained voice recognition model can utilize the context information of the semantic hierarchy to perform voice recognition, so that the voice recognition accuracy can be improved.

It should be understood that, although the steps in the flowcharts of fig. 3, 8, and 11 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps of fig. 3, 8, 11 may include steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the steps or stages in other steps.

In one embodiment, as shown in fig. 12, a processing apparatus of a speech recognition model is provided, which may use a software module or a hardware module, or a combination of both, as a part of a computer device, and specifically includes an acquisition module 1202, an encoding module 1204, an input module 1206, a decoding module 1208, and a training module 1210, where:

an obtaining module 1202, configured to obtain a sample signal and a corresponding labeling character sequence;

The encoding module 1204 is configured to input the sample signal into a speech recognition model, obtain a speech feature corresponding to the sample signal, and output a first predicted character sequence based on the speech feature;

An input module 1206 for inputting a forward character sequence corresponding to the labeling character sequence into the decoder, the forward character sequence being generated based on a previous character of each character in the labeling character sequence;

The decoding module 1208 is configured to decode, in the decoder, the speech feature according to the semantic feature corresponding to the forward character sequence, obtain a speech semantic joint feature corresponding to the sample signal, and predict based on the speech semantic joint feature, so as to obtain a second predicted character sequence corresponding to the sample signal;

The training module 1210 is configured to jointly train the speech recognition model and the decoder based on the speech recognition penalty calculated from the labeling character sequence and the first predicted character sequence, and the semantic recognition penalty calculated from the labeling character sequence and the second predicted character sequence.

In one embodiment, the encoding module 1204 is further configured to input the sample signal into a speech recognition model, output speech features corresponding to the sample signal through an encoder of the speech recognition model, and output a first predicted character sequence based on the speech features through a classifier coupled to the encoder in the speech recognition model.

In one embodiment, the encoder comprises a feature extraction network and a self-attention-based voice context network, the encoding module 1204 is further configured to input the sample signal into the encoder to obtain a voice vector sequence corresponding to the sample signal output by the feature extraction network in the encoder, perform random masking processing on voice vectors in the voice vector sequence, and input the masked voice vector sequence into the voice context network to obtain a context voice feature output by the voice context network as a voice feature corresponding to the sample signal.

In one embodiment, the decoder comprises a vectorization layer, a self-attention-based semantic context network and a cross-attention-based speech semantic context network, the decoding module 1208 is further configured to convert the forward character sequence into a corresponding forward character vector sequence through the vectorization layer of the decoder, input the forward character vector sequence into the semantic context network, calculate context semantic features corresponding to the forward character sequence as semantic features corresponding to the forward character sequence based on the forward character vector sequence through the semantic context network, and calculate speech semantic joint features corresponding to the sample signal based on the semantic features and the speech features corresponding to the forward character sequence through the speech semantic context network.

In one embodiment, the decoding module 1208 is further configured to input the speech semantic joint feature into a classifier of the decoder, and output a second predicted character sequence corresponding to the sample signal based on the speech semantic joint feature through the classifier.

In one embodiment, the speech recognition model comprises an encoder and a classifier connected to the encoder, the encoder is a pre-trained encoder obtained by self-supervised training using unlabeled sample signals, the training module 1210 is further configured to supervise the decoder and the classifier of the speech recognition model based on speech recognition loss and semantic recognition loss, and to supervise the decoder and the speech recognition model based on speech recognition loss and semantic recognition loss when a supervised training stop condition is satisfied.

In one embodiment, the encoder is a pre-trained encoder obtained by performing self-supervision training by using unlabeled sample signals, the voice recognition model further comprises a pre-training module 1210, the pre-training module 1210 is used for obtaining unlabeled sample signals, inputting the unlabeled sample signals into the initial encoder to obtain a voice vector sequence corresponding to the unlabeled sample signals output by a feature extraction network in the initial encoder, performing quantization operation on the voice vector sequence to obtain a voice quantized vector sequence, randomly masking voice vectors in the voice vector sequence, determining masked voice vectors, inputting the masked voice vector sequence into a voice context network of the initial encoder to obtain a predicted voice vector corresponding to the masked voice vectors output by the voice context network, constructing self-supervision training loss based on differences between the voice quantized vectors corresponding to the masked voice vectors in the voice quantized vector sequence and the predicted voice vectors, and returning to the step of obtaining the unlabeled sample signals to continue training after updating network parameters of the initial encoder according to the self-supervision training loss until training is finished.

In one embodiment, the training module 1210 is further configured to construct a speech recognition penalty based on the difference between the labeling character sequence and the first predicted character sequence, construct a semantic recognition penalty based on the difference between the labeling character sequence and the second predicted character sequence, weight sum the speech recognition penalty and the semantic recognition penalty according to a preset penalty weighting coefficient to obtain a target penalty, and jointly train the speech recognition model and the decoder according to the target penalty.

In one embodiment, the processing device of the voice recognition model further comprises a voice recognition module, wherein the voice recognition module is used for acquiring a signal to be recognized, inputting the signal to be recognized into the trained voice recognition model to obtain voice characteristics output by an encoder in the voice recognition model, and outputting a voice recognition result based on the voice characteristics by a classifier in the voice recognition model.

For specific limitations on the processing means of the speech recognition model, reference may be made to the above limitations on the processing method of the speech recognition model, and no further description is given here. The respective modules in the processing means of the speech recognition model described above may be implemented wholly or partly by software, hardware or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In the processing device of the speech recognition model, the sample signal is input into the speech recognition model to obtain the speech feature corresponding to the sample signal, the first prediction character sequence output based on the speech feature is input into the decoder, the forward character sequence corresponding to the labeling character sequence is input into the decoder, the speech feature is decoded according to the semantic feature corresponding to the forward character sequence in the decoder to obtain the speech semantic joint feature corresponding to the sample signal, and the forward character sequence is generated based on the previous character of each character in the labeling character sequence, so that the speech semantic joint feature obtained by decoding and encoding the speech feature output by the encoder according to the semantic feature corresponding to the forward character sequence carries context information of the semantic layer, the second prediction character sequence corresponding to the sample signal is obtained based on the speech semantic joint feature, and the speech recognition model is assisted to train according to the semantic recognition loss constructed by the second prediction character sequence and the labeling character sequence, so that the context information of the semantic layer can be distilled into the speech recognition model, thereby improving the recognition accuracy of the speech recognition model.

In one embodiment, as shown in fig. 13, a speech recognition apparatus is provided, which may employ a software module or a hardware module, or a combination of both, as part of a computer device, and specifically includes an acquisition module 1302 and a speech recognition module 1304, where:

an acquisition module 1302, configured to acquire a signal to be identified;

A speech recognition module 1304 for inputting the signal to be recognized into the trained speech recognition model to obtain speech features output by the encoder in the speech recognition model, and speech recognition results output by the classifier in the speech recognition model based on the speech features;

The voice recognition model and the decoder are obtained based on joint training of voice recognition loss and semantic recognition loss, the voice recognition loss is calculated according to a labeling character sequence corresponding to a first predicted character sequence and a sample signal, the semantic recognition loss is calculated according to a second predicted character sequence and a labeling character sequence, the first predicted character sequence is obtained after classification based on voice features output by the encoder, the second predicted character sequence is obtained by predicting voice semantic joint features obtained by decoding voice features through the decoder by using semantic features corresponding to a forward character sequence corresponding to the labeling character sequence, and the forward character sequence is generated based on previous characters of all characters in the labeling character sequence.

For specific limitations of the speech recognition device, reference may be made to the above limitations of the speech recognition method, and no further description is given here. The various modules in the speech recognition device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In the above-mentioned speech recognition device, the signal to be recognized is input into the trained speech recognition model to obtain the speech feature output by the encoder in the speech recognition model and the speech recognition result output by the classifier in the speech recognition model based on the speech feature, and since the trained speech recognition model can perform speech recognition by using the context information of the semantic hierarchy, the speech recognition accuracy can be improved.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 14. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing processing data and/or image generation data of the speech recognition model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of processing a speech recognition model and/or a method of speech recognition.

In one embodiment, a computer device is provided, which may be a terminal or a face collection device, and the internal structure of the computer device may be as shown in fig. 15. The computer device comprises a processor, a memory, a communication interface and a voice acquisition device which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of processing a speech recognition model and/or a method of speech recognition.

It will be appreciated by those skilled in the art that the structures shown in fig. 14 and 15 are merely block diagrams of portions of structures associated with aspects of the present application and are not intended to limit the computer apparatus to which aspects of the present application may be applied, and that a particular computer apparatus may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method for processing a speech recognition model, characterized in that the method comprises:

Obtaining sample signals and corresponding labeled character sequences;

Inputting the sample signal into a speech recognition model to obtain speech features corresponding to the sample signal and a first predicted character sequence output based on the speech features;

Inputting a forward character sequence corresponding to the marked character sequence into a decoder, wherein the forward character sequence is generated based on the previous character of each character in the marked character sequence;

In the decoder, the speech feature is decoded according to the semantic feature corresponding to the forward character sequence to obtain the speech and semantic joint feature corresponding to the sample signal, and prediction is performed based on the speech and semantic joint feature to obtain a second predicted character sequence corresponding to the sample signal;

The speech recognition model and the decoder are jointly trained based on the speech recognition loss calculated based on the labeled character sequence and the first predicted character sequence, and the semantic recognition loss calculated based on the labeled character sequence and the second predicted character sequence.

2. The method according to claim 1, characterized in that the step of inputting the sample signal into a speech recognition model to obtain speech features corresponding to the sample signal and outputting a first predicted character sequence based on the speech features comprises:

Inputting the sample signal into the speech recognition model;

Outputting speech features corresponding to the sample signal through an encoder of the speech recognition model;

The first predicted character sequence is output based on the speech features through a classifier connected to the encoder in the speech recognition model.

3. The method according to claim 2, characterized in that the encoder includes a feature extraction network and a speech context network based on self-attention;

The outputting of the speech features corresponding to the sample signal by the encoder of the speech recognition model includes:

Inputting the sample signal into the encoder to obtain a speech vector sequence corresponding to the sample signal output by a feature extraction network in the encoder;

Performing random masking processing on the speech vectors in the speech vector sequence;

The masked speech vector sequence is input into the speech context network to obtain context speech features output by the speech context network as speech features corresponding to the sample signal.

4. The method according to claim 1, characterized in that the decoder comprises a vectorization layer, a semantic context network based on self-attention and a speech semantic context network based on cross-attention;

The decoding of the speech feature according to the semantic feature corresponding to the forward character sequence to obtain the speech and semantic joint feature corresponding to the sample signal includes:

The forward character sequence is converted into a corresponding forward character vector sequence through the vectorization layer of the decoder, and the forward character vector sequence is input into the semantic context network;

By means of the semantic context network, based on the forward character vector sequence, a contextual semantic feature corresponding to the forward character sequence is calculated as a semantic feature corresponding to the forward character sequence;

The speech semantic context network is used to calculate the speech semantic joint feature corresponding to the sample signal based on the semantic feature corresponding to the forward character sequence and the speech feature.

5. The method according to claim 4, characterized in that the step of performing prediction based on the speech-semantic joint feature to obtain a second predicted character sequence corresponding to the sample signal comprises:

Inputting the speech-semantic joint feature into the classifier of the decoder;

The classifier outputs a second predicted character sequence corresponding to the sample signal based on the speech-semantic joint feature.

6. The method according to claim 1, characterized in that the speech recognition model comprises an encoder and a classifier connected to the encoder; the encoder is a pre-trained encoder obtained by self-supervised training using unlabeled sample signals;

The method of jointly training the speech recognition model and the decoder based on the speech recognition loss calculated according to the labeled character sequence and the first predicted character sequence, and the semantic recognition loss calculated according to the labeled character sequence and the second predicted character sequence, comprises:

Performing supervised training on the decoder and the classifier of the speech recognition model according to the speech recognition loss and the semantic recognition loss;

When the supervised training stop condition is met, supervised training is performed on the decoder and the speech recognition model according to the speech recognition loss and the semantic recognition loss.

7. The method according to claim 2, 3 or 6, characterized in that the encoder is a pre-trained encoder obtained by self-supervised training using unlabeled sample signals;

The method further comprises:

Acquire the unlabeled sample signal;

Inputting the unlabeled sample signal into an initial encoder to obtain a speech vector sequence corresponding to the unlabeled sample signal output by a feature extraction network in the initial encoder;

Performing a quantization operation on the speech vector sequence to obtain a speech quantization vector sequence;

After performing random masking processing on the speech vectors in the speech vector sequence, a masked speech vector is determined;

Inputting the masked speech vector sequence into the speech context network of the initial encoder to obtain a predicted speech vector corresponding to the masked speech vector output by the speech context network;

constructing a self-supervised training loss based on a difference between a speech quantization vector in the speech quantization vector sequence corresponding to the masked speech vector and the predicted speech vector;

After updating the network parameters of the initial encoder according to the self-supervised training loss, returning to the step of obtaining the unlabeled sample signal to continue training until the training is completed, thereby obtaining the pre-trained encoder.

8. The method according to claim 1, characterized in that the joint training of the speech recognition model and the decoder based on the speech recognition loss calculated according to the labeled character sequence and the first predicted character sequence, and the semantic recognition loss calculated according to the labeled character sequence and the second predicted character sequence, comprises:

constructing the speech recognition loss based on the difference between the labeled character sequence and the first predicted character sequence;

constructing a semantic recognition loss based on the difference between the labeled character sequence and the second predicted character sequence;

The speech recognition loss and the semantic recognition loss are weighted and summed according to a preset loss weighting coefficient to obtain a target loss;

The speech recognition model and the decoder are jointly trained according to the target loss.

9. The method according to claim 1, characterized in that the method further comprises:

Acquire a signal to be identified;

The signal to be recognized is input into a trained speech recognition model to obtain speech features output by an encoder in the speech recognition model, and speech recognition results output by a classifier in the speech recognition model based on the speech features.

10. A speech recognition method, characterized in that the method comprises:

Acquire a signal to be identified;

Inputting the signal to be recognized into a trained speech recognition model to obtain speech features output by an encoder in the speech recognition model, and speech recognition results output by a classifier in the speech recognition model based on the speech features;

Among them, the speech recognition model and decoder are obtained by joint training based on speech recognition loss and semantic recognition loss, the speech recognition loss is calculated based on a first predicted character sequence and a labeled character sequence corresponding to a sample signal, and the semantic recognition loss is calculated based on a second predicted character sequence and the labeled character sequence. The first predicted character sequence is obtained after classification based on the speech features output by the encoder, and the second predicted character sequence is obtained by predicting the speech and semantic joint features obtained by decoding the speech features using the semantic features corresponding to the forward character sequence corresponding to the labeled character sequence by the decoder, and the forward character sequence is generated based on the previous character of each character in the labeled character sequence.

11. A processing device for a speech recognition model, characterized in that the device comprises:

An acquisition module, used to acquire a sample signal and a corresponding labeled character sequence;

An encoding module, used for inputting the sample signal into a speech recognition model to obtain speech features corresponding to the sample signal and a first predicted character sequence output based on the speech features;

An input module, used for inputting a forward character sequence corresponding to the marked character sequence into a decoder, wherein the forward character sequence is generated based on the previous character of each character in the marked character sequence;

A decoding module, configured to decode the speech feature in the decoder according to the semantic feature corresponding to the forward character sequence, obtain the speech and semantic joint feature corresponding to the sample signal, and make a prediction based on the speech and semantic joint feature to obtain a second predicted character sequence corresponding to the sample signal;

A training module is used to jointly train the speech recognition model and the decoder based on the speech recognition loss calculated according to the labeled character sequence and the first predicted character sequence, and the semantic recognition loss calculated according to the labeled character sequence and the second predicted character sequence.

12. The device according to claim 11 is characterized in that the encoding module is also used to: input the sample signal into the speech recognition model; output the speech features corresponding to the sample signal through the encoder of the speech recognition model; and output the first predicted character sequence based on the speech features through a classifier connected to the encoder in the speech recognition model.

13. The device according to claim 12 is characterized in that the encoder includes a feature extraction network and a speech context network based on self-attention; the encoding module is also used to: input the sample signal into the encoder to obtain a speech vector sequence corresponding to the sample signal output by the feature extraction network in the encoder; perform random masking on the speech vectors in the speech vector sequence; input the masked speech vector sequence into the speech context network to obtain contextual speech features output by the speech context network as speech features corresponding to the sample signal.

14. The device according to claim 11 is characterized in that the decoder includes a vectorization layer, a semantic context network based on self-attention and a speech semantic context network based on cross-attention; the decoding module is also used to: convert the forward character sequence into a corresponding forward character vector sequence through the vectorization layer of the decoder, and input the forward character vector sequence into the semantic context network; calculate the contextual semantic features corresponding to the forward character sequence based on the forward character vector sequence through the semantic context network as the semantic features corresponding to the forward character sequence; calculate the speech semantic joint features corresponding to the sample signal based on the semantic features corresponding to the forward character sequence and the speech features through the speech semantic context network.

15. The device according to claim 14 is characterized in that the decoding module is also used to: input the joint feature of speech and semantics into the classifier of the decoder; and output a second predicted character sequence corresponding to the sample signal based on the joint feature of speech and semantics through the classifier.

16. The device according to claim 11 is characterized in that the speech recognition model includes an encoder and a classifier connected to the encoder; the encoder is a pre-trained encoder obtained by self-supervised training using unlabeled sample signals; the training module is also used to: perform supervised training on the decoder and the classifier of the speech recognition model according to the speech recognition loss and the semantic recognition loss; when the supervised training stop condition is met, perform supervised training on the decoder and the speech recognition model according to the speech recognition loss and the semantic recognition loss.

17. The device according to claim 12, 13 or 16, characterized in that the encoder is a pre-trained encoder obtained by self-supervised training using an unlabeled sample signal; the speech recognition model also includes a pre-training module, which is used to: obtain the unlabeled sample signal; input the unlabeled sample signal into the initial encoder to obtain a speech vector sequence corresponding to the unlabeled sample signal output by the feature extraction network in the initial encoder; perform a quantization operation on the speech vector sequence to obtain a speech quantization vector sequence; after randomly masking the speech vectors in the speech vector sequence, determine the masked speech vector; input the masked speech vector sequence into the speech context network of the initial encoder to obtain a predicted speech vector corresponding to the masked speech vector output by the speech context network; construct a self-supervised training loss based on the difference between the speech quantization vector corresponding to the masked speech vector in the speech quantization vector sequence and the predicted speech vector; after updating the network parameters of the initial encoder according to the self-supervised training loss, return to the step of obtaining the unlabeled sample signal to continue training until the training is completed, and obtain the pre-trained encoder.

18. The device according to claim 11 is characterized in that the training module is also used to: construct the speech recognition loss based on the difference between the labeled character sequence and the first predicted character sequence; construct the semantic recognition loss based on the difference between the labeled character sequence and the second predicted character sequence; weightedly sum the speech recognition loss and the semantic recognition loss according to a preset loss weighting coefficient to obtain a target loss; and jointly train the speech recognition model and the decoder according to the target loss.

19. The device according to claim 11 is characterized in that the processing device of the speech recognition model also includes a speech recognition module, and the speech recognition module is used to: obtain a signal to be recognized; input the signal to be recognized into a trained speech recognition model to obtain speech features output by an encoder in the speech recognition model, and a speech recognition result output by a classifier in the speech recognition model based on the speech features.

20. A speech recognition device, characterized in that the device comprises:

An acquisition module, used for acquiring a signal to be identified;

A speech recognition module, used for inputting the signal to be recognized into a trained speech recognition model, obtaining speech features output by an encoder in the speech recognition model, and a speech recognition result output by a classifier in the speech recognition model based on the speech features;

21. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 10 when executing the computer program.

22. A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 10 are implemented.

23. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 10 are implemented.