CN120564706A

CN120564706A - Voice recognition method, device, storage medium, electronic device and vehicle

Info

Publication number: CN120564706A
Application number: CN202410231953.2A
Authority: CN
Inventors: 王菲
Original assignee: Beijing Co Wheels Technology Co Ltd
Current assignee: Beijing Co Wheels Technology Co Ltd
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2025-08-29

Abstract

The present application relates to a speech recognition method, device, storage medium, electronic device and vehicle, and relates to the field of artificial intelligence technology, wherein the method includes: performing audio analysis on the speech information to be recognized to obtain multiple candidate texts; analyzing and calculating the correlation between each word in the candidate text and its context in the candidate text to obtain the score corresponding to each word in the candidate text, and performing weighted analysis and calculation on the score corresponding to each word in the candidate text to obtain the score of the candidate text; obtaining the target candidate text with the highest score from the multiple candidate texts as the speech recognition result corresponding to the speech information. By applying the technical solution of the present application, it is possible to further combine the information contained in the context corresponding to each word in the candidate text, identify more accurate text content in the current context as the speech recognition result, and improve the accuracy of speech recognition.

Description

Speech recognition method, device, storage medium, electronic equipment and vehicle

Technical Field

The application relates to the technical field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), in particular to a voice recognition method, a voice recognition device, a storage medium, electronic equipment and a vehicle.

Background

With the development of intelligent cockpit technology in intelligent automobiles, voice interaction technology has been increasingly applied to intelligent cabins. After a user speaks a specific keyword, the interaction system of the vehicle machine can be awakened, and then the vehicle machine system can be controlled through a voice command to finish different operations of various vertical fields such as vehicle control, navigation, media and the like, such as opening a vehicle window, adjusting seat heating, playing music, displaying a distance of a certain address on a map and the like.

At present, during speech recognition, according to the sequence of the pronunciation of a user, the words corresponding to each pronunciation are sequentially analyzed, wherein the same pronunciation can correspond to a plurality of candidate words, when the words corresponding to the current state are determined, prediction is performed on the current state according to the previous words recognized by the previous state, further, a proper word is found out from the plurality of candidate words corresponding to the current state and used as the word corresponding to the current state, then the words corresponding to the next state are continuously recognized until the words corresponding to all the states are obtained, and a final speech recognition result is obtained through combination.

However, if the recognized previous word has errors, the prediction accuracy of the current state is affected, and the accuracy of voice recognition is further affected.

Disclosure of Invention

In view of the above, the present application provides a voice recognition method, device, storage medium, electronic apparatus and vehicle, and aims to solve the technical problem that the accuracy is low during voice recognition in the prior art.

In a first aspect, the present application provides a speech recognition method, including:

Performing audio analysis on the voice information to be recognized to obtain a plurality of candidate texts;

Analyzing and calculating according to the association relation between each word in the candidate text and the context of each word in the candidate text to obtain a scoring corresponding to each word in the candidate text, and carrying out weighted analysis and calculation on the scoring corresponding to each word in the candidate text to obtain the scoring of the candidate text, wherein the position of the word in the candidate text is used for determining the scoring weight corresponding to the word;

and acquiring the target candidate text with the highest score from the plurality of candidate texts as a voice recognition result corresponding to the voice information.

In a second aspect, the present application provides a speech recognition apparatus comprising:

The analysis module is configured to perform audio analysis on the voice information to be identified to obtain a plurality of candidate texts;

the scoring module is configured to analyze and calculate according to the association relation between each word in the candidate text and the context of the word in the candidate text to obtain the score corresponding to each word in the candidate text, and perform weighted analysis and calculation on the score corresponding to each word in the candidate text to obtain the score of the candidate text, wherein the position of the word in the candidate text is used for determining the score weight corresponding to the word;

and the acquisition module is configured to acquire target candidate texts with highest scores from the plurality of candidate texts as voice recognition results corresponding to the voice information.

In a third aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the first aspect.

In a fourth aspect, the present application provides an electronic device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, the processor implementing the method of the first aspect when executing the computer program.

In a fifth aspect, the present application provides a vehicle comprising an apparatus as described in the second aspect, an electronic device as described in the fourth aspect;

According to the technical scheme, the voice recognition method, the voice recognition device, the storage medium, the electronic equipment and the vehicle are characterized in that voice information to be recognized is subjected to audio analysis to obtain a plurality of candidate texts, analysis and calculation are then carried out according to the association relation between each word in the candidate texts and the context of each word in the candidate texts to obtain scoring corresponding to each word in the candidate texts, weighting analysis and calculation are carried out on the scoring corresponding to each word in the candidate texts to obtain scoring of the candidate texts, the positions of the words in the candidate texts are used for determining scoring weights corresponding to the words, and then target candidate texts with highest scores are obtained from the plurality of candidate texts to serve as voice recognition results corresponding to the voice information. By applying the technical scheme of the application, voice information is subjected to audio analysis to obtain a plurality of candidate texts, then the candidate texts are ranked in a scoring way according to the context corresponding to each word in the candidate texts and the association relation between the context and the word, and the target candidate text with the highest score is found and used as a voice recognition result corresponding to the voice information, so that more accurate voice recognition text content in the current context can be recognized.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic flow chart of a voice recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating another speech recognition method according to an embodiment of the present application;

FIG. 3 is a flow chart of an example application provided by an embodiment of the present application;

Fig. 4 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.

Detailed Description

In order that the above objects, features and advantages of the application will be more clearly understood, a further description of the application will be made. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.

In order to solve the technical problem that the prior art can cause low accuracy in voice recognition. The present embodiment provides a voice recognition method, as shown in fig. 1, including:

and 101, carrying out audio analysis on the voice information to be recognized to obtain a plurality of candidate texts.

For the execution body of the embodiment, a voice recognition device or equipment may be used at a hardware level, and a voice recognition client may be used at a software level, for example, a voice recognition application program may be installed in a vehicle or other terminal equipment.

Automatic speech recognition (Automatic Speech Recognition, ASR) is a technique for converting human speech into computer-recognizable text or commands, and mainly includes the steps of recording speech, preprocessing speech, feature extraction, speech recognition, and post-processing.

Specifically, a user needs to record voice by using a microphone or other recording equipment, the system performs preprocessing including background noise removal, framing, feature extraction and other operations on the recorded voice, converts each frame of voice signal into a set of numerical features such as Mel Frequency Cepstral Coefficient (MFCC) coefficient, power spectral density and the like, matches the feature vector with a known voice model by using a voice recognition algorithm to recognize a corresponding text or command, and performs post-processing such as spell checking, grammar analysis and the like on the recognition result to improve recognition accuracy. In general, the process of speech recognition involves multiple steps, and multiple candidate texts are available for the process of audio parsing.

Step 102, analyzing and calculating according to the association relation between each word in the candidate text and the context in the candidate text to obtain the score corresponding to each word in the candidate text, and weighting, analyzing and calculating the score corresponding to each word in the candidate text to obtain the score of the candidate text.

The corresponding context of the word in the candidate text may include the context of the context +hereinafter (or may be referred to as the context +hereinafter), for example, the candidate text sequentially includes a, b, c, d words, the context of the word a is blank, the context of the word a is the word b, the word c and the word d, the context of the word b is the word a, the context of the word b is the word c and the word d, the context of the word c is the word a and the word b, the context of the word c is the word d, the context of the word d is the word a, the word b and the word c, and the context of the word d is blank.

The association between the context and the words may be a semantic association, a grammatical association, a contextual association, and a logical association established between words in the context. Wherein, for semantic association relationship: the semantic association of words in this context is represented by the meaning of the links between them, as in the sentence "Zhang San offensive to eat bananas", there is a semantic association between "offensive", "eat" and "banana", which together constitute a complete sentence meaning. Grammar association relation the grammar association of words in the context is expressed in the grammar structure relation between them, for example, consider that grammar association exists in a sentence, they follow a certain grammar rule, for example, the grammar of main predicate and guest, etc. and form a complete sentence structure. Contextual relevance the contextual relevance of the words in context represents the contextual relation between them. For example, in the phrase "winter, plum four likes to eat ice cream," there is a context association between "winter", "like", "eat" and "ice cream," which together constitute a context related to seasons, personal preferences and food. Logical association relationship-logical association of words in context represents the logical relationship between them. For example, in the sentence "due to traffic jam, the wang is late to work," because of "," traffic jam "," so "and" late to work "have a logical association therebetween, and they constitute a logical chain of causal relationships.

According to the method, after the scoring corresponding to each word in the candidate text is obtained according to the association relation between each word in the candidate text and the context of each word in the candidate text, the scoring corresponding to the words can be weighted and summed to obtain the scoring corresponding to the candidate text. The position of the word in the candidate text is used for determining the corresponding scoring weight of the word, and the specific value of the scoring weight can be preset according to actual conditions.

For the embodiment, a plurality of candidate texts obtained through step 101 are obtained, and these candidate texts are ranked according to the probability distribution sampling result, then the embodiment can score each candidate text according to the context corresponding to each word in each candidate text and combining the association relationship between the context and the word, further make a re-score and reorder for these candidate texts, and determine the speech recognition result according to the reordered result, so as to improve the accuracy of speech recognition, and specifically can execute the process shown in step 103.

And 103, acquiring the target candidate text with the highest score from the plurality of candidate texts as a voice recognition result corresponding to the voice information.

According to the method, voice information is subjected to audio analysis to obtain a plurality of candidate texts, then the candidate texts are subjected to scoring and sorting according to the context corresponding to each word in the candidate texts and the association relation between the context and the word, the target candidate text with the highest score is found and used as a voice recognition result corresponding to the voice information, so that more accurate voice recognition text content in the current context can be recognized by further combining the information stored in the context corresponding to each word in the candidate texts.

Further, as a refinement and extension of the foregoing embodiment, in order to fully describe a specific implementation procedure of the method of the present embodiment, the present embodiment provides a specific method as shown in fig. 2, where the method includes:

step 201, performing audio analysis on the voice information to be recognized to obtain a plurality of candidate texts.

Optionally, step 201 may specifically include calculating the speech information by using an acoustic model to obtain a probability distribution, where the probability distribution is used to represent a probability of a phoneme appearing on each time frame, identifying the probability distribution by using a decoder, and searching an identification result of each time frame on the decoding graph to obtain identification results corresponding to a plurality of decoding paths with scores greater than a preset threshold, where the identification results are used as a plurality of candidate texts.

Acoustic models play a vital role in speech recognition. The acoustic model converts the preprocessed feature vectors into acoustic model scores for evaluating the recognition capability of the model for the speech signal. The acoustic models generally include a time duration model based on a hidden markov model (Hidden Markov Model, HMM) and an acoustic model based on a deep neural network (Deep Neural Networks, DNN). In the front-end processing part, the acoustic model also converts the input voice signal through acoustic characteristics such as spectral characteristics, excitation characteristics, frequency band period characteristics and dynamic characteristics thereof. In the decoding section, an encoder-decoder model structure with a focus mechanism is used to find the recognition result of each time frame by searching each time frame on the transcription layer TLG decoding graph in speech recognition. The acoustic model is an indispensable part of speech recognition, and by extracting features of an input speech signal and performing acoustic modeling, accurate recognition of the speech signal is realized.

In some examples, the acoustic model finds the recognition result for each time frame on the TLG decoded picture by a method similar to beam search. In speech recognition, an acoustic model converts recorded speech into text or commands. The output of the acoustic model is a probability distribution representing the probability of phonemes likely to occur over each time frame. In order to translate the output of the acoustic model into the final text or command, it needs to be decoded using a TLG decoder. In the TLG decoder, the output of the acoustic model is converted into a probability distribution representing the probability of phonemes likely to occur over each time frame. The decoder then samples the probability distribution using a method similar to beam search to find the best recognition result. Specifically, the decoder samples a plurality of candidate sequences from the probability distribution according to a certain search strategy, such as K-Best search, best-first search, and the like, evaluates and sorts each candidate sequence, and finally outputs an optimal recognition result.

In this embodiment, after finding the recognition result of each time frame by a method similar to the beam search, only the recognition result corresponding to the decoding path with the highest score may be left, but the corresponding n-best recognition result, that is, the recognition result corresponding to the decoding path with the highest score (the n top ranking is taken in a certain threshold), that is, a plurality of candidate texts, is left. Decoding using a method similar to beam search in an acoustic model can improve recognition accuracy and can process longer context information, thereby improving the overall performance of the speech recognition system.

Step 202, for each candidate text, performing analysis and calculation according to the association relation between each word in the candidate text and the context in the candidate text through the bi-directional language model to obtain a score corresponding to each word in the candidate text, and performing weighted analysis and calculation on the score corresponding to each word in the candidate text to obtain the score of the bi-directional language model on the candidate text.

In this embodiment, for a scene to be optimized, the voice information and the corresponding corpus in the scene may be collected, for example, a scene such as a vertical domain enriched with entity words and/or encyclopedia mainly with long text. And establishing a certain screening and cleaning mechanism for the corpus as training text of the bidirectional language model.

Alternatively, the bi-directional language model may be a plurality of different bi-directional language models. The plurality of different bi-directional language models includes a first bi-directional language model and a second bi-directional language model. Correspondingly, the step 202 may specifically include taking the BERT model as an infrastructure of a first bi-directional language model, training the first bi-directional language model by using a mask training mode based on a training text sequence, and taking the GPT-2 model as an infrastructure of a second bi-directional language model, training the second bi-directional language model by using a causal training mode based on the training text sequence.

Masks and causal in this embodiment are two common pre-training tasks. The MASK language model pre-trains the model by replacing some words in the text with "[ [ MASK ] ]" symbols, so that the model learns to generate reasonable contexts without knowing the specific words. The causal language model is a pre-training task considering time sequence, and improves the generation capability of the model by learning the sequence of events in the text. Both of these pre-training tasks have their own advantages and limitations and can be used in combination with other techniques, such as lightweight pre-training models.

In some embodiments, a bi-directional language model in the form of a mask is an important component of the BERT model that trains the model by marking certain words in the input sentence with "mask" marks to predict those masked words. Different from the traditional autoregressive language model, the BERT model adopts a Transformer architecture and a bidirectional training mode, so that the method can better learn the internal semantic information of the text. In the training process, the BERT model segments the input sentence, encodes the information of each paragraph into vector form, and then splices the vectors together through a transducer network for depth understanding. During the prediction process, the BERT model will label certain words in the input sentence as "mask" and use a pre-trained model to predict these masked words. The mask-type bi-directional language model is an important component of the BERT model, and has wide application prospects in the fields of natural language processing and the like. The causal form of bi-directional language model is another important language model, similar to the mask form of bi-directional language model. However, unlike the mask form, the causal form language model takes time sequence and causal relationships into account in modeling, and can better handle tasks that need context information, such as text abstracts, machine translations, and the like. The bi-directional language model in causal form is modeled by means of autoregressive, wherein the encoder is responsible for encoding the input sequence into a fixed length vector representation, and the decoder generates the output sequence from the output of the encoder. The causal form of language model has better effect in processing long text than the mask form.

In some embodiments, constructing a bi-directional language model in the form of a mask selects the infrastructure referenced by the BERT model, which may keep the loss function consistent.

Optionally, step 202 specifically further includes performing iterative training for multiple times based on a network structure corresponding to the GPT-2 model to obtain a second bi-directional language model, for each iterative training, while learning the occurrence probability of the target word in the training text from left to right, inverting the word sequence of the training text, learning the occurrence probability of the target word in the training text sequence from right to left based on the word sequence obtained by inverting, and evaluating index information learned from left to right and learned from right to left through a loss function, wherein for the case that the number of original modules corresponding to learning from left to right is the same as the number of inversion modules corresponding to learning from right to left, and the network structure of the original modules is consistent with the network structure of the inversion modules.

For example, the training text X contains the word A, the word B, the word C and the word D, when the probability of occurrence of the word B is learned at the current moment, the probability of occurrence of the word B can be learned from left to right, for example, the probability of occurrence of the word B at the back of the word A is learned from the word A to the word B, meanwhile, the word sequence of the training text X is inverted to obtain the word D, the word C, the word B and the word A, and the probability of occurrence of the word B at the back of the word D and the word C is learned from the word D and the word C, so that the probability of occurrence of the word B can be learned from right to left at the current moment in the training text X.

According to the training mode, the occurrence probability of each word can be accurately learned according to the context of each word in the training text and the association relation between the context and the word, then the candidate text which is predicted subsequently is scored according to the learned occurrence probability, for example, when the candidate text is scored through calculation of the two-way language model, after the candidate text is segmented, the occurrence probability of each word in the candidate text and the association relation between the context and the word are calculated, and then the score of the candidate text is obtained through comprehensive scoring (such as available weighting average and the like) according to the occurrence probability of the words.

Illustratively, the GPT-2 model referenced infrastructure may be selected when constructing a bi-directional language model in the form of causal. As shown in fig. 3, a classical GPT model structure is a module that learns sequence information from left to right in one direction. In order to enable the model to learn information after the current moment at the same time, the embodiment designs a loss function based on the structure and inverts the input sequence. This reversed module and the original module keep the number and network structure consistent, e.g. 12 left to right and 12 right to left modules may be used.

In some embodiments, the candidate text is input into a bi-directional language model for calculation, then the candidate text is segmented through the bi-directional language model to obtain a plurality of words, the corresponding context of each word in the candidate text is obtained, and analysis and calculation are carried out according to the association relation between each word and the context of each word to obtain the scoring corresponding to each word in the candidate text. It should be noted that, in this embodiment, the candidate texts may be scored according to different implementation manners, and then the score of the candidate text may be comprehensively analyzed (for example, a weighted average manner may be used), so that the target candidate text with the highest score may be obtained from the multiple candidate texts later, which is used as a speech recognition result corresponding to the speech information.

Optionally, the analyzing and calculating according to the association relation between each word in the candidate text and the context thereof in the candidate text to obtain the score corresponding to each word in the candidate text may specifically include converting any one of the plurality of words (i.e., any one of the plurality of words) into a first vector, converting the context corresponding to the first word into a second vector, calculating a target euler distance between the first vector and the second vector, and obtaining the score corresponding to the target euler distance from the score list of euler distances as the score corresponding to the first word. And the scoring list stores scoring corresponding to different Euler distances respectively.

In some examples, computing associations between words and contexts is an important task in natural language processing, and the commonly used method is a word vector model. Word vector models are a technique of mapping words into a vector space, and by representing each word as a vector, semantic and grammatical relations between words can be better captured. In the word vector model, each word is represented as an n-dimensional vector, where n is a predefined dimension. These vectors are trained by machine learning algorithms to capture semantic and grammatical relations between words.

For example, in calculating the association between a word and a context, the context may be represented as a vector using a word vector model, and then the vector may be compared with the vector of the target word. The comparison may be by cosine similarity or euclidean distance, etc. By comparing the similarity between the different context vectors and the target word vector, the degree of association between the word and the context can be calculated.

For example, besides the word vector model, there are other methods for calculating the association relationship between the words and the context, and different methods are applicable to different scenes and tasks and need to be selected according to specific situations.

The method comprises the steps of obtaining a score corresponding to each word in a candidate text, analyzing the candidate text according to the association relation between each word in the candidate text and the context of the candidate text, obtaining the dependency relation among the words in the candidate text, converting the candidate text into a grammar tree according to the dependency relation, determining the node of any second word (namely any word in the plurality of words) in the grammar tree and the node of the context corresponding to the second word in the grammar tree, calculating the target cosine similarity between the node corresponding to the second word and the node corresponding to the context, and obtaining the score corresponding to the target cosine similarity from a score list of the cosine similarity as the score corresponding to the second word, wherein the score list stores scores corresponding to different chord similarities.

The process of calculating the target cosine similarity between the node corresponding to the second word and the node corresponding to the context may specifically include obtaining a node feature of the node corresponding to the second word and a node feature of the node corresponding to the context, where the node feature may include a position, a part of speech, a vocabulary content, and other features of the node in the grammar tree. And combining node characteristics of the nodes into characteristic vectors, and calculating cosine similarity between the characteristic vectors of the nodes corresponding to the second words and the characteristic vectors of the nodes corresponding to the contexts to serve as target cosine similarity between the nodes corresponding to the second words and the nodes corresponding to the contexts.

In some examples, the grammar tree-based approach to computing associations between words and contexts typically uses dependency syntax analysis techniques. Dependency syntactic analysis is a natural language processing technique for analyzing grammatical relations between words in sentences. In dependency syntax analysis, each word is represented as a node, and the nodes are connected by edges to represent the grammatical relation between them. In the grammar tree-based method, first, dependency syntax analysis is performed on an input sentence to obtain dependency relationships among words in the sentence. The entire sentence can then be represented as a tree structure based on the dependencies. This tree structure may reflect the syntax structure of sentences that may be used to extract context information when calculating the association between words and contexts. For example, nodes of all words in the context may be added as child nodes to the tree structure, and then the similarity between the node where the target word is located and other nodes is calculated. The similarity can be calculated by cosine similarity and the like.

Further, by calculating the similarity between different nodes, the degree of association between the word and the context can be obtained. The context information represented by the node most similar to the target node will typically have a greater impact on the meaning of the target word. Therefore, the dependency syntax analysis of the candidate text can more accurately obtain the association relationship between the words and the context.

And 203, weighting analysis and calculation are carried out on the scores of the candidate texts by using different bidirectional language models, so that the scores of the candidate texts are obtained.

Optionally, based on the score corresponding to each word in the candidate text obtained in step 202, step 203 specifically further includes performing weighted analysis calculation on the score corresponding to each word in the candidate text to obtain the score of the candidate text, where the position of the word in the candidate text is used to determine the weight of the score corresponding to the word.

In some examples, the re-scoring component combines the mask-form bi-directional language model with the causal-form bi-directional language model, e.g., when scoring a piece of text, the scores obtained for the mask-form bi-directional language model (model 1) and the causal-form bi-directional language model (model 2) may be weighted with a weight to obtain the final score:

Score=w1×score (model 1) +w2×score (model 2) (equation one)

And 204, acquiring the target candidate text with the highest score from the plurality of candidate texts as a voice recognition result corresponding to the voice information.

The method includes the steps of obtaining n candidate texts after obtaining recognition results corresponding to n decoding paths with scores greater than a preset threshold, sorting the candidate texts according to the scores, obtaining an original sorting of the n candidate texts, obtaining a recognition result list, and then re-scoring the n candidate texts by using a bi-directional language model. And reordering the n candidate texts according to the new score, and then outputting the recognition result of 1-best as a voice recognition result, namely, the candidate text with the highest score after the re-scoring is used as a final voice recognition result.

For example, the user speaks a query that wants to eat, the 3 results with the highest score are also similar in pronunciation, and the result of one reordering of the 3 recognition results according to the re-scoring model is shown in the following table 1:

TABLE 1

The candidate text with the highest score after reordering is 'I want to eat ice cream', namely the text is output as a voice recognition result corresponding to voice information.

In the embodiment, by establishing a re-scoring and re-ranking mechanism of the candidate text, the accuracy of voice recognition is improved, and the requirement of a user in a free dialogue scene is covered.

Compared with the prior art, the embodiment uses the bi-directional language model to re-score and reorder a plurality of candidate texts obtained by voice recognition, and can be further combined with the information stored in the context corresponding to each word in the text, so that more accurate text content in the current context is recognized, the accuracy of voice recognition is improved, the requirements of users in free dialogue scenes are met, and the user experience is improved.

Further, as a specific implementation of the method shown in fig. 1, the embodiment provides a voice recognition device, as shown in fig. 4, which includes an analysis module 31, a scoring module 32, and an obtaining module 33.

The parsing module 31 is configured to perform audio parsing on the voice information to be recognized to obtain a plurality of candidate texts;

The scoring module 32 is configured to perform analysis and calculation according to the association relation between each word in the candidate text and the context thereof in the candidate text to obtain a score corresponding to each word in the candidate text, and perform weighted analysis and calculation on the score corresponding to each word in the candidate text to obtain a score of the candidate text, wherein the position of the word in the candidate text is used for determining the weight of the score corresponding to the word;

And an obtaining module 33 configured to obtain a target candidate text with the highest score from the plurality of candidate texts as a speech recognition result corresponding to the speech information.

In some examples, the scoring module 32 is specifically configured to input each candidate text into a different bi-directional language model, obtain a score of the candidate text by the different bi-directional language model, and perform weighted analysis calculation on the score of the candidate text by the different bi-directional language model to obtain a score of the candidate text.

In some examples, the scoring module 32 is specifically further configured such that the plurality of different bi-directional language models includes a first bi-directional language model and a second bi-directional language model, wherein the training process of the first bi-directional language model includes training the first bi-directional language model using a mask training manner based on a training text sequence with a BERT model as an infrastructure of the first bi-directional language model, and the training process of the second bi-directional language model includes training the second bi-directional language model using a causal training manner based on a training text sequence with a GPT-2 model as an infrastructure of the second bi-directional language model.

In some examples, the scoring module 32 is specifically further configured to perform iterative training for multiple times based on a network structure corresponding to the GPT-2 model to obtain the second bi-directional language model, for each iterative training, while learning the occurrence probability of the target word in the training text from left to right, invert the word sequence of the training text, learn the occurrence probability of the target word in the training text from right to left based on the word sequence obtained by inversion, and evaluate the index information learned from left to right and the index information learned from right to left through a loss function, where, for the number of original modules corresponding to learning from left to right is the same as the number of inversion modules corresponding to learning from right to left, and the network structure of the original modules is consistent with the network structure of the inversion modules.

In some examples, the parsing module 31 is specifically configured to calculate the speech information by using an acoustic model to obtain a probability distribution, where the probability distribution is used to represent a probability of a phoneme appearing on each time frame, identify the probability distribution by using a decoder, and search an identification result of each time frame on a decoding graph to obtain identification results corresponding to a plurality of decoding paths with scores greater than a preset threshold, where the identification results correspond to the plurality of candidate texts.

In some examples, the scoring module 32 is specifically further configured to score the candidate text to obtain a plurality of terms, obtain a context corresponding to each term in the candidate text, perform analysis and calculation according to an association relationship between each term and the context to obtain a score corresponding to each term in the candidate text, and perform weighted analysis and calculation on the score corresponding to each term in the candidate text to obtain a score of the candidate text, where a position of the term in the candidate text is used to determine a score weight corresponding to the term.

In some examples, the scoring module 32 is specifically further configured to convert any one of the plurality of terms into a first vector and convert a context corresponding to the first term into a second vector, calculate a target euler distance between the first vector and the second vector, and obtain a score corresponding to the target euler distance from a scoring list of euler distances as the score corresponding to the first term.

In some examples, the scoring module 32 is specifically further configured to perform dependency syntax analysis on the candidate text to obtain a dependency relationship between each term in the candidate text, convert the candidate text into a syntax tree according to the dependency relationship, where the syntax tree is used to represent a syntax structure of the candidate text, determine a node of any second term in the plurality of terms in the syntax tree and a node of a context corresponding to the second term in the syntax tree, calculate a target cosine similarity between the node corresponding to the second term and the node corresponding to the context, and obtain a score corresponding to the target cosine similarity from a scoring list of cosine similarity as the score corresponding to the second term.

Based on the above-described methods shown in fig. 1 and 2, correspondingly, the present embodiment further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the above-described methods shown in fig. 1 and 2.

Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and includes several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to execute the method of each implementation scenario of the present application.

Based on the method shown in fig. 1 and fig. 2 and the virtual device embodiment shown in fig. 4, in order to achieve the above object, the embodiment of the present application further provides an electronic device, which may be configured on an end side of a vehicle (such as a new energy automobile), and the device includes a storage medium and a processor, where the storage medium is used to store a computer program, and the processor is used to execute the computer program to implement the method shown in fig. 1 and fig. 2.

Optionally, the entity device may further include a user interface, a network interface, a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), etc.

Further, the present embodiment also provides a vehicle, which may include the apparatus shown in fig. 4, or include the electronic device described above.

It will be appreciated by those skilled in the art that the above-described physical device structure provided in this embodiment is not limited to this physical device, and may include more or fewer components, or may combine certain components, or may be a different arrangement of components.

The storage medium may also include an operating system, a network communication module. The operating system is a program that manages the physical device hardware and software resources described above, supporting the execution of information handling programs and other software and/or programs. The network communication module is used for realizing communication among all components in the storage medium and communication with other hardware and software in the information processing entity equipment.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general hardware platforms, or may be implemented by hardware. By applying the scheme of the embodiment, the multiple candidate texts obtained by voice recognition are re-scored and reordered by using the bidirectional language model, so that the text content which is more accurate in the current context can be recognized by further combining the information which is stored in the context corresponding to each word in the text, the accuracy of voice recognition is improved, the requirements of users in free dialogue scenes are met, and the user experience is improved.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The foregoing is merely exemplary of embodiments of the present application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech recognition method, comprising:

Perform audio analysis on the speech information to be recognized to obtain multiple candidate texts;

Analyze and calculate the correlation between each word in the candidate text and its context in the candidate text to obtain the score corresponding to each word in the candidate text, and perform weighted analysis and calculation on the score corresponding to each word in the candidate text to obtain the score of the candidate text, wherein the position of the word in the candidate text is used to determine the corresponding score weight of the word;

A target candidate text with the highest score is obtained from the multiple candidate texts as a speech recognition result corresponding to the speech information.

2. The method according to claim 1, wherein the step of analyzing and calculating the correlation between each word in the candidate text and its context in the candidate text to obtain a score corresponding to each word in the candidate text comprises:

Converting any first word among the plurality of words into a first vector, and converting a context corresponding to the first word into a second vector;

Calculating a target Euler distance between the first vector and the second vector;

From the Euler distance score list, a score corresponding to the target Euler distance is obtained as the score corresponding to the first word.

3. The method according to claim 1, wherein the step of analyzing and calculating the correlation between each word in the candidate text and its context in the candidate text to obtain a score corresponding to each word in the candidate text comprises:

Perform dependency syntactic analysis on the candidate text to obtain the dependency relationship between each word in the candidate text;

Converting the candidate text into a syntax tree according to the dependency relationship, wherein the syntax tree is used to represent the grammatical structure of the candidate text;

determining a node of any second word among the plurality of words in the grammar tree, and a node of a context corresponding to the second word in the grammar tree;

Calculating a target cosine similarity between a node corresponding to the second word and a node corresponding to the context;

From the cosine similarity score list, a score corresponding to the target cosine similarity is obtained as the score corresponding to the second word.

4. The method according to claim 1, wherein the step of analyzing and calculating the correlation between each word in the candidate text and its context in the candidate text to obtain a score corresponding to each word in the candidate text, and performing weighted analysis and calculation on the scores corresponding to each word in the candidate text to obtain a score for the candidate text comprises:

The bidirectional language model is used to analyze and calculate the correlation between each word in the candidate text and its context in the candidate text to obtain a score corresponding to each word in the candidate text, and the score corresponding to each word in the candidate text is weighted and analyzed to obtain a score for the candidate text by the bidirectional language model;

The scores of the candidate texts given by different bidirectional language models are weighted and analyzed to obtain the scores of the candidate texts.

5. The method according to claim 4, wherein the different bidirectional language models include: a first bidirectional language model and a second bidirectional language model;

The training process of the first bidirectional language model includes:

The BERT model is used as the basic structure of the first bidirectional language model, and the first bidirectional language model is trained using a mask training method based on a training text sequence;

The training process of the second bidirectional language model includes:

The GPT-2 model is used as the basic structure of the second bidirectional language model, and the second bidirectional language model is trained using a causal training method based on a training text sequence.

6. The method according to claim 5, wherein the GPT-2 model is used as the basic structure of the second bidirectional language model, and the second bidirectional language model is trained based on a training text sequence using a causal training method, comprising:

Perform multiple iterative training based on the network structure corresponding to the GPT-2 model to obtain the second bidirectional language model;

For each iterative training, while learning the occurrence probability of the target words in the training text from left to right, the word sequence of the training text itself is reversed, and the occurrence probability of the target words in the training text is learned from right to left based on the reversed word sequence, and the indicator information of the left-to-right learning and the right-to-left learning is evaluated by the loss function, wherein the number of original modules corresponding to the left-to-right learning is the same as the number of reversed modules corresponding to the right-to-left learning, and the network structure of the original module is consistent with the network structure of the reversed module.

7. The method according to any one of claims 1 to 6, wherein the audio parsing of the speech information to be recognized is performed to obtain multiple candidate texts, including:

Calculating the speech information using an acoustic model to obtain a probability distribution, wherein the probability distribution is used to represent the probability of a phoneme appearing in each time frame;

The probability distribution is identified by a decoder, and the recognition results of each time frame on the decoding graph are searched to obtain recognition results corresponding to multiple decoding paths with scores greater than a preset threshold value as the multiple candidate texts.

8. A speech recognition device, comprising:

A parsing module is configured to perform audio parsing on the speech information to be recognized to obtain multiple candidate texts;

The scoring module is configured to analyze and calculate the association between each word in the candidate text and its context in the candidate text to obtain a score corresponding to each word in the candidate text, and perform weighted analysis and calculation on the score corresponding to each word in the candidate text to obtain a score for the candidate text, wherein the position of the word in the candidate text is used to determine the corresponding score weight of the word;

The acquisition module is configured to acquire the target candidate text with the highest score from the multiple candidate texts as the speech recognition result corresponding to the speech information.

9. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the method according to any one of claims 1 to 7 is implemented.

10. An electronic device comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 7 when executing the computer program.

11. A vehicle, comprising: the device according to claim 8 or the electronic device according to claim 10.