CN110556127B

CN110556127B - Method, device, equipment and medium for detecting voice recognition result

Info

Publication number: CN110556127B
Application number: CN201910907336.9A
Authority: CN
Inventors: 赵鑫; 苏少炜; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2021-01-01
Anticipated expiration: 2039-09-24
Also published as: CN110556127A

Abstract

The invention discloses a method, a device, equipment and a medium for detecting a voice recognition result, and belongs to the technical field of networks. The method and the device obtain the similarity of the voice recognition results of two adjacent voice instructions of the same user equipment, the time interval of which is less than a certain value, and the respective voice recognition results, and determine whether the voice recognition result of the previous voice instruction in the two voice instructions has errors according to the similarity. The detection method of the voice recognition result automatically detects and determines the voice recognition result with errors, does not need to manually detect all voice instructions and the voice recognition result, and improves the efficiency of determining the voice recognition errors.

Description

Method, device, equipment and medium for detecting voice recognition result

Technical Field

The present invention relates to the field of network technologies, and in particular, to a method, an apparatus, a device, and a medium for detecting a speech recognition result.

Background

In recent years, intelligent voice devices such as intelligent sound boxes have attracted much attention as effective human-computer interaction inlets in home scenes, and the intelligent voice devices receive voice instructions of users in the using process and process the voice instructions through multiple complex technologies to realize human-computer interaction. Automatic Speech Recognition (ASR) is a technology for recognizing a Speech instruction and recognizing Speech information as text information, and plays an important role in processing the Speech instruction. After speech recognition, whether the speech recognition result has errors is determined, so that the ASR model is optimized by using the speech recognition result with the recognition errors, and the accuracy of the speech recognition is improved.

At present, a method for detecting a speech recognition result generally constructs a corpus containing a plurality of speech instructions, and determines whether an error exists in the speech recognition result by comparing a speech recognition result of a current speech instruction with an artificially labeled result by artificially labeling a character corresponding to each speech instruction in the corpus.

In the method, the workload of manual labeling is large, the probability of errors of the speech recognition result is smaller and smaller along with the continuous improvement of the ASR recognition effect, and the detection efficiency for determining the speech recognition result is continuously reduced under the same manpower condition, so that a large amount of manpower resources are wasted.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a medium for detecting a voice recognition result, which can solve the problem of low detection efficiency of determining the voice recognition result in the related art. The technical scheme is as follows:

in one aspect, a method for detecting a speech recognition result is provided, where the method includes:

acquiring voice instruction information, wherein the voice instruction information comprises two voice instructions received by the same user equipment and respective voice recognition results, the two voice instructions comprise a first voice instruction and a second voice instruction, and the second voice instruction is a previous voice instruction of the first voice instruction;

when the receiving time interval of the two voice instructions is smaller than a time interval threshold value, acquiring the similarity of the voice recognition results of the two voice instructions according to the voice instruction information;

and when the similarity meets the similarity condition, determining that the voice recognition result of the second voice instruction has errors.

In a possible implementation manner, when the receiving time interval of the two voice commands is smaller than the time interval threshold, according to the voice command information, obtaining the similarity of the voice recognition results of the two voice commands includes:

when the receiving time interval of the two voice instructions is smaller than a time interval threshold value, acquiring at least one of semantic difference or signal difference of the two voice instructions according to the voice instruction information;

and when at least one of the semantic difference or the signal difference meets an error condition, executing the step of acquiring the similarity.

In one possible implementation, the at least one of the semantic difference or the signal difference comprises a plurality of comparison results;

should obtain at least one in the semantic difference or the signal difference of these two pronunciation instructions according to this pronunciation instruction information, include:

obtaining a first comparison result of the two voice instructions according to the voice instruction information;

and when the first comparison result meets the error condition, acquiring a second comparison result of the two voice commands.

In one possible implementation, the semantic difference includes a domain relationship to which semantic parsing results of the two voice commands belong, and the signal difference includes at least one of a pitch comparison result, an emotion comparison result, or a voiceprint comparison result of the two voice commands;

the at least one of semantic differences or signal differences meets an error condition, including at least one of:

the semantic analysis results of the two voice instructions belong to different fields, or the fields of the semantic analysis results of the two voice instructions belong to target fields;

the pitch comparison result indicates that the pitch of the first voice command is higher than the pitch of the second voice command;

the emotion comparison result indicates that the emotion of the first voice instruction is different from that of the second voice instruction;

the voiceprint comparison result indicates that the voiceprints of the two voice commands are the same.

In one possible implementation, the voice instruction information further includes semantic parsing results of the two voice instructions;

and acquiring the domain relation of the semantic analysis results of the two voice instructions according to the semantic analysis results of the two voice instructions.

In one possible implementation, the method further comprises:

and when any one of the comparison results does not meet the error condition, determining that the voice recognition result of the second voice instruction is correct.

In a possible implementation manner, the obtaining the similarity of the speech recognition results of the two speech commands according to the speech command information includes any one of:

acquiring an editing distance between voice recognition results of the two voice instructions, and taking the editing distance as the similarity of the voice recognition results of the two voice instructions;

calling an acoustic model, inputting the voice recognition results of the two voice commands into the acoustic model, comparing the pinyin of the two voice recognition results by the acoustic model, and outputting the similarity of the voice recognition results of the two voice commands;

and calling a language model, inputting the voice recognition results of the two voice instructions into the language model, comparing the characters of the two voice recognition results by the language model, and outputting the similarity of the voice recognition results of the two voice instructions.

In one possible implementation manner, after determining that the voice recognition result of the second voice instruction has an error, the method further includes:

and sending the second voice instruction and the voice recognition result of the second voice instruction, or sending the two voice instructions and the respective voice recognition results to target equipment, wherein the target equipment is used for prompting manual marking and detection of the received voice instruction and the voice recognition result of the voice instruction.

In one aspect, an apparatus for detecting a speech recognition result is provided, the apparatus including:

the acquisition module is used for acquiring voice instruction information, wherein the voice instruction information comprises two voice instructions received by the same user equipment and respective voice recognition results, the two voice instructions comprise a first voice instruction and a second voice instruction, and the second voice instruction is a previous voice instruction of the first voice instruction;

the acquisition module is further used for acquiring the similarity of the voice recognition results of the two voice instructions according to the voice instruction information when the receiving time interval of the two voice instructions is smaller than a time interval threshold value;

and the determining module is used for determining that the voice recognition result of the second voice instruction has errors when the similarity meets the similarity condition.

In one possible implementation, the obtaining module is further configured to:

the acquisition module is further configured to:

In one possible implementation, the determining module is further configured to:

In one possible implementation, the obtaining module is further configured to perform any one of:

In one possible implementation, the apparatus further includes:

and the sending module is used for sending the second voice instruction and the voice recognition result of the second voice instruction, or sending the two voice instructions and the respective voice recognition results to the target equipment, and the target equipment is used for prompting manual marking and detection of the received voice instruction and the voice recognition result of the voice instruction.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having at least one instruction stored therein, the instruction being loaded and executed by the one or more processors to implement operations performed by the method for detecting speech recognition results.

In one aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the operation performed by the method for detecting the speech recognition result.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

according to the technical scheme provided by the embodiment of the invention, the similarity of the voice recognition results of two adjacent voice instructions of the same user equipment, the time interval of which is less than a certain value, and the respective voice recognition results are obtained, and whether the voice recognition result of the previous voice instruction in the two voice instructions has an error or not is determined according to the similarity. The detection method of the voice recognition result automatically detects and determines the voice recognition result with errors, does not need to manually detect all voice instructions and the voice recognition result, and improves the efficiency of determining the voice recognition errors.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is an implementation environment of a method for detecting a speech recognition result according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for detecting a speech recognition result according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for detecting a speech recognition result according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a device for detecting a speech recognition result according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 is an implementation environment of a method for detecting a speech recognition result according to an embodiment of the present invention, and referring to fig. 1, the implementation environment may include a database 110 and a computer device 120, or may be the computer device 120 alone, and the computer device 120 may be a server or a terminal.

Where the implementation environment includes a database 110 and a computer device 120, the database 110 may store different voice instructions and respective voice recognition results. The computer device 120 may be used to retrieve voice instruction information, and upon retrieving the voice instruction information, the computer device 120 may retrieve from the database 110 over a network.

When the implementation environment only includes the computer device 120, the computer device 120 may be configured to receive a voice instruction and process the voice instruction to obtain a corresponding voice recognition result, and may also be configured to obtain voice instruction information.

In one possible implementation, the computer device 120 may be a server, which may be a server of at least one user device. The user equipment can be any one of intelligent voice equipment such as an intelligent sound box and the like, the user equipment can receive a voice instruction sent by a user and send the voice instruction to the server, the server identifies the voice instruction to obtain a voice identification result, resources expected by the user are obtained according to the voice identification result, the resources expected by the user are sent to the user equipment, the user equipment executes corresponding operation, and feedback is made on the voice instruction.

In another possible implementation manner, the computer device 120 may be a terminal, the terminal may be any one of intelligent voice devices such as an intelligent sound box, and the terminal may receive a voice instruction sent by a user, recognize the voice instruction to obtain a voice recognition result, obtain a resource desired by the user according to the voice recognition result, and perform a corresponding operation by the terminal to feed back the voice instruction.

Fig. 2 is a flowchart of a method for detecting a speech recognition result according to an embodiment of the present invention, where the method may be applied to a server or a terminal, and the following description takes the application of the method to the server as an example, with reference to fig. 2, where the method includes:

201. the server acquires the voice instruction information.

The voice instruction information comprises two voice instructions received by the same user equipment and respective voice recognition results, the two voice instructions comprise a first voice instruction and a second voice instruction, and the second voice instruction is a previous voice instruction of the first voice instruction. The voice recognition result can be a text message corresponding to the voice command obtained by voice recognition processing of the voice command.

In one possible implementation, the server may extract the voice instruction information from a voice instruction information repository. The voice instruction information base can store voice instructions and voice recognition results of the voice instructions, and the voice instructions and the voice recognition results can be obtained through the following steps.

In a possible implementation manner, the voice instruction information base may further include a semantic parsing result of the voice instruction, and the semantic parsing result may also be obtained through the following steps. The following describes the information acquisition sources in the voice instruction information base:

step one, a server may receive voice instructions of a plurality of user equipments, where each voice instruction may carry user equipment information, and the user equipment information may be used to indicate a user equipment to which each voice instruction belongs.

And step two, the server carries out voice activity detection on the received voice instruction, and the specific process can comprise detecting front and back breakpoints of the voice instruction and removing non-voice segments to obtain a simplified voice instruction.

And step three, the server performs voice recognition on the simplified voice command to obtain a corresponding voice recognition result. The voice recognition process is actually a process of converting a voice signal into text information.

And fourthly, performing semantic analysis on the voice recognition result by the server to obtain a semantic analysis result. According to the semantic analysis result, the real intention of the user for sending the voice command can be known, and how to feed back the voice command is further determined.

And step five, the server stores the voice instruction information to the voice instruction information base, wherein the voice instruction information can comprise the voice instruction and a voice recognition result corresponding to the voice instruction, and can also comprise a semantic analysis result and user equipment information carried by the voice instruction.

After the voice instruction information base adds the relevant information of the voice instruction each time, the server can extract the newly added voice instruction and the corresponding voice recognition result from the voice instruction information base, and extract the previous voice instruction of the newly added voice instruction and the corresponding voice recognition result, wherein the newly added voice instruction is the first voice instruction, and the previous voice instruction of the newly added voice instruction is the second voice instruction. Therefore, whether the voice recognition result of the second voice instruction has errors or not is judged according to the voice instruction information. In a possible implementation manner, the server may determine the second voice command according to user equipment information carried by the first voice command.

202. And when the receiving time interval of the two voice instructions is smaller than the time interval threshold, the server acquires at least one of semantic difference or signal difference of the two voice instructions according to the voice instruction information.

The time interval threshold may be a time required by the server to process the voice instruction and perform feedback by the user equipment, and when there is an error in the voice recognition result of the voice instruction, the user equipment may not perform feedback, or the user equipment feeds back an erroneous resource, the user often sends the voice instruction again, or continues to send the second voice instruction to acquire the resource desired by the user without waiting for the user equipment to complete feedback of the first voice instruction. Therefore, when the time interval between the receipt of the first voice command and the receipt of the second voice command is smaller than the time interval threshold, the second voice command may be recognized incorrectly, and the following voice recognition result detection step is performed on the two voice commands. The time interval threshold may be set by a technician, which is not limited in the embodiment of the present invention.

However, when the receiving time interval is larger than the time interval threshold, the receiving time interval may be too large, the two voice commands lack continuity, and it cannot be determined whether the voice recognition result of the second voice command information is wrong.

The semantic difference comprises the domain relation of semantic analysis results of the two voice commands, and the signal difference comprises at least one of tone comparison results, emotion comparison results or voiceprint comparison results of the two voice commands. In a specific implementation process, the semantic difference or the signal difference of the two voice commands acquired by the server may have different combinations, for example, the server may acquire a domain relationship to which semantic analysis results of the two voice commands belong and a voiceprint comparison result, and the server may also acquire other comparison results, and the combination may be arbitrary, which is not limited in the embodiment of the present invention.

203. And when at least one of the semantic difference or the signal difference meets an error condition, the server acquires the similarity of the voice recognition results of the two voice commands according to the voice command information.

The semantic difference or the signal difference of the two voice commands obtained by the server in step 202 may be different, and correspondingly, if at least one of the semantic difference or the signal difference meets the error condition, there may be different error conditions. In one possible implementation, the error condition may include at least one of the following error conditions one through four:

and the error condition I is that the semantic analysis results of the two voice instructions belong to different fields, or the fields of the semantic analysis results of the two voice instructions belong to target fields.

In a possible implementation manner, the server obtains the domain to which the semantic analysis results of the two voice instructions belong, obtains the semantic analysis result corresponding to the voice instruction according to the voice recognition result corresponding to the voice instruction, and then obtains the domain to which the semantic analysis result belongs according to the semantic analysis result, thereby obtaining the domain relationship to which the semantic analysis results of the two voice instructions belong.

In another possible implementation manner, the voice instruction information may further include semantic analysis results of the two voice instructions, and the server may directly obtain a domain relationship to which the semantic analysis results of the two voice instructions belong according to the semantic analysis results of the two voice instructions. The semantic parsing result can be the intention of the user for sending the voice instruction, which is obtained by understanding the voice recognition result corresponding to the voice instruction through natural language.

The server acquires the fields to which the semantic analysis results of the first voice instruction and the second voice instruction belong, and the fields of the two voice instructions may have two conditions, may be the same or may be different. When determining whether or not the error condition is satisfied, the determination may be made based on the two cases and whether or not the domain is the target domain. Can be provided with: when the two voice commands are different in field, the error condition is met; when the two voice commands are in the same field, the error condition can be met only when the two voice commands are in the target field.

Specifically, when the domains to which the semantic analysis results of the two voice commands belong are the same domain and the same domain is not the target domain, the first voice command and the second voice command do not meet an error condition; when the fields to which the semantic analysis results of the two voice commands belong are different fields or are both target fields, the first voice command and the second voice command meet an error condition.

The domain to which the semantic analysis result belongs may be divided into a plurality of domains according to the semantic analysis result, such as a weather domain, a music domain, a traffic domain, a chat domain, and the like. In a possible implementation manner, the target field may be a chat field, and the chat field may be a chat field obtained by dividing, according to the semantic analysis result, a field to which the semantic analysis result belongs into the chat field when it cannot be determined that the voice instruction belongs to the divided weather field, music field, or traffic field, or when it cannot be determined that a resource or an operation required by the voice instruction does not exist in a process of processing the voice instruction by the server.

The two voice instructions can be continuous conversations of the same user equipment, and according to the interaction habit, the continuous conversations of the user are generally in the same field, namely the fields to which the semantic analysis results of the first voice instruction and the second voice instruction belong are in the same field. For example, the second voice instruction is to search for a singer, the first voice instruction is to replace the next song, and the domains to which the semantic analysis results of the first voice instruction and the second voice instruction belong are music domains. Therefore, when the domains to which the semantic analysis results of the two voice commands belong are the same domain, the voice recognition result of the second voice command has no error.

When the domains to which the semantic analysis results of the two voice commands belong are different, the voice recognition result of the second voice command may have an error. For example, the first voice instruction is to search for a singer, the second voice instruction is to query the weather of today, the domain to which the semantic analysis result of the first voice instruction belongs is the music domain, the domain to which the semantic analysis result of the second voice instruction belongs is the weather domain, the domains to which the semantic analysis results of the two voice instructions belong are different domains, and the first voice instruction and the second voice instruction meet the error condition. For another example, the target field may be a chat field, and the first voice command and the second voice command are semantically analyzed to obtain that the fields to which semantic analysis results belong are chat fields, and the first voice command and the second voice command meet an error condition.

Error condition two, the pitch comparison result indicates that the pitch of the first voice command is higher than the pitch of the second voice command.

The server acquires the tones of the first voice instruction and the second voice instruction, when the tone of the first voice instruction is higher than the tone of the second voice instruction, the first voice instruction and the second voice instruction are in accordance with an error condition, and when the tone of the first voice instruction is lower than or equal to the tone of the second voice instruction, the first voice instruction and the second voice instruction are not in accordance with the error condition.

When the voice recognition result of the voice instruction has an error in the process that the user equipment acquires the user expected resource based on the voice instruction, the user equipment does not feed back the user expected resource, and the user may unconsciously increase the volume to enable the voice instruction to be acquired more clearly by the user equipment, so that the tone of the voice instruction received again by the user equipment becomes higher, and the server can judge whether the voice recognition result of the second voice instruction has the error according to the tone change.

And an error condition three, wherein the emotion comparison result indicates that the emotion of the first voice instruction is different from that of the second voice instruction.

In the process of acquiring the user desired resource by the user equipment based on the voice instruction, when the voice recognition result of the voice instruction is wrong and the user desired resource is not acquired, the emotion of the voice instruction issued by the user may change, for example, the anger component in the emotion may increase. The server can compare the emotion corresponding to the first voice instruction and the emotion corresponding to the second voice instruction, and judge whether the voice recognition result of the second voice instruction is possibly wrong.

And an error condition four, wherein the voiceprint comparison result indicates that the voiceprints of the two voice commands are the same.

The voiceprint is a sound wave spectrum carrying speech information and displayed by an electro-acoustic instrument, and has specificity and relative stability. Thus, it is possible to determine whether or not two voices have been uttered by the same person from the voiceprint. The server compares the voiceprints of the voices corresponding to the first voice instruction and the second voice instruction, and when the voiceprint of the first voice instruction is the same as the voiceprint of the second voice instruction, the first voice instruction and the second voice instruction are sent by the same user. When a user sends a voice instruction to the user equipment and the user equipment performs voice recognition on the voice instruction in error, the user can continue to send the same voice instruction, so that the previous voice instruction in the two voice instructions may have an error.

When the voiceprint of the first voice command is different from the voiceprint of the second voice command, the first voice command and the second voice command are sent by different users. Different users may issue different voice commands to the user equipment, which is not the case that the voice recognition of the user equipment is wrong, and thus, the case may be determined that there is no error in the voice recognition result of the second voice command.

In a possible implementation manner, at least one of the semantic difference or the signal difference obtained in step 202 may include multiple comparison results, and when the server obtains the multiple comparison results, one of the semantic difference or the signal difference may be obtained first, and the server performs error identification and judgment on the one comparison result, determines that there is a possible error, then continues to obtain the next item, and so on, and obtains multiple comparison results. The specific acquisition process may be as follows:

the server obtains a first comparison result of the two voice instructions according to the voice instruction information, wherein the first comparison result is any one of the multiple comparison results, and when the first comparison result meets an error condition, the server obtains a second comparison result of the two voice instructions, and the second comparison result is a comparison result except the first comparison result in the multiple comparison results.

For example, taking as an example that the multiple comparison results include the domain relationship to which the semantic analysis results of the two voice commands belong, the pitch comparison result, and the voiceprint comparison result, a certain comparison order may be set for the multiple comparison results. Specifically, the domain relationship to which the semantic analysis results of the two voice commands belong may be compared, the pitch comparison results of the two voice commands may be compared, and the voiceprint comparison results of the two voice commands may be compared.

When comparing the domain relationship to which the semantic analysis results of the two voice commands belong with the tone comparison results of the two voice commands, the first comparison result is the domain relationship to which the semantic analysis results of the two voice commands belong, the second comparison result is the tone comparison results of the two voice commands, and when the first comparison result meets the error condition, the server can obtain the tone comparison results of the two voice commands as the second comparison result.

When comparing the tone comparison result of the two voice commands with the voiceprint comparison result of the two voice commands, the first comparison result is the tone comparison result of the two voice commands, and the second comparison result is the voiceprint comparison result of the two voice commands, when the first comparison result meets the second error condition, the server may obtain the voiceprint comparison result of the two voice commands as the second comparison result, and when the second comparison result meets the fourth error condition, the step of obtaining the similarity is performed, which is not limited in the embodiment of the present disclosure.

In one possible implementation manner, when any one of the comparison results does not meet the error condition, the server determines that the voice recognition result of the second voice instruction is correct. In this case, the server may not obtain the next comparison result and may not perform the step of obtaining the similarity.

The above step 202 and step 203 are processes of obtaining the similarity of the speech recognition results of the two speech commands according to the speech command information when the receiving time interval of the two speech commands is smaller than the time interval threshold, and the processes may determine at least one of the semantic difference or the signal difference after determining that the receiving time interval of the two speech commands is smaller than the time interval threshold, so as to determine whether to execute the step of obtaining the similarity, and may also directly execute the step of obtaining the similarity without determining at least one of the semantic difference or the signal difference when the receiving time interval of the two speech commands is smaller than the time interval threshold, which is not limited in the present invention.

204. And when the similarity meets the similarity condition, the server determines that the voice recognition result of the second voice instruction has errors.

When the similarity of the speech recognition results of the first speech instruction and the second speech instruction is high, it can be considered that the second speech instruction has an error.

When at least one of the semantic difference or the signal difference meets an error condition, the server may obtain a similarity of the speech recognition results of the two speech commands according to the speech command information, and in a possible implementation manner, a manner of obtaining the similarity of the speech recognition results of the two speech commands by the server may include any one of the following manners:

in the first mode, the server acquires the editing distance between the voice recognition results of the two voice instructions, and the editing distance is used as the similarity of the voice recognition results of the two voice instructions.

In this manner, the similarity condition may be that the edit distance is less than the edit distance threshold. The edit distance is the minimum number of editing operations required to change from one string to another string, and the smaller the edit distance, the more similar the two strings are. The first method may be that the server obtains an edit distance of the voice recognition results corresponding to the first voice instruction and the second voice instruction, and when the edit distance is smaller than an edit distance threshold, the similarity of the voice recognition results corresponding to the first voice instruction and the second voice instruction is high, and the voice recognition result of the second voice instruction may have an error. The edit distance threshold may be set by a technician, which is not limited in the embodiment of the present invention.

And secondly, calling an acoustic model by the server, inputting the voice recognition results of the two voice commands into the acoustic model, comparing the pinyin of the two voice recognition results by the acoustic model, and outputting the similarity of the voice recognition results of the two voice commands.

And thirdly, calling a language model by the server, inputting the voice recognition results of the two voice instructions into the language model, comparing the characters of the two voice recognition results by the language model, and outputting the similarity of the voice recognition results of the two voice instructions.

For the second and third modes, the server may input the speech recognition results of the first speech instruction and the second speech instruction into the model, and directly output the similarity of the speech recognition results of the first speech instruction and the second speech instruction through model calculation, where the similarity condition may be that the speech similarity is greater than a similarity threshold. The acoustic model is used for analyzing pinyin similarity of two voice recognition results, and the language model is used for analyzing character similarity of the two voice recognition results. The similarity threshold may be set by a technician, which is not limited in the embodiment of the present invention.

In the determination process of the above steps 202 to 204, when the server detects the speech recognition result of the speech instruction, it may determine whether the speech recognition result has an error according to the conversation habit of the user, and the conversation habit of the user may be embodied as the receiving time interval, the semantic difference, the signal difference, and the similarity.

The user conversation habit may be a habit of a user in spoken language expression in a multi-turn conversation process with the user equipment, for example, a speech recognition result of a certain speech instruction of the user has an error, the user equipment does not feed back the speech instruction of the user, and based on the conversation habit of the user, the user sends the same or similar speech instruction to the user equipment again, so that the speech recognition results of adjacent speech instructions of the same user equipment can be compared to obtain a similarity of two speech recognition results, and then whether the speech recognition result of the previous speech instruction has an error or not is determined.

In a specific example, the semantic difference includes a domain relationship to which semantic parsing results of the two voice commands belong, and the signal difference includes a voiceprint comparison result of the two voice commands.

In this example, the server may first obtain a receiving time interval corresponding to the first voice instruction and the second voice instruction, and when the receiving time interval is greater than a time interval threshold, determine that a voice recognition result of the second voice instruction is correct, and no longer perform detection.

And when the receiving time interval is smaller than a time interval threshold value, the server acquires the domain relation to which the semantic analysis results of the two voice instructions belong, and when the domain relation to which the semantic analysis results belong does not accord with the error condition, the voice recognition result of the second voice instruction is determined to be correct, and detection is not performed any more.

And when the domain relation of the semantic analysis result accords with the error condition, the server acquires the voiceprints corresponding to the first voice instruction and the second voice instruction, and when the voiceprints do not accord with the error condition, the server determines that the voice recognition result of the second voice instruction is correct and does not perform detection any more.

And when the voiceprint meets the error condition, the server acquires the similarity of the voice recognition results corresponding to the first voice command and the second voice command, and when the similarity does not meet the similarity condition, the server determines that the voice recognition result of the second voice command is correct. And when the similarity meets the similarity condition, determining that the voice recognition result of the second voice instruction has errors.

205. And the server sends the second voice instruction and the voice recognition result of the second voice instruction, or sends the two voice instructions and respective voice recognition results to the target equipment.

The target device is used for prompting manual marking and detection of the received voice instruction and the voice recognition result of the voice instruction. When the server determines that the voice recognition result of the second voice is wrong, the server may send only the second voice instruction and the voice recognition result of the second voice instruction to the target device, and the voice recognition result of the first voice instruction may also be wrong.

After receiving the voice command and the voice recognition result of the voice command on the target device, the related technical personnel can play the voice command to determine whether the voice recognition result has errors. Therefore, the more accurate detection of the voice recognition result can be realized through two detection steps of automatic detection and manual re-detection of the server. If errors exist, technicians can mark the voice instruction, then input a marking result and a received voice recognition result into an ASR model, and adjust model parameters of the ASR model so as to realize optimization of the ASR model.

The server determines that the voice recognition result has an incorrect voice command through the automatic detection of the above steps 201 to 204, and can also resend the voice command and the voice recognition result of the voice command to the target device, and manually confirm whether the voice recognition result of the voice command is incorrect. Through automatic detection, all voice instructions and voice recognition results are screened, the voice instructions which preliminarily determine that the voice recognition results have errors are sent to a human, all the voice instructions and the voice recognition results do not need to be judged manually, and the efficiency of determining the voice recognition errors is improved.

According to the method provided by the embodiment of the invention, the similarity of the voice recognition results of two adjacent voice instructions of the same user equipment, the time interval of which is less than a certain value, and the respective voice recognition results are obtained, and whether the voice recognition result of the previous voice instruction in the two voice instructions has an error or not is determined according to the similarity. The detection method of the voice recognition result automatically detects and determines the voice recognition result with errors, does not need to manually detect all voice instructions and the voice recognition result, and improves the efficiency of determining the voice recognition errors.

In a specific example, the detection method based on the voice recognition result may have a plurality of specific implementation manners in specific implementation, a specific application example of the detection method based on the voice recognition result is provided below, and fig. 3 is a flowchart of the detection method based on the voice recognition result provided in the embodiment of the present invention, and refer to fig. 3.

Fig. 3 can be divided into two parts, namely a service flow 301 and an error finding flow 302, where the service flow 301 is a normal flow for processing a user voice instruction, and the error finding flow 302 is a flow of the voice recognition error information determination method.

In the business process 301, the specific process may include the following steps one to four:

step one, a server acquires a voice instruction of a user to a central control module.

The equipment such as the intelligent sound box can detect the voice command sent by the user, and sends the voice command to the server through the network, and the voice command is obtained by the server. The central control module is responsible for processing the voice instruction of the user, on one hand, the normal service requirements of the user are coordinated and completed, and the normal service requirements include but are not limited to processing voice, processing a voice recognition result, storing audio and the like; on the other hand, the central control module needs to record each processing result and store the processing result in the access history module, and determines whether there is an error in the voice instruction recognition through the error finding process 302.

And step two, a Voice Activity Detection (VAD) module acquires the Voice command, performs Voice Activity Detection, detects front and rear breakpoints of the user Voice command, and delivers the user Voice command to an Automatic Voice Recognition (ASR) module for processing by taking a section of conversation as a unit.

And step three, the ASR module identifies the voice instruction and identifies the audio information of the voice instruction as text information.

And fourthly, performing semantic analysis on the text information by a Natural Language Processing (NLP) module, and analyzing the actual intention of the user according to the text information to further obtain the resources expected by the user.

After the server completes the service flow 301, the central control module stores the user information, the voice recognition result, and the semantic parsing result corresponding to the voice instruction into the access history module, and as for the voice instruction related information newly added in the access history module, the error discovery flow 302 may be executed, and in the error discovery flow 302, the specific flow may include the following steps one to five:

step one, when the access history module acquires the related information of the newly added voice instruction, the related information of the voice instruction is sent to the result filter module to check whether a voice recognition error exists. In one possible implementation, the server may obtain information related to a voice command that is previous to the voice command of the same user equipment, and send the information related to the voice command to the result filter module together with the information related to the voice command. This step is also the step 201 described above.

And step two, after the result filter module acquires the relevant information of the voice instruction, the voice instruction is checked through a filtering rule base. If the step passes the inspection, stopping the error inspection process; otherwise, the related information of the voice command is sent to the voiceprint filter module for continuous inspection.

The filter rule base may be constructed based on user access habits. The filtering rule base may include multiple checks, and in one possible implementation, the rule base may include two contents, namely, a time rule, that is, the two voice commands whose receiving time interval for the server to obtain the two voice commands in step 202 is smaller than a time interval threshold. And secondly, a semantic rule, namely a domain relation to which semantic analysis results of the two voice commands included in the semantic difference belong, in a specific possible implementation manner, the server can define a semantic rule white list, and if the recognition domains of the two preceding and following conversations are the same recognition domain and the semantic recognition domain is in the semantic rule white list, the conversation is considered to be a normal conversation.

In one possible implementation, the history format of the voice instruction related information is as follows:

user IDentity number (ID) | timestamp | recognition result | semantic recognition field

When the access history module acquires the related information of the newly added voice instruction, firstly comparing whether the time stamp of the voice instruction at the current time and the time stamp of the previous voice instruction of the same user equipment accord with a time rule or not for the same user ID; and then comparing whether the semantic recognition fields of the two voice commands meet the semantic rules, and if the voice commands do not meet the time rules and the semantic rules, the voice commands do not pass the inspection of the filtering rule base.

And step three, the voice print filter module acquires the voice command which does not pass the detection of the filtering rule base, and voice print comparison is carried out.

Since the same user device may have multiple users simultaneously using, for example, user a may instruct "play song" and user B may instruct "change one" by voice. In this case, although continuous speech operation is also performed, speech recognition is not considered to be problematic because of different users. And when the two voice instructions of the user A and the user B do not pass the verification in the step two, judging that the two voice instructions are the voice instruction of the same user through the step three, and then, performing voiceprint comparison verification on the two voice instructions of the user A and the user B.

And step four, the error recognition calculation module acquires the voice command which does not pass the voiceprint comparison check of the voiceprint filter module, and calculates the similarity of the continuous voice command, wherein the calculation method can adopt methods of editing distance, calling an acoustic model, calling a language model and the like. That is, in the

above steps

203 and 204, it is determined whether there is an error in the speech recognition result of the second speech command according to the similarity of the speech recognition results of the two speech commands.

The steps two to four are the processes described in the above steps 202 to 204.

And step five, sending the voice instruction which does not pass the voice similarity detection of the error recognition computing module to a manual marking platform, and manually marking by a marker. This step is also the step 205 described above.

Fig. 4 is a schematic structural diagram of a device for detecting a speech recognition result according to an embodiment of the present invention, and referring to fig. 4, the device includes an obtaining module 401 and a determining module 402.

The obtaining module 401 is configured to obtain voice instruction information, where the voice instruction information includes two voice instructions received by the same user equipment and respective voice recognition results, where the two voice instructions include a first voice instruction and a second voice instruction, and the second voice instruction is a previous voice instruction of the first voice instruction.

The obtaining module 401 is further configured to obtain similarity of the speech recognition results of the two speech instructions according to the speech instruction information when the receiving time interval of the two speech instructions is smaller than the time interval threshold.

A determining module 402, configured to determine that a voice recognition result of the second voice instruction has an error when the similarity meets the similarity condition.

In one possible implementation, the obtaining module is further configured to:

the acquisition module is further configured to:

In one possible implementation, the apparatus further includes:

The device provided by the embodiment of the invention obtains the similarity of the voice recognition results of two adjacent voice instructions of the same user equipment, the time interval of which is less than a certain value, and the respective voice recognition results, and determines whether the voice recognition result of the previous voice instruction in the two voice instructions has errors or not according to the similarity. The detection method of the voice recognition result automatically detects and determines the voice recognition result with errors, does not need to manually detect all voice instructions and the voice recognition result, and improves the efficiency of determining the voice recognition errors.

It should be noted that: the voice recognition result detection apparatus provided in the above embodiment is only illustrated by the division of the above functional modules when detecting the voice recognition result, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the detection apparatus for the speech recognition result and the detection method embodiment of the speech recognition result provided by the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiment and will not be described herein again.

Fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention. The terminal 500 may be: a smart phone, a tablet computer, an MP3(Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4) player, a notebook computer or a desktop computer. Terminal 500 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and the like.

In general, the terminal 500 includes: one or more processors 501 and one or more memories 502.

The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 501 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 501 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 501 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one instruction for execution by processor 501 to implement the method of detecting speech recognition results provided by the method embodiments of the present invention.

In some embodiments, the terminal 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502 and peripheral interface 503 may be connected by a bus or signal lines. Each peripheral may be connected to the peripheral interface 503 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 504, display screen 505, camera assembly 506, audio circuitry 507, positioning assembly 508, and power supply 509.

The peripheral interface 503 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 501 and the memory 502. In some embodiments, the processor 501, memory 502, and peripheral interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 501, the memory 502, and the peripheral interface 503 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 504 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 504 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 504 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 504 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 504 may further include NFC (Near Field Communication) related circuits, which are not limited in the present disclosure.

The display screen 505 is used to display a UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 505 is a touch display screen, the display screen 505 also has the ability to capture touch signals on or over the surface of the display screen 505. The touch signal may be input to the processor 501 as a control signal for processing. At this point, the display screen 505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 505 may be one, providing the front panel of the terminal 500; in other embodiments, the display screens 505 may be at least two, respectively disposed on different surfaces of the terminal 500 or in a folded design; in still other embodiments, the display 505 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 500. Even more, the display screen 505 can be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 505 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 506 is used to capture images or video. Optionally, camera assembly 506 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 506 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 500. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 507 may also include a headphone jack.

The positioning component 508 is used for positioning the current geographic Location of the terminal 500 for navigation or LBS (Location Based Service). The Positioning component 508 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.

Power supply 509 is used to power the various components in terminal 500. The power source 509 may be alternating current, direct current, disposable or rechargeable. When power supply 509 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 500 also includes one or more sensors 510. The one or more sensors 510 include, but are not limited to: acceleration sensor 511, gyro sensor 512, pressure sensor 513, fingerprint sensor 514, optical sensor 515, and proximity sensor 516.

The acceleration sensor 511 may detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal 500. For example, the acceleration sensor 511 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 501 may control the display screen 505 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 511. The acceleration sensor 511 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 512 may detect a body direction and a rotation angle of the terminal 500, and the gyro sensor 512 may cooperate with the acceleration sensor 511 to acquire a 3D motion of the user on the terminal 500. The processor 501 may implement the following functions according to the data collected by the gyro sensor 512: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 513 may be disposed on a side frame of the terminal 500 and/or underneath the display screen 505. When the pressure sensor 513 is disposed on the side frame of the terminal 500, a user's holding signal of the terminal 500 may be detected, and the processor 501 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 513. When the pressure sensor 513 is disposed at the lower layer of the display screen 505, the processor 501 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 505. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 514 is used for collecting a fingerprint of the user, and the processor 501 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 514, or the fingerprint sensor 514 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 501 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 514 may be provided on the front, back, or side of the terminal 500. When a physical button or a vendor Logo is provided on the terminal 500, the fingerprint sensor 514 may be integrated with the physical button or the vendor Logo.

The optical sensor 515 is used to collect the ambient light intensity. In one embodiment, the processor 501 may control the display brightness of the display screen 505 based on the ambient light intensity collected by the optical sensor 515. Specifically, when the ambient light intensity is high, the display brightness of the display screen 505 is increased; when the ambient light intensity is low, the display brightness of the display screen 505 is reduced. In another embodiment, processor 501 may also dynamically adjust the shooting parameters of camera head assembly 506 based on the ambient light intensity collected by optical sensor 515.

A proximity sensor 516, also referred to as a distance sensor, is typically disposed on the front panel of the terminal 500. The proximity sensor 516 is used to collect the distance between the user and the front surface of the terminal 500. In one embodiment, when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 gradually decreases, the processor 501 controls the display screen 505 to switch from the bright screen state to the dark screen state; when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 becomes gradually larger, the display screen 505 is controlled by the processor 501 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 5 is not intended to be limiting of terminal 500 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention, where the server 600 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 601 and one or more memories 602, where at least one instruction is stored in the one or more memories 602, and is loaded and executed by the one or more processors 601 to implement the method for detecting a speech recognition result provided by the above-mentioned method embodiments. Of course, the server 600 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 600 may also include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor to perform the method of detecting a speech recognition result in the above embodiments is also provided. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for detecting a speech recognition result, the method comprising:

acquiring voice instruction information, wherein the voice instruction information comprises two voice instructions received by the same user equipment and voice recognition results obtained by an automatic voice recognition (ASR) model respectively, the two voice instructions comprise a first voice instruction and a second voice instruction, and the second voice instruction is a previous voice instruction of the first voice instruction;

when the similarity meets a similarity condition, determining that the voice recognition result of the second voice instruction has errors;

sending the second voice instruction and the voice recognition result of the second voice instruction, or sending the two voice instructions and the respective voice recognition results to target equipment, further inputting a labeling result for labeling the voice instruction and the voice recognition result received by the target equipment into the ASR model, and adjusting the model parameters of the ASR model; the target device is used for prompting manual marking and detection of the received voice instruction and the voice recognition result of the voice instruction.

2. The method according to claim 1, wherein the obtaining the similarity of the speech recognition results of the two speech instructions according to the speech instruction information when the time interval between the two speech instructions is smaller than the time interval threshold comprises:

3. The method of claim 2, wherein the at least one of semantic differences or signal differences comprises a plurality of comparison results;

the obtaining at least one of a semantic difference or a signal difference of the two voice commands according to the voice command information includes:

acquiring a first comparison result of the two voice instructions according to the voice instruction information;

and when the first comparison result meets the error condition, acquiring a second comparison result of the two voice instructions.

4. The method according to claim 2, wherein the semantic difference comprises a domain relationship to which semantic parsing results of the two voice commands belong, and the signal difference comprises at least one of a pitch comparison result, an emotion comparison result, or a voiceprint comparison result of the two voice commands;

the at least one of semantic differences or signal differences meeting an error condition, including at least one of:

the semantic analysis results of the two voice instructions belong to different fields, or the semantic analysis results of the two voice instructions belong to target fields;

the pitch comparison result indicates that the first voice instruction is higher in pitch than the second voice instruction;

the emotion comparison result indicates that the emotion of the first voice instruction and the emotion of the second voice instruction are different;

the voiceprint comparison result indicates that the voiceprints of the two voice instructions are the same.

5. The method of claim 2, wherein the voice instruction information further comprises semantic parsing results of the two voice instructions;

6. The method of claim 2, further comprising:

and when any one of the comparison results does not accord with the error condition, determining that the voice recognition result of the second voice instruction is correct.

7. The method according to claim 1, wherein the obtaining the similarity of the voice recognition results of the two voice commands according to the voice command information comprises any one of:

8. An apparatus for detecting a result of speech recognition, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring voice instruction information, the voice instruction information comprises two voice instructions received by the same user equipment and voice recognition results respectively obtained through an automatic voice recognition (ASR) model, the two voice instructions comprise a first voice instruction and a second voice instruction, and the second voice instruction is a previous voice instruction of the first voice instruction;

the acquisition module is further used for acquiring the similarity of the voice recognition results of the two voice instructions according to the voice instruction information when the receiving time interval of the two voice instructions is smaller than a time interval threshold;

the determining module is used for determining that the voice recognition result of the second voice instruction has errors when the similarity meets the similarity condition;

the adjusting module is used for sending the second voice instruction and the voice recognition result of the second voice instruction, or sending the two voice instructions and the respective voice recognition results to a target device, further inputting a labeling result for labeling the voice instruction and the voice recognition result received by the target device into the ASR model, and adjusting the model parameters of the ASR model; the target device is used for prompting manual marking and detection of the received voice instruction and the voice recognition result of the voice instruction.

9. A computer device, comprising one or more processors and one or more memories having stored therein at least one instruction, the instruction being loaded and executed by the one or more processors to implement the operations performed by the method for detecting a speech recognition result according to any one of claims 1 to 7.

10. A computer-readable storage medium having stored therein at least one instruction, which is loaded and executed by a processor to perform operations performed by the method for detecting a speech recognition result according to any one of claims 1 to 7.