CN114171023B

CN114171023B - Speech recognition method, device, computer equipment and storage medium

Info

Publication number: CN114171023B
Application number: CN202111483278.5A
Authority: CN
Inventors: 彭晓春; 杨震; 龚晟; 陈璐
Original assignee: Tianyi IoT Technology Co Ltd
Current assignee: Tianyi IoT Technology Co Ltd
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2025-09-05
Anticipated expiration: 2041-12-07
Also published as: CN114171023A

Abstract

The embodiments of the present application disclose a speech recognition method, apparatus, computer equipment and storage medium. The method includes: obtaining a speech to be recognized; inputting the speech to be recognized into a preset speech recognition model to obtain a speech recognition text, content confidence of the speech recognition text and domain confidence; judging whether the text confidence of the speech recognition text is less than a preset confidence threshold based on the content confidence and domain confidence; if the text confidence threshold of the speech recognition text is less than the confidence threshold, calling a third-party resource library to perform result supplementation processing on the speech recognition text to obtain a supplemented speech recognition text; determining the speech text with the highest domain confidence in the supplemented speech recognition text as the target speech recognition text. In this solution, when the accuracy of the speech recognition text obtained according to the preset speech recognition model is not high, the speech recognition text can be supplemented by other databases to improve the accuracy of speech recognition.

Description

Speech recognition method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method and apparatus for speech recognition, a computer device, and a storage medium.

Background

In the industry scene of the internet of things, human-computer interaction, information search through an engine and control of a terminal all relate to a voice recognition technology, voice is recognized through a voice recognition model, after a corresponding voice text is recognized, information expression of the voice text is understood in a targeted manner through a semantic understanding technology, and therefore a corresponding system can perform follow-up action operation execution according to an understanding result.

Generally, a voice recognition model needs to be trained in advance, and then voice recognition is performed according to the trained voice recognition model, but in many practical application processes, voice content to be recognized is not trained into the model in advance, so that the voice recognition model is inaccurate in voice recognition, and the voice recognition accuracy is still to be improved.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, computer equipment and a storage medium, which can improve the accuracy of voice recognition.

In a first aspect, an embodiment of the present application provides a voice recognition method, including:

Acquiring voice to be recognized;

Inputting the voice to be recognized into a preset voice recognition model to obtain a voice recognition text, and content confidence and field confidence of the voice recognition text;

Judging whether the text confidence of the voice recognition text is smaller than a preset confidence threshold according to the content confidence and the domain confidence;

if the text confidence coefficient threshold value of the voice recognition text is smaller than the confidence coefficient threshold value, a third party resource library is called to carry out result supplement processing on the voice recognition text, and a supplemented voice recognition text is obtained;

And determining the voice text with the highest field confidence in the supplemented voice recognition text as a target voice recognition text.

In a second aspect, an embodiment of the present application further provides a voice recognition apparatus, including:

the acquisition unit is used for acquiring the voice to be recognized;

The input unit is used for inputting the voice to be recognized into a preset voice recognition model to obtain a voice recognition text, the content confidence coefficient of the voice recognition text and the domain confidence coefficient;

The judging unit is used for judging whether the text confidence of the voice recognition text is smaller than a preset confidence threshold according to the content confidence and the domain confidence;

The supplementing unit is used for calling a third party resource library to conduct result supplementing processing on the voice recognition text when the text confidence threshold of the voice recognition text is smaller than the confidence threshold, so as to obtain a supplemented voice recognition text;

And the first determining unit is used for determining the voice text with the highest domain confidence degree in the supplemented voice recognition text as the target voice recognition text.

In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the method when executing the computer program.

In a fourth aspect, embodiments of the present application also provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, implement the above-described method.

The embodiment of the application provides a voice recognition method, a voice recognition device, computer equipment and a storage medium. The method comprises the steps of obtaining voice to be recognized, inputting the voice to be recognized into a preset voice recognition model to obtain voice recognition texts, content confidence degrees and domain confidence degrees of the voice recognition texts, judging whether the text confidence degrees of the voice recognition texts are smaller than a preset confidence coefficient threshold according to the content confidence degrees and the domain confidence degrees, if the text confidence coefficient threshold of the voice recognition texts is smaller than the confidence coefficient threshold, calling a third party resource library to conduct result supplement processing on the voice recognition texts to obtain supplemented voice recognition texts, and determining the voice texts with the highest domain confidence degrees in the supplemented voice recognition texts as target voice recognition texts. In the embodiment of the application, when the confidence of the text recognized according to the voice recognition model is lower, a third party resource library is required to be called to supplement the voice recognition text, then the text with the highest field confidence is selected as the target voice recognition text from the supplemented voice recognition texts, and when the accuracy of the voice recognition text obtained according to the preset voice recognition model is not high, the voice recognition text can be supplemented through other databases, and then the optimal text is selected as the target voice recognition text from the supplemented text, so that the scheme can improve the accuracy of voice recognition.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a voice recognition method according to another embodiment of the present application;

FIG. 4 is a schematic block diagram of a speech recognition device according to an embodiment of the present application;

FIG. 5 is a schematic block diagram of a speech recognition apparatus according to another embodiment of the present application;

fig. 6 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

The embodiment of the application provides a voice recognition method, a voice recognition device, computer equipment and a storage medium.

The main execution body of the voice recognition method can be the voice recognition device provided by the embodiment of the application or computer equipment integrated with the voice recognition device, wherein the voice recognition device can be realized in a hardware or software mode, the computer equipment can be a terminal or a server, and the terminal can be a smart phone, a tablet personal computer, a palm computer, a notebook computer, an intelligent sound box and the like.

Referring to fig. 1, fig. 1 is a schematic diagram of an application scenario of a speech recognition method according to an embodiment of the present application. The voice recognition method is applied to the computer equipment 10 in fig. 1, the computer equipment 10 acquires voice to be recognized of a user, the voice to be recognized is input into a preset voice recognition model to obtain voice recognition texts, content confidence and domain confidence of the voice recognition texts, whether the text confidence of the voice recognition texts is smaller than a preset confidence threshold value is judged according to the content confidence and the domain confidence, if the text confidence threshold value of the voice recognition texts is smaller than the confidence threshold value, a third party resource library is called to conduct result supplement processing on the voice recognition texts to obtain supplemented voice recognition texts, and the voice texts with the highest domain confidence in the supplemented voice recognition texts are determined to be target voice recognition texts.

Fig. 2 is a flow chart of a voice recognition method according to an embodiment of the present application. As shown in fig. 2, the method includes the following steps S110-160.

S110, acquiring voice to be recognized.

In this embodiment, the user may wake up the computer device and then input the voice to be recognized to the computer device by speaking through the mouth, so that the computer device obtains the voice to be recognized.

S120, inputting the voice to be recognized into a preset voice recognition model to obtain a voice recognition text, and obtaining content confidence and field confidence of the voice recognition text.

After the computer equipment acquires the voice to be recognized, inputting the voice to be recognized into a pre-trained voice recognition model, and outputting a voice recognition text with highest content confidence coefficient, the content confidence coefficient corresponding to the voice recognition text and domain confidence coefficient through the voice recognition model, wherein the content confidence coefficient can reflect the recognition accuracy of the voice recognition text, the domain confidence coefficient can reflect which service domains the voice recognition text belongs to, wherein the service domains comprise a voice interaction domain, a terminal control domain and the like, and the terminal control domain specifically comprises an air conditioning control domain, a curtain control domain, a television control domain and the like.

It should be noted that, the speech recognition model in this embodiment not only can implement the function of converting speech into text, but also can determine the domain confidence of the domain to which the recognized text belongs, where in some embodiments, the speech recognition model may be a convolutional neural network model, for example, an N-gram neural network model.

S130, judging whether the text confidence of the voice recognition text is smaller than a preset confidence threshold according to the content confidence and the domain confidence, if so, executing the steps S140-S150, and if not, executing the step S160.

The step S130 includes determining the text confidence according to the content confidence, a preset content confidence weight, the domain confidence and a preset domain confidence weight, and then judging whether the text confidence is smaller than the confidence threshold.

Specifically, in some embodiments, text confidence = content confidence preset content confidence weight + domain confidence preset domain confidence weight.

It should be noted that, the domain confidence in the speech recognition model in this embodiment includes a plurality of domain sub-confidence levels, and in this embodiment, text confidence calculation needs to be performed according to the domain sub-confidence level with the highest value among the plurality of domain sub-confidence levels.

And S140, calling a third party resource library to perform result supplement processing on the voice recognition text, and obtaining the supplemented voice recognition text.

When the text confidence threshold of the voice recognition text is smaller than the confidence threshold, the voice recognition model can be considered to have no corresponding recognition result, and the recognition result accuracy is lower. At this time, a third party resource library is required to be called to perform result supplement processing on the voice recognition text, wherein the third party resource library comprises an internet search engine, a social network search interface and/or a plurality of domain resource libraries.

In some embodiments, when the third party resource library comprises a plurality of domain resource libraries, step S140 includes obtaining text pinyin corresponding to the speech recognition text, determining a target domain resource library from the plurality of domain resource libraries according to the domain confidence level, and finally determining the post-supplement speech recognition text in the target domain resource library according to the text pinyin.

The domain confidence comprises a plurality of domain sub-confidence levels, wherein the determining of a target domain resource library from a plurality of domain resource libraries according to the domain sub-confidence levels comprises extracting the confidence level with the maximum value of a preset number from the domain sub-confidence levels to obtain a plurality of target domain sub-confidence levels;

For example, if the preset number is 3, the first 3 confidence degrees with the maximum value need to be extracted from the multiple domain sub-confidence degrees as target domain sub-confidence degrees, where the domain sub-confidence degrees carry domain labels of corresponding domains, the domain corresponding to the domain sub-confidence degrees will be determined according to the domain labels carried by the domain sub-confidence degrees, and then 3 domains with the same domain as the domain corresponding to the domain sub-confidence degrees are selected from the multiple domain resource libraries to be used as target domain resource libraries.

At this time, the step of determining the post-supplement speech recognition text in the target domain resource libraries according to the text pinyin includes respectively performing speech text adjustment on the text pinyin in each of the target domain resource libraries to obtain a plurality of post-supplement speech recognition texts.

Specifically, the method includes the steps of respectively searching the same pronunciation (text pinyin) in each domain resource library, taking the searched recognition text as a post-supplement speech recognition text to obtain a plurality of post-supplement speech recognition texts, wherein the post-supplement speech recognition text comprises the text searched through the domain resource library and further comprises the speech recognition text (namely the original recognized text) recognized through a speech recognition model.

For example, the text of the voice to be recognized by the voice recognition model is "bluetooth of opening the angel intelligent sound box", but the confidence coefficient threshold value of the text is calculated at this time to be lower, so that the third party resource library is called at this time to perform result supplement processing on the voice recognition text, the corresponding text is obtained by supplement search to be "bluetooth of opening the angel intelligent sound box", the confidence coefficient of the field is detected again at this time, the confidence coefficient of the field obtained by supplement search is found to be higher, and the text obtained by supplement search is taken as the finally recognized text at this time.

S150, determining the voice text with the highest field confidence in the supplemented voice recognition text as a target voice recognition text.

In some embodiments, specifically, S150 includes determining a domain confidence coefficient of each of the post-supplement speech recognition texts according to a preset domain confidence coefficient calculation model, determining a confidence coefficient with a largest domain confidence coefficient median of a plurality of post-supplement speech recognition texts as a target domain confidence coefficient, and determining a post-supplement speech recognition text corresponding to the target domain confidence coefficient as the target speech recognition text, where the speech recognition model includes a domain confidence coefficient calculation model and a content confidence coefficient calculation model, and the domain confidence coefficient calculation model is used for calculating a domain confidence coefficient of the input speech recognition text, specifically, calculating the domain confidence coefficient according to a similarity of each term in the speech recognition text in each domain.

That is, the domain confidence calculation model may calculate the domain confidence of the input text (each post-supplement speech recognition text) according to the domain probability of each word in the domain text, obtain the domain confidence corresponding to each post-supplement speech recognition text, determine the domain confidence with the highest value as the target domain confidence, and finally determine the post-supplement speech recognition text corresponding to the target domain confidence as the final recognition text.

S160, determining the voice recognition text as the target voice recognition text.

In this embodiment, if the text confidence threshold of the speech recognition text is greater than or equal to the confidence threshold, it is indicated that the accuracy of the text recognized by the speech recognition model is relatively high at this time, and it is considered that the speech recognition model has an accurate recognition result, and at this time, the speech recognition text can be directly used as the target speech recognition text.

Fig. 3 is a flowchart of a voice recognition method according to another embodiment of the present application. As shown in fig. 3, the voice recognition method of the present embodiment includes steps S210 to S290. Steps S210 to S260 are similar to steps S110 to S160 in the above embodiment, and are not described herein. Steps S270 to S290 added in the present embodiment are described in detail below.

S270, determining a semantic understanding model of the corresponding field of the target voice recognition text.

In this embodiment, the target speech recognition text carries a domain label, where in order to improve the accuracy of semantic understanding, the computer device in this embodiment sets corresponding semantic understanding models for different service domains, and after determining the target speech recognition text, invokes the semantic understanding model in the corresponding domain of the target speech recognition text.

S280, carrying out semantic understanding on the target voice recognition text according to the semantic understanding model to obtain a semantic understanding result.

After the corresponding semantic understanding model is called, semantic understanding is performed in the semantic understanding model, for example, the target voice recognition text is "Bluetooth of opening the intelligent antenna sound box", at this time, the semantic understanding is performed on the sentence through the semantic understanding model, and the obtained semantic understanding result is "call intelligent antenna sound box, and the Bluetooth function of the intelligent antenna sound box is opened".

And S290, executing processing corresponding to the semantic understanding result.

After the semantic understanding result is obtained, corresponding processing is performed according to the semantic understanding result, and service system automatic control, service system automatic search or other intelligent functions of the service system are called, for example, a Bluetooth function of the antenna intelligent sound box is opened.

Therefore, the application can introduce external information resources into the voice recognition system to finish the continuous recognition of inaccurate content due to untimely updating of the language recognition model, call different supplementary information resources to perform new continuous recognition according to the judgment of the field to which the voice belongs, and recommend the recognition result closest to the user recognition requirement semantics according to the judgment of the field to which the voice belongs.

During speech recognition, recognition of content not contained by the language recognition model in the speech recognition engine is completed. In the identification process, the content search of the same pronunciation (pinyin search) can be carried out by calling the industrial scene vocabulary dictionary (settable), the social network search interface and the content of the knowledge base in the field. And selecting the text of the search content in one or more fields as the supplement of the recognition result. The application further provides the recognition result related to the semantics as the recognition result selection for supplementing the user. The application solves the problem that the language recognition model can not be updated in time in the voice recognition engine, and forms an open architecture of the voice recognition engine facing the Internet/mobile Internet, so that the recognition result is more facing the search and application requirements.

In some embodiments, prior to performing the present application, a speech recognition model and a semantic understanding model need to be constructed, as follows:

1. Obtaining a scene, and controlling command words of a terminal, such as a terminal name (standard name, short name, unique name, name of a specific industry field and the like);

2. And constructing an industry scene semantic understanding model in the semantic understanding model. Based on the industry name and the terminal name, searching an industry resource library, including an Internet industry professional website. Constructing a semantic model understood by an industry terminal;

3. repeating the above actions based on the control command word of the terminal to construct a control command understanding model in the semantic understanding model (the essence is to obtain more command word sets of one or one type of terminal under a certain scene subordinate in the industry);

4. and constructing a voice recognition model according to the common command language of the user.

The application solves the problem that the language model can not be updated in time in the voice recognition engine, and forms an open architecture of the voice recognition engine facing the Internet/mobile Internet

The application can be applied to a voice recognition system, a call center automatic question answering system and other various search engine systems.

For example, the method and the system for applying voice recognition in the call center can realize voice recognition and semantic understanding of services served by the call center. The discovery of the call center hot spot problem is realized. The method and the system realize the identification of a new description of a thing by a user in the recording data of the call center, or the identification of a new occurrence event and content in the call center, and adopt the characteristics of strong timeliness of information resource content and strong pertinence of vertical information resource content on the Internet. Identifying those recognition requests that are not trained by the language model.

The method firstly guides information in the vertical field into the system, and extracts the corresponding feature training voice recognition model, semantic understanding model, search engine model and the like. When the front end obtains a voice input signal, the voice recognition model is used for recognizing the audio signal, but when the input signal is found to be similar to the trained vertical field information, the voice input signal is recognized in a targeted mode. After the recognition is completed, scoring the recognition result model, and outputting corresponding characters and carrying a field label when the score exceeds a certain threshold value. And then, the semantic understanding model calls the semantic understanding model rules of the corresponding field to perform corresponding semantic understanding and information searching. The method can form natural language recognition capability in a certain vertical field and form semantic understanding and information searching capability corresponding to the field. The vertical domain semantic features can be invoked by different speech recognition engines to form a core feature library for semantic understanding of such domain knowledge. Thus expanding and perfecting. The speech recognition capability coverage of the selected service area can be gradually formed and the semantic understanding capability and the information searching capability can be correspondingly provided.

In addition, the application can provide the secondary disambiguation capability through the natural language understanding capability of the characters, so that the possible error in the voice recognition output result can be corrected, and the more accurate effect can be achieved.

In summary, the method comprises the steps of obtaining a voice to be recognized, inputting the voice to be recognized into a preset voice recognition model to obtain a voice recognition text, content confidence and domain confidence of the voice recognition text, judging whether the text confidence of the voice recognition text is smaller than a preset confidence threshold according to the content confidence and the domain confidence, calling a third party resource library to conduct result supplement processing on the voice recognition text if the text confidence of the voice recognition text is smaller than the confidence threshold, and determining the voice text with the highest domain confidence in the supplemented voice recognition text as a target voice recognition text. . In the embodiment of the application, when the confidence of the text recognized according to the voice recognition model is lower, a third party resource library is required to be called to supplement the voice recognition text, then the text with the highest field confidence is selected as the target voice recognition text from the supplemented voice recognition texts, and when the accuracy of the voice recognition text obtained according to the preset voice recognition model is not high, the voice recognition text can be supplemented through other databases, and then the optimal text is selected as the target voice recognition text from the supplemented text.

Fig. 4 is a schematic block diagram of a voice recognition apparatus according to an embodiment of the present application. As shown in fig. 4, the present application also provides a voice recognition device corresponding to the above voice recognition method. The speech recognition device comprises means for performing the above-described speech recognition method, which device can be arranged in a desktop computer, a tablet computer, a laptop computer, etc. Specifically, referring to fig. 4, the voice recognition apparatus includes an acquisition unit 401, an input unit 402, a judgment unit 403, a supplementation unit 404, and a first determination unit 405.

An acquisition unit 401 for acquiring a voice to be recognized;

An input unit 402, configured to input the voice to be recognized into a preset voice recognition model, to obtain a voice recognition text, a content confidence level of the voice recognition text, and a domain confidence level;

A judging unit 403, configured to judge whether the text confidence of the speech recognition text is smaller than a preset confidence threshold according to the content confidence and the domain confidence;

The supplementing unit 404 is configured to invoke a third party resource library to perform result supplementing processing on the speech recognition text when the text confidence threshold of the speech recognition text is smaller than the confidence threshold, so as to obtain a supplemented speech recognition text;

the first determining unit 405 is configured to determine, as the target speech recognition text, a speech text with the highest domain confidence in the post-supplement speech recognition text.

In some embodiments, the third party repository includes a plurality of domain repositories, and the supplementing unit 404 is specifically configured to:

acquiring text pinyin corresponding to the voice recognition text;

Determining a target domain resource library from a plurality of domain resource libraries according to the domain confidence level;

And determining the supplemented voice recognition text in the target field resource library according to the text pinyin.

In some embodiments, the domain confidence level includes a plurality of domain sub-confidence levels, and the supplementing unit 404 is specifically further configured to, when executing the determining, according to the domain confidence level, a target domain resource library from a plurality of the domain resource libraries:

Extracting the confidence coefficient of the preset number with the maximum value from the domain sub-confidence coefficients to obtain target domain sub-confidence coefficients;

determining domain resource libraries corresponding to the target domain sub-confidence degrees respectively from a plurality of domain resource libraries to obtain a plurality of target domain resource libraries;

the determining the post-supplement speech recognition text in the target domain resource library according to the text pinyin comprises the following steps:

and respectively carrying out voice text adjustment on the text pinyin in each target field resource library to obtain a plurality of supplemented voice recognition texts.

In some embodiments, the first determining unit 405 is specifically configured to:

respectively determining the domain confidence coefficient of each supplemented voice recognition text according to a preset domain confidence coefficient calculation model;

determining the confidence coefficient with the largest domain confidence coefficient median of the plurality of the supplemented voice recognition texts as a target domain confidence coefficient;

And determining the supplemented voice recognition text corresponding to the target domain confidence as the target voice recognition text.

In some embodiments, the determining unit 403 is specifically configured to:

determining the text confidence according to the content confidence, a preset content confidence weight, the domain confidence and a preset domain confidence weight;

And judging whether the text confidence is smaller than the confidence threshold.

Fig. 5 is a schematic block diagram of a speech recognition apparatus according to another embodiment of the present application. As shown in fig. 5, the voice recognition apparatus of the present embodiment is an embodiment in which a second determination unit 406, an understanding unit 407, a processing unit 408, and a third determination unit 409 are added to the above embodiments.

A second determining unit 406, configured to determine a semantic understanding model in the domain corresponding to the target speech recognition text;

an understanding unit 407, configured to perform semantic understanding on the target speech recognition text according to the semantic understanding model, so as to obtain a semantic understanding result;

And a processing unit 408, configured to perform processing corresponding to the semantic understanding result.

A third determining unit 409 configured to determine the speech recognition text as the target speech recognition text when a text confidence threshold of the speech recognition text is greater than or equal to the confidence threshold.

It should be noted that, as those skilled in the art can clearly understand the specific implementation process of the above voice recognition device and each unit, reference may be made to the corresponding description in the foregoing method embodiments, and for convenience and brevity of description, details are not repeated here.

The speech recognition apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 6.

Referring to fig. 6, fig. 6 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 600 may be a terminal or a server, where the terminal may be an electronic device with a communication function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device. The server may be an independent server or a server cluster formed by a plurality of servers.

With reference to FIG. 6, the computer device 600 includes a processor 602, memory and a network interface 605 connected by a system bus 601, wherein the memory may include a non-volatile storage medium 603 and an internal memory 604.

The non-volatile storage medium 603 may store an operating system 6031 and a computer program 6032. The computer program 6032 comprises program instructions that, when executed, cause the processor 602 to perform a speech recognition method.

The processor 602 is used to provide computing and control capabilities to support the operation of the overall computer device 600.

The internal memory 604 provides an environment for the execution of a computer program 6032 in the non-volatile storage medium 603, which computer program 6032, when executed by the processor 602, causes the processor 602 to perform a speech recognition method.

The network interface 605 is used for network communication with other devices. It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device 600 to which the present inventive arrangements may be applied, and that a particular computer device 600 may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

Wherein the processor 602 is configured to execute a computer program 6032 stored in a memory to implement the steps of:

Acquiring voice to be recognized;

In some embodiments, the third party resource library includes a plurality of domain resource libraries, and when the processor 602 performs the step of calling the third party resource library to perform result supplement processing on the speech recognition text to obtain a post-supplement speech recognition text, the following steps are specifically implemented:

acquiring text pinyin corresponding to the voice recognition text;

In some embodiments, the domain confidence level includes a plurality of domain sub-confidence levels, and the processor 602, when implementing the step of determining the target domain resource library from the plurality of domain resource libraries according to the domain confidence levels, specifically implements the following steps:

In some embodiments, when the step of determining the speech text with the highest domain confidence in the post-supplement speech recognition text as the target speech recognition text is implemented by the processor 602, the following steps are specifically implemented:

In some embodiments, when implementing the step of determining whether the text confidence of the speech recognition text is less than a preset confidence threshold according to the content confidence and the domain confidence, the processor 602 specifically implements the following steps:

In some embodiments, after implementing the step of determining the speech text with the highest domain confidence in the post-supplement speech recognition text as the target speech recognition text, the processor 602 further implements the following steps:

Determining a semantic understanding model of the corresponding field of the target voice recognition text;

carrying out semantic understanding on the target voice recognition text according to the semantic understanding model to obtain a semantic understanding result;

And executing processing corresponding to the semantic understanding result.

In some embodiments, after implementing the step of determining whether the text confidence of the speech recognition text is less than a preset confidence threshold according to the content confidence and the domain confidence, the processor 602 further implements the following steps:

And if the text confidence threshold of the voice recognition text is greater than or equal to the confidence threshold, determining the voice recognition text as the target voice recognition text.

It should be appreciated that in embodiments of the present application, the Processor 602 may be a central processing unit (Central Processing Unit, CPU), the Processor 602 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present application also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program, wherein the computer program includes program instructions. The program instructions, when executed by the processor, cause the processor to perform the steps of:

Acquiring voice to be recognized;

In some embodiments, the third party resource library includes a plurality of domain resource libraries, and when the processor executes the program instructions to implement the step of calling the third party resource library to perform result supplement processing on the speech recognition text to obtain a post-supplement speech recognition text, the processor specifically implements the following steps:

acquiring text pinyin corresponding to the voice recognition text;

In some embodiments, the domain confidence level includes a plurality of domain sub-confidence levels, and the processor, when executing the program instructions to implement the step of determining a target domain resource library from a plurality of the domain resource libraries according to the domain confidence level, specifically implements the steps of:

In some embodiments, when the processor executes the program instructions to implement the step of determining the speech text with the highest domain confidence in the supplemented speech recognition text as the target speech recognition text, the method specifically includes the following steps:

In some embodiments, when the processor executes the program instructions to implement the step of determining whether the text confidence level of the speech recognition text is less than a preset confidence level threshold according to the content confidence level and the domain confidence level, the method specifically includes the following steps:

In some embodiments, after executing the program instructions to implement the step of determining the speech text with the highest domain confidence in the supplemented speech recognition text as the target speech recognition text, the processor further implements the steps of:

And executing processing corresponding to the semantic understanding result.

In some embodiments, after executing the program instructions to implement the step of determining whether the text confidence level of the speech recognition text is less than a preset confidence threshold according to the content confidence level and the domain confidence level, the processor further implements the steps of:

The storage medium may be a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that can store program codes.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application.

While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of speech recognition, comprising:

Acquiring voice to be recognized;

Determining the voice text with the highest field confidence in the supplemented voice recognition text as a target voice recognition text;

when the third party resource library comprises a plurality of domain resource libraries, the calling third party resource library performs result supplementing processing on the voice recognition text to obtain the supplemented voice recognition text, and the method comprises the following steps:

acquiring text pinyin corresponding to the voice recognition text;

2. The method of claim 1, wherein the domain confidence level comprises a plurality of domain sub-confidence levels, and wherein the determining a target domain resource library from the plurality of domain resource libraries based on the domain confidence levels comprises:

3. The method of claim 2, wherein determining the speech text with the highest domain confidence level in the supplemented speech recognition text as the target speech recognition text comprises:

4. The method of claim 1, wherein said determining whether text confidence of the speech recognition text is less than a preset confidence threshold based on the content confidence and the domain confidence comprises:

5. The method of claim 1, wherein after determining the speech text with the highest domain confidence level in the post-supplement speech recognition text as the target speech recognition text, the method further comprises:

And executing processing corresponding to the semantic understanding result.

6. The method according to any one of claims 1 to 5, wherein after said determining whether text confidence of the speech recognition text is less than a preset confidence threshold based on the content confidence and the domain confidence, the method further comprises:

7. A speech recognition apparatus, comprising:

the acquisition unit is used for acquiring the voice to be recognized;

the first determining unit is used for determining the voice text with the highest domain confidence coefficient in the supplemented voice recognition text as a target voice recognition text;

The third party resource library comprises an Internet search engine, a social network search interface and/or a plurality of domain resource libraries, and when the third party resource library comprises a plurality of domain resource libraries, the supplementing unit is specifically used for:

acquiring text pinyin corresponding to the voice recognition text;

8. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-6.

9. A computer readable storage medium, characterized in that the storage medium stores a computer program comprising program instructions which, when executed by a processor, can implement the method of any of claims 1-6.