CN110970018B

CN110970018B - Speech recognition method and device

Info

Publication number: CN110970018B
Application number: CN201811143178.6A
Authority: CN
Inventors: 易斌; 连园园; 陈浩广; 肖龙
Original assignee: Gree Electric Appliances Inc of Zhuhai
Current assignee: Gree Electric Appliances Inc of Zhuhai
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2022-05-27
Anticipated expiration: 2038-09-28
Also published as: CN110970018A

Abstract

The invention discloses a voice recognition method and a voice recognition device. Wherein, the method comprises the following steps: acquiring voice information; based on the voice information, determining the language to which the voice information belongs according to a pre-acquired language judgment model, wherein the language judgment model is obtained by training according to training data, and the training data comprises: voice information of a plurality of languages and tags for indicating the languages to which the voice information belongs; and calling the corresponding language conversion module according to the language to which the language information belongs, and converting the voice information into corresponding character information. The invention solves the technical problem that the process of voice transcription in the prior art can only run in a fixed language mode, so that the intelligence degree is lower.

Description

Voice recognition method and device

Technical Field

The invention relates to the field of language processing, in particular to a voice recognition method and a voice recognition device.

Background

The artificial intelligence technology is rapidly developed, the influence of smart home on the life of a user is more and more, the application convenience is gradually upgraded, and some problems needing improvement still exist.

For example, the existing smart home system has a function of receiving a voice command, but can only receive a fixed voice command, and if the language used for sending the voice command is changed, the equipment is difficult to correctly recognize when performing the operation of transcribing characters by voice; if the voice command comprises voices of a plurality of languages, all voice information can still be converted into one voice when the equipment carries out voice character transcription operation, so that the intelligence degree is low, and the user experience is poor.

Aiming at the problem that the process of voice transcription in the prior art can only run in a fixed language mode, which results in lower intelligence degree, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method and a voice recognition device, which at least solve the technical problem that the intelligent degree is lower because the voice transcription process in the prior art can only run in a fixed language mode.

According to an aspect of an embodiment of the present invention, there is provided a speech recognition method including: acquiring voice information; based on the voice information, determining the language to which the voice information belongs according to a pre-acquired language judgment model, wherein the language judgment model is obtained by training according to training data, and the training data comprises: voice information of a plurality of languages and tags for indicating the languages to which the voice information belongs; and calling the corresponding language conversion module according to the language to which the language information belongs, and converting the voice information into corresponding character information.

Further, acquiring collected sound information; and denoising the sound information to obtain the voice information of the target object.

Further, inputting the voice information into a language judgment model, wherein the language judgment model outputs the probability that the voice information belongs to each candidate language; and determining the language to which the voice information belongs according to the probability corresponding to each candidate language.

Further, acquiring preset candidate languages and weights corresponding to the candidate languages; and determining the language to which the voice information belongs according to the probability corresponding to each candidate language and the weight of each candidate language.

Further, calling a corresponding language conversion module according to the language to which the language information belongs, converting the voice information into corresponding text information, and then acquiring a preset text display type; and displaying the text information according to the text display type.

Further, before obtaining the voice information, obtaining a language judgment model, wherein obtaining the language judgment model includes: acquiring training data and an initial convolutional neural network model, wherein the initial convolutional neural network model has initial network parameters; and training the initial convolutional neural network model by using the training data to obtain target network parameters, wherein the target network parameters are used for forming a language judgment model.

According to another aspect of the embodiments of the present invention, there is also provided a speech recognition apparatus, including: the acquisition module is used for acquiring voice information; a determining module, configured to determine, based on the voice information, a language to which the voice information belongs according to a language judgment model obtained in advance, where the language judgment model is obtained by training according to training data, and the training data includes: the voice information of a plurality of languages and a label used for expressing the language to which the voice information belongs; and the conversion module is used for calling the corresponding language conversion module according to the language to which the language information belongs and converting the voice information into corresponding character information.

Further, the acquisition module includes: the acquisition submodule is used for acquiring the acquired sound information; and the processing submodule is used for carrying out denoising processing on the sound information to obtain the voice information of the target object.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the above-mentioned speech recognition method.

According to another aspect of the embodiments of the present invention, there is also provided a processor, wherein the processor is configured to execute a program, and wherein the program executes to perform the above-mentioned speech recognition method.

In the embodiment of the invention, the voice information is acquired, the language to which the voice information belongs is determined according to the pre-acquired language judgment model based on the voice information, and the corresponding language conversion module is called according to the language to which the language information belongs, so that the voice information is converted into the corresponding text information. According to the scheme, the language of the voice information is judged through the language diagnosis model, the condition that the voice recognition is inaccurate due to the fact that the languages of various languages are subjected to voice recognition according to one language is avoided, and the technical problem that the intelligent degree is low due to the fact that the voice transcription process can only be operated in a fixed language mode in the prior art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of a speech recognition method according to an embodiment of the present invention; and

fig. 2 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with an embodiment of the present invention, there is provided an embodiment of a speech recognition method, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:

step S102, voice information is acquired.

Specifically, the voice information may be a voice uttered by the user, or a voice uttered by the user through controlling other devices.

The voice recognition method can be applied to various scenes, for example, a scene of an intelligent home, can be applied to air conditioning equipment, enables a user to control an air conditioner through voice, and can be applied to control of an intelligent terminal, for example, the scene of the intelligent terminal.

In an alternative embodiment, taking the application in an intelligent terminal as an example, the intelligent terminal may pick up voice information of a user through a microphone; taking the smart home system as an example, the smart home can also pick up the voice information of the user through the microphone.

Step S104, based on the voice information, determining the language to which the voice information belongs according to a language judgment model obtained in advance, wherein the language judgment model is obtained by training according to training data, and the training data comprises: the voice information of a plurality of languages and a label for indicating the language to which the voice information belongs.

In the foregoing solution, the language judgment model may be obtained by pre-training according to training data, and the speech information is judged according to the pre-obtained language judgment model, so that corresponding speech recognition can be performed according to a language judgment result of the speech information.

In an optional embodiment, after receiving the voice information, the device does not input the voice information to the voice recognition module, but first inputs the voice information to the language type determination module to determine the language type of the voice.

The training data comprises voice information of a plurality of languages and a label used for representing the language to which the voice information belongs, the label and the language have a preset corresponding relation, and the label is used for determining the language to which the voice information in the training data belongs when the language judgment model is trained.

It should be noted here that the speech information received by the device may be that the whole sentence belongs to one language, for example: "television on the TV" belongs to chinese language, "turn on the TV" belongs to english language, but it is also possible that the whole sentence contains speech information in multiple languages, for example: and playing the my heart will go on. Therefore, when the language judgment model judges the language to which the speech information belongs, the language to which the whole sentence belongs can be judged, and each word can be judged after the sentence in the speech information is segmented.

Step S106, calling a corresponding language conversion module according to the language to which the language information belongs, and converting the voice information into corresponding character information.

In an alternative embodiment, a plurality of speech conversion modules may be provided in the speech recognition system, for example: the system comprises a Chinese voice conversion module, an English voice conversion module, a French voice conversion module and the like, wherein in the voice recognition process, the corresponding voice conversion module is called to recognize according to the language type of the determined voice information.

In another alternative embodiment, the chinese speech conversion module may further classify the chinese speech according to the type of the chinese, for example: the Guangdong language voice conversion module, the Henan speech conversion module, the Mandarin language voice conversion module and the like can further determine the category of the voice information in order to further improve the accuracy of voice information identification under the condition that the language to which the voice information belongs is Chinese, so that the voice conversion module can be called more accurately.

In the prior art, equipment can always recognize according to the type of voice recognition specified by a user, and if the type of voice recognition specified by the user is Chinese, the equipment can still recognize according to the Chinese even if the user sends out voice information containing foreign language. For example, the user instructs the smart audio to: and playing the my heart will go on. The device can only recognize "play" and it is difficult to recognize "my heart will go on", so it is only acoustically difficult to correctly execute the instruction of the user.

In the above scheme, the device may first determine the language to which the voice information belongs, and then recognize according to the language, and still take the voice instruction "play my heart will go on" as an example, after the intelligent sound recognizes that "play" is chinese and "my heart will go on" is english, may call the chinese voice recognition module to recognize "play", and call the english recognition module to recognize "my heart will go on" so as to determine the instruction of the user and execute it accurately.

Therefore, in the embodiment of the present application, the voice information is obtained, the language to which the voice information belongs is determined according to the pre-obtained language judgment model based on the voice information, and the corresponding language conversion module is called according to the language to which the language information belongs, so as to convert the voice information into the corresponding text information. According to the scheme, the language of the voice information is judged through the language diagnosis model, the condition that the voice recognition is inaccurate due to the fact that the languages of various languages are subjected to voice recognition according to one language is avoided, and the technical problem that the intelligent degree is low due to the fact that the voice transcription process can only be operated in a fixed language mode in the prior art is solved.

As an alternative embodiment, acquiring the voice information includes: acquiring collected sound information; and denoising the sound information to obtain the voice information of the target object.

When the device collects the voice information, since there may be other interfering sounds in the environment, for example, in an indoor environment, besides the sound of the voice information emitted by the user, there may also be sounds emitted by other devices such as a television and a sound, so that it is difficult to obtain an accurate voice recognition result, after the device collects the voice information, the device may perform denoising processing on the voice information in a manner such as filtering, so as to extract the voice information from the voice information.

In an alternative embodiment, the sound information may be denoised by means of wavelet transform. Decomposing the signal into multiple scales by using wavelet transform, performing threshold processing on each layer of wavelet coefficient, separating the noise wavelet coefficient from the wavelet coefficient of the sound signal, and recovering the signal by using a wavelet reconstruction algorithm, thereby achieving the purpose of noise reduction.

As an alternative embodiment, determining, based on the speech information, a language to which the speech information belongs according to a language judgment model obtained in advance includes: inputting the voice information into a language judgment model, wherein the language judgment model outputs the probability that the voice information belongs to each candidate language; and determining the language to which the voice information belongs according to the probability corresponding to each candidate language.

Specifically, the language candidates may be selected by the user in advance, or may be default by the device in advance. The language judgment model predicts the probability of each candidate language, which is used for representing the possibility that the voice information belongs to the language, and determines the language to which the voice information belongs according to the prediction result.

In an alternative embodiment, the language candidate with the highest probability may be determined as the language to which the speech information belongs. For example, the candidate languages are chinese and english, and after the device receives the speech information, the language judgment model predicts the language of the speech information to obtain that the probability that the speech information is chinese is 98% and the probability that the speech information is english is 2%, so that the chinese can be determined to be the language to which the sentence belongs.

As an alternative embodiment, determining the language to which the speech information belongs according to the probability corresponding to each candidate language includes: acquiring preset candidate languages and weights corresponding to the candidate languages; and determining the language to which the voice information belongs according to the probability corresponding to each candidate language and the weight of each candidate language.

Specifically, the candidate languages and the weights of the candidate languages may be set by the user according to the requirement. Taking an intelligent air conditioner in an intelligent home scene as an example, if a family includes not only members using Chinese as a native language but also members using English as a native language, the candidate languages can be set as Chinese and English, and because the number of the members using Chinese as a native language is large, the weight of Chinese can be set to be greater than that of English.

After the probability result output by the language judgment model is obtained, the probability of each candidate language after combining the weight can be obtained by combining the weight corresponding to each candidate language, and the language to which the voice information belongs can be determined by combining the probability after combining the weight.

In an alternative embodiment, the candidate languages are set to Chinese and English, and the probability of Chinese corresponding to 65% and the probability of English corresponding to 35% are set. The prediction result of the language diagnosis model on the voice information is as follows: 0.6 Chinese and 0.4 English. The probability 65% corresponding to Chinese is multiplied by the weight 0.65 corresponding to Chinese to obtain 0.39, the probability 35% corresponding to English is multiplied by the weight 0.4 corresponding to English to obtain 0.14, and the speech information is determined to belong to Chinese due to the fact that 0.39 is greater than 0.14.

In application scenarios such as smart home, the voice information received by the device is usually a voice instruction of the user to the device, and therefore, in a case where the device recognizes the voice information to obtain the voice instruction, the device can execute the voice instruction, but in other scenarios, the recognition result of the voice information needs to be displayed, for example, in a scenario of voice input, the voice recognition result needs to be displayed in a display interface of the terminal; for example, for the deaf disabled, the voice information of the other person needs to be recognized as the text information and displayed to the disabled so as to communicate with the other person. Therefore, as an optional embodiment, after calling the corresponding language conversion module according to the language to which the language information belongs, and converting the voice information into the corresponding text information, the method further includes: acquiring a preset character display type; and displaying the text information according to the text display type.

Specifically, the character display types may include characters of characters, font sizes, colors, and the like, taking chinese as an example, the character display types corresponding to chinese may include simplified chinese and traditional chinese, in traditional chinese, port traditional chinese and desk traditional chinese, and may be classified according to sons and regular script; for English, the display type may include a flower body or a body.

In the above scheme, the user selects the type required by the user in advance, and the device displays the recognition result according to the type selected by the user.

As an optional embodiment, before acquiring the voice information, the method further includes: obtaining a language judgment model, wherein obtaining the language judgment model comprises: acquiring training data and an initial convolutional neural network model, wherein the initial convolutional neural network model has initial network parameters; and training the initial convolutional neural network model by using the training data to obtain target network parameters, wherein the target network parameters are used for forming a language judgment model.

Specifically, the training data may include multiple sets of training data, each set of training data includes at least one piece of voice information and a label corresponding to the voice information, and the label corresponding to the voice information is used to indicate a language to which the voice information belongs. And training the initial convolutional neural network model by using the training data to obtain a language judgment model.

In an optional embodiment, the speech information in the training data may be input to the initial convolutional neural network model to obtain a result output by the initial convolutional neural network model, a cross entropy loss function of the result output by the initial convolutional neural network model and a label corresponding to the speech information is obtained according to the result output by the initial convolutional neural network model and the label corresponding to the speech information, and after a large amount of training, the cross entropy loss function is converged to obtain a network parameter of the language judgment model, so as to obtain the language judgment model.

Example 2

According to an embodiment of the present invention, there is provided a speech recognition apparatus, and fig. 2 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present invention, and with reference to fig. 2, the apparatus includes:

and an obtaining module 20, configured to obtain the voice information.

The determining module 22 is configured to determine, based on the speech information, a language to which the speech information belongs according to a language judgment model obtained in advance, where the language judgment model is obtained by training according to training data, and the training data includes: the voice information of a plurality of languages and a label for indicating the language to which the voice information belongs.

And the conversion module 24 is configured to call the corresponding language conversion module according to the language to which the language information belongs, and convert the voice information into corresponding text information.

As an alternative embodiment, the obtaining module includes: the acquisition submodule is used for acquiring the acquired sound information; and the processing submodule is used for carrying out denoising processing on the sound information to obtain the voice information of the target object.

As an alternative embodiment, the determining module includes: the input submodule is used for inputting the voice information into the language judgment model, wherein the language judgment model outputs the probability that the voice information belongs to each candidate language; and the determining submodule is used for determining the language to which the voice information belongs according to the probability corresponding to each candidate language.

As an alternative embodiment, the determining sub-module includes: the acquiring unit is used for acquiring preset candidate languages and weights corresponding to the candidate languages; and the determining unit is used for determining the language to which the voice information belongs according to the probability corresponding to each candidate language and the weight of each candidate language.

As an alternative embodiment, the apparatus further comprises: the type acquisition module is used for calling the corresponding language conversion module according to the language to which the language information belongs, converting the voice information into corresponding character information and then acquiring a preset character display type; and the display module is used for displaying the text information according to the text display type.

As an alternative embodiment, the apparatus further comprises: the model acquisition module is used for acquiring the language judgment model before acquiring the voice information, wherein the model acquisition module comprises: the acquisition submodule is used for acquiring training data and an initial convolutional neural network model, wherein the initial convolutional neural network model has initial network parameters; and the training module is used for training the initial convolutional neural network model by using the training data to obtain target network parameters, wherein the target network parameters are used for forming a language judgment model.

Example 3

According to an embodiment of the present invention, there is provided a storage medium including a stored program, wherein, when the program runs, a device in which the storage medium is controlled to execute the voice recognition method described in embodiment 1.

Example 4

According to an embodiment of the present invention, there is provided a processor, wherein the processor is configured to execute a program, and when the program runs, the voice recognition method described in embodiment 1 is performed.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is substantially or partly contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A speech recognition method, comprising:

acquiring voice information;

inputting the voice information into a language judgment model based on the voice information, wherein the language judgment model outputs the probability that the voice information belongs to each candidate language; determining the language to which the voice information belongs according to the probability corresponding to each candidate language, wherein the language judgment model is obtained by training according to training data, and the training data comprises: determining the language to which the voice information belongs according to the probability corresponding to each candidate language, wherein the method comprises the following steps of: acquiring preset candidate languages and weights corresponding to the candidate languages; determining the language to which the voice information belongs according to the probability corresponding to each candidate language and the weight of each candidate language;

and calling a corresponding language conversion module according to the language to which the language information belongs, and converting the voice information into corresponding text information.

2. The method of claim 1, wherein obtaining voice information comprises:

acquiring collected sound information;

and denoising the sound information to obtain the voice information of the target object.

3. The method according to claim 1, wherein after calling a corresponding language conversion module according to a language to which the language information belongs to convert the voice information into corresponding text information, the method further comprises:

acquiring a preset character display type;

and displaying the text information according to the text display type.

4. The method of claim 1, wherein prior to obtaining the voice information, the method further comprises: obtaining the language judgment model, wherein obtaining the language judgment model comprises:

obtaining the training data and an initial convolutional neural network model, wherein the initial convolutional neural network model has initial network parameters;

and training the initial convolutional neural network model by using the training data to obtain target network parameters, wherein the target network parameters are used for forming the language judgment model.

5. A speech recognition apparatus, comprising:

the acquisition module is used for acquiring voice information;

the determining module is used for inputting the voice information into a language judgment model based on the voice information, wherein the language judgment model outputs the probability that the voice information belongs to each candidate language; determining the language to which the voice information belongs according to the probability corresponding to each candidate language, wherein the language judgment model is obtained by training according to training data, and the training data comprises: determining the language to which the voice information belongs according to the probability corresponding to each candidate language, wherein the method comprises the following steps of: acquiring preset candidate languages and weights corresponding to the candidate languages; determining the language to which the voice information belongs according to the probability corresponding to each candidate language and the weight of each candidate language;

and the conversion module is used for calling the corresponding language conversion module according to the language to which the language information belongs and converting the voice information into corresponding character information.

6. The apparatus of claim 5, wherein the obtaining module comprises:

the acquisition submodule is used for acquiring the acquired sound information;

and the processing submodule is used for carrying out denoising processing on the sound information to obtain the voice information of the target object.

7. A storage medium comprising a stored program, wherein the program, when executed, controls an apparatus in which the storage medium is located to perform the speech recognition method according to any one of claims 1 to 4.

8. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to perform the speech recognition method according to any one of claims 1 to 4 when running.