CN120146035B

CN120146035B - A speech error correction method and system based on artificial intelligence

Info

Publication number: CN120146035B
Application number: CN202510234170.4A
Authority: CN
Inventors: 刘威; 车艳辉; 谭海波; 周锦良
Original assignee: Jiangxi Hualianyuan Universe Digital Technology Co ltd
Current assignee: Jiangxi Hualianyuan Universe Digital Technology Co ltd
Priority date: 2025-02-28
Filing date: 2025-02-28
Publication date: 2025-10-10
Anticipated expiration: 2045-02-28
Also published as: CN120146035A

Abstract

The invention discloses a voice error correction method and a system based on artificial intelligence, and relates to the technical field of voice interaction, wherein the method comprises the steps of obtaining voice input information input by a current time node and outputting a corresponding pinyin sequence; the method comprises the steps of importing a pinyin sequence into a preset word database, traversing a plurality of Chinese character sequences corresponding to the pinyin sequence, wherein the plurality of Chinese character sequences are homonyms, scoring semantic agreements of the plurality of Chinese character sequences and the pinyin sequence by adopting a context verification model based on an attention mechanism, outputting semantic scoring values of the Chinese character sequences, and determining the Chinese character sequence with the highest semantic scoring value as a target Chinese character sequence to obtain a target Chinese character text corresponding to voice input information. The invention determines the target Chinese character text by scoring based on the context verification model of the attention mechanism, and solves the problems of inaccurate voice error correction process and low voice conversion result accuracy caused by the problem of multi-tone characters when the characters are converted by voice in the prior art.

Description

Speech error correction method and system based on artificial intelligence

Technical Field

The invention relates to the technical field of voice interaction, in particular to a voice error correction method and system based on artificial intelligence.

Background

Polyphones refer to a Chinese character having multiple pronunciations (e.g., "re" reads zh co ng or cho ng) under different semantic scenarios, where the correct pronunciation and corresponding Chinese character selection are highly dependent on context. The traditional method faces the following core challenges when addressing polyphone matching:

1. limitations of static word stock

Depending on the predefined dictionary (e.g. "line" is fixed as h ng in "bank") it is not possible to adapt dynamically to unregistered words or complex contexts (e.g. "line" in "line code" should read h ng, and "walking" should read x i ng).

2. Short-range dependent modeling deficiency

The N-gram language model captures only local word sequences (e.g., the first 2-3 words), and it is difficult to understand long-distance semantic logic (e.g., "his weight (chbang) newly weighs (zh bang)) the pronunciation difference of two" heavy "words in the good".

3. Ambiguity of ambiguous scenes

When the context is not sufficiently well-directed to a single pronunciation (e.g. "good" readable h-block o or h-a o in "the person is talking"), the traditional model lacks deep semantic reasoning capabilities.

Therefore, in the prior art, the problem of low accuracy of the voice input conversion result is caused by inaccurate voice error correction due to the problem of multi-voice characters when the voice is converted into characters.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a voice error correction method and a voice error correction system based on artificial intelligence, which aim to solve the problems of inaccurate voice error correction and low voice input conversion result accuracy caused by the problem of polyphonic characters when the voice is converted in the prior art.

A first aspect of the present invention provides an artificial intelligence based speech error correction method, the method comprising:

Acquiring voice input information input by a current time node, and outputting a corresponding pinyin sequence according to the voice input information;

Importing the pinyin sequence into a preset word database, traversing a plurality of Chinese character sequences corresponding to the pinyin sequence in the word database, wherein the Chinese character sequences are homonyms;

scoring semantic agreements of the Chinese character sequences and the pinyin sequences by adopting a context verification model based on an attention mechanism, and outputting semantic scoring values of each Chinese character sequence;

And determining the Chinese character sequence with the highest semantic score value as a target Chinese character sequence to obtain a target Chinese character text corresponding to the voice input information.

According to an aspect of the above technical solution, the step of scoring semantic fitness of the plurality of chinese character sequences and the pinyin sequence by using a context verification model based on an attention mechanism, and outputting a semantic scoring value of each chinese character sequence includes:

Respectively converting the Chinese character sequence and the pinyin sequence into embedded vectors, and correspondingly adding position codes;

the Chinese character sequence and the pinyin sequence are focused mutually through the attention mechanism of the context verification model, and the context-related target features are generated;

performing feature fusion and pooling on the target features, and extracting key information of the target features;

And mapping the fused target features through a full connection layer in the context verification model to obtain the semantic score value of each Chinese character sequence.

According to one aspect of the above technical solution, the Chinese character sequence and the pinyin sequence are focused on each other by the attention mechanism of the context verification model, a step of generating a contextually relevant targeted feature, comprising:

Calculating the cross attention weight between the Chinese character sequence and the pinyin sequence according to a multi-head attention mechanism through the context verification model, and capturing semantic association between the Chinese character sequence and the pinyin sequence;

Taking the Chinese character sequence as a Query vector, taking the pinyin sequence as a Key vector and a Value vector, generating an attention matrix, and measuring the local correlation between Chinese characters and pinyin by using the attention of scaling points;

And carrying out weighted summation on the cross attention weight and the Value vector of the pinyin vector to generate a context enhancement representation of the Chinese character sequence, thereby obtaining target characteristics.

According to an aspect of the above technical solution, the step of mapping the fused target feature to obtain the semantic score value of each chinese character sequence through the full connection layer in the context verification model includes:

mapping the target feature to an intermediate dimension through a weight matrix and a bias term according to at least one full connection layer in the context verification model;

and outputting the score corresponding to the Chinese character sequence by using the linear output activation function through an output layer in the context verification model to obtain the semantic score corresponding to each Chinese character sequence.

According to an aspect of the foregoing technical solution, the method further includes:

performing background denoising on the voice input information by using a pre-trained deep learning model, and enhancing the voice definition of the voice input information;

And detecting the end point of the voice input information, determining the starting and ending points of the voice fragments in the voice input information, and cutting according to the starting and ending points of the voice fragments to obtain target voice fragments.

According to one aspect of the above technical solution, the step of performing endpoint detection on the voice input information, determining a start and stop point of a voice segment in the voice input information, and cutting according to the start and stop point of the voice segment to obtain a target voice segment further includes:

and outputting a corresponding pinyin sequence according to the target voice segment.

A second aspect of the present invention provides an artificial intelligence-based speech error correction system, which is applied to the method described in the above technical solution, and the system includes:

The phonetic sequence output module is used for acquiring the phonetic input information input by the current time node and outputting a corresponding phonetic sequence according to the phonetic input information;

the Chinese character sequence traversing module is used for importing the pinyin sequence into a preset word database, traversing a plurality of Chinese character sequences corresponding to the pinyin sequence in the word database, wherein the Chinese character sequences are homonyms;

The Chinese character sequence scoring module is used for scoring the semantic fitness of the plurality of Chinese character sequences and the pinyin sequence respectively by adopting a context verification model based on an attention mechanism, and outputting the semantic scoring value of each Chinese character sequence;

and the Chinese character text output module is used for determining the Chinese character sequence with the highest semantic score value as a target Chinese character sequence to obtain a target Chinese character text corresponding to the voice input information.

According to an aspect of the foregoing technical solution, the chinese character sequence scoring module is specifically configured to:

A third aspect of the present invention is to provide a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method described in the above technical solutions.

A fourth aspect of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of implementing a method described in the foregoing technical solutions when executed by the processor.

Compared with the prior art, the voice error correction method and system based on artificial intelligence have the beneficial effects that:

The method comprises the steps of obtaining voice input information input by a current time node, outputting a corresponding pinyin sequence according to the voice input information, importing the pinyin sequence into a preset word database, traversing a plurality of Chinese character sequences corresponding to the pinyin sequence in the word database, wherein the plurality of Chinese character sequences are homonyms, scoring semantic consistencies of the plurality of Chinese character sequences and the pinyin sequence by adopting a context verification model based on an attention mechanism, outputting semantic grading values of the Chinese character sequences, and determining the Chinese character sequence with the highest semantic grading value as a target Chinese character sequence to obtain a target Chinese character text corresponding to the voice input information. According to the invention, the target Chinese character text is determined by scoring the Chinese character sequence possibly corresponding to the pinyin sequence based on the context verification model of the attention mechanism, so that the problems of inaccurate voice error correction process and low voice conversion result accuracy caused by the multi-tone character problem in the prior art are solved, and the method can be effectively used for accurate voice error correction in terminal applications such as electronic equipment and automobiles, thereby improving user experience.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of an artificial intelligence based speech error correction method according to an embodiment of the invention;

FIG. 2 is a block diagram of an artificial intelligence based speech error correction system in accordance with an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below. Several embodiments of the invention are presented in the figures. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

Example 1

Referring to fig. 1, a first embodiment of the present invention provides an artificial intelligence-based speech error correction method, which includes steps S10 to S40:

Step S10, voice input information input by the current time node is obtained, and a corresponding pinyin sequence is output according to the voice input information.

In this embodiment, the current time node is a time node when the user inputs voice through the sound pick-up, and may be a start time of voice input, or an end time of voice input, or any time point in the voice input process, and voice input information is obtained through voice input.

The voice input information may be a section of voice input in the input method, for example, "you know why the sun is east, west, or a control instruction for controlling an electronic device or a terminal device such as a vehicle" input "help me set an alarm clock of seven o' clock in the morning" and "help me set an air conditioner temperature to 26 degrees" when talking with an AI model.

Specifically, after the current time node inputs the voice content to obtain the voice input information, a corresponding pinyin sequence is output according to the voice input information, for example, the pinyin sequence output by "you know why the sun is the east-west-fall-is" ni, zhidao, taiyang, weishenme, shi, dongshengxiluo, ma "and the output pinyin sequence includes at least one.

Of course, it should be further noted that after the voice input information is obtained, the voice input information may be preprocessed, so that the voice content in the voice input information is clearer, and accurate output of the pinyin sequence is facilitated.

Step S20, importing the pinyin sequence into a preset word database, traversing a plurality of Chinese character sequences corresponding to the pinyin sequence in the word database, wherein the Chinese character sequences are homonyms.

In this embodiment, after the pinyin sequence is correspondingly output according to the voice input information, it is first possible to determine what the pinyin of the content input by the user is, then the corresponding pinyin sequence is imported into a preset word database, where the word database is a database built according to the dictionary content and includes at least common words and corresponding pinyin thereof, and there is a corresponding relationship between the common words and the corresponding pinyin, and then in the word database, all words corresponding to the pinyin sequence or the subsequence are searched for and combined with the pinyin sequence or the subsequence as indexes, so as to obtain a plurality of Chinese character sequences.

For example, the above description is given by taking the example of "you know why the sun is Dongshengxi whereinthe Chinese character sequence corresponding to Pinyin sequence" zhidao "includes" know "," guide "," direct "and so on, and the other Pinyin sequences are the same, and the description is not given here.

That is, a word segment or a control command corresponding to the voice input information, each pinyin sequence contained in the voice input information corresponds to a plurality of Chinese character sequences, and the combination modes of the pinyin sequences are as follows, i.e. the Chinese character sequences are "known", "guided" and "straight-line", which are polyphones, and at least part of voice contents in the voice input information, i.e. the Chinese character sequences, need to be corrected by error correction in order to accurately obtain the final target Chinese character text. For example, the initially generated "you guide why the sun is the east-west-fall" correction is "you know why the sun is the east-west-fall" in the speech conversion process. Specifically, the method is realized by adopting the steps S30 to S40.

And step S30, scoring semantic agreements of the Chinese character sequences and the pinyin sequences by adopting a context verification model based on an attention mechanism, and outputting semantic scoring values of the Chinese character sequences.

In this embodiment, the step of scoring semantic agreements of the plurality of chinese character sequences and the pinyin sequence by using a context verification model based on an attention mechanism, and outputting a semantic scoring value of each chinese character sequence includes:

Wherein, the step of generating the contextually relevant target feature by focusing the Chinese character sequence and the pinyin sequence on each other through the attention mechanism of the context verification model comprises the following steps:

In addition, the step of mapping the fused target features to obtain the semantic score value of each Chinese character sequence through the full connection layer in the context verification model comprises the following steps:

Specifically, when the semantic matching degree of the Chinese character sequence is scored, a context verification model based on an attention mechanism is adopted for scoring, the Chinese character sequence is converted into vector representation through an embedding layer, position codes are added to capture sequence information, and then the pinyin sequence is split into initials, finals and tones or is directly converted into pinyin embedding vectors, and the position codes are added to capture the sequence information. Then, cross attention weight between the Chinese character sequence and the pinyin sequence is calculated through a multi-head attention mechanism, voice association of the Chinese character sequence and the pinyin sequence is captured, specifically, the Chinese character sequence is used as a Query vector, the pinyin sequence is used as a Key vector and a Value vector (or vice versa), an attention matrix is generated, the local relevance of the Chinese character and the pinyin is measured by using the scaling dot product attention, finally, weighted summation is carried out on the attention weight and the Value vector of the pinyin sequence, and context enhancement representation of the Chinese character sequence is generated, so that target characteristics are obtained. And splicing the context representations after the original Chinese characters are embedded and the attention is enhanced, retaining the original information and the attention characteristics, realizing hierarchical characteristic splicing, and then carrying out global average pooling or maximum pooling treatment on the spliced characteristics to generate a semantic vector with fixed length. Mapping the pooled semantic vector to a labeling scoring value through a full connection layer, specifically, limiting the output within the range of [ 0,1 ] by using an activation function (such as a Sigmoid function) to represent the probability of the fitness, or outputting the unnormalized score in a regression mode and comparing the fitness through score ordering.

And S40, determining the Chinese character sequence with the highest semantic score value as a target Chinese character sequence, and obtaining a target Chinese character text corresponding to the voice input information.

In this embodiment, a target chinese character sequence is determined according to the semantic score value, parallel computation is performed on the sequence pairs of the plurality of chinese character sequences-pinyin sequences, the semantic score value of each sequence pair is output, the sequence pairs with low fit are filtered according to the score threshold value, or the optimal candidate sequence pairs are output according to the score order, so as to obtain the target chinese character sequence, and thus, a target chinese character text corresponding to the speech content in the speech input information is determined according to the target chinese character sequence, and thus, speech error correction in the speech conversion process is realized.

It should be noted that, in this embodiment, the context verification model based on the attention mechanism is implemented based on cross attention, not on self attention, so as to ensure semantic alignment of cross modes (Chinese characters and pinyin).

Compared with the prior art, the voice error correction method based on artificial intelligence, which is shown in the embodiment, has the beneficial effects that:

The method comprises the steps of obtaining voice input information input by a current time node, outputting a corresponding pinyin sequence according to the voice input information, importing the pinyin sequence into a preset word database, traversing a plurality of Chinese character sequences corresponding to the pinyin sequence in the word database, wherein the plurality of Chinese character sequences are homonyms, scoring semantic consistencies of the plurality of Chinese character sequences and the pinyin sequence by adopting a context verification model based on an attention mechanism, outputting semantic grading values of the Chinese character sequences, and determining the Chinese character sequence with the highest semantic grading value as a target Chinese character sequence to obtain a target Chinese character text corresponding to the voice input information. According to the method and the device for correcting the voice error, the target Chinese character text is determined by scoring the Chinese character sequence possibly corresponding to the pinyin sequence based on the context verification model of the attention mechanism, so that the problems that in the prior art, the voice error correction process is inaccurate due to the multi-tone character problem when the characters are converted by voice are solved, the accuracy of the voice conversion result is low, and the method and the device can be effectively used for correcting the voice error in terminal applications such as electronic equipment and automobiles, and therefore user experience is improved.

Example two

The second embodiment of the present invention also provides a speech error correction method based on artificial intelligence, where the method shown in the present embodiment further includes:

The method comprises the steps of detecting the end point of the voice input information, determining the starting and ending points of the voice fragments in the voice input information, cutting according to the starting and ending points of the voice fragments, and obtaining target voice fragments, and then further comprises the following steps:

Specifically, in this embodiment, after the voice input information is obtained, the deep learning model such as Conv-TasNet obtained by training in advance is used to perform background denoising on the voice input information, remove background noise and enhance voice definition, for example, in a noisy environment, the voice print characteristics of the target speaker are extracted through the semantic separation model, and the start and stop points of the voice effective segment are detected, so that the silence segment or the noise segment is avoided from being input into the model, thereby effectively reducing the data processing capacity of the model and improving the conversion efficiency of voice conversion.

Example III

Referring to fig. 2, a third embodiment of the present invention provides an artificial intelligence-based speech error correction system, which is applied to the method described in any of the above embodiments, and includes a pinyin sequence output module 10, a hanzi sequence traversing module 20, a hanzi sequence scoring module 30, and a hanzi text output module 40.

The pinyin sequence output module 10 is used for acquiring the voice input information input by the current time node and outputting a corresponding pinyin sequence according to the voice input information;

The chinese character sequence traversing module 20 is configured to import the pinyin sequence into a preset word database, traverse a plurality of chinese character sequences corresponding to the pinyin sequence in the word database, where the plurality of chinese character sequences are homonyms;

A Chinese character sequence scoring module 30, configured to score semantic agreements of the plurality of Chinese character sequences and the pinyin sequence respectively by using a context verification model based on an attention mechanism, and output a semantic scoring value of each Chinese character sequence;

And the chinese character text output module 40 is configured to determine the chinese character sequence with the highest semantic score value as a target chinese character sequence, and obtain a target chinese character text corresponding to the voice input information.

In this embodiment, the chinese character sequence scoring module 30 is specifically configured to:

In this embodiment, the chinese character sequence scoring module 30 is further configured to:

And, the Chinese character sequence scoring module 30 is further configured to:

Compared with the prior art, the voice error correction system based on artificial intelligence, which is shown in the embodiment, has the beneficial effects that:

The system of the embodiment obtains voice input information input by a current time node, outputs a corresponding pinyin sequence according to the voice input information, imports the pinyin sequence into a preset word database, traverses a plurality of Chinese character sequences corresponding to the pinyin sequence in the word database, wherein the plurality of Chinese character sequences are homophones, adopts a context verification model based on an attention mechanism to score semantic consistence of the plurality of Chinese character sequences and the pinyin sequence respectively, outputs semantic grading values of each Chinese character sequence, and determines the Chinese character sequence with the highest semantic grading value as a target Chinese character sequence to obtain a target Chinese character text corresponding to the voice input information. According to the method and the device for correcting the voice error, the target Chinese character text is determined by scoring the Chinese character sequence possibly corresponding to the pinyin sequence based on the context verification model of the attention mechanism, so that the problems that in the prior art, the voice error correction process is inaccurate due to the multi-tone character problem when the characters are converted by voice are solved, the accuracy of the voice conversion result is low, and the method and the device can be effectively used for correcting the voice error in terminal applications such as electronic equipment and automobiles, and therefore user experience is improved.

Example IV

A fourth embodiment of the invention provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method described in any of the embodiments above.

Example five

A fifth embodiment of the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described in any of the embodiments above when executing the computer program.

In the description of the present specification, a description referring to the terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing examples illustrate only a few embodiments of the invention, and are described in detail, but are not to be construed as limiting the scope of the invention. It should be noted that it is possible for those skilled in the art to make several variations and modifications without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A speech error correction method based on artificial intelligence, the method comprising:

Scoring semantic agreements of the Chinese character sequences and the pinyin sequences respectively by adopting a context verification model based on an attention mechanism, and outputting a semantic scoring value of each Chinese character sequence, wherein the semantic agreements are voice association between the Chinese character sequences and single voice sequences respectively;

determining the Chinese character sequence with the highest semantic score value as a target Chinese character sequence to obtain a target Chinese character text corresponding to the voice input information;

the step of scoring semantic agreements of the plurality of Chinese character sequences and the pinyin sequence by adopting a context verification model based on an attention mechanism and outputting the semantic scoring value of each Chinese character sequence comprises the following steps:

2. The artificial intelligence based speech error correction method according to claim 1, wherein generating contextually relevant target features by focusing the kanji sequence on the pinyin sequence by an attention mechanism of the context verification model comprises:

3. The artificial intelligence based speech error correction method according to claim 2, wherein the step of mapping the fused target features to obtain semantic scoring values of each kanji sequence through a full connection layer in the context verification model comprises:

4. The artificial intelligence based speech error correction method of claim 1, further comprising:

5. The artificial intelligence based speech error correction method according to claim 4, wherein the steps of detecting the end point of the speech input information, determining the start and stop points of the speech segment in the speech input information, and cutting according to the start and stop points of the speech segment to obtain the target speech segment, further comprise:

6. An artificial intelligence based speech error correction system, for use in the method of any of claims 1-5, said system comprising:

The Chinese character sequence scoring module is used for scoring semantic agreements of the Chinese character sequences and the pinyin sequences respectively by adopting a context verification model based on an attention mechanism and outputting a semantic scoring value of each Chinese character sequence;

the Chinese character text output module is used for determining the Chinese character sequence with the highest semantic score value as a target Chinese character sequence to obtain a target Chinese character text corresponding to the voice input information;

Wherein, the Chinese character sequence scoring module is specifically used for:

7. A readable storage medium having stored thereon a computer program, which when executed by a processor implements the method of any of claims 1-5.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-5 when executing the computer program.