Disclosure of Invention
In view of this, the embodiment of the invention provides a method for synthesizing English dubbing, so as to solve the problems of needing to consume a large amount of manpower resources and time and having lower efficiency in order to improve the accuracy of English dubbing.
The embodiment of the invention also provides a synthesizing device for English dubbing, which is used for ensuring the practical realization and application of the method.
In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:
a synthesis method of English dubbing includes:
when an instruction for synthesizing English text is received, determining whether each English word in the English text accords with a preset replacement condition, and determining the English word which accords with the preset replacement condition as a word to be replaced;
Determining a target spliced word group corresponding to each word to be replaced in the English text from a plurality of preset spliced word groups, wherein each spliced word group consists of a plurality of word splices, and the combined pronunciation of each word contained in each target spliced word group is matched with the pronunciation of the corresponding word to be replaced;
Replacing each word to be replaced in the English text with a target spliced word group corresponding to the word to be replaced to obtain a replaced English text;
processing the replaced English text through a preset voice synthesis model to obtain synthesized voice corresponding to the replaced English text, and taking the synthesized voice as English dubbing corresponding to the English text.
In the above method, optionally, the determining whether each english word in the english text meets a preset replacement condition includes:
Judging whether each preset word has a word matched with the English word or not for each English word in the English text, and if the preset word has a word matched with the English word, determining that the English word meets the preset replacement condition.
In the above method, optionally, the determining whether a word matching the english word exists in each preset word includes:
comparing each preset word with the English word;
If one preset word exists in each preset word and is the same as the English word, determining that a word matched with the English word exists in each preset word;
if the preset words which are the same as the English words do not exist in the preset words, determining that the words which are matched with the English words do not exist in the preset words.
In the above method, optionally, the determining, among the preset plurality of concatenation word groups, the target concatenation word group corresponding to each word to be replaced in the english text includes:
Determining one-to-one correspondence between the plurality of spliced word groups and each preset word;
determining target preset words corresponding to each word to be replaced, wherein the target preset words corresponding to each word to be replaced are preset words matched with the word to be replaced in the preset words;
And determining the spliced word group corresponding to the target preset word corresponding to each word to be replaced as the target spliced word group corresponding to each word to be replaced.
In the above method, optionally, the processing the replaced english text by using a preset speech synthesis model to obtain a synthesized speech corresponding to the replaced english text includes:
Loading the replaced English text to an input layer of the voice synthesis model, and acquiring output voice from an output layer of the voice synthesis model after the voice synthesis model is processed;
And taking the acquired output voice as the synthesized voice corresponding to the replaced English text.
A synthesizer for english dubbing, comprising:
a first determining unit, configured to determine, when an instruction for performing speech synthesis on an english text is received, whether each english word in the english text meets a preset replacement condition, and determine, as a word to be replaced, an english word that meets the preset replacement condition;
The second determining unit is used for determining a target spliced word group corresponding to each word to be replaced in the English text from a plurality of preset spliced word groups, wherein each spliced word group consists of a plurality of word splices, and the combined pronunciation of each word contained in each target spliced word group is matched with the pronunciation of the corresponding word to be replaced;
The replacing unit is used for replacing each word to be replaced in the English text with a target spliced word group corresponding to the word to be replaced to obtain a replaced English text;
the synthesizing unit is used for processing the replaced English text through a preset voice synthesis model to obtain synthesized voice corresponding to the replaced English text, and taking the synthesized voice as English dubbing corresponding to the English text.
The above apparatus, optionally, the first determining unit includes:
And the judging subunit is used for judging whether each preset word has a word matched with the English word or not for each English word in the English text, and if the preset word has the word matched with the English word, determining that the English word meets the preset replacement condition.
The above device, optionally, the judging subunit includes:
A comparison subunit, configured to compare each preset word with the english word;
A first determining subunit, configured to determine that, if one preset word exists in the preset words and is the same as the english word, a word matching the english word exists in the preset words;
And the second determining subunit is used for determining that the word matched with the English word does not exist in each preset word if the preset word identical to the English word does not exist in each preset word.
A storage medium comprising stored instructions, wherein the instructions, when executed, control a device in which the storage medium is located to perform a method of synthesizing english dubbing as described above.
An electronic device comprising a memory and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by one or more processors as a method of synthesizing english dubbing as described above.
The English dubbing synthesis method comprises the steps of determining whether each English word in an English text meets a preset replacement condition when an instruction for carrying out voice synthesis on the English text is received, determining the English word meeting the preset replacement condition as a word to be replaced, determining a target spliced word group corresponding to each word to be replaced in the English text in a plurality of preset spliced word groups, splicing each spliced word group by a plurality of words, enabling combined pronunciation of each word contained in each target spliced word group to be matched with pronunciation of the corresponding word to be replaced, replacing each word to be replaced in the English text with the target spliced word group corresponding to the word to be replaced, obtaining the replaced English text, processing the replaced English text through a preset voice synthesis model, obtaining synthesized voice corresponding to the English text, and taking the synthesized voice as the English dubbing corresponding to the English text. By applying the method provided by the embodiment of the invention, the replacement condition can be preset, the English word to be replaced can be identified in the English text and replaced by the corresponding spliced word group, the single word which is difficult to synthesize and accurately read by the voice synthesis model can be replaced by the spliced word group of a plurality of simple words, the voice synthesis model can synthesize and accurately read, the accuracy of English dubbing of the English text is improved, a great amount of manpower resources and time are not required to be consumed for manual dubbing or model adjustment, and the dubbing efficiency is improved.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the present disclosure, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
As known from the background art, in the existing synthesis method of english dubbing, the english text to be dubbed is generally directly input into a speech synthesis model for synthesis, if the synthesized dubbing has inaccurate pronunciation, the dubbing is generally performed manually or a training sample is adjusted to modify the speech synthesis model so as to re-dubbe, thereby improving accuracy. However, such a method generally requires a lot of manpower resources and time, and the use of artificial dubbing also has a problem of inconsistent speech styles.
Therefore, the embodiment of the invention provides a method and a device for synthesizing English dubbing, a storage medium and electronic equipment, wherein a spliced word group is used for replacing a designated word in an English text, and the replaced English text is used for performing voice synthesis, so that the accuracy of voice synthesis is improved, manual correction is avoided, and the dubbing efficiency is improved.
It should be noted that the method and device for synthesizing English dubbing, the storage medium and the electronic device provided by the invention can be used in the financial field or other fields, for example, can be used in the application scene of banking service dubbing in the financial field. Other fields are any field other than the financial field, for example, the communication service field. The foregoing is merely an example, and the method and apparatus for synthesizing english dubbing, the storage medium, and the application field of the electronic device provided by the present invention are not limited.
The embodiment of the invention provides a synthesizing method of English dubbing, which can be applied to a voice synthesizing system, wherein an execution subject of the method can be a processor of the system, and a flow chart of the method is shown in fig. 1 and comprises the following steps:
S101, when an instruction for synthesizing English text is received, determining whether each English word in the English text accords with a preset replacement condition, and determining the English word which accords with the preset replacement condition as a word to be replaced;
In the method provided by the embodiment of the invention, the user can import English text through the front end and send the voice synthesis instruction to the processor. When the processor receives the instruction, the English text can be identified, each English word contained in the text is identified one by one, and whether each English word meets the preset replacement condition is judged. And determining each English word meeting the replacement condition in the English text as a word to be replaced.
S102, determining a target spliced word group corresponding to each word to be replaced in the English text from a plurality of preset spliced word groups, wherein each spliced word group consists of a plurality of word splices, and the combined pronunciation of each word contained in each target spliced word group is matched with the pronunciation of the corresponding word to be replaced;
In the method provided by the embodiment of the invention, a plurality of spliced word groups can be preset, each spliced word group is a word group formed by splicing a plurality of words, for example, a spliced word group fine-nance is formed by splicing two words of fine and nance. The word group is obtained by analyzing a single word which is easy to read by voice synthesis in advance, splitting the single word into a plurality of words of combined pronunciation homophones and then splicing the split words. For example, word finish, which is synthesized by speech synthesis model as a pronunciation The pronunciation does not coincide with the correct pronunciation of finish, but can be pronounced by splitting finish into fine-nance, i.e. using fine-nance as its corresponding concatenation word group, and performing speech synthesis on concatenation word group fine-nance by means of speech synthesis modelThe pronunciation is the same as the correct pronunciation of finish.
For each word to be replaced, the corresponding target spliced word group can be determined in each spliced word group.
S103, replacing each word to be replaced in the English text with a target spliced word group corresponding to the word to be replaced to obtain a replaced English text;
In the method provided by the embodiment of the invention, the whole replacement is carried out on each word to be replaced in the English text, each word to be replaced in the text is replaced by the corresponding target spliced word group, and each finish in the English text is replaced by the target spliced word group fine-nance corresponding to the finish by taking the word to be replaced as the finish as an example. And replacing each word to be replaced with the corresponding target spliced word group to obtain the replaced English text. It should be noted that the embodiment of the present invention is merely an example, and the english text may include a plurality of different words to be replaced.
S104, processing the replaced English text through a preset voice synthesis model to obtain synthesized voice corresponding to the replaced English text, and taking the synthesized voice as English dubbing corresponding to the English text.
In the method provided by the embodiment of the invention, the replaced English text can be input into the preset voice synthesis model, so that the voice synthesis model carries out voice synthesis processing on the replaced English text to obtain the synthesized voice corresponding to the replaced English text, and the synthesized voice obtained by processing the voice synthesis model is used as English dubbing corresponding to the English text.
Based on the method provided by the embodiment of the invention, in the process of synthesizing the English text by voice, whether each English word in the English text accords with the preset replacement condition can be determined, the English word which accords with the preset replacement condition is replaced, and each word to be replaced in the English text is replaced with a target spliced word group with combined pronunciation matched with the combined pronunciation. And performing voice synthesis processing on the replaced English text through a voice synthesis model to obtain English dubbing corresponding to the English text. By using the method provided by the embodiment of the invention, the English word with accurate pronunciation can not be easily synthesized by the voice synthesis model and can be replaced by a spliced phrase formed by splicing a plurality of words, the combined pronunciation of the English word is the same as that of the English word, the voice synthesis accuracy of the voice synthesis model for simple words (words with simpler syllables, such as monosyllabic words) is higher than that of the voice synthesis accuracy of complex words (words with more complicated syllables, such as polysyllabic words), the correct pronunciation of the English word corresponding to the voice synthesis model can be accurately restored by synthesizing the voice of each word in the spliced word group, the accuracy of voice synthesis is improved, and the accuracy of English dubbing of English texts is improved without consuming a great amount of manpower resources and time to perform manual dubbing or model adjustment, and the dubbing synthesis efficiency can be improved.
Based on the method shown in fig. 1, in the method provided by the embodiment of the present invention, the determining whether each english word in the english text meets a preset replacement condition includes:
Judging whether each preset word has a word matched with the English word or not for each English word in the English text, and if the preset word has a word matched with the English word, determining that the English word meets the preset replacement condition.
In the method provided by the embodiment of the invention, a plurality of preset words are preset, each preset word can be a word which is easy to mispronounce in the voice synthesis of the voice synthesis model, and a worker can analyze and determine a voice sample synthesized by the voice synthesis model. Whether the English word meets the preset replacement condition can be determined by judging whether the word matched with the English word exists in each preset word.
Based on the method provided by the above embodiment, in the method provided by the embodiment of the present invention, the determining whether a word matching the english word exists in each preset word includes:
comparing each preset word with the English word;
If one preset word exists in each preset word and is the same as the English word, determining that a word matched with the English word exists in each preset word;
if the preset words which are the same as the English words do not exist in the preset words, determining that the words which are matched with the English words do not exist in the preset words.
In the method provided by the embodiment of the invention, the currently judged English word can be matched and matched with each preset word so as to judge whether the English word is identical with the preset word or not. If one preset word is the same as the English word in the comparison process of each preset word and the English word, the words matched with the English word exist in each preset word. Each preset word is different from each other, so that at most only one preset word is identical to the English word. If each preset word is different from the English word, no word matched with the English word exists in each preset word.
On the basis of the method provided in the foregoing embodiment, referring to fig. 2, a process for determining, among a plurality of preset spliced word groups, a target spliced word group corresponding to each word to be replaced in the english text is shown in the method for synthesizing english dubbing provided in the embodiment of the invention, where the implementation process includes:
s201, determining one-to-one correspondence between the plurality of spliced word groups and each preset word;
in the method provided by the embodiment of the invention, each spliced word group and each preset word are arranged in a one-to-one correspondence, and the combined pronunciation of each word contained in each spliced word group is the same as the pronunciation of the corresponding preset word.
And determining the one-to-one correspondence between the plurality of spliced word groups and each preset word according to the preset correspondence, namely determining the spliced word group corresponding to each preset word.
S202, determining target preset words corresponding to each word to be replaced, wherein the target preset words corresponding to each word to be replaced are preset words matched with the word to be replaced in the preset words;
In the method provided by the embodiment of the invention, for each word to be replaced, the preset word matched with the word to be replaced in each preset word is used as the target preset word corresponding to the word to be replaced.
S203, determining the spliced word group corresponding to the target preset word corresponding to each word to be replaced as the target spliced word group corresponding to each word to be replaced.
In the method provided by the embodiment of the invention, the spliced word group corresponding to each target preset word can be used as the target spliced word group corresponding to the word to be replaced corresponding to the target preset word based on the corresponding relation between the spliced word group and the preset word. For example, the word to be replaced is a preset word finish, the spliced word group corresponding to the target preset word finish is fine-nance, and the spliced word group fine-nance is taken as the target spliced word group corresponding to the word finish to be replaced.
In order to better explain the synthesis process of english dubbing provided by the embodiment of the invention, the synthesis process of english dubbing provided by the embodiment of the invention is further described below with reference to the flowchart shown in fig. 3.
As shown in fig. 3, a specific implementation flow of the method provided by the embodiment of the present invention includes:
1. searching inaccurate pronunciation parts;
The staff can search and analyze the place where the pronunciation of the multi-syllable word is inaccurate by listening to the voice synthesized by the voice synthesis model. It should be noted that, the multi-syllable word analysis in the embodiment of the present invention is only one specific embodiment, and the speech synthesis model is easy to pronounce inaccurately in the specific implementation process, which is usually a multi-syllable word, but the analysis object in the specific application process is not limited, and any word with inaccurate pronunciation can be regarded as an inaccurate pronunciation place.
2. The homonym word splitting is carried out on the multi-syllable words in a targeted mode, and the homonym words are spliced and stored in the background;
Aiming at inaccurate pronunciation, each English word corresponding to the inaccurate pronunciation is split in a targeted mode, the English word is split into a plurality of homonyms of correct pronunciation, each homonym is spliced, and then the spliced word group and the English word are correspondingly stored in the background.
3. And requesting from the background, replacing a plurality of words with inaccurate pronunciation in the English text by using the spliced text, and performing voice synthesis.
When the English text is required to be subjected to voice synthesis, each pre-configured English word with inaccurate pronunciation and the corresponding spliced word group can be requested from the background. Each English word in the English text is identified, all the words belonging to inaccurate pronunciation in the text are replaced, usually all the inaccurate multi-syllable words are replaced, each word to be replaced is replaced by a corresponding spliced word group, namely a plurality of homonyms with the same combined pronunciation. And synthesizing the voice by using the replaced English text.
Based on the method shown in fig. 1, in the method provided by the embodiment of the present invention, the processing, by using a preset speech synthesis model, the replaced english text to obtain a synthesized speech corresponding to the replaced english text includes:
Loading the replaced English text to an input layer of the voice synthesis model, and acquiring output voice from an output layer of the voice synthesis model after the voice synthesis model is processed;
And taking the acquired output voice as the synthesized voice corresponding to the replaced English text.
In the method provided by the embodiment of the invention, the replaced English text is input into a preset voice synthesis model, and after the voice synthesis model is processed, the voice output by the voice synthesis model is used as English dubbing corresponding to the English text requiring voice synthesis.
Corresponding to the method for synthesizing English dubbing shown in FIG. 1, the embodiment of the invention also provides a device for synthesizing English dubbing, which is used for realizing the method shown in FIG. 1, and the structure diagram is shown in FIG. 4, and comprises:
A first determining unit 301, configured to determine, when an instruction for performing speech synthesis on an english text is received, whether each english word in the english text meets a preset replacement condition, and determine, as a word to be replaced, an english word that meets the preset replacement condition;
A second determining unit 302, configured to determine, from a plurality of preset concatenation word groups, a target concatenation word group corresponding to each word to be replaced in the english text, where each concatenation word group is formed by splicing a plurality of words, and a combined pronunciation of each word included in each target concatenation word group is matched with a pronunciation of the corresponding word to be replaced;
A replacing unit 303, configured to replace each word to be replaced in the english text with a target spliced word group corresponding to the word to be replaced, so as to obtain a replaced english text;
And the synthesizing unit 304 is configured to process the replaced english text through a preset speech synthesis model, obtain a synthesized speech corresponding to the replaced english text, and use the synthesized speech as an english dubbing corresponding to the english text.
Based on the device provided by the embodiment of the invention, in the process of synthesizing the English text, whether each English word in the English text accords with the preset replacement condition can be determined, the English word which accords with the preset replacement condition is replaced, and each word to be replaced in the English text is replaced with the target spliced word group with combined pronunciation matched with the combined pronunciation. And performing voice synthesis processing on the replaced English text through a voice synthesis model to obtain English dubbing corresponding to the English text. By using the device provided by the embodiment of the invention, the English word with accurate pronunciation can not be easily synthesized by the voice synthesis model, the English word with accurate pronunciation can be replaced to form a spliced phrase formed by splicing a plurality of words, the voice synthesis accuracy of the voice synthesis model for simple words (words with simpler syllables, such as monosyllabic words) is higher than that of the voice synthesis accuracy of complex words (words with more complicated syllables, such as polysyllabic words), the correct pronunciation of the corresponding English word can be accurately restored by synthesizing the voice of each word in the spliced word group, the accuracy of voice synthesis can be improved, and the accuracy of English dubbing of English text can be improved without consuming a great amount of manpower resources and time to perform manual dubbing or model adjustment, and the dubbing synthesis efficiency can be improved.
On the basis of the apparatus provided by the foregoing embodiment, in the apparatus provided by the embodiment of the present invention, the first determining unit 301 includes:
And the judging subunit is used for judging whether each preset word has a word matched with the English word or not for each English word in the English text, and if the preset word has the word matched with the English word, determining that the English word meets the preset replacement condition.
On the basis of the device provided by the above embodiment, in the device provided by the embodiment of the present invention, the judging subunit includes:
A comparison subunit, configured to compare each preset word with the english word;
A first determining subunit, configured to determine that, if one preset word exists in the preset words and is the same as the english word, a word matching the english word exists in the preset words;
And the second determining subunit is used for determining that the word matched with the English word does not exist in each preset word if the preset word identical to the English word does not exist in each preset word.
On the basis of the apparatus provided by the foregoing embodiment, in the apparatus provided by the embodiment of the present invention, the second determining unit 302 includes:
a third determining subunit, configured to determine a one-to-one correspondence between the plurality of spliced word groups and the preset words;
A fourth determining subunit, configured to determine a target preset word corresponding to each word to be replaced, where the target preset word corresponding to each word to be replaced is a preset word matched with the word to be replaced in the preset words;
And a fifth determining subunit, configured to determine, as a target concatenation word group corresponding to each word to be replaced, a concatenation word group corresponding to a target preset word corresponding to each word to be replaced.
On the basis of the apparatus provided by the foregoing embodiment, in the apparatus provided by the embodiment of the present invention, the synthesizing unit 304 includes:
The model processing subunit is used for loading the replaced English text to the input layer of the voice synthesis model, acquiring output voice from the output layer of the voice synthesis model after the processing of the voice synthesis model, and taking the acquired output voice as the synthesized voice corresponding to the replaced English text.
The embodiment of the invention also provides a storage medium, which comprises stored instructions, wherein the instructions are used for controlling equipment where the storage medium is located to execute the English dubbing synthesis method.
The embodiment of the present invention further provides an electronic device, whose structural schematic diagram is shown in fig. 5, specifically including a memory 401, and one or more instructions 402, where the one or more instructions 402 are stored in the memory 401, and configured to be executed by the one or more processors 403 to perform the following operations by the one or more instructions 402:
when an instruction for synthesizing English text is received, determining whether each English word in the English text accords with a preset replacement condition, and determining the English word which accords with the preset replacement condition as a word to be replaced;
Determining a target spliced word group corresponding to each word to be replaced in the English text from a plurality of preset spliced word groups, wherein each spliced word group consists of a plurality of word splices, and the combined pronunciation of each word contained in each target spliced word group is matched with the pronunciation of the corresponding word to be replaced;
Replacing each word to be replaced in the English text with a target spliced word group corresponding to the word to be replaced to obtain a replaced English text;
processing the replaced English text through a preset voice synthesis model to obtain synthesized voice corresponding to the replaced English text, and taking the synthesized voice as English dubbing corresponding to the English text.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.