[go: up one dir, main page]

CN114120963B - Synthesis method and device for English dubbing, storage medium and electronic device - Google Patents

Synthesis method and device for English dubbing, storage medium and electronic device Download PDF

Info

Publication number
CN114120963B
CN114120963B CN202111412688.0A CN202111412688A CN114120963B CN 114120963 B CN114120963 B CN 114120963B CN 202111412688 A CN202111412688 A CN 202111412688A CN 114120963 B CN114120963 B CN 114120963B
Authority
CN
China
Prior art keywords
word
english
preset
replaced
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111412688.0A
Other languages
Chinese (zh)
Other versions
CN114120963A (en
Inventor
李健保
盛沛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Ltd
Original Assignee
Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Ltd filed Critical Bank of China Ltd
Priority to CN202111412688.0A priority Critical patent/CN114120963B/en
Publication of CN114120963A publication Critical patent/CN114120963A/en
Application granted granted Critical
Publication of CN114120963B publication Critical patent/CN114120963B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

本申请公开了一种英文配音的合成方法及装置、存储介质及电子设备,可应用于金融领域或其他领域。该方法包括:当接收到对英文文本进行语音合成的指令时,确定英文文本中的每个英文单词是否符合预设的替换条件,并将符合该替换条件的英文单词确定为待替换单词;在预设的多个拼接单词组中,确定每个待替换单词对应的目标拼接单词组;将英文文本中每个待替换单词替换为其对应的目标拼接单词组,得到替换后的英文文本;通过预设的语音合成模型对替换后的英文文本进行处理,将处理得到的合成语音作为该英文文本对应的英文配音。应用本申请的方法,可将不易准确发音的英文单词替换成拼接词组,有利于提高合成发音的准确度,可避免人工纠正,提高效率。

The present application discloses a synthesis method and device for English dubbing, a storage medium and an electronic device, which can be applied to the financial field or other fields. The method comprises: when receiving an instruction for speech synthesis of an English text, determining whether each English word in the English text meets a preset replacement condition, and determining the English words that meet the replacement condition as words to be replaced; determining a target splicing word group corresponding to each word to be replaced in a preset plurality of splicing word groups; replacing each word to be replaced in the English text with its corresponding target splicing word group to obtain the replaced English text; processing the replaced English text through a preset speech synthesis model, and using the synthesized speech obtained by the processing as the English dubbing corresponding to the English text. By using the method of the present application, English words that are difficult to pronounce accurately can be replaced with splicing word groups, which is conducive to improving the accuracy of synthesized pronunciation, avoiding manual correction, and improving efficiency.

Description

English dubbing synthesis method and device, storage medium and electronic equipment
Technical Field
The present invention relates to the field of speech synthesis, and in particular, to a method and apparatus for synthesizing english dubbing, a storage medium, and an electronic device.
Background
In daily work of service institutions such as banks, it is often required to perform voice dubbing on various english version materials to provide voice services to users.
At present, the english material is usually dubbed by a speech synthesis method, specifically, the text of the english material is input into a preset speech synthesis model, so as to obtain the english dubbing corresponding to the english material.
In a specific application process, english dubbing obtained based on the existing synthesis method often has the condition of inaccurate pronunciation, at the moment, the pronunciation is usually corrected through manual dubbing, or the pronunciation is synthesized again after the voice synthesis model is adjusted, so that the accuracy of English dubbing is improved, a large amount of manpower resources and time are required to be consumed, and the efficiency of synthesizing dubbing is low.
Disclosure of Invention
In view of this, the embodiment of the invention provides a method for synthesizing English dubbing, so as to solve the problems of needing to consume a large amount of manpower resources and time and having lower efficiency in order to improve the accuracy of English dubbing.
The embodiment of the invention also provides a synthesizing device for English dubbing, which is used for ensuring the practical realization and application of the method.
In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:
a synthesis method of English dubbing includes:
when an instruction for synthesizing English text is received, determining whether each English word in the English text accords with a preset replacement condition, and determining the English word which accords with the preset replacement condition as a word to be replaced;
Determining a target spliced word group corresponding to each word to be replaced in the English text from a plurality of preset spliced word groups, wherein each spliced word group consists of a plurality of word splices, and the combined pronunciation of each word contained in each target spliced word group is matched with the pronunciation of the corresponding word to be replaced;
Replacing each word to be replaced in the English text with a target spliced word group corresponding to the word to be replaced to obtain a replaced English text;
processing the replaced English text through a preset voice synthesis model to obtain synthesized voice corresponding to the replaced English text, and taking the synthesized voice as English dubbing corresponding to the English text.
In the above method, optionally, the determining whether each english word in the english text meets a preset replacement condition includes:
Judging whether each preset word has a word matched with the English word or not for each English word in the English text, and if the preset word has a word matched with the English word, determining that the English word meets the preset replacement condition.
In the above method, optionally, the determining whether a word matching the english word exists in each preset word includes:
comparing each preset word with the English word;
If one preset word exists in each preset word and is the same as the English word, determining that a word matched with the English word exists in each preset word;
if the preset words which are the same as the English words do not exist in the preset words, determining that the words which are matched with the English words do not exist in the preset words.
In the above method, optionally, the determining, among the preset plurality of concatenation word groups, the target concatenation word group corresponding to each word to be replaced in the english text includes:
Determining one-to-one correspondence between the plurality of spliced word groups and each preset word;
determining target preset words corresponding to each word to be replaced, wherein the target preset words corresponding to each word to be replaced are preset words matched with the word to be replaced in the preset words;
And determining the spliced word group corresponding to the target preset word corresponding to each word to be replaced as the target spliced word group corresponding to each word to be replaced.
In the above method, optionally, the processing the replaced english text by using a preset speech synthesis model to obtain a synthesized speech corresponding to the replaced english text includes:
Loading the replaced English text to an input layer of the voice synthesis model, and acquiring output voice from an output layer of the voice synthesis model after the voice synthesis model is processed;
And taking the acquired output voice as the synthesized voice corresponding to the replaced English text.
A synthesizer for english dubbing, comprising:
a first determining unit, configured to determine, when an instruction for performing speech synthesis on an english text is received, whether each english word in the english text meets a preset replacement condition, and determine, as a word to be replaced, an english word that meets the preset replacement condition;
The second determining unit is used for determining a target spliced word group corresponding to each word to be replaced in the English text from a plurality of preset spliced word groups, wherein each spliced word group consists of a plurality of word splices, and the combined pronunciation of each word contained in each target spliced word group is matched with the pronunciation of the corresponding word to be replaced;
The replacing unit is used for replacing each word to be replaced in the English text with a target spliced word group corresponding to the word to be replaced to obtain a replaced English text;
the synthesizing unit is used for processing the replaced English text through a preset voice synthesis model to obtain synthesized voice corresponding to the replaced English text, and taking the synthesized voice as English dubbing corresponding to the English text.
The above apparatus, optionally, the first determining unit includes:
And the judging subunit is used for judging whether each preset word has a word matched with the English word or not for each English word in the English text, and if the preset word has the word matched with the English word, determining that the English word meets the preset replacement condition.
The above device, optionally, the judging subunit includes:
A comparison subunit, configured to compare each preset word with the english word;
A first determining subunit, configured to determine that, if one preset word exists in the preset words and is the same as the english word, a word matching the english word exists in the preset words;
And the second determining subunit is used for determining that the word matched with the English word does not exist in each preset word if the preset word identical to the English word does not exist in each preset word.
A storage medium comprising stored instructions, wherein the instructions, when executed, control a device in which the storage medium is located to perform a method of synthesizing english dubbing as described above.
An electronic device comprising a memory and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by one or more processors as a method of synthesizing english dubbing as described above.
The English dubbing synthesis method comprises the steps of determining whether each English word in an English text meets a preset replacement condition when an instruction for carrying out voice synthesis on the English text is received, determining the English word meeting the preset replacement condition as a word to be replaced, determining a target spliced word group corresponding to each word to be replaced in the English text in a plurality of preset spliced word groups, splicing each spliced word group by a plurality of words, enabling combined pronunciation of each word contained in each target spliced word group to be matched with pronunciation of the corresponding word to be replaced, replacing each word to be replaced in the English text with the target spliced word group corresponding to the word to be replaced, obtaining the replaced English text, processing the replaced English text through a preset voice synthesis model, obtaining synthesized voice corresponding to the English text, and taking the synthesized voice as the English dubbing corresponding to the English text. By applying the method provided by the embodiment of the invention, the replacement condition can be preset, the English word to be replaced can be identified in the English text and replaced by the corresponding spliced word group, the single word which is difficult to synthesize and accurately read by the voice synthesis model can be replaced by the spliced word group of a plurality of simple words, the voice synthesis model can synthesize and accurately read, the accuracy of English dubbing of the English text is improved, a great amount of manpower resources and time are not required to be consumed for manual dubbing or model adjustment, and the dubbing efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a method flowchart of a synthesizing method of english dubbing provided by an embodiment of the invention;
fig. 2 is a flowchart of another method of synthesizing english dubbing according to an embodiment of the invention;
Fig. 3 is a flow chart of a synthesizing process of english dubbing according to an embodiment of the invention;
fig. 4 is a schematic structural diagram of a synthesizing device for english dubbing according to an embodiment of the invention;
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the present disclosure, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
As known from the background art, in the existing synthesis method of english dubbing, the english text to be dubbed is generally directly input into a speech synthesis model for synthesis, if the synthesized dubbing has inaccurate pronunciation, the dubbing is generally performed manually or a training sample is adjusted to modify the speech synthesis model so as to re-dubbe, thereby improving accuracy. However, such a method generally requires a lot of manpower resources and time, and the use of artificial dubbing also has a problem of inconsistent speech styles.
Therefore, the embodiment of the invention provides a method and a device for synthesizing English dubbing, a storage medium and electronic equipment, wherein a spliced word group is used for replacing a designated word in an English text, and the replaced English text is used for performing voice synthesis, so that the accuracy of voice synthesis is improved, manual correction is avoided, and the dubbing efficiency is improved.
It should be noted that the method and device for synthesizing English dubbing, the storage medium and the electronic device provided by the invention can be used in the financial field or other fields, for example, can be used in the application scene of banking service dubbing in the financial field. Other fields are any field other than the financial field, for example, the communication service field. The foregoing is merely an example, and the method and apparatus for synthesizing english dubbing, the storage medium, and the application field of the electronic device provided by the present invention are not limited.
The embodiment of the invention provides a synthesizing method of English dubbing, which can be applied to a voice synthesizing system, wherein an execution subject of the method can be a processor of the system, and a flow chart of the method is shown in fig. 1 and comprises the following steps:
S101, when an instruction for synthesizing English text is received, determining whether each English word in the English text accords with a preset replacement condition, and determining the English word which accords with the preset replacement condition as a word to be replaced;
In the method provided by the embodiment of the invention, the user can import English text through the front end and send the voice synthesis instruction to the processor. When the processor receives the instruction, the English text can be identified, each English word contained in the text is identified one by one, and whether each English word meets the preset replacement condition is judged. And determining each English word meeting the replacement condition in the English text as a word to be replaced.
S102, determining a target spliced word group corresponding to each word to be replaced in the English text from a plurality of preset spliced word groups, wherein each spliced word group consists of a plurality of word splices, and the combined pronunciation of each word contained in each target spliced word group is matched with the pronunciation of the corresponding word to be replaced;
In the method provided by the embodiment of the invention, a plurality of spliced word groups can be preset, each spliced word group is a word group formed by splicing a plurality of words, for example, a spliced word group fine-nance is formed by splicing two words of fine and nance. The word group is obtained by analyzing a single word which is easy to read by voice synthesis in advance, splitting the single word into a plurality of words of combined pronunciation homophones and then splicing the split words. For example, word finish, which is synthesized by speech synthesis model as a pronunciation The pronunciation does not coincide with the correct pronunciation of finish, but can be pronounced by splitting finish into fine-nance, i.e. using fine-nance as its corresponding concatenation word group, and performing speech synthesis on concatenation word group fine-nance by means of speech synthesis modelThe pronunciation is the same as the correct pronunciation of finish.
For each word to be replaced, the corresponding target spliced word group can be determined in each spliced word group.
S103, replacing each word to be replaced in the English text with a target spliced word group corresponding to the word to be replaced to obtain a replaced English text;
In the method provided by the embodiment of the invention, the whole replacement is carried out on each word to be replaced in the English text, each word to be replaced in the text is replaced by the corresponding target spliced word group, and each finish in the English text is replaced by the target spliced word group fine-nance corresponding to the finish by taking the word to be replaced as the finish as an example. And replacing each word to be replaced with the corresponding target spliced word group to obtain the replaced English text. It should be noted that the embodiment of the present invention is merely an example, and the english text may include a plurality of different words to be replaced.
S104, processing the replaced English text through a preset voice synthesis model to obtain synthesized voice corresponding to the replaced English text, and taking the synthesized voice as English dubbing corresponding to the English text.
In the method provided by the embodiment of the invention, the replaced English text can be input into the preset voice synthesis model, so that the voice synthesis model carries out voice synthesis processing on the replaced English text to obtain the synthesized voice corresponding to the replaced English text, and the synthesized voice obtained by processing the voice synthesis model is used as English dubbing corresponding to the English text.
Based on the method provided by the embodiment of the invention, in the process of synthesizing the English text by voice, whether each English word in the English text accords with the preset replacement condition can be determined, the English word which accords with the preset replacement condition is replaced, and each word to be replaced in the English text is replaced with a target spliced word group with combined pronunciation matched with the combined pronunciation. And performing voice synthesis processing on the replaced English text through a voice synthesis model to obtain English dubbing corresponding to the English text. By using the method provided by the embodiment of the invention, the English word with accurate pronunciation can not be easily synthesized by the voice synthesis model and can be replaced by a spliced phrase formed by splicing a plurality of words, the combined pronunciation of the English word is the same as that of the English word, the voice synthesis accuracy of the voice synthesis model for simple words (words with simpler syllables, such as monosyllabic words) is higher than that of the voice synthesis accuracy of complex words (words with more complicated syllables, such as polysyllabic words), the correct pronunciation of the English word corresponding to the voice synthesis model can be accurately restored by synthesizing the voice of each word in the spliced word group, the accuracy of voice synthesis is improved, and the accuracy of English dubbing of English texts is improved without consuming a great amount of manpower resources and time to perform manual dubbing or model adjustment, and the dubbing synthesis efficiency can be improved.
Based on the method shown in fig. 1, in the method provided by the embodiment of the present invention, the determining whether each english word in the english text meets a preset replacement condition includes:
Judging whether each preset word has a word matched with the English word or not for each English word in the English text, and if the preset word has a word matched with the English word, determining that the English word meets the preset replacement condition.
In the method provided by the embodiment of the invention, a plurality of preset words are preset, each preset word can be a word which is easy to mispronounce in the voice synthesis of the voice synthesis model, and a worker can analyze and determine a voice sample synthesized by the voice synthesis model. Whether the English word meets the preset replacement condition can be determined by judging whether the word matched with the English word exists in each preset word.
Based on the method provided by the above embodiment, in the method provided by the embodiment of the present invention, the determining whether a word matching the english word exists in each preset word includes:
comparing each preset word with the English word;
If one preset word exists in each preset word and is the same as the English word, determining that a word matched with the English word exists in each preset word;
if the preset words which are the same as the English words do not exist in the preset words, determining that the words which are matched with the English words do not exist in the preset words.
In the method provided by the embodiment of the invention, the currently judged English word can be matched and matched with each preset word so as to judge whether the English word is identical with the preset word or not. If one preset word is the same as the English word in the comparison process of each preset word and the English word, the words matched with the English word exist in each preset word. Each preset word is different from each other, so that at most only one preset word is identical to the English word. If each preset word is different from the English word, no word matched with the English word exists in each preset word.
On the basis of the method provided in the foregoing embodiment, referring to fig. 2, a process for determining, among a plurality of preset spliced word groups, a target spliced word group corresponding to each word to be replaced in the english text is shown in the method for synthesizing english dubbing provided in the embodiment of the invention, where the implementation process includes:
s201, determining one-to-one correspondence between the plurality of spliced word groups and each preset word;
in the method provided by the embodiment of the invention, each spliced word group and each preset word are arranged in a one-to-one correspondence, and the combined pronunciation of each word contained in each spliced word group is the same as the pronunciation of the corresponding preset word.
And determining the one-to-one correspondence between the plurality of spliced word groups and each preset word according to the preset correspondence, namely determining the spliced word group corresponding to each preset word.
S202, determining target preset words corresponding to each word to be replaced, wherein the target preset words corresponding to each word to be replaced are preset words matched with the word to be replaced in the preset words;
In the method provided by the embodiment of the invention, for each word to be replaced, the preset word matched with the word to be replaced in each preset word is used as the target preset word corresponding to the word to be replaced.
S203, determining the spliced word group corresponding to the target preset word corresponding to each word to be replaced as the target spliced word group corresponding to each word to be replaced.
In the method provided by the embodiment of the invention, the spliced word group corresponding to each target preset word can be used as the target spliced word group corresponding to the word to be replaced corresponding to the target preset word based on the corresponding relation between the spliced word group and the preset word. For example, the word to be replaced is a preset word finish, the spliced word group corresponding to the target preset word finish is fine-nance, and the spliced word group fine-nance is taken as the target spliced word group corresponding to the word finish to be replaced.
In order to better explain the synthesis process of english dubbing provided by the embodiment of the invention, the synthesis process of english dubbing provided by the embodiment of the invention is further described below with reference to the flowchart shown in fig. 3.
As shown in fig. 3, a specific implementation flow of the method provided by the embodiment of the present invention includes:
1. searching inaccurate pronunciation parts;
The staff can search and analyze the place where the pronunciation of the multi-syllable word is inaccurate by listening to the voice synthesized by the voice synthesis model. It should be noted that, the multi-syllable word analysis in the embodiment of the present invention is only one specific embodiment, and the speech synthesis model is easy to pronounce inaccurately in the specific implementation process, which is usually a multi-syllable word, but the analysis object in the specific application process is not limited, and any word with inaccurate pronunciation can be regarded as an inaccurate pronunciation place.
2. The homonym word splitting is carried out on the multi-syllable words in a targeted mode, and the homonym words are spliced and stored in the background;
Aiming at inaccurate pronunciation, each English word corresponding to the inaccurate pronunciation is split in a targeted mode, the English word is split into a plurality of homonyms of correct pronunciation, each homonym is spliced, and then the spliced word group and the English word are correspondingly stored in the background.
3. And requesting from the background, replacing a plurality of words with inaccurate pronunciation in the English text by using the spliced text, and performing voice synthesis.
When the English text is required to be subjected to voice synthesis, each pre-configured English word with inaccurate pronunciation and the corresponding spliced word group can be requested from the background. Each English word in the English text is identified, all the words belonging to inaccurate pronunciation in the text are replaced, usually all the inaccurate multi-syllable words are replaced, each word to be replaced is replaced by a corresponding spliced word group, namely a plurality of homonyms with the same combined pronunciation. And synthesizing the voice by using the replaced English text.
Based on the method shown in fig. 1, in the method provided by the embodiment of the present invention, the processing, by using a preset speech synthesis model, the replaced english text to obtain a synthesized speech corresponding to the replaced english text includes:
Loading the replaced English text to an input layer of the voice synthesis model, and acquiring output voice from an output layer of the voice synthesis model after the voice synthesis model is processed;
And taking the acquired output voice as the synthesized voice corresponding to the replaced English text.
In the method provided by the embodiment of the invention, the replaced English text is input into a preset voice synthesis model, and after the voice synthesis model is processed, the voice output by the voice synthesis model is used as English dubbing corresponding to the English text requiring voice synthesis.
Corresponding to the method for synthesizing English dubbing shown in FIG. 1, the embodiment of the invention also provides a device for synthesizing English dubbing, which is used for realizing the method shown in FIG. 1, and the structure diagram is shown in FIG. 4, and comprises:
A first determining unit 301, configured to determine, when an instruction for performing speech synthesis on an english text is received, whether each english word in the english text meets a preset replacement condition, and determine, as a word to be replaced, an english word that meets the preset replacement condition;
A second determining unit 302, configured to determine, from a plurality of preset concatenation word groups, a target concatenation word group corresponding to each word to be replaced in the english text, where each concatenation word group is formed by splicing a plurality of words, and a combined pronunciation of each word included in each target concatenation word group is matched with a pronunciation of the corresponding word to be replaced;
A replacing unit 303, configured to replace each word to be replaced in the english text with a target spliced word group corresponding to the word to be replaced, so as to obtain a replaced english text;
And the synthesizing unit 304 is configured to process the replaced english text through a preset speech synthesis model, obtain a synthesized speech corresponding to the replaced english text, and use the synthesized speech as an english dubbing corresponding to the english text.
Based on the device provided by the embodiment of the invention, in the process of synthesizing the English text, whether each English word in the English text accords with the preset replacement condition can be determined, the English word which accords with the preset replacement condition is replaced, and each word to be replaced in the English text is replaced with the target spliced word group with combined pronunciation matched with the combined pronunciation. And performing voice synthesis processing on the replaced English text through a voice synthesis model to obtain English dubbing corresponding to the English text. By using the device provided by the embodiment of the invention, the English word with accurate pronunciation can not be easily synthesized by the voice synthesis model, the English word with accurate pronunciation can be replaced to form a spliced phrase formed by splicing a plurality of words, the voice synthesis accuracy of the voice synthesis model for simple words (words with simpler syllables, such as monosyllabic words) is higher than that of the voice synthesis accuracy of complex words (words with more complicated syllables, such as polysyllabic words), the correct pronunciation of the corresponding English word can be accurately restored by synthesizing the voice of each word in the spliced word group, the accuracy of voice synthesis can be improved, and the accuracy of English dubbing of English text can be improved without consuming a great amount of manpower resources and time to perform manual dubbing or model adjustment, and the dubbing synthesis efficiency can be improved.
On the basis of the apparatus provided by the foregoing embodiment, in the apparatus provided by the embodiment of the present invention, the first determining unit 301 includes:
And the judging subunit is used for judging whether each preset word has a word matched with the English word or not for each English word in the English text, and if the preset word has the word matched with the English word, determining that the English word meets the preset replacement condition.
On the basis of the device provided by the above embodiment, in the device provided by the embodiment of the present invention, the judging subunit includes:
A comparison subunit, configured to compare each preset word with the english word;
A first determining subunit, configured to determine that, if one preset word exists in the preset words and is the same as the english word, a word matching the english word exists in the preset words;
And the second determining subunit is used for determining that the word matched with the English word does not exist in each preset word if the preset word identical to the English word does not exist in each preset word.
On the basis of the apparatus provided by the foregoing embodiment, in the apparatus provided by the embodiment of the present invention, the second determining unit 302 includes:
a third determining subunit, configured to determine a one-to-one correspondence between the plurality of spliced word groups and the preset words;
A fourth determining subunit, configured to determine a target preset word corresponding to each word to be replaced, where the target preset word corresponding to each word to be replaced is a preset word matched with the word to be replaced in the preset words;
And a fifth determining subunit, configured to determine, as a target concatenation word group corresponding to each word to be replaced, a concatenation word group corresponding to a target preset word corresponding to each word to be replaced.
On the basis of the apparatus provided by the foregoing embodiment, in the apparatus provided by the embodiment of the present invention, the synthesizing unit 304 includes:
The model processing subunit is used for loading the replaced English text to the input layer of the voice synthesis model, acquiring output voice from the output layer of the voice synthesis model after the processing of the voice synthesis model, and taking the acquired output voice as the synthesized voice corresponding to the replaced English text.
The embodiment of the invention also provides a storage medium, which comprises stored instructions, wherein the instructions are used for controlling equipment where the storage medium is located to execute the English dubbing synthesis method.
The embodiment of the present invention further provides an electronic device, whose structural schematic diagram is shown in fig. 5, specifically including a memory 401, and one or more instructions 402, where the one or more instructions 402 are stored in the memory 401, and configured to be executed by the one or more processors 403 to perform the following operations by the one or more instructions 402:
when an instruction for synthesizing English text is received, determining whether each English word in the English text accords with a preset replacement condition, and determining the English word which accords with the preset replacement condition as a word to be replaced;
Determining a target spliced word group corresponding to each word to be replaced in the English text from a plurality of preset spliced word groups, wherein each spliced word group consists of a plurality of word splices, and the combined pronunciation of each word contained in each target spliced word group is matched with the pronunciation of the corresponding word to be replaced;
Replacing each word to be replaced in the English text with a target spliced word group corresponding to the word to be replaced to obtain a replaced English text;
processing the replaced English text through a preset voice synthesis model to obtain synthesized voice corresponding to the replaced English text, and taking the synthesized voice as English dubbing corresponding to the English text.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. The synthesis method of English dubbing is characterized by comprising the following steps:
when an instruction for synthesizing English text is received, determining whether each English word in the English text accords with a preset replacement condition, and determining the English word which accords with the preset replacement condition as a word to be replaced;
Determining a target spliced word group corresponding to each word to be replaced in the English text from a plurality of preset spliced word groups, wherein each spliced word group consists of a plurality of word splices, and the combined pronunciation of each word contained in each target spliced word group is matched with the pronunciation of the corresponding word to be replaced;
Replacing each word to be replaced in the English text with a target spliced word group corresponding to the word to be replaced to obtain a replaced English text;
processing the replaced English text through a preset voice synthesis model to obtain synthesized voice corresponding to the replaced English text, and taking the synthesized voice as English dubbing corresponding to the English text.
2. The method of claim 1, wherein determining whether each english word in the english text meets a preset replacement condition comprises:
Judging whether each preset word has a word matched with the English word or not for each English word in the English text, and if the preset word has a word matched with the English word, determining that the English word meets the preset replacement condition.
3. The method of claim 2, wherein determining whether a word matching the english word exists in each of the preset words comprises:
comparing each preset word with the English word;
If one preset word exists in each preset word and is the same as the English word, determining that a word matched with the English word exists in each preset word;
if the preset words which are the same as the English words do not exist in the preset words, determining that the words which are matched with the English words do not exist in the preset words.
4. The method of claim 2, wherein the determining, among the preset plurality of concatenation word groups, the target concatenation word group corresponding to each word to be replaced in the english text includes:
Determining one-to-one correspondence between the plurality of spliced word groups and each preset word;
determining target preset words corresponding to each word to be replaced, wherein the target preset words corresponding to each word to be replaced are preset words matched with the word to be replaced in the preset words;
And determining the spliced word group corresponding to the target preset word corresponding to each word to be replaced as the target spliced word group corresponding to each word to be replaced.
5. The method of claim 1, wherein the processing the substituted english text by the preset speech synthesis model to obtain the synthesized speech corresponding to the substituted english text includes:
Loading the replaced English text to an input layer of the voice synthesis model, and acquiring output voice from an output layer of the voice synthesis model after the voice synthesis model is processed;
And taking the acquired output voice as the synthesized voice corresponding to the replaced English text.
6. An english dubbing synthesizing apparatus, comprising:
a first determining unit, configured to determine, when an instruction for performing speech synthesis on an english text is received, whether each english word in the english text meets a preset replacement condition, and determine, as a word to be replaced, an english word that meets the preset replacement condition;
The second determining unit is used for determining a target spliced word group corresponding to each word to be replaced in the English text from a plurality of preset spliced word groups, wherein each spliced word group consists of a plurality of word splices, and the combined pronunciation of each word contained in each target spliced word group is matched with the pronunciation of the corresponding word to be replaced;
The replacing unit is used for replacing each word to be replaced in the English text with a target spliced word group corresponding to the word to be replaced to obtain a replaced English text;
the synthesizing unit is used for processing the replaced English text through a preset voice synthesis model to obtain synthesized voice corresponding to the replaced English text, and taking the synthesized voice as English dubbing corresponding to the English text.
7. The apparatus according to claim 6, wherein the first determining unit includes:
And the judging subunit is used for judging whether each preset word has a word matched with the English word or not for each English word in the English text, and if the preset word has the word matched with the English word, determining that the English word meets the preset replacement condition.
8. The apparatus of claim 7, wherein the determination subunit comprises:
A comparison subunit, configured to compare each preset word with the english word;
A first determining subunit, configured to determine that, if one preset word exists in the preset words and is the same as the english word, a word matching the english word exists in the preset words;
And the second determining subunit is used for determining that the word matched with the English word does not exist in each preset word if the preset word identical to the English word does not exist in each preset word.
9. A storage medium, wherein the storage medium includes stored instructions, and when the instructions are executed, the device in which the storage medium is controlled to execute the english dubbing synthesis method according to any one of claims 1 to 5.
10. An electronic device comprising a memory and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the method of synthesizing english dubbing according to any one of claims 1-5.
CN202111412688.0A 2021-11-25 2021-11-25 Synthesis method and device for English dubbing, storage medium and electronic device Active CN114120963B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111412688.0A CN114120963B (en) 2021-11-25 2021-11-25 Synthesis method and device for English dubbing, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111412688.0A CN114120963B (en) 2021-11-25 2021-11-25 Synthesis method and device for English dubbing, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN114120963A CN114120963A (en) 2022-03-01
CN114120963B true CN114120963B (en) 2025-04-15

Family

ID=80372920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111412688.0A Active CN114120963B (en) 2021-11-25 2021-11-25 Synthesis method and device for English dubbing, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN114120963B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101236743A (en) * 2007-01-30 2008-08-06 国际商业机器公司 System and method for generating high quality speech
CN108109610A (en) * 2017-11-06 2018-06-01 芋头科技(杭州)有限公司 A kind of simulation vocal technique and simulation sonification system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2119397C (en) * 1993-03-19 2007-10-02 Kim E.A. Silverman Improved automated voice synthesis employing enhanced prosodic treatment of text, spelling of text and rate of annunciation
CN109002454B (en) * 2018-04-28 2022-05-27 陈逸天 Method and electronic equipment for determining spelling partition of target word
CN109979257B (en) * 2019-04-27 2021-01-08 深圳市数字星河科技有限公司 Method for performing accurate splitting operation correction based on English reading automatic scoring
CN110765785B (en) * 2019-09-19 2024-03-22 平安科技(深圳)有限公司 Chinese-English translation method based on neural network and related equipment thereof
CN111710326B (en) * 2020-06-12 2024-01-23 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN112906360A (en) * 2021-02-05 2021-06-04 李四艳 Spelling and marking method and device for English text
CN113450760A (en) * 2021-06-07 2021-09-28 北京一起教育科技有限责任公司 Method and device for converting text into voice and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101236743A (en) * 2007-01-30 2008-08-06 国际商业机器公司 System and method for generating high quality speech
CN108109610A (en) * 2017-11-06 2018-06-01 芋头科技(杭州)有限公司 A kind of simulation vocal technique and simulation sonification system

Also Published As

Publication number Publication date
CN114120963A (en) 2022-03-01

Similar Documents

Publication Publication Date Title
US20230317055A1 (en) Method, apparatus, storage medium and electronic device for speech synthesis
US11450313B2 (en) Determining phonetic relationships
CN113113014B (en) Developer voice action system
CN107240395B (en) Acoustic model training method and device, computer equipment and storage medium
US20190213995A1 (en) Speech synthesis method terminal and storage medium
CN110930993B (en) Domain-specific language model generation method and speech data labeling system
US8725492B2 (en) Recognizing multiple semantic items from single utterance
CN104485115B (en) Pronounce valuator device, method and system
US9454525B2 (en) Information extraction in a natural language understanding system
US20080189111A1 (en) Selective enablement of speech recognition grammars
CN112634866B (en) Speech synthesis model training and speech synthesis method, device, equipment and medium
US20080319742A1 (en) System and method for posting to a blog or wiki using a telephone
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
CN104361896B (en) Voice quality assessment equipment, method and system
US9009050B2 (en) System and method for cloud-based text-to-speech web services
US20210312831A1 (en) Methods and systems for assisting pronunciation correction
US20100324893A1 (en) System and method for improving robustness of speech recognition using vocal tract length normalization codebooks
CN116806338A (en) Determining and utilizing auxiliary language proficiency metrics
CN114120963B (en) Synthesis method and device for English dubbing, storage medium and electronic device
CN111935541B (en) Video correction method and device, readable medium and electronic equipment
CN113516963B (en) Audio data generation method and device, server and intelligent sound box
CN114691109B (en) Case script generation method and device, storage medium and electronic device
Alimuradov et al. Methods to improve the efficiency of recognition of speech signals in voice control systems
US12061861B2 (en) Custom display post processing in speech recognition
US20230097338A1 (en) Generating synthesized speech input

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant