CN118298803B

CN118298803B - Speech cloning method

Info

Publication number: CN118298803B
Application number: CN202410388083.XA
Authority: CN
Inventors: 雷涛; 谭可华; 徐东
Original assignee: Tianyun Rongchuang Data Science & Technology Beijing Co ltd
Current assignee: Tianyun Rongchuang Data Science & Technology Beijing Co ltd
Priority date: 2024-04-01
Filing date: 2024-04-01
Publication date: 2025-01-17
Anticipated expiration: 2044-04-01
Also published as: CN118298803A

Abstract

The application provides a voice cloning method, which relates to the technical field of artificial intelligence and comprises the steps of obtaining a phoneme identifier corresponding to a text to be processed, a tone identifier of a target object and a Mel frequency spectrum corresponding to voice of the target object, inputting the phoneme identifier, the tone identifier and the Mel frequency spectrum into an acoustic model, calculating the extracted phoneme characteristic of the text to be processed and the tone characteristic of the target object by utilizing a multi-head attention mechanism to obtain a first voice characteristic of each phoneme in the text to be processed, predicting the pronunciation time of each phoneme according to the first voice characteristic of each phoneme and the phoneme characteristic of the text to be processed to obtain a second voice characteristic of each phoneme, inputting the second voice characteristic of each phoneme into a decoding module of an acoustic model to be decoded to obtain the continuous voice characteristic of each phoneme, inputting the continuous voice characteristic of each vocoder into the model to be synthesized, and outputting the cloning voice of the target object aiming at the text to be processed.

Description

Speech cloning method

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a voice cloning method.

Background

The voice cloning is to synthesize the voice very similar to the original speaker, and can be widely applied to the scenes of digital people, virtual people, video creation and the like. In the related art, text is generally converted into phonemes, the phonemes are converted into acoustic features, and then voices simulating the target speaker are generated based on the acoustic features and the mel frequency spectrum of the target speaker, so that cloned voices of the target speaker are obtained.

However, based on the voice cloning method, the obtained cloned voice has low similarity with the actual pronunciation of the target speaking person.

Disclosure of Invention

In order to solve the problem of low similarity between the cloned voice and the actual pronunciation of the target speaker, the application provides a voice cloning method, a voice cloning device, electronic equipment and a storage medium.

In a first aspect, the present application provides a method for cloning drive voices, including:

Acquiring a phoneme identifier corresponding to a text to be processed, a tone identifier of a target object and a Mel frequency spectrum corresponding to the voice of the target object;

Inputting the phoneme mark, the tone mark and the Mel frequency spectrum into an acoustic model, and extracting the phoneme features of the text to be processed and the tone features of the target object by using an encoding module in the acoustic model;

Calculating the phoneme features and the tone features by utilizing a multi-head attention mechanism to obtain first voice features of each phoneme in the text to be processed;

Predicting the pronunciation time length of each phoneme according to the first voice characteristic of each phoneme and the phoneme characteristic of the text to be processed to obtain a second voice characteristic of each phoneme, wherein the second voice characteristic comprises the pronunciation time length of each phoneme;

Inputting the second voice characteristic of each phoneme into a decoding module of the acoustic model for decoding processing to obtain continuous voice characteristics of each phoneme;

And inputting the continuous voice characteristics of each phoneme into a vocoder model for voice synthesis, and outputting the cloned voice of the target object aiming at the text to be processed.

As an alternative implementation of the embodiment of the present application, the coding module of the acoustic model includes a phoneme coder and a spectrum coder; the inputting the phoneme label, the tone color label and the mel spectrum into an acoustic model, extracting the phoneme characteristic of the text to be processed and the tone color characteristic of the target object by using an encoding module of the acoustic model, including:

Inputting the phoneme identification corresponding to the text to be processed and the tone identification of the target object into the phoneme encoder for feature extraction to obtain the phoneme features of the text to be processed;

And inputting the Mel spectrum corresponding to the voice of the target object into the spectrum encoder for feature extraction to obtain the tone feature of the target object.

As an optional implementation manner of the embodiment of the present application, the acoustic model includes an attention module, and the calculating the phoneme feature and the tone feature by using a multi-head attention mechanism to obtain a first speech feature of each phoneme in the text to be processed includes:

taking the phoneme characteristics of the text to be processed as query contents, and taking the tone characteristics of the target object as keywords and content values to be input into the attention module;

And calculating the phoneme characteristics of the text to be processed and the tone characteristics of the target object by utilizing a multi-head attention mechanism to obtain a first voice characteristic of each phoneme in the text to be processed.

The acoustic model comprises a duration predictor, wherein the predicting the pronunciation duration of each phoneme according to the first voice feature of each phoneme and the phoneme feature of the text to be processed to obtain the second voice feature of each phoneme comprises the following steps:

Extracting global tone characteristics of the mel spectrum;

And inputting the characteristics obtained by adding the first voice characteristics and the global tone characteristics of each phoneme in the text to be processed and the phoneme characteristics of the text to be processed into the duration predictor, predicting the pronunciation duration of each phoneme, normalizing the pronunciation duration of each phoneme, and obtaining and outputting the second voice characteristics of each phoneme.

The embodiment of the application is an optional implementation method, the vocoder model comprises a multi-head vector quantizer and HifiGAN vocoders, the continuous voice characteristics of each phoneme are input into the vocoder model for voice synthesis, and the cloned voice of the target object aiming at the text to be processed is output, comprising the following steps:

inputting the continuous voice characteristics of each phoneme into the multi-head vector quantizer for quantization processing to obtain discrete voice characteristics of each phoneme;

And inputting the discrete voice characteristics of each phoneme into the HifiGAN vocoder to perform voice synthesis, obtaining and outputting the cloned voice of the target object aiming at the text to be processed.

As an optional implementation manner of the embodiment of the present application, the obtaining a phoneme identifier corresponding to a text to be processed includes:

Acquiring a text to be processed, and converting characters in the text to be processed into text phonemes;

And converting the text phonemes into phoneme identifications corresponding to the text to be processed based on a phoneme identification dictionary.

As an optional implementation manner of the embodiment of the present application, the obtaining the text to be processed, converting characters in the text to be processed into text phonemes, includes:

acquiring a text to be processed, if the text to be processed comprises Chinese characters and English characters, separating the Chinese characters from the English characters, and adding prosody grades for the Chinese characters;

Converting the Chinese characters and prosodic grades into Chinese phonemes, wherein the Chinese phonemes comprise character phonemes corresponding to the Chinese characters and prosodic phonemes corresponding to the prosodic grades, and converting the English characters into English phonemes;

and splicing the Chinese phonemes and the English phonemes to form text phonemes corresponding to the text to be processed.

In a second aspect, the present application provides a voice cloning apparatus comprising:

The acquisition module is used for acquiring a phoneme identifier corresponding to the text to be processed, a tone identifier of a target object and a Mel frequency spectrum corresponding to the voice of the target object;

the encoding module is used for inputting the phoneme identifications, the tone identifications and the mel frequency spectrum into an acoustic model, and extracting the phoneme characteristics of the text to be processed and the tone characteristics of the target object by utilizing the encoding module in the acoustic model;

The computing module is used for computing the phoneme features and the tone features by utilizing a multi-head attention mechanism to obtain first voice features of each phoneme in the text to be processed;

The processing module is used for predicting the pronunciation time length of each phoneme according to the first voice characteristic of each phoneme and the phoneme characteristic of the text to be processed to obtain a second voice characteristic of each phoneme, wherein the second voice characteristic comprises the pronunciation time length of each phoneme;

the decoding module is used for inputting the second voice characteristic of each phoneme into the decoding module of the acoustic model for decoding processing to obtain the continuous voice characteristic of each phoneme;

And the synthesis module is used for inputting the continuous voice characteristics of each phoneme into a vocoder model to perform voice synthesis and outputting the cloned voice of the target object aiming at the text to be processed.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the voice cloning method according to the first aspect or any optional implementation manner of the first aspect when the computer program is invoked.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, on which a computer program is stored, the computer program implementing the method for cloning speech according to the first aspect or any alternative implementation manner of the first aspect when being executed by a processor.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

The voice cloning method comprises the steps of obtaining a phoneme identifier corresponding to a text to be processed, a tone identifier of a target object and a Mel frequency spectrum corresponding to voice of the target object, inputting the phoneme identifier, the tone identifier and the Mel frequency spectrum into an acoustic model, extracting phoneme characteristics of the text to be processed and tone characteristics of the target object by using a coding module in the acoustic model, calculating the phoneme characteristics and the tone characteristics by using a multi-head attention mechanism to obtain first voice characteristics of each phoneme in the text to be processed, predicting pronunciation time length of each phoneme according to the first voice characteristics of each phoneme and the phoneme characteristics of the text to be processed to obtain second voice characteristics of each phoneme, inputting the second voice characteristics of each phoneme into a decoding module of the acoustic model to be decoded to obtain continuous voice characteristics of each phoneme, inputting the continuous voice characteristics of each phoneme into a voice decoding module of the vocoder to be decoded, and outputting the voice characteristics of the target object to be processed. According to the embodiment of the application, the multi-head attention mechanism is utilized to calculate the phoneme characteristics of the text to be processed and the tone characteristics of the target object to obtain the first voice characteristics, and the duration prediction is carried out on the basis of the first voice characteristics and the phoneme characteristics of the text to be processed to obtain the second voice characteristics, namely the second voice characteristics comprise the phoneme characteristics, the tone characteristics and the duration characteristics of the phonemes, which are equivalent to extracting the voice characteristics of the target object aiming at each phoneme from multiple angles, so that the voice synthesized by the vocoder model on the basis of the continuous voice characteristics of each phoneme is more approximate to the pronunciation of the target object, and the authenticity of cloned voice is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of generating cloned voice corresponding to text to be processed according to one embodiment of the present application;

FIG. 2 is a flowchart illustrating steps of a method for voice cloning according to one embodiment of the present application;

FIG. 3 is a flowchart of a voice cloning method according to an embodiment of the present application;

FIG. 4 is a block diagram of a voice cloning apparatus according to one embodiment of the present application;

Fig. 5 is an internal structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, embodiments and advantages of the present application more apparent, an exemplary embodiment of the present application will be described more fully hereinafter with reference to the accompanying drawings in which exemplary embodiments of the application are shown, it being understood that the exemplary embodiments described are merely some, but not all, of the examples of the application.

Based on the exemplary embodiments described herein, all other embodiments that may be obtained by one of ordinary skill in the art without making any inventive effort are within the scope of the appended claims. Furthermore, while the present disclosure has been described in terms of an exemplary embodiment or embodiments, it should be understood that each aspect of the disclosure can be practiced separately from the other aspects. It should be noted that the brief description of the terminology in the present application is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The voice cloning method provided by the embodiment of the application can be obtained based on a trained voice cloning model, wherein the voice cloning model comprises an acoustic model and a vocoder model.

Referring to fig. 1, fig. 1 is a schematic flow chart of generating cloned voice corresponding to a text to be processed according to an embodiment of the present application. The method comprises the steps of obtaining a text to be processed, normalizing the text to be processed, wherein the text normalization comprises normalization processing of punctuation in the text, separating Chinese and English characters to obtain Chinese characters and English characters when the text to be processed is a Chinese and English mixed text, converting the English characters into corresponding phonemes, presenting prosodic grades for the Chinese characters, converting the corresponding phonemes into corresponding phonemes, mapping the phonemes corresponding to the Chinese characters and the phonemes corresponding to the English characters to corresponding phoneme identifications, inputting the phoneme identifications, the tone identifications of target objects and the voices of the target objects into a trained voice cloning model to perform voice cloning, and obtaining and outputting voices of the target objects for the text to be processed.

In one embodiment, as shown in fig. 2, a voice cloning method is provided, and fig. 2 is a step flowchart of the voice cloning method provided in one embodiment of the present application, including the following steps S21-S26.

S21, obtaining a phoneme identifier corresponding to the text to be processed, a tone identifier of a target object and a Mel frequency spectrum corresponding to the voice of the target object.

The phoneme identifier is used for identifying a corresponding text phoneme, and the text to be processed in the embodiment of the application can be full Chinese text, british text, or a mixed text of Chinese and English.

The method comprises the steps of obtaining a text to be processed, converting characters in the text to be processed into text phonemes after the text to be processed is obtained, and converting the text phonemes into phoneme identifications corresponding to the text to be processed based on a phoneme identification dictionary. The text phonemes in this embodiment refer to phonemes corresponding to all characters obtained by converting text, and the phonemes are the minimum units that constitute syllables, for example, the "middle" of a character in the text is converted into corresponding phonemes and then two phonemes, "zh" and "ong 1".

If the text to be processed is a full Chinese text, carrying out standardization processing on punctuation marks in the text to be processed, converting the text such as numbers, dates and the like in the text to be processed, removing punctuation and special characters except comma, period, semicolon and other symbols, replacing other punctuation marks with target punctuation marks, adding prosodic grades for Chinese characters, converting the Chinese characters with the prosodic grades into corresponding phonemes, and mapping the phonemes into corresponding phoneme identifications.

If the text to be processed is the text of the text, carrying out standardization processing on punctuation marks in the text to be processed, converting the text such as numbers, dates and the like in the text to be processed, removing punctuation and special characters except comma, sentence, semicolon and other symbols, replacing other punctuation marks with target punctuation marks, converting English characters into corresponding phonemes, and mapping the phonemes into corresponding phoneme identifications.

If the text to be processed comprises Chinese characters and English characters, converting the characters in the text to be processed into text phonemes can comprise separating the Chinese characters from the English characters, adding prosody grades for the Chinese characters, wherein the prosody grades are used for representing pronunciation time of the Chinese characters, converting the Chinese characters and the prosody grades into Chinese phonemes, wherein the Chinese phonemes comprise character phonemes corresponding to the Chinese characters and prosody phonemes corresponding to the prosody grades, converting the English characters into English phonemes, and splicing the Chinese phonemes and the English phonemes to form the text phonemes corresponding to the text to be processed.

The punctuation mark in the Chinese-English mixed text is normalized, the text such as numbers, dates and the like in the text to be processed is converted, punctuation and special characters except comma, period, semicolon and the like are removed, and other punctuation marks are replaced by target punctuation marks. The prosody level may include four levels, such as a borderless level, a prosody word boundary level, a prosody phrase boundary level, and a prosody intonation boundary level, the prosody being the pronunciation rhythm and law of the sound to which the text corresponds. In the same sentence, different prosodic structures correspond to different pronunciations, and each prosodic level represents a different duration of the pause. That is, different prosodic levels differ in readback, pause time, etc. One prosodic level also corresponds to one phoneme, namely, a text phoneme comprises a phoneme corresponding to a character and a phoneme corresponding to a prosodic level, and three phonemes can be obtained by conversion after adding the prosodic level to 'medium'.

Illustratively, the prosody level corresponding to the chinese character may be derived based on a chinese prosody model, which is a BERT-based classification model. After the text phonemes are obtained, the text phonemes are converted into phoneme identifications corresponding to the text to be processed based on a phoneme identification dictionary, and mapping relations of the phonemes and the phoneme identifications are recorded in the phoneme identification dictionary.

The target object refers to a real sound object corresponding to the cloned voice, for example, to clone the text to be processed into the voice of the user A to be sent, and the user A is the target object. The tone mark of the target object can be obtained based on one or more sections of voice of the user a, and the embodiment of the present application will not be described in detail.

The mel spectrum corresponding to the voice of the target object may be obtained based on one or more segments of voice of the target object, which is not described in detail in the embodiments of the present application.

S22, inputting the phoneme identification, the tone color identification and the Mel frequency spectrum into an acoustic model, and extracting the phoneme characteristics of the text to be processed and the tone color characteristics of the target object by using an encoding module in the acoustic model.

Referring to fig. 3, fig. 3 is a schematic flow chart of a speech cloning method according to an embodiment of the present application, where an encoding module of the acoustic model includes a phoneme encoder and a spectrum encoder.

Inputting the phoneme mark corresponding to the text to be processed and the tone mark of the target object into the phoneme encoder for feature extraction to obtain the phoneme feature of the text to be processed, and inputting the mel frequency spectrum corresponding to the voice of the target object into the frequency spectrum encoder for feature extraction to obtain the tone feature of the target object.

S23, calculating the phoneme features and the tone features by utilizing a multi-head attention mechanism to obtain first voice features of each phoneme in the text to be processed.

The acoustic model comprises an attention module, wherein the process of obtaining the first voice characteristic can comprise the steps of taking the phoneme characteristic of the text to be processed as query content, taking the tone characteristic of the target object as a keyword and a content value to be input into the attention module, and calculating the phoneme characteristic of the text to be processed and the tone characteristic of the target object by utilizing a multi-head attention mechanism to obtain the first voice characteristic of each phoneme in the text to be processed.

Illustratively, inputting a phoneme identifier corresponding to a text to be processed and a tone identifier of a target object into a phoneme encoder to encode to obtain a hidden vector Hm which is Q (Query, query content), and performing multi-head attention calculation based on a tone feature Gm of the target object obtained by mel spectrum which is K/V (Keys/Values), so as to obtain a first voice feature of each phoneme in the text to be processed, wherein the first voice feature is a phoneme-level voice feature. The embodiment of the application does not describe the multi-head attention mechanism in a similar way.

S24, predicting the pronunciation time length of each phoneme according to the first voice feature of each phoneme and the phoneme feature of the text to be processed to obtain the second voice feature of each phoneme.

The second voice features comprise the pronunciation time length of each phoneme, and the pronunciation time length features are hashed in the second voice features.

Referring to fig. 3, the acoustic model includes a duration predictor, and the predicting the pronunciation duration of each phoneme according to the first speech feature of each phoneme and the phoneme feature of the text to be processed to obtain the second speech feature of each phoneme includes:

The method comprises the steps of receiving a voice input signal, extracting global tone characteristics of a Mel frequency spectrum, inputting characteristics obtained by adding a first voice characteristic of each phoneme in a text to be processed and the global tone characteristics and phoneme characteristics of the text to be processed into a duration predictor, predicting pronunciation duration of each phoneme, normalizing the pronunciation duration of each phoneme, and obtaining and outputting a second voice characteristic of each phoneme.

Illustratively, global timbre features are extracted based on a mel frequency spectrum corresponding to the voice of the target object, the global timbre features and the first voice features are added and input into a duration predictor, the pronunciation duration of each phoneme and semantic representation (pronunciation duration features) are predicted through up-sampling, and normalization processing is carried out on the pronunciation duration of each phoneme to obtain and output second voice features of each phoneme. For example, assuming that the predicted pronunciation time length is 213 4 for the phoneme sequence, the pronunciation time length is a a c d d d b b b b after normalization processing, and the pronunciation time length feature of each phoneme is considered, so that the pronunciation feature of the target object can be acquired more accurately.

S25, inputting the second voice characteristic of each phoneme into a decoding module of the acoustic model for decoding processing, and obtaining the continuous voice characteristic of each phoneme.

Referring to fig. 3, the second speech feature is input into the decoding module, and downsampling is performed in multiple stages step by step to obtain multiple stages of continuous semantic features (continuous speech features), so as to obtain features with different granularities in speech.

S26, inputting the continuous voice characteristics of each phoneme into a vocoder model for voice synthesis, and outputting cloned voice of the target object aiming at the text to be processed.

Referring to fig. 3, the vocoder model includes a multi-headed vector quantizer and HifiGAN vocoder, the inputting the continuous speech feature of each phoneme into the vocoder model for speech synthesis, outputting the cloned speech of the target object for the text to be processed includes:

And inputting the discrete voice characteristics of each phoneme into the HifiGAN vocoder to perform voice synthesis to obtain the cloned voice of the target object aiming at the text to be processed and outputting the cloned voice.

The multi-head vector quantizer (multi-head VQ) comprises an encoder, a decoder, a discrete codebook and the like, wherein the discrete codebook replaces continuous features with discrete features, so that the naturalness of voice features can be conveniently and further extracted, namely, the cloned voice is more natural. In the process of discretizing continuous voice features by adopting a multi-head vector quantizer, the problem that the more codewords, the lower the compression ratio, and the higher the model training complexity can be solved.

HifiGAN the vocoder is a high-efficiency vocoder with Multi-scale and Multi-period discriminators, a model is generated by taking a generation countermeasure network (GENERATIVE ADVERSIAL Networks, GAN) as a basis, and HiFiGAN is provided with the Multi-scale discriminators (Multi-Scale Discriminator, MSD) and the Multi-period discriminators at the same time, so that the capability of the GAN discriminators in discriminating synthesized or real audio can be enhanced as much as possible. The HiFiGAN generator mainly comprises two blocks, namely an up-sampling structure, which is specifically composed of one-dimensional transpose convolution, and a Multi-receptive field Fusion (MRF) module, which is mainly responsible for optimizing sampling points obtained by up-sampling, and is specifically composed of a residual error network. The discrete speech features generate final speech through HifiGAN.

Referring to fig. 3, in the process of training the vocoder model, a multi-order encoder is used to extract features of the mel spectrum corresponding to the speech of the target object, and the features of the mel spectrum are input into a multi-headed vector quantizer for training, so as to improve the similarity between the cloned speech and the speech of the target object.

In the embodiment of the present application, the voices of the target object in the acoustic model and the vocoder model may be the same voices emitted by the target object or may be different voices, and the embodiment of the present application is not particularly limited.

The embodiment of the application provides a voice cloning method, which comprises the steps of obtaining a phoneme identifier corresponding to a text to be processed, a tone identifier of a target object and a Mel frequency spectrum corresponding to voice of the target object, inputting the phoneme identifier, the tone identifier and the Mel frequency spectrum into an acoustic model, extracting phoneme characteristics of the text to be processed and tone characteristics of the target object by using a coding module in the acoustic model, calculating the phoneme characteristics and the tone characteristics by using a multi-head attention mechanism to obtain a first voice characteristic of each phoneme in the text to be processed, predicting the pronunciation time length of each phoneme according to the first voice characteristics of each phoneme and the phoneme characteristics of the text to be processed to obtain a second voice characteristic of each phoneme, wherein the second voice characteristic comprises the pronunciation time length of each phoneme, inputting the second voice characteristic of each phoneme into a decoding module of the acoustic model to be decoded to obtain continuous voice characteristics of each phoneme, inputting the continuous voice characteristics of each phoneme into a voice model, and outputting the voice characteristics of the target text to be processed. According to the embodiment of the application, the multi-head attention mechanism is utilized to calculate the phoneme characteristics of the text to be processed and the tone characteristics of the target object to obtain the first voice characteristics, and the duration prediction is carried out on the basis of the first voice characteristics and the phoneme characteristics of the text to be processed to obtain the second voice characteristics, namely the second voice characteristics comprise the phoneme characteristics, the tone characteristics and the duration characteristics of the phonemes, which are equivalent to extracting the voice characteristics of the target object aiming at each phoneme from multiple angles, so that the voice synthesized by the vocoder model on the basis of the continuous voice characteristics of each phoneme is more approximate to the pronunciation of the target object, and the authenticity of cloned voice is improved.

Based on the same inventive concept, as an implementation of the method, the embodiment of the present application further provides a voice cloning apparatus for executing the foregoing embodiment, where the embodiment of the apparatus corresponds to the foregoing method embodiment, and for convenience of reading, the embodiment of the present application does not describe details in the foregoing method embodiment one by one, but it should be clear that the voice cloning apparatus in the present embodiment can correspondingly implement all the details in the foregoing method embodiment.

Fig. 4 is a schematic structural diagram of a voice cloning apparatus according to an embodiment of the present application, and as shown in fig. 4, a voice cloning apparatus 400 according to this embodiment includes:

An obtaining module 410, configured to obtain a phoneme identifier corresponding to a text to be processed, a timbre identifier of a target object, and a mel spectrum corresponding to a voice of the target object;

An encoding module 420, configured to input the phoneme label, the timbre label, and the mel spectrum into an acoustic model, and extract, by using an encoding module in the acoustic model, a phoneme feature of the text to be processed and a timbre feature of the target object;

A calculation module 430, configured to calculate the phoneme features and the timbre features by using a multi-head attention mechanism, so as to obtain a first voice feature of each phoneme in the text to be processed;

A processing module 440, configured to predict a pronunciation time length of each phoneme according to the first speech feature of each phoneme and the phoneme feature of the text to be processed, so as to obtain a second speech feature of each phoneme, where the second speech feature includes the pronunciation time length of each phoneme;

A decoding module 450, configured to input the second speech feature of each phoneme into the decoding module of the acoustic model for decoding, so as to obtain continuous speech features of each phoneme;

And a synthesis module 460, configured to input the continuous speech feature of each phoneme into a vocoder model for performing speech synthesis, and output cloned speech of the target object for the text to be processed.

As an optional implementation manner of the embodiment of the present application, the encoding module 420 is specifically configured to input a phoneme identifier corresponding to the text to be processed and a timbre identifier of the target object into the phoneme encoder to perform feature extraction to obtain a phoneme feature of the text to be processed, and input a mel spectrum corresponding to the voice of the target object into the spectrum encoder to perform feature extraction to obtain a timbre feature of the target object.

As an alternative implementation manner of the embodiment of the present application, the acoustic model includes an attention module, the calculation module 430 is specifically configured to use the phoneme feature of the text to be processed as query content, input the tone color feature of the target object as a keyword and a content value to the attention module, and calculate the phoneme feature of the text to be processed and the tone color feature of the target object by using a multi-head attention mechanism to obtain a first voice feature of each phoneme in the text to be processed.

As an alternative implementation manner of the embodiment of the present application, the acoustic model includes a duration predictor, the processing module 440 is specifically configured to extract a global timbre feature of the mel spectrum, input a feature obtained by adding the first voice feature and the global timbre feature of each phoneme in the text to be processed and a phoneme feature of the text to be processed into the duration predictor, predict a pronunciation duration of each phoneme, normalize the pronunciation duration of each phoneme, and obtain and output a second voice feature of each phoneme.

The vocoder model comprises a multi-head vector quantizer and HifiGAN vocoders, wherein the synthesis module 460 is specifically configured to input the continuous speech feature of each phoneme into the multi-head vector quantizer for quantization processing to obtain the discrete speech feature of each phoneme, input the discrete speech feature of each phoneme into the HifiGAN vocoders for speech synthesis to obtain the cloned speech of the target object for the text to be processed, and output the cloned speech.

As an optional implementation manner of the embodiment of the present application, the obtaining module 410 is specifically configured to obtain a text to be processed, convert characters in the text to be processed into text phonemes, and convert the text phonemes into phoneme identifications corresponding to the text to be processed based on a phoneme identification dictionary.

As an optional implementation manner of the embodiment of the present application, the obtaining module 410 is specifically configured to obtain a text to be processed, if the text to be processed includes chinese characters and english characters, separate the chinese characters from the english characters, and add prosodic grades to the chinese characters, convert the chinese characters and prosodic grades into chinese phonemes, where the chinese phonemes include character phonemes corresponding to the chinese characters and prosodic phonemes corresponding to the prosodic grades, convert the english characters into english phonemes, and splice the chinese phonemes and the english phonemes to form text phonemes corresponding to the text to be processed.

The voice cloning device provided in this embodiment may perform the voice cloning method provided in the foregoing method embodiment, and its implementation principle is similar to that of the technical effect, and will not be described herein again. The various modules in the voice cloning apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, an electronic device is provided, comprising a memory storing a computer program and a processor implementing the steps of any one of the speech cloning methods described in the method embodiments above when the computer program is executed.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic device provided in this embodiment includes a memory 51 and a processor 52, where the memory 51 is configured to store a computer program, and the processor 52 is configured to execute steps in the voice cloning method provided in the foregoing method embodiment when the computer program is called, and its implementation principle and technical effects are similar, and are not repeated herein. It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the electronic device to which the present inventive arrangements are applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of any one of the speech cloning methods described in the method embodiments above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium and which, when executed, may comprise the steps of the above-described embodiments of the methods. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static random access memory (Static Random Access Memory, SRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The foregoing is only a specific embodiment of the application to enable those skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of voice cloning, comprising:

Inputting the continuous voice characteristics of each phoneme into a vocoder model for voice synthesis, and outputting cloned voice of the target object aiming at the text to be processed;

Wherein the coding module of the acoustic model comprises a phoneme coder and a spectrum coder; inputting the phoneme identifications, the tone marks and the Mel frequency spectrums into an acoustic model, and extracting the phoneme characteristics of the text to be processed and the tone characteristics of the target object by using an encoding module of the acoustic model, wherein the method comprises the steps of inputting the phoneme identifications corresponding to the text to be processed and the tone marks of the target object into the phoneme encoder for characteristic extraction to obtain the phoneme characteristics of the text to be processed;

The acoustic model comprises a duration predictor, wherein the predicting of the pronunciation duration of each phoneme is carried out according to the first voice characteristic of each phoneme and the phoneme characteristic of the text to be processed to obtain the second voice characteristic of each phoneme, the method comprises the steps of extracting global tone characteristics of a Mel frequency spectrum, inputting the characteristics obtained by adding the first voice characteristic and the global tone characteristics of each phoneme in the text to be processed and the phoneme characteristics of the text to be processed into the duration predictor, predicting the pronunciation duration of each phoneme, normalizing the pronunciation duration of each phoneme to obtain and output the second voice characteristic of each phoneme.

2. The method of claim 1, wherein the acoustic model comprises an attention module, wherein the computing the phoneme feature and the timbre feature using a multi-headed attention mechanism to obtain a first speech feature for each phoneme in the text to be processed comprises:

3. The method of claim 1, wherein the vocoder model comprises a multi-headed vector quantizer and HifiGAN vocoder, wherein inputting the continuous speech feature of each of the phonemes into the vocoder model for speech synthesis and outputting cloned speech of the target object for the text to be processed comprises:

4. A method according to any one of claims 1-3, wherein the obtaining a phoneme identification corresponding to the text to be processed comprises:

5. The method of claim 4, wherein the obtaining the text to be processed, converting characters in the text to be processed into text phonemes, comprises:

6. A speech cloning apparatus comprising:

the synthesis module is used for inputting the continuous voice characteristics of each phoneme into a vocoder model to perform voice synthesis and outputting cloned voice of the target object aiming at the text to be processed;

The encoding module of the acoustic model comprises a phoneme encoder and a spectrum encoder, wherein the encoding module is specifically used for inputting a phoneme identifier corresponding to the text to be processed and a tone identifier of the target object into the phoneme encoder for feature extraction to obtain a phoneme feature of the text to be processed;

The acoustic model comprises a duration predictor, the processing module, a feature obtained by adding the first voice feature and the global tone feature of each phoneme in the text to be processed and the phoneme feature of the text to be processed, wherein the processing module is specifically used for extracting the global tone feature of the Mel frequency spectrum, inputting the feature obtained by adding the first voice feature and the global tone feature of each phoneme in the text to be processed into the duration predictor, predicting the pronunciation duration of each phoneme, normalizing the pronunciation duration of each phoneme, and obtaining and outputting the second voice feature of each phoneme.

7. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the speech cloning method of any one of claims 1 to 5 when executing the computer program.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the speech cloning method of any one of claims 1 to 5.