Disclosure of Invention
Based on this, it is necessary to provide a voice synthesis method, apparatus, live broadcast system, electronic device, and computer readable storage medium, which can generate high-quality live broadcast room summary information.
A method of speech synthesis, comprising:
Acquiring target text synthesized by language and tone characteristics of a speaker;
encoding the target guide audio to obtain target discrete semantic codes without timbre, wherein the target discrete semantic codes contain prosodic features;
predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized, wherein the audio code to be synthesized comprises prosody of target guide audio and semantic features of target text;
and decoding the audio code to be synthesized according to the target text and the tone characteristics to obtain voice synthesized audio.
In one embodiment, encoding the target pilot audio results in target discrete semantic coding with timbre removed, comprising:
inputting the target guide audio into an audio feature extraction model based on speaker timbre decoupling to obtain semantic features which remove timbre and contain prosody information;
and inputting semantic features of the target guide audio into a residual quantization model for processing to obtain a target discrete semantic code.
In one embodiment, predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized includes:
invoking a pre-trained large language model, wherein the large language model is obtained by training by utilizing training text and training audio based on a general language model;
and inputting the target discrete semantic codes and the target text into the large language model for prediction to obtain the audio codes to be synthesized, wherein the audio codes to be synthesized comprise the rhythm of the target guide audio and the semantic characteristics of the target text.
In one embodiment, the method for synthesizing speech further includes:
Constructing a large language model;
acquiring a text audio pair data set comprising training audio and corresponding text content thereof;
inputting the training audio into an encoder to obtain discrete semantic codes of the training audio, wherein the discrete semantic codes are targets for optimizing a large language model;
Inputting the discrete semantic codes of the training audio and the text content into a large language model together for prediction to obtain the discrete semantic codes of the audio to be synthesized;
The large language model is optimized by using cross entropy as an objective function based on the discrete semantic coding of the audio to be synthesized and the discrete semantic coding of the training audio.
In one embodiment, decoding the audio code to be synthesized according to the target text and the tone characteristics to obtain speech synthesis audio includes:
Invoking a pre-trained decoder, wherein the decoder is trained based on the training text and tone characteristics of a speaker;
inputting the audio code to be synthesized, the target text and the tone characteristic into the decoder for decoding to obtain the voice synthesized audio.
In one embodiment, the method for synthesizing speech further includes:
Constructing a decoder, wherein the decoder comprises an a priori encoder, a posterior encoder, a normal stream, a generator and a discriminator;
Setting a loss function, wherein the loss function comprises L1 reconstruction loss, KL divergence, antagonism loss and characteristic matching loss;
Training text and training audio are input to the decoder and the decoder is jointly optimized based on the loss function.
In one embodiment, the KL-divergence is used to measure the difference between a posterior distribution and an a priori distribution;
the L1 reconstruction loss is used for measuring the difference between the spectrogram predicted by the generator and the target spectrogram;
The countermeasures are used for training a generator and a discriminator;
the feature matching penalty is used to optimize the generator.
In one embodiment, inputting the audio code to be synthesized, the target text and the tone characteristic into a pre-trained decoder for decoding to obtain the voice synthesized audio, including:
and obtaining prior distribution of the audio code to be synthesized, the target text and the tone characteristic through a prior encoder, obtaining posterior distribution after the prior distribution passes through a normal stream, and obtaining the voice synthesized audio by fusing the posterior distribution and the tone characteristic input generator.
A speech synthesis apparatus comprising:
the input module is used for acquiring target text synthesized by the language and tone characteristics of a speaker;
The coding module is used for coding the target guide audio to obtain target discrete semantic codes without timbre, wherein the target discrete semantic codes contain prosodic features;
the prediction module is used for predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized, wherein the audio code to be synthesized comprises prosody of target guide audio and semantic characteristics of target text;
And the decoding module is used for decoding the audio code to be synthesized according to the target text and the tone characteristics to obtain voice synthesized audio.
The live broadcast system comprises a main broadcasting end, a spectator end and a live broadcast server, wherein the main broadcasting end and the spectator end are respectively connected to the live broadcast server through a communication network;
the main broadcasting end is used for accessing a main broadcasting of the live broadcasting room, collecting main broadcasting live video streams and uploading the live broadcasting video streams to the live broadcasting server;
The live broadcast server is used for forwarding the live broadcast video stream of the anchor to the audience side and generating an audio stream in the live broadcast video stream by utilizing the voice synthesis method;
The audience terminal is used for accessing audience users in a live broadcast room, receiving the live video stream of the anchor broadcast for playing, and displaying the push information.
An electronic device, comprising:
One or more processors;
A memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the steps of the speech synthesis method.
A computer readable storage medium storing at least one instruction, at least one program, set of codes, or set of instructions, the at least one instruction, the at least one program, the set of codes, or set of instructions being loaded by the processor and performing the steps of the speech synthesis method.
According to the technical scheme, the discrete semantic code of the guide audio is obtained through coding, the discrete semantic code and the target text are predicted through a large language model to obtain the audio code to be synthesized, and the audio code to be synthesized is decoded by combining the tone characteristics of a speaker to obtain the voice synthesized audio.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more, for example, a plurality of objects means two or more. The words "comprise" or "comprising" and the like mean that information preceding the word "comprising" or "comprises" is meant to encompass the information listed thereafter and equivalents thereof as well as additional information not being excluded. Reference to "and/or" in embodiments of the application indicates that there may be three relationships, and the character "/" generally indicates that the associated object is an "or" relationship.
The technical scheme provided by the embodiment of the application can be applied to the application scene of the related method of the application as shown in fig. 1, fig. 1 is an exemplary live service application scene schematic diagram, the live broadcast system can comprise a live broadcast server, a live broadcast end and a spectator end, wherein the live broadcast end and the spectator end are in data communication with the live broadcast server through a communication network, so that live broadcast of the live broadcast end and spectator users of the spectator end can be realized. The terminal devices of the anchor terminal and the audience terminal can be, but not limited to, various personal computers, notebook computers, smart phones and tablet computers, and the live broadcast server can be realized by an independent server or a server cluster formed by a plurality of servers.
According to the technical scheme, referring to FIG. 2, FIG. 2 is a flow chart of an exemplary speech synthesis algorithm, the guide audio and tone characteristics of an input language are utilized to synthesize target text and a speaker, the guide audio is subjected to a coder to obtain discrete semantic codes containing prosody after tone removal, a trained large language model of the discrete semantic codes and the target text is used to obtain discrete semantic codes of the audio to be synthesized containing the prosody of the guide audio and the semantic characteristics of the target text, and a trained decoder of the discrete semantic codes of the audio to be synthesized, the target text and the tone characteristics of the speaker is used to obtain final speech synthesis audio.
Referring to fig. 3, fig. 3 is a flowchart of a speech synthesis method according to one embodiment, including the steps of:
Step S30, obtaining target text of language synthesis and tone characteristics of a speaker.
In the step, a target text synthesized by a language is generated according to text content, tone characteristics of a speaker are produced, tone characteristics of a certain speaker can be selected as tone characteristics of the tone characteristics, tone characteristics of a plurality of speakers can be mixed, and tone characteristics of the matched rhythm and tone characteristics can be modulated according to the requirements in the step.
And S20, encoding the target guide audio to obtain target discrete semantic codes with tone removed, wherein the target discrete semantic codes contain prosodic features.
In the step, the target guide audio is encoded by an encoder to remove tone color, so as to obtain target discrete semantic codes containing prosody, wherein the guide audio can be a section of voice audio for a speaker to read or an arbitrary section of pure human voice of a speaker.
In one embodiment, the step S20 of encoding the target guide audio to obtain target discrete semantic codes with timbre removed, wherein the target discrete semantic codes contain prosodic features comprises the following steps:
Inputting the target guide audio into an audio feature extraction model based on speaker timbre decoupling to obtain semantic features which remove timbre and contain prosody information, and inputting the semantic features of the target guide audio into a residual quantization model to obtain target discrete semantic codes.
Specifically, the encoder may include an audio feature extraction model (ContentVec) and a residual vector quantization model (Residual Vector Quantization, RVQ) for pre-trained speaker tone decoupling, where the target guided audio is passed through the audio feature extraction model to obtain semantic features with tone removed, mainly including prosodic information, and the semantic features of the guided audio are passed through the residual vector quantization model to obtain corresponding discrete semantic codes (tokens), and tokens is input information of a large language model (Large Language Model, LLM), referred to as discrete semantic codes.
And step S30, predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized, wherein the audio code to be synthesized comprises prosody of target guide audio and semantic features of target text.
In this step, using tokens of the target guide audio and the target text obtained by encoding, the audio to be synthesized tokens including the prosody of the target guide audio and the semantic features of the target text is obtained through prediction by using a pre-trained LLM model.
In one embodiment, predicting the target discrete semantic code and the target text based on the large language model in step S30 to obtain an audio code to be synthesized includes:
(1) And calling a pre-trained large language model, wherein the large language model is obtained by training by utilizing training text and training audio based on the universal language model.
As an embodiment, referring to fig. 4, fig. 4 is a training schematic diagram of a large language model according to an embodiment, for a training method of a large language model, the method may include the following steps:
a, constructing a large language model;
b, acquiring a text audio pair data set comprising training audio and corresponding text content thereof, specifically, acquiring a text audio pair data set through massive text audio pairs Where a i is training audio and x i is text content corresponding to the training audio.
And c, inputting the training audio into an encoder to obtain discrete semantic codes of the training audio, wherein the discrete semantic codes are targets for optimizing a large language model.
And d, inputting the discrete semantic code of the training audio and the text content into a large language model together to predict to obtain the discrete semantic code of the audio to be synthesized, specifically, firstly, obtaining the corresponding discrete semantic code y i to be synthesized by the training audio a i through an encoder, and enabling the discrete semantic code y i to be used as an optimization target of LLM learning and simultaneously enter LLM together with the text content x i to obtain the predicted discrete semantic code y' i of the audio to be synthesized.
E, optimizing a large language model by using cross entropy as an objective function based on the discrete semantic coding of the audio to be synthesized and the discrete semantic coding of the training audio.
Specifically, the cross entropy loss calculation formula is as follows:
where Loss (y' i,yi) is the cross entropy Loss.
(2) And inputting the target discrete semantic codes and the target text into the large language model for prediction to obtain the audio codes to be synthesized, wherein the audio codes to be synthesized comprise the rhythm of the target guide audio and the semantic characteristics of the target text.
Specifically, tokens of the guiding audio and the target text obtained in the step S10 are input into the trained LLM to obtain the predicted tokens of the audio to be synthesized.
And S40, decoding the audio code to be synthesized according to the target text and the tone characteristics to obtain voice synthesized audio.
In this step, the target text, tone characteristics and audio codes to be synthesized are decoded by a pre-trained decoder to obtain the final speech synthesized audio output.
In one embodiment, when the speech synthesis audio is obtained through decoding in step S40, a pre-trained decoder is called, the audio code to be synthesized, the target text and the tone characteristic are input into the decoder for decoding to obtain the speech synthesis audio, wherein the decoder is obtained through training based on the training text and the tone characteristic of the speaker.
In one embodiment, referring to fig. 5, fig. 5 is a schematic diagram of a decoder training flow in one embodiment, for a decoder training method, the following may be used:
1) And constructing a decoder, wherein the decoder comprises an a priori encoder, a posterior encoder, a normal stream, a generator and a discriminant model.
As shown in fig. 5, a decoder model may be constructed that includes a priori encoder, a posterior encoder, a regular stream (Normalizing Flow), a generator, and a discriminant model.
2) And setting a loss function, wherein the loss function comprises L1 reconstruction loss, KL divergence, antagonism loss and characteristic matching loss.
Specifically, by inputting training text and training audio, the decoder is jointly optimized using four penalty functions, namely, L1 reconstruction penalty, KL divergence (KL DIVERGENCE), fight penalty, and feature matching penalty.
For L1 reconstruction loss, the method is used for training audio to obtain a target spectrogram x mel, the target spectrum is subjected to a posterior encoder to obtain a hidden variable z, and the hidden variable is subjected to a generator to obtain a predicted spectrogramAnd finally, reconstructing loss optimization by using L1 based on the target spectrogram and the predicted spectrogram.
KL divergence loss, which is the distance between the difference of one probability distribution and the other probability distribution, is used for training audio to obtain discrete semantic codes through an encoder, then the discrete semantic codes and training text and speaker timbre are subjected to prior encoder to obtain prior distribution, hidden variable z is subjected to Flow to obtain posterior distribution, and KL divergence loss optimization is utilized based on the prior distribution and the posterior distribution.
Countering the loss, training the generator and the discriminator so as to predict the spectrogramMore realistic
The feature matching loss is used for optimizing the generator, and expected input target spectrogram x mel and predicted spectrogramThe feature maps of each layer of the generator can be similar.
3) Training text and training audio are input to the decoder and the decoder is jointly optimized based on the loss function.
Specifically, the prior encoder is a variational inference (Variational inference, VAE) structure, uses a regular stream (Normalizing Flow) to enhance the expression capability of prior distribution, normalizing Flows is used for learning complex data distribution, improves the quality of synthesized audio through the training of a countermeasure network (GENERATIVE ADVERSARIAL Networks, GAN), and combines VAE and GAN training to obtain a final loss function, wherein the reconstruction loss, KL divergence, countermeasure loss and feature matching loss are combined.
Illustratively, the reconstruction loss uses the L1 loss to measure the difference between the spectrogram predicted by the generator and the target spectrogram as follows:
Wherein x mel is the target spectrogram, The difference between posterior distribution and prior distribution is measured by KL divergence, and the formula is as follows:
Lkl=log(qφ)-log(pθ)
where q φ is the posterior distribution and p θ is the a priori distribution.
The countering loss includes a loss of the arbiter and a loss of the generator, wherein the loss of the arbiter is expressed as follows:
Ladv(D)=E(y,a)[(D(a)-1)2+(D(G(z)))2]
the loss formula of the generator is as follows:
Ladv(G)=Ez[(D(G(z))-1)2]
Wherein D is a discriminator, G is a generator, a is input audio, z is a latent variable output by a posterior encoder, and the characteristic matching loss is mainly used for training the generator, and the formula is as follows:
Where D l is the feature map output by the first layer of the arbiter and N l is the number of feature maps.
Combining all the loss functions to obtain a total loss function, wherein the formula is as follows:
Ltotal=Lrecon+Lkl+Ladv(G)+Lfm(G)
where L total represents the total loss function.
In one embodiment, when the decoder is used for decoding to obtain the voice synthesized audio, the audio code to be synthesized, the target text and the tone characteristic are subjected to prior distribution by the prior encoder, posterior distribution is obtained after the prior distribution is subjected to normal flow, and the posterior distribution and the tone characteristic are input into the generator for fusion to obtain the voice synthesized audio.
Specifically, a decoder is obtained based on the training, referring to fig. 6, fig. 6 is a flow chart of a push flow of the decoder according to an embodiment, and when reasoning is performed, tokens obtained by receiving the target text and LLM prediction and the timbre feature of the speaker finally obtain the speech synthesis audio, the rhythm of the speech synthesis audio depends on the guide audio, and the timbre depends on the timbre feature of the speaker.
Preferably, the speaker's timbre characteristics may specify that multiple timbres are to be fused. For example, assuming n timbre features, denoted as e 1,e2,…,en respectively, the corresponding fusion ratio is p 1,p2,…,pn, and satisfiesThe tone color fusion formula is embedding =e 1×p1+e2×p2+…+en×pn, so that the tone color diversification requirement can be met and the tone color characteristics of a specific speaker can be reduced.
The technical scheme of the embodiment can be used for an AI assistant in a live broadcasting room, and by adding an AI live broadcasting auxiliary function, a user can customize the voice line of the AI assistant autonomously, so that the synthesized voice can have rhythm approaching a real person, for example, in a group chat scene of a live broadcasting platform, the group chat function of a plurality of virtual persons and users can be realized, the voice effect including the main broadcasting tone and the non-main broadcasting tone can be virtualized, wherein the voice effect of the main broadcasting tone synthesis can approximate the real person, the non-main broadcasting tone can be modulated to meet the expected tone and rhythm, and the unique tone and rhythm can be created by adjusting the technology for guiding the mixing of the audio and the tone so as to meet various service demands.
An embodiment of the speech synthesis apparatus is set forth below.
As shown in fig. 7, fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment, including:
An input module 10 for acquiring target text of language synthesis and tone characteristics of a speaker;
the coding module 20 is used for coding the target guide audio to obtain target discrete semantic codes without timbre, wherein the target discrete semantic codes contain prosodic features;
The prediction module 30 is used for predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized, wherein the audio code to be synthesized comprises prosody of target guide audio and semantic features of target text;
The decoding module 40 is configured to decode the audio code to be synthesized according to the target text and the tone characteristic to obtain a speech synthesis audio.
The voice synthesis device of the present embodiment may perform a voice synthesis method provided by the embodiment of the present application, and its implementation principle is similar, and actions performed by each module in the voice synthesis device of each embodiment of the present application correspond to steps in the voice synthesis method of each embodiment of the present application, and detailed functional descriptions of each module of the voice synthesis device may be specifically referred to the descriptions in the corresponding voice synthesis method shown in the foregoing, which are not repeated herein.
An embodiment of a live system is set forth below.
In the live broadcast system of this embodiment, referring to fig. 8, fig. 8 is a schematic structural diagram of an exemplary live broadcast system, where the system framework may include a main broadcasting end, a live broadcast server and a viewer end, where the viewer end and the main broadcasting end may enter the live broadcast platform through clients installed on an electronic device, and the main broadcasting end and the viewer end may be, for example, computer devices, such as a PDA, a smart phone, a tablet computer, a desktop computer, a notebook computer, or the like, which are not limited to this, and may also be software modules of an application program, and the live broadcast server may be implemented by using an independent server or a server cluster formed by multiple servers. The method comprises the steps of enabling a main broadcasting end to be used for accessing a main broadcasting of a live broadcasting room and collecting main broadcasting live video streams to upload the live broadcasting video streams to a live broadcasting server, enabling the live broadcasting server to be used for forwarding the main broadcasting live video streams to a spectator end and generating audio streams in the live broadcasting video streams by utilizing the voice synthesis method of any embodiment, and enabling the spectator end to be used for accessing spectator users of the live broadcasting room and receiving the main broadcasting live video streams to play.
The method is characterized in that the method comprises the steps that an AI assistant applied to a live broadcasting room carries out voice live broadcasting, a user can independently customize the voice line of the AI assistant to synthesize live broadcasting voice, so that the synthesized voice can have rhythm approaching to a real person, for example, a plurality of virtual persons and users can carry out group chat, the virtual persons comprise a main broadcasting tone and a non-main broadcasting tone, the AI assistant can create unique tone and rhythm, and various business requirements are met.
According to the technical scheme, the synthesized voice can approximate to the effect of the real voice, the rhythm and the timbre are adjusted based on the large language model, and the rhythm and the timbre of the synthesized voice can be modulated, so that unique timbre and rhythm are created, the voice synthesis effect is improved, and the diversified application requirements in network live broadcast are met.
Embodiments of an electronic device and a computer-readable storage medium are set forth below.
The application provides a technical scheme of electronic equipment, which is used for realizing the related functions of a voice synthesis method, and the electronic equipment of the embodiment comprises one or more processors and a memory; one or more applications, wherein the one or more applications are stored in memory and configured to be executed by one or more processors, the one or more applications configured for the steps of the speech synthesis method of any embodiment.
As shown in FIG. 9, FIG. 9 is a block diagram of an example electronic device, which may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like. The electronic device 100 can include one or more of a processing component 102, a memory 104, a power component 106, a multimedia component 108, an audio component 109, an input/output (I/O) interface 112, a sensor component 114, and a communication component 116.
The processing component 102 generally controls overall operation of the electronic device 100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
The memory 104 is configured to store various types of data to support operations at the electronic device 100. Such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply assembly 106 provides power to the various components of the electronic device 100.
The multimedia component 109 includes a screen between the electronic device 100 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). In some embodiments, the multimedia component 108 includes a front-facing camera and/or a rear-facing camera.
The audio component 109 is configured to output and/or input audio signals.
The I/O interface 112 provides an interface between the processing component 102 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, a home button, a volume button, an activate button, and a lock button.
The sensor assembly 114 includes one or more sensors for providing status assessment of various aspects of the electronic device 100. The sensor assembly 114 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
The communication component 116 is configured to facilitate communication between the electronic device 100 and other devices in a wired or wireless manner. The electronic device 100 may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof.
The application provides a technical scheme of a computer readable storage medium, which is used for realizing relevant functions of a voice synthesis method. The computer readable storage medium stores at least one instruction, at least one program, code set, or instruction set, the at least one instruction, at least one program, code set, or instruction set being loaded by a processor and performing the speech synthesis method of any of the embodiments.
In an exemplary embodiment, the computer-readable storage medium may be a non-transitory computer-readable storage medium including instructions, such as a memory including instructions, for example, the non-transitory computer-readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.