[go: up one dir, main page]

CN119360822A - Speech synthesis method, device, live broadcast system, electronic device and storage medium - Google Patents

Speech synthesis method, device, live broadcast system, electronic device and storage medium Download PDF

Info

Publication number
CN119360822A
CN119360822A CN202411374449.4A CN202411374449A CN119360822A CN 119360822 A CN119360822 A CN 119360822A CN 202411374449 A CN202411374449 A CN 202411374449A CN 119360822 A CN119360822 A CN 119360822A
Authority
CN
China
Prior art keywords
audio
target
code
synthesized
speech synthesis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411374449.4A
Other languages
Chinese (zh)
Inventor
苏正航
宫凯程
陈增海
贺灏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Cubesili Information Technology Co Ltd
Original Assignee
Guangzhou Cubesili Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Cubesili Information Technology Co Ltd filed Critical Guangzhou Cubesili Information Technology Co Ltd
Priority to CN202411374449.4A priority Critical patent/CN119360822A/en
Publication of CN119360822A publication Critical patent/CN119360822A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

本申请涉及一种语音合成方法、装置、直播系统、电子设备及计算机可读存储介质;所述方法包括:获取语言合成的目标文本和说话人的音色特征;对目标引导音频进行编码得到去除音色的目标离散语义编码;其中,所述目标离散语义编码包含韵律特征;基于大语言模型对所述目标离散语义编码和所述目标文本进行预测得到待合成音频编码;其中,所述待合成音频编码包括目标引导音频的韵律和目标文本的语义特征;根据所述目标文本和音色特征对所述待合成音频编码进行解码得到语音合成音频;该技术方案,合成语音能逼近真人语音效果,可以调制合成语音的韵律和音色,满足网络直播中的多样化应用需求。

The present application relates to a speech synthesis method, device, live broadcast system, electronic device and computer-readable storage medium; the method comprises: obtaining a target text for speech synthesis and the timbre characteristics of a speaker; encoding a target guide audio to obtain a target discrete semantic code with the timbre removed; wherein the target discrete semantic code includes a rhythmic feature; predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized; wherein the audio code to be synthesized includes the rhythm of the target guide audio and the semantic features of the target text; decoding the audio code to be synthesized according to the target text and the timbre characteristics to obtain speech synthesis audio; with this technical solution, the synthesized speech can approach the effect of real human speech, and the rhythm and timbre of the synthesized speech can be modulated to meet the diverse application needs in online live broadcasts.

Description

Voice synthesis method, device, live broadcast system, electronic equipment and storage medium
Technical Field
The present application relates to the field of speech processing technology, and in particular, to a speech synthesis method, apparatus, live broadcast system, electronic device, and computer readable storage medium.
Background
At present, with the rapid development of artificial intelligence and machine learning technologies, TTS (Text To Speech) has gradually become an important research direction in the field of human-computer interaction, and TTS can convert Text information into natural and smooth Speech output and is widely applied To multiple fields of intelligent assistants, navigation systems, education, entertainment and the like.
With the development of network technology, network live broadcast has been used by most network users, wherein the network live broadcast plays an important role in promoting flexible employment, promoting economic and social development, enriching the mental culture life of people and the like in terms of intuitiveness, instantaneity and interactivity of contents and forms of the network live broadcast, and a host can better show the talent of the host in live broadcast, so that the self-value is realized for more hosts. In network live broadcasting platforms, TTS is increasingly applied, and most of conventional TTS technologies have a mechanical sense, and TTS technologies based on LLM (Large Language Models, large language model) have become a mainstream trend in recent years due to their high naturalness, for example, there are technologies of encoding continuous speech features into audio token (i.e. discrete semantic encoding) by using an audio encoder, establishing a relationship between text and audio token by using a large language model, and recovering speech from the audio token by using a vocoder.
However, in the technical field of network live broadcast, although the conventional TTS technology based on LLM achieves a better effect in terms of speech synthesis, there are still many limitations in terms of timbre modulation, prosody variation, etc., and practical application often depends on the sound quality of recorded sound and a specific template file, so that it is difficult to satisfy the diversified application requirements in network live broadcast.
Disclosure of Invention
Based on this, it is necessary to provide a voice synthesis method, apparatus, live broadcast system, electronic device, and computer readable storage medium, which can generate high-quality live broadcast room summary information.
A method of speech synthesis, comprising:
Acquiring target text synthesized by language and tone characteristics of a speaker;
encoding the target guide audio to obtain target discrete semantic codes without timbre, wherein the target discrete semantic codes contain prosodic features;
predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized, wherein the audio code to be synthesized comprises prosody of target guide audio and semantic features of target text;
and decoding the audio code to be synthesized according to the target text and the tone characteristics to obtain voice synthesized audio.
In one embodiment, encoding the target pilot audio results in target discrete semantic coding with timbre removed, comprising:
inputting the target guide audio into an audio feature extraction model based on speaker timbre decoupling to obtain semantic features which remove timbre and contain prosody information;
and inputting semantic features of the target guide audio into a residual quantization model for processing to obtain a target discrete semantic code.
In one embodiment, predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized includes:
invoking a pre-trained large language model, wherein the large language model is obtained by training by utilizing training text and training audio based on a general language model;
and inputting the target discrete semantic codes and the target text into the large language model for prediction to obtain the audio codes to be synthesized, wherein the audio codes to be synthesized comprise the rhythm of the target guide audio and the semantic characteristics of the target text.
In one embodiment, the method for synthesizing speech further includes:
Constructing a large language model;
acquiring a text audio pair data set comprising training audio and corresponding text content thereof;
inputting the training audio into an encoder to obtain discrete semantic codes of the training audio, wherein the discrete semantic codes are targets for optimizing a large language model;
Inputting the discrete semantic codes of the training audio and the text content into a large language model together for prediction to obtain the discrete semantic codes of the audio to be synthesized;
The large language model is optimized by using cross entropy as an objective function based on the discrete semantic coding of the audio to be synthesized and the discrete semantic coding of the training audio.
In one embodiment, decoding the audio code to be synthesized according to the target text and the tone characteristics to obtain speech synthesis audio includes:
Invoking a pre-trained decoder, wherein the decoder is trained based on the training text and tone characteristics of a speaker;
inputting the audio code to be synthesized, the target text and the tone characteristic into the decoder for decoding to obtain the voice synthesized audio.
In one embodiment, the method for synthesizing speech further includes:
Constructing a decoder, wherein the decoder comprises an a priori encoder, a posterior encoder, a normal stream, a generator and a discriminator;
Setting a loss function, wherein the loss function comprises L1 reconstruction loss, KL divergence, antagonism loss and characteristic matching loss;
Training text and training audio are input to the decoder and the decoder is jointly optimized based on the loss function.
In one embodiment, the KL-divergence is used to measure the difference between a posterior distribution and an a priori distribution;
the L1 reconstruction loss is used for measuring the difference between the spectrogram predicted by the generator and the target spectrogram;
The countermeasures are used for training a generator and a discriminator;
the feature matching penalty is used to optimize the generator.
In one embodiment, inputting the audio code to be synthesized, the target text and the tone characteristic into a pre-trained decoder for decoding to obtain the voice synthesized audio, including:
and obtaining prior distribution of the audio code to be synthesized, the target text and the tone characteristic through a prior encoder, obtaining posterior distribution after the prior distribution passes through a normal stream, and obtaining the voice synthesized audio by fusing the posterior distribution and the tone characteristic input generator.
A speech synthesis apparatus comprising:
the input module is used for acquiring target text synthesized by the language and tone characteristics of a speaker;
The coding module is used for coding the target guide audio to obtain target discrete semantic codes without timbre, wherein the target discrete semantic codes contain prosodic features;
the prediction module is used for predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized, wherein the audio code to be synthesized comprises prosody of target guide audio and semantic characteristics of target text;
And the decoding module is used for decoding the audio code to be synthesized according to the target text and the tone characteristics to obtain voice synthesized audio.
The live broadcast system comprises a main broadcasting end, a spectator end and a live broadcast server, wherein the main broadcasting end and the spectator end are respectively connected to the live broadcast server through a communication network;
the main broadcasting end is used for accessing a main broadcasting of the live broadcasting room, collecting main broadcasting live video streams and uploading the live broadcasting video streams to the live broadcasting server;
The live broadcast server is used for forwarding the live broadcast video stream of the anchor to the audience side and generating an audio stream in the live broadcast video stream by utilizing the voice synthesis method;
The audience terminal is used for accessing audience users in a live broadcast room, receiving the live video stream of the anchor broadcast for playing, and displaying the push information.
An electronic device, comprising:
One or more processors;
A memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the steps of the speech synthesis method.
A computer readable storage medium storing at least one instruction, at least one program, set of codes, or set of instructions, the at least one instruction, the at least one program, the set of codes, or set of instructions being loaded by the processor and performing the steps of the speech synthesis method.
According to the technical scheme, the discrete semantic code of the guide audio is obtained through coding, the discrete semantic code and the target text are predicted through a large language model to obtain the audio code to be synthesized, and the audio code to be synthesized is decoded by combining the tone characteristics of a speaker to obtain the voice synthesized audio.
Drawings
FIG. 1 is a schematic illustration of an exemplary live service application scenario;
FIG. 2 is a block diagram of an exemplary speech synthesis algorithm;
FIG. 3 is a flow chart of a method of speech synthesis of one embodiment;
FIG. 4 is a training schematic of a large language model of one embodiment;
FIG. 5 is a schematic diagram of a decoder training flow of one embodiment;
FIG. 6 is a push flow diagram of a decoder of one embodiment;
FIG. 7 is a schematic diagram of the structure of a speech synthesis apparatus according to one embodiment;
FIG. 8 is a schematic diagram of an exemplary live system architecture;
Fig. 9 is a block diagram of an example electronic device.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more, for example, a plurality of objects means two or more. The words "comprise" or "comprising" and the like mean that information preceding the word "comprising" or "comprises" is meant to encompass the information listed thereafter and equivalents thereof as well as additional information not being excluded. Reference to "and/or" in embodiments of the application indicates that there may be three relationships, and the character "/" generally indicates that the associated object is an "or" relationship.
The technical scheme provided by the embodiment of the application can be applied to the application scene of the related method of the application as shown in fig. 1, fig. 1 is an exemplary live service application scene schematic diagram, the live broadcast system can comprise a live broadcast server, a live broadcast end and a spectator end, wherein the live broadcast end and the spectator end are in data communication with the live broadcast server through a communication network, so that live broadcast of the live broadcast end and spectator users of the spectator end can be realized. The terminal devices of the anchor terminal and the audience terminal can be, but not limited to, various personal computers, notebook computers, smart phones and tablet computers, and the live broadcast server can be realized by an independent server or a server cluster formed by a plurality of servers.
According to the technical scheme, referring to FIG. 2, FIG. 2 is a flow chart of an exemplary speech synthesis algorithm, the guide audio and tone characteristics of an input language are utilized to synthesize target text and a speaker, the guide audio is subjected to a coder to obtain discrete semantic codes containing prosody after tone removal, a trained large language model of the discrete semantic codes and the target text is used to obtain discrete semantic codes of the audio to be synthesized containing the prosody of the guide audio and the semantic characteristics of the target text, and a trained decoder of the discrete semantic codes of the audio to be synthesized, the target text and the tone characteristics of the speaker is used to obtain final speech synthesis audio.
Referring to fig. 3, fig. 3 is a flowchart of a speech synthesis method according to one embodiment, including the steps of:
Step S30, obtaining target text of language synthesis and tone characteristics of a speaker.
In the step, a target text synthesized by a language is generated according to text content, tone characteristics of a speaker are produced, tone characteristics of a certain speaker can be selected as tone characteristics of the tone characteristics, tone characteristics of a plurality of speakers can be mixed, and tone characteristics of the matched rhythm and tone characteristics can be modulated according to the requirements in the step.
And S20, encoding the target guide audio to obtain target discrete semantic codes with tone removed, wherein the target discrete semantic codes contain prosodic features.
In the step, the target guide audio is encoded by an encoder to remove tone color, so as to obtain target discrete semantic codes containing prosody, wherein the guide audio can be a section of voice audio for a speaker to read or an arbitrary section of pure human voice of a speaker.
In one embodiment, the step S20 of encoding the target guide audio to obtain target discrete semantic codes with timbre removed, wherein the target discrete semantic codes contain prosodic features comprises the following steps:
Inputting the target guide audio into an audio feature extraction model based on speaker timbre decoupling to obtain semantic features which remove timbre and contain prosody information, and inputting the semantic features of the target guide audio into a residual quantization model to obtain target discrete semantic codes.
Specifically, the encoder may include an audio feature extraction model (ContentVec) and a residual vector quantization model (Residual Vector Quantization, RVQ) for pre-trained speaker tone decoupling, where the target guided audio is passed through the audio feature extraction model to obtain semantic features with tone removed, mainly including prosodic information, and the semantic features of the guided audio are passed through the residual vector quantization model to obtain corresponding discrete semantic codes (tokens), and tokens is input information of a large language model (Large Language Model, LLM), referred to as discrete semantic codes.
And step S30, predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized, wherein the audio code to be synthesized comprises prosody of target guide audio and semantic features of target text.
In this step, using tokens of the target guide audio and the target text obtained by encoding, the audio to be synthesized tokens including the prosody of the target guide audio and the semantic features of the target text is obtained through prediction by using a pre-trained LLM model.
In one embodiment, predicting the target discrete semantic code and the target text based on the large language model in step S30 to obtain an audio code to be synthesized includes:
(1) And calling a pre-trained large language model, wherein the large language model is obtained by training by utilizing training text and training audio based on the universal language model.
As an embodiment, referring to fig. 4, fig. 4 is a training schematic diagram of a large language model according to an embodiment, for a training method of a large language model, the method may include the following steps:
a, constructing a large language model;
b, acquiring a text audio pair data set comprising training audio and corresponding text content thereof, specifically, acquiring a text audio pair data set through massive text audio pairs Where a i is training audio and x i is text content corresponding to the training audio.
And c, inputting the training audio into an encoder to obtain discrete semantic codes of the training audio, wherein the discrete semantic codes are targets for optimizing a large language model.
And d, inputting the discrete semantic code of the training audio and the text content into a large language model together to predict to obtain the discrete semantic code of the audio to be synthesized, specifically, firstly, obtaining the corresponding discrete semantic code y i to be synthesized by the training audio a i through an encoder, and enabling the discrete semantic code y i to be used as an optimization target of LLM learning and simultaneously enter LLM together with the text content x i to obtain the predicted discrete semantic code y' i of the audio to be synthesized.
E, optimizing a large language model by using cross entropy as an objective function based on the discrete semantic coding of the audio to be synthesized and the discrete semantic coding of the training audio.
Specifically, the cross entropy loss calculation formula is as follows:
where Loss (y' i,yi) is the cross entropy Loss.
(2) And inputting the target discrete semantic codes and the target text into the large language model for prediction to obtain the audio codes to be synthesized, wherein the audio codes to be synthesized comprise the rhythm of the target guide audio and the semantic characteristics of the target text.
Specifically, tokens of the guiding audio and the target text obtained in the step S10 are input into the trained LLM to obtain the predicted tokens of the audio to be synthesized.
And S40, decoding the audio code to be synthesized according to the target text and the tone characteristics to obtain voice synthesized audio.
In this step, the target text, tone characteristics and audio codes to be synthesized are decoded by a pre-trained decoder to obtain the final speech synthesized audio output.
In one embodiment, when the speech synthesis audio is obtained through decoding in step S40, a pre-trained decoder is called, the audio code to be synthesized, the target text and the tone characteristic are input into the decoder for decoding to obtain the speech synthesis audio, wherein the decoder is obtained through training based on the training text and the tone characteristic of the speaker.
In one embodiment, referring to fig. 5, fig. 5 is a schematic diagram of a decoder training flow in one embodiment, for a decoder training method, the following may be used:
1) And constructing a decoder, wherein the decoder comprises an a priori encoder, a posterior encoder, a normal stream, a generator and a discriminant model.
As shown in fig. 5, a decoder model may be constructed that includes a priori encoder, a posterior encoder, a regular stream (Normalizing Flow), a generator, and a discriminant model.
2) And setting a loss function, wherein the loss function comprises L1 reconstruction loss, KL divergence, antagonism loss and characteristic matching loss.
Specifically, by inputting training text and training audio, the decoder is jointly optimized using four penalty functions, namely, L1 reconstruction penalty, KL divergence (KL DIVERGENCE), fight penalty, and feature matching penalty.
For L1 reconstruction loss, the method is used for training audio to obtain a target spectrogram x mel, the target spectrum is subjected to a posterior encoder to obtain a hidden variable z, and the hidden variable is subjected to a generator to obtain a predicted spectrogramAnd finally, reconstructing loss optimization by using L1 based on the target spectrogram and the predicted spectrogram.
KL divergence loss, which is the distance between the difference of one probability distribution and the other probability distribution, is used for training audio to obtain discrete semantic codes through an encoder, then the discrete semantic codes and training text and speaker timbre are subjected to prior encoder to obtain prior distribution, hidden variable z is subjected to Flow to obtain posterior distribution, and KL divergence loss optimization is utilized based on the prior distribution and the posterior distribution.
Countering the loss, training the generator and the discriminator so as to predict the spectrogramMore realistic
The feature matching loss is used for optimizing the generator, and expected input target spectrogram x mel and predicted spectrogramThe feature maps of each layer of the generator can be similar.
3) Training text and training audio are input to the decoder and the decoder is jointly optimized based on the loss function.
Specifically, the prior encoder is a variational inference (Variational inference, VAE) structure, uses a regular stream (Normalizing Flow) to enhance the expression capability of prior distribution, normalizing Flows is used for learning complex data distribution, improves the quality of synthesized audio through the training of a countermeasure network (GENERATIVE ADVERSARIAL Networks, GAN), and combines VAE and GAN training to obtain a final loss function, wherein the reconstruction loss, KL divergence, countermeasure loss and feature matching loss are combined.
Illustratively, the reconstruction loss uses the L1 loss to measure the difference between the spectrogram predicted by the generator and the target spectrogram as follows:
Wherein x mel is the target spectrogram, The difference between posterior distribution and prior distribution is measured by KL divergence, and the formula is as follows:
Lkl=log(qφ)-log(pθ)
where q φ is the posterior distribution and p θ is the a priori distribution.
The countering loss includes a loss of the arbiter and a loss of the generator, wherein the loss of the arbiter is expressed as follows:
Ladv(D)=E(y,a)[(D(a)-1)2+(D(G(z)))2]
the loss formula of the generator is as follows:
Ladv(G)=Ez[(D(G(z))-1)2]
Wherein D is a discriminator, G is a generator, a is input audio, z is a latent variable output by a posterior encoder, and the characteristic matching loss is mainly used for training the generator, and the formula is as follows:
Where D l is the feature map output by the first layer of the arbiter and N l is the number of feature maps.
Combining all the loss functions to obtain a total loss function, wherein the formula is as follows:
Ltotal=Lrecon+Lkl+Ladv(G)+Lfm(G)
where L total represents the total loss function.
In one embodiment, when the decoder is used for decoding to obtain the voice synthesized audio, the audio code to be synthesized, the target text and the tone characteristic are subjected to prior distribution by the prior encoder, posterior distribution is obtained after the prior distribution is subjected to normal flow, and the posterior distribution and the tone characteristic are input into the generator for fusion to obtain the voice synthesized audio.
Specifically, a decoder is obtained based on the training, referring to fig. 6, fig. 6 is a flow chart of a push flow of the decoder according to an embodiment, and when reasoning is performed, tokens obtained by receiving the target text and LLM prediction and the timbre feature of the speaker finally obtain the speech synthesis audio, the rhythm of the speech synthesis audio depends on the guide audio, and the timbre depends on the timbre feature of the speaker.
Preferably, the speaker's timbre characteristics may specify that multiple timbres are to be fused. For example, assuming n timbre features, denoted as e 1,e2,…,en respectively, the corresponding fusion ratio is p 1,p2,…,pn, and satisfiesThe tone color fusion formula is embedding =e 1×p1+e2×p2+…+en×pn, so that the tone color diversification requirement can be met and the tone color characteristics of a specific speaker can be reduced.
The technical scheme of the embodiment can be used for an AI assistant in a live broadcasting room, and by adding an AI live broadcasting auxiliary function, a user can customize the voice line of the AI assistant autonomously, so that the synthesized voice can have rhythm approaching a real person, for example, in a group chat scene of a live broadcasting platform, the group chat function of a plurality of virtual persons and users can be realized, the voice effect including the main broadcasting tone and the non-main broadcasting tone can be virtualized, wherein the voice effect of the main broadcasting tone synthesis can approximate the real person, the non-main broadcasting tone can be modulated to meet the expected tone and rhythm, and the unique tone and rhythm can be created by adjusting the technology for guiding the mixing of the audio and the tone so as to meet various service demands.
An embodiment of the speech synthesis apparatus is set forth below.
As shown in fig. 7, fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment, including:
An input module 10 for acquiring target text of language synthesis and tone characteristics of a speaker;
the coding module 20 is used for coding the target guide audio to obtain target discrete semantic codes without timbre, wherein the target discrete semantic codes contain prosodic features;
The prediction module 30 is used for predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized, wherein the audio code to be synthesized comprises prosody of target guide audio and semantic features of target text;
The decoding module 40 is configured to decode the audio code to be synthesized according to the target text and the tone characteristic to obtain a speech synthesis audio.
The voice synthesis device of the present embodiment may perform a voice synthesis method provided by the embodiment of the present application, and its implementation principle is similar, and actions performed by each module in the voice synthesis device of each embodiment of the present application correspond to steps in the voice synthesis method of each embodiment of the present application, and detailed functional descriptions of each module of the voice synthesis device may be specifically referred to the descriptions in the corresponding voice synthesis method shown in the foregoing, which are not repeated herein.
An embodiment of a live system is set forth below.
In the live broadcast system of this embodiment, referring to fig. 8, fig. 8 is a schematic structural diagram of an exemplary live broadcast system, where the system framework may include a main broadcasting end, a live broadcast server and a viewer end, where the viewer end and the main broadcasting end may enter the live broadcast platform through clients installed on an electronic device, and the main broadcasting end and the viewer end may be, for example, computer devices, such as a PDA, a smart phone, a tablet computer, a desktop computer, a notebook computer, or the like, which are not limited to this, and may also be software modules of an application program, and the live broadcast server may be implemented by using an independent server or a server cluster formed by multiple servers. The method comprises the steps of enabling a main broadcasting end to be used for accessing a main broadcasting of a live broadcasting room and collecting main broadcasting live video streams to upload the live broadcasting video streams to a live broadcasting server, enabling the live broadcasting server to be used for forwarding the main broadcasting live video streams to a spectator end and generating audio streams in the live broadcasting video streams by utilizing the voice synthesis method of any embodiment, and enabling the spectator end to be used for accessing spectator users of the live broadcasting room and receiving the main broadcasting live video streams to play.
The method is characterized in that the method comprises the steps that an AI assistant applied to a live broadcasting room carries out voice live broadcasting, a user can independently customize the voice line of the AI assistant to synthesize live broadcasting voice, so that the synthesized voice can have rhythm approaching to a real person, for example, a plurality of virtual persons and users can carry out group chat, the virtual persons comprise a main broadcasting tone and a non-main broadcasting tone, the AI assistant can create unique tone and rhythm, and various business requirements are met.
According to the technical scheme, the synthesized voice can approximate to the effect of the real voice, the rhythm and the timbre are adjusted based on the large language model, and the rhythm and the timbre of the synthesized voice can be modulated, so that unique timbre and rhythm are created, the voice synthesis effect is improved, and the diversified application requirements in network live broadcast are met.
Embodiments of an electronic device and a computer-readable storage medium are set forth below.
The application provides a technical scheme of electronic equipment, which is used for realizing the related functions of a voice synthesis method, and the electronic equipment of the embodiment comprises one or more processors and a memory; one or more applications, wherein the one or more applications are stored in memory and configured to be executed by one or more processors, the one or more applications configured for the steps of the speech synthesis method of any embodiment.
As shown in FIG. 9, FIG. 9 is a block diagram of an example electronic device, which may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like. The electronic device 100 can include one or more of a processing component 102, a memory 104, a power component 106, a multimedia component 108, an audio component 109, an input/output (I/O) interface 112, a sensor component 114, and a communication component 116.
The processing component 102 generally controls overall operation of the electronic device 100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
The memory 104 is configured to store various types of data to support operations at the electronic device 100. Such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply assembly 106 provides power to the various components of the electronic device 100.
The multimedia component 109 includes a screen between the electronic device 100 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). In some embodiments, the multimedia component 108 includes a front-facing camera and/or a rear-facing camera.
The audio component 109 is configured to output and/or input audio signals.
The I/O interface 112 provides an interface between the processing component 102 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, a home button, a volume button, an activate button, and a lock button.
The sensor assembly 114 includes one or more sensors for providing status assessment of various aspects of the electronic device 100. The sensor assembly 114 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
The communication component 116 is configured to facilitate communication between the electronic device 100 and other devices in a wired or wireless manner. The electronic device 100 may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof.
The application provides a technical scheme of a computer readable storage medium, which is used for realizing relevant functions of a voice synthesis method. The computer readable storage medium stores at least one instruction, at least one program, code set, or instruction set, the at least one instruction, at least one program, code set, or instruction set being loaded by a processor and performing the speech synthesis method of any of the embodiments.
In an exemplary embodiment, the computer-readable storage medium may be a non-transitory computer-readable storage medium including instructions, such as a memory including instructions, for example, the non-transitory computer-readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (12)

1.一种语音合成方法,其特征在于,包括:1. A speech synthesis method, comprising: 获取语言合成的目标文本和说话人的音色特征;Obtaining the target text for speech synthesis and the speaker's timbre characteristics; 对目标引导音频进行编码得到去除音色的目标离散语义编码;其中,所述目标离散语义编码包含韵律特征;Encoding the target guided audio to obtain a target discrete semantic code with timbre removed; wherein the target discrete semantic code includes a rhythmic feature; 基于大语言模型对所述目标离散语义编码和所述目标文本进行预测得到待合成音频编码;其中,所述待合成音频编码包括目标引导音频的韵律和目标文本的语义特征;Predicting the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized; wherein the audio code to be synthesized includes the rhythm of the target guide audio and the semantic features of the target text; 根据所述目标文本和音色特征对所述待合成音频编码进行解码得到语音合成音频。The audio code to be synthesized is decoded according to the target text and timbre characteristics to obtain speech synthesis audio. 2.根据权利要求1所述的语音合成方法,其特征在于,对目标引导音频进行编码得到去除音色的目标离散语义编码,包括:2. The speech synthesis method according to claim 1, characterized in that encoding the target guide audio to obtain the target discrete semantic code without timbre comprises: 将目标引导音频输入基于说话人音色解耦的音频特征提取模型得到去除音色的且包含韵律信息的语义特征;The target-guided audio input is based on the audio feature extraction model decoupled from the speaker's timbre to obtain semantic features that remove timbre and contain prosodic information; 将目标引导音频的语义特征输入残差量化模型处理得到目标离散语义编码。The semantic features of the target guided audio are input into the residual quantization model to obtain the target discrete semantic encoding. 3.根据权利要求1所述的语音合成方法,其特征在于,基于大语言模型对所述目标离散语义编码和所述目标文本进行预测得到待合成音频编码,包括:3. The speech synthesis method according to claim 1, characterized in that the step of predicting the target discrete semantic code and the target text based on a large language model to obtain the audio code to be synthesized comprises: 调用预先训练的大语言模型;其中,所述大语言模型为基于通用语言模型并利用训练文本和训练音频进行训练得到;Calling a pre-trained large language model; wherein the large language model is trained based on a general language model and using training text and training audio; 将所述目标离散语义编码和所述目标文本输入所述大语言模型进行预测得到包含目标引导音频的韵律和目标文本的语义特征的待合成音频编码。The target discrete semantic code and the target text are input into the large language model for prediction to obtain the audio code to be synthesized including the rhythm of the target guided audio and the semantic features of the target text. 4.根据权利要求3所述的语音合成方法,其特征在于,还包括:4. The speech synthesis method according to claim 3, further comprising: 构建大语言模型;Build a large language model; 获取包括训练音频及其对应的文本内容的文本音频对数据集;Obtain a text-audio pair dataset including training audio and its corresponding text content; 将训练音频输入编码器得到训练音频的离散语义编码;其中,所述离散语义编码为大语言模型优化的目标;Input the training audio into the encoder to obtain a discrete semantic encoding of the training audio; wherein the discrete semantic encoding is the target of large language model optimization; 将训练音频的离散语义编码和文本内容一起输入大语言模型预测得到待合成音频离散语义编码;Input the discrete semantic code of the training audio and the text content into the large language model to predict and obtain the discrete semantic code of the audio to be synthesized; 基于待合成音频离散语义编码和训练音频的离散语义编码利用交叉熵作为目标函数优化大语言模型。Based on the discrete semantic coding of the audio to be synthesized and the discrete semantic coding of the training audio, the large language model is optimized using cross entropy as the objective function. 5.根据权利要求1所述的语音合成方法,其特征在于,根据所述目标文本和音色特征对所述待合成音频编码进行解码得到语音合成音频,包括:5. The speech synthesis method according to claim 1, characterized in that decoding the audio code to be synthesized to obtain speech synthesis audio according to the target text and timbre characteristics comprises: 调用预先训练的解码器;其中,所述解码器基于训练文本和说话人的音色特征训练得到;Calling a pre-trained decoder; wherein the decoder is trained based on training text and the speaker's timbre characteristics; 将所述待合成音频编码、目标文本以及音色特征输入所述解码器进行解码得到语音合成音频。The audio code to be synthesized, the target text and the timbre characteristics are input into the decoder for decoding to obtain speech synthesis audio. 6.根据权利要求5所述的语音合成方法,其特征在于,还包括:6. The speech synthesis method according to claim 5, further comprising: 构建解码器;其中,所述解码器包括:先验编码器、后验编码器、正规流、生成器和判别器;Constructing a decoder; wherein the decoder comprises: a priori encoder, a posteriori encoder, a regular stream, a generator and a discriminator; 设置损失函数;其中,所述损失函数包括:L1重建损失、KL散度、对抗损失和特征匹配损失;Setting a loss function; wherein the loss function includes: L1 reconstruction loss, KL divergence, adversarial loss and feature matching loss; 将训练文本和训练音频输入所述解码器,并基于所述损失函数联合优化解码器。The training text and the training audio are input into the decoder, and the decoder is jointly optimized based on the loss function. 7.根据权利要求6所述的语音合成方法,其特征在于,所述KL散度用于衡量后验分布与先验分布之间的差异;7. The speech synthesis method according to claim 6, characterized in that the KL divergence is used to measure the difference between the posterior distribution and the prior distribution; 所述L1重建损失用于衡量生成器预测的频谱图与目标频谱图之间的差异;The L1 reconstruction loss is used to measure the difference between the spectrogram predicted by the generator and the target spectrogram; 所述对抗损失用于训练生成器和判别器;The adversarial loss is used to train the generator and the discriminator; 所述特征匹配损失用于优化生成器。The feature matching loss is used to optimize the generator. 8.根据权利要求6所述的语音合成方法,其特征在于,将所述待合成音频编码、目标文本以及音色特征输入预先训练的解码器进行解码得到语音合成音频,包括:8. The speech synthesis method according to claim 6, characterized in that the audio code to be synthesized, the target text and the timbre feature are input into a pre-trained decoder for decoding to obtain speech synthesis audio, comprising: 将所述待合成音频编码、目标文本以及音色特征经过先验编码器得到先验分布,将先验分布经过正规流之后得到后验分布,以及将后验分布和音色特征输入生成器进行融合得到语音合成音频。The audio code to be synthesized, the target text and the timbre features are passed through a priori encoder to obtain a priori distribution, the priori distribution is passed through a regular stream to obtain a posterior distribution, and the posterior distribution and the timbre features are input into a generator for fusion to obtain speech synthesis audio. 9.一种语音合成装置,其特征在于,包括:9. A speech synthesis device, comprising: 输入模块,用于获取语言合成的目标文本和说话人的音色特征;An input module, used to obtain the target text for speech synthesis and the speaker's timbre characteristics; 编码模块,用于对目标引导音频进行编码得到去除音色的目标离散语义编码;其中,所述目标离散语义编码包含韵律特征;An encoding module, used for encoding the target guided audio to obtain a target discrete semantic code with timbre removed; wherein the target discrete semantic code includes a rhythmic feature; 预测模块,用于基于大语言模型对所述目标离散语义编码和所述目标文本进行预测得到待合成音频编码;其中,所述待合成音频编码包括目标引导音频的韵律和目标文本的语义特征;A prediction module, configured to predict the target discrete semantic code and the target text based on a large language model to obtain an audio code to be synthesized; wherein the audio code to be synthesized includes the rhythm of the target guide audio and the semantic features of the target text; 解码模块,用于根据所述目标文本和音色特征对所述待合成音频编码进行解码得到语音合成音频。A decoding module is used to decode the audio code to be synthesized according to the target text and timbre characteristics to obtain speech synthesis audio. 10.一种直播系统,其特征在于,包括:主播端、观众端以及直播服务器;其中,所述主播端和观众端分别通过通信网络连接至所述直播服务器;10. A live broadcast system, characterized in that it comprises: an anchor terminal, an audience terminal and a live broadcast server; wherein the anchor terminal and the audience terminal are respectively connected to the live broadcast server via a communication network; 所述主播端,用于接入直播间的主播以及采集主播直播视频流上传至直播服务器;The anchor terminal is used to access the anchor in the live broadcast room and collect the anchor's live video stream and upload it to the live broadcast server; 所述直播服务器,用于将主播的直播视频流转发至观众端,以及利用权利要求1-8任一项所述的语音合成方法来生成直播视频流中的音频流;The live broadcast server is used to forward the live broadcast video stream of the anchor to the audience end, and use the speech synthesis method described in any one of claims 1 to 8 to generate an audio stream in the live broadcast video stream; 所述观众端,用于接入直播间的观众用户,接收所述主播直播视频流进行播放,以及展示所述推送信息。The audience end is used for audience users who access the live broadcast room, receive the live video stream of the anchor for playback, and display the push information. 11.一种电子设备,其特征在于,该电子设备,其包括:11. An electronic device, characterized in that the electronic device comprises: 一个或多个处理器;one or more processors; 存储器;Memory; 一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个程序配置用于执行权利要求1-8任一项所述的语音合成方法的步骤。One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, and the one or more programs are configured to execute the steps of the speech synthesis method described in any one of claims 1-8. 12.一种计算机可读存储介质,其特征在于,所述存储介质存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行权利要求1-8任一项所述的语音合成方法的步骤。12. A computer-readable storage medium, characterized in that the storage medium stores at least one instruction, at least one program, a code set or an instruction set, and the at least one instruction, the at least one program, the code set or the instruction set are loaded by the processor and execute the steps of the speech synthesis method described in any one of claims 1-8.
CN202411374449.4A 2024-09-29 2024-09-29 Speech synthesis method, device, live broadcast system, electronic device and storage medium Pending CN119360822A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411374449.4A CN119360822A (en) 2024-09-29 2024-09-29 Speech synthesis method, device, live broadcast system, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411374449.4A CN119360822A (en) 2024-09-29 2024-09-29 Speech synthesis method, device, live broadcast system, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN119360822A true CN119360822A (en) 2025-01-24

Family

ID=94313420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411374449.4A Pending CN119360822A (en) 2024-09-29 2024-09-29 Speech synthesis method, device, live broadcast system, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN119360822A (en)

Similar Documents

Publication Publication Date Title
CN112767910B (en) Audio information synthesis method, device, computer readable medium and electronic equipment
CN108711423A (en) Intelligent sound interacts implementation method, device, computer equipment and storage medium
US20220301250A1 (en) Avatar-based interaction service method and apparatus
CN113316078B (en) Data processing method and device, computer equipment and storage medium
CN114866856B (en) Audio signal processing method, audio generation model training method and device
CN112837669B (en) Speech synthesis method, device and server
US20250056078A1 (en) Server, method and computer program
WO2024193227A1 (en) Voice editing method and apparatus, and storage medium and electronic apparatus
CN116072095A (en) Character interaction method, device, electronic device and storage medium
KR20250010714A (en) Voice Chat Translation
CN113707122B (en) Method and device for constructing voice synthesis model
CN118433437A (en) Live broadcasting room voice live broadcasting method and device, live broadcasting system, electronic equipment and medium
CN118762684A (en) Speech synthesis model training method, speech synthesis method, device and medium
CN113571082A (en) Voice call control method and device, computer readable medium and electronic equipment
CN117373463A (en) Model training method, device, medium and program product for speech processing
CN119360822A (en) Speech synthesis method, device, live broadcast system, electronic device and storage medium
CN114283789B (en) Singing voice synthesis method, device, computer equipment and storage medium
CN116229996A (en) Audio production method, device, terminal, storage medium and program product
CN114093341A (en) Data processing method, apparatus and medium
CN116347123A (en) Bullet screen generation method and device
CN114299915B (en) Speech synthesis method and related equipment
CN116708951B (en) Video generation method and device based on neural network
US12387000B1 (en) Privacy-preserving avatar voice transmission
CN117768741A (en) Voice interaction method and device for live broadcasting room, live broadcasting system, equipment and medium
CN112383722B (en) Method and apparatus for generating video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication