CN119360818A

CN119360818A - Speech generation method, device, computer equipment and medium based on artificial intelligence

Info

Publication number: CN119360818A
Application number: CN202411368508.7A
Authority: CN
Inventors: 孙奥兰; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2024-09-27
Filing date: 2024-09-27
Publication date: 2025-01-24

Abstract

The application belongs to the field of artificial intelligence and the field of financial science and technology, and relates to a voice generation method based on artificial intelligence, which comprises the steps of receiving a text to be synthesized and initial voice; the method comprises the steps of preprocessing a text to be synthesized to obtain a specified text, carrying out text coding on the specified text based on a text coder to obtain a text coding vector, extracting a speaker embedded vector from initial voice, processing the text coding vector and the speaker embedded vector based on a large language model to generate a target voice mark sequence, processing the target voice mark sequence and the speaker embedded vector based on a target conditional flow matching model to obtain a Mel spectrogram, and carrying out conversion processing on the Mel spectrogram based on a vocoder to obtain the synthesized voice. The application also provides a voice generating device, computer equipment and a storage medium based on the artificial intelligence. Furthermore, the synthesized speech of the present application may be stored in a blockchain. The application improves the voice quality of the generated synthesized voice and is beneficial to improving the user experience.

Description

Speech generation method, device, computer equipment and medium based on artificial intelligence

Technical Field

The application relates to the technical field of artificial intelligence development and the technical field of finance, in particular to a voice generation method, a voice generation device, computer equipment and a storage medium based on artificial intelligence.

Background

The voice synthesis method can synthesize the text into corresponding voice and has wide application in a plurality of fields such as Internet, finance, medical treatment, education and the like. In a financial business scene, in order to solve the problem that customers carry out the communication on business contents such as financial products, transaction processes and the like, customer service personnel are generally configured, because financial business is complex and various, a large number of simple tasks such as consultation business, after-sales business and the like can seriously occupy the energy and time of the business personnel, the working efficiency and the working quality of the business personnel are reduced, a large amount of labor cost can be saved by adopting an intelligent session mode based on automatic voice synthesis, and meanwhile, the service quality of the customers can be improved by controlling synthesized voice, so that a voice synthesis technology plays an important auxiliary role in the financial business scene.

Current financial enterprises typically employ speech synthesis based on large language models that simulate tag sequences using text as a condition by converting speech signals into the tag sequences. The original waveform is then reconstructed from the labeled speech using a labeled vocoder, thereby completing the generation of the synthesized speech. However, the synthetic voice generated by the processing mode often has the problems of insufficient naturalness and poor quality, and the user experience is poor.

Disclosure of Invention

The embodiment of the application aims to provide a voice generation method, a voice generation device, a computer device and a storage medium based on artificial intelligence, so as to solve the technical problem that the quality of generated synthesized voice is poor in a voice synthesis mode based on a large language model adopted by the existing financial enterprises.

In order to solve the above technical problems, the embodiment of the present application provides a speech generating method based on artificial intelligence, which adopts the following technical scheme:

receiving a text to be synthesized and an initial voice input by a user;

preprocessing the text to be synthesized to obtain a corresponding appointed text;

Performing text coding processing on the appointed text based on a preset text coder to obtain a corresponding text coding vector;

Extracting a corresponding speaker embedding vector from the initial speech;

Processing the text coding vector and the speaker embedded vector based on a preset large language model to generate a corresponding target voice mark sequence;

processing the target voice mark sequence and the speaker embedded vector based on a preset target conditional flow matching model to obtain a corresponding Mel spectrogram;

And converting the Mel spectrogram based on a preset vocoder to obtain corresponding synthesized voice.

Further, the step of processing the text encoding vector and the speaker embedding vector based on the preset large language model to generate a corresponding target voice tag sequence specifically includes:

Invoking the large language model;

processing the text coding vector and the speaker embedded vector based on the large language model to obtain a corresponding first voice mark sequence;

Removing repeated items from the first voice mark sequence to obtain a corresponding second voice mark sequence;

Performing sequence length adjustment processing on the second voice mark sequence to obtain a corresponding third voice mark sequence;

Performing sequence smoothing on the third voice mark sequence to obtain a corresponding fourth voice mark sequence;

and taking the fourth voice mark sequence as the target voice mark sequence.

Further, the step of performing a sequence length adjustment process on the second voice mark sequence to obtain a corresponding third voice mark sequence specifically includes:

Acquiring the sequence length of the second voice mark sequence;

Acquiring a preset target sequence length;

Judging whether the sequence length is smaller than the target sequence length or not;

If yes, performing element addition processing on the second voice mark sequence based on a preset element addition strategy to obtain an element-added second voice mark sequence, and taking the element-added second voice mark sequence as the third voice mark sequence;

If not, carrying out element deletion processing on the second voice mark sequence based on a preset element deletion strategy to obtain an element deleted second voice mark sequence, and taking the element deleted second voice mark sequence as the third voice mark sequence.

Further, the step of extracting the corresponding speaker embedding vector from the initial speech specifically includes:

Calling a preset voiceprint model;

Inputting the initial speech into the voiceprint model;

and carrying out vector extraction processing on the initial voice based on the voiceprint model to obtain the speaker embedded vector corresponding to the initial voice.

Further, before the step of processing the target voice tag sequence and the speaker embedded vector based on the preset target condition stream matching model to obtain a corresponding mel spectrogram, the method further includes:

acquiring an initial condition flow matching model;

Acquiring a preset cosine scheduler;

Adjusting the initial conditional flow matching model based on the cosine scheduler to obtain a corresponding first conditional flow matching model;

Optimizing the first conditional flow matching model based on a preset classifier free guiding strategy to obtain a corresponding second conditional flow matching model;

And constructing and obtaining the target conditional flow matching model based on the second conditional flow matching model.

Further, after the step of converting the mel-frequency spectrogram based on the preset vocoder to obtain the corresponding synthesized voice, the method further includes:

optimizing the synthesized voice based on a preset quality optimization strategy to obtain a corresponding target synthesized voice;

generating a corresponding target audio file based on the target synthesized speech;

acquiring a preset pushing mode;

and pushing the target audio file to the user based on the pushing mode.

Further, the step of optimizing the synthesized voice based on the preset quality optimization strategy to obtain the corresponding target synthesized voice specifically includes:

denoising the synthesized voice to obtain a corresponding first synthesized voice;

Performing volume adjustment processing on the first synthesized voice to obtain a corresponding second synthesized voice;

performing range compression processing on the second synthesized voice to obtain a corresponding third synthesized voice;

And taking the third synthesized voice as the target synthesized voice.

In order to solve the technical problems, the embodiment of the application also provides a voice generating device based on artificial intelligence, which adopts the following technical scheme:

the receiving module is used for receiving the text to be synthesized and the initial voice input by the user;

The preprocessing module is used for preprocessing the text to be synthesized to obtain a corresponding appointed text;

the coding module is used for carrying out text coding processing on the appointed text based on a preset text coder to obtain a corresponding text coding vector;

the extraction module is used for extracting a corresponding speaker embedding vector from the initial voice;

The first processing module is used for processing the text coding vector and the speaker embedded vector based on a preset large language model to generate a corresponding target voice mark sequence;

The second processing module is used for processing the target voice mark sequence and the speaker embedded vector based on a preset target condition stream matching model to obtain a corresponding Mel spectrogram;

And the conversion module is used for carrying out conversion processing on the Mel spectrogram based on a preset vocoder to obtain corresponding synthesized voice.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

receiving a text to be synthesized and an initial voice input by a user;

Extracting a corresponding speaker embedding vector from the initial speech;

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

receiving a text to be synthesized and an initial voice input by a user;

Extracting a corresponding speaker embedding vector from the initial speech;

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

The method comprises the steps of firstly receiving a text to be synthesized and an initial voice input by a user, preprocessing the text to be synthesized to obtain a corresponding appointed text, then conducting text coding processing on the appointed text based on a preset text coder to obtain a corresponding text coding vector, extracting a corresponding speaker embedded vector from the initial voice, then conducting processing on the text coding vector and the speaker embedded vector based on a preset large language model to generate a corresponding target voice mark sequence, then conducting processing on the target voice mark sequence and the speaker embedded vector based on a preset target condition stream matching model to obtain a corresponding Mel spectrogram, and finally conducting conversion processing on the Mel spectrogram based on a preset vocoder to obtain the corresponding synthesized voice. According to the method, the text to be synthesized and the initial voice input by a user are received, the text is coded on the basis of the use of a text coder to obtain a text coding vector, the corresponding speaker embedding vector is extracted from the initial voice, the text coding vector and the speaker embedding vector are processed on the basis of the use of a large language model to generate a target voice mark sequence, the target voice mark sequence and the speaker embedding vector are processed on the basis of a target condition stream matching model to obtain a Mel spectrogram, and the Mel spectrogram is converted on the basis of the use of a vocoder, so that the rapid synthesis from the text to high-quality voice can be realized, the voice quality of the generated synthesized voice is effectively improved, and the user experience is facilitated.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of an artificial intelligence based speech generation method according to the present application;

FIG. 3 is a schematic diagram of one embodiment of an artificial intelligence based speech generating device in accordance with the present application;

FIG. 4 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the terms used in the description herein are used for the purpose of describing particular embodiments only and are not intended to limit the application, and the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the above description of the drawings are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, and a server 103, where the terminal device 101 may be a notebook 1011, a tablet 1012, or a cell phone 1013. Network 102 is the medium used to provide communication links between terminal device 101 and server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 103 via the network 102 using the terminal device 101 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal device 101.

The terminal device 101 may be various electronic devices having a display screen and supporting web browsing, and the terminal device 101 may be an electronic book reader, an MP3 player (Mov I ng P I cture Experts G roup Aud I o Layer I I I, moving picture experts compression standard audio layer ii), an MP4 (Mov I ng P I cture Experts Group Aud I o Layer I V, moving picture experts compression standard audio layer I V) player, a laptop portable computer, a desktop computer, and the like, in addition to the notebook 1011, the tablet 1012, or the mobile phone 1013.

The server 103 may be a server providing various services, such as a background server providing support for pages displayed on the terminal device 101.

It should be noted that, the speech generating method based on artificial intelligence provided by the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the speech generating device based on artificial intelligence is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flowchart of one embodiment of an artificial intelligence based speech generation method in accordance with the present application is shown. The order of the steps in the flowchart may be changed and some steps may be omitted according to various needs. The speech generating method based on the artificial intelligence provided by the embodiment of the application can be applied to any scene needing speech synthesis, and can be applied to products of the scenes, such as speech synthesis in the field of financial insurance. The artificial intelligence-based voice generation method comprises the following steps:

step S201, receiving a text to be synthesized and an initial voice input by a user.

In this embodiment, the electronic device (for example, the server/terminal device shown in fig. 1) on which the artificial intelligence-based speech generating method operates may acquire the text to be synthesized and the initial speech input by the user through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection may include, but is not limited to, 3G/4G/5G connections, wifi connections, bluetooth connections, wimax connections, Z i gbee connections, UWB (u l tra W i deband) connections, and other now known or later developed wireless connection means. The execution subject of the present application is specifically a speech synthesis system, or simply a system. The user may input text to be synthesized through an interface of the voice system or AP I, and may also input initial voice. In the service scenario of pushing the product of the financial insurance, the text to be synthesized may be a service text related to transaction data, payment data, service data, and the like. The initial speech may be speech that includes useful acoustic features, which may include pitch, intensity, duration, timbre, etc., as well as higher-level speaker-embedded vectors. The speaker-embedded vector is used to characterize the identity of the speaker.

The speech synthesis system of the application consists of four components, which respectively comprise a text encoder, a speech word segmentation device, a large language model and a target condition stream matching model. In particular, the text encoder is used to align the semantic space of text and phonetic markers, while the phonetic word segmentation is used for semantic markers. TTS (text to speech) tasks are re-expressed as an autoregressive sequence generation problem for a given text prompt by learning the sequence mapping of the entire text code and phonetic transcription using a large language model. The target conditional flow matching model is used to convert the phonetic signature into a mel-frequency spectrogram by a denoising process on the optimal path. To obtain a perceptible signal, a vocoder is then used to synthesize waveforms with the generated mel-frequency spectrogram as input.

Step S202, preprocessing the text to be synthesized to obtain a corresponding appointed text.

In this embodiment, preprocessing is performed on the text to be synthesized, which specifically includes removing unnecessary spaces and punctuations, performing lowercase processing, and the like, so as to ensure consistency of text formats, thereby obtaining corresponding specified text.

And step S203, performing text coding processing on the specified text based on a preset text coder to obtain a corresponding text coding vector.

In this embodiment, the text encoder may specifically be a pre-trained text encoder, for example, a T ransforme r-based encoder. The pre-processed specified text is converted into a token sequence by a word segmentation device, and then the token sequence is input into a text encoder. The text encoder outputs a text encoding vector aligned with the semantic space of the speech markers. This text encoding vector captures semantic information of the specified text and is used to provide input for subsequent steps.

Step S204, extracting the corresponding speaker embedded vector from the initial voice.

In this embodiment, the above specific implementation process of extracting the corresponding speaker-embedded vector from the initial speech will be described in further detail in the following specific embodiments, which will not be described herein.

Step S205, processing the text encoding vector and the speaker embedding vector based on a preset large language model to generate a corresponding target voice mark sequence.

In this embodiment, the text encoding vector and the speaker embedding vector are processed based on the preset large language model, so as to generate a specific implementation process of the corresponding target voice markup sequence, which will be described in further detail in the following embodiments, which will not be described herein.

Step S206, processing the target voice mark sequence and the speaker embedded vector based on a preset target condition stream matching model to obtain a corresponding Mel spectrogram.

In this embodiment, the target conditional flow matching model is a model generated after adjusting and optimizing an initial conditional flow matching model based on a preset cosine scheduler and classifier free guidance strategy. And the target conditional flow matching model can generate a matched Mel spectrogram according to the generated voice mark sequence and the speaker embedding vector.

Step S207, the Mel spectrogram is converted based on a preset vocoder to obtain corresponding synthesized voice.

In this embodiment, the vocoder has a function of synthesizing an audio waveform based on an input mel-frequency spectrogram. The generated mel-frequency spectrogram is input into a vocoder, and the vocoder converts the mel-frequency spectrogram into a final audio waveform, so that the synthesized voice is obtained.

In some alternative implementations, step S205 includes the steps of:

And calling the large language model.

In this embodiment, the large language model is a large language model architecture suitable for sequence generation, and for example, GPT, BERT, etc. may be used. Specifically, TTS (text-to-speech) tasks are expressed as an autoregressive phonetic notation generation problem of a Large Language Model (LLM). For large language models, sequence construction is the most important item, and the construction mode is as follows: wherein, AndRespectively the beginning and end of the sequence. v is the speaker-embedded vector extracted from the original speech X, obtained by a pre-trained voiceprint model. Text encoding is obtained by passing the input text to be synthesized through a Byte Pair Encoding (BPE) word segmentation device and a text encoderSince text and speech markup are at different semantic levels, text encoders are used to align their semantic spaces and facilitate modeling of large language models. A start identifier is inserted between the text encoding and the phonetic marking sequence extracted by the supervised semantic markerDuring the training phase we have adopted a teacher forcing scheme where the left shifted sequence is used as the model input and the original sequence as the expected output. Wherein during training only the speech markers and the end identifiers are consideredTo optimize the performance of large language models. In addition, the training process of large language models specifically includes taking as input text encoding and speaker embedding vectors. The large language model generates a corresponding phonetic symbol sequence based on the input. The performance of the large language model is then evaluated using cross entropy loss, and parameters of the large language model are optimized by a back propagation algorithm, thereby constructing a large language model that yields functionality capable of generating a corresponding phonetic marker sequence from text encoding.

And processing the text coding vector and the speaker embedded vector based on the large language model to obtain a corresponding first voice mark sequence.

In this embodiment, the text encoding vector and the speaker embedding vector are input into the above-mentioned large language model, so that the text encoding vector and the speaker embedding vector are processed by the large language model, and the corresponding first voice markup sequence is output. Wherein the first phonetic symbol sequence may comprise phonemes, syllables or other phonetic units.

And removing repeated items from the first voice mark sequence to obtain a corresponding second voice mark sequence.

In this embodiment, by performing the duplicate term removal process on the first voice mark sequence, unnecessary duplicate elements in the generated sequence, which may be due to model errors or coding problems, are avoided. Specifically, each element in the sequence is traversed one by one starting from the first element of the sequence. Then for each element in the sequence it is checked whether it is identical to the next element. If adjacent elements are found to be identical, then a decision is made to delete one (typically the second) of them on demand. And after deleting or modifying the duplicate, updating the sequence to reflect the corresponding change.

And performing sequence length adjustment processing on the second voice mark sequence to obtain a corresponding third voice mark sequence.

In this embodiment, the foregoing sequence length adjustment process is performed on the second voice mark sequence to obtain a specific implementation process of the corresponding third voice mark sequence, which will be described in further detail in the following specific embodiments, which will not be described herein.

And performing sequence smoothing processing on the third voice mark sequence to obtain a corresponding fourth voice mark sequence.

In this embodiment, the sequence smoothing process includes analyzing the generated third voice mark sequence to determine the pronunciation duration of each mark. The pronunciation time of each tag is then adjusted according to linguistic rules, statistical models, or user preferences. For example, some phones may tend to have longer pronunciation times, while others are shorter. The adjusted pronunciation time information is then integrated into the sequence for use in a subsequent speech synthesis step.

And taking the fourth voice mark sequence as the target voice mark sequence.

According to the method, the text coding vector and the speaker embedding vector are processed by calling the large language model, and after the corresponding first voice mark sequence is obtained, repeated item removal processing, sequence length adjustment processing and sequence smoothing processing are automatically and intelligently carried out on the first voice mark sequence, so that automatic optimization processing of the voice mark sequence output by the large language model is completed, a corresponding target voice mark sequence is generated, the generated target voice mark sequence can be effectively ensured to meet the requirements of subsequent voice synthesis processing steps, smooth voice synthesis processing is guaranteed, and further the voice quality of generated synthesized voice is improved.

In some optional implementations of this embodiment, the performing a sequence length adjustment process on the second voice mark sequence to obtain a corresponding third voice mark sequence includes the following steps:

and acquiring the sequence length of the second voice mark sequence.

In this embodiment, the corresponding sequence length is obtained by performing length statistics on the second voice mark sequence.

And acquiring a preset target sequence length.

In this embodiment, the selection of the target sequence length is not specifically limited, and may be set according to actual service usage requirements, where the target sequence length may be a fixed length, or may be dynamically determined based on a certain rule (e.g., the same as the reference sequence length).

And judging whether the sequence length is smaller than the target sequence length.

In this embodiment, the sequence length of the generated second voice mark sequence may be compared with the target sequence length to obtain a corresponding value comparison result. The numerical comparison result includes that the sequence length is smaller than the target sequence length or the sequence length is larger than the target sequence length.

If so, performing element addition processing on the second voice mark sequence based on a preset element addition strategy to obtain an element-added second voice mark sequence, and taking the element-added second voice mark sequence as the third voice mark sequence.

In this embodiment, the policy content of the element addition policy may include adding a filler element to the second voice tag sequence to reach the target length if the sequence length of the generated second voice tag sequence is smaller than the target length. Illustratively, the padding element may employ a particular placeholder or last element of the repeating sequence.

In this embodiment, the policy content of the element deletion policy may include deleting an element at the end of the second voice tag sequence if the sequence length of the generated second voice tag sequence is greater than the target length, or selectively deleting an element in the second voice tag sequence in probability or importance so that the length of the second voice tag sequence is equal to the target length.

According to the method and the device, the sequence length of the second voice mark sequence is obtained, the preset target sequence length is obtained, and then according to the comparison result of the sequence length and the target sequence length, the element addition strategy or the element deletion strategy is intelligently and accurately adopted to carry out sequence length adjustment processing corresponding to the target length on the second voice mark sequence, so that the generated second voice mark sequence is effectively ensured to meet specific requirements, and the accuracy and normalization of the obtained second voice mark sequence are ensured.

In some alternative implementations, step S204 includes the steps of:

and calling a preset voiceprint model.

In this embodiment, the foregoing is a voiceprint model with an acoustic feature extraction function obtained through pre-training.

The initial speech is input into the voiceprint model.

In this embodiment, the initial speech may be input as a model and may be input into the voiceprint model. Voiceprint models can extract useful acoustic features from the original speech, which may include pitch, intensity, duration, timbre, etc., as well as higher-level speaker-embedded vectors.

In this embodiment, the speaker-embedded vector is an identity feature used to characterize the speaker. The model can extract unique voice characteristics of the speaker, such as tone, intonation, speech speed and the like, through embedding the vector into the speaker. In addition, in speech synthesis and speech replication applications, the speaker-embedded vector can be input into the model as a condition to control the speaker style of the generated speech. For example, in a text-to-speech (TTS) task, a speech output having a particular speaker style may be generated by a model with a text-encoded vector and a speaker-embedded vector as inputs. In the task of speech replication, a new speaker-embedded vector can be inferred by a small number of speech samples and then applied to a multi-speaker generation model to generate speech having that speaker style.

The method comprises the steps of calling a preset voiceprint model, inputting the initial voice into the voiceprint model, and carrying out vector extraction processing on the initial voice based on the voiceprint model to obtain the speaker embedded vector corresponding to the initial voice. The application carries out vector extraction processing on the initial voice based on the use of the voiceprint model, can rapidly and accurately extract the speaker embedded vector corresponding to the initial voice, and improves the acquisition efficiency and the acquisition intelligence of the speaker embedded vector.

In some alternative implementations, before step S206, the electronic device may further perform the following steps:

An initial conditional flow matching model is obtained.

In the present embodiment, an initial conditional flow matching model is designed based on Optimal Transport (OT) and Continuous Normalized Flow (CNF) according to actual processing requirements. The training process of the initial conditional flow matching model includes taking the generated phonetic-mark sequence and the speaker-embedded vector as conditional inputs. The target mel-frequency spectrogram distribution is then gradually generated from the prior distribution by solving the Ordinary Differential Equation (ODE). And the similarity between the generated Mel spectrogram and the target distribution is measured by using the optimal transportation loss (such as WASSER STE I N distance) later, and the OT-CFM parameter is optimized by a back propagation algorithm, so that an initial condition flow matching model meeting the requirements is obtained. The distribution of the mel-frequency spectrogram is learned by using an optimal transport condition stream matching model (OT-CFM), and samples are generated from the distribution using the generated phonetic symbols as conditions. Compared to the Diffusion Probability Model (DPMs), OT-CFM has simpler gradients, easier training, and faster generation speeds, and thus can achieve better performance. In continuous-time Normalized Streams (CNFs), a probability density path is constructed from the a priori distribution p0 (X) to the mel-frequency spectrogram data distribution q (X).

And acquiring a preset cosine scheduler.

In this embodiment, the OT-CFM gradually generates the target Mel-spectrogram distribution from the prior distribution by solving the Ordinary Differential Equation (ODE). In this process, a cosine scheduler may be used to control the time step, i.e. the time step t, starting from 0, gradually increasing as the iteration proceeds until a maximum value of 1 is reached, to ensure smoothness of the generation process of the mel-frequency spectrogram.

And adjusting the initial conditional flow matching model based on the cosine scheduler to obtain a corresponding first conditional flow matching model.

In this embodiment, the cosine scheduler is adapted to the initial conditional flow matching model, so that a smoother transition is provided in an early stage of the generation process of the mel spectrogram, and the model is easier to learn to switch from an initial state to a target state.

And optimizing the first conditional flow matching model based on a preset classifier free guiding strategy to obtain a corresponding second conditional flow matching model.

In this embodiment, the above-mentioned classifier free-boot strategy is specifically classifier free-boot (CFG), and by adapting the CFG to the first conditional flow matching model, the conditions are randomly discarded with a fixed probability of 0.2 in the training phase, so that the conditional and unconditional flows can be learned, and the context learning capability is enhanced by masking 70% to 100% of the pre-characteristic conditions.

In this embodiment, the second conditional flow matching model may be used as the final target conditional flow matching model. Or the second conditional flow matching model may also be mask optimized to obtain the target conditional flow matching model. In particular, by utilizing masking speech for training a model to learn the inherent structure and features of the speech signal, by randomly masking portions of the speech frames, the model needs to predict or reconstruct the masked frames based on the remaining unmasked frames, thereby forcing model learning to be more robust and comprehensive speech representations. The self-supervision learning mode does not need a large amount of labeling data, can fully utilize unlabeled voice resources, and reduces the cost of data collection and labeling.

According to the application, the initial condition flow matching model is obtained, and then the initial condition flow matching model is adjusted and optimized based on the use of the cosine scheduler and the classifier free guidance strategy, so that the corresponding target condition flow matching model can be quickly and intelligently constructed, and the model processing efficiency of the target condition flow matching model is improved. And processing the target voice mark sequence and the speaker embedded vector based on a target conditional flow matching model, so that the quality of the obtained Mel spectrogram can be improved.

In some optional implementations of this embodiment, after step S207, the electronic device may further perform the following steps:

and optimizing the synthesized voice based on a preset quality optimization strategy to obtain a corresponding target synthesized voice.

In this embodiment, the above-mentioned optimization process is performed on the synthesized speech based on a preset quality optimization policy to obtain a specific implementation process of the corresponding target synthesized speech, which will be described in further detail in the following specific embodiments, which are not described herein.

And generating a corresponding target audio file based on the target synthesized voice.

In this embodiment, the target synthesized speech may be saved as an audio file (e.g., WAV, MP3, etc.) to obtain a corresponding target audio file.

And acquiring a preset pushing mode.

In this embodiment, the selection of the pushing manner is not specifically limited, and for example, modes such as mail sending, interface sending, and application sending may be adopted.

And pushing the target audio file to the user based on the pushing mode.

In this embodiment, the target audio file is pushed to the user by using the pushing manner, so that the user can hear the voice generated by the input text to be synthesized through conversion.

The method comprises the steps of optimizing the synthesized voice based on a preset quality optimization strategy to obtain corresponding target synthesized voice, generating a corresponding target audio file based on the target synthesized voice, obtaining a preset pushing mode, and pushing the target audio file to the user based on the pushing mode. According to the application, after the conversion processing is carried out on the Mel spectrogram based on the preset vocoder to obtain the synthesized voice, the optimization processing is carried out on the synthesized voice based on the use of the quality optimization strategy, so that the voice quality of the generated target synthesized voice is effectively improved. And generating a corresponding target audio file based on the target synthesized voice, and pushing the target audio file to the user based on the obtained pushing mode, so that the user can hear high-quality voice generated by converting the input text to be synthesized, and the user experience is improved.

In some optional implementations of this embodiment, the optimizing the synthesized speech based on a preset quality optimization policy to obtain a corresponding target synthesized speech includes the following steps:

and denoising the synthesized voice to obtain a corresponding first synthesized voice.

In the present embodiment, the above-described denoising process refers to a process of removing unnecessary background noise or interfering sound in synthesized speech. In particular, the estimated noise spectrum may be subtracted from the spectrum of the noisy speech by analyzing the spectral characteristics of the noise in the synthesized speech.

And performing volume adjustment processing on the first synthesized voice to obtain a corresponding second synthesized voice.

In this embodiment, the above-mentioned volume adjustment processing refers to a processing procedure of changing the overall loudness of the first synthesized speech. Specifically, the amplitude of the first synthesized speech may be scaled to increase or decrease the volume of the first synthesized speech. This can be achieved in particular by multiplying the first synthesized speech by a constant (gain factor). Or the gain of the audio signal is dynamically adjusted using Automatic Gain Control (AGC) to maintain a consistent volume level.

And performing range compression processing on the second synthesized voice to obtain a corresponding third synthesized voice.

In this embodiment, the dynamic range compression processing is performed on the second synthesized speech, so that the process of difference between the highest level and the lowest level in the second synthesized speech can be effectively reduced, so that the second synthesized speech sounds smoother and more consistent, and the hearing quality of the target synthesized speech is improved.

And taking the third synthesized voice as the target synthesized voice.

According to the application, through denoising, volume adjustment and range compression processing on the synthesized voice, the intelligent and accurate optimization processing on the synthesized voice is realized, and the voice quality of the generated target synthesized voice is effectively improved.

In some alternative implementations, the obtained user information solicits user consent and meets the specifications of the relevant laws and relevant policies.

In addition, the non-native company software tools or components present in the embodiments of the present application are presented by way of example only and are not representative of actual use.

The speech synthesis scheme based on the codec consists of a large language model for text-to-markup generation and a conditional flow matching model for markup-to-speech synthesis. The model adopts a conditional flow matching method, and compared with the traditional diffusion model, the model can accelerate training and reasoning. The language synthesis system of the present application is a scalable zero sample TTS synthesis system that incorporates a large language model for text-to-markup generation and a conditional flow matching model for markup-to-speech synthesis, eliminating the need for an additional phonemic and forced aligner. To further improve the quality of the generated speech, the speech modeling is divided into three components, semantic, speaker and prosody, by integrating the x-vector into a large language model. Large languages are responsible for modeling semantic content and prosody, while conditional flow matching models capture timbre and environmental information. In addition, techniques such as no classifier guidance, cosine scheduler, and masking conditions are further used to optimize the flow matching process.

In addition, the application provides an extensible multi-language speech generation system which supports zero-sample context learning, cross-language sound cloning, instruction generation and fine control over emotion and secondary language features. The model can be extended by enhanced instruction following capabilities. In particular, it supports control of various aspects such as speaker identity (i.e., characteristics of the speaker), speaking style (including emotion, gender, speaking rate and pitch), and fine sub-linguistic characteristics. These features include the ability to insert laughter, breathe sound, speak while laughter, and emphasize certain words. While exhibiting the ability of zero sample context learning, allowing arbitrary sounds to be reproduced with only a brief reference speech sample. In addition, mel-frequency spectrograms of speaker-embedded and alert voices are integrated to further enhance sound quality and environmental consistency.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

It is emphasized that to further ensure privacy and security of the synthesized speech, the synthesized speech may also be stored in a blockchain node.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (B l ockcha i n), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (ART I F I C I A L I NTE L L I GENCE, A I) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-On-y Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an artificial intelligence-based speech generating apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 3, the artificial intelligence based speech generating apparatus 300 according to the present embodiment includes a receiving module 301, a preprocessing module 302, an encoding module 303, an extracting module 304, a first processing module 305, a second processing module 306, and a converting module 307. Wherein:

a receiving module 301, configured to receive a text to be synthesized and an initial voice input by a user;

The preprocessing module 302 is configured to preprocess the text to be synthesized to obtain a corresponding specified text;

The encoding module 303 is configured to perform text encoding processing on the specified text based on a preset text encoder, so as to obtain a corresponding text encoding vector;

An extracting module 304, configured to extract a corresponding speaker insertion vector from the initial speech;

The first processing module 305 is configured to process the text encoding vector and the speaker embedding vector based on a preset large language model, and generate a corresponding target voice tag sequence;

The second processing module 306 is configured to process the target voice tag sequence and the speaker embedded vector based on a preset target conditional flow matching model, so as to obtain a corresponding mel spectrogram;

The conversion module 307 is configured to perform conversion processing on the mel-frequency spectrogram based on a preset vocoder, so as to obtain corresponding synthesized voice.

In this embodiment, the operations performed by the modules or units respectively correspond to the steps of the artificial intelligence-based speech generating method in the foregoing embodiment one by one, which is not described herein again.

In some alternative implementations of the present embodiment, the first processing module 305 includes:

The first calling sub-module is used for calling the large language model;

The first processing sub-module is used for processing the text coding vector and the speaker embedded vector based on the large language model to obtain a corresponding first voice mark sequence;

The second processing sub-module is used for removing repeated items from the first voice mark sequence to obtain a corresponding second voice mark sequence;

The third processing sub-module is used for carrying out sequence length adjustment processing on the second voice mark sequence to obtain a corresponding third voice mark sequence;

a fourth processing sub-module, configured to perform sequence smoothing processing on the third voice tag sequence to obtain a corresponding fourth voice tag sequence;

and the first determining submodule is used for taking the fourth voice mark sequence as the target voice mark sequence.

In some optional implementations of this embodiment, the third processing sub-module includes:

a first obtaining unit, configured to obtain a sequence length of the second voice tag sequence;

the second acquisition unit is used for acquiring the preset target sequence length;

a judging unit, configured to judge whether the sequence length is smaller than the target sequence length;

The first processing unit is used for carrying out element addition processing on the second voice mark sequence based on a preset element addition strategy if so, obtaining a second voice mark sequence after element addition, and taking the second voice mark sequence after element addition as the third voice mark sequence;

and the second processing unit is used for carrying out element deletion processing on the second voice mark sequence based on a preset element deletion strategy if not, obtaining an element deleted second voice mark sequence, and taking the element deleted second voice mark sequence as the third voice mark sequence.

In some alternative implementations of the present embodiment, the extraction module 304 includes:

the second calling sub-module is used for calling a preset voiceprint model;

An input sub-module for inputting the initial speech into the voiceprint model;

And the extraction sub-module is used for carrying out vector extraction processing on the initial voice based on the voiceprint model to obtain the speaker embedded vector corresponding to the initial voice.

In some optional implementations of the present embodiment, the artificial intelligence based speech generating apparatus further includes:

The first acquisition module is used for acquiring an initial conditional flow matching model;

the second acquisition module is used for acquiring a preset cosine scheduler;

the adjusting module is used for adjusting the initial conditional flow matching model based on the cosine scheduler to obtain a corresponding first conditional flow matching model;

The first optimization module is used for optimizing the first conditional flow matching model based on a preset classifier free guiding strategy to obtain a corresponding second conditional flow matching model;

and the construction module is used for constructing the target conditional flow matching model based on the second conditional flow matching model.

The second optimizing module is used for optimizing the synthesized voice based on a preset quality optimizing strategy to obtain a corresponding target synthesized voice;

the generation module is used for generating a corresponding target audio file based on the target synthesized voice;

The second acquisition module is used for acquiring a preset pushing mode;

and the pushing module is used for pushing the target audio file to the user based on the pushing mode.

In some optional implementations of this embodiment, the second optimization module includes:

the denoising sub-module is used for denoising the synthesized voice to obtain a corresponding first synthesized voice;

The adjusting sub-module is used for carrying out volume adjustment processing on the first synthesized voice to obtain a corresponding second synthesized voice;

The compression sub-module is used for performing range compression processing on the second synthesized voice to obtain a corresponding third synthesized voice;

and the second determining submodule is used for taking the third synthesized voice as the target synthesized voice.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware thereof includes, but is not limited to, a microprocessor, an application specific integrated circuit (APP L I CAT I on SPEC I F I C I NTEGRATED C I rcu I t, AS IC), a programmable gate array (Fie l d-Programmab L E GATE AR RAY, FPGA), a digital Processor (D I G I TA L S I GNA L Processor, DSP), an embedded device, and the like.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MED I A CARD, SMC), a secure digital (Secu RE D I G I TA L, SD) card, a flash memory card (F L ASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of an artificial intelligence-based speech generating method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Cent ra lProcess i ng Un i t, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the artificial intelligence based speech generating method.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of an artificial intelligence-based speech generating method as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A speech generation method based on artificial intelligence, characterized in that it comprises the following steps:

Receive the text to be synthesized and the initial speech input by the user;

Preprocessing the text to be synthesized to obtain a corresponding specified text;

Performing text encoding processing on the specified text based on a preset text encoder to obtain a corresponding text encoding vector;

Extracting a corresponding speaker embedding vector from the initial speech;

Processing the text encoding vector and the speaker embedding vector based on a preset large language model to generate a corresponding target speech tag sequence;

Processing the target speech tag sequence and the speaker embedding vector based on a preset target conditional stream matching model to obtain a corresponding mel-spectrogram;

The mel-spectrogram is converted based on a preset vocoder to obtain corresponding synthesized speech.

2. The method for generating speech based on artificial intelligence according to claim 1, characterized in that the step of processing the text encoding vector and the speaker embedding vector based on a preset large language model to generate a corresponding target speech tag sequence specifically comprises:

Calling the large language model;

Processing the text encoding vector and the speaker embedding vector based on the large language model to obtain a corresponding first speech tag sequence;

performing a process of removing duplicate items from the first speech mark sequence to obtain a corresponding second speech mark sequence;

Performing sequence length adjustment processing on the second speech mark sequence to obtain a corresponding third speech mark sequence;

Performing sequence smoothing processing on the third speech mark sequence to obtain a corresponding fourth speech mark sequence;

The fourth speech mark sequence is used as the target speech mark sequence.

3. The method for generating speech based on artificial intelligence according to claim 2, wherein the step of adjusting the length of the second speech mark sequence to obtain the corresponding third speech mark sequence specifically comprises:

Obtaining a sequence length of the second speech mark sequence;

Get the preset target sequence length;

Determining whether the sequence length is less than the target sequence length;

If yes, perform element addition processing on the second speech mark sequence based on a preset element addition strategy to obtain a second speech mark sequence after element addition, and use the second speech mark sequence after element addition as the third speech mark sequence;

If not, perform element deletion processing on the second speech mark sequence based on a preset element deletion strategy to obtain a second speech mark sequence after element deletion, and use the second speech mark sequence after element deletion as the third speech mark sequence.

4. The method for generating speech based on artificial intelligence according to claim 1, wherein the step of extracting the corresponding speaker embedding vector from the initial speech specifically comprises:

Call the preset voiceprint model;

Inputting the initial speech into the voiceprint model;

The initial speech is subjected to vector extraction processing based on the voiceprint model to obtain the speaker embedding vector corresponding to the initial speech.

5. The method for generating speech based on artificial intelligence according to claim 1, characterized in that before the step of processing the target speech tag sequence and the speaker embedding vector based on the preset target conditional stream matching model to obtain the corresponding mel-spectrogram, it also includes:

Get the initial condition flow matching model;

Get the preset cosine scheduler;

Optimizing the first conditional flow matching model based on a preset classifier free guidance strategy to obtain a corresponding second conditional flow matching model;

The target conditional flow matching model is constructed based on the second conditional flow matching model.

6. The method for generating speech based on artificial intelligence according to claim 1, characterized in that after the step of converting the Mel-spectrogram based on the preset vocoder to obtain the corresponding synthesized speech, it also includes:

Optimizing the synthesized speech based on a preset quality optimization strategy to obtain a corresponding target synthesized speech;

Generate a corresponding target audio file based on the target synthesized speech;

Get the preset push method;

Based on the pushing method, the target audio file is pushed to the user.

7. The method for generating speech based on artificial intelligence according to claim 6, wherein the step of optimizing the synthesized speech based on a preset quality optimization strategy to obtain the corresponding target synthesized speech specifically comprises:

Performing denoising processing on the synthesized speech to obtain a corresponding first synthesized speech;

Performing volume adjustment processing on the first synthesized speech to obtain a corresponding second synthesized speech;

performing range compression processing on the second synthesized speech to obtain a corresponding third synthesized speech;

The third synthesized speech is used as the target synthesized speech.

8. A speech generation device based on artificial intelligence, characterized by comprising:

A receiving module, used for receiving the text to be synthesized and the initial speech input by the user;

A preprocessing module, used for preprocessing the text to be synthesized to obtain a corresponding specified text;

An encoding module, used for performing text encoding processing on the specified text based on a preset text encoder to obtain a corresponding text encoding vector;

An extraction module, configured to extract a corresponding speaker embedding vector from the initial speech;

A first processing module, configured to process the text encoding vector and the speaker embedding vector based on a preset large language model to generate a corresponding target speech tag sequence;

A second processing module is used to process the target speech tag sequence and the speaker embedding vector based on a preset target conditional stream matching model to obtain a corresponding mel-spectrogram;

The conversion module is used to convert the Mel-spectrogram based on a preset vocoder to obtain corresponding synthesized speech.

9. A computer device, characterized in that it comprises a memory and a processor, wherein the memory stores computer-readable instructions, and when the processor executes the computer-readable instructions, the steps of the artificial intelligence-based speech generation method as described in any one of claims 1 to 7 are implemented.

10. A computer-readable storage medium, characterized in that computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the steps of the artificial intelligence-based speech generation method as described in any one of claims 1 to 7 are implemented.