CN113825005B - Face video and audio synchronization method and system based on joint training - Google Patents
Face video and audio synchronization method and system based on joint training Download PDFInfo
- Publication number
- CN113825005B CN113825005B CN202111159455.4A CN202111159455A CN113825005B CN 113825005 B CN113825005 B CN 113825005B CN 202111159455 A CN202111159455 A CN 202111159455A CN 113825005 B CN113825005 B CN 113825005B
- Authority
- CN
- China
- Prior art keywords
- phoneme
- video
- mouth shape
- sentence
- appointed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000002372 labelling Methods 0.000 claims description 29
- 230000011218 segmentation Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 10
- 238000012937 correction Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 8
- 238000000926 separation method Methods 0.000 claims description 6
- 230000001815 facial effect Effects 0.000 claims 1
- 238000013461 design Methods 0.000 abstract description 8
- 230000015572 biosynthetic process Effects 0.000 abstract description 3
- 238000003786 synthesis reaction Methods 0.000 abstract description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
- H04N21/4307—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8547—Content authoring involving timestamps for synchronizing content
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Security & Cryptography (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention relates to a human face video and audio synchronization method and system based on joint training, and a computer device, wherein a brand new logic relation is adopted, based on a phoneme sequence to be processed corresponding to each pronunciation mouth shape in a sample video, network training is carried out by combining appointed type characteristics of preset appointed phonemes in the phoneme sequence to be processed corresponding to each mouth shape video and appointed type mouth shape characteristics corresponding to each mouth shape video respectively, so as to obtain a mouth shape characteristic generating module; based on the method, the appointed type mouth feature sequences corresponding to the Chinese voices in the target audio are obtained and are used for correcting the face mouth shape of the corresponding video segment in the target face video corresponding to the Chinese voices in the target audio, and the Chinese voices in the target audio are loaded according to the time stamp of the face mouth shape, so that the synchronization between the target audio and the target face video is realized, the synthesis of the target audio and the target video can be accurately and stably realized through the whole design application, and the effect of actual audio and video is improved.
Description
Technical Field
The invention relates to a face video and audio synchronization method and system based on joint training and computer equipment, and belongs to the technical field of audio and video synthesis processing.
Background
There are many demands for generating video content in the internet and media fields, and some methods can synthesize pictures and videos of human faces, called TTA, that is, text-to-animation, such as ATVG methods; another class of methods can synthesize sound called TTS, text-to-speech, and many methods are applicable in practice. However, if TTA and TTS are used for model training respectively, the asynchronism of sound and picture, namely the asynchronism of mouth-shaped action and sound, is easy to occur. The prior art therefore lacks an accurate and stable synthesis method for audio and video independent of each other.
Disclosure of Invention
The invention aims to solve the technical problem of providing a face video and audio synchronization method based on joint training, which uses phoneme features to learn mouth shape features to obtain mouth shape features consistent with audio, and further realizes synchronization of the audio and the face mouth shape video in a correction mode.
The invention adopts the following technical scheme for solving the technical problems: the invention designs a face video and audio synchronization method based on joint training, which comprises the following steps of I to V, generating a mouth shape feature generating module, and then applying the mouth shape feature generating module to obtain the synchronization between a target audio and a target face video according to the following steps A to C;
step I, acquiring each section of audio in the sample video, acquiring each video to be selected corresponding to each section of audio respectively, and then entering step II;
Step II, respectively segmenting the audio corresponding to each video to be selected, obtaining each sentence of Chinese voice in the audio and the time stamp corresponding to each Chinese voice, further obtaining the phoneme sequence corresponding to each sentence of Chinese voice and the time stamp corresponding to each phoneme sequence, and combining to form the phoneme sequence group corresponding to the video to be selected; obtaining phoneme sequence groups corresponding to the videos to be selected respectively, and then entering a step III;
Step III, respectively aiming at each video to be selected, dividing and obtaining each mouth shape video in the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video according to different pronunciation mouth shapes of adjacent frames according to a phoneme sequence group corresponding to the video to be selected; obtaining all mouth shape videos corresponding to all the video to be selected and phoneme sequences to be processed corresponding to all the mouth shape videos respectively, and then entering the step IV;
Step IV, obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequence to be processed, which are respectively corresponding to each mouth shape video, obtaining appointed type mouth shape characteristics, which are respectively corresponding to each mouth shape video, and then entering step V;
step V, according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video, taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input, taking the appointed type mouth shape characteristics corresponding to the mouth shape video as output, and training aiming at a preset appointed network to form a mouth shape characteristic generating module;
Step A, according to the step I to the step IV, obtaining phoneme sequences corresponding to each sentence of Chinese voice in the target audio, obtaining appointed type characteristics of each phoneme in each phoneme sequence, and then entering the step B;
B, according to the appointed type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, taking the appointed type characteristics of each phoneme as input, applying a mouth shape characteristic generating module to obtain the appointed type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, further obtaining the appointed type mouth shape characteristic sequences corresponding to each sentence of Chinese speech in the target audio, and then entering the step C;
and C, correcting the face mouth shape of the corresponding video segment in the target face video corresponding to each sentence of Chinese voice according to the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese voice in the target audio and the time stamp corresponding to the target face video corresponding to each sentence of Chinese voice in the preset target audio, and loading each sentence of Chinese voice according to the time stamp corresponding to the target face video so as to realize the synchronization between the target audio and the target face video.
As a preferred technical scheme of the invention: the specified type feature of the phoneme is embedding features of the phoneme or one-hot features of the phoneme.
As a preferred technical scheme of the invention: in the step IV, the following operations are executed for the phoneme sequences to be processed corresponding to each mouth shape video respectively, so as to obtain the specified type characteristics of the preset specified phonemes in the phoneme sequences to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video is obtained, a preset number of phonemes before and after each landmark key phoneme are selected to be used as each phoneme to be selected, and the phoneme to be selected with the highest frequency is selected to be used as a preset appointed phoneme corresponding to the mouth shape video; then embedding features of preset appointed phonemes corresponding to the mouth shape video are obtained.
As a preferred technical scheme of the invention: in the step IV, the following operations are performed for the to-be-processed phoneme sequences corresponding to the mouth-shaped videos respectively to obtain the specific type characteristics of the preset specific phonemes in the to-be-processed phoneme sequences corresponding to the mouth-shaped videos; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
Obtaining each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video, multiplying the position of each landmark key phoneme by a preset parameter value, and obtaining each position in a rounding manner to obtain phonemes corresponding to each position respectively as each phoneme to be selected; then selecting the phoneme to be selected with the highest frequency as a preset appointed phoneme corresponding to the mouth shape video; finally, the appointed type characteristics of the preset appointed phonemes corresponding to the mouth shape video are obtained.
As a preferred technical scheme of the invention: obtaining embedding characteristics of phonemes according to the following steps a to c;
Step a, according to the length of each section of audio in the sample video, the time stamp of the corresponding pinyin marking of the inner initial consonant and vowel of each sentence is combined, according to the preset initial consonant and vowel duration ratio, the pronunciation level marking is executed for each sentence of Chinese sentence, and the face alignment is applied to carry out phoneme marking for the pronunciation level marking, so as to obtain the marked phoneme sequence corresponding to each sentence of Chinese sentence respectively;
B, applying a phoneme coding network layer in the TTS, aiming at each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, converting each labeling phoneme into each phoneme coding TTS representation, further obtaining the phoneme coding TTS representation of each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, and then entering the step c;
and c, respectively converting the phoneme coding TTS representation of each labeling phoneme in the corresponding labeling phoneme sequence of each sentence Chinese sentence to obtain embedding features of each labeling phoneme, namely embedding features of the phonemes.
As a preferred technical scheme of the invention: in the step V, according to the specified type of mouth shape feature corresponding to each mouth shape video and the specified type of feature of the phoneme sequence to be processed corresponding to each mouth shape video, the specified type of feature of the phoneme sequence to be processed corresponding to the mouth shape video is taken as input, the specified type of mouth shape feature corresponding to the mouth shape video is taken as output, and training is performed aiming at the GAN network to form a mouth shape feature generating module.
Correspondingly, the technical problem to be solved by the invention is to provide a system for the human face video and audio synchronization method based on joint training, which applies the learning of the phoneme feature to the mouth shape feature to obtain the mouth shape feature consistent with the audio, and further realizes the synchronization of the audio and the human face mouth shape video in a correction mode.
The invention adopts the following technical scheme for solving the technical problems: the invention designs a system of a human face video and audio synchronization method based on joint training, which comprises an audio and video separation and segmentation module, a phoneme sequence obtaining module, a mouth shape video segmentation module, a feature extraction module, a model training module, a model execution module and a correction loading module;
The audio-video separation and segmentation module is used for acquiring each section of audio in the sample video and acquiring each video to be selected corresponding to each section of audio respectively; the method comprises the steps of selecting each video to be selected, and segmenting audio corresponding to the video to be selected to obtain each sentence of Chinese voice in the audio and each corresponding time stamp of each Chinese voice;
The phoneme sequence obtaining module is used for obtaining phoneme sequences corresponding to each sentence of Chinese voice respectively and time stamps corresponding to each phoneme sequence respectively, and combining to form a phoneme sequence group corresponding to the video to be selected; further obtaining a phoneme sequence group corresponding to each video to be selected respectively;
the mouth shape video segmentation module is used for respectively segmenting each mouth shape video in the video to be selected and each mouth shape video to be processed according to different pronunciation mouth shapes of adjacent frames and according to a phoneme sequence group corresponding to the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video; respectively aiming at each video to be selected, according to a phoneme sequence group corresponding to the video to be selected, according to different pronunciation mouth shapes of adjacent frames, dividing to obtain each mouth shape video in the video to be selected and a phoneme sequence to be processed corresponding to each mouth shape video respectively;
The feature extraction module is used for obtaining the appointed type features of preset appointed phonemes in the phoneme sequences to be processed, which are respectively corresponding to the mouth shape videos, and obtaining the appointed type mouth shape features, which are respectively corresponding to the mouth shape videos;
The model training module is used for training aiming at a preset appointed network according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video by taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input and the appointed type mouth shape characteristics corresponding to the mouth shape video as output to form a mouth shape characteristic generating module;
the model execution module is used for obtaining the appointed type mouth shape characteristic corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the appointed type characteristic of each phoneme as input and applying the mouth shape characteristic generation module according to the appointed type characteristic of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio;
The correction loading module is used for obtaining the specified type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the specified type characteristics of each phoneme as input and applying the mouth shape characteristic generating module according to the specified type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the specified type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio.
The invention designs a computer device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of a face video and audio synchronization method based on joint training when executing the computer program.
And designing a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a face video and audio synchronization method based on joint training.
Compared with the prior art, the face video and audio synchronization method based on the joint training has the following technical effects:
The invention designs a face video and audio synchronization method based on joint training, which adopts a totally new logic relationship, and carries out network training based on a phoneme sequence to be processed corresponding to each pronunciation mouth shape in a sample video, and combines specified type characteristics of a preset specified phoneme in the phoneme sequence to be processed corresponding to each mouth shape video and specified type mouth shape characteristics corresponding to each mouth shape video to obtain a mouth shape characteristic generating module; based on the method, the appointed type mouth feature sequences corresponding to the Chinese voices in the target audio are obtained and are used for correcting the face mouth shape of the corresponding video segment in the target face video corresponding to the Chinese voices in the target audio, and the Chinese voices in the target audio are loaded according to the time stamp of the face mouth shape, so that the synchronization between the target audio and the target face video is realized, the synthesis of the target audio and the target video can be accurately and stably realized through the whole design application, and the effect of actual audio and video is improved.
Drawings
FIG. 1 is a flow chart of a method for synchronizing synthetic face and synthetic voice based on joint training according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
The invention designs a face video and audio synchronization method based on joint training, which is shown in fig. 1, and generates a mouth shape feature generation module through the following steps I to V.
And I, acquiring each section of audio in the sample video, acquiring each video to be selected corresponding to each section of audio respectively, and then entering the step II.
Step II, respectively segmenting the audio corresponding to each video to be selected, obtaining each sentence of Chinese voice in the audio and the time stamp corresponding to each Chinese voice, further obtaining the phoneme sequence corresponding to each sentence of Chinese voice and the time stamp corresponding to each phoneme sequence, and combining to form the phoneme sequence group corresponding to the video to be selected; and further obtaining phoneme sequence groups corresponding to the videos to be selected respectively, and then entering the step III.
Step III, respectively aiming at each video to be selected, dividing and obtaining each mouth shape video in the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video according to different pronunciation mouth shapes of adjacent frames according to a phoneme sequence group corresponding to the video to be selected; and further obtaining all mouth shape videos corresponding to all the candidate videos and phoneme sequences to be processed corresponding to all the mouth shape videos respectively, and then entering the step IV.
And IV, obtaining the appointed type characteristics of preset appointed phonemes in the phoneme sequence to be processed, which are respectively corresponding to each mouth shape video, obtaining the appointed type mouth shape characteristics, which are respectively corresponding to each mouth shape video, and then entering the step V.
In the actual 1s audio, there are 100 frames of phoneme feature labels, in which there are 30 frames landmark key phonemes, namely landmark key phonemes: the phone phone=1:3, specifically, the specific type feature of the phone is embedding features of the phone or one-hot features of the phone, where in the step IV, if the specific type feature of the phone is embedding features of the phone, the following operation 1 is executed for the phone sequence to be processed corresponding to each mouth shape video, so as to obtain the specific type feature of the preset specific phone in the phone sequence to be processed corresponding to the mouth shape video; and further obtaining the appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed, which are respectively corresponding to the mouth shape videos.
Operation 1: each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video is obtained, a preset number of phonemes before and after each landmark key phoneme are selected to be used as each phoneme to be selected, and the phoneme to be selected with the highest frequency is selected to be used as a preset appointed phoneme corresponding to the mouth shape video; then embedding features of preset appointed phonemes corresponding to the mouth shape video are obtained.
Meanwhile, regarding the obtaining of the specific type feature of the phonemes, a second manner is designed, that is, if the specific type feature of the phonemes is embedding features of the phonemes or one-hot features, in the step IV, the following operation 2 is executed for the to-be-processed phoneme sequences corresponding to each mouth shape video respectively, so as to obtain the specific type feature of the preset specific phonemes in the to-be-processed phoneme sequences corresponding to the mouth shape video; and further obtaining the appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed, which are respectively corresponding to the mouth shape videos.
Operation 2: obtaining each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video, multiplying the position of each landmark key phoneme by a preset parameter value, and obtaining each position in a rounding manner to obtain phonemes corresponding to each position respectively as each phoneme to be selected; then selecting the phoneme to be selected with the highest frequency as a preset appointed phoneme corresponding to the mouth shape video; finally, the appointed type characteristics of the preset appointed phonemes corresponding to the mouth shape video are obtained.
In practical application, embedding features of the phonemes are obtained as follows in steps a to c.
And a step a, according to the length of each section of audio in the sample video, the time stamp of the corresponding pinyin marking of the inner initial consonant and the vowel of each sentence is combined, the pronunciation level data marking is executed for each sentence of Chinese sentence according to the preset duration of the initial consonant and the vowel, such as 1:2 or 1:1, and the face alignment is applied to carry out phoneme marking for the pronunciation level marking, so as to obtain the marked phoneme sequence corresponding to each sentence of Chinese sentence.
And b, applying a phoneme coding network layer (phone embedding network layer) in the TTS, aiming at each marked phoneme in the marked phoneme sequence corresponding to each sentence Chinese sentence, converting each marked phoneme into each phoneme coding TTS representation, further obtaining the phoneme coding TTS representation of each marked phoneme in the marked phoneme sequence corresponding to each sentence Chinese sentence, and then entering the step c.
In practice, the conversion of the labeling phonemes into a phoneme-encoded TTS representation is implemented as follows, using a phoneme-encoding network layer (phone embedding network layer) in the TTS.
#===nparams===
#N_symbols=len (symbols) # vocabulary size, 320 total
#symbos_embedding_dim=512
# Initialization
self.embedding=nn.Embedding(
hparams.n_symbols,hparams.symbols_embedding_dim) std=sqrt(2.0/(hparams.n_symbols+hparams.symbols_embedding_dim)) val=sqrt(3.0)*std#uniform bounos for std self.embedding.weight.data.uniform_(-val,val)
# Use
embedded_inputs=self.embedding(text_inputs).transpose(1,2)
In practical implementation, the actual application result of the labeling-phoneme phase-phoneme coding TTS representation conversion is as follows, namely, "labeling phonemes: the phoneme-encoded TTS means "as follows:
uei1:ui1
uei2:ui2
uei3:ui3
uei4:ui4
uei5:ui5
ng:ng1
iou1:ou1
iou2:ou2
iou3:ou3
iou4:ou4
ii1:i1
ii2:i2
ii3:i3
ii4:i4
ii5:i5
io1:o1
io2:o2
io3:o3
io4:o4
io5:o5
va1:a1
va2:a2
va3:a3
va4:a4
and c, respectively converting the phoneme coding TTS representation of each labeling phoneme in the corresponding labeling phoneme sequence of each sentence Chinese sentence to obtain embedding features of each labeling phoneme, namely embedding features of the phonemes.
In practical implementation, the actual application result of the conversion of the phoneme-encoded TTS representation to the phoneme embedding feature is as follows, namely, "phoneme-encoded TTS representation: the phoneme embedding features "as follows:
@iao5:@iao1
@iu5:@iu1
@ng1:@n
@ua5:@ua1
@uai5:@uai1
@ue5:@ue1
@ui5:@ui1
@v1:@v2
@v5:@v2
@ve1:@ve4
@ve2:@ve4
@ve3:@ve4
And V, according to the specified type mouth shape characteristics corresponding to each mouth shape video and the specified type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video, taking the specified type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input, taking the specified type mouth shape characteristics corresponding to the mouth shape video as output, and training aiming at a preset specified network such as a GAN network to form a mouth shape characteristic generating module.
Based on the acquisition of the mouth shape feature generation module, the mouth shape feature generation module can be applied, as shown in fig. 1, and the synchronization between the target audio and the target face video is obtained according to the following steps a to C.
Step A, according to the step I to the step IV, obtaining phoneme sequences corresponding to each sentence of Chinese voice in the target audio, obtaining the appointed type characteristics of each phoneme in each phoneme sequence, and then entering the step B.
And B, according to the appointed type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, taking the appointed type characteristics of each phoneme as input, applying a mouth shape characteristic generating module to obtain the appointed type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, further obtaining the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio, and then entering the step C.
And C, correcting the face mouth shape of the corresponding video segment in the target face video corresponding to each sentence of Chinese voice according to the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese voice in the target audio and the time stamp corresponding to the target face video corresponding to each sentence of Chinese voice in the preset target audio, and loading each sentence of Chinese voice according to the time stamp corresponding to the target face video so as to realize the synchronization between the target audio and the target face video.
The face video and audio synchronization method based on the combined training, which is designed by the invention, is applied to practice, and a face video and audio synchronization method based on the combined training is further specifically designed, wherein the face video and audio synchronization method based on the combined training comprises an audio-video separation and segmentation module, a phoneme sequence obtaining module, a mouth-shaped video segmentation module, a feature extraction module, a model training module, a model executing module and a correction loading module.
The audio-video separation and segmentation module is used for acquiring each section of audio in the sample video and acquiring each video to be selected corresponding to each section of audio respectively; and the method is used for respectively segmenting the audio corresponding to each video to be selected, and obtaining each sentence of Chinese voice in the audio and the corresponding time stamp of each Chinese voice.
The phoneme sequence obtaining module is used for obtaining phoneme sequences corresponding to each sentence of Chinese voice respectively and time stamps corresponding to each phoneme sequence respectively, and combining to form a phoneme sequence group corresponding to the video to be selected; and further obtaining a phoneme sequence group corresponding to each video to be selected.
The mouth shape video segmentation module is used for respectively segmenting each mouth shape video in the video to be selected and each mouth shape video to be processed according to different pronunciation mouth shapes of adjacent frames and according to a phoneme sequence group corresponding to the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video; and respectively aiming at each video to be selected, according to a phoneme sequence group corresponding to the video to be selected, according to different pronunciation mouth shapes of adjacent frames, dividing to obtain each mouth shape video in the video to be selected and a phoneme sequence to be processed corresponding to each mouth shape video.
The feature extraction module is used for obtaining the appointed type features of preset appointed phonemes in the phoneme sequence to be processed, which are respectively corresponding to each mouth shape video, and obtaining the appointed type mouth shape features, which are respectively corresponding to each mouth shape video.
The model training module is used for training aiming at a preset appointed network according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video, taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input, and taking the appointed type mouth shape characteristics corresponding to the mouth shape video as output to form the mouth shape characteristic generating module.
The model execution module is used for obtaining the appointed type mouth shape characteristic corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the appointed type characteristic of each phoneme as input and applying the mouth shape characteristic generation module according to the appointed type characteristic of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio.
The correction loading module is used for obtaining the specified type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the specified type characteristics of each phoneme as input and applying the mouth shape characteristic generating module according to the specified type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the specified type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio.
The invention designs a computer device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of a face video and audio synchronization method based on joint training when executing the computer program.
And designing a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a face video and audio synchronization method based on joint training.
The embodiments of the present invention will be described in detail below with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of one skilled in the art without departing from the spirit of the present invention.
Any modifications or variations, which are apparent to those skilled in the art in light of the above teachings, are intended to be included within the scope of this invention without departing from its spirit.
Claims (5)
1. A face video and audio synchronization method based on joint training is characterized in that a mouth shape feature generation module is generated through the following steps I to V, and then the mouth shape feature generation module is applied to obtain the synchronization between a target audio and a target face video according to the following steps A to C;
step I, acquiring each section of audio in the sample video, acquiring each video to be selected corresponding to each section of audio respectively, and then entering step II;
Step II, respectively segmenting the audio corresponding to each video to be selected, obtaining each sentence of Chinese voice in the audio and the time stamp corresponding to each Chinese voice, further obtaining the phoneme sequence corresponding to each sentence of Chinese voice and the time stamp corresponding to each phoneme sequence, and combining to form the phoneme sequence group corresponding to the video to be selected; obtaining phoneme sequence groups corresponding to the videos to be selected respectively, and then entering a step III;
Step III, respectively aiming at each video to be selected, dividing and obtaining each mouth shape video in the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video according to different pronunciation mouth shapes of adjacent frames according to a phoneme sequence group corresponding to the video to be selected; obtaining all mouth shape videos corresponding to all the video to be selected and phoneme sequences to be processed corresponding to all the mouth shape videos respectively, and then entering the step IV;
Step IV, obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequence to be processed, which are respectively corresponding to each mouth shape video, obtaining appointed type mouth shape characteristics, which are respectively corresponding to each mouth shape video, and then entering step V;
step V, according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video, taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input, taking the appointed type mouth shape characteristics corresponding to the mouth shape video as output, and training aiming at a preset appointed network to form a mouth shape characteristic generating module;
Step A, according to the step I to the step IV, obtaining phoneme sequences corresponding to each sentence of Chinese voice in the target audio, obtaining appointed type characteristics of each phoneme in each phoneme sequence, and then entering the step B;
B, according to the appointed type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, taking the appointed type characteristics of each phoneme as input, applying a mouth shape characteristic generating module to obtain the appointed type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, further obtaining the appointed type mouth shape characteristic sequences corresponding to each sentence of Chinese speech in the target audio, and then entering the step C;
Step C, according to the appointed type mouth feature sequences respectively corresponding to each sentence of Chinese voice in the target audio and the time stamps respectively corresponding to each sentence of Chinese voice in the preset target audio and the target face video, correcting the face mouth shape of the corresponding video segment in the target face video respectively corresponding to each sentence of Chinese voice, and loading each sentence of Chinese voice according to the time stamp corresponding to the target face video, so that the synchronization between the target audio and the target face video is realized;
In the step IV, the following operations are executed for the phoneme sequences to be processed corresponding to each mouth shape video respectively, so as to obtain the specified type characteristics of the preset specified phonemes in the phoneme sequences to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
Each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video is obtained, a preset number of phonemes before and after each landmark key phoneme are selected to be used as each phoneme to be selected, and the phoneme to be selected with the highest frequency is selected to be used as a preset appointed phoneme corresponding to the mouth shape video; then embedding features of preset appointed phonemes corresponding to the mouth shape video are obtained;
In the step IV, the following operations are performed for the to-be-processed phoneme sequences corresponding to the mouth-shaped videos respectively to obtain the specific type characteristics of the preset specific phonemes in the to-be-processed phoneme sequences corresponding to the mouth-shaped videos; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
Obtaining each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video, multiplying the position of each landmark key phoneme by a preset parameter value, and obtaining each position in a rounding manner to obtain phonemes corresponding to each position respectively as each phoneme to be selected; then selecting the phoneme to be selected with the highest frequency as a preset appointed phoneme corresponding to the mouth shape video; finally, the appointed type characteristics of the preset appointed phonemes corresponding to the mouth shape video are obtained;
Obtaining embedding characteristics of phonemes according to the following steps a to c;
Step a, according to the length of each section of audio in the sample video, the time stamp of the corresponding pinyin marking of the inner initial consonant and vowel of each sentence is combined, according to the preset initial consonant and vowel duration ratio, the pronunciation level marking is executed for each sentence of Chinese sentence, and the face alignment is applied to carry out phoneme marking for the pronunciation level marking, so as to obtain the marked phoneme sequence corresponding to each sentence of Chinese sentence respectively;
B, applying a phoneme coding network layer in the TTS, aiming at each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, converting each labeling phoneme into each phoneme coding TTS representation, further obtaining the phoneme coding TTS representation of each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, and then entering the step c;
and c, respectively converting the phoneme coding TTS representation of each labeling phoneme in the corresponding labeling phoneme sequence of each sentence Chinese sentence to obtain embedding features of each labeling phoneme, namely embedding features of the phonemes.
2. The method for synchronizing facial video and audio based on joint training according to claim 1, wherein: in the step V, according to the specified type of mouth shape feature corresponding to each mouth shape video and the specified type of feature of the phoneme sequence to be processed corresponding to each mouth shape video, the specified type of feature of the phoneme sequence to be processed corresponding to the mouth shape video is taken as input, the specified type of mouth shape feature corresponding to the mouth shape video is taken as output, and training is performed aiming at the GAN network to form a mouth shape feature generating module.
3. A system for implementing the face video and audio synchronization method based on joint training according to claim 1 or 2, which is characterized by comprising an audio/video separation and segmentation module, a phoneme sequence obtaining module, a mouth shape video segmentation module, a feature extraction module, a model training module, a model execution module and a correction loading module;
The audio-video separation and segmentation module is used for acquiring each section of audio in the sample video and acquiring each video to be selected corresponding to each section of audio respectively; the method comprises the steps of selecting each video to be selected, and segmenting audio corresponding to the video to be selected to obtain each sentence of Chinese voice in the audio and each corresponding time stamp of each Chinese voice;
The phoneme sequence obtaining module is used for obtaining phoneme sequences corresponding to each sentence of Chinese voice respectively and time stamps corresponding to each phoneme sequence respectively, and combining to form a phoneme sequence group corresponding to the video to be selected; further obtaining a phoneme sequence group corresponding to each video to be selected respectively;
the mouth shape video segmentation module is used for respectively segmenting each mouth shape video in the video to be selected and each mouth shape video to be processed according to different pronunciation mouth shapes of adjacent frames and according to a phoneme sequence group corresponding to the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video; respectively aiming at each video to be selected, according to a phoneme sequence group corresponding to the video to be selected, according to different pronunciation mouth shapes of adjacent frames, dividing to obtain each mouth shape video in the video to be selected and a phoneme sequence to be processed corresponding to each mouth shape video respectively;
The feature extraction module is used for obtaining the appointed type features of preset appointed phonemes in the phoneme sequences to be processed, which are respectively corresponding to the mouth shape videos, and obtaining the appointed type mouth shape features, which are respectively corresponding to the mouth shape videos;
The model training module is used for training aiming at a preset appointed network according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video by taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input and the appointed type mouth shape characteristics corresponding to the mouth shape video as output to form a mouth shape characteristic generating module;
the model execution module is used for obtaining the appointed type mouth shape characteristic corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the appointed type characteristic of each phoneme as input and applying the mouth shape characteristic generation module according to the appointed type characteristic of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio;
The correction loading module is used for obtaining the specified type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the specified type characteristics of each phoneme as input and applying the mouth shape characteristic generating module according to the specified type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the specified type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio;
The specific type features of the phonemes are embedding features of the phonemes, and the feature extraction module performs the following operations respectively for the sequences of the phonemes to be processed corresponding to each mouth shape video to obtain the specific type features of the preset specific phonemes in the sequences of the phonemes to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
Each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video is obtained, a preset number of phonemes before and after each landmark key phoneme are selected to be used as each phoneme to be selected, and the phoneme to be selected with the highest frequency is selected to be used as a preset appointed phoneme corresponding to the mouth shape video; then embedding features of preset appointed phonemes corresponding to the mouth shape video are obtained;
The specific type feature of the phonemes is embedding features of the phonemes or one-hot features, and the feature extraction module performs the following operations respectively for the sequences of the phonemes to be processed corresponding to each mouth shape video to obtain the specific type feature of the preset specific phonemes in the sequences of the phonemes to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
Obtaining each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video, multiplying the position of each landmark key phoneme by a preset parameter value, and obtaining each position in a rounding manner to obtain phonemes corresponding to each position respectively as each phoneme to be selected; then selecting the phoneme to be selected with the highest frequency as a preset appointed phoneme corresponding to the mouth shape video; finally, the appointed type characteristics of the preset appointed phonemes corresponding to the mouth shape video are obtained;
Obtaining embedding characteristics of phonemes according to the following steps a to c;
Step a, according to the length of each section of audio in the sample video, the time stamp of the corresponding pinyin marking of the inner initial consonant and vowel of each sentence is combined, according to the preset initial consonant and vowel duration ratio, the pronunciation level marking is executed for each sentence of Chinese sentence, and the face alignment is applied to carry out phoneme marking for the pronunciation level marking, so as to obtain the marked phoneme sequence corresponding to each sentence of Chinese sentence respectively;
B, applying a phoneme coding network layer in the TTS, aiming at each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, converting each labeling phoneme into each phoneme coding TTS representation, further obtaining the phoneme coding TTS representation of each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, and then entering the step c;
and c, respectively converting the phoneme coding TTS representation of each labeling phoneme in the corresponding labeling phoneme sequence of each sentence Chinese sentence to obtain embedding features of each labeling phoneme, namely embedding features of the phonemes.
4. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of claim 1 or 2 when executing the computer program.
5. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of claim 1 or 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111159455.4A CN113825005B (en) | 2021-09-30 | 2021-09-30 | Face video and audio synchronization method and system based on joint training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111159455.4A CN113825005B (en) | 2021-09-30 | 2021-09-30 | Face video and audio synchronization method and system based on joint training |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113825005A CN113825005A (en) | 2021-12-21 |
CN113825005B true CN113825005B (en) | 2024-05-24 |
Family
ID=78919849
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111159455.4A Active CN113825005B (en) | 2021-09-30 | 2021-09-30 | Face video and audio synchronization method and system based on joint training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113825005B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115065844B (en) * | 2022-05-24 | 2023-09-12 | 北京跳悦智能科技有限公司 | Self-adaptive adjustment method for motion rhythm of anchor limb |
CN114945110B (en) * | 2022-05-31 | 2023-10-24 | 深圳市优必选科技股份有限公司 | Method and device for synthesizing voice head video, terminal equipment and readable storage medium |
CN116992309B (en) * | 2023-09-26 | 2023-12-19 | 苏州青颖飞帆软件科技股份有限公司 | Training method of voice mouth shape synchronous detection model, electronic equipment and storage medium |
CN118471250B (en) * | 2024-06-20 | 2025-02-07 | 北京林业大学 | A method for automatically generating lip shape and expression by inputting speech |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110149548A (en) * | 2018-09-26 | 2019-08-20 | 腾讯科技(深圳)有限公司 | Video dubbing method, electronic device and readable storage medium storing program for executing |
WO2019219968A1 (en) * | 2018-05-18 | 2019-11-21 | Deepmind Technologies Limited | Visual speech recognition by phoneme prediction |
CN111741326A (en) * | 2020-06-30 | 2020-10-02 | 腾讯科技(深圳)有限公司 | Video synthesis method, device, equipment and storage medium |
CN112750185A (en) * | 2021-01-19 | 2021-05-04 | 清华大学 | Portrait video generation method and device, electronic equipment and storage medium |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7046300B2 (en) * | 2002-11-29 | 2006-05-16 | International Business Machines Corporation | Assessing consistency between facial motion and speech signals in video |
US7133535B2 (en) * | 2002-12-21 | 2006-11-07 | Microsoft Corp. | System and method for real time lip synchronization |
US20080111887A1 (en) * | 2006-11-13 | 2008-05-15 | Pixel Instruments, Corp. | Method, system, and program product for measuring audio video synchronization independent of speaker characteristics |
US8224652B2 (en) * | 2008-09-26 | 2012-07-17 | Microsoft Corporation | Speech and text driven HMM-based body animation synthesis |
EP3438952B1 (en) * | 2017-08-02 | 2025-05-07 | Tata Consultancy Services Limited | Systems and methods for intelligent generation of inclusive system designs |
US11526808B2 (en) * | 2019-05-29 | 2022-12-13 | The Board Of Trustees Of The Leland Stanford Junior University | Machine learning based generation of ontology for structural and functional mapping |
-
2021
- 2021-09-30 CN CN202111159455.4A patent/CN113825005B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019219968A1 (en) * | 2018-05-18 | 2019-11-21 | Deepmind Technologies Limited | Visual speech recognition by phoneme prediction |
CN110149548A (en) * | 2018-09-26 | 2019-08-20 | 腾讯科技(深圳)有限公司 | Video dubbing method, electronic device and readable storage medium storing program for executing |
CN111741326A (en) * | 2020-06-30 | 2020-10-02 | 腾讯科技(深圳)有限公司 | Video synthesis method, device, equipment and storage medium |
CN112750185A (en) * | 2021-01-19 | 2021-05-04 | 清华大学 | Portrait video generation method and device, electronic equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
唇同步的自动识别与验证研究;侯亚荣, 熊璋;计算机工程与设计;20040228(第02期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113825005A (en) | 2021-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113825005B (en) | Face video and audio synchronization method and system based on joint training | |
US11837216B2 (en) | Speech recognition using unspoken text and speech synthesis | |
CN111954903B (en) | Multi-speaker neuro-text-to-speech synthesis | |
US11990117B2 (en) | Using speech recognition to improve cross-language speech synthesis | |
US11823697B2 (en) | Improving speech recognition with speech synthesis-based model adapation | |
CN112489618A (en) | Neural Text-to-Speech Synthesis Using Multilevel Contextual Features | |
CN111627420A (en) | Specific-speaker emotion voice synthesis method and device under extremely low resources | |
CN112365878A (en) | Speech synthesis method, device, equipment and computer readable storage medium | |
CN111916054B (en) | Lip-based voice generation method, device and system and storage medium | |
US10521945B2 (en) | Text-to-articulatory movement | |
US20240153484A1 (en) | Massive multilingual speech-text joint semi-supervised learning for text-to-speech | |
US12400638B2 (en) | Using aligned text and speech representations to train automatic speech recognition models without transcribed speech data | |
Barve et al. | Multi-language audio-visual content generation based on generative adversarial networks | |
Barve et al. | Synchronized Speech and Video Synthesis | |
Choi et al. | Label Embedding for Chinese Grapheme-to-Phoneme Conversion. | |
US20250078805A1 (en) | Scaling Multilingual Speech Synthesis with Zero Supervision of Found Data | |
Anand et al. | ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams | |
US20250279087A1 (en) | Speech-text prompting for speech tasks | |
Huang et al. | A Multimodal Learning Approach for Translating Live Lectures into MOOCs Materials | |
CN113936627B (en) | Model training methods and components, phoneme pronunciation duration annotation methods and components | |
US20250118292A1 (en) | Word-level end-to-end neural speaker diarization with auxnet | |
Chou et al. | A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data | |
Bandodkar et al. | “Allot?” Is “A Lot!” Towards Developing More Generalized Speech Recognition System for Accessible Communication | |
Priya et al. | Emotion-Aware Text-to-Speech Synthesis for Enhanced Accessibility: Synthetic Data Generation and Automated TTS for the Visually Impaired | |
CN119207370A (en) | Audio processing method, device, equipment and product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |