[go: up one dir, main page]

CN113825005B - Face video and audio synchronization method and system based on joint training - Google Patents

Face video and audio synchronization method and system based on joint training Download PDF

Info

Publication number
CN113825005B
CN113825005B CN202111159455.4A CN202111159455A CN113825005B CN 113825005 B CN113825005 B CN 113825005B CN 202111159455 A CN202111159455 A CN 202111159455A CN 113825005 B CN113825005 B CN 113825005B
Authority
CN
China
Prior art keywords
phoneme
video
mouth shape
sentence
appointed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111159455.4A
Other languages
Chinese (zh)
Other versions
CN113825005A (en
Inventor
包英泽
梁光
卢景熙
冯富森
舒科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tiaoyue Intelligent Technology Co ltd
Original Assignee
Beijing Tiaoyue Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tiaoyue Intelligent Technology Co ltd filed Critical Beijing Tiaoyue Intelligent Technology Co ltd
Priority to CN202111159455.4A priority Critical patent/CN113825005B/en
Publication of CN113825005A publication Critical patent/CN113825005A/en
Application granted granted Critical
Publication of CN113825005B publication Critical patent/CN113825005B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to a human face video and audio synchronization method and system based on joint training, and a computer device, wherein a brand new logic relation is adopted, based on a phoneme sequence to be processed corresponding to each pronunciation mouth shape in a sample video, network training is carried out by combining appointed type characteristics of preset appointed phonemes in the phoneme sequence to be processed corresponding to each mouth shape video and appointed type mouth shape characteristics corresponding to each mouth shape video respectively, so as to obtain a mouth shape characteristic generating module; based on the method, the appointed type mouth feature sequences corresponding to the Chinese voices in the target audio are obtained and are used for correcting the face mouth shape of the corresponding video segment in the target face video corresponding to the Chinese voices in the target audio, and the Chinese voices in the target audio are loaded according to the time stamp of the face mouth shape, so that the synchronization between the target audio and the target face video is realized, the synthesis of the target audio and the target video can be accurately and stably realized through the whole design application, and the effect of actual audio and video is improved.

Description

Face video and audio synchronization method and system based on joint training
Technical Field
The invention relates to a face video and audio synchronization method and system based on joint training and computer equipment, and belongs to the technical field of audio and video synthesis processing.
Background
There are many demands for generating video content in the internet and media fields, and some methods can synthesize pictures and videos of human faces, called TTA, that is, text-to-animation, such as ATVG methods; another class of methods can synthesize sound called TTS, text-to-speech, and many methods are applicable in practice. However, if TTA and TTS are used for model training respectively, the asynchronism of sound and picture, namely the asynchronism of mouth-shaped action and sound, is easy to occur. The prior art therefore lacks an accurate and stable synthesis method for audio and video independent of each other.
Disclosure of Invention
The invention aims to solve the technical problem of providing a face video and audio synchronization method based on joint training, which uses phoneme features to learn mouth shape features to obtain mouth shape features consistent with audio, and further realizes synchronization of the audio and the face mouth shape video in a correction mode.
The invention adopts the following technical scheme for solving the technical problems: the invention designs a face video and audio synchronization method based on joint training, which comprises the following steps of I to V, generating a mouth shape feature generating module, and then applying the mouth shape feature generating module to obtain the synchronization between a target audio and a target face video according to the following steps A to C;
step I, acquiring each section of audio in the sample video, acquiring each video to be selected corresponding to each section of audio respectively, and then entering step II;
Step II, respectively segmenting the audio corresponding to each video to be selected, obtaining each sentence of Chinese voice in the audio and the time stamp corresponding to each Chinese voice, further obtaining the phoneme sequence corresponding to each sentence of Chinese voice and the time stamp corresponding to each phoneme sequence, and combining to form the phoneme sequence group corresponding to the video to be selected; obtaining phoneme sequence groups corresponding to the videos to be selected respectively, and then entering a step III;
Step III, respectively aiming at each video to be selected, dividing and obtaining each mouth shape video in the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video according to different pronunciation mouth shapes of adjacent frames according to a phoneme sequence group corresponding to the video to be selected; obtaining all mouth shape videos corresponding to all the video to be selected and phoneme sequences to be processed corresponding to all the mouth shape videos respectively, and then entering the step IV;
Step IV, obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequence to be processed, which are respectively corresponding to each mouth shape video, obtaining appointed type mouth shape characteristics, which are respectively corresponding to each mouth shape video, and then entering step V;
step V, according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video, taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input, taking the appointed type mouth shape characteristics corresponding to the mouth shape video as output, and training aiming at a preset appointed network to form a mouth shape characteristic generating module;
Step A, according to the step I to the step IV, obtaining phoneme sequences corresponding to each sentence of Chinese voice in the target audio, obtaining appointed type characteristics of each phoneme in each phoneme sequence, and then entering the step B;
B, according to the appointed type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, taking the appointed type characteristics of each phoneme as input, applying a mouth shape characteristic generating module to obtain the appointed type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, further obtaining the appointed type mouth shape characteristic sequences corresponding to each sentence of Chinese speech in the target audio, and then entering the step C;
and C, correcting the face mouth shape of the corresponding video segment in the target face video corresponding to each sentence of Chinese voice according to the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese voice in the target audio and the time stamp corresponding to the target face video corresponding to each sentence of Chinese voice in the preset target audio, and loading each sentence of Chinese voice according to the time stamp corresponding to the target face video so as to realize the synchronization between the target audio and the target face video.
As a preferred technical scheme of the invention: the specified type feature of the phoneme is embedding features of the phoneme or one-hot features of the phoneme.
As a preferred technical scheme of the invention: in the step IV, the following operations are executed for the phoneme sequences to be processed corresponding to each mouth shape video respectively, so as to obtain the specified type characteristics of the preset specified phonemes in the phoneme sequences to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video is obtained, a preset number of phonemes before and after each landmark key phoneme are selected to be used as each phoneme to be selected, and the phoneme to be selected with the highest frequency is selected to be used as a preset appointed phoneme corresponding to the mouth shape video; then embedding features of preset appointed phonemes corresponding to the mouth shape video are obtained.
As a preferred technical scheme of the invention: in the step IV, the following operations are performed for the to-be-processed phoneme sequences corresponding to the mouth-shaped videos respectively to obtain the specific type characteristics of the preset specific phonemes in the to-be-processed phoneme sequences corresponding to the mouth-shaped videos; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
Obtaining each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video, multiplying the position of each landmark key phoneme by a preset parameter value, and obtaining each position in a rounding manner to obtain phonemes corresponding to each position respectively as each phoneme to be selected; then selecting the phoneme to be selected with the highest frequency as a preset appointed phoneme corresponding to the mouth shape video; finally, the appointed type characteristics of the preset appointed phonemes corresponding to the mouth shape video are obtained.
As a preferred technical scheme of the invention: obtaining embedding characteristics of phonemes according to the following steps a to c;
Step a, according to the length of each section of audio in the sample video, the time stamp of the corresponding pinyin marking of the inner initial consonant and vowel of each sentence is combined, according to the preset initial consonant and vowel duration ratio, the pronunciation level marking is executed for each sentence of Chinese sentence, and the face alignment is applied to carry out phoneme marking for the pronunciation level marking, so as to obtain the marked phoneme sequence corresponding to each sentence of Chinese sentence respectively;
B, applying a phoneme coding network layer in the TTS, aiming at each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, converting each labeling phoneme into each phoneme coding TTS representation, further obtaining the phoneme coding TTS representation of each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, and then entering the step c;
and c, respectively converting the phoneme coding TTS representation of each labeling phoneme in the corresponding labeling phoneme sequence of each sentence Chinese sentence to obtain embedding features of each labeling phoneme, namely embedding features of the phonemes.
As a preferred technical scheme of the invention: in the step V, according to the specified type of mouth shape feature corresponding to each mouth shape video and the specified type of feature of the phoneme sequence to be processed corresponding to each mouth shape video, the specified type of feature of the phoneme sequence to be processed corresponding to the mouth shape video is taken as input, the specified type of mouth shape feature corresponding to the mouth shape video is taken as output, and training is performed aiming at the GAN network to form a mouth shape feature generating module.
Correspondingly, the technical problem to be solved by the invention is to provide a system for the human face video and audio synchronization method based on joint training, which applies the learning of the phoneme feature to the mouth shape feature to obtain the mouth shape feature consistent with the audio, and further realizes the synchronization of the audio and the human face mouth shape video in a correction mode.
The invention adopts the following technical scheme for solving the technical problems: the invention designs a system of a human face video and audio synchronization method based on joint training, which comprises an audio and video separation and segmentation module, a phoneme sequence obtaining module, a mouth shape video segmentation module, a feature extraction module, a model training module, a model execution module and a correction loading module;
The audio-video separation and segmentation module is used for acquiring each section of audio in the sample video and acquiring each video to be selected corresponding to each section of audio respectively; the method comprises the steps of selecting each video to be selected, and segmenting audio corresponding to the video to be selected to obtain each sentence of Chinese voice in the audio and each corresponding time stamp of each Chinese voice;
The phoneme sequence obtaining module is used for obtaining phoneme sequences corresponding to each sentence of Chinese voice respectively and time stamps corresponding to each phoneme sequence respectively, and combining to form a phoneme sequence group corresponding to the video to be selected; further obtaining a phoneme sequence group corresponding to each video to be selected respectively;
the mouth shape video segmentation module is used for respectively segmenting each mouth shape video in the video to be selected and each mouth shape video to be processed according to different pronunciation mouth shapes of adjacent frames and according to a phoneme sequence group corresponding to the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video; respectively aiming at each video to be selected, according to a phoneme sequence group corresponding to the video to be selected, according to different pronunciation mouth shapes of adjacent frames, dividing to obtain each mouth shape video in the video to be selected and a phoneme sequence to be processed corresponding to each mouth shape video respectively;
The feature extraction module is used for obtaining the appointed type features of preset appointed phonemes in the phoneme sequences to be processed, which are respectively corresponding to the mouth shape videos, and obtaining the appointed type mouth shape features, which are respectively corresponding to the mouth shape videos;
The model training module is used for training aiming at a preset appointed network according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video by taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input and the appointed type mouth shape characteristics corresponding to the mouth shape video as output to form a mouth shape characteristic generating module;
the model execution module is used for obtaining the appointed type mouth shape characteristic corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the appointed type characteristic of each phoneme as input and applying the mouth shape characteristic generation module according to the appointed type characteristic of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio;
The correction loading module is used for obtaining the specified type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the specified type characteristics of each phoneme as input and applying the mouth shape characteristic generating module according to the specified type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the specified type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio.
The invention designs a computer device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of a face video and audio synchronization method based on joint training when executing the computer program.
And designing a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a face video and audio synchronization method based on joint training.
Compared with the prior art, the face video and audio synchronization method based on the joint training has the following technical effects:
The invention designs a face video and audio synchronization method based on joint training, which adopts a totally new logic relationship, and carries out network training based on a phoneme sequence to be processed corresponding to each pronunciation mouth shape in a sample video, and combines specified type characteristics of a preset specified phoneme in the phoneme sequence to be processed corresponding to each mouth shape video and specified type mouth shape characteristics corresponding to each mouth shape video to obtain a mouth shape characteristic generating module; based on the method, the appointed type mouth feature sequences corresponding to the Chinese voices in the target audio are obtained and are used for correcting the face mouth shape of the corresponding video segment in the target face video corresponding to the Chinese voices in the target audio, and the Chinese voices in the target audio are loaded according to the time stamp of the face mouth shape, so that the synchronization between the target audio and the target face video is realized, the synthesis of the target audio and the target video can be accurately and stably realized through the whole design application, and the effect of actual audio and video is improved.
Drawings
FIG. 1 is a flow chart of a method for synchronizing synthetic face and synthetic voice based on joint training according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.
The invention designs a face video and audio synchronization method based on joint training, which is shown in fig. 1, and generates a mouth shape feature generation module through the following steps I to V.
And I, acquiring each section of audio in the sample video, acquiring each video to be selected corresponding to each section of audio respectively, and then entering the step II.
Step II, respectively segmenting the audio corresponding to each video to be selected, obtaining each sentence of Chinese voice in the audio and the time stamp corresponding to each Chinese voice, further obtaining the phoneme sequence corresponding to each sentence of Chinese voice and the time stamp corresponding to each phoneme sequence, and combining to form the phoneme sequence group corresponding to the video to be selected; and further obtaining phoneme sequence groups corresponding to the videos to be selected respectively, and then entering the step III.
Step III, respectively aiming at each video to be selected, dividing and obtaining each mouth shape video in the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video according to different pronunciation mouth shapes of adjacent frames according to a phoneme sequence group corresponding to the video to be selected; and further obtaining all mouth shape videos corresponding to all the candidate videos and phoneme sequences to be processed corresponding to all the mouth shape videos respectively, and then entering the step IV.
And IV, obtaining the appointed type characteristics of preset appointed phonemes in the phoneme sequence to be processed, which are respectively corresponding to each mouth shape video, obtaining the appointed type mouth shape characteristics, which are respectively corresponding to each mouth shape video, and then entering the step V.
In the actual 1s audio, there are 100 frames of phoneme feature labels, in which there are 30 frames landmark key phonemes, namely landmark key phonemes: the phone phone=1:3, specifically, the specific type feature of the phone is embedding features of the phone or one-hot features of the phone, where in the step IV, if the specific type feature of the phone is embedding features of the phone, the following operation 1 is executed for the phone sequence to be processed corresponding to each mouth shape video, so as to obtain the specific type feature of the preset specific phone in the phone sequence to be processed corresponding to the mouth shape video; and further obtaining the appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed, which are respectively corresponding to the mouth shape videos.
Operation 1: each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video is obtained, a preset number of phonemes before and after each landmark key phoneme are selected to be used as each phoneme to be selected, and the phoneme to be selected with the highest frequency is selected to be used as a preset appointed phoneme corresponding to the mouth shape video; then embedding features of preset appointed phonemes corresponding to the mouth shape video are obtained.
Meanwhile, regarding the obtaining of the specific type feature of the phonemes, a second manner is designed, that is, if the specific type feature of the phonemes is embedding features of the phonemes or one-hot features, in the step IV, the following operation 2 is executed for the to-be-processed phoneme sequences corresponding to each mouth shape video respectively, so as to obtain the specific type feature of the preset specific phonemes in the to-be-processed phoneme sequences corresponding to the mouth shape video; and further obtaining the appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed, which are respectively corresponding to the mouth shape videos.
Operation 2: obtaining each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video, multiplying the position of each landmark key phoneme by a preset parameter value, and obtaining each position in a rounding manner to obtain phonemes corresponding to each position respectively as each phoneme to be selected; then selecting the phoneme to be selected with the highest frequency as a preset appointed phoneme corresponding to the mouth shape video; finally, the appointed type characteristics of the preset appointed phonemes corresponding to the mouth shape video are obtained.
In practical application, embedding features of the phonemes are obtained as follows in steps a to c.
And a step a, according to the length of each section of audio in the sample video, the time stamp of the corresponding pinyin marking of the inner initial consonant and the vowel of each sentence is combined, the pronunciation level data marking is executed for each sentence of Chinese sentence according to the preset duration of the initial consonant and the vowel, such as 1:2 or 1:1, and the face alignment is applied to carry out phoneme marking for the pronunciation level marking, so as to obtain the marked phoneme sequence corresponding to each sentence of Chinese sentence.
And b, applying a phoneme coding network layer (phone embedding network layer) in the TTS, aiming at each marked phoneme in the marked phoneme sequence corresponding to each sentence Chinese sentence, converting each marked phoneme into each phoneme coding TTS representation, further obtaining the phoneme coding TTS representation of each marked phoneme in the marked phoneme sequence corresponding to each sentence Chinese sentence, and then entering the step c.
In practice, the conversion of the labeling phonemes into a phoneme-encoded TTS representation is implemented as follows, using a phoneme-encoding network layer (phone embedding network layer) in the TTS.
#===nparams===
#N_symbols=len (symbols) # vocabulary size, 320 total
#symbos_embedding_dim=512
# Initialization
self.embedding=nn.Embedding(
hparams.n_symbols,hparams.symbols_embedding_dim) std=sqrt(2.0/(hparams.n_symbols+hparams.symbols_embedding_dim)) val=sqrt(3.0)*std#uniform bounos for std self.embedding.weight.data.uniform_(-val,val)
# Use
embedded_inputs=self.embedding(text_inputs).transpose(1,2)
In practical implementation, the actual application result of the labeling-phoneme phase-phoneme coding TTS representation conversion is as follows, namely, "labeling phonemes: the phoneme-encoded TTS means "as follows:
uei1:ui1
uei2:ui2
uei3:ui3
uei4:ui4
uei5:ui5
ng:ng1
iou1:ou1
iou2:ou2
iou3:ou3
iou4:ou4
ii1:i1
ii2:i2
ii3:i3
ii4:i4
ii5:i5
io1:o1
io2:o2
io3:o3
io4:o4
io5:o5
va1:a1
va2:a2
va3:a3
va4:a4
and c, respectively converting the phoneme coding TTS representation of each labeling phoneme in the corresponding labeling phoneme sequence of each sentence Chinese sentence to obtain embedding features of each labeling phoneme, namely embedding features of the phonemes.
In practical implementation, the actual application result of the conversion of the phoneme-encoded TTS representation to the phoneme embedding feature is as follows, namely, "phoneme-encoded TTS representation: the phoneme embedding features "as follows:
@iao5:@iao1
@iu5:@iu1
@ng1:@n
@ua5:@ua1
@uai5:@uai1
@ue5:@ue1
@ui5:@ui1
@v1:@v2
@v5:@v2
@ve1:@ve4
@ve2:@ve4
@ve3:@ve4
And V, according to the specified type mouth shape characteristics corresponding to each mouth shape video and the specified type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video, taking the specified type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input, taking the specified type mouth shape characteristics corresponding to the mouth shape video as output, and training aiming at a preset specified network such as a GAN network to form a mouth shape characteristic generating module.
Based on the acquisition of the mouth shape feature generation module, the mouth shape feature generation module can be applied, as shown in fig. 1, and the synchronization between the target audio and the target face video is obtained according to the following steps a to C.
Step A, according to the step I to the step IV, obtaining phoneme sequences corresponding to each sentence of Chinese voice in the target audio, obtaining the appointed type characteristics of each phoneme in each phoneme sequence, and then entering the step B.
And B, according to the appointed type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, taking the appointed type characteristics of each phoneme as input, applying a mouth shape characteristic generating module to obtain the appointed type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, further obtaining the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio, and then entering the step C.
And C, correcting the face mouth shape of the corresponding video segment in the target face video corresponding to each sentence of Chinese voice according to the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese voice in the target audio and the time stamp corresponding to the target face video corresponding to each sentence of Chinese voice in the preset target audio, and loading each sentence of Chinese voice according to the time stamp corresponding to the target face video so as to realize the synchronization between the target audio and the target face video.
The face video and audio synchronization method based on the combined training, which is designed by the invention, is applied to practice, and a face video and audio synchronization method based on the combined training is further specifically designed, wherein the face video and audio synchronization method based on the combined training comprises an audio-video separation and segmentation module, a phoneme sequence obtaining module, a mouth-shaped video segmentation module, a feature extraction module, a model training module, a model executing module and a correction loading module.
The audio-video separation and segmentation module is used for acquiring each section of audio in the sample video and acquiring each video to be selected corresponding to each section of audio respectively; and the method is used for respectively segmenting the audio corresponding to each video to be selected, and obtaining each sentence of Chinese voice in the audio and the corresponding time stamp of each Chinese voice.
The phoneme sequence obtaining module is used for obtaining phoneme sequences corresponding to each sentence of Chinese voice respectively and time stamps corresponding to each phoneme sequence respectively, and combining to form a phoneme sequence group corresponding to the video to be selected; and further obtaining a phoneme sequence group corresponding to each video to be selected.
The mouth shape video segmentation module is used for respectively segmenting each mouth shape video in the video to be selected and each mouth shape video to be processed according to different pronunciation mouth shapes of adjacent frames and according to a phoneme sequence group corresponding to the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video; and respectively aiming at each video to be selected, according to a phoneme sequence group corresponding to the video to be selected, according to different pronunciation mouth shapes of adjacent frames, dividing to obtain each mouth shape video in the video to be selected and a phoneme sequence to be processed corresponding to each mouth shape video.
The feature extraction module is used for obtaining the appointed type features of preset appointed phonemes in the phoneme sequence to be processed, which are respectively corresponding to each mouth shape video, and obtaining the appointed type mouth shape features, which are respectively corresponding to each mouth shape video.
The model training module is used for training aiming at a preset appointed network according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video, taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input, and taking the appointed type mouth shape characteristics corresponding to the mouth shape video as output to form the mouth shape characteristic generating module.
The model execution module is used for obtaining the appointed type mouth shape characteristic corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the appointed type characteristic of each phoneme as input and applying the mouth shape characteristic generation module according to the appointed type characteristic of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio.
The correction loading module is used for obtaining the specified type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the specified type characteristics of each phoneme as input and applying the mouth shape characteristic generating module according to the specified type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the specified type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio.
The invention designs a computer device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of a face video and audio synchronization method based on joint training when executing the computer program.
And designing a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a face video and audio synchronization method based on joint training.
The embodiments of the present invention will be described in detail below with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of one skilled in the art without departing from the spirit of the present invention.
Any modifications or variations, which are apparent to those skilled in the art in light of the above teachings, are intended to be included within the scope of this invention without departing from its spirit.

Claims (5)

1. A face video and audio synchronization method based on joint training is characterized in that a mouth shape feature generation module is generated through the following steps I to V, and then the mouth shape feature generation module is applied to obtain the synchronization between a target audio and a target face video according to the following steps A to C;
step I, acquiring each section of audio in the sample video, acquiring each video to be selected corresponding to each section of audio respectively, and then entering step II;
Step II, respectively segmenting the audio corresponding to each video to be selected, obtaining each sentence of Chinese voice in the audio and the time stamp corresponding to each Chinese voice, further obtaining the phoneme sequence corresponding to each sentence of Chinese voice and the time stamp corresponding to each phoneme sequence, and combining to form the phoneme sequence group corresponding to the video to be selected; obtaining phoneme sequence groups corresponding to the videos to be selected respectively, and then entering a step III;
Step III, respectively aiming at each video to be selected, dividing and obtaining each mouth shape video in the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video according to different pronunciation mouth shapes of adjacent frames according to a phoneme sequence group corresponding to the video to be selected; obtaining all mouth shape videos corresponding to all the video to be selected and phoneme sequences to be processed corresponding to all the mouth shape videos respectively, and then entering the step IV;
Step IV, obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequence to be processed, which are respectively corresponding to each mouth shape video, obtaining appointed type mouth shape characteristics, which are respectively corresponding to each mouth shape video, and then entering step V;
step V, according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video, taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input, taking the appointed type mouth shape characteristics corresponding to the mouth shape video as output, and training aiming at a preset appointed network to form a mouth shape characteristic generating module;
Step A, according to the step I to the step IV, obtaining phoneme sequences corresponding to each sentence of Chinese voice in the target audio, obtaining appointed type characteristics of each phoneme in each phoneme sequence, and then entering the step B;
B, according to the appointed type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, taking the appointed type characteristics of each phoneme as input, applying a mouth shape characteristic generating module to obtain the appointed type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, further obtaining the appointed type mouth shape characteristic sequences corresponding to each sentence of Chinese speech in the target audio, and then entering the step C;
Step C, according to the appointed type mouth feature sequences respectively corresponding to each sentence of Chinese voice in the target audio and the time stamps respectively corresponding to each sentence of Chinese voice in the preset target audio and the target face video, correcting the face mouth shape of the corresponding video segment in the target face video respectively corresponding to each sentence of Chinese voice, and loading each sentence of Chinese voice according to the time stamp corresponding to the target face video, so that the synchronization between the target audio and the target face video is realized;
In the step IV, the following operations are executed for the phoneme sequences to be processed corresponding to each mouth shape video respectively, so as to obtain the specified type characteristics of the preset specified phonemes in the phoneme sequences to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
Each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video is obtained, a preset number of phonemes before and after each landmark key phoneme are selected to be used as each phoneme to be selected, and the phoneme to be selected with the highest frequency is selected to be used as a preset appointed phoneme corresponding to the mouth shape video; then embedding features of preset appointed phonemes corresponding to the mouth shape video are obtained;
In the step IV, the following operations are performed for the to-be-processed phoneme sequences corresponding to the mouth-shaped videos respectively to obtain the specific type characteristics of the preset specific phonemes in the to-be-processed phoneme sequences corresponding to the mouth-shaped videos; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
Obtaining each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video, multiplying the position of each landmark key phoneme by a preset parameter value, and obtaining each position in a rounding manner to obtain phonemes corresponding to each position respectively as each phoneme to be selected; then selecting the phoneme to be selected with the highest frequency as a preset appointed phoneme corresponding to the mouth shape video; finally, the appointed type characteristics of the preset appointed phonemes corresponding to the mouth shape video are obtained;
Obtaining embedding characteristics of phonemes according to the following steps a to c;
Step a, according to the length of each section of audio in the sample video, the time stamp of the corresponding pinyin marking of the inner initial consonant and vowel of each sentence is combined, according to the preset initial consonant and vowel duration ratio, the pronunciation level marking is executed for each sentence of Chinese sentence, and the face alignment is applied to carry out phoneme marking for the pronunciation level marking, so as to obtain the marked phoneme sequence corresponding to each sentence of Chinese sentence respectively;
B, applying a phoneme coding network layer in the TTS, aiming at each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, converting each labeling phoneme into each phoneme coding TTS representation, further obtaining the phoneme coding TTS representation of each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, and then entering the step c;
and c, respectively converting the phoneme coding TTS representation of each labeling phoneme in the corresponding labeling phoneme sequence of each sentence Chinese sentence to obtain embedding features of each labeling phoneme, namely embedding features of the phonemes.
2. The method for synchronizing facial video and audio based on joint training according to claim 1, wherein: in the step V, according to the specified type of mouth shape feature corresponding to each mouth shape video and the specified type of feature of the phoneme sequence to be processed corresponding to each mouth shape video, the specified type of feature of the phoneme sequence to be processed corresponding to the mouth shape video is taken as input, the specified type of mouth shape feature corresponding to the mouth shape video is taken as output, and training is performed aiming at the GAN network to form a mouth shape feature generating module.
3. A system for implementing the face video and audio synchronization method based on joint training according to claim 1 or 2, which is characterized by comprising an audio/video separation and segmentation module, a phoneme sequence obtaining module, a mouth shape video segmentation module, a feature extraction module, a model training module, a model execution module and a correction loading module;
The audio-video separation and segmentation module is used for acquiring each section of audio in the sample video and acquiring each video to be selected corresponding to each section of audio respectively; the method comprises the steps of selecting each video to be selected, and segmenting audio corresponding to the video to be selected to obtain each sentence of Chinese voice in the audio and each corresponding time stamp of each Chinese voice;
The phoneme sequence obtaining module is used for obtaining phoneme sequences corresponding to each sentence of Chinese voice respectively and time stamps corresponding to each phoneme sequence respectively, and combining to form a phoneme sequence group corresponding to the video to be selected; further obtaining a phoneme sequence group corresponding to each video to be selected respectively;
the mouth shape video segmentation module is used for respectively segmenting each mouth shape video in the video to be selected and each mouth shape video to be processed according to different pronunciation mouth shapes of adjacent frames and according to a phoneme sequence group corresponding to the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video; respectively aiming at each video to be selected, according to a phoneme sequence group corresponding to the video to be selected, according to different pronunciation mouth shapes of adjacent frames, dividing to obtain each mouth shape video in the video to be selected and a phoneme sequence to be processed corresponding to each mouth shape video respectively;
The feature extraction module is used for obtaining the appointed type features of preset appointed phonemes in the phoneme sequences to be processed, which are respectively corresponding to the mouth shape videos, and obtaining the appointed type mouth shape features, which are respectively corresponding to the mouth shape videos;
The model training module is used for training aiming at a preset appointed network according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video by taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input and the appointed type mouth shape characteristics corresponding to the mouth shape video as output to form a mouth shape characteristic generating module;
the model execution module is used for obtaining the appointed type mouth shape characteristic corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the appointed type characteristic of each phoneme as input and applying the mouth shape characteristic generation module according to the appointed type characteristic of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio;
The correction loading module is used for obtaining the specified type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the specified type characteristics of each phoneme as input and applying the mouth shape characteristic generating module according to the specified type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the specified type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio;
The specific type features of the phonemes are embedding features of the phonemes, and the feature extraction module performs the following operations respectively for the sequences of the phonemes to be processed corresponding to each mouth shape video to obtain the specific type features of the preset specific phonemes in the sequences of the phonemes to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
Each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video is obtained, a preset number of phonemes before and after each landmark key phoneme are selected to be used as each phoneme to be selected, and the phoneme to be selected with the highest frequency is selected to be used as a preset appointed phoneme corresponding to the mouth shape video; then embedding features of preset appointed phonemes corresponding to the mouth shape video are obtained;
The specific type feature of the phonemes is embedding features of the phonemes or one-hot features, and the feature extraction module performs the following operations respectively for the sequences of the phonemes to be processed corresponding to each mouth shape video to obtain the specific type feature of the preset specific phonemes in the sequences of the phonemes to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;
Obtaining each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video, multiplying the position of each landmark key phoneme by a preset parameter value, and obtaining each position in a rounding manner to obtain phonemes corresponding to each position respectively as each phoneme to be selected; then selecting the phoneme to be selected with the highest frequency as a preset appointed phoneme corresponding to the mouth shape video; finally, the appointed type characteristics of the preset appointed phonemes corresponding to the mouth shape video are obtained;
Obtaining embedding characteristics of phonemes according to the following steps a to c;
Step a, according to the length of each section of audio in the sample video, the time stamp of the corresponding pinyin marking of the inner initial consonant and vowel of each sentence is combined, according to the preset initial consonant and vowel duration ratio, the pronunciation level marking is executed for each sentence of Chinese sentence, and the face alignment is applied to carry out phoneme marking for the pronunciation level marking, so as to obtain the marked phoneme sequence corresponding to each sentence of Chinese sentence respectively;
B, applying a phoneme coding network layer in the TTS, aiming at each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, converting each labeling phoneme into each phoneme coding TTS representation, further obtaining the phoneme coding TTS representation of each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, and then entering the step c;
and c, respectively converting the phoneme coding TTS representation of each labeling phoneme in the corresponding labeling phoneme sequence of each sentence Chinese sentence to obtain embedding features of each labeling phoneme, namely embedding features of the phonemes.
4. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of claim 1 or 2 when executing the computer program.
5. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of claim 1 or 2.
CN202111159455.4A 2021-09-30 2021-09-30 Face video and audio synchronization method and system based on joint training Active CN113825005B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111159455.4A CN113825005B (en) 2021-09-30 2021-09-30 Face video and audio synchronization method and system based on joint training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111159455.4A CN113825005B (en) 2021-09-30 2021-09-30 Face video and audio synchronization method and system based on joint training

Publications (2)

Publication Number Publication Date
CN113825005A CN113825005A (en) 2021-12-21
CN113825005B true CN113825005B (en) 2024-05-24

Family

ID=78919849

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111159455.4A Active CN113825005B (en) 2021-09-30 2021-09-30 Face video and audio synchronization method and system based on joint training

Country Status (1)

Country Link
CN (1) CN113825005B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115065844B (en) * 2022-05-24 2023-09-12 北京跳悦智能科技有限公司 Self-adaptive adjustment method for motion rhythm of anchor limb
CN114945110B (en) * 2022-05-31 2023-10-24 深圳市优必选科技股份有限公司 Method and device for synthesizing voice head video, terminal equipment and readable storage medium
CN116992309B (en) * 2023-09-26 2023-12-19 苏州青颖飞帆软件科技股份有限公司 Training method of voice mouth shape synchronous detection model, electronic equipment and storage medium
CN118471250B (en) * 2024-06-20 2025-02-07 北京林业大学 A method for automatically generating lip shape and expression by inputting speech

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110149548A (en) * 2018-09-26 2019-08-20 腾讯科技(深圳)有限公司 Video dubbing method, electronic device and readable storage medium storing program for executing
WO2019219968A1 (en) * 2018-05-18 2019-11-21 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
CN111741326A (en) * 2020-06-30 2020-10-02 腾讯科技(深圳)有限公司 Video synthesis method, device, equipment and storage medium
CN112750185A (en) * 2021-01-19 2021-05-04 清华大学 Portrait video generation method and device, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7046300B2 (en) * 2002-11-29 2006-05-16 International Business Machines Corporation Assessing consistency between facial motion and speech signals in video
US7133535B2 (en) * 2002-12-21 2006-11-07 Microsoft Corp. System and method for real time lip synchronization
US20080111887A1 (en) * 2006-11-13 2008-05-15 Pixel Instruments, Corp. Method, system, and program product for measuring audio video synchronization independent of speaker characteristics
US8224652B2 (en) * 2008-09-26 2012-07-17 Microsoft Corporation Speech and text driven HMM-based body animation synthesis
EP3438952B1 (en) * 2017-08-02 2025-05-07 Tata Consultancy Services Limited Systems and methods for intelligent generation of inclusive system designs
US11526808B2 (en) * 2019-05-29 2022-12-13 The Board Of Trustees Of The Leland Stanford Junior University Machine learning based generation of ontology for structural and functional mapping

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019219968A1 (en) * 2018-05-18 2019-11-21 Deepmind Technologies Limited Visual speech recognition by phoneme prediction
CN110149548A (en) * 2018-09-26 2019-08-20 腾讯科技(深圳)有限公司 Video dubbing method, electronic device and readable storage medium storing program for executing
CN111741326A (en) * 2020-06-30 2020-10-02 腾讯科技(深圳)有限公司 Video synthesis method, device, equipment and storage medium
CN112750185A (en) * 2021-01-19 2021-05-04 清华大学 Portrait video generation method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唇同步的自动识别与验证研究;侯亚荣, 熊璋;计算机工程与设计;20040228(第02期);全文 *

Also Published As

Publication number Publication date
CN113825005A (en) 2021-12-21

Similar Documents

Publication Publication Date Title
CN113825005B (en) Face video and audio synchronization method and system based on joint training
US11837216B2 (en) Speech recognition using unspoken text and speech synthesis
CN111954903B (en) Multi-speaker neuro-text-to-speech synthesis
US11990117B2 (en) Using speech recognition to improve cross-language speech synthesis
US11823697B2 (en) Improving speech recognition with speech synthesis-based model adapation
CN112489618A (en) Neural Text-to-Speech Synthesis Using Multilevel Contextual Features
CN111627420A (en) Specific-speaker emotion voice synthesis method and device under extremely low resources
CN112365878A (en) Speech synthesis method, device, equipment and computer readable storage medium
CN111916054B (en) Lip-based voice generation method, device and system and storage medium
US10521945B2 (en) Text-to-articulatory movement
US20240153484A1 (en) Massive multilingual speech-text joint semi-supervised learning for text-to-speech
US12400638B2 (en) Using aligned text and speech representations to train automatic speech recognition models without transcribed speech data
Barve et al. Multi-language audio-visual content generation based on generative adversarial networks
Barve et al. Synchronized Speech and Video Synthesis
Choi et al. Label Embedding for Chinese Grapheme-to-Phoneme Conversion.
US20250078805A1 (en) Scaling Multilingual Speech Synthesis with Zero Supervision of Found Data
Anand et al. ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams
US20250279087A1 (en) Speech-text prompting for speech tasks
Huang et al. A Multimodal Learning Approach for Translating Live Lectures into MOOCs Materials
CN113936627B (en) Model training methods and components, phoneme pronunciation duration annotation methods and components
US20250118292A1 (en) Word-level end-to-end neural speaker diarization with auxnet
Chou et al. A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data
Bandodkar et al. “Allot?” Is “A Lot!” Towards Developing More Generalized Speech Recognition System for Accessible Communication
Priya et al. Emotion-Aware Text-to-Speech Synthesis for Enhanced Accessibility: Synthetic Data Generation and Automated TTS for the Visually Impaired
CN119207370A (en) Audio processing method, device, equipment and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant