CN113825005B

CN113825005B - Face video and audio synchronization method and system based on joint training

Info

Publication number: CN113825005B
Application number: CN202111159455.4A
Authority: CN
Inventors: 包英泽; 梁光; 卢景熙; 冯富森; 舒科
Original assignee: Beijing Tiaoyue Intelligent Technology Co ltd
Current assignee: Beijing Tiaoyue Intelligent Technology Co ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2024-05-24
Anticipated expiration: 2041-09-30
Also published as: CN113825005A

Abstract

The invention relates to a human face video and audio synchronization method and system based on joint training, and a computer device, wherein a brand new logic relation is adopted, based on a phoneme sequence to be processed corresponding to each pronunciation mouth shape in a sample video, network training is carried out by combining appointed type characteristics of preset appointed phonemes in the phoneme sequence to be processed corresponding to each mouth shape video and appointed type mouth shape characteristics corresponding to each mouth shape video respectively, so as to obtain a mouth shape characteristic generating module; based on the method, the appointed type mouth feature sequences corresponding to the Chinese voices in the target audio are obtained and are used for correcting the face mouth shape of the corresponding video segment in the target face video corresponding to the Chinese voices in the target audio, and the Chinese voices in the target audio are loaded according to the time stamp of the face mouth shape, so that the synchronization between the target audio and the target face video is realized, the synthesis of the target audio and the target video can be accurately and stably realized through the whole design application, and the effect of actual audio and video is improved.

Description

Face video and audio synchronization method and system based on joint training

Technical Field

The invention relates to a face video and audio synchronization method and system based on joint training and computer equipment, and belongs to the technical field of audio and video synthesis processing.

Background

There are many demands for generating video content in the internet and media fields, and some methods can synthesize pictures and videos of human faces, called TTA, that is, text-to-animation, such as ATVG methods; another class of methods can synthesize sound called TTS, text-to-speech, and many methods are applicable in practice. However, if TTA and TTS are used for model training respectively, the asynchronism of sound and picture, namely the asynchronism of mouth-shaped action and sound, is easy to occur. The prior art therefore lacks an accurate and stable synthesis method for audio and video independent of each other.

Disclosure of Invention

The invention aims to solve the technical problem of providing a face video and audio synchronization method based on joint training, which uses phoneme features to learn mouth shape features to obtain mouth shape features consistent with audio, and further realizes synchronization of the audio and the face mouth shape video in a correction mode.

The invention adopts the following technical scheme for solving the technical problems: the invention designs a face video and audio synchronization method based on joint training, which comprises the following steps of I to V, generating a mouth shape feature generating module, and then applying the mouth shape feature generating module to obtain the synchronization between a target audio and a target face video according to the following steps A to C;

step I, acquiring each section of audio in the sample video, acquiring each video to be selected corresponding to each section of audio respectively, and then entering step II;

Step II, respectively segmenting the audio corresponding to each video to be selected, obtaining each sentence of Chinese voice in the audio and the time stamp corresponding to each Chinese voice, further obtaining the phoneme sequence corresponding to each sentence of Chinese voice and the time stamp corresponding to each phoneme sequence, and combining to form the phoneme sequence group corresponding to the video to be selected; obtaining phoneme sequence groups corresponding to the videos to be selected respectively, and then entering a step III;

Step III, respectively aiming at each video to be selected, dividing and obtaining each mouth shape video in the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video according to different pronunciation mouth shapes of adjacent frames according to a phoneme sequence group corresponding to the video to be selected; obtaining all mouth shape videos corresponding to all the video to be selected and phoneme sequences to be processed corresponding to all the mouth shape videos respectively, and then entering the step IV;

Step IV, obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequence to be processed, which are respectively corresponding to each mouth shape video, obtaining appointed type mouth shape characteristics, which are respectively corresponding to each mouth shape video, and then entering step V;

step V, according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video, taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input, taking the appointed type mouth shape characteristics corresponding to the mouth shape video as output, and training aiming at a preset appointed network to form a mouth shape characteristic generating module;

Step A, according to the step I to the step IV, obtaining phoneme sequences corresponding to each sentence of Chinese voice in the target audio, obtaining appointed type characteristics of each phoneme in each phoneme sequence, and then entering the step B;

B, according to the appointed type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, taking the appointed type characteristics of each phoneme as input, applying a mouth shape characteristic generating module to obtain the appointed type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, further obtaining the appointed type mouth shape characteristic sequences corresponding to each sentence of Chinese speech in the target audio, and then entering the step C;

and C, correcting the face mouth shape of the corresponding video segment in the target face video corresponding to each sentence of Chinese voice according to the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese voice in the target audio and the time stamp corresponding to the target face video corresponding to each sentence of Chinese voice in the preset target audio, and loading each sentence of Chinese voice according to the time stamp corresponding to the target face video so as to realize the synchronization between the target audio and the target face video.

As a preferred technical scheme of the invention: the specified type feature of the phoneme is embedding features of the phoneme or one-hot features of the phoneme.

As a preferred technical scheme of the invention: in the step IV, the following operations are executed for the phoneme sequences to be processed corresponding to each mouth shape video respectively, so as to obtain the specified type characteristics of the preset specified phonemes in the phoneme sequences to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;

each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video is obtained, a preset number of phonemes before and after each landmark key phoneme are selected to be used as each phoneme to be selected, and the phoneme to be selected with the highest frequency is selected to be used as a preset appointed phoneme corresponding to the mouth shape video; then embedding features of preset appointed phonemes corresponding to the mouth shape video are obtained.

As a preferred technical scheme of the invention: in the step IV, the following operations are performed for the to-be-processed phoneme sequences corresponding to the mouth-shaped videos respectively to obtain the specific type characteristics of the preset specific phonemes in the to-be-processed phoneme sequences corresponding to the mouth-shaped videos; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;

Obtaining each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video, multiplying the position of each landmark key phoneme by a preset parameter value, and obtaining each position in a rounding manner to obtain phonemes corresponding to each position respectively as each phoneme to be selected; then selecting the phoneme to be selected with the highest frequency as a preset appointed phoneme corresponding to the mouth shape video; finally, the appointed type characteristics of the preset appointed phonemes corresponding to the mouth shape video are obtained.

As a preferred technical scheme of the invention: obtaining embedding characteristics of phonemes according to the following steps a to c;

Step a, according to the length of each section of audio in the sample video, the time stamp of the corresponding pinyin marking of the inner initial consonant and vowel of each sentence is combined, according to the preset initial consonant and vowel duration ratio, the pronunciation level marking is executed for each sentence of Chinese sentence, and the face alignment is applied to carry out phoneme marking for the pronunciation level marking, so as to obtain the marked phoneme sequence corresponding to each sentence of Chinese sentence respectively;

B, applying a phoneme coding network layer in the TTS, aiming at each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, converting each labeling phoneme into each phoneme coding TTS representation, further obtaining the phoneme coding TTS representation of each labeling phoneme in the labeling phoneme sequence corresponding to each sentence Chinese sentence, and then entering the step c;

and c, respectively converting the phoneme coding TTS representation of each labeling phoneme in the corresponding labeling phoneme sequence of each sentence Chinese sentence to obtain embedding features of each labeling phoneme, namely embedding features of the phonemes.

As a preferred technical scheme of the invention: in the step V, according to the specified type of mouth shape feature corresponding to each mouth shape video and the specified type of feature of the phoneme sequence to be processed corresponding to each mouth shape video, the specified type of feature of the phoneme sequence to be processed corresponding to the mouth shape video is taken as input, the specified type of mouth shape feature corresponding to the mouth shape video is taken as output, and training is performed aiming at the GAN network to form a mouth shape feature generating module.

Correspondingly, the technical problem to be solved by the invention is to provide a system for the human face video and audio synchronization method based on joint training, which applies the learning of the phoneme feature to the mouth shape feature to obtain the mouth shape feature consistent with the audio, and further realizes the synchronization of the audio and the human face mouth shape video in a correction mode.

The invention adopts the following technical scheme for solving the technical problems: the invention designs a system of a human face video and audio synchronization method based on joint training, which comprises an audio and video separation and segmentation module, a phoneme sequence obtaining module, a mouth shape video segmentation module, a feature extraction module, a model training module, a model execution module and a correction loading module;

The audio-video separation and segmentation module is used for acquiring each section of audio in the sample video and acquiring each video to be selected corresponding to each section of audio respectively; the method comprises the steps of selecting each video to be selected, and segmenting audio corresponding to the video to be selected to obtain each sentence of Chinese voice in the audio and each corresponding time stamp of each Chinese voice;

The phoneme sequence obtaining module is used for obtaining phoneme sequences corresponding to each sentence of Chinese voice respectively and time stamps corresponding to each phoneme sequence respectively, and combining to form a phoneme sequence group corresponding to the video to be selected; further obtaining a phoneme sequence group corresponding to each video to be selected respectively;

the mouth shape video segmentation module is used for respectively segmenting each mouth shape video in the video to be selected and each mouth shape video to be processed according to different pronunciation mouth shapes of adjacent frames and according to a phoneme sequence group corresponding to the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video; respectively aiming at each video to be selected, according to a phoneme sequence group corresponding to the video to be selected, according to different pronunciation mouth shapes of adjacent frames, dividing to obtain each mouth shape video in the video to be selected and a phoneme sequence to be processed corresponding to each mouth shape video respectively;

The feature extraction module is used for obtaining the appointed type features of preset appointed phonemes in the phoneme sequences to be processed, which are respectively corresponding to the mouth shape videos, and obtaining the appointed type mouth shape features, which are respectively corresponding to the mouth shape videos;

The model training module is used for training aiming at a preset appointed network according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video by taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input and the appointed type mouth shape characteristics corresponding to the mouth shape video as output to form a mouth shape characteristic generating module;

the model execution module is used for obtaining the appointed type mouth shape characteristic corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the appointed type characteristic of each phoneme as input and applying the mouth shape characteristic generation module according to the appointed type characteristic of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio;

The correction loading module is used for obtaining the specified type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the specified type characteristics of each phoneme as input and applying the mouth shape characteristic generating module according to the specified type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the specified type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio.

The invention designs a computer device which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of a face video and audio synchronization method based on joint training when executing the computer program.

And designing a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a face video and audio synchronization method based on joint training.

Compared with the prior art, the face video and audio synchronization method based on the joint training has the following technical effects:

The invention designs a face video and audio synchronization method based on joint training, which adopts a totally new logic relationship, and carries out network training based on a phoneme sequence to be processed corresponding to each pronunciation mouth shape in a sample video, and combines specified type characteristics of a preset specified phoneme in the phoneme sequence to be processed corresponding to each mouth shape video and specified type mouth shape characteristics corresponding to each mouth shape video to obtain a mouth shape characteristic generating module; based on the method, the appointed type mouth feature sequences corresponding to the Chinese voices in the target audio are obtained and are used for correcting the face mouth shape of the corresponding video segment in the target face video corresponding to the Chinese voices in the target audio, and the Chinese voices in the target audio are loaded according to the time stamp of the face mouth shape, so that the synchronization between the target audio and the target face video is realized, the synthesis of the target audio and the target video can be accurately and stably realized through the whole design application, and the effect of actual audio and video is improved.

Drawings

FIG. 1 is a flow chart of a method for synchronizing synthetic face and synthetic voice based on joint training according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

The invention designs a face video and audio synchronization method based on joint training, which is shown in fig. 1, and generates a mouth shape feature generation module through the following steps I to V.

And I, acquiring each section of audio in the sample video, acquiring each video to be selected corresponding to each section of audio respectively, and then entering the step II.

Step II, respectively segmenting the audio corresponding to each video to be selected, obtaining each sentence of Chinese voice in the audio and the time stamp corresponding to each Chinese voice, further obtaining the phoneme sequence corresponding to each sentence of Chinese voice and the time stamp corresponding to each phoneme sequence, and combining to form the phoneme sequence group corresponding to the video to be selected; and further obtaining phoneme sequence groups corresponding to the videos to be selected respectively, and then entering the step III.

Step III, respectively aiming at each video to be selected, dividing and obtaining each mouth shape video in the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video according to different pronunciation mouth shapes of adjacent frames according to a phoneme sequence group corresponding to the video to be selected; and further obtaining all mouth shape videos corresponding to all the candidate videos and phoneme sequences to be processed corresponding to all the mouth shape videos respectively, and then entering the step IV.

And IV, obtaining the appointed type characteristics of preset appointed phonemes in the phoneme sequence to be processed, which are respectively corresponding to each mouth shape video, obtaining the appointed type mouth shape characteristics, which are respectively corresponding to each mouth shape video, and then entering the step V.

In the actual 1s audio, there are 100 frames of phoneme feature labels, in which there are 30 frames landmark key phonemes, namely landmark key phonemes: the phone phone=1:3, specifically, the specific type feature of the phone is embedding features of the phone or one-hot features of the phone, where in the step IV, if the specific type feature of the phone is embedding features of the phone, the following operation 1 is executed for the phone sequence to be processed corresponding to each mouth shape video, so as to obtain the specific type feature of the preset specific phone in the phone sequence to be processed corresponding to the mouth shape video; and further obtaining the appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed, which are respectively corresponding to the mouth shape videos.

Operation 1: each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video is obtained, a preset number of phonemes before and after each landmark key phoneme are selected to be used as each phoneme to be selected, and the phoneme to be selected with the highest frequency is selected to be used as a preset appointed phoneme corresponding to the mouth shape video; then embedding features of preset appointed phonemes corresponding to the mouth shape video are obtained.

Meanwhile, regarding the obtaining of the specific type feature of the phonemes, a second manner is designed, that is, if the specific type feature of the phonemes is embedding features of the phonemes or one-hot features, in the step IV, the following operation 2 is executed for the to-be-processed phoneme sequences corresponding to each mouth shape video respectively, so as to obtain the specific type feature of the preset specific phonemes in the to-be-processed phoneme sequences corresponding to the mouth shape video; and further obtaining the appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed, which are respectively corresponding to the mouth shape videos.

Operation 2: obtaining each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video, multiplying the position of each landmark key phoneme by a preset parameter value, and obtaining each position in a rounding manner to obtain phonemes corresponding to each position respectively as each phoneme to be selected; then selecting the phoneme to be selected with the highest frequency as a preset appointed phoneme corresponding to the mouth shape video; finally, the appointed type characteristics of the preset appointed phonemes corresponding to the mouth shape video are obtained.

In practical application, embedding features of the phonemes are obtained as follows in steps a to c.

And a step a, according to the length of each section of audio in the sample video, the time stamp of the corresponding pinyin marking of the inner initial consonant and the vowel of each sentence is combined, the pronunciation level data marking is executed for each sentence of Chinese sentence according to the preset duration of the initial consonant and the vowel, such as 1:2 or 1:1, and the face alignment is applied to carry out phoneme marking for the pronunciation level marking, so as to obtain the marked phoneme sequence corresponding to each sentence of Chinese sentence.

And b, applying a phoneme coding network layer (phone embedding network layer) in the TTS, aiming at each marked phoneme in the marked phoneme sequence corresponding to each sentence Chinese sentence, converting each marked phoneme into each phoneme coding TTS representation, further obtaining the phoneme coding TTS representation of each marked phoneme in the marked phoneme sequence corresponding to each sentence Chinese sentence, and then entering the step c.

In practice, the conversion of the labeling phonemes into a phoneme-encoded TTS representation is implemented as follows, using a phoneme-encoding network layer (phone embedding network layer) in the TTS.

#＝＝＝nparams＝＝＝

#N_symbols=len (symbols) # vocabulary size, 320 total

#symbos_embedding_dim＝512

# Initialization

self.embedding＝nn.Embedding(

hparams.n_symbols,hparams.symbols_embedding_dim) std＝sqrt(2.0/(hparams.n_symbols+hparams.symbols_embedding_dim)) val＝sqrt(3.0)＊std#uniform bounos for std self.embedding.weight.data.uniform_(-val,val)

# Use

embedded_inputs＝self.embedding(text_inputs).transpose(1，2)

In practical implementation, the actual application result of the labeling-phoneme phase-phoneme coding TTS representation conversion is as follows, namely, "labeling phonemes: the phoneme-encoded TTS means "as follows:

uei1：ui1

uei2：ui2

uei3：ui3

uei4：ui4

uei5：ui5

ng：ng1

iou1：ou1

iou2：ou2

iou3：ou3

iou4：ou4

ii1：i1

ii2：i2

ii3：i3

ii4：i4

ii5：i5

io1：o1

io2：o2

io3：o3

io4：o4

io5：o5

va1：a1

va2：a2

va3：a3

va4：a4

In practical implementation, the actual application result of the conversion of the phoneme-encoded TTS representation to the phoneme embedding feature is as follows, namely, "phoneme-encoded TTS representation: the phoneme embedding features "as follows:

@iao5：@iao1

@iu5：@iu1

@ng1：@n

@ua5：@ua1

@uai5：@uai1

@ue5：@ue1

@ui5：@ui1

@v1：@v2

@v5：@v2

@ve1：@ve4

@ve2：@ve4

@ve3：@ve4

And V, according to the specified type mouth shape characteristics corresponding to each mouth shape video and the specified type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video, taking the specified type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input, taking the specified type mouth shape characteristics corresponding to the mouth shape video as output, and training aiming at a preset specified network such as a GAN network to form a mouth shape characteristic generating module.

Based on the acquisition of the mouth shape feature generation module, the mouth shape feature generation module can be applied, as shown in fig. 1, and the synchronization between the target audio and the target face video is obtained according to the following steps a to C.

Step A, according to the step I to the step IV, obtaining phoneme sequences corresponding to each sentence of Chinese voice in the target audio, obtaining the appointed type characteristics of each phoneme in each phoneme sequence, and then entering the step B.

And B, according to the appointed type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, taking the appointed type characteristics of each phoneme as input, applying a mouth shape characteristic generating module to obtain the appointed type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, further obtaining the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio, and then entering the step C.

The face video and audio synchronization method based on the combined training, which is designed by the invention, is applied to practice, and a face video and audio synchronization method based on the combined training is further specifically designed, wherein the face video and audio synchronization method based on the combined training comprises an audio-video separation and segmentation module, a phoneme sequence obtaining module, a mouth-shaped video segmentation module, a feature extraction module, a model training module, a model executing module and a correction loading module.

The audio-video separation and segmentation module is used for acquiring each section of audio in the sample video and acquiring each video to be selected corresponding to each section of audio respectively; and the method is used for respectively segmenting the audio corresponding to each video to be selected, and obtaining each sentence of Chinese voice in the audio and the corresponding time stamp of each Chinese voice.

The phoneme sequence obtaining module is used for obtaining phoneme sequences corresponding to each sentence of Chinese voice respectively and time stamps corresponding to each phoneme sequence respectively, and combining to form a phoneme sequence group corresponding to the video to be selected; and further obtaining a phoneme sequence group corresponding to each video to be selected.

The mouth shape video segmentation module is used for respectively segmenting each mouth shape video in the video to be selected and each mouth shape video to be processed according to different pronunciation mouth shapes of adjacent frames and according to a phoneme sequence group corresponding to the video to be selected and the phoneme sequence to be processed corresponding to each mouth shape video; and respectively aiming at each video to be selected, according to a phoneme sequence group corresponding to the video to be selected, according to different pronunciation mouth shapes of adjacent frames, dividing to obtain each mouth shape video in the video to be selected and a phoneme sequence to be processed corresponding to each mouth shape video.

The feature extraction module is used for obtaining the appointed type features of preset appointed phonemes in the phoneme sequence to be processed, which are respectively corresponding to each mouth shape video, and obtaining the appointed type mouth shape features, which are respectively corresponding to each mouth shape video.

The model training module is used for training aiming at a preset appointed network according to the appointed type mouth shape characteristics corresponding to each mouth shape video and the appointed type characteristics of the phoneme sequence to be processed corresponding to each mouth shape video, taking the appointed type characteristics of the phoneme sequence to be processed corresponding to the mouth shape video as input, and taking the appointed type mouth shape characteristics corresponding to the mouth shape video as output to form the mouth shape characteristic generating module.

The model execution module is used for obtaining the appointed type mouth shape characteristic corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the appointed type characteristic of each phoneme as input and applying the mouth shape characteristic generation module according to the appointed type characteristic of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the appointed type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio.

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of one skilled in the art without departing from the spirit of the present invention.

Any modifications or variations, which are apparent to those skilled in the art in light of the above teachings, are intended to be included within the scope of this invention without departing from its spirit.

Claims

1. A face video and audio synchronization method based on joint training is characterized in that a mouth shape feature generation module is generated through the following steps I to V, and then the mouth shape feature generation module is applied to obtain the synchronization between a target audio and a target face video according to the following steps A to C;

Step C, according to the appointed type mouth feature sequences respectively corresponding to each sentence of Chinese voice in the target audio and the time stamps respectively corresponding to each sentence of Chinese voice in the preset target audio and the target face video, correcting the face mouth shape of the corresponding video segment in the target face video respectively corresponding to each sentence of Chinese voice, and loading each sentence of Chinese voice according to the time stamp corresponding to the target face video, so that the synchronization between the target audio and the target face video is realized;

In the step IV, the following operations are executed for the phoneme sequences to be processed corresponding to each mouth shape video respectively, so as to obtain the specified type characteristics of the preset specified phonemes in the phoneme sequences to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;

Each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video is obtained, a preset number of phonemes before and after each landmark key phoneme are selected to be used as each phoneme to be selected, and the phoneme to be selected with the highest frequency is selected to be used as a preset appointed phoneme corresponding to the mouth shape video; then embedding features of preset appointed phonemes corresponding to the mouth shape video are obtained;

In the step IV, the following operations are performed for the to-be-processed phoneme sequences corresponding to the mouth-shaped videos respectively to obtain the specific type characteristics of the preset specific phonemes in the to-be-processed phoneme sequences corresponding to the mouth-shaped videos; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;

Obtaining each landmark key phoneme in a phoneme sequence to be processed corresponding to the mouth shape video, multiplying the position of each landmark key phoneme by a preset parameter value, and obtaining each position in a rounding manner to obtain phonemes corresponding to each position respectively as each phoneme to be selected; then selecting the phoneme to be selected with the highest frequency as a preset appointed phoneme corresponding to the mouth shape video; finally, the appointed type characteristics of the preset appointed phonemes corresponding to the mouth shape video are obtained;

Obtaining embedding characteristics of phonemes according to the following steps a to c;

2. The method for synchronizing facial video and audio based on joint training according to claim 1, wherein: in the step V, according to the specified type of mouth shape feature corresponding to each mouth shape video and the specified type of feature of the phoneme sequence to be processed corresponding to each mouth shape video, the specified type of feature of the phoneme sequence to be processed corresponding to the mouth shape video is taken as input, the specified type of mouth shape feature corresponding to the mouth shape video is taken as output, and training is performed aiming at the GAN network to form a mouth shape feature generating module.

3. A system for implementing the face video and audio synchronization method based on joint training according to claim 1 or 2, which is characterized by comprising an audio/video separation and segmentation module, a phoneme sequence obtaining module, a mouth shape video segmentation module, a feature extraction module, a model training module, a model execution module and a correction loading module;

The correction loading module is used for obtaining the specified type mouth shape characteristics corresponding to each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio by taking the specified type characteristics of each phoneme as input and applying the mouth shape characteristic generating module according to the specified type characteristics of each phoneme in the phoneme sequence corresponding to each sentence of Chinese speech in the target audio, so as to obtain the specified type mouth shape characteristic sequence corresponding to each sentence of Chinese speech in the target audio;

The specific type features of the phonemes are embedding features of the phonemes, and the feature extraction module performs the following operations respectively for the sequences of the phonemes to be processed corresponding to each mouth shape video to obtain the specific type features of the preset specific phonemes in the sequences of the phonemes to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;

The specific type feature of the phonemes is embedding features of the phonemes or one-hot features, and the feature extraction module performs the following operations respectively for the sequences of the phonemes to be processed corresponding to each mouth shape video to obtain the specific type feature of the preset specific phonemes in the sequences of the phonemes to be processed corresponding to the mouth shape video; further obtaining appointed type characteristics of preset appointed phonemes in the phoneme sequences to be processed corresponding to the mouth-shaped videos respectively;

4. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of claim 1 or 2 when executing the computer program.

5. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of claim 1 or 2.