CN119815096B

CN119815096B - A lip synchronization method and system based on multiple reference frames and controllable style

Info

Publication number: CN119815096B
Application number: CN202510274090.1A
Authority: CN
Inventors: 钟添芸; 赵洲
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2025-03-10
Filing date: 2025-03-10
Publication date: 2025-06-13
Anticipated expiration: 2045-03-10
Also published as: CN119815096A

Abstract

The present invention discloses a lip synchronization method and system based on multiple reference frames and controllable style, belonging to the field of video generation. The speaker's video data and driving audio data are obtained; two groups of video frames are randomly selected from the video as the first multiple reference images and the second multiple reference images, and the driving key points of the reference images are regarded as reference key points; the speaker's lip movement style features are extracted from the video, and the fine key points are predicted by the audio-to-key point module from sparse to fine; the second multiple reference images and their reference key points are combined with the fine key points predicted by the audio-to-key point module, and a multi-scale aggregation ratio is introduced in the key point-to-video module to generate lip synchronization frames; in the generation task, the speaker's lip movement style features can be extracted from the original video or the speaker's lip movement style features can be directly given to achieve style-controllable lip synchronization.

Description

Lip synchronization method and system based on multiple reference frames and controllable style

Technical Field

The invention relates to the field of video generation, in particular to a lip synchronization method and a lip synchronization system based on multiple reference frames and controllable styles.

Background

At present, the task of generating speaker video has attracted a great deal of attention, and is an important artificial intelligence field. Lip sync tasks are a common task in the art that aims to modify the speaker's lip in another given audio-video segment through a given driving audio so that its lip is synchronized with the driving audio. The lip synchronization method and the lip synchronization system are widely used in a plurality of application scenes such as video dubbing, cross-language education, e-commerce live broadcast and the like.

The existing mainstream lip synchronization method and technology mainly adopts a two-stage key point method, namely key points extracted by a pre-training tool are used as intermediate characterization and are divided into audio-to-key point and key-to-video generation sub-tasks, and two sub-models are respectively trained. Compared with tasks without speaking style input such as single-graph driving, richer information, such as speaking style characteristics of a speaker and texture information brought by more reference frames, naturally exists in the lip synchronous task. The prior method has incomplete utilization of the information, and mainly has two problems of ignoring the speaking style characteristics of a speaker in the original video, and synchronously generating the lip by using only single reference frame information or simply averaging multiple reference frame information.

Disclosure of Invention

In order to overcome the problems, the invention provides a lip synchronization method and a lip synchronization system based on multiple reference frames and controllable styles, which fully utilize the style characteristic information and the multiple reference frame information of a speaker existing in an original video to realize lip synchronization video generation with controllable styles and high fidelity.

The invention adopts the specific technical scheme that:

In a first aspect, the present invention provides a lip synchronization method based on multi-reference frames and controllable styles, comprising the following steps:

S1, acquiring speaker video data and driving audio data, and extracting driving audio characteristics of each audio frame and driving key points of each video frame;

S2, randomly selecting two groups of video frames from the video to serve as a first multi-reference picture and a second multi-reference picture respectively, and taking driving key points of the reference pictures as reference key points;

S3, extracting lip movement style characteristics of a speaker from the video, and predicting fine key points by utilizing an audio-to-key point module by combining the driving audio characteristics and reference key points of the first multi-reference graph;

S4, combining the second multi-reference picture, the reference key points thereof and the fine key points predicted by the audio-frequency to key point module, and generating a lip-shaped synchronous frame by utilizing the key point to video module;

S5, calculating loss according to the prediction result of the audio-to-key point module and the generation result of the key point-to-video module, and updating the module;

S6, given the driving audio and the original video, extracting lip movement style characteristics of a speaker from the original video or directly giving the lip movement style characteristics of the speaker, and generating a synthesized video synchronous with the lip of the given driving audio by utilizing the trained audio-to-key point module and the key point-to-video module to finish the lip synchronization task.

Further, the first multi-reference map and the second multi-reference map are independent from each other.

Further, the driving key points comprise 3D driving key points and 2D driving key points, the reference key points comprise 3D reference key points and 2D reference key points, and the fine key points predicted by the audio-to-key point module are fine 2D driving key points.

Further, the lip movement style characteristic is a scalar and consists of a lip amplitude and a lip speed, and the method for extracting the lip movement style characteristic of the speaker from the video is as follows:

Screening lip key points in 3D driving key points of speakers in each frame of video image, wherein the lip key points comprise an upper lip key point and a lower lip key point;

Characterizing the lip stretch by using an average value of y coordinate differences of the upper lip key points and the lower lip key points;

and characterizing the lip speed by using a time sequence first-order average difference of the absolute value of the y coordinate difference of the upper lip key point and the lower lip key point.

Further, the audio-to-keypoint module comprises:

A sparse 3D driving key point predictor which takes driving audio characteristics of each audio frame, lip movement style characteristics of a speaker and 3D reference key points of a first multi-reference graph as inputs and outputs predicted sparse 3D driving key points;

And the fine 2D driving key point predictor takes the 2D reference key points of the first multi-reference graph and the sparse 3D driving key points as inputs and outputs predicted fine 2D driving key points.

Further, the sparse 3D driving keypoint predictor includes a reference keypoint encoder, a lip span embedding layer, a lip speed embedding layer, an audio encoder and a driving keypoint decoder, and the calculating process includes:

the reference key point encoder encodes the 3D reference key points of the first multi-reference graph to obtain encoded 3D reference key point characteristics;

The lip amplitude embedding layer and the lip speed embedding layer respectively encode and embed the lip amplitude and the lip speed in the lip movement style characteristics of a speaker to obtain lip movement style embedding characteristics formed by the lip amplitude embedding and the lip speed embedding;

The audio encoder encodes the driving audio characteristics of each audio frame to obtain audio encoding characteristics;

the 3D reference key point characteristics and the lip movement style embedded characteristics after coding are expanded to have the same dimension as the audio coding characteristics, and then are connected with the audio coding characteristics to obtain sparse 3D driving key point characteristics, wherein the number of frames of the sparse 3D driving key point characteristics is consistent with the number of frames of the audio;

And decoding the sparse 3D driving key point features through a driving key point decoder to obtain predicted sparse 3D driving key points.

Further, the fine 2D driving keypoint predictor includes a reference keypoint encoder, a linear layer, and a driving keypoint decoder, and the calculating process includes:

the reference key point encoder encodes the 2D reference key points of the first multi-reference graph to obtain encoded 2D reference key point characteristics;

the predicted sparse 3D driving key points are embedded into sparse 3D driving key points by a linear layer;

the coded 2D reference key point features are expanded to have the same dimension as the sparse 3D drive key point embedded, and then are connected with the sparse 3D drive key point embedded to obtain fine 2D drive key point features, wherein the number of frames of the fine 2D drive key point features is consistent with the number of audio frames;

The fine 2D driving keypoint feature is decoded by a driving keypoint decoder to obtain predicted fine 2D driving keypoints.

Further, step S4 includes:

s4-1, twisting the second reference map feature map of each frame according to the 2D reference key point of the second reference map of each frame and the fine 2D driving key point of the current frame to generate a multi-twisting feature map;

S4-2, calculating a multi-scale aggregation proportion of the multi-distortion feature map by using the fine 2D driving key points of the current frame, the 2D reference key points of all the second reference maps and the multi-distortion feature map, wherein the multi-scale aggregation proportion comprises a frame scale proportion and a pixel scale proportion;

S4-3, decoding the aggregate feature map to obtain a generated image, generating a smooth lower half face mask by using driving key points of an original frame image, and fusing the lower half face of a speaker of the generated image with the original frame image by using the smooth lower half face mask to obtain a lip-shaped synchronous frame image of the current frame;

S4-4, repeating S4-1 to S4-3, and generating a lip-sync frame image of the next frame by using the fine 2D driving key points of the next frame until the frame number of all the fine 2D driving key points is traversed.

In a second aspect, the present invention provides a controllable lip synchronization system based on multiple reference frames and styles, for implementing the above controllable lip synchronization method based on multiple reference frames and styles.

Compared with the prior art, the invention has the following beneficial effects:

The invention realizes lip synchronization with controllable styles based on multi-reference frames, can extract lip movement style characteristics of a speaker from an original video through giving driving audio and the original video, or directly give the lip movement style characteristics of the speaker, supports arbitrary explicit or implicit speaker style assignment, reduces task difficulty of one-step generation results through a sparse-to-fine audio-to-key point module, ensures more stable training effect, introduces multi-scale aggregation proportion into the key point-to-video module to generate lip synchronization, and can effectively utilize multi-reference frame information to generate video pictures with higher fidelity. In conclusion, lip-shaped synchronous video generation with controllable style and high fidelity is realized.

Drawings

FIG. 1 is a schematic diagram of a controllable lip sync method based on multiple reference frames and styles according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a sparse 3D driven keypoint predictor according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a fine 2D driving keypoint predictor according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated and described below in connection with specific embodiments. The described embodiments are merely exemplary of the present disclosure and do not limit the scope. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.

The drawings are merely schematic illustrations of the present invention and are not necessarily drawn to scale. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only and not necessarily all steps are included. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Referring to fig. 1, the lip synchronization method based on multi-reference frames and controllable styles provided by the invention mainly comprises the following steps:

step 1, acquiring audio and video data of a speaker, extracting a 3D driving key point and a camera estimation matrix of a face of the speaker in a video frame, and projecting the 3D driving key point by using the camera estimation matrix to obtain a 2D driving key point;

and extracting the driving audio characteristics of each audio frame.

And 2, calculating lip movement style characteristics of the speaker in the video based on the 3D driving key points of all the video frames.

And 3, predicting a fine 2D driving key point by using an audio-to-key point module by combining lip movement style characteristics of a speaker, driving audio characteristics of each audio frame, 3D reference key points and 2D reference key points of a first multi-reference image, wherein the number of frames corresponding to the fine 2D driving key point is consistent with the number of frames of the audio and corresponds to each other one by one.

And 4, randomly selecting part of video frames from the original video frames to form a second multi-reference image, obtaining a second multi-reference image characteristic image after encoding, and obtaining 2D reference key points of the second multi-reference image by adopting the same method.

And 5, combining the second reference picture, the 2D reference key points thereof and the fine 2D driving key points predicted by the audio-frequency to-key point module, and generating a lip-shaped synchronous frame by utilizing the key point-video module.

And 6, training an audio-to-key point module and a key point-to-video module.

And 7, given the driving audio and the original video, extracting lip movement style characteristics of a speaker from the original video or directly designating the lip movement style characteristics of the speaker, and generating a synthesized video synchronous with the lip of the driving audio by utilizing the trained audio-to-key point module and the key point-to-video module to finish the lip synchronous task.

In the step 1, face 2D and 3D keypoint information in each frame of video image in the continuous video frames is extracted by a pre-trained face keypoint detection tool, and one alternative way is to use a pre-trained 3D face keypoint detection tool including but not limited to MEDIAPIPE to detect 3D keypoints on the face part of each frame of video image to obtain 3D driving keypoint coordinates and a camera estimation matrix, and perform 2D projection on the 3D driving keypoint coordinates by the camera estimation matrix to obtain 2D driving keypoint coordinates. Here, each video frame image corresponds to a set of 3D driving keypoints and a set of 2D driving keypoints.

Likewise, the driving audio features of each of the audio frames in the succession are extracted by a pre-trained audio feature extraction tool, an alternative pre-trained audio feature extraction tool being HuBERT.

The invention also introduces a multi-reference frame, wherein the multi-reference frame is extracted from the video frames in the input audio and video data, and the extraction mode adopts random extraction. And taking the 3D driving key points and the 2D driving key points corresponding to the selected video frames as the 3D reference key points and the 2D reference key points of the first multi-reference pictures, wherein each reference picture corresponds to a group of 3D reference key points and a group of 2D reference key points.

In the step 2, an alternative way to calculate the lip movement style feature based on the 3D driving key point is as follows:

And extracting and estimating quantized lip movement style characteristics by using lip key points in 3D driving key points of a speaker in the video, wherein the lip movement style characteristics comprise two speaking style attributes of lip amplitude and lip speed. Specifically, the average value of the y coordinate difference between the upper lip and the lower lip key point is used for measuring the lip stretch, and the time sequence first-order average difference of the y coordinate difference absolute value between the upper lip and the lower lip key point is used for measuring the lip speed, namely:

Wherein, AndIndicating the lip Zhang Fu and the lip speed,Is the current in a given videoThe y-coordinate difference of the key points of the upper lip and the lower lip of the time step,Is the total time step in a given video. The lip amplitude and the lip speed jointly form the speaking style attribute of a speaker, and the lip movement style characteristic obtained through calculation corresponds to the whole section of audio and video.

When the lip movement style characteristics extracted by the original video are adopted in the model reasoning stage, the lip movement style in the generated lip synchronous video is consistent with the input video, and the lip movement style characteristics obtained through calculation are scalar, so that the style of the generated lip synchronous video can be modified through the value of the lip movement style characteristics freely specified by the user.

In the step 3, the audio-to-keypoint module includes a sparse 3D driving keypoint predictor and a fine 2D driving keypoint predictor, wherein the sparse 3D driving keypoint predictor takes driving audio features of each audio frame, lip movement style features of a speaker, 3D reference keypoints of the first multi-reference map as inputs, outputs predicted sparse 3D driving keypoints, and the fine 2D driving keypoint predictor takes 2D reference keypoints of the first multi-reference map and the sparse 3D driving keypoints as inputs, and outputs predicted fine 2D driving keypoints.

Here, in the sparse 3D driving keypoint predictor, the 3D reference keypoints of the first multi-reference map are encoded and expanded to have the same dimensions as the driving audio features of each audio frame, so that the number of frames of the output predicted sparse 3D driving keypoints is consistent with the number of frames of audio, that is, each frame of audio corresponds to one frame of sparse 3D driving keypoint, and the one frame of sparse 3D driving keypoint refers to a group of keypoints corresponding to one frame of video image. Similarly, in the fine 2D driving key point predictor, the 2D reference key points of the first multi-reference image are encoded and expanded into the same dimension after the sparse 3D driving key point is encoded, so that the number of frames of the output predicted fine 2D driving key points is identical to the number of frames of audio, that is, each frame of audio corresponds to one frame of fine 2D driving key point, and the one frame of fine 2D driving key point refers to a group of key points corresponding to one frame of video image.

Through a multi-stage model structure from sparse to fine, task difficulty of generating a result in one step of a model is reduced, and a training effect is more stable.

As shown in fig. 2, in one embodiment of the present invention, the sparse 3D driving keypoint predictor includes a reference keypoint encoder, a lip amplitude embedding layer, a lip speed embedding layer, an audio encoder and a driving keypoint decoder, and the calculation process of the sparse 3D driving keypoint predictor is as follows:

3-1 a) encoding the 3D reference keypoints of the first multi-reference map by a reference keypoint encoder to obtain encoded 3D reference keypoint features;

3-2 a) respectively carrying out coding embedding on the lip amplitude and the lip speed in the lip movement style characteristics of the speaker by the lip amplitude embedding layer and the lip speed embedding layer to obtain lip movement style embedding characteristics formed by the lip amplitude embedding and the lip speed embedding;

In this embodiment, the lip span embedding is obtained by multiplying a scalar and a learnable vector, i.e., f_amp=s_amp×e_amp, where s_amp is a scalar value of the lip span, e_amp is a learnable vector of the lip span embedding layer, f_amp is lip span embedding, and the calculation manner of lip speed embedding is the same and will not be described.

3-3 A) the audio encoder encodes the driving audio characteristics of each audio frame to obtain audio encoding characteristics;

3-4 a) the coded 3D reference key point features and lip motion style embedded features are expanded to the same length as the audio coding features and then connected with the audio coding features to obtain sparse 3D driving key point features, wherein the number of frames of the sparse 3D driving key point features is consistent with the number of frames of the audio, and the expansion mode adopts a repeated filling mode which is a known technology in the field;

3-5 a) the sparse 3D driving keypoint feature is decoded by a driving keypoint decoder to obtain a predicted sparse 3D driving keypoint. The sparse 3D driving keypoints mainly include sparse portions of the lip keypoints and facial contour points.

In this embodiment, the keypoint decoder may employ a conventional decoder structure, for example, a stack of N residual modules, each of which includes, in order, a layer normalization layer, a 1D convolution layer, a PReLu activation function, a final 1D convolution layer, and a residual connection to the input to the output.

As shown in fig. 3, in one embodiment of the present invention, the fine 2D driving keypoint predictor includes a reference keypoint encoder, a linear layer and a driving keypoint decoder, and the fine 2D driving keypoint predictor is calculated as follows:

3-1 b) encoding the 2D reference keypoints of the first multi-reference map by a reference keypoint encoder to obtain encoded 2D reference keypoint features;

3-2 b) embedding the predicted sparse 3D driving keypoints by the linear layer as sparse 3D driving keypoint embedding;

3-3 b) expanding the 2D reference key point characteristics after encoding to the same dimension as the sparse 3D driving key point embedding, and connecting the 2D reference key point characteristics with the sparse 3D driving key point embedding to obtain fine 2D driving key point characteristics, wherein the frame number of the fine 2D driving key point characteristics is consistent with the audio frame number;

3-4 b) the fine 2D driving keypoint feature is decoded by a driving keypoint decoder to obtain predicted fine 2D driving keypoints.

The second multi-reference map in step 4 and the first multi-reference map in step 1 are not related, and may be different. And the 2D reference key points of the second multi-reference image are obtained in the same way as the first multi-reference image, and the 2D driving key points corresponding to the selected video frames are directly used as the 2D reference key points of the second multi-reference image, wherein each reference image corresponds to a group of 2D reference key points.

In the above step 5, an optional implementation process is as follows:

5-1) generating a distortion feature map of the multi-frame reference map, namely a multi-distortion feature map, by distorting the feature map of the second reference map of each frame into a feature map consistent with the driving key point according to the 2D reference key point of the second reference map of each frame and the fine 2D driving key point of the current frame.

In one embodiment of the present invention, a method for obtaining the multi-distortion feature map includes, but is not limited to, using a dense action hourglass network disclosed in the article Thin-PLATE SPLINE Motion Model for Image Animation, taking the 2D reference key point of each frame reference map and its feature map, any frame fine 2D driving key point as input, and the distortion feature map of each frame reference map as output, for convenience, the distortion feature map of the multi-frame reference map is simply referred to as the multi-distortion feature map.

5-2) Calculating a multi-scale aggregation proportion of the multi-distortion feature map by using the fine 2D driving key points of the current frame, the 2D reference key points of all the second reference maps and the multi-distortion feature map, wherein the multi-scale aggregation proportion comprises a frame scale proportion and a pixel scale proportion, and obtaining the aggregation feature map by using texture information of the multi-distortion feature map through the multi-scale aggregation proportion.

In one embodiment of the present invention, the frame scale ratio calculation process includes:

Calculating a query matrix corresponding to the second reference map of each frame by using the selected fine 2D driving key points and the 2D reference key points of all the second reference maps Key value matrixMatrix of values:

Wherein, Representing three linear transformation matrices, the three linear transformation matrices,And (3) withRepresenting the selected fine 2D driving key point and the 2D reference key point of the second reference diagram of the ith frame respectively;

calculating the cross attention value of each frame reference picture according to the query matrix, the key matrix and the value matrix, and obtaining the frame scale proportion through softmax processing Here, whereIs a set of numbers between 0 and 1, consistent with the number of frames of the second reference picture.

The pixel scale ratio is calculated by class cross attention of adaptive instance normalization (AdaIN), the process includes:

calculating a query matrix corresponding to each frame of reference picture Key value matrixMatrix of values:

Wherein, In order to modulate the convolutional layer,For the distortion feature map of the i-th frame reference map,Is a convolution kernel, which is derived from a reference keypoint.

Will be、To be extended to andThe same shape, according to the expanded query matrix, key matrix and value matrix, calculating the cross attention value of each frame reference image, and obtaining the pixel scale proportion through softmax processingHere, whereIs a set of numbers between 0 and 1, consistent with the number of frames of the second reference picture. The ratio has different values among different pixels, reflecting the local similarity of the driving key point and the pixel region at the fine granularity of the pixel level.

The process of obtaining the aggregate feature map by utilizing the texture information of the multi-distortion feature map in the multi-scale aggregate proportion is as follows:

Wherein, Representing the mixed multi-scale aggregation proportion, and consistent with the number of frames of the second reference picture; representing the scale of the frame scale and, Representing the scale of the pixel,Representing a predetermined frame weight, the present embodiment uses。

The multiple pieces of feature map information are aggregated into a single piece of feature map information by a multi-scale attention mixer to reduce the subsequent calculation amount. The aggregation of the multi-warp feature images is to perform weighted average on pixels of each feature image according to a specific proportion, and the aggregation proportion is obtained by mixing a frame scale proportion and a pixel scale proportion, so that the aggregation is called multi-scale.

5-3) Decoding the aggregate feature map to obtain a generated image, and fusing the lower half face of the speaker of the generated image with the original frame image to obtain a lip-shaped synchronous frame image of the current frame;

In one implementation of the present invention, the fusion of the lower half face of the speaker generating the image with the original frame image is as follows:

forming a convex hull by using key points related to the lower half face in the original video image, wherein the convex hull area is regarded as a hard lower half face mask . In the boundary region of the convex hull, a Gaussian model with certain strength is used for feathering to obtain a smooth lower half face mask. Finally, the smooth lower half face mask is used to mix the original frame image with the generated image:

Wherein, A mixed result image, namely a final lip-sync frame image; And (3) with An image and an original image are generated respectively.

5-4) Repeating steps 5-1) to 5-3), generating a lip-sync frame image of the next frame using the next frame fine 2D driving key point until the number of frames of all fine 2D driving key points are traversed.

In step 6 above, an alternative way to train the audio-to-keypoint module and the keypoint-to-video module is as follows:

for training of the audio to key point module, the calculation formula of the loss function is as follows:

Wherein, Is the loss function of the audio to keypoint module,Is the L1 loss of the face 2D driving keypoint true value and the predicted fine 2D driving keypoint,Is the L1 loss of the true value of the face 3D driving key point and the predicted sparse 3D driving key point,Is the L1 loss of the true value and the predicted value of the lip movement style characteristic.

The 3D driving key point reality value and the 2D driving key point reality value are the 3D driving key point and the 2D driving key point of each video frame image obtained in the step 1, the predicted sparse 3D driving key point and the predicted fine 2D driving key point have the same frame number as the audio frame number, the audio and the video in the training stage correspond to each other, the lip movement style characteristic reality value is obtained by calculating in the step 2, and the lip movement style characteristic prediction value is obtained by calculating in the same method as in the step 2 by using the predicted sparse 3D driving key point.

For the training of the keypoint to video module, the loss function includes two parts, a full graph and a lip region.

For the full graph, the loss function is calculated as:

Wherein, Is a full graph loss function of the keypoint to video module,Is the mean square error loss of the generated image and the original frame image,The mean square error loss of the generated image and the original frame image in the perception space of the pretrained model VGG-19.

For the lip region introducing additional loss enhancement, extracting the lip region from the full graph, sampling the same method to calculate the mean square error loss of the lip region of the generated image and the lip region of the original frame image,The method is to generate the mean square error loss of the lip region of the image and the lip region of the original frame image in the perception space of the pretrained model VGG-19, and the function of the lip region loss from the key point to the video module is recorded as。

The complete training loss function from the key point to the video module is as follows:

Wherein, Is the loss function from the key point to the video module.

In the step 7, by giving a section of driving audio and an original video, extracting lip movement style characteristics of a speaker or directly designating the lip movement style characteristics of the speaker from the original video, generating a synthesized video synchronous with the lip of the driving audio by utilizing the trained audio-to-key point module and the key point-to-video module, and completing the lip synchronization task.

Since the lip movement style feature is a scalar, the style of the generated lip synchronization video can be modified by the value of the lip movement style feature freely specified by the user.

The above method is applied to the following embodiments to embody the technical effects of the present invention, and specific steps in the embodiments are not described in detail.

The invention performs experiments on a mainstream lip sync task testing set HDTF (High-Definition TALKING FACE), and 20 random samples in the testing set are selected as test sample sets for comparison experiments. The present invention did not train on HDTF dataset distributions in order to objectively evaluate the performance of the present invention.

The comparison test adopts mainstream lip synchronous task quantization indexes including indexes of multiple dimensions such as image quality, key point reconstruction accuracy, character identity maintenance and the like, and specifically comprises PSNR, SSIM, LPIPS, FID, CSIM and LipLDM. Peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM) are used to evaluate the similarity between the generated video and the ground truth video. Perceived image block similarity (LPIPS) and Frechet Inception Distance (FID) are used to evaluate distances in the potential feature space, which represents the quality of the generated vision. To better evaluate the ability to maintain identity, cosine Similarity (CSIM) of features extracted by ArcFace was also evaluated. Lip keypoint distances (LipLDM) are calculated to measure audio-visual lip synchronization, note that the calculation of LipLDM is performed using the lip keypoints extracted by MEDIAPIPE for the generated results and real video, and the distance metric takes the L1 distance.

Reference models for comparison included MAKEITTALK, WAV2Lip, wav2Lip-GAN, PC-AVS, videoRetalking, IP-LAP, and DINet. Similar to the method of the present invention, IP-LAP and DINet use multiple reference frames to enhance the prior knowledge of the face. The official open source implementation and pre-training model of these baselines was used at the time of reasoning. Some reference models, such as PC-AVS, only generate cropped or resized speaker videos. For these methods, the preprocessed source video is considered as the true value.

The experimental results obtained according to the procedure described in the specific embodiment are shown in table 1.

TABLE 1 results of the lip sync comparison test of the present invention for HDTF dataset

It can be seen from table 1 that the invention trained on publicly accessible datasets gave better performance in PSNR, SSIM, LPIPS, FID and CSIM, etc. evaluations, as shown in table 1, for overall comparison. IP-LAP uses an audio-to-keypoint network trained on two large datasets and therefore performs better at LipLDM. Because of the design of the key point to video module and the inference reference selection strategy, the FID and LPIPS scores of the invention are significantly better than previous methods, for example, the FID of the invention is about 1.45, while the FID of the reference model is above 4.9, which indicates the high visual quality of the key point to video module. Furthermore, the CSIM of the present invention achieves levels above 0.95, while the best limit model is only around 0.92, showing the high identity retention capability of the audio-to-keypoint module of the present invention.

Based on the same inventive concept, a lip-sync system based on multiple reference frames and controllable styles is also provided in this embodiment, and is used to implement the above embodiment. The terms "module," "unit," and the like, as used below, may be a combination of software and/or hardware that performs a predetermined function. Although the system described in the following embodiments is preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible.

In this embodiment, a lip synchronization system with controllable styles based on multiple reference frames includes:

The audio-video data preprocessing module is used for acquiring speaker video data and driving audio data, extracting driving audio characteristics of each audio frame and driving key points of each video frame, randomly selecting two groups of video frames from video to serve as a first multi-reference picture and a second multi-reference picture respectively, and taking the driving key points of the reference pictures as reference key points;

The audio-to-key point module is used for extracting lip movement style characteristics of a speaker from the video, and predicting fine key points by combining the driving audio characteristics and the reference key points of the first multi-reference graph;

The key point-to-video module is used for generating a lip-shaped synchronous frame by combining the second multi-reference picture and the reference key points thereof and the fine key points predicted by the audio-to-key point module;

the training module is used for calculating loss according to the prediction result of the audio-to-key point module and the generation result of the key point-to-video module and updating the module;

The lip synchronization generating module is used for extracting lip movement style characteristics of a speaker or directly giving the lip movement style characteristics of the speaker from the original video aiming at given driving audio and the original video, and generating a synthesized video synchronous with the given driving audio lip by utilizing the trained audio-to-key point module and the key point-to-video module to finish the lip synchronization task.

For the system embodiment, since the system embodiment basically corresponds to the method embodiment, the relevant parts only need to be referred to in the description of the method embodiment, and the implementation methods of the remaining modules are not repeated herein. The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Embodiments of the system of the present invention may be applied to any device having data processing capabilities, such as a computer or the like. The system embodiment may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability.

The foregoing list is only illustrative of specific embodiments of the invention. Obviously, the invention is not limited to the above embodiments, but many variations are possible. All modifications directly derived or suggested to one skilled in the art from the present disclosure should be considered as being within the scope of the present invention.

Claims

1. A lip synchronization method based on multi-reference frames and controllable styles, which is characterized by comprising the following steps:

The driving key points comprise 3D driving key points and 2D driving key points;

the reference key points comprise 3D reference key points and 2D reference key points;

S3, extracting lip movement style characteristics of a speaker from the video, and predicting fine key points by utilizing a sparse-to-fine audio-to-key point module in combination with driving audio characteristics and reference key points of a first multi-reference graph;

The fine key points predicted by the audio-to-key point module are fine 2D driving key points;

The lip movement style characteristic is a scalar and consists of a lip amplitude and a lip speed, and the method for extracting the lip movement style characteristic of the speaker from the video is as follows:

Characterizing the lip speed by using a time sequence first-order average difference of the absolute value of the y coordinate difference of the upper lip key point and the lower lip key point;

S4, combining the second multi-reference picture, the reference key points thereof and the fine key points predicted by the audio-to-key point module, and introducing multi-scale aggregation proportion into the key points-to-video module to generate a lip synchronous frame;

The step S4 includes:

s4-4, repeating the steps S4-1 to S4-3, and generating a lip-shaped synchronous frame image of the next frame by using the fine 2D driving key points of the next frame until the frame number of all the fine 2D driving key points is traversed;

2. The multi-reference frame and style controllable lip sync based method of claim 1, wherein the audio-to-keypoint module comprises:

3. The lip synchronization method according to claim 2, wherein the sparse 3D driving keypoint predictor comprises a reference keypoint encoder, a lip amplitude embedding layer, a lip speed embedding layer, an audio encoder and a driving keypoint decoder, and the calculating process comprises:

4. The multi-reference frame and style controllable lip sync method according to claim 2, wherein the fine 2D driving keypoint predictor comprises a reference keypoint encoder, a linear layer and a driving keypoint decoder, and the calculating process comprises:

5. The multi-reference frame and style controllable lip synchronization method of claim 1, wherein the frame scale ratio is obtained by calculating a cross attention value from the selected fine 2D driving keypoints and the 2D reference keypoints of the second reference map, and the pixel scale ratio is obtained by calculating an adaptive instance normalized cross attention from the selected fine 2D driving keypoints and the 2D reference keypoints of the second reference map.

6. The multi-reference frame and style controllable lip synchronization method of claim 1, wherein the training loss of the audio-to-keypoint module comprises a loss between a 2D driving keypoint true value and a predicted fine 2D driving keypoint, a loss between a 3D driving keypoint true value and a predicted sparse 3D driving keypoint, and a loss between lip motion style characteristics calculated using the 3D driving keypoint true value and the predicted sparse 3D driving keypoint, respectively;

the training loss from the key point to the video module comprises a full-image loss and a lip region loss, wherein the full-image loss and the lip region loss comprise a mean square error loss between a generated image and an original frame image, a mean square error loss between image features of the generated image and the original frame image, a mean square error loss between a lip region of the generated image and a lip region of the original frame image and a mean square error loss between image features of the lip region of the generated image and the lip region of the original frame image.

7. A lip sync system with controllable style based on multiple reference frames for implementing the method of claim 1, wherein the system comprises: