CN118969008A

CN118969008A - Voice-driven face video generation method, system, storage medium and electronic device

Info

Publication number: CN118969008A
Application number: CN202411063361.0A
Authority: CN
Inventors: 丁宝进; 雷钧
Original assignee: Zhejiang University ZJU; Zhejiang Tonghuashun Intelligent Technology Co Ltd
Current assignee: Zhejiang University ZJU; Zhejiang Tonghuashun Intelligent Technology Co Ltd
Priority date: 2024-08-05
Filing date: 2024-08-05
Publication date: 2024-11-15
Anticipated expiration: 2044-08-05
Also published as: CN118969008B

Abstract

The application provides a voice-driven face video generation method, which comprises the following steps: acquiring voice data and extracting audio characteristics of the voice data; inputting the audio characteristics, the background image and the sampling points into a face generation model, and generating a face speaking image corresponding to each frame of voice by using the face generation model; and splicing the face speaking images into videos, and overlapping voice data to obtain voice-driven face videos to generate videos. According to the method, three-dimensional modeling is not required to be independently carried out on the background image, meanwhile, the calculated amount can be greatly reduced, and the hardware cost of rendering is reduced, so that the requirements of real-time performance and low cost in an application scene are met. The application also provides a voice-driven human face video generation system, a storage medium and electronic equipment, which have the beneficial effects.

Description

Voice-driven face video generation method and system, storage medium and electronic equipment

Technical Field

The present application relates to the field of image processing, and in particular, to a method and system for generating a voice-driven face video, a storage medium, and an electronic device.

Background

With the development of computer vision technology and natural language processing technology, artificial intelligence is widely cited in the field of face generation. Voice-driven face video generation is a technique that generates realistic face images corresponding to a given voice input. The method can generate the corresponding speaking video of the person by only receiving voice input without the presence of the real person and using a pre-trained specific person model. The method not only can promote entertainment experience and artistic creation, but also can improve the effect of man-machine interaction and education training, and has wide application in the field of virtual anchor.

Thanks to the powerful generation capabilities of the generation of the challenge Network (GAN), it is applied to the field of speech driven speakers. The GAN network can perform mapping learning of voice to face images, and when training is completed, it can generate face images end-to-end according to input voice. Because the two-dimensional method directly outputs the RGB value of the target image according to the voice, only the result of a single frame image is considered, and therefore, the three-dimensional consistency of the human face is lack of constraint.

Disclosure of Invention

The application aims to provide a voice-driven face video generation method, a voice-driven face video generation system, a storage medium and electronic equipment, wherein three-dimensional modeling is not needed to be carried out on a background, image generation is controlled through style variables, and high-quality images can be generated while three-dimensional modeling information is maintained.

In order to solve the technical problems, the application provides a voice-driven face video generation method, which comprises the following specific technical scheme:

Acquiring voice data and extracting audio characteristics of the voice data;

Inputting the audio characteristics, the background image and the sampling points into a face generation model, and generating a face speaking image corresponding to each frame of voice by using the face generation model; the face generation model is composed of a rendering model and a generating countermeasure model, and is used for carrying out downsampling coding on the background image and combining the downsampling coding with three-dimensional features of the face extracted by the rendering model to obtain style variables; the generated countermeasure model is used for generating the face speaking image according to the style variable;

And splicing the face speaking images into videos, and overlapping the voice data to obtain voice-driven face videos to generate videos.

Optionally, the generating process of the background image includes:

Acquiring a training video, and extracting the training video to obtain training data; the training data comprises voice features and image features contained in the training video;

Intercepting and obtaining partial face images from the training data; the intercepting part comprises a face area, a neck area and a shoulder area;

Performing face segmentation on the partial face image by using a face analysis model to obtain a face mask and a background mask; the sizes of the face mask and the background mask are consistent with the sizes of the partial face images;

And obtaining a background image with the face removed according to the background mask and the partial face image.

Optionally, the capturing a partial face image from the training data includes:

extracting original image frames from the training video according to a set frame rate; the frame number of the original image frame is the same as the audio frame number;

Determining a face image position in the original image frame;

and intercepting and obtaining a part of face image according to the face image position.

Optionally, after obtaining the face mask and the background mask, the method further includes:

And obtaining a foreground image only containing a human face according to the human face mask and the partial human face image, and reducing the foreground image to obtain a true value of the rendering model.

Optionally, after the foreground image is reduced to obtain the true value of the rendering model, the method further includes:

converting pixel coordinates in the partial face image into world coordinates in a world coordinate system;

and regarding each pixel point as being obtained by rendering rays passing through the pixel point in space, and discretizing each ray to obtain the sampling point.

Optionally, converting the pixel coordinates in the partial face image into world coordinates in a world coordinate system includes:

Extracting the partial face image to obtain two-dimensional key points;

Matching the two-dimensional key points with the three-dimensional key points of the three-dimensional standard face model to determine the face gesture of each frame of image;

Calculating a camera extrinsic matrix corresponding to each frame of image according to the face gesture;

assuming the image observed by the camera is the true value, acquiring a camera internal reference matrix;

and converting pixel coordinates in the partial face image into world coordinates in a world coordinate system by using the camera external reference matrix and the camera internal reference matrix.

Optionally, the rendering model includes:

The voice coding network comprises a plurality of convolution blocks and is used for extracting audio characteristics of input audio to obtain voice embedded characteristics;

the position coding network is used for coding the three-dimensional position information of the sampling points to obtain position characteristics;

The direction coding network is used for coding the direction of the sampling point to obtain a direction code; the direction and the code of the sampling point on the same ray are the same;

the transparency prediction network is used for outputting the color transparency of all the sampling points and splicing the voice embedding feature and the position feature to obtain a first combined feature;

The RGB prediction network is used for outputting red, green and blue tristimulus values for all the sampling points, and splicing the first combination characteristic and the direction code to obtain a second combination characteristic;

The rendering module is used for performing cumulative rendering on the sampling points on each ray to obtain RGB color values corresponding to the two-dimensional pixel points corresponding to the rays; respectively carrying out accumulation calculation on three pixel values in the ray direction by a plurality of sampling points on the ray to obtain an RGB predicted image;

The background coding network is used for carrying out downsampling coding on the background image to obtain two-dimensional background characteristics;

And the style coding network is used for combining the two-dimensional background characteristic and the second combined characteristic and coding to obtain the style variable.

The application also provides a voice-driven human face video generation system, which comprises:

the audio feature extraction module is used for acquiring voice data and extracting audio features of the voice data;

The face image generation module is used for inputting the audio characteristics, the background images and the sampling points into a face generation model, and generating a face speaking image corresponding to each frame of voice by utilizing the face generation model; the face generation model is composed of a rendering model and a generating countermeasure model, and is used for carrying out downsampling coding on the background image and combining the downsampling coding with three-dimensional features of the face extracted by the rendering model to obtain style variables; the generated countermeasure model is used for generating the face speaking image according to the style variable;

and the video generation module is used for splicing the face speaking images into videos and overlapping the voice data to obtain voice-driven face video generation videos.

The application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method as described above.

The application also provides an electronic device comprising a memory in which a computer program is stored and a processor which when calling the computer program in the memory implements the steps of the method as described above.

The application provides a voice-driven face video generation method, which comprises the following steps: acquiring voice data and extracting audio characteristics of the voice data; inputting the audio characteristics, the background image and the sampling points into a face generation model, and generating a face speaking image corresponding to each frame of voice by using the face generation model; the face generation model is composed of a rendering model and a generating countermeasure model, and is used for carrying out downsampling coding on the background image and combining the downsampling coding with three-dimensional features of the face extracted by the rendering model to obtain style variables; the generated countermeasure model is used for generating the face speaking image according to the style variable; and splicing the face speaking images into videos, and overlapping the voice data to obtain voice-driven face videos to generate videos.

According to the application, the rendering model is used for extracting the three-dimensional characteristics of the face, combining with the coding characteristics of the background image, and combining the three-dimensional characteristics of the face with the background characteristics, so that the three-dimensional modeling information is reserved, the high-quality image generating capability of generating the countermeasure model is utilized, the calculated amount is greatly reduced, the rendering hardware cost is reduced, and the requirements of real-time performance and low cost in an application scene are met.

The application also provides a voice-driven human face video generation system, a storage medium and electronic equipment, which have the beneficial effects and are not repeated here.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for generating a voice-driven face video according to an embodiment of the present application;

Fig. 2 is a schematic structural diagram of a voice-driven face video generating system according to an embodiment of the present application;

fig. 3 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, fig. 1 is a flowchart of a method for generating a voice-driven face video according to an embodiment of the present application, where the method includes:

s101: acquiring voice data and extracting audio characteristics of the voice data;

S102: inputting the audio characteristics, the background image and the sampling points into a face generation model, and generating a face speaking image corresponding to each frame of voice by using the face generation model; the face generation model is composed of a rendering model and a generating countermeasure model, and is used for carrying out downsampling coding on the background image and combining the downsampling coding with three-dimensional features of the face extracted by the rendering model to obtain style variables; the generated countermeasure model is used for generating the face speaking image according to the style variable;

S103: and splicing the face speaking images into videos, and overlapping the voice data to obtain voice-driven face videos to generate videos.

When the voice-driven face video generation video is generated, the face speaking image can be obtained only by inputting voice data into the face generation model, and the voice-driven face video generation video can be obtained after splicing and voice data combination. The present embodiment defaults to having acquired or trained the complete face generation model before executing step S102.

The following description is made with respect to the generation process of the background image provided by the present application:

firstly, acquiring a training video, and extracting the training video to obtain training data; the training data comprises voice features and image features contained in the training video;

Step two, intercepting and obtaining partial face images from the training data; the intercepting part comprises a face area, a neck area and a shoulder area;

Thirdly, carrying out face segmentation on the partial face image by utilizing a face analysis model to obtain a face mask and a background mask; the sizes of the face mask and the background mask are consistent with the sizes of the partial face images;

and step four, obtaining a background image with the face removed according to the background mask and the partial face image.

And acquiring voice characteristics and image characteristics from a preset training video to serve as training data. Firstly, extracting corresponding audio from training video, extracting features according to 25 frames per second by using a pre-trained voice recognition model Wav2vector or DEEPSPEECH, wherein the dimension is t, and taking a window of w in the time dimension for enabling voice features to have front-back relevance, wherein the finally extracted features d _a∈R^t×w, w represent front-back continuous w frames containing the current frame, and the values are filled in when the front-back boundaries are met.

Image feature extraction is performed thereafter. The video is first extracted into original image frames at a frame rate 25 that is consistent with the frame rate of the audio features, i.e., the number of image frames and the length of the audio frames. Then determining the position of the face in the image, and intercepting an upper body face image from a fixed position in each original image, wherein the image comprises a face area, a neck area and a shoulder area. The image size is then scaled to 512 to obtain a partial face image I ε R ^512×512×3. And (3) carrying out face segmentation on the I by using a face analysis model to obtain a face mask F and a background mask B, wherein the sizes of the face mask F and the background mask B are consistent with the sizes of the I. And obtaining a foreground image I _c only containing the human face according to the human face mask F, wherein the pixel values of other areas are 0. And obtaining a background image I _b with the face and the neck removed according to the background mask B, wherein the pixel value of other areas is 0.

Wherein the method comprises the steps ofRepresenting the multiplication of the corresponding elements. The foreground image I _c is scaled down to obtain I _m∈R^c×c×3, c <512, as the true value of the rendering model. The rendering model is not limited, and NeRF is taken as an example.

And extracting part of the face images to obtain two-dimensional key points, and matching the two-dimensional key points with three-dimensional key points of the three-dimensional standard face model to determine the face gesture of each frame of image. Calculating a camera external reference matrix corresponding to each frame of image according to the face gesture, and obtaining a camera internal reference matrix under the assumption that the image observed by a camera is a true value; and converting pixel coordinates in the partial face image into world coordinates in a world coordinate system by using the camera external reference matrix and the camera internal reference matrix.

Extracting two-dimensional key points of the face of the part of the face image I, matching the two-dimensional key points with three-dimensional key points of a three-dimensional standard face model, estimating the face gesture of each frame of image, calculating a camera external reference matrix corresponding to each frame of image, and simultaneously, obtaining a camera internal reference matrix by assuming that the image observed by a camera is a reduced image I _m. The extrinsic and intrinsic matrices may transform pixel coordinates in the image to a world coordinate system.

According to NeRF rendering models, each pixel point is rendered by rays passing through the point in space, and each ray is discretized to obtain n points. The number of points in the space corresponding to all the pixel points in the I _m image is nc ².

Face generation models are built based on NeRF and GAN. And carrying out downsampling coding on the background image I _b, combining the downsampling coding with the three-dimensional face features rendered by NeRF, and controlling the downsampling coding as a style variable to generate a face image generated by the countermeasure model. Wherein NeRF the rendering model comprises: the system comprises a voice coding network, a position coding network, a direction coding network, a transparency prediction network, an RGB prediction network and a rendering module. Generating the challenge model, for example, a GAN network, may include: a background encoding network, a style encoding network, a decoding generation network and a discriminator network.

The speech coding network is composed of a plurality of convolution blocks in sequence, wherein each convolution block is composed of a one-dimensional convolution and a basic network structure such as a nonlinear activation function. The step size of each convolution is set to 2 such that the last dimension of the audio feature d ^a∈R^t×w is halved up to 1. The first dimension of the input audio feature is changed from t to e simultaneously through a plurality of convolution blocks. The speech embedded feature f _a∈R^e is finally output.

The three-dimensional positions of all the sampling points are encoded. The three coordinates of each point are position coded using sine and cosine functions, together using n _p sets of frequencies. Examples of encoding the x-coordinate are as follows:

Wherein the method comprises the steps of For sine and cosine coding features, the following is calculated:

and the characteristic f _y,f_z is obtained for the y and z coordinate coding modes as above. The final position coding features are the concatenation of the three:

And coding the directions of all the sampling points, wherein the directions of the sampling points on the same ray are consistent, and the codes are also consistent. Similar to the position coding, three coordinates of each ray direction are coded using sine and cosine functions and spliced with the original direction coordinates. Using n _d sets of frequencies together, the directional coding features are obtained:

The transparency prediction network outputs color transparency for all sample points: the speech embedded features and the location features are stitched together to form a first combined feature f _a,f_p. The transparency prediction network is composed of a plurality of MLP modules sequentially, and each MLP is composed of a full-connection layer, a nonlinear activation layer and other basic network structures. Before the last MLP, the feature f _t is obtained. The last MLP output module contains only full connection layers, mapping f _t to transparency σ, dimension 1.

The RGB prediction network outputs RGB three color values for all sample points: and splicing the f _t and the direction code f _d together to obtain a second combined characteristic [ f _t,f_d ]. The RGB prediction network is composed of a plurality of MLP modules sequentially, and each module is composed of a full-connection layer, a nonlinear activation layer and other basic network structures. Before the last MLP, the feature f _c is obtained. The last MLP output module contains only fully connected layers, mapping f _c to color values RGB, dimension 3, denoted C _o.

And the rendering module performs accumulated rendering on the sampling points on each ray, and can obtain an RGB color value C of a two-dimensional pixel point corresponding to the ray. Each two-dimensional pixel on the I _m image is traversed by a ray emanating from the camera center O, as follows:

r(t)＝O+td；

Wherein d ε R ³ is the ray direction and t is the ray length. The color C of the two-dimensional pixel is accumulated by a point C _o on the ray between the proximal end t _n and the distal end t _f. The rendering formula is as follows:

Wherein σ (·) and C _o (·) are the transparency prediction network and the RGB prediction network, respectively, and T (T) is the transparency accumulation in the ray direction, calculated as follows:

The integration in the rendering formula is discretized, that is, n sampling points in the above-mentioned rays, and three pixel values of C _o are respectively accumulated and calculated in the ray direction, so that finally, an RGB predicted image I _n∈R^c×c×3 can be obtained.

The background coding network downsamples and codes the background image I _b: the method consists of a plurality of convolution blocks, wherein each convolution block comprises basic network structures such as convolution, batch normalization function, nonlinear activation function and the like. The number of the feature images is changed and the size of the feature images is reduced by setting the number of convolution kernels and the step length of convolution. Finally obtain the characteristicsWhere n _b is the number of channels, and the height and width are both c, consistent with the size of rendering I _n.

The style encoding network combines the two-dimensional background feature f _b and the rendering feature f _c together to recode the style variables needed by StyleGAN: recording the number of f _c characteristic channels obtained by a single sampling point as n _c, performing rendering on RGB three channels similar to the rendering on the above-mentioned characteristic channels at f _c to obtain characteristicsSplicing F _b and F _c together to form a new feature as a network input:

The style coding network is composed of two parts, wherein the first part of the style coding network comprises a plurality of continuous convolution blocks, and each convolution block comprises basic network structures such as convolution, batch normalization functions, nonlinear activation functions and the like. The F _m size is continuously reduced from c up to 4 by setting the convolution step size. The number of channels is changed to 512 only in the first convolution block, and none of the others is changed. Thus, feature f _s∈R^4×4×512 is obtained after the first part, and then f _s is flattened into 8192 one-dimensional features as input to the second part of the network. The second part of the network consists of a plurality of MLP modules, each consisting of a fully connected layer and a nonlinear activation layer, which downsamples the features 16 times. Through the second part, style variables F _s∈R⁵¹² are finally obtained.

The decode generation network uses a pre-trained StyleGAN decoder, with F _s as input, to ultimately generate RGB image I _g∈R^512×512×3.

The discriminator network uses StyleGAN discriminator network, inputs 512-sized RGB pictures, and outputs a probability value that the picture is a real picture.

Setting a network loss function, training a network, and updating network parameters.

For three-dimensional rendering, the generation of I _n needs to be supervised, and the rendering loss function is as follows:

L_n＝Per(I_m,I_n)+L1(I_m,I_n)；

Where Per (·) is the perceptual penalty function, L1 (·) is the average absolute error penalty function, I _n is the rendering generation, and I _m is the true value.

For the generation network of the GAN network, the generation of the I _g needs to be supervised, and the generation loss function is as follows:

L_g＝Per(I,I_g)+L1(I,I_g)；

Wherein I _g is StyleGAN network generated result and I is true value.

I and I _g are input into the discrimination network, the outputs of which are D _I,D_g respectively.

For the generation network, the total loss adds discrimination loss to the generation loss as follows:

LG＝L_g+BCE(1,D_g)；

Wherein BCE (·) is a cross entropy function.

For a discriminating network of a GAN network, the loss function is:

L_D＝BCE(0,D_g)+BCE(1,D_I)；

the network optimizer may choose an Adam optimizer to continually optimize network parameters according to a loss function. The decoder uses the pre-trained parameters and does not participate in the update throughout. Training is stopped until the loss function no longer drops significantly.

And generating a speaking video by using the trained model. A section of voice is obtained as a driving signal, firstly, the audio characteristics of the input voice are extracted, and the audio characteristics, the preprocessed background image I _b and sampling points are input into a trained model together. For each frame of speech, a corresponding speaker image is generated. And then splicing the images into videos, and adding voice data to obtain the final voice-driven face video to generate videos.

The application uses the rendering model to extract the three-dimensional feature of the face, combines with the coding feature of the background image, does not need to independently carry out three-dimensional modeling on the background image, combines the three-dimensional feature of the face with the background feature at the same time, and is used as a style variable to control the generation of the image for generating the countermeasure model, thus not only retaining the three-dimensional modeling information, but also utilizing the high-quality image generating capability for generating the countermeasure model. When the three-dimensional characteristics are obtained by using NeRF with low fraction and then the GAN network is used for super-division, the calculated amount can be greatly reduced, and the hardware cost of rendering is reduced, so that the requirements of real-time performance and low cost in an application scene are met, the calculated amount can be greatly reduced, and the hardware cost of rendering is reduced, and the requirements of real-time performance and low cost in the application scene are met.

The following describes a voice-driven face video generating system provided by the embodiment of the present application, and the voice-driven face video generating system described below and the voice-driven face video generating method described above may be referred to correspondingly. Referring to fig. 2, fig. 2 is a schematic structural diagram of a voice-driven face video generating system according to an embodiment of the present application, where the system includes:

Based on the above embodiments, as a preferred embodiment, the system includes:

The background image generation module is used for acquiring training videos and extracting the training videos to obtain training data; the training data comprises voice features and image features contained in the training video; intercepting and obtaining partial face images from the training data; the intercepting part comprises a face area, a neck area and a shoulder area; performing face segmentation on the partial face image by using a face analysis model to obtain a face mask and a background mask; the sizes of the face mask and the background mask are consistent with the sizes of the partial face images; and obtaining a background image with the face removed according to the background mask and the partial face image.

Based on the above embodiments, as a preferred embodiment, the background map generating module includes:

The face intercepting unit is used for extracting original image frames from the training video according to a set frame rate; the frame number of the original image frame is the same as the audio frame number; determining a face image position in the original image frame; and intercepting and obtaining a part of face image according to the face image position.

Based on the foregoing embodiment, as a preferred embodiment, the background map generating module may further include:

and the truth value data generating unit is used for obtaining a foreground image only containing a human face according to the human face mask and the partial human face image, and reducing the foreground image to obtain the truth value of the rendering model.

The world coordinate generation unit is used for converting pixel coordinates in the partial face image into world coordinates in a world coordinate system;

And the sampling point generating unit is used for treating each pixel point as being rendered by rays passing through the pixel point in space, and discretizing each ray to obtain the sampling point.

Based on the above-described embodiments, as a preferred embodiment, the world coordinate generation unit is a unit for performing the steps of:

Extracting the partial face image to obtain two-dimensional key points; matching the two-dimensional key points with the three-dimensional key points of the three-dimensional standard face model to determine the face gesture of each frame of image; calculating a camera extrinsic matrix corresponding to each frame of image according to the face gesture; assuming the image observed by the camera is the true value, acquiring a camera internal reference matrix; and converting pixel coordinates in the partial face image into world coordinates in a world coordinate system by using the camera external reference matrix and the camera internal reference matrix.

Based on the above embodiments, as a preferred embodiment, the rendering model includes:

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the steps provided by the above-described embodiments. The storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The present application also provides an electronic device, referring to fig. 3, and as shown in fig. 3, a block diagram of an electronic device provided in an embodiment of the present application may include a processor 1410 and a memory 1420.

Processor 1410 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc., among others. The processor 1410 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). Processor 1410 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1410 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1410 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

Memory 1420 may include one or more computer-readable storage media, which may be non-transitory. Memory 1420 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 1420 is used at least to store a computer program 1421, which, when loaded and executed by the processor 1410, can implement relevant steps in the method performed by the electronic device side as disclosed in any of the foregoing embodiments. In addition, the resources stored by memory 1420 may include an operating system 1422, data 1423, and the like, and the storage may be transient storage or permanent storage. Operating system 1422 may include Windows, linux, android, among other things.

In some embodiments, the electronic device may further include a display 1430, an input-output interface 1440, a communication interface 1450, a sensor 1460, a power supply 1470, and a communication bus 1480.

Of course, the structure of the electronic device shown in fig. 3 is not limited to the electronic device in the embodiment of the present application, and the electronic device may include more or fewer components than those shown in fig. 3 or may combine some components in practical applications.

In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The system provided by the embodiment is relatively simple to describe as it corresponds to the method provided by the embodiment, and the relevant points are referred to in the description of the method section.

The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that the present application may be modified and practiced without departing from the spirit of the present application.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method for generating a voice-driven face video, comprising:

Acquire voice data, and extract audio features of the voice data;

The audio features, background image and sampling points are input into a face generation model, and the face generation model is used to generate a face speaking image corresponding to each frame of speech; the face generation model is composed of a rendering model and a generative adversarial model, and is used to downsample and encode the background image, and combine it with the three-dimensional face features extracted by the rendering model to obtain style variables; the generative adversarial model is used to generate the face speaking image according to the style variables;

The face speaking images are spliced into a video, and the voice data is superimposed to obtain a voice-driven face video generation video.

2. The method for generating a voice-driven face video according to claim 1, wherein the generation process of the background image comprises:

Acquire a training video, and extract the training video to obtain training data; the training data includes speech features and image features contained in the training video;

Cutting out a partial face image from the training data; the cutout portion includes a face area, a neck area, and a shoulder area;

Performing face segmentation on the partial face image using a face analysis model to obtain a face mask and a background mask; the sizes of the face mask and the background mask are consistent with those of the partial face image;

A background image after removing the face is obtained according to the background mask and the partial face image.

3. The method for generating a voice-driven face video according to claim 2, wherein the step of intercepting a partial face image from the training data comprises:

Extracting original image frames from the training video at a set frame rate; the number of frames of the original image frames is the same as the number of audio frames;

Determining a facial image position in the original image frame;

A partial face image is obtained according to the position of the face image.

4. The method for generating a voice-driven face video according to claim 2, characterized in that after obtaining the face mask and the background mask, it further comprises:

A foreground image containing only the face is obtained according to the face mask and the partial face image, and the foreground image is reduced to obtain the true value of the rendering model.

5. The method for generating a voice-driven face video according to claim 4, characterized in that after reducing the foreground image to obtain the true value of the rendering model, it further comprises:

Convert pixel coordinates in the partial face image into world coordinates in a world coordinate system;

Each pixel point is regarded as being rendered by a ray passing through the pixel point in space, and each ray is discretized to obtain the sampling point.

6. The method for generating a voice-driven face video according to claim 5, wherein converting the pixel coordinates in the partial face image into world coordinates in a world coordinate system comprises:

Extracting two-dimensional key points from the partial face image;

Matching the two-dimensional key points with the three-dimensional key points of the three-dimensional standard face model to determine the face posture of each frame image;

Calculate the camera extrinsic parameter matrix corresponding to each frame of image according to the face posture;

Assuming that the image observed by the camera is the true value, obtain the camera intrinsic parameter matrix;

The camera extrinsic parameter matrix and the camera intrinsic parameter matrix are applied to transform the pixel coordinates in the partial face image into world coordinates in a world coordinate system.

7. The method for generating a voice-driven face video according to claim 5, wherein the rendering model comprises:

The speech coding network includes several convolutional blocks, which are used to extract the audio features of the input audio and obtain the speech embedding features;

A position encoding network, used for encoding the three-dimensional position information of the sampling point to obtain position features;

A direction coding network, used for coding the direction of the sampling point to obtain a direction code; wherein the direction and code of the sampling point on the same ray are the same;

A transparency prediction network, used for outputting the color transparency of all the sampling points, and concatenating the speech embedding feature and the position feature to obtain a first combined feature;

An RGB prediction network is used to output red, green and blue color values for all the sampling points, and to concatenate the first combined feature and the direction code to obtain a second combined feature;

A rendering module is used to perform cumulative rendering on the sampling points on each ray to obtain the RGB color value corresponding to the two-dimensional pixel point corresponding to the ray; cumulatively calculate the three pixel values of several sampling points on the ray in the ray direction to obtain the RGB predicted image;

A background encoding network is used to downsample and encode the background image to obtain two-dimensional background features;

The style encoding network is used to combine the two-dimensional background feature and the second combined feature to encode the style variable.

8. A voice-driven face video generation system, comprising:

An audio feature extraction module, used to obtain voice data and extract audio features of the voice data;

A face image generation module, used for inputting the audio features, background image and sampling points into a face generation model, and using the face generation model to generate a face speaking image corresponding to each frame of speech; the face generation model is composed of a rendering model and a generative adversarial model, and is used for downsampling and encoding the background image, and combining it with the three-dimensional face features extracted by the rendering model to obtain style variables; the generative adversarial model is used for generating the face speaking image according to the style variables;

The video generation module is used to splice the face speaking images into a video and superimpose the voice data to obtain a voice-driven face video generation video.

9. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the voice-driven facial video generation method as described in any one of claims 1 to 7 are implemented.

10. An electronic device, characterized in that it comprises a memory and a processor, wherein a computer program is stored in the memory, and when the processor calls the computer program in the memory, the steps of the voice-driven facial video generation method as described in any one of claims 1 to 7 are implemented.