[go: up one dir, main page]

CN118969008A - Voice-driven face video generation method, system, storage medium and electronic device - Google Patents

Voice-driven face video generation method, system, storage medium and electronic device Download PDF

Info

Publication number
CN118969008A
CN118969008A CN202411063361.0A CN202411063361A CN118969008A CN 118969008 A CN118969008 A CN 118969008A CN 202411063361 A CN202411063361 A CN 202411063361A CN 118969008 A CN118969008 A CN 118969008A
Authority
CN
China
Prior art keywords
face
image
voice
video
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411063361.0A
Other languages
Chinese (zh)
Other versions
CN118969008B (en
Inventor
丁宝进
雷钧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Zhejiang Tonghuashun Intelligent Technology Co Ltd
Original Assignee
Zhejiang University ZJU
Zhejiang Tonghuashun Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU, Zhejiang Tonghuashun Intelligent Technology Co Ltd filed Critical Zhejiang University ZJU
Priority to CN202411063361.0A priority Critical patent/CN118969008B/en
Publication of CN118969008A publication Critical patent/CN118969008A/en
Application granted granted Critical
Publication of CN118969008B publication Critical patent/CN118969008B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application provides a voice-driven face video generation method, which comprises the following steps: acquiring voice data and extracting audio characteristics of the voice data; inputting the audio characteristics, the background image and the sampling points into a face generation model, and generating a face speaking image corresponding to each frame of voice by using the face generation model; and splicing the face speaking images into videos, and overlapping voice data to obtain voice-driven face videos to generate videos. According to the method, three-dimensional modeling is not required to be independently carried out on the background image, meanwhile, the calculated amount can be greatly reduced, and the hardware cost of rendering is reduced, so that the requirements of real-time performance and low cost in an application scene are met. The application also provides a voice-driven human face video generation system, a storage medium and electronic equipment, which have the beneficial effects.

Description

Voice-driven face video generation method and system, storage medium and electronic equipment
Technical Field
The present application relates to the field of image processing, and in particular, to a method and system for generating a voice-driven face video, a storage medium, and an electronic device.
Background
With the development of computer vision technology and natural language processing technology, artificial intelligence is widely cited in the field of face generation. Voice-driven face video generation is a technique that generates realistic face images corresponding to a given voice input. The method can generate the corresponding speaking video of the person by only receiving voice input without the presence of the real person and using a pre-trained specific person model. The method not only can promote entertainment experience and artistic creation, but also can improve the effect of man-machine interaction and education training, and has wide application in the field of virtual anchor.
Thanks to the powerful generation capabilities of the generation of the challenge Network (GAN), it is applied to the field of speech driven speakers. The GAN network can perform mapping learning of voice to face images, and when training is completed, it can generate face images end-to-end according to input voice. Because the two-dimensional method directly outputs the RGB value of the target image according to the voice, only the result of a single frame image is considered, and therefore, the three-dimensional consistency of the human face is lack of constraint.
Disclosure of Invention
The application aims to provide a voice-driven face video generation method, a voice-driven face video generation system, a storage medium and electronic equipment, wherein three-dimensional modeling is not needed to be carried out on a background, image generation is controlled through style variables, and high-quality images can be generated while three-dimensional modeling information is maintained.
In order to solve the technical problems, the application provides a voice-driven face video generation method, which comprises the following specific technical scheme:
Acquiring voice data and extracting audio characteristics of the voice data;
Inputting the audio characteristics, the background image and the sampling points into a face generation model, and generating a face speaking image corresponding to each frame of voice by using the face generation model; the face generation model is composed of a rendering model and a generating countermeasure model, and is used for carrying out downsampling coding on the background image and combining the downsampling coding with three-dimensional features of the face extracted by the rendering model to obtain style variables; the generated countermeasure model is used for generating the face speaking image according to the style variable;
And splicing the face speaking images into videos, and overlapping the voice data to obtain voice-driven face videos to generate videos.
Optionally, the generating process of the background image includes:
Acquiring a training video, and extracting the training video to obtain training data; the training data comprises voice features and image features contained in the training video;
Intercepting and obtaining partial face images from the training data; the intercepting part comprises a face area, a neck area and a shoulder area;
Performing face segmentation on the partial face image by using a face analysis model to obtain a face mask and a background mask; the sizes of the face mask and the background mask are consistent with the sizes of the partial face images;
And obtaining a background image with the face removed according to the background mask and the partial face image.
Optionally, the capturing a partial face image from the training data includes:
extracting original image frames from the training video according to a set frame rate; the frame number of the original image frame is the same as the audio frame number;
Determining a face image position in the original image frame;
and intercepting and obtaining a part of face image according to the face image position.
Optionally, after obtaining the face mask and the background mask, the method further includes:
And obtaining a foreground image only containing a human face according to the human face mask and the partial human face image, and reducing the foreground image to obtain a true value of the rendering model.
Optionally, after the foreground image is reduced to obtain the true value of the rendering model, the method further includes:
converting pixel coordinates in the partial face image into world coordinates in a world coordinate system;
and regarding each pixel point as being obtained by rendering rays passing through the pixel point in space, and discretizing each ray to obtain the sampling point.
Optionally, converting the pixel coordinates in the partial face image into world coordinates in a world coordinate system includes:
Extracting the partial face image to obtain two-dimensional key points;
Matching the two-dimensional key points with the three-dimensional key points of the three-dimensional standard face model to determine the face gesture of each frame of image;
Calculating a camera extrinsic matrix corresponding to each frame of image according to the face gesture;
assuming the image observed by the camera is the true value, acquiring a camera internal reference matrix;
and converting pixel coordinates in the partial face image into world coordinates in a world coordinate system by using the camera external reference matrix and the camera internal reference matrix.
Optionally, the rendering model includes:
The voice coding network comprises a plurality of convolution blocks and is used for extracting audio characteristics of input audio to obtain voice embedded characteristics;
the position coding network is used for coding the three-dimensional position information of the sampling points to obtain position characteristics;
The direction coding network is used for coding the direction of the sampling point to obtain a direction code; the direction and the code of the sampling point on the same ray are the same;
the transparency prediction network is used for outputting the color transparency of all the sampling points and splicing the voice embedding feature and the position feature to obtain a first combined feature;
The RGB prediction network is used for outputting red, green and blue tristimulus values for all the sampling points, and splicing the first combination characteristic and the direction code to obtain a second combination characteristic;
The rendering module is used for performing cumulative rendering on the sampling points on each ray to obtain RGB color values corresponding to the two-dimensional pixel points corresponding to the rays; respectively carrying out accumulation calculation on three pixel values in the ray direction by a plurality of sampling points on the ray to obtain an RGB predicted image;
The background coding network is used for carrying out downsampling coding on the background image to obtain two-dimensional background characteristics;
And the style coding network is used for combining the two-dimensional background characteristic and the second combined characteristic and coding to obtain the style variable.
The application also provides a voice-driven human face video generation system, which comprises:
the audio feature extraction module is used for acquiring voice data and extracting audio features of the voice data;
The face image generation module is used for inputting the audio characteristics, the background images and the sampling points into a face generation model, and generating a face speaking image corresponding to each frame of voice by utilizing the face generation model; the face generation model is composed of a rendering model and a generating countermeasure model, and is used for carrying out downsampling coding on the background image and combining the downsampling coding with three-dimensional features of the face extracted by the rendering model to obtain style variables; the generated countermeasure model is used for generating the face speaking image according to the style variable;
and the video generation module is used for splicing the face speaking images into videos and overlapping the voice data to obtain voice-driven face video generation videos.
The application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method as described above.
The application also provides an electronic device comprising a memory in which a computer program is stored and a processor which when calling the computer program in the memory implements the steps of the method as described above.
The application provides a voice-driven face video generation method, which comprises the following steps: acquiring voice data and extracting audio characteristics of the voice data; inputting the audio characteristics, the background image and the sampling points into a face generation model, and generating a face speaking image corresponding to each frame of voice by using the face generation model; the face generation model is composed of a rendering model and a generating countermeasure model, and is used for carrying out downsampling coding on the background image and combining the downsampling coding with three-dimensional features of the face extracted by the rendering model to obtain style variables; the generated countermeasure model is used for generating the face speaking image according to the style variable; and splicing the face speaking images into videos, and overlapping the voice data to obtain voice-driven face videos to generate videos.
According to the application, the rendering model is used for extracting the three-dimensional characteristics of the face, combining with the coding characteristics of the background image, and combining the three-dimensional characteristics of the face with the background characteristics, so that the three-dimensional modeling information is reserved, the high-quality image generating capability of generating the countermeasure model is utilized, the calculated amount is greatly reduced, the rendering hardware cost is reduced, and the requirements of real-time performance and low cost in an application scene are met.
The application also provides a voice-driven human face video generation system, a storage medium and electronic equipment, which have the beneficial effects and are not repeated here.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a method for generating a voice-driven face video according to an embodiment of the present application;
Fig. 2 is a schematic structural diagram of a voice-driven face video generating system according to an embodiment of the present application;
fig. 3 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, fig. 1 is a flowchart of a method for generating a voice-driven face video according to an embodiment of the present application, where the method includes:
s101: acquiring voice data and extracting audio characteristics of the voice data;
S102: inputting the audio characteristics, the background image and the sampling points into a face generation model, and generating a face speaking image corresponding to each frame of voice by using the face generation model; the face generation model is composed of a rendering model and a generating countermeasure model, and is used for carrying out downsampling coding on the background image and combining the downsampling coding with three-dimensional features of the face extracted by the rendering model to obtain style variables; the generated countermeasure model is used for generating the face speaking image according to the style variable;
S103: and splicing the face speaking images into videos, and overlapping the voice data to obtain voice-driven face videos to generate videos.
When the voice-driven face video generation video is generated, the face speaking image can be obtained only by inputting voice data into the face generation model, and the voice-driven face video generation video can be obtained after splicing and voice data combination. The present embodiment defaults to having acquired or trained the complete face generation model before executing step S102.
The following description is made with respect to the generation process of the background image provided by the present application:
firstly, acquiring a training video, and extracting the training video to obtain training data; the training data comprises voice features and image features contained in the training video;
Step two, intercepting and obtaining partial face images from the training data; the intercepting part comprises a face area, a neck area and a shoulder area;
Thirdly, carrying out face segmentation on the partial face image by utilizing a face analysis model to obtain a face mask and a background mask; the sizes of the face mask and the background mask are consistent with the sizes of the partial face images;
and step four, obtaining a background image with the face removed according to the background mask and the partial face image.
And acquiring voice characteristics and image characteristics from a preset training video to serve as training data. Firstly, extracting corresponding audio from training video, extracting features according to 25 frames per second by using a pre-trained voice recognition model Wav2vector or DEEPSPEECH, wherein the dimension is t, and taking a window of w in the time dimension for enabling voice features to have front-back relevance, wherein the finally extracted features d a∈Rt×w, w represent front-back continuous w frames containing the current frame, and the values are filled in when the front-back boundaries are met.
Image feature extraction is performed thereafter. The video is first extracted into original image frames at a frame rate 25 that is consistent with the frame rate of the audio features, i.e., the number of image frames and the length of the audio frames. Then determining the position of the face in the image, and intercepting an upper body face image from a fixed position in each original image, wherein the image comprises a face area, a neck area and a shoulder area. The image size is then scaled to 512 to obtain a partial face image I ε R 512×512×3. And (3) carrying out face segmentation on the I by using a face analysis model to obtain a face mask F and a background mask B, wherein the sizes of the face mask F and the background mask B are consistent with the sizes of the I. And obtaining a foreground image I c only containing the human face according to the human face mask F, wherein the pixel values of other areas are 0. And obtaining a background image I b with the face and the neck removed according to the background mask B, wherein the pixel value of other areas is 0.
Wherein the method comprises the steps ofRepresenting the multiplication of the corresponding elements. The foreground image I c is scaled down to obtain I m∈Rc×c×3, c <512, as the true value of the rendering model. The rendering model is not limited, and NeRF is taken as an example.
And extracting part of the face images to obtain two-dimensional key points, and matching the two-dimensional key points with three-dimensional key points of the three-dimensional standard face model to determine the face gesture of each frame of image. Calculating a camera external reference matrix corresponding to each frame of image according to the face gesture, and obtaining a camera internal reference matrix under the assumption that the image observed by a camera is a true value; and converting pixel coordinates in the partial face image into world coordinates in a world coordinate system by using the camera external reference matrix and the camera internal reference matrix.
Extracting two-dimensional key points of the face of the part of the face image I, matching the two-dimensional key points with three-dimensional key points of a three-dimensional standard face model, estimating the face gesture of each frame of image, calculating a camera external reference matrix corresponding to each frame of image, and simultaneously, obtaining a camera internal reference matrix by assuming that the image observed by a camera is a reduced image I m. The extrinsic and intrinsic matrices may transform pixel coordinates in the image to a world coordinate system.
According to NeRF rendering models, each pixel point is rendered by rays passing through the point in space, and each ray is discretized to obtain n points. The number of points in the space corresponding to all the pixel points in the I m image is nc 2.
Face generation models are built based on NeRF and GAN. And carrying out downsampling coding on the background image I b, combining the downsampling coding with the three-dimensional face features rendered by NeRF, and controlling the downsampling coding as a style variable to generate a face image generated by the countermeasure model. Wherein NeRF the rendering model comprises: the system comprises a voice coding network, a position coding network, a direction coding network, a transparency prediction network, an RGB prediction network and a rendering module. Generating the challenge model, for example, a GAN network, may include: a background encoding network, a style encoding network, a decoding generation network and a discriminator network.
The speech coding network is composed of a plurality of convolution blocks in sequence, wherein each convolution block is composed of a one-dimensional convolution and a basic network structure such as a nonlinear activation function. The step size of each convolution is set to 2 such that the last dimension of the audio feature d a∈Rt×w is halved up to 1. The first dimension of the input audio feature is changed from t to e simultaneously through a plurality of convolution blocks. The speech embedded feature f a∈Re is finally output.
The three-dimensional positions of all the sampling points are encoded. The three coordinates of each point are position coded using sine and cosine functions, together using n p sets of frequencies. Examples of encoding the x-coordinate are as follows:
Wherein the method comprises the steps of For sine and cosine coding features, the following is calculated:
and the characteristic f y,fz is obtained for the y and z coordinate coding modes as above. The final position coding features are the concatenation of the three:
And coding the directions of all the sampling points, wherein the directions of the sampling points on the same ray are consistent, and the codes are also consistent. Similar to the position coding, three coordinates of each ray direction are coded using sine and cosine functions and spliced with the original direction coordinates. Using n d sets of frequencies together, the directional coding features are obtained:
The transparency prediction network outputs color transparency for all sample points: the speech embedded features and the location features are stitched together to form a first combined feature f a,fp. The transparency prediction network is composed of a plurality of MLP modules sequentially, and each MLP is composed of a full-connection layer, a nonlinear activation layer and other basic network structures. Before the last MLP, the feature f t is obtained. The last MLP output module contains only full connection layers, mapping f t to transparency σ, dimension 1.
The RGB prediction network outputs RGB three color values for all sample points: and splicing the f t and the direction code f d together to obtain a second combined characteristic [ f t,fd ]. The RGB prediction network is composed of a plurality of MLP modules sequentially, and each module is composed of a full-connection layer, a nonlinear activation layer and other basic network structures. Before the last MLP, the feature f c is obtained. The last MLP output module contains only fully connected layers, mapping f c to color values RGB, dimension 3, denoted C o.
And the rendering module performs accumulated rendering on the sampling points on each ray, and can obtain an RGB color value C of a two-dimensional pixel point corresponding to the ray. Each two-dimensional pixel on the I m image is traversed by a ray emanating from the camera center O, as follows:
r(t)=O+td;
Wherein d ε R 3 is the ray direction and t is the ray length. The color C of the two-dimensional pixel is accumulated by a point C o on the ray between the proximal end t n and the distal end t f. The rendering formula is as follows:
Wherein σ (·) and C o (·) are the transparency prediction network and the RGB prediction network, respectively, and T (T) is the transparency accumulation in the ray direction, calculated as follows:
The integration in the rendering formula is discretized, that is, n sampling points in the above-mentioned rays, and three pixel values of C o are respectively accumulated and calculated in the ray direction, so that finally, an RGB predicted image I n∈Rc×c×3 can be obtained.
The background coding network downsamples and codes the background image I b: the method consists of a plurality of convolution blocks, wherein each convolution block comprises basic network structures such as convolution, batch normalization function, nonlinear activation function and the like. The number of the feature images is changed and the size of the feature images is reduced by setting the number of convolution kernels and the step length of convolution. Finally obtain the characteristicsWhere n b is the number of channels, and the height and width are both c, consistent with the size of rendering I n.
The style encoding network combines the two-dimensional background feature f b and the rendering feature f c together to recode the style variables needed by StyleGAN: recording the number of f c characteristic channels obtained by a single sampling point as n c, performing rendering on RGB three channels similar to the rendering on the above-mentioned characteristic channels at f c to obtain characteristicsSplicing F b and F c together to form a new feature as a network input:
The style coding network is composed of two parts, wherein the first part of the style coding network comprises a plurality of continuous convolution blocks, and each convolution block comprises basic network structures such as convolution, batch normalization functions, nonlinear activation functions and the like. The F m size is continuously reduced from c up to 4 by setting the convolution step size. The number of channels is changed to 512 only in the first convolution block, and none of the others is changed. Thus, feature f s∈R4×4×512 is obtained after the first part, and then f s is flattened into 8192 one-dimensional features as input to the second part of the network. The second part of the network consists of a plurality of MLP modules, each consisting of a fully connected layer and a nonlinear activation layer, which downsamples the features 16 times. Through the second part, style variables F s∈R512 are finally obtained.
The decode generation network uses a pre-trained StyleGAN decoder, with F s as input, to ultimately generate RGB image I g∈R512×512×3.
The discriminator network uses StyleGAN discriminator network, inputs 512-sized RGB pictures, and outputs a probability value that the picture is a real picture.
Setting a network loss function, training a network, and updating network parameters.
For three-dimensional rendering, the generation of I n needs to be supervised, and the rendering loss function is as follows:
Ln=Per(Im,In)+L1(Im,In);
Where Per (·) is the perceptual penalty function, L1 (·) is the average absolute error penalty function, I n is the rendering generation, and I m is the true value.
For the generation network of the GAN network, the generation of the I g needs to be supervised, and the generation loss function is as follows:
Lg=Per(I,Ig)+L1(I,Ig);
Wherein I g is StyleGAN network generated result and I is true value.
I and I g are input into the discrimination network, the outputs of which are D I,Dg respectively.
For the generation network, the total loss adds discrimination loss to the generation loss as follows:
LG=Lg+BCE(1,Dg);
Wherein BCE (·) is a cross entropy function.
For a discriminating network of a GAN network, the loss function is:
LD=BCE(0,Dg)+BCE(1,DI);
the network optimizer may choose an Adam optimizer to continually optimize network parameters according to a loss function. The decoder uses the pre-trained parameters and does not participate in the update throughout. Training is stopped until the loss function no longer drops significantly.
And generating a speaking video by using the trained model. A section of voice is obtained as a driving signal, firstly, the audio characteristics of the input voice are extracted, and the audio characteristics, the preprocessed background image I b and sampling points are input into a trained model together. For each frame of speech, a corresponding speaker image is generated. And then splicing the images into videos, and adding voice data to obtain the final voice-driven face video to generate videos.
The application uses the rendering model to extract the three-dimensional feature of the face, combines with the coding feature of the background image, does not need to independently carry out three-dimensional modeling on the background image, combines the three-dimensional feature of the face with the background feature at the same time, and is used as a style variable to control the generation of the image for generating the countermeasure model, thus not only retaining the three-dimensional modeling information, but also utilizing the high-quality image generating capability for generating the countermeasure model. When the three-dimensional characteristics are obtained by using NeRF with low fraction and then the GAN network is used for super-division, the calculated amount can be greatly reduced, and the hardware cost of rendering is reduced, so that the requirements of real-time performance and low cost in an application scene are met, the calculated amount can be greatly reduced, and the hardware cost of rendering is reduced, and the requirements of real-time performance and low cost in the application scene are met.
The following describes a voice-driven face video generating system provided by the embodiment of the present application, and the voice-driven face video generating system described below and the voice-driven face video generating method described above may be referred to correspondingly. Referring to fig. 2, fig. 2 is a schematic structural diagram of a voice-driven face video generating system according to an embodiment of the present application, where the system includes:
the audio feature extraction module is used for acquiring voice data and extracting audio features of the voice data;
The face image generation module is used for inputting the audio characteristics, the background images and the sampling points into a face generation model, and generating a face speaking image corresponding to each frame of voice by utilizing the face generation model; the face generation model is composed of a rendering model and a generating countermeasure model, and is used for carrying out downsampling coding on the background image and combining the downsampling coding with three-dimensional features of the face extracted by the rendering model to obtain style variables; the generated countermeasure model is used for generating the face speaking image according to the style variable;
and the video generation module is used for splicing the face speaking images into videos and overlapping the voice data to obtain voice-driven face video generation videos.
Based on the above embodiments, as a preferred embodiment, the system includes:
The background image generation module is used for acquiring training videos and extracting the training videos to obtain training data; the training data comprises voice features and image features contained in the training video; intercepting and obtaining partial face images from the training data; the intercepting part comprises a face area, a neck area and a shoulder area; performing face segmentation on the partial face image by using a face analysis model to obtain a face mask and a background mask; the sizes of the face mask and the background mask are consistent with the sizes of the partial face images; and obtaining a background image with the face removed according to the background mask and the partial face image.
Based on the above embodiments, as a preferred embodiment, the background map generating module includes:
The face intercepting unit is used for extracting original image frames from the training video according to a set frame rate; the frame number of the original image frame is the same as the audio frame number; determining a face image position in the original image frame; and intercepting and obtaining a part of face image according to the face image position.
Based on the foregoing embodiment, as a preferred embodiment, the background map generating module may further include:
and the truth value data generating unit is used for obtaining a foreground image only containing a human face according to the human face mask and the partial human face image, and reducing the foreground image to obtain the truth value of the rendering model.
Based on the foregoing embodiment, as a preferred embodiment, the background map generating module may further include:
The world coordinate generation unit is used for converting pixel coordinates in the partial face image into world coordinates in a world coordinate system;
And the sampling point generating unit is used for treating each pixel point as being rendered by rays passing through the pixel point in space, and discretizing each ray to obtain the sampling point.
Based on the above-described embodiments, as a preferred embodiment, the world coordinate generation unit is a unit for performing the steps of:
Extracting the partial face image to obtain two-dimensional key points; matching the two-dimensional key points with the three-dimensional key points of the three-dimensional standard face model to determine the face gesture of each frame of image; calculating a camera extrinsic matrix corresponding to each frame of image according to the face gesture; assuming the image observed by the camera is the true value, acquiring a camera internal reference matrix; and converting pixel coordinates in the partial face image into world coordinates in a world coordinate system by using the camera external reference matrix and the camera internal reference matrix.
Based on the above embodiments, as a preferred embodiment, the rendering model includes:
The voice coding network comprises a plurality of convolution blocks and is used for extracting audio characteristics of input audio to obtain voice embedded characteristics;
the position coding network is used for coding the three-dimensional position information of the sampling points to obtain position characteristics;
The direction coding network is used for coding the direction of the sampling point to obtain a direction code; the direction and the code of the sampling point on the same ray are the same;
the transparency prediction network is used for outputting the color transparency of all the sampling points and splicing the voice embedding feature and the position feature to obtain a first combined feature;
The RGB prediction network is used for outputting red, green and blue tristimulus values for all the sampling points, and splicing the first combination characteristic and the direction code to obtain a second combination characteristic;
The rendering module is used for performing cumulative rendering on the sampling points on each ray to obtain RGB color values corresponding to the two-dimensional pixel points corresponding to the rays; respectively carrying out accumulation calculation on three pixel values in the ray direction by a plurality of sampling points on the ray to obtain an RGB predicted image;
The background coding network is used for carrying out downsampling coding on the background image to obtain two-dimensional background characteristics;
And the style coding network is used for combining the two-dimensional background characteristic and the second combined characteristic and coding to obtain the style variable.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the steps provided by the above-described embodiments. The storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The present application also provides an electronic device, referring to fig. 3, and as shown in fig. 3, a block diagram of an electronic device provided in an embodiment of the present application may include a processor 1410 and a memory 1420.
Processor 1410 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc., among others. The processor 1410 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). Processor 1410 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1410 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1410 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.
Memory 1420 may include one or more computer-readable storage media, which may be non-transitory. Memory 1420 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 1420 is used at least to store a computer program 1421, which, when loaded and executed by the processor 1410, can implement relevant steps in the method performed by the electronic device side as disclosed in any of the foregoing embodiments. In addition, the resources stored by memory 1420 may include an operating system 1422, data 1423, and the like, and the storage may be transient storage or permanent storage. Operating system 1422 may include Windows, linux, android, among other things.
In some embodiments, the electronic device may further include a display 1430, an input-output interface 1440, a communication interface 1450, a sensor 1460, a power supply 1470, and a communication bus 1480.
Of course, the structure of the electronic device shown in fig. 3 is not limited to the electronic device in the embodiment of the present application, and the electronic device may include more or fewer components than those shown in fig. 3 or may combine some components in practical applications.
In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The system provided by the embodiment is relatively simple to describe as it corresponds to the method provided by the embodiment, and the relevant points are referred to in the description of the method section.
The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that the present application may be modified and practiced without departing from the spirit of the present application.
It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1.一种语音驱动人脸视频生成方法,其特征在于,包括:1. A method for generating a voice-driven face video, comprising: 获取语音数据,提取所述语音数据的音频特征;Acquire voice data, and extract audio features of the voice data; 将所述音频特征、背景图像和采样点输入至人脸生成模型,利用所述人脸生成模型生成每帧语音对应的人脸说话图像;所述人脸生成模型为基于渲染模型和生成对抗模型组成,用于对所述背景图像进行下采样编码,并与所述渲染模型提取得到的人脸三维特征相结合,得到风格变量;所述生成对抗模型用于根据所述风格变量生成所述人脸说话图像;The audio features, background image and sampling points are input into a face generation model, and the face generation model is used to generate a face speaking image corresponding to each frame of speech; the face generation model is composed of a rendering model and a generative adversarial model, and is used to downsample and encode the background image, and combine it with the three-dimensional face features extracted by the rendering model to obtain style variables; the generative adversarial model is used to generate the face speaking image according to the style variables; 将所述人脸说话图像拼接成视频,并叠加所述语音数据,得到语音驱动人脸视频生成视频。The face speaking images are spliced into a video, and the voice data is superimposed to obtain a voice-driven face video generation video. 2.根据权利要求1所述的语音驱动人脸视频生成方法,其特征在于,所述背景图像的生成过程包括:2. The method for generating a voice-driven face video according to claim 1, wherein the generation process of the background image comprises: 获取训练视频,提取所述训练视频得到训练数据;所述训练数据包括所述训练视频包含的语音特征和图像特征;Acquire a training video, and extract the training video to obtain training data; the training data includes speech features and image features contained in the training video; 从所述训练数据中截取得到部分人脸图像;所述截取部位包括人脸区域、脖颈区域和肩膀区域;Cutting out a partial face image from the training data; the cutout portion includes a face area, a neck area, and a shoulder area; 利用人脸解析模型对所述部分人脸图像进行人脸分割,得到人脸遮罩和背景遮罩;所述人脸遮罩和所述背景遮罩的尺寸与所述部分人脸图像一致;Performing face segmentation on the partial face image using a face analysis model to obtain a face mask and a background mask; the sizes of the face mask and the background mask are consistent with those of the partial face image; 根据所述背景遮罩和所述部分人脸图像得到移除人脸后的背景图像。A background image after removing the face is obtained according to the background mask and the partial face image. 3.根据权利要求2所述的语音驱动人脸视频生成方法,其特征在于,从所述训练数据中截取得到部分人脸图像包括:3. The method for generating a voice-driven face video according to claim 2, wherein the step of intercepting a partial face image from the training data comprises: 从所述训练视频中按设定帧率提取原始图像帧;所述原始图像帧的帧数与音频帧数相同;Extracting original image frames from the training video at a set frame rate; the number of frames of the original image frames is the same as the number of audio frames; 确定所述原始图像帧中的人脸图像位置;Determining a facial image position in the original image frame; 根据所述人脸图像位置截取得到部分人脸图像。A partial face image is obtained according to the position of the face image. 4.根据权利要求2所述的语音驱动人脸视频生成方法,其特征在于,得到人脸遮罩和背景遮罩之后,还包括:4. The method for generating a voice-driven face video according to claim 2, characterized in that after obtaining the face mask and the background mask, it further comprises: 根据所述人脸遮罩和所述部分人脸图像得到仅包含人脸的前景图像,对所述前景图像进行缩小得到所述渲染模型的真值。A foreground image containing only the face is obtained according to the face mask and the partial face image, and the foreground image is reduced to obtain the true value of the rendering model. 5.根据权利要求4所述的语音驱动人脸视频生成方法,其特征在于,对所述前景图像进行缩小得到所述渲染模型的真值之后,还包括:5. The method for generating a voice-driven face video according to claim 4, characterized in that after reducing the foreground image to obtain the true value of the rendering model, it further comprises: 将所述部分人脸图像中的像素坐标转化为世界坐标系下的世界坐标;Convert pixel coordinates in the partial face image into world coordinates in a world coordinate system; 将每个像素点视为由空间中经过所述像素点的射线渲染得到,对每个所述射线离散化得到所述采样点。Each pixel point is regarded as being rendered by a ray passing through the pixel point in space, and each ray is discretized to obtain the sampling point. 6.根据权利要求5所述的语音驱动人脸视频生成方法,其特征在于,将所述部分人脸图像中的像素坐标转化为世界坐标系下的世界坐标包括:6. The method for generating a voice-driven face video according to claim 5, wherein converting the pixel coordinates in the partial face image into world coordinates in a world coordinate system comprises: 对所述部分人脸图像提取得到二维关键点;Extracting two-dimensional key points from the partial face image; 将所述二维关键点和三维标准人脸模型的三维关键点进行匹配,以确定每帧图像的人脸姿态;Matching the two-dimensional key points with the three-dimensional key points of the three-dimensional standard face model to determine the face posture of each frame image; 根据所述人脸姿态计算每帧图像对应的相机外参矩阵;Calculate the camera extrinsic parameter matrix corresponding to each frame of image according to the face posture; 假设相机观察的图像为所述真值,获取相机内参矩阵;Assuming that the image observed by the camera is the true value, obtain the camera intrinsic parameter matrix; 应用所述相机外参矩阵和所述相机内参矩阵将所述部分人脸图像中的像素坐标转化为世界坐标系下的世界坐标。The camera extrinsic parameter matrix and the camera intrinsic parameter matrix are applied to transform the pixel coordinates in the partial face image into world coordinates in a world coordinate system. 7.根据权利要求5所述的语音驱动人脸视频生成方法,其特征在于,所述渲染模型包括:7. The method for generating a voice-driven face video according to claim 5, wherein the rendering model comprises: 语音编码网络,包含若干卷积块,用于提取输入音频的音频特征,得到语音嵌入特征;The speech coding network includes several convolutional blocks, which are used to extract the audio features of the input audio and obtain the speech embedding features; 位置编码网络,用于对所述采样点的三维位置信息进行编码,得到位置特征;A position encoding network, used for encoding the three-dimensional position information of the sampling point to obtain position features; 方向编码网络,用于对所述采样点的方向进行编码,得到方向编码;其中,同一条射线上的采样点方向和编码均相同;A direction coding network, used for coding the direction of the sampling point to obtain a direction code; wherein the direction and code of the sampling point on the same ray are the same; 透明度预测网络,用于输出所有所述采样点的颜色透明度,并拼接所述语音嵌入特征和所述位置特征,得到第一组合特征;A transparency prediction network, used for outputting the color transparency of all the sampling points, and concatenating the speech embedding feature and the position feature to obtain a first combined feature; RGB预测网络,用于为所有所述采样点输出红绿蓝三色值,并拼接所述第一组合特征和所述方向编码,得到第二组合特征;An RGB prediction network is used to output red, green and blue color values for all the sampling points, and to concatenate the first combined feature and the direction code to obtain a second combined feature; 渲染模块,用于对每条所述射线上的采样点进行累积渲染,得到所述射线对应二维像素点对应的RGB颜色值;将所述射线上的若干采样点对三个像素值在射线方向上分别进行累积计算,得到RGB预测图像;A rendering module is used to perform cumulative rendering on the sampling points on each ray to obtain the RGB color value corresponding to the two-dimensional pixel point corresponding to the ray; cumulatively calculate the three pixel values of several sampling points on the ray in the ray direction to obtain the RGB predicted image; 背景编码网络,用于对所述背景图像进行下采样编码,得到二维背景特征;A background encoding network is used to downsample and encode the background image to obtain two-dimensional background features; 风格编码网络,用于将所述二维背景特征和所述第二组合特征相结合,编码得到所述风格变量。The style encoding network is used to combine the two-dimensional background feature and the second combined feature to encode the style variable. 8.一种语音驱动人脸视频生成系统,其特征在于,包括:8. A voice-driven face video generation system, comprising: 音频特征提取模块,用于获取语音数据,提取所述语音数据的音频特征;An audio feature extraction module, used to obtain voice data and extract audio features of the voice data; 人脸图像生成模块,用于将所述音频特征、背景图像和采样点输入至人脸生成模型,利用所述人脸生成模型生成每帧语音对应的人脸说话图像;所述人脸生成模型为基于渲染模型和生成对抗模型组成,用于对所述背景图像进行下采样编码,并与所述渲染模型提取得到的人脸三维特征相结合,得到风格变量;所述生成对抗模型用于根据所述风格变量生成所述人脸说话图像;A face image generation module, used for inputting the audio features, background image and sampling points into a face generation model, and using the face generation model to generate a face speaking image corresponding to each frame of speech; the face generation model is composed of a rendering model and a generative adversarial model, and is used for downsampling and encoding the background image, and combining it with the three-dimensional face features extracted by the rendering model to obtain style variables; the generative adversarial model is used for generating the face speaking image according to the style variables; 视频生成模块,用于将所述人脸说话图像拼接成视频,并叠加所述语音数据,得到语音驱动人脸视频生成视频。The video generation module is used to splice the face speaking images into a video and superimpose the voice data to obtain a voice-driven face video generation video. 9.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-7任一项所述的语音驱动人脸视频生成方法的步骤。9. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the voice-driven facial video generation method as described in any one of claims 1 to 7 are implemented. 10.一种电子设备,其特征在于,包括存储器和处理器,所述存储器中存有计算机程序,所述处理器调用所述存储器中的计算机程序时实现如权利要求1-7任一项所述的语音驱动人脸视频生成方法的步骤。10. An electronic device, characterized in that it comprises a memory and a processor, wherein a computer program is stored in the memory, and when the processor calls the computer program in the memory, the steps of the voice-driven facial video generation method as described in any one of claims 1 to 7 are implemented.
CN202411063361.0A 2024-08-05 2024-08-05 Voice-driven face video generation method, system, storage medium, and electronic device Active CN118969008B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411063361.0A CN118969008B (en) 2024-08-05 2024-08-05 Voice-driven face video generation method, system, storage medium, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411063361.0A CN118969008B (en) 2024-08-05 2024-08-05 Voice-driven face video generation method, system, storage medium, and electronic device

Publications (2)

Publication Number Publication Date
CN118969008A true CN118969008A (en) 2024-11-15
CN118969008B CN118969008B (en) 2025-09-26

Family

ID=93395931

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411063361.0A Active CN118969008B (en) 2024-08-05 2024-08-05 Voice-driven face video generation method, system, storage medium, and electronic device

Country Status (1)

Country Link
CN (1) CN118969008B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119851656A (en) * 2025-01-07 2025-04-18 广州航海学院 Method and device for generating face image based on voice

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003259320A (en) * 2002-03-05 2003-09-12 Matsushita Electric Ind Co Ltd Video and audio synthesizer
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 A method and system for generating speech video
KR20230123184A (en) * 2022-02-16 2023-08-23 주식회사 엘지유플러스 Method and apparatus for generating speech video
CN116740788A (en) * 2023-06-14 2023-09-12 平安科技(深圳)有限公司 Virtual human speaking video generation method, server, equipment and storage medium
KR20230172427A (en) * 2022-06-15 2023-12-22 고려대학교 세종산학협력단 Talking face image synthesis system according to audio voice
CN117372585A (en) * 2023-08-09 2024-01-09 广州虎牙科技有限公司 Face video generation method and device and electronic equipment
CN117611608A (en) * 2023-12-04 2024-02-27 上海积图科技有限公司 Video face cartoon method, device and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003259320A (en) * 2002-03-05 2003-09-12 Matsushita Electric Ind Co Ltd Video and audio synthesizer
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 A method and system for generating speech video
KR20230123184A (en) * 2022-02-16 2023-08-23 주식회사 엘지유플러스 Method and apparatus for generating speech video
KR20230172427A (en) * 2022-06-15 2023-12-22 고려대학교 세종산학협력단 Talking face image synthesis system according to audio voice
CN116740788A (en) * 2023-06-14 2023-09-12 平安科技(深圳)有限公司 Virtual human speaking video generation method, server, equipment and storage medium
CN117372585A (en) * 2023-08-09 2024-01-09 广州虎牙科技有限公司 Face video generation method and device and electronic equipment
CN117611608A (en) * 2023-12-04 2024-02-27 上海积图科技有限公司 Video face cartoon method, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈莹;陈湟康;: "基于多模态生成对抗网络和三元组损失的说话人识别", 电子与信息学报, no. 02, 15 February 2020 (2020-02-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119851656A (en) * 2025-01-07 2025-04-18 广州航海学院 Method and device for generating face image based on voice

Also Published As

Publication number Publication date
CN118969008B (en) 2025-09-26

Similar Documents

Publication Publication Date Title
CN116363261B (en) Image editing model training method, image editing method and device
CA3137297C (en) Adaptive convolutions in neural networks
CN111783603A (en) Generative confrontation network training method, image face swapping, video face swapping method and device
CN114529574B (en) Image matting method and device based on image segmentation, computer equipment and medium
CN116385667B (en) Reconstruction method of three-dimensional model, training method and device of texture reconstruction model
WO2024255423A1 (en) Image processing method and apparatus, and computer device and computer-readable storage medium
CN113486787B (en) Face driving and live broadcasting method and device, computer equipment and storage medium
WO2020077913A1 (en) Image processing method and device, and hardware device
CN118969008B (en) Voice-driven face video generation method, system, storage medium, and electronic device
CN117218246A (en) Training method, device, electronic equipment and storage medium for image generation model
CN118015142A (en) Face image processing method, device, computer equipment and storage medium
CN120894474B (en) A method and related equipment for real-time construction of digital humans
CN112102461B (en) A face rendering method, device, electronic device and storage medium
CN112669431B (en) Image processing methods, devices, equipment, storage media and program products
CN120201259A (en) Video generation method, device, equipment and storage medium
CN119992416A (en) Digital human driving model generation method, device, electronic device and storage medium
CN119788907A (en) Digital human synthesis method, device, electronic equipment and storage medium
CN119048616A (en) Text image generation and text image generation model training method and device
CN118229632A (en) Display screen defect detection method, model training method, device, equipment and medium
CN119052568B (en) Live broadcast real-time face replacement method and electronic device based on deep learning
CN120580339B (en) Digital human reconstruction method, system and computing device based on motion consistency constraints
CN115714888B (en) Video generation method, device, equipment and computer readable storage medium
CN121544760A (en) Normal map generation method and digital human video generation method
CN118097015A (en) A three-dimensional object reconstruction method, device, computer equipment and storage medium
CN116958328A (en) Method, device, equipment and storage medium for synthesizing mouth shape

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant