CN113269862B

CN113269862B - Scene self-adaptive fine three-dimensional face reconstruction method, system and electronic equipment

Info

Publication number: CN113269862B
Application number: CN202110601213.XA
Authority: CN
Inventors: 雷震; 朱翔昱; 于畅
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2024-06-21
Anticipated expiration: 2041-05-31
Also published as: CN113269862A

Abstract

The invention belongs to the technical field of image processing and pattern recognition, in particular relates to a scene self-adaptive fine three-dimensional face reconstruction method, a system and electronic equipment, and aims to solve the problems of strong model sense and poor generalization of the reconstruction result of the existing three-dimensional face reconstruction; the method comprises the steps of amplifying the face shape of a training set based on a 3DMM and a graphical imaging model to obtain a plurality of three-dimensional face data and images corresponding to the three-dimensional face data; fitting a three-dimensional variable model to the three-dimensional face data corresponding image as an initial shape, and performing virtual multi-view generation based on the three-dimensional face data corresponding image and the initial shape to obtain a multi-view image; inputting the multi-view images into a many-to-one funnel network, and optimizing through a vision-consistent loss function to obtain a refined three-dimensional face shape; the invention improves three aspects of training data construction, model design and scene self-adaptation, realizes fine shape reconstruction and improves the robustness of the model in an unconstrained scene.

Description

Scene self-adaptive fine three-dimensional face reconstruction method, system and electronic equipment

Technical Field

The invention belongs to the technical field of image processing and pattern recognition, and particularly relates to a scene self-adaptive fine three-dimensional face reconstruction method, a system and electronic equipment.

Background

At present, the three-dimensional face reconstruction algorithm is mainly used for carrying out shape reconstruction based on a three-dimensional variable model (3 DMM). However, due to the acquisition cost of three-dimensional data, most 3DMM models are built by only hundreds of scanning point clouds, the spans of ages and the like are very small, the images are often shot in a controlled environment, the faces are in the front and the expression is natural, the obtained model has limited expression capacity, and it is difficult to describe all changes which may exist in the faces in practice.

The existing mainstream algorithm is based on convolutional neural network to reconstruct three-dimensional face, and training of the three-dimensional face generally requires a large amount of dense three-dimensional face point clouds and corresponding face images as supervision. However, the cost of manually labeling the data is high, the implementation is difficult, the existing three-dimensional data set is mostly used as a label of network training by using a 3DMM fitting result, the mode can cause that a three-dimensional shape loses a lot of details, and a reconstruction result tends to have a strong model sense. For this reason, algorithms have been proposed to project the result of the 3DMM fitting onto the original image for self-supervised learning, but since the pose is a main factor affecting the face position and the shape can only slightly affect the position of the point, such techniques focus mainly on the accuracy of pose estimation, rather than shape reconstruction.

In addition, three-dimensional face training data are mostly collected in indoor environments, and due to the characteristic of deep learning data driving, when a model is applied to an outdoor unconstrained scene, the existing algorithm is poor in robustness due to the differences of factors such as illumination, resolution, gesture and shielding.

Disclosure of Invention

In order to solve the problems, namely the problems of strong model sense and poor generalization of the reconstruction result of the existing three-dimensional face reconstruction method, the invention provides a scene self-adaptive fine three-dimensional face reconstruction method, a system and electronic equipment.

The first aspect of the invention provides a scene self-adaptive fine three-dimensional face reconstruction method, which comprises the following steps: step S100, the face shape of the training set is enhanced based on a 3DMM and a graphic imaging model, and a plurality of high-fidelity training data are obtained; the high-fidelity training data comprises three-dimensional face data and corresponding images thereof;

Step S200, based on the acquired high-fidelity training data, fitting a three-dimensional variable model to the corresponding image of the three-dimensional face data as an initial shape, and based on the corresponding image of the three-dimensional face data and the initial shape, performing virtual multi-view generation to obtain a multi-view image;

And inputting the multi-view images into a many-to-one funnel network, and optimizing through a vision consistent loss function to obtain a refined three-dimensional face shape.

In some preferred embodiments, the augmentation in step S100 specifically comprises the steps of: step S110, complementing incomplete depth channels in an input sample image by using gridding, and converting the image into a first three-dimensional grid;

The depth channel comprises a depth channel of a face area and a depth channel of a background area, and the value acquisition method of the depth channel of the face area specifically comprises the following steps: fitting an input sample image based on 3DMM to obtain an initial three-dimensional face shape, and obtaining the initial three-dimensional face shape based on the initial three-dimensional face shape by adopting a non-rigid registration algorithm;

the value of the depth channel of the background area is obtained based on the original depth channel value and smoothness constraint; the sample image is an RGB-D image containing a human face;

step S120, eyes, nose, mouth and cheeks are randomly selected from the training set, and are combined to obtain a first three-dimensional face shape;

Step S130, replacing the face area in the first three-dimensional grid in step S110 with the first three-dimensional face shape, and adjusting the position of a background anchor point through the smoothness constraint of the background area to obtain a second three-dimensional grid as a three-dimensional structure after shape migration;

step S140, performing shadow migration on the second three-dimensional face shape based on a graphical imaging model to obtain a third three-dimensional grid;

And step S150, rendering the third three-dimensional grid to obtain an image with the shape transferred so as to complete the enhancement of the shape of the high-fidelity three-dimensional face.

In some preferred embodiments, the method for acquiring the values of the depth channels of the background area in step S110 is as follows:

uniformly paving a plurality of anchor points in a background area, constructing a triangular network based on the paved anchor points through a Delaunay triangulation algorithm, and calculating the depth of each anchor point through a preset first method; completing the depth channel of the background area based on the depth of each anchor point;

the preset first method specifically comprises the following steps:

Wherein, Representing the Depth of the ith anchor point, mask (x _i,y_i) represents whether the Depth channel of the ith anchor point has a value, depth (x _i,y_i) is the value of the Depth channel of the RGB-D image at the position of the ith anchor point, and Connect (i, j) represents whether the ith anchor point and the j anchor point are connected by the edge of the triangle network.

In some preferred embodiments, the adjustment of the location of the background anchor point in step S130 is specifically:

Wherein, For the location of the ith anchor point on the source image,/>Is the target location of the ith anchor point, faceCounter (i) indicates whether the ith anchor point is located in the facial contour, connect (i, j) indicates whether the ith, j anchor points are connected to the background grid.

In some preferred embodiments, the shadow migration in step S140 is specifically:

Based on a preset imaging formula, replacing a normal vector and a specular reflection value with a value after shape migration, and obtaining a target face after shadow migration by using the values of the source images by the other imaging parameters;

The preset imaging formula is as follows:

Wherein, For the color value of the ith point of the three-dimensional shape, T _i is the texture value of the ith point, amb is ambient light, and the diagonal matrix Dir represents parallel light in the l direction,/>For the normal direction of the ith point, k _s is specular reflection, ve is viewing angle direction, v is angle distribution parameter controlling specular reflection, r _i is direction of maximum specular reflection value,/>

In some preferred embodiments, step S200 specifically includes the steps of: step S210, fitting an input sample image based on 3DMM to obtain a roughly reconstructed three-dimensional face; based on the rough reconstructed three-dimensional face, gridding the input sample image to obtain a fourth three-dimensional grid; the depth channel value of the face region is obtained by roughly reconstructing a three-dimensional face, and the depth channel value of the background region is obtained by roughly reconstructing a mean value of the three-dimensional face;

step S220, mirror-image overturning is carried out on the input sample image, the step S210 is repeated to obtain a three-dimensional grid of the image after mirror image, the three-dimensional grid is used as a fifth three-dimensional grid, and the fifth three-dimensional grid is combined with the fourth three-dimensional grid to obtain a sixth three-dimensional grid with complete textures;

Step S230, rendering the sixth three-dimensional mesh shape according to the first preset viewing angle, and calculating the visibility of each triangle mesh to obtain a visibility graph:

step S240, a virtual multi-view image is obtained based on the visibility graph and the rendering result;

step S250, respectively inputting a preset number of virtual views into corresponding encoders to extract image features, and expanding the image features to a UV space to obtain a UV feature group of the multi-view face;

The UV feature sets are connected in series and then input into a decoder, offset between a real three-dimensional face and a rough reconstructed three-dimensional face is obtained by adopting deconvolution, and the offset and the rough reconstructed three-dimensional face are added to obtain a reconstructed three-dimensional shape;

Step S260, setting the texture value of each point of the reconstructed three-dimensional face as (255 ), placing the texture value under direct front light, and calculating the RGB value of each point under the light; rendering the image according to a second preset viewing angle based on the micro-renderers; repeatedly rendering a real three-dimensional face S ^* serving as a target face, and obtaining an evaluation index L ^psd of visual similarity based on the calculated error of the rendered image;

Wherein v represents the corresponding view angle number; r (&, &) represents a rendering function, the input of which is three-dimensional shape, texture and illumination; s ^* is a real three-dimensional face; s ^init is a rough reconstruction three-dimensional face S ^init; Δs is the offset; t ^w is a full white texture, I ^orth represents parallel light in the direction (0, 1).

In some preferred embodiments, the method further comprises step S300, comprising: fine tuning the many-to-one funnel network;

The fine tuning process is specifically as follows: step S310, grouping data in a training set according to attributes, dividing the data into a plurality of parts based on the distribution of two attributes of age and gender, and grouping the data for training to obtain a dedicated three-dimensional shape model corresponding to the three attributes;

wherein the age distribution is associated with the model Comprises five parts of infants, teenagers, young, middle-aged and elderly people; the sex distribution association model comprises two parts, namely male theta _male and female theta _female;

Step S320, inputting a non-labeling image of a target scene, estimating the age and the sex of the non-labeling image, and matching corresponding models from model groups with different attributes to obtain two model parameters theta _age、θ_gender;

step S330, extracting key points and segmentation masks of non-annotated images in the target scene;

Step S340, carrying out targeted amplification on the input sample image, and obtaining an amplified image;

Step S350, based on the results of step S330 and step S340, fine tuning the two model parameters θ _age、θ_gender using the weak supervision constraint and the self-supervision constraint to obtain a pseudo tag model:

Step S360, generating a plurality of pseudo three-dimensional labels for the target scene data based on the pseudo label model, obtaining updated pseudo three-dimensional labels by adopting a mean value fusion strategy, and combining the updated pseudo three-dimensional labels with the source data to retrain the many-to-one funnel network.

A second aspect of the present invention provides a scene-adaptive fine three-dimensional face reconstruction system, comprising: the device comprises an acquisition module, an augmentation module, a fitting module, a generation module and an optimization module;

The acquisition module is configured to acquire a two-dimensional face image to be reconstructed as an input sample image;

the augmentation module is configured to augment an input sample image based on the 3DMM and the graphical imaging model to obtain a plurality of high-fidelity training data;

The fitting module is configured to fit a three-dimensional variable model to the corresponding image of the three-dimensional face data based on the acquired high-fidelity training data to serve as an initial shape;

the generating module is configured to generate virtual multi-view based on the three-dimensional face data corresponding image and the initial shape so as to obtain a multi-view image;

The optimizing module is configured to input the multi-view images into a many-to-one funnel network, and optimize the multi-view images through a vision-consistent loss function so as to obtain a refined three-dimensional face shape

A third aspect of the present invention provides an electronic device comprising: at least one processor; and a memory communicatively coupled to at least one of the processors; wherein the memory stores instructions executable by the processor for execution by the processor to implement the scene-adaptive fine three-dimensional face reconstruction method of any one of the above.

A fourth aspect of the present invention proposes a computer readable storage medium storing computer instructions for execution by the computer to implement the scene-adaptive fine three-dimensional face reconstruction method of any one of the above.

1) Aiming at the defects of strong model sense and limited application scene of the reconstruction result of the existing three-dimensional face reconstruction method, the invention improves three aspects of training data construction, model design and scene self-adaption, provides a network structure and a corresponding optimization method aiming at fine shape reconstruction, designs the scene self-adaption method aiming at three-dimensional face reconstruction on the basis, realizes fine shape reconstruction, and improves the robustness of the model in an unconstrained scene.

2) In the construction process of the three-dimensional data, the data is amplified through shape migration, and the shape change of a data set is enriched. On the basis, the accuracy and visual consistency of three-dimensional face reconstruction are improved by a single-view fine three-dimensional face reconstruction method based on the depth neural network. In addition, the scene self-adaptive method of three-dimensional reconstruction can improve the adaptability of the network to the target scene, so that the model can be stably reconstructed in any target scene.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

FIG. 1 is a flow chart of one embodiment of a scene adaptive fine three-dimensional face reconstruction method of the present invention;

FIG. 2 is a flow chart of a method of generating high fidelity virtual three-dimensional face data in the invention;

FIG. 3 is a flow chart of a single view fine three-dimensional face reconstruction method in the present invention;

FIG. 4 is a flow chart of a scene adaptation method of three-dimensional reconstruction in the present invention;

FIG. 5 is a schematic diagram of a frame of one embodiment of a scene adaptive fine three-dimensional face reconstruction system of the present invention;

FIG. 6 is a schematic diagram of a computer system of a server for implementing embodiments of the method, system, and apparatus of the present application.

Detailed Description

In order to make the embodiments, technical solutions and advantages of the present invention more obvious, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the embodiments are some, but not all embodiments of the present invention. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.

The first aspect of the invention provides a scene self-adaptive fine three-dimensional face reconstruction method, which comprises the following steps: step S100, the face shape of the training set is enhanced based on a 3DMM and a graphic imaging model, and a plurality of high-fidelity training data are obtained; the high-fidelity training data comprises three-dimensional face data and corresponding images thereof; step S200, based on the acquired high-fidelity training data, fitting a three-dimensional variable model to the corresponding image of the three-dimensional face data as an initial shape, and based on the corresponding image of the three-dimensional face data and the initial shape, performing virtual multi-view generation to obtain a multi-view image; and inputting the multi-view images into a many-to-one funnel network, and optimizing through a vision consistent loss function to obtain a refined three-dimensional face shape.

Aiming at the defects of strong model sense and poor generalization of the reconstruction result of the existing three-dimensional face reconstruction method, the invention provides a scene self-adaptive fine three-dimensional face reconstruction method, so that the robustness of the model in an unconstrained scene can be improved while the shape is finely reconstructed. The method comprises a high-fidelity virtual three-dimensional face data generation method, a single-view fine three-dimensional face reconstruction method and a three-dimensional reconstruction scene self-adaption method. The high-fidelity virtual three-dimensional face data generation method is characterized in that in the training data construction process, the face shape of a training set is enhanced based on a three-dimensional priori and a graphic imaging model, so that large-scale high-fidelity three-dimensional face data and corresponding images thereof are obtained.

The single-view fine three-dimensional face reconstruction method relies on a deep neural network structure, a three-dimensional variable model is firstly fitted to an image to serve as an initial shape, virtual multi-view generation is carried out on the basis, the initial shape and a target shape are input into a many-to-one funnel network, residual errors between the initial shape and the target shape are learned, and finally the network can intensively optimize the fine structure of the three-dimensional face through a vision-consistent loss function to improve the reconstructed visual effect

The invention is further described below in connection with specific embodiments with reference to the accompanying drawings.

Referring to fig. 1 to 4, a first aspect of the present invention provides a scene-adaptive fine three-dimensional face reconstruction method, the method comprising the steps of: step S100, in the training data construction process, the face shape of a training set is enhanced based on a 3DMM (namely a face 3D deformation statistical model) and a graphic imaging model, and a plurality of high-fidelity training data are obtained; the high-fidelity training data comprises three-dimensional face data and corresponding images; step S200, based on the acquired high-fidelity training data, fitting a three-dimensional variable model to the corresponding image of the three-dimensional face data as an initial shape, and based on the corresponding image of the three-dimensional face data and the initial shape, performing virtual multi-view generation to obtain a multi-view image; inputting the multi-view images into a many-to-one funnel network, and optimizing through a vision-consistent loss function to obtain a refined three-dimensional face shape. In the invention, based on the obtained high-fidelity training data, a single-view fine three-dimensional face reconstruction method is provided, and the network can intensively optimize the fine structure of the three-dimensional face through a virtual multi-view image generation method, a many-to-one funnel network structure and a vision consistent loss function, so that the visual effect of the three-dimensional face reconstruction is effectively improved.

The augmentation in step S100 specifically includes the steps of: step S110, using gridding complementation to incomplete depth channels in an input sample image, and converting the image into a first three-dimensional grid; the depth channel comprises a depth channel of a face region and a depth channel of a background region; the method for acquiring the value of the depth channel of the face region specifically comprises the following steps: fitting an input sample image based on 3DMM to obtain an initial three-dimensional face shape, and obtaining the initial three-dimensional face shape based on the initial three-dimensional face shape by adopting a non-rigid registration algorithm; the sample image is an RGB-D image containing a human face.

The value of the depth channel of the background region is obtained based on the original depth channel value and the smoothness constraint; the acquisition method specifically comprises the following steps: uniformly paving a plurality of anchor points in a background area, constructing a triangular network based on the paved anchor points through a Delaunay triangulation algorithm, and calculating the depth of each anchor point through a preset first method; and complementing the depth channel of the background area based on the depth of each anchor point.

The preset first method specifically comprises the following steps:

Wherein/> Representing the Depth of the ith anchor point, mask (x _i,y_i) represents whether the Depth channel of the ith anchor point has a value, depth (x _i,y_i) is the value of the Depth channel of the RGB-D image at the position of the ith anchor point, and Connect (i, j) represents whether the ith anchor point and the j anchor point are connected by the edge of the triangle network.

In step S120, eyes, nose, mouth and cheeks are randomly selected from the training set (i.e. the three-dimensional face sample in the data set), and are combined to obtain a first three-dimensional face shape as a target for subsequent shape migration.

Step S130, replacing the face area in the first three-dimensional grid in the step S110 with a first three-dimensional face shape, and adjusting the position of a background anchor point through smoothness constraint of a background area to obtain a second three-dimensional grid as a three-dimensional structure after shape migration; the position adjustment of the background anchor point is specifically:

Wherein/> For the location of the ith anchor point on the source image,/>Is the target location of the ith anchor point, faceCounter (i) indicates whether the ith anchor point is located in the facial contour, connect (i, j) indicates whether the ith, j anchor points are connected to the background grid.

Step S140, performing shadow migration on the second three-dimensional grid obtained in the step S130 based on the graphical imaging model to obtain a third three-dimensional grid; shadow migration is specifically: based on a preset imaging formula, the normal vector and the specular reflection value are replaced by the value after shape migration, and the other imaging parameters still use the value of the source image to obtain the target face after shadow migration.

The preset imaging formula is as follows: Wherein/> For the color value of the ith point of the three-dimensional shape, T _i is the texture value of the ith point, amb is ambient light, and the diagonal matrix Dir represents parallel light in the l direction,/>For the normal direction of the ith point, k _s is specular reflection, ve is viewing angle direction, v is angle distribution parameter controlling specular reflection, r _i is direction of maximum specular reflection value,/>

And step S150, rendering the obtained third three-dimensional grid to obtain an image with the shape transferred so as to complete the enhancement of the shape of the high-fidelity three-dimensional face.

Further, the step S200 specifically includes the following steps: step S210, fitting an input sample image based on 3DMM to obtain a roughly reconstructed three-dimensional face; based on the rough reconstructed three-dimensional face, gridding the input sample image to obtain a fourth three-dimensional grid; the depth channel value of the face region is obtained by roughly reconstructing the three-dimensional face, and the depth channel value of the background region is obtained by roughly reconstructing the average value of the three-dimensional face.

Step S220, mirror-image overturning is carried out on the input sample image, the step S210 is repeated to obtain a three-dimensional net of the image after mirror image, the three-dimensional net is used as a fifth three-dimensional net, and the fifth three-dimensional net is combined with a fourth three-dimensional net to obtain a sixth three-dimensional net with complete textures;

Step S230, rendering the sixth three-dimensional mesh shape according to a first preset viewing angle, wherein the first preset viewing angle comprises five fixed viewing angles (pitch angle, yaw angle) respectively (0 ° ), (0 °,25 °), (0 °,50 °), (15 °,0 °), (-25 °,0 °); meanwhile, in the rendering process, the visibility of each triangular net is calculated, and the visibility is rendered to a target visual angle as the color of a face piece (namely the face net), so as to obtain a visibility graph vis (m);

Where l= [0, 1] is the viewing angle direction, n (·) is the normal vector of the patch, and m is the patch position at the original viewing angle.

Step S240, obtaining a virtual multi-view image I based on the visibility graph and the rendering result;

I＝λ☉I_origin+λ_flip☉I_flip with

wherein I _origin and I _flip are images rendered by the original three-dimensional network and the mirror three-dimensional network under the target viewing angle, and, as for the dot product operation, λ and λ _flip are weights of the corresponding images.

Step S250, inputting a preset number of virtual views into corresponding encoders to extract image features, wherein in the embodiment, the preset number is five, namely, five virtual views are respectively input into the five encoders to extract image features, and the image features are unfolded on a UV space to obtain a UV feature group of a multi-view face; the UV feature set is input into a decoder after being connected in series, the offset delta S between the real three-dimensional face S ^* and the rough reconstructed three-dimensional face S ^init is obtained by deconvolution, and the offset delta S and the rough reconstructed three-dimensional face S ^init are added to obtain a reconstructed three-dimensional shape;

Step S260, setting the texture value of each point of the reconstructed three-dimensional human shape as (255 ), placing the texture value under the direct front light, and calculating the RGB value of each point under the light; rendering it according to a second preset viewing angle based on the micro-renderers, wherein the second preset viewing angle is (0 ° ), (0 °,90 °), (0 °, -90 °), (30 °,0 °), (-30 °,0 °); repeating the process for rendering the real three-dimensional face S ^* serving as the target face, calculating the error of the rendered image, and obtaining an evaluation index L ^psd of visual similarity; wherein, psd is a gypsum image descriptor (Plaster Sculpture Descriptor);

Wherein v represents the corresponding view angle number; r (&, &) represents a rendering function, the input of which is three-dimensional shape, texture and illumination; s ^* is a real three-dimensional face; s ^init is roughly reconstructing a three-dimensional face; Δs is the offset; t ^w is a full white texture, I ^ort h represents parallel light in the direction (0, 1).

Further, the method for reconstructing the fine three-dimensional face of the scene in a self-adaptive manner further comprises step S300, and the method for reconstructing the scene in a self-adaptive manner specifically comprises the following steps: the fine tuning of the many-to-one funnel network, namely the scene self-adaption method of the three-dimensional reconstruction, is achieved by generating pseudo labels for unlabeled images in a target scene to fine tune the many-to-one funnel network, and the adaptability of the network to the target scene is improved.

The fine tuning process is specifically (i.e., the adaptive method is specifically): step S310, grouping the data in the training set according to the attributes, dividing the data into a plurality of parts based on the distribution of the two attributes of age and gender, and grouping the data for training to obtain the exclusive three-dimensional shape model corresponding to the three attributes. Wherein the age distribution is associated with the modelComprises five parts of infants, teenagers, young, middle-aged and elderly people; the sex distribution association model comprises two parts, namely male theta _ma and female theta _female;

step S330, extracting key points F and segmentation masks M of non-labeled images in the target scene;

Step S340, carrying out targeted augmentation on an input sample image, including color disturbance, gaussian blur, attitude disturbance, random occlusion and the like, so as to obtain a series of augmented images;

Step S350, based on the results of step S330 and step S340, fine tuning the two model parameters θ _age、θ_gender with weak supervision constraint and self-supervision constraint, and generating a pseudo tag for the unlabeled image of the target scene, to obtain a pseudo tag model L _self:

L_self＝α·L_weak+β·L_{self_s}h_p+γ·L_{self_pose}；

L_{self_shp}＝||Net_shp(x)-Net_shp(Aug(x))||；

L_{self_pose}＝||MLP(f(x),f(Aug(x)))-Δp_x||；

Wherein α, β, γ are weighting coefficients of three losses, L _weak is a weak supervision constraint, L _{self_shp} is a self-supervision shape information constraint, and L _{self_pose} is a self-supervision pose constraint; f is a two-dimensional face key point, i is a key point index, and V ^2d is a point which is estimated by the model and is consistent with F semantic after being projected into an image space; n _2d is the number of key points of the two-dimensional face; m is a two-dimensional face segmentation mask, V ^seg is a boundary point of the projection of the pseudo three-dimensional face label to the two-dimensional and S semantics, and D is the Euclidean distance from the point to the segmentation mask region; net _shp (·) is a shape reconstruction network, x is an input image, and Aug is a random image augmentation operation; the input of the MLP (& gtis) is the top-level characteristic extracted from the original image and the augmented image by the three-dimensional reconstruction network, and Deltap _x is the gesture offset from the original image to the augmented image; finally, a pseudo tag model group is obtained:

Step S360, generating a plurality of pseudo three-dimensional labels for the target scene data based on the generated pseudo label model, obtaining updated pseudo three-dimensional labels by adopting a mean value fusion strategy, combining the updated pseudo three-dimensional labels with the source data, and retraining a many-to-one funnel network, so that the model can maintain the performance of the source scene, and meanwhile, the generalization of the target scene is improved.

The first aspect of the invention discloses a scene self-adaptive fine three-dimensional face reconstruction method, a high-fidelity virtual three-dimensional face data generation method, a single-view fine three-dimensional face reconstruction method and a three-dimensional reconstruction scene self-adaptive method, wherein the high-fidelity virtual three-dimensional face data generation method is required for fine three-dimensional reconstruction under an unconstrained scene. The high-fidelity virtual three-dimensional face data generation method is characterized in that in the training data construction process, the face shape of a training set is enhanced based on a three-dimensional deformable model and a graphic imaging model, and large-scale three-dimensional face data and corresponding images thereof are obtained. Based on the generated high-fidelity training set, the invention designs a single-view fine three-dimensional face reconstruction method based on a deep neural network, which comprises the following steps: firstly, a three-dimensional variable model is fitted to an image to serve as an initial shape, virtual multi-view generation is carried out on the basis, the initial shape is input into a many-to-one funnel network, residual errors between the initial shape and a target shape are learned, and finally, optimization is carried out through a vision-consistent loss function, so that the network can intensively optimize the fine structure of a three-dimensional human face, and the reconstructed visual effect is improved. On the basis, the scene self-adaptation method generates pseudo labels for unlabeled images in the target scene to finely adjust the many-to-one funnel network, and improves the adaptability of the network to the target scene. Aiming at the defects of strong model sense and poor generalization of the reconstruction result of the existing three-dimensional face reconstruction method, the invention improves three aspects of training data construction, model design and scene self-adaption, provides a network structure and a corresponding optimization method aiming at fine shape reconstruction, designs a scene self-adaption method aiming at three-dimensional face reconstruction on the basis, realizes fine shape reconstruction and improves the robustness of the model under an unconstrained scene.

Referring to fig. 5, a second aspect of the present invention provides a scene-adaptive fine three-dimensional face reconstruction system, comprising: the system comprises an acquisition module 100, an augmentation module 200, a fitting module 300, a generation module 400 and an optimization module 500;

the optimizing module is configured to input the multi-view images into a many-to-one funnel network, and optimize the multi-view images through a vision-consistent loss function so as to obtain a refined three-dimensional face shape.

An electronic device of a third embodiment of the present invention includes: at least one processor; and a memory communicatively coupled to at least one of the processors; the memory stores instructions executable by the processor for execution by the processor to implement the scene-adaptive fine three-dimensional face reconstruction method of any one of the above.

A computer-readable storage medium of a fourth embodiment of the present invention stores computer instructions for execution by the computer to implement the scene-adaptive fine three-dimensional face reconstruction method of any one of the above.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the storage device and the processing device described above and the related description may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

Referring now to FIG. 6, there is shown a schematic diagram of a computer system of a server for implementing embodiments of the methods, systems, and apparatus of the present application. The server illustrated in fig. 6 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in fig. 6, the computer system includes a central processing unit (CPU, central Processing Unit) 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage portion 608 into a random access Memory (RAM, random Access Memory) 603. In the RAM 603, various programs and data required for system operation are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An Input/Output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN (local area network ) card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 609 and/or installed from the removable medium 611. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 601. The computer readable medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terms "first," "second," and the like, are used for distinguishing between similar objects and not for describing a particular sequential or chronological order.

It should be noted that, in the description of the present invention, terms such as "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", and the like, which indicate directions or positional relationships, are based on the directions or positional relationships shown in the drawings, are merely for convenience of description, and do not indicate or imply that the apparatus or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Furthermore, it should be noted that, in the description of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those skilled in the art according to the specific circumstances.

The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, article, or apparatus/means that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, article, or apparatus/means.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

Claims

1. A scene self-adaptive fine three-dimensional face reconstruction method is characterized by comprising the following steps:

step S100, the face shape of the training set is enhanced based on a 3DMM and a graphic imaging model, and a plurality of high-fidelity training data are obtained; the high-fidelity training data comprises three-dimensional face data and corresponding images thereof;

inputting the multi-view images into a many-to-one funnel network, and optimizing the multi-view images through a vision-consistent loss function to obtain a refined three-dimensional face shape; the augmentation in step S100 specifically includes the steps of:

step S110, complementing incomplete depth channels in an input sample image by using gridding, and converting the image into a first three-dimensional grid;

step S140, performing shadow migration on the second three-dimensional grid based on a graphical imaging model to obtain a third three-dimensional grid;

2. The scene adaptive fine three-dimensional face reconstruction method according to claim 1, wherein the method for acquiring the values of the depth channels of the background area in step S110 is as follows:

the preset first method specifically comprises the following steps:

3. The scene adaptive fine three-dimensional face reconstruction method according to claim 1, wherein the adjusting the position of the background anchor point in step S130 is specifically:

4. The scene adaptive fine three-dimensional face reconstruction method according to claim 1, wherein the shadow migration in step S140 is specifically:

based on a preset imaging formula, the normal vector and the specular reflection value are replaced by the value after shape migration, and the other imaging parameters still use the value of the source image to obtain the target face after shadow migration

The preset imaging formula is as follows:

5. The scene adaptive fine three-dimensional face reconstruction method according to claim 1, wherein the step S200 specifically comprises the steps of:

step S210, fitting an input sample image based on 3DMM to obtain a roughly reconstructed three-dimensional face; based on the rough reconstructed three-dimensional face, gridding the input sample image to obtain a fourth three-dimensional grid; the depth channel value of the face region is obtained by roughly reconstructing a three-dimensional face, and the depth channel value of the background region is obtained by roughly reconstructing a mean value of the three-dimensional face;

Wherein v represents the corresponding view angle number; r (&, &) represents a rendering function, the input of which is three-dimensional shape, texture and illumination; s ^* is a real three-dimensional face; s ^init is a rough reconstruction three-dimensional face S ^init; Δs is the offset; t ^w is a full white texture, I ^ort h represents parallel light in the direction (0, 1).

6. The scene adaptive fine three-dimensional face reconstruction method according to claim 1, further comprising step S300, comprising: fine tuning the many-to-one funnel network;

The fine tuning process is specifically as follows:

Step S310, grouping data in a training set according to attributes, dividing the data into a plurality of parts based on the distribution of two attributes of age and gender, and grouping the data for training to obtain a dedicated three-dimensional shape model corresponding to the two attributes;

7. A scene-adaptive fine three-dimensional face reconstruction system, comprising: the device comprises an acquisition module, an augmentation module, a fitting module, a generation module and an optimization module;

The optimizing module is configured to input the multi-view images into a many-to-one funnel network, and optimize the multi-view images through a vision-consistent loss function so as to obtain a refined three-dimensional face shape;

The augmentation module comprises:

Complementing incomplete depth channels in an input sample image by using gridding, and converting the image into a first three-dimensional grid;

randomly selecting eyes, nose, mouth and cheeks from the training set, and merging the eyes, nose, mouth and cheeks to obtain a first three-dimensional face shape;

Replacing a face region in the first three-dimensional grid with the first three-dimensional face shape, and adjusting the position of a background anchor point through smoothness constraint of a background region to obtain a second three-dimensional grid to be used as a three-dimensional structure after shape migration;

Shadow migration is carried out on the second three-dimensional grid based on a graphical imaging model, and a third three-dimensional grid is obtained;

And rendering the third three-dimensional grid to obtain an image with the shape transferred so as to complete the shape augmentation of the high-fidelity three-dimensional face.

8. An electronic device, comprising: at least one processor; and a memory communicatively coupled to at least one of the processors; wherein the memory stores instructions executable by the processor for execution by the processor to implement the scene-adaptive fine three-dimensional face reconstruction method of any one of claims 1-6.

9. A computer readable storage medium storing computer instructions for execution by the computer to implement the scene-adaptive fine three-dimensional face reconstruction method of any one of claims 1-6.