WO2017177363A1

WO2017177363A1 - Methods and apparatuses for face hallucination

Info

Publication number: WO2017177363A1
Application number: PCT/CN2016/078960
Authority: WO
Inventors: Xiaoou Tang; Shizhan ZHU; Cheng Li; Chen Change Loy
Original assignee: Sensetime Group Ltd
Current assignee: Sensetime Group Ltd
Priority date: 2016-04-11
Filing date: 2016-04-11
Publication date: 2017-10-19
Anticipated expiration: 2018-10-11
Also published as: CN109313795B; CN109313795A

Abstract

Methods and apparatus for face hallucination are disclosed. According to an embodiment, a method for hallucination comprises estimating a dense correspondence field based on a first image and a trained model; executing face hallucination based on the first image, the estimated dense correspondence field and the trained model through a bi-network to obtain a second image; and updating the first image with the second image, wherein the steps of estimating, executing and updating are performed repeatedly until the obtained second image has a desired resolution or the steps of estimating, executing and updating have been repeated for predetermined times.

Description

Methods and Apparatuses for Face Hallucination

Technical Field

The disclosure relates to image processing, in particular, to methods and apparatus for face hallucination.

Background

Increasing attention is devoted to detection of small facial images with low image resolution, for example, as low as 10 pixels of height. Meanwhile, facial analysis techniques, such as face alignment and verification, have been progressed rapidly. However, the performance of most existing techniques would degrade when a low-resolution facial image is given, because such image naturally carries less information, and images corrupted with down-sampling and blur would interfere the facial analysis procedure. Face hallucination is a task that improves the resolution of facial images and provides a viable means for improving low-resolution face processing and analysis, e.g., person identification in surveillance videos and facial image enhancement.

Summary

In one aspect of the present application, a method for face hallucination is provided, which comprises: estimating a dense correspondence field based on a first image and a trained model； executing face hallucination based on the first image, the estimated dense correspondence field and the trained model through a bi-network to obtain a second image； and updating the first image with the second image, wherein the steps of estimating, executing and updating are performed repeatedly until the obtained second image has a desired resolution or the steps of estimating, executing and updating have been repeated for predetermined times.

According to another aspect of the present application, an apparatus for face hallucination is provided, which comprises: an estimating unit configured to estimate a dense correspondence field based on a first image and a trained model； and a hallucination unit configured to execute face hallucination based on the first image, the estimated dense correspondence field and the trained model through a bi-network to obtain a second image； wherein the first image is iteratively updated with the second image, and the estimation unit and the hallucination unit works for a predetermined times of iterations or until the obtained second image has a desired resolution.

In a further aspect of the present application, a device for face hallucination is provided, which comprises a processor and a memory storing computer-readable instructions, wherein, when the instructions are executed by the processor, the processor is operable to:estimate a dense correspondence field based on a first image and a trained model； execute face hallucination based on the first image, the estimated dense correspondence field and the trained model through a bi-network to obtain a second image； and update the first image with the second image, wherein the first image is iteratively updated with the second image for a predetermined times of iterations or until the obtained second image has a desired resolution.

In a further aspect of the present application, a nonvolatile storage medium containing computer-readable instructions is provided, wherein, when the instructions are executed by a processor, the processor is operable to estimate a dense correspondence field based on a first image and a trained model； execute face hallucination based on the first image, the estimated dense correspondence field and the trained model through a bi-network to obtain a second image； and update the first image with the second image, wherein the first image is iteratively updated with the second image for a predetermined times of iterations or until the obtained second image has a desired resolution.

Brief Description of Drawings

Fig. 1 is a flow chart of a method for face hallucination according to an embodiment of the present disclosure.

Fig. 2 illustrates an apparatus for face hallucination according to an embodiment of the present disclosure.

Fig. 3 illustrates a flow chart of the training process of the estimation unit according to an embodiment of the present application.

Fig. 4 illustrates a flow chart of the testing process of the estimation unit according to an embodiment of the present application.

Fig. 5 illustrates a flow chart of the training process of the hallucination unit according to an embodiment of the present application.

Fig. 6 illustrates a flow chart of the testing process of the hallucination unit according to an embodiment of the present application.

Fig. 7 is a structural schematic diagram of an embodiment of computer equipment provided by the present invention.

Detailed Description of Embodiments

According to an embodiment, a method for face hallucination is provided. Fig. 1 is a flow chart of a method 100 for face hallucination according to an embodiment of the present disclosure. According to another embodiment, an apparatus 200 for face hallucination is provided. Fig. 1 is a flow chart of a method 100 for face hallucination according to an embodiment of the present disclosure. Fig. 2 illustrates an apparatus 200 for face hallucination according to an embodiment of the present disclosure. As shown in Fig. 2, the apparatus 200 may comprises an estimation unit 201 and a hallucination unit 202.

As shown in Fig. 1, at step S101, a dense correspondence field is estimated by an estimation unit 201 based on an input first image 10 and parameters from a trained model 20. The first image input into the estimation unit may be a facial image with a low resolution. The dense correspondence field indicates the correspondence or mapping relationship of the first image to a warped image and denotes the warping of each pixel from the first image to the warped image. The trained model contains various parameters that may be used for the estimation of the dense correspondence field.

At step S102, face hallucination is executed by the hallucination unit 202 based on the first image 10 and the estimated dense correspondence field to obtain a second image 30.The second image obtained after the face hallucination on the first image usually has a resolution higher than the first image. The hallucination unit 202 is a bi-network which comprises a first branch 2021 being a common branch for face hallucination and a second branch 2022 being a high-frequency branch. The processing in the common branch is similar to the face hallucination in the prior art. In the high-frequency branch, the estimated dense correspondence field and parameters from the trained model 20 are further considered in addition to the input image 10. The results obtained from both branches are incorporated through a gate network 2023 to obtain the second image 30.

At step S103, the first image is updated with the second image so that the second image is used as an input to the estimation unit 201. Then, the steps S101 to S103 are performed repeatedly. For example, the steps may be performed repeatedly until the obtained second image has a desired image resolution. Alternatively, the steps may be performed for pre-defined times.

For example, the facial image may be denoted as a matrix I, and each pixel in the image may be denoted as x with coordinates (x, y) . A mean face template for the facial image may be denoted as M, which comprises a plurality of pixels z. The dense correspondence field indicates the mapping from pixels z in the mean face template M to pixels x in the facial image I, which may be denoted by a warping function W (z) as x＝W (z) . It is noted that the pixels in both images are considered in a 2D face region. The warping function W(z) may be determined based on a deformation coefficient p and a deformation base B(z) , which may be denoted as

W (z) ＝z+B (z) p (1)

where

denotes the deformation coefficients and

denotes the deformation bases. The bases are pre-defined and shared by all samples.

According to an embodiment, the deformation base B (z) is predefined and shared by all samples, and thus the warping function is actually controlled by the deformation coefficient p for each sample. For an initially input image, p is equal to 0, and thus the warping function W (z) ＝z, indicating that the dense correspondence field is the mean face template.

Taking K times of iterations (k iterates from 1 to K) as an example, all the notations are appended with the index k to indicate the iteration. A larger k in the notation of I_k, W_k, B_k and M_k indicates the larger resolution and the same k indicates the same resolution. The whole work starts from I₀ and p₀, wherein I₀ denotes the input low-resolution facial image and p₀ is a zero vector representing the deformation coefficients of the mean face template. The final hallucinated facial image output is I_K. The deformation coefficient p_k, the warping function W_k (z) and the second image I_k are updated in each iteration. For example, the deformation coefficient p_k and the warping function W_k (z) are updated by:

wherein f_k is a Gauss-Newton descent regressor learned and stored in the trained model for predicting the dense correspondence field coefficients. The coefficients f_k may further be represented by a Gauss-Newton steepest descent regression matrixR_k, which is obtained by training. In the equation (2) φ is the shape-indexed feature that concatenates the local appearance from all L landmarks, and

is its average over all the training samples.

In an embodiment, the dense correspondence field coefficients are estimated based on each pixel in the image. Alternatively, according to another embodiment, the dense correspondence field coefficients are estimated based on landmarks in the image since using a sparse set of facial landmarks is more robust and accurate under low resolution. Under such circumstances, a landmark base S_k (l) is further considered in the estimation. In particular, two sets of deformation bases, i.e., the deformation base

for the dense field and the landmark base

for the landmarks are obtained, where l is the landmark index. The bases for the dense field and landmarks are on-to-one related, i.e., both B_k (z) and S_k (l) are share the same deformation coefficients

W_k (z) ＝z+B_k (z) p_k

where

denotes the coordinates of the l-th landmark, and denotes its mean locations.

For the face hallucination, the common branch conservatively recovers texture details that are only detectable from the low-resolution input, which is similar to the general super resolution. The high-frequency branch super-resolves faces with the additional high-frequency prior warped by the estimated face correspondence field in the current cascade. Thanks to the guidance of prior, this branch is capable of recovering and synthesizing un-revealed texture details in the overly low-resolution input image. A pixel-wise gate network is learned to fuse the results from the two branches.

According to an embodiment, the first image is upscaled and then input to the hallucination unit. In particular, the upscaled image is input to both the common branch and the high-frequency branch. In the common branch, the upscaled image is processed adaptively, for example, under a bicubic interpolation. In the high-frequency branch, the estimated dense correspondence field is further input, and the upscaled image is processed based on the estimated dense correspondence field. The results from both branches are combined in a gate network to obtain the second image. It is noted that the processing in the common branch is not limited to the bicubic interpolation, but may be any suitable process for the face hallucination.

For example, for the k-th iteration, the obtained image Ik is obtained by:

I_k＝↑I_k-1+g_k (↑I_k-1； W_k (z) ) (4)

where g_k represents a hallucination bi-network learned and stored in the trained model for face hallucination. The coefficients g_k is obtained by training.

It is noted that both the estimation unit and the hallucination unit may have a testing mode and a training mode. The method 100 as shown in Fig. 1 illustrates the working process of the estimation unit and hallucination unit in the testing mode. When working in the training mode, the estimation unit and hallucination unit may perform a training process to obtain and store parameters required in the testing mode into the trained model. Herein, the estimation unit and the hallucination unit having both a testing mode and a training mode are described as an example. Alternatively, the training process and the testing process may be performed by separate apparatus or separate units.

In the training process, two training sets, i.e., a hallucination training set and a correspondence field training set are provided. Each of the two training sets includes a plurality of images, as well as the down-sampled images in various scales for each of the plurality of images. Ground-truth coefficients for the deformation coefficient p are further included in the correspondence field training set. In contrast with the testing process as described above, images input in the training process have high resolution. Fig. 3 illustrates a flow chart of the training process 300 of the estimation unit according to an embodiment of the present application. As shown, at step S301, the dense bases B_k (z) , the landmark bases S_k (l) and appearance eigen vectors Φ_kare obtained. These parameters to be used in following steps are predefined. Meanwhile, the dense bases B_k (z) and the landmark bases S_k (l) are stored into the trained model for later use. At step 302, the average project-out Jacobian J_k is learned, for example, by minimizing the following loss:

where

φ is the shape-indexed feature that concatenates the local appearance from all L landmarks, and

is its average over all the training samples.

At step S303, the Gauss-Newton steepest descent regression matrix R_k is calculated by:

In the above equations, the Jacobian J_k and the Gauss-Newton steepest descent regression matrix R_k described above are obtained via constructing the project-out Hessian.

Optionally, the process 300 may further include steps S304 and S305. At S304, the deformation coefficients for both the correspondence training set and the hallucination training set are updated. At step S305, the dense correspondence field for each location z for the hallucination training set is calculated. The deformation coefficients and the dense correspondence field obtained at steps S304 and S305 may be used in the later training process.

Fig. 4 illustrates a flow chart of the testing process 400 of the estimation unit according to an embodiment of the present application. As shown, at step S401, location for each landmark is obtained from the facial image input to the estimation unit. The input image is the original low-resolution image in the first iteration. In the following iteration (for example, in the k-th iteration) , the input image is the image obtained in the (k-1) th iteration, as well as the deformation coefficient obtained in the (k-1) th iteration. Based on the landmark base stored in the train model, the location of each landmark in the input image is obtained.

At step S402, for each landmark, the SIFT feature from around the location of the landmark is obtained. The SIFT feature is the shape-indexed feature described above. At step S403, the features from all the landmarks are combined as an appearance eigen vector. At step S404, the deformation coefficients are updated via regression according to the equation (2) . At step S405, the dense correspondence field for each location z is computed.

Fig. 5 illustrates a flow chart of the training process 500 of the hallucination unit according to an embodiment of the present application. As shown, at step S501, images from the training sets are upsampled by bicubic interpolation. At step S502, the warped high-frequency prior

is obtained according to the dense correspondence field. At step S503, the deep bi-network is trained with three steps: pre-training the common sub-network, pre-training the high-frequency sub-network, and tuning the whole bi-network end-to-end. In this step, the bi-network coefficient may be stored in the trained model. Then, at step S504, the bi-network may be passed to compute the predict image for both the hallucination training set and the estimation training set.

Fig. 6 illustrates a flow chart of the testing process 600 of the hallucination unit according to an embodiment of the present application. As shown, at step S601, an input image I_k-1 is upsampled by bicubic interpolation to obtain an upsampled image ↑ I_k-1. At step S602, the warped high-frequency prior

is obtained according to the dense correspondence field. At step S603, the learned bi-network coefficient g_k is used to forward pass the deep bi-network with two inputs ↑ I_k-1 and

so that the image I_k is obtained.

It is understanding from the above description that the two tasks, i.e., the high-level face correspondence estimation and the low-level face hallucination, are complementary and can be alternatingly refined with the guidance from each other through a task-alternating cascaded framework. Experiments have conducted and improved results are obtained.

An exemplary algorithms for training and testing according to the present application are listed as below. Algorithm 1 is an exemplary training algorithm for learning the parameters by the apparatus according to an embodiment of the present application. Algorithm 2 is an exemplary testing algorithm for hallucinating a low-resolution face according to an embodiment of the present application.

With reference to Fig. 7, the computer equipment can be used for implementing the face hallucination method provided in the above embodiments. Specifically, the computer equipment may be greatly different due to different configuration or performance, and may include one or more processors (e.g. Central Processing Units, CPU) 710 and a memory 720. The memory 720 may be a volatile memory or a nonvolatile memory. One or more programs can be stored in the memory 720, and each program may include a series of instruction operations in the computer equipment. Further, the processor 710 can communicate with the memory 720, and execute the series of instruction operations in the memory 720 on the computer equipment. Particularly, data of one or more operating systems, e.g. Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc., are further stored in the memory 720. The computer equipment may further include one or more power supplies 730, one or more wired or wireless network interfaces 740, one or more input/output interfaces 750, etc.

The method and the device according to the present invention described above may be implemented in hardware or firmware, or implemented as software or computer codes which can be stored in a recording medium (e.g. CD, ROM, RAM, soft disk, hard disk or magneto-optical disk) , or implemented as computer codes which are originally stored in a remote recording medium or a non-transient machine readable medium and can be downloaded through a network to be stored in a local recording medium, so that the method described herein can be processed by such software stored in the recording medium in a general purpose computer, a dedicated processor or programmable or dedicated hardware (e.g. ASIC or FPGA) . It could be understood that the computer, the processor, the microprocessor controller or the programmable hardware include a storage assembly (e.g. RAM, ROM, flash memory, etc. ) capable of storing or receiving software or computer codes, and when the software or computer codes are accessed and executed by the computer, the processor or the hardware, the processing method described herein is implemented. Moreover, when the general purpose computer accesses the codes for implementing the processing shown herein, the execution of the codes converts the general purpose computer to a dedicated computer for executing the processing illustrated herein.

The foregoing descriptions are merely specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any variation or substitution readily conceivable to those skilled in the art within the technical scope disclosed in the present invention shall fall within the protection scope of the present invention. Accordingly, the protection scope of the claims should prevail over the protection scope of the present invention.

Claims

A method for face hallucination, comprising:

estimating a dense correspondence field based on a first image and a trained model；

executing face hallucination based on the first image, the estimated dense correspondence field and the trained model through a bi-network to obtain a second image； and

updating the first image with the second image,

wherein the steps of estimating, executing and updating are performed repeatedly until the obtained second image has a desired resolution or the steps of estimating, executing and updating have been repeated for predetermined times.
The method according to claim 1, wherein the bi-network comprises a first branch and a second branch, and the step of executing comprises:

executing face hallucination based on the first image through the first branch to obtain a first result；

executing face hallucination based on the first image, the estimated dense correspondence field and the trained model through the second branch to obtain a second result； and

incorporating the first result and the second result to obtain the second image.
The method of claim 1, wherein the trained model stores a dense base, a landmark base, a Gauss-Newton descent regressor for estimating the dense correspondence field, and a bi-network coefficient for the face hallucination, wherein the dense base and the landmark base are predefined, and the Gauss-Newton descent regressor and the bi-network coefficient are learned by training.
The method of claim 1, wherein the estimated dense correspondence field comprises a deformation coefficient p and a warping function W (z) for mapping a pixel z in a mean face image to a pixel x in the first image, wherein x＝W (z) ＝z+Bp, and B is a predefined dense base.
The method of claim 4, wherein for the repeated steps of estimating, executing and updating, the deformation coefficient p and the warping function W (z) are updated repeatedly.
The method of claim 4, wherein for a k-th iteration, the first image is a (k-1) -th image I_k-1, the warping function is denoted as W_k (z) , and the second image is denoted as I_k and obtained by:

I_k＝↑ I_k-1+g_k (↑ I_k-1； W_k (z) )

wherein ↑ I_k-1 is an upscaled image of the (k-1) -th image I_k-1, and g_k is a bi-network coefficient for the k-th iteration obtained from the trained model.
The method of claim 4, wherein for a k-th iteration, the first image is a (k-1) -th image I_k-1, the deformation coefficient denoted as p_k and the warping function denoted as W_k (z) are obtained by:

p_k＝p_k-1+f_k (I_k-1； p_k-1)

W_k (z) ＝z+B_kp_k

wherein p_k-1 is the deformation coefficient obtained in the last iteration, f_k is a Gauss-Newton descent regressor obtained from the trained model, and B_k is the predefined dense base for the k-th iteration obtained from the trained model.
An apparatus for face hallucination, comprising:

an estimating unit configured to estimate a dense correspondence field based on a first image and a trained model； and

a hallucination unit configured to execute face hallucination based on the first image, the estimated dense correspondence field and the trained model through a bi-network to obtain a second image；

wherein the first image is iteratively updated with the second image, and the estimation unit and the hallucination unit works for a predetermined times of iterations or until the obtained second image has a desired resolution.
The apparatus of claim 8, wherein the hallucination unit comprises:

a first branch configured to execute face hallucination based on the first image to obtain a first result；

a second branch configured to execute face hallucination based on the first image, the estimated dense correspondence field and the trained model to obtain a second result； and

a gate network configured to incorporate the first result and the second result to obtain the second image.
The apparatus of claim 8, wherein the trained model stores a dense base, a landmark base, a Gauss-Newton descent regressor for estimating the dense correspondence field, and a bi-network coefficient for the face hallucination, wherein the dense base and the landmark base are predefined, and the Gauss-Newton descent regressor and the bi-network coefficient are learned by training.
The apparatus of claim 8, wherein estimated dense correspondence field comprises a deformation coefficient p and a warping function W (z) for mapping a pixel z in a mean face image to a pixel x in the first image, wherein x＝W (z) ＝z+Bp, and B is a predefined dense base.
The apparatus of claim 11, wherein for each time of the iterations, the deformation coefficient p and the warping function W (z) are updated repeatedly.
The apparatus of claim 11, wherein for a k-th iteration, the first image is a (k-1) -th image I_k-1, the warping function is denoted as W_k (z) , and the second image obtained by the hallucination unit is denoted as I_k and obtained by:

I_k＝↑ I_k-1+g_k (↑ I_k-1； W_k (z) )

wherein ↑ I_k-1 is an upscaled image of the (k-1) -th image I_k-1, and g_k is a bi-network coefficient for the k-th iteration obtained from the trained model.
The apparatus of claim 11, wherein for a k-th iteration, the first image is a (k-1) -th image I_k-1, the deformation coefficient denoted as p_k and the warping function denoted as W_k (z) are obtained by the estimation unit according to:

p_k＝p_k-1+f_k (I_k-1； p_k-1)

W_k (z) ＝z+B_kp_k

wherein p_k-1 is the deformation coefficient obtained in the last iteration, f_k is a Gauss-Newton descent regressor obtained from the trained model, and B_k is the predefined dense base for the k-th iteration obtained from the trained model.
A device for face hallucination, comprising:

a processor； and

a memory storing computer-readable instructions,

wherein, when the instructions are executed by the processor, the processor is operable to:

estimate a dense correspondence field based on a first image and a trained model；

execute face hallucination based on the first image, the estimated dense correspondence field and the trained model through a bi-network to obtain a second image； and

update the first image with the second image,

wherein the first image is iteratively updated with the second image for a predetermined times of iterations or until the obtained second image has a desired resolution.
A nonvolatile storage medium containing computer-readable instructions, wherein, when the instructions are executed by a processor, the processor is operable to:

estimate a dense correspondence field based on a first image and a trained model；

execute face hallucination based on the first image, the estimated dense correspondence field and the trained model through a bi-network to obtain a second image； and

update the first image with the second image,

wherein the first image is iteratively updated with the second image for a predetermined times of iterations or until the obtained second image has a desired resolution.