WO2016119076A1

WO2016119076A1 - A method and a system for face recognition

Info

Publication number: WO2016119076A1
Application number: PCT/CN2015/000050
Authority: WO
Inventors: Xiaoou Tang; Xiaogang Wang; Yi Sun
Original assignee: Xiaoou Tang
Priority date: 2015-01-27
Filing date: 2015-01-27
Publication date: 2016-08-04
Also published as: CN107209864B; CN107209864A

Abstract

Disclosed is an apparatus for face recognition. The apparatus may comprise an extractor having a plurality of deep feature extraction hierarchies, the hierarchies extract recognition features from one or more input images; and a recognizer being electronically communicated with the extractor and recognizing face images of the input images based on the extracted recognition features.

Description

A METHOD AND A SYSTEM FOR FACE RECOGNITION

Technical Field

The present application relates to a method for face recognition and a system thereof.

Background

Learning effective deep face representation for face recognition using deep neural networks has becoming a very promising method for the face recognition. With better deep network structures and supervisory methods， face recognition accuracy has been boosted rapidly in recent years. DeepFace and DeepID are independently proposed to learn identity-related facial features through large-scale face identification tasks. DeepID2 made an additional improvement by learning deep facial features with joint face identification-verification tasks. DeepID2+ further improves DeepID2 by increasing the feature dimensions in each layer and adding joint identification-verification supervisory signals to previous feature extraction layers. DeepID2+ achieved the current state-of-the-art face recognition results in a number of widely evaluated face recognition dataset. However， the network structure of DeepID2+ is still similar to conventional convolutional neural networks with interlacing convolutional and pooling layers.

In general object recognition domain， there have been a few successful attempts to improve upon conventional convolutional neural networks. VGG net and GoogLeNet are two representatives. VGG net proposes to use continuous convolutions with small convolutional kernels. In particular， it stacks two or three layers of 3x3 convolutions together between every two pooling layers. GoogLeNet incorporates multi-scale convolutions and pooling into a single feature extraction layer coined inception. To learn efficient features， an inception layer also introduces 1x1 convolutions to reduce the number of feature maps before larger convolutions and after pooling.

Summary

In one aspect of the present application， disclosed is an apparatus for face recognition. The apparatus may comprise an extractor having a plurality of deep feature extraction hierarchies， the hierarchies extract recognition features from one or more input images； and a recognizer being electronically communicated with the extractor and recognizing face images of the input images based on the extracted recognition features.

In one embodiment of the present application， each of the hierarchies comprises N multi-convolution modules and M pooling modules， each of N and M is integer greater than 1. A first one of the multi-convolution modules extracts local features from the input images， and the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules， wherein each of the pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features. The features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.

In one embodiment of the present application， each of the pooling modules is coupled between two of adjacent multi-convolution modules， between one multi-convolution module and one adjacent multi-inception module， or between two of adjacent multi-inception modules.

In one embodiment of the present application， each of the hierarchies further comprises one or more multi-inception modules. Each of the multi-inception modules performs multi-scale convolutional operation on the features received from previous coupled pooling modules and reduces dimensions of the received features. Each of multi-convolution and multi-inception modules in each hierarchy is followed by one of the pooling modules， and each pooling module is followed by a multi-convolution module or a multi-inception module， except for a last pooling module， a last multi-convolution module， or a last multi-inception module in the hierarchy.

As an example， each of the multi-inception modules may comprise a plurality of cascaded inception layers. Each of inception layers receives features outputted from a previous inception layer as its input， and the inception layers are configured to perform multi-scale convolution operations and pooling operations on the received features to obtain multi-scale convolutional feature maps and locally invariant feature maps， and perform 1x1 convolution operations before multi-scale convolution operations and after pooling operations to reduce dimensions of features before multi-scale convolution operations and after pooling operations. The obtained multi-scale convolutional feature maps and the obtained locally invafiant feature maps are stacked together to form input feature maps of the layer that follows.

In particular， each of the inception layers comprises： one or more first 1x1 convolution operation layers are configured to receive input feature maps from a previous feature extraction layer and perform 1x1convolution operations on the received features maps to compress a number of feature maps； one or more multi-scale convolution operation layers are configured to perform N×N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form first output feature maps， where N＞1. One or more pooling operation layers are configured to pool over local regions of the input feature maps from the previous layer to form locally invariant feature maps； and one or more second 1x1 convolution operation layers are configured to perform 1x1convolution operations on the locally invariant feature maps received from the pooling operation layers to compress a number of feature maps so as to obtain second output feature maps. One or more third convolution operation layers are configured to receive input feature maps from the previous layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps to obtain third feature maps. The first， second and third feature maps are stacked together to form feature maps for inputting a following inception layer of the inception layers or inputting a next feature extraction module.

In one embodiment of the present application， each of multi-convolution modules may comprise one or more cascaded convolution layers， each of the convolution layers receives features outputted from a previous convolution layer as its input， and each of the convolution layers is configured to perform local convolution operations on inputted features， wherein the convolutional layers share neural weights for the convolution operations only in local areas of the inputted images.

In some of embodiments， a trainer may be electronically communicated with the extractor to add supervisory signals on the feature extraction unit during training so as to adjust neural weights in the deep feature extraction hierarchies by back-propagating supervisory signals through the cascaded multi-convolution modules and pooling modules， or through the cascaded multi-convolution modules， pooling modules and the multi-inception modules. The supervisory signals comprise one identification supervisory signal and one verification supervisory signal， wherein the identification supervisory signal is generated by classifying features in any of the modules extracted from an input face region into one of N identities in a training dataset， and taking a classification error as the supervisory signal， and wherein the verification signal is generated by comparing features in any of the modules extracted from two input face images respectively for determining if they are from the same person， and taking a verification error as the supervisory signal. According to the present application， each of the multi-convolution modules， the pooling modules and the multi-inception modules receives a plurality of supervisory signals which are either added on said each module or back-propagated from later feature extraction modules. These supervisory signals are aggregated to adjust neural weights in each of multi-convolution and multi-inception modules during training.

In the present application， each of the deep feature extraction hierarchies may comprise a different number of the multi-convolution modules， a different number of the multi-inception modules， a different number of pooling modules， and a different number of full-connection modules， or takes a different input face region to extract the features.

In further aspect of the present application， disclosed is a method for face recognition， comprising： extracting， by an extractor having a plurality of deep feature extraction hierarchies， recognition features from one or more input images； and recognizing face images of the input images based on the extracted recognition features， wherein each of the hierarchies comprises N multi-convolution modules and M pooling modules， each of N and M is integer greater than 1. A first one of the multi-convolution modules extracts local features from the input images， the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules， wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features. Features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.

In one embodiment of the present application， each of the hierarchies further comprises one or more multi-inception modules， each of which has a plurality of cascaded inception layers， the extracting further comprises： performing， by each of the inception layers， convolution operations on the received features to obtain multi-scale convolutional feature maps， and performing， by said each of the inception layers， pooling operations on the received features to obtain pooled feature maps (i.e. to pool over local regions of the feature maps received from the previous layer to form locally invariant feature maps) ， wherein the obtained multi-scale convolutional feature maps and the pooled feature maps are stacked together to form input feature maps of the layer that follows.

In further embodiment of the present application， each of the hierarchies further comprises one or more multi-inception modules， each of which has a plurality of cascaded inception layers， and wherein， during the extracting， each of the inception layers operates to： receive input feature maps from a previous feature extraction layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps； perform N×N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form first output feature maps， where N＞1； perform pooling operations on the received features from said previous layer (i.e. to pool over local regions of the input feature maps from the previous layer to form locally invariant feature maps) ； perform 1×1 convolution operations on the pooled feature maps received from the pooling operation layers to compress a number of feature maps so as to obtain second output feature maps， receive the input feature maps from the previous layer and perform 1x1convolution operations on the received features maps to compress a number of feature maps so as to obtain third feature maps； and concatenate the first， second and third feature maps to form feature maps for inputting a following inception layer of the inception layers or inputting a next feature extraction module.

In further aspect of the subject application， there is provided with an apparatus for face recognition， which may comprise： one or more memories that stores executable components； and one or more processors， coupled to the memories， that executes the executable components to perform operations of the apparatus， the executable components comprising：

an extracting component having a plurality of deep feature extraction hierarchies configured to extract recognition features from one or more input images； and

a recognizing component recognizing face images of the input images based on the extracted recognition features，

wherein each of the hierarchies comprises N multi-convolution modules and M pooling modules， each of N and M is an integer greater than 1，

a first one of the multi-convolution modules extracts local features from the input images， the followings of the multi-convolution modules extracts further local features from the extracted features outputted from a previous module of the pooling modules， wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features， and

wherein features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.

Fig. 2 is a schematic diagram illustrating an apparatus for face recognition when it is implemented in software， consistent with some disclosed embodiments.

Fig. 3a and 3b are two schematic diagrams illustrating two examples of deep feature extraction hierarchies in the feature extraction unit as shown in Fig. 1.

Fig. 4a is a schematic diagram illustrating structures of a multi-convolution module， consistent with some disclosed embodiments.

Fig. 4b is a multi-inception module in deep feature extraction hierarchies， consistent with some disclosed embodiments.

Fig. 5 is a schematic diagram illustrating structures of an inception layer in multi-inception modules， consistent with some disclosed embodiments.

Fig. 6 is a schematic flowchart illustrating the trainer as shown in Fig. 1 consistent with some disclosed embodiments.

Fig. 7 is a schematic flowchart illustrating the extractor as shown in Fig. 1 consistent with some disclosed embodiments.

Fig. 8 is a schematic flowchart illustrating the recognizer as shown in Fig. 1 consistent with some disclosed embodiments.

Fig. 9 is a schematic flowchart illustrating the process for the inception layer as shown in Fig. 5 consistent with some disclosed embodiments.

Detailed Description

Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments， it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary， it is intended to cover alternatives， modifications， and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description， numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances， well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein， the singular forms ″a″ ， ″an″ and ″the″ are intended to include the plural forms as well， unless the context clearly indicates otherwise. It will be further understood that the terms ″comprises″ and/or ″comprising， ″ when used in this specification， specify the presence of stated features， integers， steps， operations， elements， and/or components， but do not preclude the presence or addition of one or more other features， integers， steps， operations， elements， components， and/or groups thereof.

As will be appreciated by one skilled in the art， the present invention may be embodied as a system， method or computer program product. Accordingly， the present invention may take the form of an entirely hardware embodiment， an entirely software embodiment (including firmware， resident software， micro-code， etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit， ” “module” or “system. ” Furthermore， the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.

In the case that the apparatus 1000 as disclosed below is implemented with software， the apparatus 1000 may include a general purpose computer， a computer cluster， a mainstream computer， a computing device dedicated for providing online contents， or a computer network comprising a group of computers operating in a centralized or distributed fashion. As shown in Fig. 2， the apparatus 1000 may include one or more processors (

processors

102， 104， 106 etc. ) ， a memory 112， a storage device 116， a communication interface 114， and a bus to facilitate information exchange among various components of apparatus 1000. Processors 102-106 may include a central processing unit ( “CPU” ) ， a graphic processing unit ( “GPU” ) ， or other suitable information processing devices. Depending on the type of hardware being used， processors 102-106 can include one or more printed circuit boards， and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods or run the modules that will be explained in greater detail below.

Memory 112 can include， among other things， a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) . Computer program instructions can be stored， accessed， and read from memory 112 for execution by one or more of processors 102-106. For example， memory 112 may store one or more software applications. Further， memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106 to carry out the functions as disclosed below for the apparatus 1000. It is noted that although only one block is shown in Fig. 1， memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.

Referring Fig. 1 again， where the apparatus 1000 is implemented by the hardware， it may comprise an extractor 10 and a recognizer 20. The extractor 10 is configured with a plurality of deep feature extraction hierarchies， which may be formed as a neural network configured or trained to extract recognition features from one or more input images. The recognizer 20 is electronically communicated with the extractor 10 and recognizes face images of the input images based on the extracted recognition features. As will be discussed in details below， each of the hierarchies comprises N multi-convolution modules and M pooling modules， each of N and M is integer greater than 1. A first one of the multi-convolution modules extracts local features from the input images， and the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules， wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features. The features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features. In addition， the apparatus 1000 may further comprise a trainer 30 used to train the neural network.

The Extractor 10

The feature extractor 10 contains a plurality of deep feature extraction hierarchies. Each of the feature extraction hierarchies is a cascade of feature extraction modules. Fig. 7 is a schematic flowchart illustrating the feature extraction process in the extractor 10， which contains three steps. In step 101， the feature extractor 10 forward propagates an input face image through each of deep feature extraction hierarchies， respectively. Then in step 102， the extractor 10 takes representations outputted by each of the deep feature extraction hierarchies as features. Finally in step 103， it concatenates features of all deep feature extraction hierarchies.

In one embodiment of the present application， each of the deep feature extraction hierarchies may include a plurality of multi-convolution modules， a plurality of multi-inception modules， a plurality of pooling modules， and a plurality of full-connection modules. Each of the deep feature extraction hierarchies may contain a different number of cascaded multi-convolution modules， a different number of multi-inception modules， a different number of pooling modules， and a different number of full-connection modules， or may take a different input face region to extract features.

Fig. 3a illustrates an example of feature extraction hierarchies in the extractor 10.As shown in Fig. 3a， each of the deep feature extraction hierarchies contains alternate multi-convolution modules 21-1， 21-2， 21-3... and pooling modules 22-1， 22-2， 22-3.... For purpose of description， four multi-convolution modules 21-1， 21-2， 21-3 and 21-4 and three pooling modules 22-1， 22-2 and 22-3 are illustrated in Fig. 3a as an example.

Fig. 4a is a schematic diagram illustrating structures of each of the multi-convolution modules 21-1， 21-2， 21-3.... As shown， each multi-convolution module contains a plurality of cascaded convolutional layers. Fig. 4a shows an example of three cascaded convolutional layers of convolutional layer 1-3. However， in the present application， a multi-convolution module could contain any number of convolutional layers such as one， two， three， or more. In one extreme of a multi-convolution module containing only one convolutional layer， it degrades to a conventional convolution module. Therefore， multi-convolution modules are generalizations of conventional convolution modules. Likewise， a multi-inception module contains one or more cascaded inception layers.

The convolutional layers in a multi-convolution module are configured to extract local facial features from input feature maps (which is output feature maps of the previous layer) to form output feature maps of the current layer. In particular， each convolutional layer performs convolution operations on the input feature maps to form output feature maps of the current layer， and the formed output feature maps will be input to the next convolutional layer.

Each feature map is a certain kind of features organized in 2D. The features in the same output feature map or in local regions of the same feature map are extracted from input feature maps with the same set of neural connection weights. The convolution operation in each convolutional laver may be expressed as

Where，

xⁱ and y^j are the i-th input feature map and the j-th output feature map， respectively；

k^ij is the convolution kernel between the i-th input feature map and the j-th output feature map；

*denotes convolution；

b^j is the bias of the j-th output feature map；

ReLU nonlinearity y＝max (0， x) is used for neurons. Weights in higher convolutional layers of the ConvNets are locally shared.

r indicates a local region where weights are shared. In one extreme of the local region r corresponding to entire input feature maps， convolution becomes global convolution. In another extreme of the local region r corresponding to a single pixel in input feature maps， a convolutional layer degrades to a local-connection layer.

In further embodiments of the present application， 1x1 convolution operations may be carried out in inception layers (as shown in Fig. 4) compress the number of feature maps by setting the number of output feature maps significantly smaller than the number of input feature maps， which will be discussed below.

Returning to Fig. 3a， as shown， there is provided one pooling module inserted between every two multi-convolution modules. Each of the pooling modules 22-1， 22-2... aims to reduce feature dimensions and form more invariant features.

The goal of cascading multiple convolution/inception layers is to extract hierarchical local features (i.e. features extracted from local regions of the input images or the input features) ， wherein features extracted by higher convolution/inception layers have larger effective receptive field on input images and more complex non-linearity. The pooling modules 22-1， 22-2... are configured to pool local facial features from input feature maps from previous layer to form output feature maps of the current layer. Each of the pooling modules 22-1， 22-2... receives the feature maps from the respective connected multi-convolution/multi-inception module and then reduces the feature dimensions of the received feature maps and forms more invariant features by pooling operations， which may be formulated as

where each neuron in the i-th output feature map yⁱ pools over an M×N local region in the i-th input feature map xⁱ， with s as the step size.

The feature maps with the reduced dimensions are then input to the next cascaded convolutional module.

As shown in Fig. 3a， each of pooling modules is in addition followed by a full-connection module 23 (23-1， 23-2 and 23-3) . Features extracted in the three full-connection modules 21-1， 21-2 and 21-3 and the last multi-convolution module 21-4 (multi-convolution module 4) are supervised by supervisory signals. Features in the last multi-convolution module 21-4 are used for face recognition.

The full-connection modules 23-1， 23-2 and 23-3 in deep feature extraction hierarchies are configured to extract global features (features extracted from the entire region of input feature maps) from previous feature extraction modules， i.e. the pooling modules 22-1， 22-2 and 22-3. The fully-connected layers also serve as interfaces for receiving supervisory signals during training， which will be discussed later. The full-connection modules 23-1， 23-2 and 23-3 also have the function of feature dimension reduction as pooling modules 22-1， 22-2 and 22-3 by restricting the number of neurons in them. The fully-connection modules 23-1， 23-2 and 23-3 may be formulated as

Where，

x denote neural outputs (features) from the cascaded pooling module，

y denote neural outputs (features) in the current fully-connection，

w denotes neural weights in current feature extraction module (current fully-connection) . Neurons in fully-connection modules linearly combine features in previous feature extraction module， followed by ReLU non-linearity.

Features in the highest module of the deep feature extraction hierarchies are used for face recognition. These features are global and can capture highly non-linear mappings from input face images to their identities. As two examples， features in multi-convolution module 4 in Fig. 3a and full-connection module 4 in Fig. 3b are used for face recognition for the two deep feature extraction hierarchies shown in these two figures， respectively. A feature extraction unit may contain a plurality of the deep feature extraction hierarchies. Features in top feature extraction modules in all deep feature extraction hierarchies are concatenated into a long feature vector as a final feature representation for face recognition. There may be a plurality of feature extraction modules branching out from the module cascade for extracting features. The full-connection modules 1-3 in Fig. 3a and Fig. 3b are examples of such modules. These branching-out modules， as well as the top feature extraction module (which extracts features for face recognition) ， serve as interfaces for receiving supervisory signals during training， which will be disclosed later. When training is finished， all branching-out modules will be discarded and only the module cascade for extracting features for face recognition is reserved in test.

In another example of feature extraction hierarchy 20-2 in Fig. 3b， the hierarchy contains two multi-convolution modules 21-1 and 21-2， each of which is followed by a pooling module 22 (22-1 or 22-2) . The multi-convolution module 21-1 is connected to an input face image as an input layer， and is configured to extract local facial features (i.e. features extracted from local regions of the input images) from input images by rule of formulation 1) .

The pooling module 22-1 is configured to pool local facial features from the previous layer (the multi-convolution module 21-1 ) to form output feature maps of the current layer. To be specific， the pooling module 22-1 receives the feature maps from the respectively connected convolutional module and then reduces the feature dimensions of the received feature maps， and forms more invariant features by pooling operations， which is formulated as formulation 2) .

And then， the cascaded multi-convolution module 21-2 and pooling module 22-2 receive the features maps from the pooling modules 22-1 and perform the same operation on the received feature maps as those of multi-convolution module 21-1 and pooling module 22-1， respectively. Herein， each feature map is a certain kind of features organized in 2D.

As shown in Fig. 3b， the feature extraction hierarchy further comprises two multi-inception modules 24-1 and 24-2， each of which is followed by a pooling module 22 (22-3 or 22-4) . Fig. 4b shows an example of three cascaded inception layers 1-3 in each of the multi-inception modules 24-1 and 24-2. The goal of cascading the inception layers is to extract multi-scale local features by incorporating convolutions of various kernel sizes as well as local pooling operations in a single layer. The features extracted by higher convolution/inception layers have larger effective receptive field on input images and more complex non-linearity.

As shown in Fig. 5， each of the inception layers comprises one or more first 1x1 convolution operation layers241； one or more second 1x1 convolution operation layers 242， one or more multi-convolution operation layers (N×N convolution， N＞1) 243， one or more pooling operation layers 244， and one or more third 1x1 convolution operation layers 245. The number of the 1x1 convolution operation layers 241 is the same as that of the multi-scale convolution operation layers 243， and each layer 243 is coupled to a corresponding layer 241. The number of the third 1x1 convolution operation layers 245 is the same as that of the pooling layers 244. The second 1x1 convolution operation layers 242 are coupled to the previous inception layer.

The 1x1 convolution layers 241 are used to make computation efficient before the operations of the multi-convolution operation layers 243 and after pooling operation layers 244， which will be discussed below.

For purpose of clarity， Fig. 5 just shows two first 1x1 convolution operation layers 241， one second 1x1 convolution operation layer 242， one third 1x1 convolution operation layer 245 and two multi-scale convolution operation layers 243， but the invention is not limited thereto. In the example shown in Fig. 5， the inception layer ensembles convolution operations with convolutional kernel sizes of 1x1， 3x3， and 5x5， as well as pooling operations by rule of formulation 2. The first 1x1 convolution layers 241 are used to make computation efficient before 3x3 and 5x5 convolutions. The number of output feature maps of a 1x1 convolution layer is set to be much smaller than its input feature maps. Since 3x3 and 5x5 convolutions take output feature maps of 1x1 convolutions as their input feature maps， the number of input feature maps of 3x3 and 5x5 convolutions become much smaller. In this way， computations in 3x3 and 5x5 convolutions are reduced significantly. Likewise， the 1x1 convolution 245 after pooling helps reduce the number of output feature maps of pooling. Since output feature maps of 1x1， 3x3， and 5x5 convolutions are concatenated to form input feature maps of the next layer， a small number of output feature maps of 1x1 convolutions reduces the total number of output feature maps， and therefore reduces computation in next layer. The 1x1 convolution itself does not take the majority computation due to the extremely small convolutional kernel size.

Fig. 9 is a schematic flowchart illustrating the process for the inception layer as shown in Fig. 5 consistent with some disclosed embodiments. At step 901， each of 1x1 convolution operation layers 241 operates to receive input feature maps from the previous layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps by rule of formula 1) as stated in the above. The multi-scale convolution operation layer 243 performs N×N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layer 241 to form a plurality of first output feature maps.

At step 902， the pooling operation layer 244 operates to receive the input feature maps from the previous layer and perform pooling operations on the received feature maps by rule of formula 2) . The pooling operations in inception layers aim to pool over local regions of input feature maps to form locally invariant features as stated in the above. However， to keep output feature map sizes consistent in

layers

242， 243， and 245 so as to stack them together later， pooling in inception layers may not reduce feature dimensions， which is achieved by setting step-size s equal to 1 by rule of formula 2. The third 1x1 convolution operation layers 245 operate to perform 1x1 convolution operations on the features maps received from the pooling operation layer 244 to compress numbers of the feature maps by rule of formula 1) as stated in the above so as to obtain a plurality of second output feature maps.

At step 903， the second 1x1 convolution operation layers 242 operate to receive the input feature maps from the previous layer and perform 1x1 convolution operations on the received features maps to compress numbers of the feature maps by rule of formula 1) so as to obtain a plurality of third feature maps.

At step 904， the first， second and third feature maps are concatenated to form feature maps for inputting the following inception layer or inputting the following feature extraction module.

The recognizer 20

The recognizer 20 operates to calculate distances between features for different face images extracted by the feature extractor 10 to determine if two face images are from the same identity for face verification or determine if one of the input images， as a probe face image， is belonging to a same identity as one of gallery face images consisting of the input images for face identification. Fig. 8 is a schematic flowchart illustrating the recognition process in the recognizer 20. In step 201， the recognizer 20 calculates distances between features extracted from different face images by the feature extractor 10. Then in step 202， the recognizer 20 determines if two face images are from the same identity for face verification， or， alternatively， in step 203， it determines one of the input images， as a probe face image， is belonging to a same identity as one of gallery face images consisting of the input images for face identification.

In the recognizer 20， two face images are determined to belong to the same identity if their feature distance is smaller than a threshold， or the probe face image is determined to belong to the same identity as one of gallery face images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images， wherein feature distances determined by the recognizer 20 could be Euclidean distances， Joint Bayesian distances， cosine distances， Hamming distances， or any other distances.

In one embodiment of the present application， Joint Bayesian distances are used as feature distances. Joint Bayesian has been a popular similarity metric of faces， which represents the extracted facial features x (after subtracting the mean) by the sum of two independent Gaussian variables

x＝μ+ò， (4)

where μ～N (0，S_μ) represents the face identity andò～N (0， S_ò) represents the intra-personal variations. Joint Bayesian models the joint probability of two faces given the intra or extra-personal variation hypothesis， P (x₁， x₂|H₁) and P(x₁， x₂|H_E) . It is readily shown from Equation (5) that these two probabilities are also Gaussian with variations

and

respectively. S_μ and S_ò can be learned from data with EM algorithm. In test， it calculates the likelihood ratio

which has closed-form solutions and is efficient.

The Trainer 30

The trainer 30 is used to update the weights w on connections between neurons in feature extraction layers (i.e. the layers of the multi-convolution modules， the multi-inception modules and the full connection modules) in the feature extractor 10 by inputting initial weights on connections between neurons in feature extraction layers in the feature extractor， a plurality of identification supervisory signals， and a plurality of verification supervisory signals. The trainer 30 aims to iteratively find a set of optimized neural weights in deep feature extraction hierarchies for extracting identity-related features for face recognition.

As shown in Fig. 3a and Fig. 3b， the identification and verification supervisory signals in the trainer 30 are simultaneously added to each of the supervised layers in each of the feature extraction hierarchies in the feature extractor 10，and respectively back-propagated to the input face image， so as to update the weights on connections between neurons in all the cascaded feature extraction modules.

The identification supervisory signals are generated in the trainer 30 by classifying all of the supervised layer (layers selected for supervision， which could be those in multi-convolution modules， multi-inception modules， pooling modules， or full-connection modules) representations into one of N identities， wherein the classification errors are used as the identification supervisory signals.

The verification supervisory signals in the trainer 30 are generated by verifying the supervised layer representations of two compared face images， respectively， in each of the feature extraction modules， to determine if the two compared face images belong to the same identity， wherein the verification errors are used as the verification supervisory signals. Given a pair of training face images， the feature extractor 10 extracts two feature vectors f_i and f_j from the two face images respectively in each of the feature extraction modules. The verification error is

if f_i and f_j are features of face images of the same identity， or

if f_i and f_j are features of face images of different identities， where ||f_i -f_j||2 is Euclidean distance of the two feature vectors， m is a positive constant value. There are errors if f_i and f_j are dissimilar for the same identity， or if f_i and f_j are similar for different identities.

Fig. 6 is a schematic flowchart illustrating the training process in the trainer 30.In step 301， the trainer 30 samples two face images and inputs them to the feature extractor 10， respectively， to get feature representations of each of the two face images in all feature extraction layers of the feature extractor 10. Then in step 302， the trainer 30 calculates identification errors by classifying feature representations of each face image in each supervised layer into one of a plurality of (N) identities. Simultaneously， in step 303， the trainer 30 calculates verification errors by verifying if feature representations of two face images， respectively， in each supervised layer are from the same identity. The identification and verification errors are used as identification and verification supervisory signals， respectively. In step 304， the trainer 30 back propagates all identification and verification supervisory signals through the feature extractor 10 simultaneously， so as to update weights on connections between neurons in the feature extractor 10. Identification and verification supervisory signals (or errors) simultaneously added to supervised layers are back-propagated through the cascade of feature extraction modules until the input image. After back-propagation， errors got in each layer in the cascade of feature extraction modules are accumulated. Weights on connections between neurons in the feature extractor 10 are updated according to the magnitude of the errors. At last， in step 305， the trainer 30 judges if training process has converged， and repeats steps 301-304 if a convergence point has not reached.

Although the preferred examples of the present invention have been described， those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims is intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.

Obviously， those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such， if these variations or modifications belong to the scope of the claims and equivalent technique， they may also fall into the scope of the present invention.

The corresponding structures， materials， acts， and equivalents of all means or step plus function elements in the claims below are intended to include any structure， material， or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description， but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application， and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

An apparatus for face recognition， comprising：

an extractor having a plurality of deep feature extraction hierarchies configured to extract recognition features from one or more input images； and

a recognizer electronically communicated with the extractor to recognize face images of the input images based on the extracted recognition features，

wherein each of the hierarchies comprises a plurality of multi-convolution modules and a plurality of pooling modules， and at least one of the hierarchies further comprise a plurality of multi-inception modules，

wherein a first one of the multi-convolution or multi-inception modules extracts local features from the input images， and the followings of the multi-convolution and multi-inception modules extract further local features from the features output from modules of the pooling modules， which are coupled thereto， and wherein each of pooling modules receives local features from respective multi-convolution modules and multi-inception modules and reduces dimensions of the received features， and

wherein features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
The apparatus of claim 1， wherein each of the pooling modules is coupled between two of adjacent multi-convolution modules， between one multi-convolution module and one adjacent multi-inception module， or between two of adjacent multi-inception modules.
The apparatus of claim 1， wherein each of the multi-inception modules performs a multi-scale convolutional operation on the features received from pooling modules coupled thereto and reduces dimensions of the received features，

wherein， each of the multi-convolution modules and each of the multi-inception modules in each hierarchy are， respectively， followed by one of the pooling modules， and each of the pooling modules is followed by one of the multi-convolution modules or one of the multi-inception modules， except for a last one of the pooling modules， a last one of the multi-convolution modules， or a last one of the multi-inception modules in each of the hierarchies.
The apparatus of claim 1 or 3， wherein each of the multi-inception modules comprises a plurality of cascaded inception layers， wherein each of the inception layers is configured to perform 1x1 convolutions on the input feature maps to compress numbers thereof before larger convolution operations and after pooling operations.
The apparatus of claim 4， wherein each of the inception layers comprises：

one or more first 1x1 convolution operation layers configured to receive input feature maps from a previous one of the inception layers and perform l xlconvolution operations on the received feature maps to compress numbers of the received feature maps；

one or more multi-scale convolution operation layers configured to perform N×N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form a plurality of first output feature maps， where N ＞1；

one or more pooling operation layers configured to receive the input feature maps from said previous inception layer to pool over local regions of the received feature maps to form locally invariant features maps；

one or more second 1x1 convolution operation layers configured to perform 1x1 convolution operations on the locally invariant feature maps to compress numbers of the feature maps so as to obtain a plurality of second output feature maps； and

one or more third convolution operation layers configured to receive the input feature maps from the previous inception layer and perform 1x1 convolution operations on the received features maps to compress numbers of the feature maps so as to obtain a plurality of third feature maps；

wherein the first， second and third feature maps are stacked together to form feature maps for inputting a following inception layer of the inception layers.
The apparatus of claim 1， wherein each of multi-convolution modules comprises one or more cascaded convolution layers， each of the convolution layers receives features outputted from a previous one of the convolution layer as its input， and each of the convolution layers is configured to perform local convolution operations on the input features， wherein the convolutional layers share neural weights for the convolution operations only in local areas of the inputted images.
The apparatus of claim 4， wherein one or more of the pooling modules， the multi-convolution modules， or the multi-inception modules are followed by full-connection modules for extracting global features from corresponding pooling modules， multi-convolution modules， or multi-inception modules connected thereto.
The apparatus of claim 7， further comprising：

a trainer being electronically communicated with the extractor to add supervisory signals on one or more of the pooling modules， the multi-convolution modules， the multi-inception modules， and the full-connection modules during training so as to adjust neural weights in the deep feature extraction hierarchies by back-propagating supervisory signals through the cascaded multi-convolution modules and pooling modules， or through the cascaded multi-convolution modules， pooling modules and the multi-inception modules.
The apparatus of claim 8， wherein the supervisory signals comprise one identification supervisory signal and one verification supervisory signal，

wherein the identification supervisory signal is generated by classifying features in each of the supervised modules， which are extracted from an input face region， into one of N identities in a training dataset， and taking a classification error as the supervisory signal， and

wherein the verification signal is generated by comparing features in each of the supervised modules， which are extracted from two input face images respectively， for determining if they are from the same person， and taking a verification error as the supervisory signal.
The apparatus of claim 9， wherein each of the multi-convolution modules， the pooling modules and the multi-inception modules receives a plurality of supervisory signals which are either added on said each module or back-propagated from later feature extraction modules， wherein these supervisory signals are aggregated to adjust neural weights in each of multi-convolution modules， each of multi-inception modules， and each of full-connection modules during training.
The apparatus of claim 1， wherein distances between the features from two input face images are compared to a threshold to determine if the two input face images are from the same person for face verification， or distances between features of an input query face image to features of each of face images in a face image database are computed to determine which identity in the face image database the input query face image belongs to for face identification.
The apparatus of claim 11， wherein the distances between the features is one selected from a group consisting of Euclidean distances， cosine similarities， Joint Bayesian metrics， or any other distances.
The apparatus of claim 7， wherein each of the deep feature extraction hierarchies comprises a different number of the multi-convolution modules， a different number of the multi-inception modules， a different number of pooling modules， and a different number of full-connection modules， or takes a different input face region to extract the features.
A method for face recognition， comprising：

extracting， by a plurality of deep feature extraction hierarchies， recognition features from one or more input images； and

recognizing face images of the input images based on the extracted recognition features，

wherein each of the hierarchies comprises a plurality of multi-convolution modules and a plurality of pooling modules， and at least one of the hierarchies further comprise a plurality of multi-inception modules，

wherein the extracting further comprises：

extracting， by a first one of the multi-convolution or multi-inception modules， local features from the input images；

extracting， by the followings of the multi-convolution modules and multi-inception modules， further local features from the extracted features outputted from a previous module of the pooling modules， wherein each of pooling modules receives local features from respective multi-convolution modules and multi-inception modules and reduces dimensions of the received features， and

concatenating features obtained from all the extraction hierarchies into a feature vector as said recognition features.
The method of claim 14， wherein each of the multi-inception modules has a plurality of cascaded inception layers configured to perform 1x1 convolutions on the input feature maps to compress numbers of the feature maps before larger convolution operations and after pooling operations.
The method of claim 15， wherein， during the extracting， each of the inception layers operates to：

receive input feature maps from a previous inception layer and perform 1x1 convolution operations on the received features maps to compress numbers of the feature maps；

perform N×N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form a plurality of first output feature maps， where N ＞ 1；

pool over local regions of the input feature maps from said previous inception layer to form locally invariant feature maps；

perform 1x1 convolution operations on the locally invariant feature maps to compress numbers of the feature maps so as to obtain a plurality of second output feature maps；

receive the input feature maps from the previous inception layer and perform 1x1 convolution operations on the received features maps to compress numbers thereof so as to obtain a plurality of third feature maps； and

concatenate the first， second and third feature maps to form feature maps for inputting a following inception layer of the inception layers.
The method of claim 14， wherein the recognizing further comprises：

determining distances between the recognition features； and

determining， in accordance with the determined distance， iftwo face images of the input images are from the same identity for face verification or if one of the input images， as a probe face image， is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
The method of claim 17， wherein the determining further comprises：

comparing distances between the features from two input face images to a threshold to determine if the two input face images are from the same person for face verification， or

computing distances between features of an input query face image to features of each of face images in a face image database to determine which identity in the face image database the input query face image belongs to for face identification.
The method of claim 18， wherein the distances is one selected from a group consisting of Euclidean distances， cosine similarities， Joint Bayesian metrics， or any other distances.
The method of claim 15， wherein at least one of the hierarchies further comprise a plurality of full-connection modules for extracting global features from corresponding pooling modules， multi-convolution modules， or multi-inception modules connected thereto.
The method of claim 20， wherein the multi-convolution modules， the multi-inception modules， the pooling modules and the full-connection modules are formed as a neural network， and the method further comprises：

inputting two face images to the neural network， respectively， to get feature representations of each of the two face images；

calculating identification errors by classifying feature representations of each face image in the neural network into one of a plurality of identities；

calculating verification errors by verifying if feature representations of two face images， respectively， are from the same identity， the identification and verification errors being treated as the identification and verification supervisory signals， respectively； and

back-propagating the identification and verification supervisory signals through the neural network simultaneously， to update neural weights on connections among the cascaded multi-convolution modules， the multi-inception modules， and the full-connection modules in the neural network.
An apparatus for face recognition， comprising：

one or more memories that store executable components； and

one or more processors， coupled to the memories， that execute the executable components to perform operations of the apparatus， the executable components comprising：

an extracting component having a plurality of deep feature extraction hierarchies configured to extract recognition features from one or more input images； and

a recognizing component recognizing face images of the input images based on the extracted recognition features，

wherein each of the hierarchies comprises a plurality of multi-convolution modules and a plurality of pooling modules， and at least one of the hierarchies further comprise a plurality of multi-inception modules，

a first one of the multi-convolution or multi-inception modules extracts local features from the input images， the followings of the multi-convolution and multi-inception modules extracts further local features from the extracted features outputted from a previous module of the pooling modules， wherein each of pooling modules receives local features from respective multi-convolution modules and multi-inception modules and reduces dimensions thereof， and

wherein features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
The apparatus of claim 22， wherein each of the multi-inception modules performs multi-scale convolutional operation on the features received from previous coupled pooling modules and reduces dimensions of the received features.
The apparatus of claim 22 or 23， wherein each of the multi-inception modules comprises a plurality of cascaded inception layers， each of inception layers receives features outputted from a previous inception layer as its input， and is configured to perform 1x1 convolution operations on the feature maps to reduce numbers thereof.
The apparatus of any one of claim 22-24， wherein each of the inception layers comprises：

one or more first 1x1 convolution operation layers configured to receive input feature maps from a previous inception layer and perform 1x1convolution operations on the received features maps to compress numbers of thereof；

one or more multi-scale convolution operation layers configured to perform N×N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form a plurality of first output feature maps， where N ＞1；

one or more pooling operation layers configured to pool over local regions of the input feature maps from the previous inception layer to form locally invariant feature maps；

one or more second 1x1 convolution operation layers configured to perform 1x1 convolution operations on the locally invariant feature maps to compress numbers of the feature maps so as to obtain a plurality of second output feature maps； and

one or more third convolution operation layers configured to receive the input feature maps from the previous inception layer and perform 1x1 convolution operations on the received feature maps to compress numbers thereof so as to obtain a plurality of third feature maps；

wherein the first， second and third feature maps are stacked together to form feature maps for inputting a following inception layer of the inception layers.
The apparatus of claim 22， wherein each of multi-convolution modules comprises one or more cascaded convolution layers， each of the convolution layers receives features outputted from a previous one of the convolution layers as its input， and each of the convolution layers is configured to perform local convolution operations on the input features， wherein the convolutional layers share neural weights for the convolution operations only in local areas of the inputted images.
The apparatus of claim 22， wherein each of the hierarchies further comprise a plurality of full-connection modules for extracting global features from corresponding pooling modules， multi-convolution modules， or multi-inception modules connected thereto.