[go: up one dir, main page]

WO2016119076A1 - A method and a system for face recognition - Google Patents

A method and a system for face recognition Download PDF

Info

Publication number
WO2016119076A1
WO2016119076A1 PCT/CN2015/000050 CN2015000050W WO2016119076A1 WO 2016119076 A1 WO2016119076 A1 WO 2016119076A1 CN 2015000050 W CN2015000050 W CN 2015000050W WO 2016119076 A1 WO2016119076 A1 WO 2016119076A1
Authority
WO
WIPO (PCT)
Prior art keywords
modules
convolution
inception
features
feature maps
Prior art date
Application number
PCT/CN2015/000050
Other languages
French (fr)
Inventor
Xiaoou Tang
Xiaogang Wang
Yi Sun
Original Assignee
Xiaoou Tang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaoou Tang filed Critical Xiaoou Tang
Priority to CN201580074278.6A priority Critical patent/CN107209864B/en
Priority to PCT/CN2015/000050 priority patent/WO2016119076A1/en
Publication of WO2016119076A1 publication Critical patent/WO2016119076A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application relates to a method for face recognition and a system thereof.
  • DeepFace and DeepID are independently proposed to learn identity-related facial features through large-scale face identification tasks. DeepID2 made an additional improvement by learning deep facial features with joint face identification-verification tasks. DeepID2+ further improves DeepID2 by increasing the feature dimensions in each layer and adding joint identification-verification supervisory signals to previous feature extraction layers. DeepID2+ achieved the current state-of-the-art face recognition results in a number of widely evaluated face recognition dataset.
  • the network structure of DeepID2+ is still similar to conventional convolutional neural networks with interlacing convolutional and pooling layers.
  • VGG net and GoogLeNet are two representatives.
  • VGG net proposes to use continuous convolutions with small convolutional kernels. In particular, it stacks two or three layers of 3x3 convolutions together between every two pooling layers.
  • GoogLeNet incorporates multi-scale convolutions and pooling into a single feature extraction layer coined inception. To learn efficient features, an inception layer also introduces 1x1 convolutions to reduce the number of feature maps before larger convolutions and after pooling.
  • an apparatus for face recognition may comprise an extractor having a plurality of deep feature extraction hierarchies, the hierarchies extract recognition features from one or more input images; and a recognizer being electronically communicated with the extractor and recognizing face images of the input images based on the extracted recognition features.
  • each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is integer greater than 1.
  • a first one of the multi-convolution modules extracts local features from the input images, and the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of the pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features.
  • the features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
  • each of the pooling modules is coupled between two of adjacent multi-convolution modules, between one multi-convolution module and one adjacent multi-inception module, or between two of adjacent multi-inception modules.
  • each of the hierarchies further comprises one or more multi-inception modules.
  • Each of the multi-inception modules performs multi-scale convolutional operation on the features received from previous coupled pooling modules and reduces dimensions of the received features.
  • Each of multi-convolution and multi-inception modules in each hierarchy is followed by one of the pooling modules, and each pooling module is followed by a multi-convolution module or a multi-inception module, except for a last pooling module, a last multi-convolution module, or a last multi-inception module in the hierarchy.
  • each of the multi-inception modules may comprise a plurality of cascaded inception layers.
  • Each of inception layers receives features outputted from a previous inception layer as its input, and the inception layers are configured to perform multi-scale convolution operations and pooling operations on the received features to obtain multi-scale convolutional feature maps and locally invariant feature maps, and perform 1x1 convolution operations before multi-scale convolution operations and after pooling operations to reduce dimensions of features before multi-scale convolution operations and after pooling operations.
  • the obtained multi-scale convolutional feature maps and the obtained locally invafiant feature maps are stacked together to form input feature maps of the layer that follows.
  • each of the inception layers comprises: one or more first 1x1 convolution operation layers are configured to receive input feature maps from a previous feature extraction layer and perform 1x1convolution operations on the received features maps to compress a number of feature maps; one or more multi-scale convolution operation layers are configured to perform N ⁇ N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form first output feature maps, where N>1.
  • One or more pooling operation layers are configured to pool over local regions of the input feature maps from the previous layer to form locally invariant feature maps; and one or more second 1x1 convolution operation layers are configured to perform 1x1convolution operations on the locally invariant feature maps received from the pooling operation layers to compress a number of feature maps so as to obtain second output feature maps.
  • One or more third convolution operation layers are configured to receive input feature maps from the previous layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps to obtain third feature maps.
  • the first, second and third feature maps are stacked together to form feature maps for inputting a following inception layer of the inception layers or inputting a next feature extraction module.
  • each of multi-convolution modules may comprise one or more cascaded convolution layers, each of the convolution layers receives features outputted from a previous convolution layer as its input, and each of the convolution layers is configured to perform local convolution operations on inputted features, wherein the convolutional layers share neural weights for the convolution operations only in local areas of the inputted images.
  • a trainer may be electronically communicated with the extractor to add supervisory signals on the feature extraction unit during training so as to adjust neural weights in the deep feature extraction hierarchies by back-propagating supervisory signals through the cascaded multi-convolution modules and pooling modules, or through the cascaded multi-convolution modules, pooling modules and the multi-inception modules.
  • the supervisory signals comprise one identification supervisory signal and one verification supervisory signal, wherein the identification supervisory signal is generated by classifying features in any of the modules extracted from an input face region into one of N identities in a training dataset, and taking a classification error as the supervisory signal, and wherein the verification signal is generated by comparing features in any of the modules extracted from two input face images respectively for determining if they are from the same person, and taking a verification error as the supervisory signal.
  • each of the multi-convolution modules, the pooling modules and the multi-inception modules receives a plurality of supervisory signals which are either added on said each module or back-propagated from later feature extraction modules. These supervisory signals are aggregated to adjust neural weights in each of multi-convolution and multi-inception modules during training.
  • each of the deep feature extraction hierarchies may comprise a different number of the multi-convolution modules, a different number of the multi-inception modules, a different number of pooling modules, and a different number of full-connection modules, or takes a different input face region to extract the features.
  • a method for face recognition comprising: extracting, by an extractor having a plurality of deep feature extraction hierarchies, recognition features from one or more input images; and recognizing face images of the input images based on the extracted recognition features, wherein each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is integer greater than 1.
  • a first one of the multi-convolution modules extracts local features from the input images, the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features.
  • Features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
  • each of the hierarchies further comprises one or more multi-inception modules, each of which has a plurality of cascaded inception layers
  • the extracting further comprises: performing, by each of the inception layers, convolution operations on the received features to obtain multi-scale convolutional feature maps, and performing, by said each of the inception layers, pooling operations on the received features to obtain pooled feature maps (i.e. to pool over local regions of the feature maps received from the previous layer to form locally invariant feature maps) , wherein the obtained multi-scale convolutional feature maps and the pooled feature maps are stacked together to form input feature maps of the layer that follows.
  • each of the hierarchies further comprises one or more multi-inception modules, each of which has a plurality of cascaded inception layers, and wherein, during the extracting, each of the inception layers operates to: receive input feature maps from a previous feature extraction layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps; perform N ⁇ N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form first output feature maps, where N>1; perform pooling operations on the received features from said previous layer (i.e.
  • an apparatus for face recognition which may comprise: one or more memories that stores executable components; and one or more processors, coupled to the memories, that executes the executable components to perform operations of the apparatus, the executable components comprising:
  • an extracting component having a plurality of deep feature extraction hierarchies configured to extract recognition features from one or more input images
  • a recognizing component recognizing face images of the input images based on the extracted recognition features
  • each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is an integer greater than 1,
  • a first one of the multi-convolution modules extracts local features from the input images, the followings of the multi-convolution modules extracts further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features, and
  • Fig. 1 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.
  • Fig. 2 is a schematic diagram illustrating an apparatus for face recognition when it is implemented in software, consistent with some disclosed embodiments.
  • Fig. 3a and 3b are two schematic diagrams illustrating two examples of deep feature extraction hierarchies in the feature extraction unit as shown in Fig. 1.
  • Fig. 4a is a schematic diagram illustrating structures of a multi-convolution module, consistent with some disclosed embodiments.
  • Fig. 4b is a multi-inception module in deep feature extraction hierarchies, consistent with some disclosed embodiments.
  • Fig. 5 is a schematic diagram illustrating structures of an inception layer in multi-inception modules, consistent with some disclosed embodiments.
  • Fig. 6 is a schematic flowchart illustrating the trainer as shown in Fig. 1 consistent with some disclosed embodiments.
  • Fig. 7 is a schematic flowchart illustrating the extractor as shown in Fig. 1 consistent with some disclosed embodiments.
  • Fig. 8 is a schematic flowchart illustrating the recognizer as shown in Fig. 1 consistent with some disclosed embodiments.
  • Fig. 9 is a schematic flowchart illustrating the process for the inception layer as shown in Fig. 5 consistent with some disclosed embodiments.
  • the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit, ” “module” or “system. ” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
  • the apparatus 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion.
  • the apparatus 1000 may include one or more processors (processors 102, 104, 106 etc. ) , a memory 112, a storage device 116, a communication interface 114, and a bus to facilitate information exchange among various components of apparatus 1000.
  • Processors 102-106 may include a central processing unit ( “CPU” ) , a graphic processing unit ( “GPU” ) , or other suitable information processing devices.
  • processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods or run the modules that will be explained in greater detail below.
  • Memory 112 can include, among other things, a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) .
  • Computer program instructions can be stored, accessed, and read from memory 112 for execution by one or more of processors 102-106.
  • memory 112 may store one or more software applications.
  • memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106 to carry out the functions as disclosed below for the apparatus 1000. It is noted that although only one block is shown in Fig. 1, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.
  • the apparatus 1000 may comprise an extractor 10 and a recognizer 20.
  • the extractor 10 is configured with a plurality of deep feature extraction hierarchies, which may be formed as a neural network configured or trained to extract recognition features from one or more input images.
  • the recognizer 20 is electronically communicated with the extractor 10 and recognizes face images of the input images based on the extracted recognition features.
  • each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is integer greater than 1.
  • a first one of the multi-convolution modules extracts local features from the input images, and the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features.
  • the features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
  • the apparatus 1000 may further comprise a trainer 30 used to train the neural network.
  • the feature extractor 10 contains a plurality of deep feature extraction hierarchies. Each of the feature extraction hierarchies is a cascade of feature extraction modules.
  • Fig. 7 is a schematic flowchart illustrating the feature extraction process in the extractor 10, which contains three steps.
  • step 101 the feature extractor 10 forward propagates an input face image through each of deep feature extraction hierarchies, respectively.
  • step 102 the extractor 10 takes representations outputted by each of the deep feature extraction hierarchies as features.
  • step 103 it concatenates features of all deep feature extraction hierarchies.
  • each of the deep feature extraction hierarchies may include a plurality of multi-convolution modules, a plurality of multi-inception modules, a plurality of pooling modules, and a plurality of full-connection modules.
  • Each of the deep feature extraction hierarchies may contain a different number of cascaded multi-convolution modules, a different number of multi-inception modules, a different number of pooling modules, and a different number of full-connection modules, or may take a different input face region to extract features.
  • Fig. 3a illustrates an example of feature extraction hierarchies in the extractor 10.As shown in Fig. 3a, each of the deep feature extraction hierarchies contains alternate multi-convolution modules 21-1, 21-2, 21-3... and pooling modules 22-1, 22-2, 22-3.... For purpose of description, four multi-convolution modules 21-1, 21-2, 21-3 and 21-4 and three pooling modules 22-1, 22-2 and 22-3 are illustrated in Fig. 3a as an example.
  • Fig. 4a is a schematic diagram illustrating structures of each of the multi-convolution modules 21-1, 21-2, 21-3.... As shown, each multi-convolution module contains a plurality of cascaded convolutional layers. Fig. 4a shows an example of three cascaded convolutional layers of convolutional layer 1-3. However, in the present application, a multi-convolution module could contain any number of convolutional layers such as one, two, three, or more. In one extreme of a multi-convolution module containing only one convolutional layer, it degrades to a conventional convolution module. Therefore, multi-convolution modules are generalizations of conventional convolution modules. Likewise, a multi-inception module contains one or more cascaded inception layers.
  • the convolutional layers in a multi-convolution module are configured to extract local facial features from input feature maps (which is output feature maps of the previous layer) to form output feature maps of the current layer.
  • each convolutional layer performs convolution operations on the input feature maps to form output feature maps of the current layer, and the formed output feature maps will be input to the next convolutional layer.
  • Each feature map is a certain kind of features organized in 2D.
  • the features in the same output feature map or in local regions of the same feature map are extracted from input feature maps with the same set of neural connection weights.
  • the convolution operation in each convolutional laver may be expressed as
  • x i and y j are the i-th input feature map and the j-th output feature map, respectively;
  • k ij is the convolution kernel between the i-th input feature map and the j-th output feature map
  • b j is the bias of the j-th output feature map
  • r indicates a local region where weights are shared. In one extreme of the local region r corresponding to entire input feature maps, convolution becomes global convolution. In another extreme of the local region r corresponding to a single pixel in input feature maps, a convolutional layer degrades to a local-connection layer.
  • 1x1 convolution operations may be carried out in inception layers (as shown in Fig. 4) compress the number of feature maps by setting the number of output feature maps significantly smaller than the number of input feature maps, which will be discussed below.
  • Each of the pooling modules 22-1, 22-2... aims to reduce feature dimensions and form more invariant features.
  • the goal of cascading multiple convolution/inception layers is to extract hierarchical local features (i.e. features extracted from local regions of the input images or the input features) , wherein features extracted by higher convolution/inception layers have larger effective receptive field on input images and more complex non-linearity.
  • the pooling modules 22-1, 22-2... are configured to pool local facial features from input feature maps from previous layer to form output feature maps of the current layer.
  • Each of the pooling modules 22-1, 22-2... receives the feature maps from the respective connected multi-convolution/multi-inception module and then reduces the feature dimensions of the received feature maps and forms more invariant features by pooling operations, which may be formulated as
  • each neuron in the i-th output feature map y i pools over an M ⁇ N local region in the i-th input feature map x i , with s as the step size.
  • the feature maps with the reduced dimensions are then input to the next cascaded convolutional module.
  • each of pooling modules is in addition followed by a full-connection module 23 (23-1, 23-2 and 23-3) .
  • Features extracted in the three full-connection modules 21-1, 21-2 and 21-3 and the last multi-convolution module 21-4 (multi-convolution module 4) are supervised by supervisory signals.
  • Features in the last multi-convolution module 21-4 are used for face recognition.
  • the full-connection modules 23-1, 23-2 and 23-3 in deep feature extraction hierarchies are configured to extract global features (features extracted from the entire region of input feature maps) from previous feature extraction modules, i.e. the pooling modules 22-1, 22-2 and 22-3.
  • the fully-connected layers also serve as interfaces for receiving supervisory signals during training, which will be discussed later.
  • the full-connection modules 23-1, 23-2 and 23-3 also have the function of feature dimension reduction as pooling modules 22-1, 22-2 and 22-3 by restricting the number of neurons in them.
  • the fully-connection modules 23-1, 23-2 and 23-3 may be formulated as
  • x denote neural outputs (features) from the cascaded pooling module
  • y denote neural outputs (features) in the current fully-connection
  • w denotes neural weights in current feature extraction module (current fully-connection) . Neurons in fully-connection modules linearly combine features in previous feature extraction module, followed by ReLU non-linearity.
  • a feature extraction unit may contain a plurality of the deep feature extraction hierarchies.
  • Features in top feature extraction modules in all deep feature extraction hierarchies are concatenated into a long feature vector as a final feature representation for face recognition.
  • branching-out modules serve as interfaces for receiving supervisory signals during training, which will be disclosed later.
  • the top feature extraction module which extracts features for face recognition
  • all branching-out modules will be discarded and only the module cascade for extracting features for face recognition is reserved in test.
  • the hierarchy contains two multi-convolution modules 21-1 and 21-2, each of which is followed by a pooling module 22 (22-1 or 22-2) .
  • the multi-convolution module 21-1 is connected to an input face image as an input layer, and is configured to extract local facial features (i.e. features extracted from local regions of the input images) from input images by rule of formulation 1) .
  • the pooling module 22-1 is configured to pool local facial features from the previous layer (the multi-convolution module 21-1 ) to form output feature maps of the current layer. To be specific, the pooling module 22-1 receives the feature maps from the respectively connected convolutional module and then reduces the feature dimensions of the received feature maps, and forms more invariant features by pooling operations, which is formulated as formulation 2) .
  • each feature map is a certain kind of features organized in 2D.
  • the feature extraction hierarchy further comprises two multi-inception modules 24-1 and 24-2, each of which is followed by a pooling module 22 (22-3 or 22-4) .
  • Fig. 4b shows an example of three cascaded inception layers 1-3 in each of the multi-inception modules 24-1 and 24-2.
  • the goal of cascading the inception layers is to extract multi-scale local features by incorporating convolutions of various kernel sizes as well as local pooling operations in a single layer.
  • the features extracted by higher convolution/inception layers have larger effective receptive field on input images and more complex non-linearity.
  • each of the inception layers comprises one or more first 1x1 convolution operation layers241; one or more second 1x1 convolution operation layers 242, one or more multi-convolution operation layers (N ⁇ N convolution, N>1) 243, one or more pooling operation layers 244, and one or more third 1x1 convolution operation layers 245.
  • the number of the 1x1 convolution operation layers 241 is the same as that of the multi-scale convolution operation layers 243, and each layer 243 is coupled to a corresponding layer 241.
  • the number of the third 1x1 convolution operation layers 245 is the same as that of the pooling layers 244.
  • the second 1x1 convolution operation layers 242 are coupled to the previous inception layer.
  • the 1x1 convolution layers 241 are used to make computation efficient before the operations of the multi-convolution operation layers 243 and after pooling operation layers 244, which will be discussed below.
  • Fig. 5 just shows two first 1x1 convolution operation layers 241, one second 1x1 convolution operation layer 242, one third 1x1 convolution operation layer 245 and two multi-scale convolution operation layers 243, but the invention is not limited thereto.
  • the inception layer ensembles convolution operations with convolutional kernel sizes of 1x1, 3x3, and 5x5, as well as pooling operations by rule of formulation 2.
  • the first 1x1 convolution layers 241 are used to make computation efficient before 3x3 and 5x5 convolutions.
  • the number of output feature maps of a 1x1 convolution layer is set to be much smaller than its input feature maps.
  • 3x3 and 5x5 convolutions take output feature maps of 1x1 convolutions as their input feature maps, the number of input feature maps of 3x3 and 5x5 convolutions become much smaller. In this way, computations in 3x3 and 5x5 convolutions are reduced significantly.
  • the 1x1 convolution 245 after pooling helps reduce the number of output feature maps of pooling. Since output feature maps of 1x1, 3x3, and 5x5 convolutions are concatenated to form input feature maps of the next layer, a small number of output feature maps of 1x1 convolutions reduces the total number of output feature maps, and therefore reduces computation in next layer.
  • the 1x1 convolution itself does not take the majority computation due to the extremely small convolutional kernel size.
  • Fig. 9 is a schematic flowchart illustrating the process for the inception layer as shown in Fig. 5 consistent with some disclosed embodiments.
  • each of 1x1 convolution operation layers 241 operates to receive input feature maps from the previous layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps by rule of formula 1) as stated in the above.
  • the multi-scale convolution operation layer 243 performs N ⁇ N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layer 241 to form a plurality of first output feature maps.
  • the pooling operation layer 244 operates to receive the input feature maps from the previous layer and perform pooling operations on the received feature maps by rule of formula 2) .
  • the pooling operations in inception layers aim to pool over local regions of input feature maps to form locally invariant features as stated in the above.
  • pooling in inception layers may not reduce feature dimensions, which is achieved by setting step-size s equal to 1 by rule of formula 2.
  • the third 1x1 convolution operation layers 245 operate to perform 1x1 convolution operations on the features maps received from the pooling operation layer 244 to compress numbers of the feature maps by rule of formula 1) as stated in the above so as to obtain a plurality of second output feature maps.
  • the second 1x1 convolution operation layers 242 operate to receive the input feature maps from the previous layer and perform 1x1 convolution operations on the received features maps to compress numbers of the feature maps by rule of formula 1) so as to obtain a plurality of third feature maps.
  • the first, second and third feature maps are concatenated to form feature maps for inputting the following inception layer or inputting the following feature extraction module.
  • the recognizer 20 operates to calculate distances between features for different face images extracted by the feature extractor 10 to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
  • Fig. 8 is a schematic flowchart illustrating the recognition process in the recognizer 20. In step 201, the recognizer 20 calculates distances between features extracted from different face images by the feature extractor 10.
  • the recognizer 20 determines if two face images are from the same identity for face verification, or, alternatively, in step 203, it determines one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
  • two face images are determined to belong to the same identity if their feature distance is smaller than a threshold, or the probe face image is determined to belong to the same identity as one of gallery face images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images, wherein feature distances determined by the recognizer 20 could be Euclidean distances, Joint Bayesian distances, cosine distances, Hamming distances, or any other distances.
  • Joint Bayesian distances are used as feature distances.
  • Joint Bayesian has been a popular similarity metric of faces, which represents the extracted facial features x (after subtracting the mean) by the sum of two independent Gaussian variables
  • S ⁇ and S ⁇ can be learned from data with EM algorithm. In test, it calculates the likelihood ratio
  • the Trainer 30 The Trainer 30
  • the trainer 30 is used to update the weights w on connections between neurons in feature extraction layers (i.e. the layers of the multi-convolution modules, the multi-inception modules and the full connection modules) in the feature extractor 10 by inputting initial weights on connections between neurons in feature extraction layers in the feature extractor, a plurality of identification supervisory signals, and a plurality of verification supervisory signals.
  • the trainer 30 aims to iteratively find a set of optimized neural weights in deep feature extraction hierarchies for extracting identity-related features for face recognition.
  • the identification and verification supervisory signals in the trainer 30 are simultaneously added to each of the supervised layers in each of the feature extraction hierarchies in the feature extractor 10,and respectively back-propagated to the input face image, so as to update the weights on connections between neurons in all the cascaded feature extraction modules.
  • the identification supervisory signals are generated in the trainer 30 by classifying all of the supervised layer (layers selected for supervision, which could be those in multi-convolution modules, multi-inception modules, pooling modules, or full-connection modules) representations into one of N identities, wherein the classification errors are used as the identification supervisory signals.
  • the verification supervisory signals in the trainer 30 are generated by verifying the supervised layer representations of two compared face images, respectively, in each of the feature extraction modules, to determine if the two compared face images belong to the same identity, wherein the verification errors are used as the verification supervisory signals.
  • the feature extractor 10 extracts two feature vectors f i and f j from the two face images respectively in each of the feature extraction modules.
  • the verification error is if f i and f j are features of face images of the same identity, or if f i and f j are features of face images of different identities, where
  • 2 is Euclidean distance of the two feature vectors, m is a positive constant value.
  • f i and f j are dissimilar for the same identity, or if f i and f j are similar for different identities.
  • Fig. 6 is a schematic flowchart illustrating the training process in the trainer 30.
  • the trainer 30 samples two face images and inputs them to the feature extractor 10, respectively, to get feature representations of each of the two face images in all feature extraction layers of the feature extractor 10.
  • the trainer 30 calculates identification errors by classifying feature representations of each face image in each supervised layer into one of a plurality of (N) identities.
  • the trainer 30 calculates verification errors by verifying if feature representations of two face images, respectively, in each supervised layer are from the same identity.
  • the identification and verification errors are used as identification and verification supervisory signals, respectively.
  • step 304 the trainer 30 back propagates all identification and verification supervisory signals through the feature extractor 10 simultaneously, so as to update weights on connections between neurons in the feature extractor 10.
  • Identification and verification supervisory signals (or errors) simultaneously added to supervised layers are back-propagated through the cascade of feature extraction modules until the input image. After back-propagation, errors got in each layer in the cascade of feature extraction modules are accumulated. Weights on connections between neurons in the feature extractor 10 are updated according to the magnitude of the errors.
  • step 305 the trainer 30 judges if training process has converged, and repeats steps 301-304 if a convergence point has not reached.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

Disclosed is an apparatus for face recognition. The apparatus may comprise an extractor having a plurality of deep feature extraction hierarchies, the hierarchies extract recognition features from one or more input images; and a recognizer being electronically communicated with the extractor and recognizing face images of the input images based on the extracted recognition features.

Description

A METHOD AND A SYSTEM FOR FACE RECOGNITION Technical Field
The present application relates to a method for face recognition and a system thereof.
Background
Learning effective deep face representation for face recognition using deep neural networks has becoming a very promising method for the face recognition. With better deep network structures and supervisory methods, face recognition accuracy has been boosted rapidly in recent years. DeepFace and DeepID are independently proposed to learn identity-related facial features through large-scale face identification tasks. DeepID2 made an additional improvement by learning deep facial features with joint face identification-verification tasks. DeepID2+ further improves DeepID2 by increasing the feature dimensions in each layer and adding joint identification-verification supervisory signals to previous feature extraction layers. DeepID2+ achieved the current state-of-the-art face recognition results in a number of widely evaluated face recognition dataset. However, the network structure of DeepID2+ is still similar to conventional convolutional neural networks with interlacing convolutional and pooling layers.
In general object recognition domain, there have been a few successful attempts to improve upon conventional convolutional neural networks. VGG net and GoogLeNet are two representatives. VGG net proposes to use continuous convolutions with small convolutional kernels. In particular, it stacks two or three layers of 3x3 convolutions together between every two pooling layers. GoogLeNet incorporates multi-scale convolutions and pooling into a single feature extraction layer coined inception. To learn efficient features, an inception layer also introduces 1x1 convolutions to reduce the number of feature maps before larger convolutions and after pooling.
Summary
In one aspect of the present application, disclosed is an apparatus for face recognition. The apparatus may comprise an extractor having a plurality of deep feature extraction hierarchies, the hierarchies extract recognition features from one or more input images; and a recognizer being electronically communicated with the extractor and recognizing face images of the input images based on the extracted recognition features.
In one embodiment of the present application, each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is integer greater than 1. A first one of the multi-convolution modules extracts local features from the input images, and the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of the pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features. The features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
In one embodiment of the present application, each of the pooling modules is coupled between two of adjacent multi-convolution modules, between one multi-convolution module and one adjacent multi-inception module, or between two of adjacent multi-inception modules.
In one embodiment of the present application, each of the hierarchies further comprises one or more multi-inception modules. Each of the multi-inception modules performs multi-scale convolutional operation on the features received from previous coupled pooling modules and reduces dimensions of the received features. Each of multi-convolution and multi-inception modules in each hierarchy is followed by one of the pooling modules, and each pooling module is followed by a multi-convolution  module or a multi-inception module, except for a last pooling module, a last multi-convolution module, or a last multi-inception module in the hierarchy.
As an example, each of the multi-inception modules may comprise a plurality of cascaded inception layers. Each of inception layers receives features outputted from a previous inception layer as its input, and the inception layers are configured to perform multi-scale convolution operations and pooling operations on the received features to obtain multi-scale convolutional feature maps and locally invariant feature maps, and perform 1x1 convolution operations before multi-scale convolution operations and after pooling operations to reduce dimensions of features before multi-scale convolution operations and after pooling operations. The obtained multi-scale convolutional feature maps and the obtained locally invafiant feature maps are stacked together to form input feature maps of the layer that follows.
In particular, each of the inception layers comprises: one or more first 1x1 convolution operation layers are configured to receive input feature maps from a previous feature extraction layer and perform 1x1convolution operations on the received features maps to compress a number of feature maps; one or more multi-scale convolution operation layers are configured to perform N×N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form first output feature maps, where N>1. One or more pooling operation layers are configured to pool over local regions of the input feature maps from the previous layer to form locally invariant feature maps; and one or more second 1x1 convolution operation layers are configured to perform 1x1convolution operations on the locally invariant feature maps received from the pooling operation layers to compress a number of feature maps so as to obtain second output feature maps. One or more third convolution operation layers are configured to receive input feature maps from the previous layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps to obtain third feature maps. The first, second and third feature maps are stacked together to form feature  maps for inputting a following inception layer of the inception layers or inputting a next feature extraction module.
In one embodiment of the present application, each of multi-convolution modules may comprise one or more cascaded convolution layers, each of the convolution layers receives features outputted from a previous convolution layer as its input, and each of the convolution layers is configured to perform local convolution operations on inputted features, wherein the convolutional layers share neural weights for the convolution operations only in local areas of the inputted images.
In some of embodiments, a trainer may be electronically communicated with the extractor to add supervisory signals on the feature extraction unit during training so as to adjust neural weights in the deep feature extraction hierarchies by back-propagating supervisory signals through the cascaded multi-convolution modules and pooling modules, or through the cascaded multi-convolution modules, pooling modules and the multi-inception modules. The supervisory signals comprise one identification supervisory signal and one verification supervisory signal, wherein the identification supervisory signal is generated by classifying features in any of the modules extracted from an input face region into one of N identities in a training dataset, and taking a classification error as the supervisory signal, and wherein the verification signal is generated by comparing features in any of the modules extracted from two input face images respectively for determining if they are from the same person, and taking a verification error as the supervisory signal. According to the present application, each of the multi-convolution modules, the pooling modules and the multi-inception modules receives a plurality of supervisory signals which are either added on said each module or back-propagated from later feature extraction modules. These supervisory signals are aggregated to adjust neural weights in each of multi-convolution and multi-inception modules during training.
In the present application, each of the deep feature extraction hierarchies may  comprise a different number of the multi-convolution modules, a different number of the multi-inception modules, a different number of pooling modules, and a different number of full-connection modules, or takes a different input face region to extract the features.
In further aspect of the present application, disclosed is a method for face recognition, comprising: extracting, by an extractor having a plurality of deep feature extraction hierarchies, recognition features from one or more input images; and recognizing face images of the input images based on the extracted recognition features, wherein each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is integer greater than 1. A first one of the multi-convolution modules extracts local features from the input images, the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features. Features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
In one embodiment of the present application, each of the hierarchies further comprises one or more multi-inception modules, each of which has a plurality of cascaded inception layers, the extracting further comprises: performing, by each of the inception layers, convolution operations on the received features to obtain multi-scale convolutional feature maps, and performing, by said each of the inception layers, pooling operations on the received features to obtain pooled feature maps (i.e. to pool over local regions of the feature maps received from the previous layer to form locally invariant feature maps) , wherein the obtained multi-scale convolutional feature maps and the pooled feature maps are stacked together to form input feature maps of the layer that follows.
In further embodiment of the present application, each of the hierarchies further comprises one or more multi-inception modules, each of which has a plurality of cascaded inception layers, and wherein, during the extracting, each of the inception layers operates to: receive input feature maps from a previous feature extraction layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps; perform N×N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form first output feature maps, where N>1; perform pooling operations on the received features from said previous layer (i.e. to pool over local regions of the input feature maps from the previous layer to form locally invariant feature maps) ; perform 1×1 convolution operations on the pooled feature maps received from the pooling operation layers to compress a number of feature maps so as to obtain second output feature maps, receive the input feature maps from the previous layer and perform 1x1convolution operations on the received features maps to compress a number of feature maps so as to obtain third feature maps; and concatenate the first, second and third feature maps to form feature maps for inputting a following inception layer of the inception layers or inputting a next feature extraction module.
In further aspect of the subject application, there is provided with an apparatus for face recognition, which may comprise: one or more memories that stores executable components; and one or more processors, coupled to the memories, that executes the executable components to perform operations of the apparatus, the executable components comprising:
an extracting component having a plurality of deep feature extraction hierarchies configured to extract recognition features from one or more input images; and
a recognizing component recognizing face images of the input images based on the extracted recognition features,
wherein each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is an integer greater than 1,
a first one of the multi-convolution modules extracts local features from the input images, the followings of the multi-convolution modules extracts further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features, and 
wherein features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
Brief Description of the Drawing
Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
Fig. 1 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.
Fig. 2 is a schematic diagram illustrating an apparatus for face recognition when it is implemented in software, consistent with some disclosed embodiments.
Fig. 3a and 3b are two schematic diagrams illustrating two examples of deep feature extraction hierarchies in the feature extraction unit as shown in Fig. 1.
Fig. 4a is a schematic diagram illustrating structures of a multi-convolution module, consistent with some disclosed embodiments.
Fig. 4b is a multi-inception module in deep feature extraction hierarchies, consistent with some disclosed embodiments.
Fig. 5 is a schematic diagram illustrating structures of an inception layer in multi-inception modules, consistent with some disclosed embodiments.
Fig. 6 is a schematic flowchart illustrating the trainer as shown in Fig. 1 consistent with some disclosed embodiments.
Fig. 7 is a schematic flowchart illustrating the extractor as shown in Fig. 1 consistent with some disclosed embodiments.
Fig. 8 is a schematic flowchart illustrating the recognizer as shown in Fig. 1  consistent with some disclosed embodiments.
Fig. 9 is a schematic flowchart illustrating the process for the inception layer as shown in Fig. 5 consistent with some disclosed embodiments.
Detailed Description
Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms ″a″ , ″an″ and ″the″ are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms ″comprises″ and/or ″comprising, ″ when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely  software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit, ” “module” or “system. ” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
In the case that the apparatus 1000 as disclosed below is implemented with software, the apparatus 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion. As shown in Fig. 2, the apparatus 1000 may include one or more processors ( processors  102, 104, 106 etc. ) , a memory 112, a storage device 116, a communication interface 114, and a bus to facilitate information exchange among various components of apparatus 1000. Processors 102-106 may include a central processing unit ( “CPU” ) , a graphic processing unit ( “GPU” ) , or other suitable information processing devices. Depending on the type of hardware being used, processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods or run the modules that will be explained in greater detail below.
Memory 112 can include, among other things, a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) . Computer program instructions can be stored, accessed, and read from memory 112 for execution by one or more of processors 102-106. For example, memory 112 may store one or more software applications. Further, memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106 to carry out the functions as disclosed below for the apparatus 1000. It is noted that  although only one block is shown in Fig. 1, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.
Referring Fig. 1 again, where the apparatus 1000 is implemented by the hardware, it may comprise an extractor 10 and a recognizer 20. The extractor 10 is configured with a plurality of deep feature extraction hierarchies, which may be formed as a neural network configured or trained to extract recognition features from one or more input images. The recognizer 20 is electronically communicated with the extractor 10 and recognizes face images of the input images based on the extracted recognition features. As will be discussed in details below, each of the hierarchies comprises N multi-convolution modules and M pooling modules, each of N and M is integer greater than 1. A first one of the multi-convolution modules extracts local features from the input images, and the followings of the multi-convolution modules extract further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of pooling modules receives local features from respective multi-convolution modules and reduces dimensions of the received features. The features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features. In addition, the apparatus 1000 may further comprise a trainer 30 used to train the neural network.
The Extractor 10
The feature extractor 10 contains a plurality of deep feature extraction hierarchies. Each of the feature extraction hierarchies is a cascade of feature extraction modules. Fig. 7 is a schematic flowchart illustrating the feature extraction process in the extractor 10, which contains three steps. In step 101, the feature extractor 10 forward propagates an input face image through each of deep feature extraction hierarchies, respectively. Then in step 102, the extractor 10 takes representations outputted by each of the deep feature extraction hierarchies as features. Finally in step 103, it concatenates features of all deep feature extraction hierarchies.
In one embodiment of the present application, each of the deep feature extraction hierarchies may include a plurality of multi-convolution modules, a plurality of multi-inception modules, a plurality of pooling modules, and a plurality of full-connection modules. Each of the deep feature extraction hierarchies may contain a different number of cascaded multi-convolution modules, a different number of multi-inception modules, a different number of pooling modules, and a different number of full-connection modules, or may take a different input face region to extract features.
Fig. 3a illustrates an example of feature extraction hierarchies in the extractor 10.As shown in Fig. 3a, each of the deep feature extraction hierarchies contains alternate multi-convolution modules 21-1, 21-2, 21-3... and pooling modules 22-1, 22-2, 22-3.... For purpose of description, four multi-convolution modules 21-1, 21-2, 21-3 and 21-4 and three pooling modules 22-1, 22-2 and 22-3 are illustrated in Fig. 3a as an example.
Fig. 4a is a schematic diagram illustrating structures of each of the multi-convolution modules 21-1, 21-2, 21-3.... As shown, each multi-convolution module contains a plurality of cascaded convolutional layers. Fig. 4a shows an example of three cascaded convolutional layers of convolutional layer 1-3. However, in the present application, a multi-convolution module could contain any number of convolutional layers such as one, two, three, or more. In one extreme of a multi-convolution module containing only one convolutional layer, it degrades to a conventional convolution module. Therefore, multi-convolution modules are generalizations of conventional convolution modules. Likewise, a multi-inception module contains one or more cascaded inception layers.
The convolutional layers in a multi-convolution module are configured to  extract local facial features from input feature maps (which is output feature maps of the previous layer) to form output feature maps of the current layer. In particular, each convolutional layer performs convolution operations on the input feature maps to form output feature maps of the current layer, and the formed output feature maps will be input to the next convolutional layer.
Each feature map is a certain kind of features organized in 2D. The features in the same output feature map or in local regions of the same feature map are extracted from input feature maps with the same set of neural connection weights. The convolution operation in each convolutional laver may be expressed as
Figure PCTCN2015000050-appb-000001
Where,
xi and yj are the i-th input feature map and the j-th output feature map, respectively;
kij is the convolution kernel between the i-th input feature map and the j-th output feature map;
*denotes convolution;
bj is the bias of the j-th output feature map;
ReLU nonlinearity y=max (0, x) is used for neurons. Weights in higher convolutional layers of the ConvNets are locally shared.
r indicates a local region where weights are shared. In one extreme of the local region r corresponding to entire input feature maps, convolution becomes global convolution. In another extreme of the local region r corresponding to a single pixel in input feature maps, a convolutional layer degrades to a local-connection layer.
In further embodiments of the present application, 1x1 convolution operations may be carried out in inception layers (as shown in Fig. 4) compress the number of feature maps by setting the number of output feature maps significantly smaller than the number of input feature maps, which will be discussed below.
Returning to Fig. 3a, as shown, there is provided one pooling module inserted between every two multi-convolution modules. Each of the pooling modules 22-1, 22-2... aims to reduce feature dimensions and form more invariant features.
The goal of cascading multiple convolution/inception layers is to extract hierarchical local features (i.e. features extracted from local regions of the input images or the input features) , wherein features extracted by higher convolution/inception layers have larger effective receptive field on input images and more complex non-linearity. The pooling modules 22-1, 22-2... are configured to pool local facial features from input feature maps from previous layer to form output feature maps of the current layer. Each of the pooling modules 22-1, 22-2... receives the feature maps from the respective connected multi-convolution/multi-inception module and then reduces the feature dimensions of the received feature maps and forms more invariant features by pooling operations, which may be formulated as
Figure PCTCN2015000050-appb-000002
where each neuron in the i-th output feature map yi pools over an M×N local region in the i-th input feature map xi, with s as the step size.
The feature maps with the reduced dimensions are then input to the next cascaded convolutional module.
As shown in Fig. 3a, each of pooling modules is in addition followed by a full-connection module 23 (23-1, 23-2 and 23-3) . Features extracted in the three full-connection modules 21-1, 21-2 and 21-3 and the last multi-convolution module 21-4 (multi-convolution module 4) are supervised by supervisory signals. Features in the last multi-convolution module 21-4 are used for face recognition.
The full-connection modules 23-1, 23-2 and 23-3 in deep feature extraction hierarchies are configured to extract global features (features extracted from the entire  region of input feature maps) from previous feature extraction modules, i.e. the pooling modules 22-1, 22-2 and 22-3. The fully-connected layers also serve as interfaces for receiving supervisory signals during training, which will be discussed later. The full-connection modules 23-1, 23-2 and 23-3 also have the function of feature dimension reduction as pooling modules 22-1, 22-2 and 22-3 by restricting the number of neurons in them. The fully-connection modules 23-1, 23-2 and 23-3 may be formulated as
Figure PCTCN2015000050-appb-000003
Where,
x denote neural outputs (features) from the cascaded pooling module,
y denote neural outputs (features) in the current fully-connection,
w denotes neural weights in current feature extraction module (current fully-connection) . Neurons in fully-connection modules linearly combine features in previous feature extraction module, followed by ReLU non-linearity.
Features in the highest module of the deep feature extraction hierarchies are used for face recognition. These features are global and can capture highly non-linear mappings from input face images to their identities. As two examples, features in multi-convolution module 4 in Fig. 3a and full-connection module 4 in Fig. 3b are used for face recognition for the two deep feature extraction hierarchies shown in these two figures, respectively. A feature extraction unit may contain a plurality of the deep feature extraction hierarchies. Features in top feature extraction modules in all deep feature extraction hierarchies are concatenated into a long feature vector as a final feature representation for face recognition. There may be a plurality of feature extraction modules branching out from the module cascade for extracting features. The full-connection modules 1-3 in Fig. 3a and Fig. 3b are examples of such modules. These branching-out modules, as well as the top feature extraction module (which extracts features for face recognition) , serve as interfaces for receiving supervisory signals during training, which will be disclosed later. When training is finished, all branching-out modules will be discarded and only the module cascade for extracting  features for face recognition is reserved in test.
In another example of feature extraction hierarchy 20-2 in Fig. 3b, the hierarchy contains two multi-convolution modules 21-1 and 21-2, each of which is followed by a pooling module 22 (22-1 or 22-2) . The multi-convolution module 21-1 is connected to an input face image as an input layer, and is configured to extract local facial features (i.e. features extracted from local regions of the input images) from input images by rule of formulation 1) .
The pooling module 22-1 is configured to pool local facial features from the previous layer (the multi-convolution module 21-1 ) to form output feature maps of the current layer. To be specific, the pooling module 22-1 receives the feature maps from the respectively connected convolutional module and then reduces the feature dimensions of the received feature maps, and forms more invariant features by pooling operations, which is formulated as formulation 2) .
And then, the cascaded multi-convolution module 21-2 and pooling module 22-2 receive the features maps from the pooling modules 22-1 and perform the same operation on the received feature maps as those of multi-convolution module 21-1 and pooling module 22-1, respectively. Herein, each feature map is a certain kind of features organized in 2D.
As shown in Fig. 3b, the feature extraction hierarchy further comprises two multi-inception modules 24-1 and 24-2, each of which is followed by a pooling module 22 (22-3 or 22-4) . Fig. 4b shows an example of three cascaded inception layers 1-3 in each of the multi-inception modules 24-1 and 24-2. The goal of cascading the inception layers is to extract multi-scale local features by incorporating convolutions of various kernel sizes as well as local pooling operations in a single layer. The features extracted by higher convolution/inception layers have larger  effective receptive field on input images and more complex non-linearity.
As shown in Fig. 5, each of the inception layers comprises one or more first 1x1 convolution operation layers241; one or more second 1x1 convolution operation layers 242, one or more multi-convolution operation layers (N×N convolution, N>1) 243, one or more pooling operation layers 244, and one or more third 1x1 convolution operation layers 245. The number of the 1x1 convolution operation layers 241 is the same as that of the multi-scale convolution operation layers 243, and each layer 243 is coupled to a corresponding layer 241. The number of the third 1x1 convolution operation layers 245 is the same as that of the pooling layers 244. The second 1x1 convolution operation layers 242 are coupled to the previous inception layer.
The 1x1 convolution layers 241 are used to make computation efficient before the operations of the multi-convolution operation layers 243 and after pooling operation layers 244, which will be discussed below.
For purpose of clarity, Fig. 5 just shows two first 1x1 convolution operation layers 241, one second 1x1 convolution operation layer 242, one third 1x1 convolution operation layer 245 and two multi-scale convolution operation layers 243, but the invention is not limited thereto. In the example shown in Fig. 5, the inception layer ensembles convolution operations with convolutional kernel sizes of 1x1, 3x3, and 5x5, as well as pooling operations by rule of formulation 2. The first 1x1 convolution layers 241 are used to make computation efficient before 3x3 and 5x5 convolutions. The number of output feature maps of a 1x1 convolution layer is set to be much smaller than its input feature maps. Since 3x3 and 5x5 convolutions take output feature maps of 1x1 convolutions as their input feature maps, the number of input feature maps of 3x3 and 5x5 convolutions become much smaller. In this way, computations in 3x3 and 5x5 convolutions are reduced significantly. Likewise, the 1x1 convolution 245 after pooling helps reduce the number of output feature maps of  pooling. Since output feature maps of 1x1, 3x3, and 5x5 convolutions are concatenated to form input feature maps of the next layer, a small number of output feature maps of 1x1 convolutions reduces the total number of output feature maps, and therefore reduces computation in next layer. The 1x1 convolution itself does not take the majority computation due to the extremely small convolutional kernel size.
Fig. 9 is a schematic flowchart illustrating the process for the inception layer as shown in Fig. 5 consistent with some disclosed embodiments. At step 901, each of 1x1 convolution operation layers 241 operates to receive input feature maps from the previous layer and perform 1x1 convolution operations on the received features maps to compress a number of feature maps by rule of formula 1) as stated in the above. The multi-scale convolution operation layer 243 performs N×N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layer 241 to form a plurality of first output feature maps.
At step 902, the pooling operation layer 244 operates to receive the input feature maps from the previous layer and perform pooling operations on the received feature maps by rule of formula 2) . The pooling operations in inception layers aim to pool over local regions of input feature maps to form locally invariant features as stated in the above. However, to keep output feature map sizes consistent in  layers  242, 243, and 245 so as to stack them together later, pooling in inception layers may not reduce feature dimensions, which is achieved by setting step-size s equal to 1 by rule of formula 2. The third 1x1 convolution operation layers 245 operate to perform 1x1 convolution operations on the features maps received from the pooling operation layer 244 to compress numbers of the feature maps by rule of formula 1) as stated in the above so as to obtain a plurality of second output feature maps.
At step 903, the second 1x1 convolution operation layers 242 operate to receive the input feature maps from the previous layer and perform 1x1 convolution  operations on the received features maps to compress numbers of the feature maps by rule of formula 1) so as to obtain a plurality of third feature maps.
At step 904, the first, second and third feature maps are concatenated to form feature maps for inputting the following inception layer or inputting the following feature extraction module.
The recognizer 20
The recognizer 20 operates to calculate distances between features for different face images extracted by the feature extractor 10 to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification. Fig. 8 is a schematic flowchart illustrating the recognition process in the recognizer 20. In step 201, the recognizer 20 calculates distances between features extracted from different face images by the feature extractor 10. Then in step 202, the recognizer 20 determines if two face images are from the same identity for face verification, or, alternatively, in step 203, it determines one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
In the recognizer 20, two face images are determined to belong to the same identity if their feature distance is smaller than a threshold, or the probe face image is determined to belong to the same identity as one of gallery face images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images, wherein feature distances determined by the recognizer 20 could be Euclidean distances, Joint Bayesian distances, cosine distances, Hamming distances, or any other distances.
In one embodiment of the present application, Joint Bayesian distances are used as feature distances. Joint Bayesian has been a popular similarity metric of faces, which represents the extracted facial features x (after subtracting the mean) by the sum of two independent Gaussian variables
x=μ+ò,   (4) 
where μ~N (0,Sμ) represents the face identity andò~N (0, Sò) represents the intra-personal variations. Joint Bayesian models the joint probability of two faces given the intra or extra-personal variation hypothesis, P (x1, x2|H1) and P(x1, x2|HE) . It is readily shown from Equation (5) that these two probabilities are also Gaussian with variations
Figure PCTCN2015000050-appb-000004
and
Figure PCTCN2015000050-appb-000005
respectively. Sμ and Sò can be learned from data with EM algorithm. In test, it calculates the likelihood ratio
Figure PCTCN2015000050-appb-000006
which has closed-form solutions and is efficient.
The Trainer 30
The trainer 30 is used to update the weights w on connections between neurons in feature extraction layers (i.e. the layers of the multi-convolution modules, the multi-inception modules and the full connection modules) in the feature extractor 10 by inputting initial weights on connections between neurons in feature extraction layers in the feature extractor, a plurality of identification supervisory signals, and a plurality of verification supervisory signals. The trainer 30 aims to iteratively find a set of optimized neural weights in deep feature extraction hierarchies for extracting identity-related features for face recognition.
As shown in Fig. 3a and Fig. 3b, the identification and verification supervisory signals in the trainer 30 are simultaneously added to each of the supervised layers in each of the feature extraction hierarchies in the feature extractor 10,and respectively back-propagated to the input face image, so as to update the weights on connections between neurons in all the cascaded feature extraction modules.
The identification supervisory signals are generated in the trainer 30 by classifying all of the supervised layer (layers selected for supervision, which could be those in multi-convolution modules, multi-inception modules, pooling modules, or full-connection modules) representations into one of N identities, wherein the classification errors are used as the identification supervisory signals.
The verification supervisory signals in the trainer 30 are generated by verifying the supervised layer representations of two compared face images, respectively, in each of the feature extraction modules, to determine if the two compared face images belong to the same identity, wherein the verification errors are used as the verification supervisory signals. Given a pair of training face images, the feature extractor 10 extracts two feature vectors fi and fj from the two face images respectively in each of the feature extraction modules. The verification error is 
Figure PCTCN2015000050-appb-000007
if fi and fj are features of face images of the same identity, or 
Figure PCTCN2015000050-appb-000008
if fi and fj are features of face images of different identities, where ||fi -fj||2 is Euclidean distance of the two feature vectors, m is a positive constant value. There are errors if fi and fj are dissimilar for the same identity, or if fi and fj are similar for different identities.
Fig. 6 is a schematic flowchart illustrating the training process in the trainer 30.In step 301, the trainer 30 samples two face images and inputs them to the feature extractor 10, respectively, to get feature representations of each of the two face images in all feature extraction layers of the feature extractor 10. Then in step 302, the trainer 30 calculates identification errors by classifying feature representations of each face image in each supervised layer into one of a plurality of (N) identities. Simultaneously, in step 303, the trainer 30 calculates verification errors by verifying if feature representations of two face images, respectively, in each supervised layer are from the same identity. The identification and verification errors are used as identification and verification supervisory signals, respectively. In step 304, the trainer 30 back propagates all identification and verification supervisory signals through the feature extractor 10 simultaneously, so as to update weights on connections between neurons in the feature extractor 10. Identification and verification supervisory signals (or errors) simultaneously added to supervised layers are back-propagated through the cascade of feature extraction modules until the input image. After back-propagation, errors got in each layer in the cascade of feature extraction modules are accumulated. Weights on connections between neurons in the feature extractor 10 are updated according to the magnitude of the errors. At last, in step 305, the trainer 30 judges if training process has converged, and repeats steps 301-304 if a convergence point has not reached.
Although the preferred examples of the present invention have been  described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims is intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.
Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (27)

  1. An apparatus for face recognition, comprising:
    an extractor having a plurality of deep feature extraction hierarchies configured to extract recognition features from one or more input images; and
    a recognizer electronically communicated with the extractor to recognize face images of the input images based on the extracted recognition features,
    wherein each of the hierarchies comprises a plurality of multi-convolution modules and a plurality of pooling modules, and at least one of the hierarchies further comprise a plurality of multi-inception modules,
    wherein a first one of the multi-convolution or multi-inception modules extracts local features from the input images, and the followings of the multi-convolution and multi-inception modules extract further local features from the features output from modules of the pooling modules, which are coupled thereto, and wherein each of pooling modules receives local features from respective multi-convolution modules and multi-inception modules and reduces dimensions of the received features, and
    wherein features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
  2. The apparatus of claim 1, wherein each of the pooling modules is coupled between two of adjacent multi-convolution modules, between one multi-convolution module and one adjacent multi-inception module, or between two of adjacent multi-inception modules.
  3. The apparatus of claim 1, wherein each of the multi-inception modules performs a multi-scale convolutional operation on the features received from pooling modules coupled thereto and reduces dimensions of the received features,
    wherein, each of the multi-convolution modules and each of the multi-inception modules in each hierarchy are, respectively, followed by one of the pooling modules,  and each of the pooling modules is followed by one of the multi-convolution modules or one of the multi-inception modules, except for a last one of the pooling modules, a last one of the multi-convolution modules, or a last one of the multi-inception modules in each of the hierarchies.
  4. The apparatus of claim 1 or 3, wherein each of the multi-inception modules comprises a plurality of cascaded inception layers, wherein each of the inception layers is configured to perform 1x1 convolutions on the input feature maps to compress numbers thereof before larger convolution operations and after pooling operations.
  5. The apparatus of claim 4, wherein each of the inception layers comprises:
    one or more first 1x1 convolution operation layers configured to receive input feature maps from a previous one of the inception layers and perform l xlconvolution operations on the received feature maps to compress numbers of the received feature maps;
    one or more multi-scale convolution operation layers configured to perform N×N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form a plurality of first output feature maps, where N >1;
    one or more pooling operation layers configured to receive the input feature maps from said previous inception layer to pool over local regions of the received feature maps to form locally invariant features maps;
    one or more second 1x1 convolution operation layers configured to perform 1x1 convolution operations on the locally invariant feature maps to compress numbers of the feature maps so as to obtain a plurality of second output feature maps; and
    one or more third convolution operation layers configured to receive the input feature maps from the previous inception layer and perform 1x1 convolution operations on the received features maps to compress numbers of the feature maps so as to obtain a plurality of third feature maps;
    wherein the first, second and third feature maps are stacked together to form feature maps for inputting a following inception layer of the inception layers.
  6. The apparatus of claim 1, wherein each of multi-convolution modules comprises one or more cascaded convolution layers, each of the convolution layers receives features outputted from a previous one of the convolution layer as its input, and each of the convolution layers is configured to perform local convolution operations on the input features, wherein the convolutional layers share neural weights for the convolution operations only in local areas of the inputted images.
  7. The apparatus of claim 4, wherein one or more of the pooling modules, the multi-convolution modules, or the multi-inception modules are followed by full-connection modules for extracting global features from corresponding pooling modules, multi-convolution modules, or multi-inception modules connected thereto.
  8. The apparatus of claim 7, further comprising:
    a trainer being electronically communicated with the extractor to add supervisory signals on one or more of the pooling modules, the multi-convolution modules, the multi-inception modules, and the full-connection modules during training so as to adjust neural weights in the deep feature extraction hierarchies by back-propagating supervisory signals through the cascaded multi-convolution modules and pooling modules, or through the cascaded multi-convolution modules, pooling modules and the multi-inception modules.
  9. The apparatus of claim 8, wherein the supervisory signals comprise one identification supervisory signal and one verification supervisory signal,
    wherein the identification supervisory signal is generated by classifying features in each of the supervised modules, which are extracted from an input face region, into one of N identities in a training dataset, and taking a classification error as the supervisory signal, and
    wherein the verification signal is generated by comparing features in each of the supervised modules, which are extracted from two input face images respectively, for determining if they are from the same person, and taking a verification error as the supervisory signal.
  10. The apparatus of claim 9, wherein each of the multi-convolution modules, the pooling modules and the multi-inception modules receives a plurality of supervisory signals which are either added on said each module or back-propagated from later feature extraction modules, wherein these supervisory signals are aggregated to adjust neural weights in each of multi-convolution modules, each of multi-inception modules, and each of full-connection modules during training.
  11. The apparatus of claim 1, wherein distances between the features from two input face images are compared to a threshold to determine if the two input face images are from the same person for face verification, or distances between features of an input query face image to features of each of face images in a face image database are computed to determine which identity in the face image database the input query face image belongs to for face identification.
  12. The apparatus of claim 11, wherein the distances between the features is one selected from a group consisting of Euclidean distances, cosine similarities, Joint Bayesian metrics, or any other distances.
  13. The apparatus of claim 7, wherein each of the deep feature extraction hierarchies comprises a different number of the multi-convolution modules, a different number of the multi-inception modules, a different number of pooling modules, and a different number of full-connection modules, or takes a different input face region to extract the features.
  14. A method for face recognition, comprising:
    extracting, by a plurality of deep feature extraction hierarchies, recognition features from one or more input images; and
    recognizing face images of the input images based on the extracted recognition features,
    wherein each of the hierarchies comprises a plurality of multi-convolution modules and a plurality of pooling modules, and at least one of the hierarchies further comprise a plurality of multi-inception modules,
    wherein the extracting further comprises:
    extracting, by a first one of the multi-convolution or multi-inception modules, local features from the input images;
    extracting, by the followings of the multi-convolution modules and multi-inception modules, further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of pooling modules receives local features from respective multi-convolution modules and multi-inception modules and reduces dimensions of the received features, and
    concatenating features obtained from all the extraction hierarchies into a feature vector as said recognition features.
  15. The method of claim 14, wherein each of the multi-inception modules has a plurality of cascaded inception layers configured to perform 1x1 convolutions on the input feature maps to compress numbers of the feature maps before larger convolution operations and after pooling operations.
  16. The method of claim 15, wherein, during the extracting, each of the inception layers operates to:
    receive input feature maps from a previous inception layer and perform 1x1 convolution operations on the received features maps to compress numbers of the feature maps;
    perform N×N convolution operations on the compressed feature maps received  from respective 1x1 convolution operation layers to form a plurality of first output feature maps, where N > 1;
    pool over local regions of the input feature maps from said previous inception layer to form locally invariant feature maps;
    perform 1x1 convolution operations on the locally invariant feature maps to compress numbers of the feature maps so as to obtain a plurality of second output feature maps;
    receive the input feature maps from the previous inception layer and perform 1x1 convolution operations on the received features maps to compress numbers thereof so as to obtain a plurality of third feature maps; and
    concatenate the first, second and third feature maps to form feature maps for inputting a following inception layer of the inception layers.
  17. The method of claim 14, wherein the recognizing further comprises:
    determining distances between the recognition features; and
    determining, in accordance with the determined distance, iftwo face images of the input images are from the same identity for face verification or if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
  18. The method of claim 17, wherein the determining further comprises:
    comparing distances between the features from two input face images to a threshold to determine if the two input face images are from the same person for face verification, or
    computing distances between features of an input query face image to features of each of face images in a face image database to determine which identity in the face image database the input query face image belongs to for face identification.
  19. The method of claim 18, wherein the distances is one selected from a group  consisting of Euclidean distances, cosine similarities, Joint Bayesian metrics, or any other distances.
  20. The method of claim 15, wherein at least one of the hierarchies further comprise a plurality of full-connection modules for extracting global features from corresponding pooling modules, multi-convolution modules, or multi-inception modules connected thereto.
  21. The method of claim 20, wherein the multi-convolution modules, the multi-inception modules, the pooling modules and the full-connection modules are formed as a neural network, and the method further comprises:
    inputting two face images to the neural network, respectively, to get feature representations of each of the two face images;
    calculating identification errors by classifying feature representations of each face image in the neural network into one of a plurality of identities;
    calculating verification errors by verifying if feature representations of two face images, respectively, are from the same identity, the identification and verification errors being treated as the identification and verification supervisory signals, respectively; and
    back-propagating the identification and verification supervisory signals through the neural network simultaneously, to update neural weights on connections among the cascaded multi-convolution modules, the multi-inception modules, and the full-connection modules in the neural network.
  22. An apparatus for face recognition, comprising:
    one or more memories that store executable components; and
    one or more processors, coupled to the memories, that execute the executable components to perform operations of the apparatus, the executable components comprising:
    an extracting component having a plurality of deep feature extraction hierarchies configured to extract recognition features from one or more input images; and
    a recognizing component recognizing face images of the input images based on the extracted recognition features,
    wherein each of the hierarchies comprises a plurality of multi-convolution modules and a plurality of pooling modules, and at least one of the hierarchies further comprise a plurality of multi-inception modules,
    a first one of the multi-convolution or multi-inception modules extracts local features from the input images, the followings of the multi-convolution and multi-inception modules extracts further local features from the extracted features outputted from a previous module of the pooling modules, wherein each of pooling modules receives local features from respective multi-convolution modules and multi-inception modules and reduces dimensions thereof, and
    wherein features obtained from all the extraction hierarchies are concatenated into a feature vector as said recognition features.
  23. The apparatus of claim 22, wherein each of the multi-inception modules performs multi-scale convolutional operation on the features received from previous coupled pooling modules and reduces dimensions of the received features.
  24. The apparatus of claim 22 or 23, wherein each of the multi-inception modules comprises a plurality of cascaded inception layers, each of inception layers receives features outputted from a previous inception layer as its input, and is configured to perform 1x1 convolution operations on the feature maps to reduce numbers thereof.
  25. The apparatus of any one of claim 22-24, wherein each of the inception layers comprises:
    one or more first 1x1 convolution operation layers configured to receive input feature maps from a previous inception layer and perform 1x1convolution operations  on the received features maps to compress numbers of thereof;
    one or more multi-scale convolution operation layers configured to perform N×N convolution operations on the compressed feature maps received from respective 1x1 convolution operation layers to form a plurality of first output feature maps, where N >1;
    one or more pooling operation layers configured to pool over local regions of the input feature maps from the previous inception layer to form locally invariant feature maps;
    one or more second 1x1 convolution operation layers configured to perform 1x1 convolution operations on the locally invariant feature maps to compress numbers of the feature maps so as to obtain a plurality of second output feature maps; and
    one or more third convolution operation layers configured to receive the input feature maps from the previous inception layer and perform 1x1 convolution operations on the received feature maps to compress numbers thereof so as to obtain a plurality of third feature maps;
    wherein the first, second and third feature maps are stacked together to form feature maps for inputting a following inception layer of the inception layers.
  26. The apparatus of claim 22, wherein each of multi-convolution modules comprises one or more cascaded convolution layers, each of the convolution layers receives features outputted from a previous one of the convolution layers as its input, and each of the convolution layers is configured to perform local convolution operations on the input features, wherein the convolutional layers share neural weights for the convolution operations only in local areas of the inputted images.
  27. The apparatus of claim 22, wherein each of the hierarchies further comprise a plurality of full-connection modules for extracting global features from corresponding pooling modules, multi-convolution modules, or multi-inception modules connected thereto.
PCT/CN2015/000050 2015-01-27 2015-01-27 A method and a system for face recognition WO2016119076A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201580074278.6A CN107209864B (en) 2015-01-27 2015-01-27 Face identification method and device
PCT/CN2015/000050 WO2016119076A1 (en) 2015-01-27 2015-01-27 A method and a system for face recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/000050 WO2016119076A1 (en) 2015-01-27 2015-01-27 A method and a system for face recognition

Publications (1)

Publication Number Publication Date
WO2016119076A1 true WO2016119076A1 (en) 2016-08-04

Family

ID=56542092

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/000050 WO2016119076A1 (en) 2015-01-27 2015-01-27 A method and a system for face recognition

Country Status (2)

Country Link
CN (1) CN107209864B (en)
WO (1) WO2016119076A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798381A (en) * 2017-11-13 2018-03-13 河海大学 A kind of image-recognizing method based on convolutional neural networks
WO2018090905A1 (en) * 2016-11-15 2018-05-24 Huawei Technologies Co., Ltd. Automatic identity detection
CN108073876A (en) * 2016-11-14 2018-05-25 北京三星通信技术研究有限公司 Facial analyzing device and facial analytic method
US10282589B2 (en) 2017-08-29 2019-05-07 Konica Minolta Laboratory U.S.A., Inc. Method and system for detection and classification of cells using convolutional neural networks
CN110648316A (en) * 2019-09-07 2020-01-03 创新奇智(成都)科技有限公司 Steel coil end face edge detection algorithm based on deep learning
CN110889373A (en) * 2019-11-27 2020-03-17 中国农业银行股份有限公司 Block chain-based identity recognition method, information storage method and related device
US10621424B2 (en) 2018-03-27 2020-04-14 Wistron Corporation Multi-level state detecting system and method

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844541A (en) * 2017-10-25 2018-03-27 北京奇虎科技有限公司 Image duplicate checking method and device
CN110651273B (en) * 2017-11-17 2023-02-14 华为技术有限公司 Data processing method and equipment
CN109344779A (en) * 2018-10-11 2019-02-15 高新兴科技集团股份有限公司 A kind of method for detecting human face under ring road scene based on convolutional neural networks
US10740593B1 (en) * 2019-01-31 2020-08-11 StradVision, Inc. Method for recognizing face using multiple patch combination based on deep neural network with fault tolerance and fluctuation robustness in extreme situation
CN110598716A (en) * 2019-09-09 2019-12-20 北京文安智能技术股份有限公司 Personnel attribute identification method, device and system
WO2021098799A1 (en) * 2019-11-20 2021-05-27 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Face detection device, method and face unlock system
CN111968264A (en) * 2020-10-21 2020-11-20 东华理工大学南昌校区 Sports event time registration device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038337A (en) * 1996-03-29 2000-03-14 Nec Research Institute, Inc. Method and apparatus for object recognition
US20080014563A1 (en) * 2004-06-04 2008-01-17 France Teleom Method for Recognising Faces by Means of a Two-Dimensional Linear Disriminant Analysis
US8345962B2 (en) * 2007-11-29 2013-01-01 Nec Laboratories America, Inc. Transfer learning methods and systems for feed-forward visual recognition systems
CN103530657A (en) * 2013-09-26 2014-01-22 华南理工大学 Deep learning human face identification method based on weighting L2 extraction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038337A (en) * 1996-03-29 2000-03-14 Nec Research Institute, Inc. Method and apparatus for object recognition
US20080014563A1 (en) * 2004-06-04 2008-01-17 France Teleom Method for Recognising Faces by Means of a Two-Dimensional Linear Disriminant Analysis
US8345962B2 (en) * 2007-11-29 2013-01-01 Nec Laboratories America, Inc. Transfer learning methods and systems for feed-forward visual recognition systems
CN103530657A (en) * 2013-09-26 2014-01-22 华南理工大学 Deep learning human face identification method based on weighting L2 extraction

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SUN, YI ET AL.: "Deep learning Face Representation by Joint Identification-Verification", CORNELL UNIVERSITY LIBRARY, 18 June 2014 (2014-06-18), Retrieved from the Internet <URL:https://arxiv.org/abs/1406.4773> *
SUN, YI ET AL.: "Deep Learning Face Representation from Predicting 10000 Classes", 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 28 June 2014 (2014-06-28), pages 1891 - 1898 *
SUN, YI ET AL.: "Hybrid Deep learning for Face Verification", 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 8 October 2013 (2013-10-08), pages 1489 - 1496 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073876A (en) * 2016-11-14 2018-05-25 北京三星通信技术研究有限公司 Facial analyzing device and facial analytic method
CN108073876B (en) * 2016-11-14 2023-09-19 北京三星通信技术研究有限公司 Face analysis device and face analysis method
WO2018090905A1 (en) * 2016-11-15 2018-05-24 Huawei Technologies Co., Ltd. Automatic identity detection
US10460153B2 (en) 2016-11-15 2019-10-29 Futurewei Technologies, Inc. Automatic identity detection
US10282589B2 (en) 2017-08-29 2019-05-07 Konica Minolta Laboratory U.S.A., Inc. Method and system for detection and classification of cells using convolutional neural networks
CN107798381A (en) * 2017-11-13 2018-03-13 河海大学 A kind of image-recognizing method based on convolutional neural networks
CN107798381B (en) * 2017-11-13 2021-11-30 河海大学 Image identification method based on convolutional neural network
US10621424B2 (en) 2018-03-27 2020-04-14 Wistron Corporation Multi-level state detecting system and method
CN110648316A (en) * 2019-09-07 2020-01-03 创新奇智(成都)科技有限公司 Steel coil end face edge detection algorithm based on deep learning
CN110889373A (en) * 2019-11-27 2020-03-17 中国农业银行股份有限公司 Block chain-based identity recognition method, information storage method and related device
CN110889373B (en) * 2019-11-27 2022-04-08 中国农业银行股份有限公司 Block chain-based identity recognition method, information storage method and related device

Also Published As

Publication number Publication date
CN107209864B (en) 2018-03-30
CN107209864A (en) 2017-09-26

Similar Documents

Publication Publication Date Title
WO2016119076A1 (en) A method and a system for face recognition
CN112561027B (en) Neural network architecture search method, image processing method, device and storage medium
CN112597941B (en) Face recognition method and device and electronic equipment
EP3732619B1 (en) Convolutional neural network-based image processing method and image processing apparatus
CN112446270B (en) Person re-identification network training method, person re-identification method and device
WO2019228317A1 (en) Face recognition method and device, and computer readable medium
Paisitkriangkrai et al. Pedestrian detection with spatially pooled features and structured ensemble learning
US9811718B2 (en) Method and a system for face verification
JP6345276B2 (en) Face authentication method and system
CN110765860A (en) Tumble determination method, tumble determination device, computer apparatus, and storage medium
CN110175671A (en) Construction method, image processing method and the device of neural network
Parashar et al. Deep learning pipelines for recognition of gait biometrics with covariates: a comprehensive review
WO2016086330A1 (en) A method and a system for face recognition
WO2014205231A1 (en) Deep learning framework for generic object detection
CN111914908A (en) Image recognition model training method, image recognition method and related equipment
CN111414875B (en) 3D Point Cloud Head Pose Estimation System Based on Depth Regression Forest
CN114358205B (en) Model training method, model training device, terminal device and storage medium
Imani et al. Neural computation for robust and holographic face detection
CN106803054B (en) Faceform&#39;s matrix training method and device
CN113705596A (en) Image recognition method and device, computer equipment and storage medium
CN113536970A (en) A training method and related device for a video classification model
US20240143977A1 (en) Model training method and apparatus
Liu et al. Self-constructing graph convolutional networks for semantic labeling
CN113762249B (en) Image attack detection and image attack detection model training method and device
CN108496174B (en) Method and system for face recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15879288

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15879288

Country of ref document: EP

Kind code of ref document: A1