WO2016037300A1

WO2016037300A1 - Method and system for multi-class object detection

Info

Publication number: WO2016037300A1
Application number: PCT/CN2014/000833
Authority: WO
Inventors: Xiaoou Tang; Wanli OUYANG; Xingyu ZENG; Shi QIU; Chen Change Loy; Xiaogang Wang
Original assignee: Individual
Current assignee: Individual
Priority date: 2014-09-10
Filing date: 2014-09-10
Publication date: 2016-03-17
Anticipated expiration: 2017-03-10
Also published as: CN106688011A; CN106688011B

Abstract

Disclosed is a device for training neural networks of multi-class object detection. The device may comprise a feature learning unit and a sub-boxes detector unit. According to one embodiment of the present application, the feature learning unit is configured to determine a first neural network based on training images of a first training image set, wherein each of the images has a plurality of bounding boxes with objects inside, and the determined first neural network outputs contextual information for an inputted image; and to determine a second neural network based on bounding boxes of the images in the first training image set and then further fine-tune the second neural network based on bounding boxes of the images in second training image set. The sub-boxes detector unit is configured to determine sub-boxes detector scores for the bounding boxes based on the second neural network, each of sub-boxes detector scores predicting one value for one of the bounding boxes for one semantic object class.

Description

[Title established by the ISA under Rule 37.2] METHOD AND SYSTEM FOR MULTI-CLASS OBJECT DETECTION

Technical field

The present application relates to a method and a system of multi-class object detection, of which the aim is to automatically detect instances of object of different classes in digital images of videos.

Background

The aim of object detection is to detect instances of object of a certain class in digital images and videos. The performance of object detection systems depends heavily on image representation, of which the quality can be influenced by many kinds of variations, such as viewpoints, illuminations, poses, and occlusions. Due to such uncontrollable factors, it is non-trivial to design robust image representation that is sufficiently discriminative to represent large quantity of object classes.

Substantial efforts have been dedicated to design hand-crafted features, such as Gabor, SIFT, and HOG, for representing image. Typically, object detection based on hand-crafted features involves extracting multiple features on the landmarks of images with multiple scales, and concatenating them into high-dimensional feature vectors.

Deep Convolutional Neural Network (CNN) has been applied to learn features directly from raw pixels. For the object detection task, existing deep CNN learning methods pre-train the CNN by using images without bounding box ground truth, and subsequently fine-tune the deep neural net using another set of images with bounding box ground truth. Typically, the image set used for fine-tuning has lower quantity of semantic class numbers compared to the image set used for pre-training. In addition, the number of semantic classes in image set for fine-tuning equals the number of actual classes we wish to detect.

Summary of invention

In one aspect, disclosed is a device for training neural networks of multi-class object detection. The device may comprise a feature learning unit and a sub-boxes detector unit. According to one embodiment of the present application, the feature learning unit is configured to determine a first neural network based on training images of a first training image set, wherein each of the training images has a plurality of bounding boxes with objects inside； and determine a second neural network based on bounding boxes of the training images of the first training image set and then further fine-tune the second neural network based on bounding boxes of training images of a second training image set. The sub-boxes detector unit is configured to determine a binary classifier detector for the bounding boxes of the first and the second image sets based on the second neural network, each score of the determined binary classifier detector predicting one semantic object class inside one of the bounding boxes

In another aspect, disclosed is device for multi-class object detection, comprising: a feature learning module configured to determine a plurality of classification features for each candidate bounding box of an inputted image； a sub-boxes detector module configured to utilize a pre-trained detection neural network to calculate a plurality of detection classes scores for each candidate box based on the classification features determined by the feature learning module (203) ； and a context information module configured to concatenate the calculated classification class scores, and determine a final score for the candidate bounding box, the final score representing one semantic object class inside one of the bounding boxes of the inputted image.

In further aspect, disclosed is a system for multi-class object detection, which comprises a training device, configured to determine a classification neural network, and a detection neural network from a plurality of predetermined training image sets. The system further comprises a prediction device, comprising a feature learning module configured to determine a plurality of features for each candidate bounding box of an inputted image based on the detection neural network, wherein the detection neural network takes the candidate bounding box as input and operates to output detection features for the candidate bounding box； a sub-boxes detector module configured to utilize the classification neural network to calculate a plurality of classification class scores for each candidate bounding box based on said detection features； and a context information module configured to concatenate the calculated classification classes scores, and determine, based on the detection neural network, a final score for the candidate bounding box, the final score representing semantic object class inside the box.

In further aspect, disclosed is a method for training neural networks of multi-class object detection, comprising:

determining a first neural network based on training images of a first training image set, wherein each of the images has a plurality of bounding boxes with objects inside, and the determined first neural network outputs contextual information for an inputted image；

determining a second neural network based on bounding boxes of the images in the first training image set；

fine-tuning the second neural network based on bounding boxes of the images in second training image set； and

determining sub-boxes detector scores for the bounding boxes based on the second neural network, each of sub-boxes detector scores predicting one value for one of the bounding boxes for one semantic object class.

determining a first neural network based on a plurality of bounding boxes of a first training image set；

determining a second neural network based on bounding boxes of the images in a second training image set, the determined first neural network outputting contextual information for an inputted image； and

In addition, the present application further proposes a method for multi-class object detection, comprising:

determining a classification neural network, a detection neural network, a plurality of sub-boxes detectors and a plurality of context information detectors from a plurality of predetermined training image sets；

determining a plurality of features for each candidate bounding box of an inputted image based on the detection neural network, wherein the detection neural network takes the candidate bounding box as input and calculates features values from a last hidden layer of the detection neural network；

calculating plurality of classification classes scores for each candidate box based on the classification neural network；

concatenating the calculated classification classes scores, so as to determine, based on the detection neural network, a final score for the candidate bounding box through the determined sub-boxes detector.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 is a schematic diagram illustrating an exemplary system for multi-class object detection according to one embodiment of the present application.

Fig. 2 is a schematic diagram illustrating an exemplary block diagram of training device according to one embodiment of the present application.

Fig. 3 illustrates a flow chart of the operations for the selective search unit according to one embodiment of the present application.

Fig. 4 illustrates a flow chart of the operations for the feature learning unit according to one embodiment of the present application.

Fig. 5 illustrates a flow chart for the feature learning unit to train a neural network according to one embodiment of the present application.

Fig. 6 illustrates sub-image patches according to one embodiment of the present application.

Fig. 7 illustrates a flow chart of the operations for the sub-boxes detector unit according to one embodiment of the present application.

Fig. 8 illustrates a flow chart of the operations for the sub-boxes detector unit according to another embodiment of the present application.

Fig. 9 illustrates a flow chart of the operations for the contextual information unit according to another embodiment of the present application.

Fig. 10 is a schematic diagram illustrating an exemplary configuration of neural network structure according to one embodiment of the present application.

Fig. 11 is a schematic diagram illustrating an exemplary configuration of deformation layer of the network according to one embodiment of the present application.

Fig. 12 is a schematic diagram illustrating an exemplary block diagram for the prediction device according to one embodiment of the present application.

Fig. 13 a flow chart for the process to show how to output predicted bounding boxes and the corresponding scores for the predicted bounding boxes according to one embodiment of the present application.

Fig. 14 illustrates a flow chart of the operations for the model average unit according to other embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like parts.

Fig. 1 is a schematic diagram illustrating an exemplary system 100 for multi-class object detection according to one embodiment of the present application. As illustrated in Fig. 1, the system 100 for multi-class object detection may comprise a training device 10 and a prediction device 20.

The training device 10 is configured to retrieve a set of pre-determined training set, which contains a set of images, each of which labeled with designated bounding boxes (x, y, w, h) , where (x, y) ＝ top-left coordinate of a bounding box, h ＝ height of a bounding box, and w ＝ width of a bounding box. In one embodiment of the present application, each box contains a target semantic object. The training device 10 then determines a classification neural network, a detection neural network, a plurality of (n) sub-boxes detectors and a plurality of (n) context information detectors from the retrieved training set. Once the training device 10 has finished the training procedure, the predictiondevice20 can use the networks, sub-boxes detectors and context detectors to detect semantic classes in the images. The prediction device 20 takes an image as input, and output bounding boxes coordinates (x, y, w, h) , with which each box contains a target semantic object.

Fig. 2 is a schematic diagram illustrating an exemplary block diagram of training device 10 according to one embodiment of the present application. As shown, the training device 10 may comprise a selective search unit 101, a region rejection unit 102, a feature learning unit 103, asub-boxes detector unit 104 and a contextual information unit 105, which will be discussed in details as below.

Selective search unit 101

The selective search unit 101 is configured to retrieve at least one digital image of videos, and then propose an over-complete set of candidate bounding boxes that may have objects inside for each retrieved image, and then output a plurality of positive and negative candidate bounding boxes (x, y, w, h) . Fig. 3 illustrates a flow chart of the operations for the selective search unit 101 according to one embodiment of the present application. In step s301, the selective search unit 101 operates to resize each of the retrieved images to a fixed width, e. g. 500 pixels. In step s302, the selective search unit 101performs super-pixel segmentation on each of the images to obtain a set of bounding box locations of the each image, for example, a small set of data-driven, class-independent, high quality bounding box locations. In step s303, the selective search unit 101 compares candidate bounding boxes (i. e. the obtained bounding boxes) with manually labeled bounding boxes to determine if the overlap between the candidate bounding boxes and manually labeled bounding boxes is larger than a predetermined threshold (in terms of overlap area ratio) , for example 0.5. If yes, the bounding box will be regarded as positive samples in step S304, whereas those with overlap less than 0.5 will be regarded as positive samples regarded as negative samples in step s305.

Region rejection unit 102

The region rejection unit 102 is configured to throw away a large part of candidate bounding boxes, according to scores, to make the following procedure faster. This unit 102 is applied on only the fine-tuning set. In other words, the region rejection unit 102 receives at least one image of videos and the obtains positive and negative candidate bounding boxes (x, y, w, h) , and determines which boxes of the obtained positive and negative candidate bounding boxes will be filtered based on the received images.

In one embodiment of the present application, the region rejection unit 102 operates to obtain object detection score for each positive and negative candidate bounding boxes. The region rejection unit 102 may apply any existing object detector on the input images to obtain object detection score for each positive and negative candidate bounding boxes (x, y, w, h) . Denote the detection scores for n classes of the i-th candidate bounding boxes to be s_i. The i-th candidate bounding box is rejected if the following rejection condition is satisfied:

||s_i||_∞＜γ Formula 1)

where||s_i||_∞＝max_j{s_i，j}，

i is the sample index,

j is the class index, and

γis a pre-determined threshold.

Feature learning unit 103

The feature learning unit 103 is used to train a neural network whose lasthidden-layer values would be regarded as features. In one embodiment of the present application, the feature learning unit 103 receives, as its inputs, a pre-training set, afine-tuning set and the filtered bounding boxes, and then determines, based on the inputs, a fine-tuned neural network, wherein the values outputted from the last hidden layer of the fine-tuned neural network will be regarded as feature. The pre-training set may consist of images and the corresponding ground truth bounding boxes (x, y, w, h) . The pre-training set encompasses m object classes. The fine-tuning setmay consist of images and the corresponding ground truth bounding boxes (x, y, w, h).The fine-tuning set encompasses n object classes.

Fig. 4 illustrates a flow chart of the operations for the feature learning unit 103 according to one embodiment of the present application. In step s401, the unit 103 operates to pre-train the first neural network using the images in the pre-training set with positive and negative bounding boxes as determined by the selective search unit 101. To be specific, the feature learning unit 103 may unitize a back-propagation algorithm to train a neural network. Fig. 5 illustrates a flow chart forthe feature learning unit 103 to train a neural network. As shown, in step s4011, thefeature learning unit 103 createsa neural network and then random initializes the created network. The configuration of the created networkwill be discussed later.

And then in step 4012, the feature learning unit 103 calculates the pre-defined loss function for the inputted the images in the pre-training set, the candidate positive and negative image regions corresponding to the positive and negative bounding boxes. The loss function can be described as Loss＝ f (x, y, θ) , where x is the bounding boxes, y is its label, θ stands for all parameters, including the convolution filters, deformation layer weights, fully connected weights and bias in the created network. If x is the positive candidate bounding box, then its y should be a non-zero value. If one ground truth box having the max overlap value with x, the y should be the value of the class that ground truth box belongs to. The whole training process for the neural network is trying to minimize the loss of the whole training images. In step S4013, the feature learning unit 103 calculates the gradient with respect to all the parameters, that’s

Then in step s4014, the update process can be described as

where lr is one prefixed learning rate. In step s4015, the feature learning unit 103 will check if the stopping criterion, for example, whether the loss value of validation set is increasing or not, is satisfied. If not, the feature learning unit 103 returns to step s4012 to run though the steps s4012-S4015 until the stopping criterion is satisfied.

Returning to Fig. 4, once the first neural network is created and pre-trained, a second neural network with the same structure as the pre-trained neural network will be created in step S402. In step s403, the second neural network is initialized by using the parameters of the pre-trained neural network. In step s404, the feature learning unit 103 operates to replace the output layer of the second neural network of m node with a new output layer with n node. In step s405, the feature learning unit 103 operates to fine-tune the second neural network using the bounding boxes of the images in the pre-training set and then further fine-tune the second neural network using the bounding boxes of the images in the fine-tune set.

Alternatively, in steps s4012-s4015, the first neural network may be trained/tuned by using bounding boxes of the pre-training set, and then in step s405, the feature learning unit 103 operates to fine-tune the second neural network using the bounding boxes of the images in the fine-tune set.

It shall be appreciated that the pre-train step (step s401) uses the whole images in the pre-training set to train the first neural network, while the fine-tuning step (step s405) uses the image regions (bounding boxes containing objects) in the pre-training set and then further use the fine-tuning set to train the second neural network. As discussed in the above in reference to step s404, for the second network, the feature learning unit 103 operates to replace the output layer of the second neural network of m node with a new output layer with n node, and thus the difference between the pre-training step (step s401) and the fine-tuning step (step 405) is that the last layer of the first network has m nodes whereas the last layer of the second layer has n nodes.

Prior arts often use the whole images in the pre-training set to train the first neural network and use image regions (bounding boxes containing objects) in the fine-tuning set to train the second neural network. In contrast to the previous training scheme, the process as proposed above in the present application uses the image regions (bounding boxes containing objects) in the pre-training set to improve the feature learning performance of the feature learning unit.

Sub-boxes detector unit104

The sub-boxes detector unit 104 receives at least one image and the candidate bounding boxes (i. e. the boxes outputted from unit 102) , and then utilizes the fine-tuned network trained by the unit 103 to output a plurality of (n) Support Vector Machine (SVM) detectors, each of which predicts one value for one candidate bounding box for one semantic object class, such that a plurality of (n) Support Vector Machine (SVM) detectors will be obtained for the prediction unit (to be discussed later) to predict detection scores for n object classes. Herein, the SVM is discussed as an example only, and any other binary classifier may be used in the embodiments of the present application.

For each candidate bounding box B, the sub-boxes detector unit 10 calculates the feature vector F_B using the fine-tuned neural network obtained from feature learning unit 103 to describe each candidate bounding box’s contents, and further divide it into a plurality of sub-image patches. Fig. 6 illustrates 4 sub-image patches as an example. It should be appreciated that different number of sub-image patches can be divided in the embodiments of the present application.

Fig. 7 illustrates a flow chart of the operations for the sub-boxes detector unit 104 according to one embodiment of the present application (Following max-average SVM) . In step s701, the sub-boxes detector unit 10 divides the received bounding box into a plurality of (for example, 4) sub-image patches, w. In step s702, for each sub-image-patch w, the sub-boxes detector unit 104 calculates its overlapping ratios with all object-bounding-boxes B using the following equation

O_w，B＝S_w∩B/(S_w+S_B-S_w∩B),Formula 2)

where S_w, S_B, and S_w∩B are the size of the sub-image-patches w, the size of the object-bounding-box B, and the size of the intersected region of the sub-image-patches w and the object-bounding-box B, respectively.

In step s703, for each sub-image-patches w, the object-bounding box

with the highest overlapping ratio is chosen as its corresponding box, i. e. ,

The feature vector of the object-bounding box

is assigned to the sub-image-patches w to describe its contents.

In step s704, for each object-bounding-box proposal B with its sub-image-patches, the element-wise average

of the feature vectors of the plurality of sub-image-patches and maximum

of the feature vectors of the plurality of sub-image-patches

are calculated as

Formula3)

Formula4)

In step s705, the feature vector F_B of the object-bounding-box B is concatenated with

and

to create a longer feature vector

to describe the image contents within the bounding box B. In one embodiment of the present application, the fine-tuned neural network obtained from the feature learning unit 103 is used to extract features from exact sub-image-patch regions. The element-wise average and maximum of the feature vectors are used to describe the image content.

In step s706, the concatenated feature vectorsand the ground-truth labels of the object-bounding-boxes B are used to train the binary classifier (for example, the SVM as discussed above) detector to output a likelihood score for every possible object class that the box might belongs to.

Fig. 8 illustrates a flow chart of the operations for the sub-boxes detector unit 104 according to another embodiment of the present application (Following multiple-feature SVM) . In step s801, the sub-boxes detector unit 10 divides the received bounding box into a plurality of (for example, 4) sub-image patches w. In step s802, for each object-bounding-box B, its feature vectors F_B and the feature vectors from the sub-image-patches are used to train separate support vector machines. For example, where there are 4 sub-image-patches, the 4 feature vectors from the 4 sub-image-patches are used to train 5 separate support vector machines.

In step s803, given a new object-bounding-box B and its feature vector extracted by the fine-tuned neural network obtained from feature learning unit 103, the corresponding support vector machine is applied to calculate a likelihood score for each object class.

In step s804, for each sub-image-patch w, sub-boxes detector unit 104 first calculates its overlapping ratios with all proposed object-bounding-boxes B using the following equation

O_w，B＝S_w∩B/(S_w+S_B-S_w∩B), Formula 5)

where S_w, S_B, and S_w∩B are the size of the sub-image-patches w, the size of the object-bounding-box B, and the size of the intersected region of the sub-image-patches w and the object bounding box B, respectively.

Only object-bounding-box B with overlapping ratios greater than a predetermined threshold (for example, 0.5) to the sub-image-patch w are selected as the candidate corresponding bounding box for w in step s805.

The corresponding trained support vector machine of w is used to test all its candidate corresponding bounding boxes. For each candidate bounding box, the trained support vector machine generates a score for each possible object class in step s805. The highest score of each object class from all candidate windows is chosen as the class likelihood score for w.

In step s806, the object-bounding-box and it’s (for example, 4) sub-image-patches are associated with a plurality of (for example, 5) sets of object class likelihood scores, the sets of scores are normalized independently and summed together to output a set object class likelihood.

Contextual information unit 105

The contextual information unit 105 is configured to exploit contextual information to improve detection performance. The contextual information unit 105 receives at least one image and receives the candidate bounding boxes from the unit 102. The unit 105 further retrieves scores of the sub-boxes detector from the sub-boxes detector unit 104 and the contextual information from feature learning unit 103, i. e. classification score outputted from the first network. And then the unit 105 utilizes the pre-trained network and fine-tuned network to train one binary classifier (for example, SVM) for each detection class of the candidate bounding boxes to output n classes of binary classifier to predict an n-dimension vector for each candidate bounding box.

Fig. 9 illustrates a flow chart of the operations for the contextual information unit 105 according to another embodiment of the present application.

In step s901, the contextual information unit 105 utilizes the pre-trained network to output the classification score (contextual information)

for the whole of the received image, where L_c is the number of the classification categories. s_c(i) is the probability of the i-th classification class, i. e, i-th classification class of m classes in the pre-training set.

In step s902, the contextual information unit 105 operates to concatenate the classification score s_c and detection score s_d obtained by sub-boxes detector unit 104 for each bounding box in this image. After the scores s_c and s_d are calculated for all images and their bounding boxes, a new one v. s. all binary classifier (SVM) is trained for each of the n detection class with contextual modeling. To train the j-th binary classifier, the feature vector x_B may be concatenated from s_d(j) and a sparse feature vector

with a weight η, i. e. , by rule of:

Formula 6)

In order to avoid over-fitting to the training data, some irrelevant dimensions of the feature vector

are set zero in step s903. An then in step s904, the contextual information unit 105 operates to train one binary classifier for each detection class. Let Ω_j selects the most relevant classes in the classification task for thej-th class in the detection task. If i∈Ω_j,

otherwise

Then the final score would be outputted as the score of binary classifier in step s905.

The Model Average Unit

Hereinabove, one of the arrangements (one model of the system) for the multi-class object detection system 100 has been discussed. It shall be understood that there may be several models by changing the setting of feature learning unit, the sub-boxes detector unit and the contextual information unit. For example, configuration of network created by the feature learning unit may be changed with different layers. As these models share the same selective search unit, the candidate boxes are the same for all models. For each candidate box, different model may output different scores for different classes.

In one embodiment of the present application, the prediction device 10 may further comprise model average unit (not shown) . The model average unit is configured to utilize advantages of several models and make the performance better. As it needs to detect instances of multiple classes, and different training setting may result different performance. For example, one model setting may be better in some classes while another model may come out better on other classes. This model average unit is used to select different models for each class.

The model average unit tries to find out a combination list for each class and averages the score of the models in this list as the final score for each candidate box. Fig. 14 illustrates a flow chart of the operations for the model average unit according to other embodiment of the present application. In step s1401, it creates one empty list for one class in step s1401. Multiple modes can be obtained by changing the setting of the feature learning unit, the sub-boxes detector unit and the contextual information unit. Those models share the same selective search unit.

In step s1402, for each class, this unit starts to select the best model as the starting point, and tries to find one more model (s1403) so that the performance of this class would be better by averaging the scores of those two models (the best model and said one more model) and then add this model to list in step s1408. Repeat step s1402-1407 until no more models can be added or the performance would be worse if one more model is added. Repeat the above procedures for all classes. This model average unit would output one model list for each class.

The neural network

Hereinafter, the neural network created and trained by the feature learning unit 103 will be discussed.

The neural network structure consists of several kinds of layers. Fig. 10 is a schematic diagram illustrating an exemplary configuration of neural network structure according to one embodiment of the present application. Fig. 11 is a schematic diagram illustrating an exemplary configuration of deformation layer of the network according to one embodiment of the present application.

Data layer

This layer receives images

and its labels

where x_ij is the j-th bit value of the d-dimension feature vector of the i-th input image region, y_ij is the j-th bit value of the n dimension label vector of the i-th input image region.

Convolution layer

The convolution layer receives the output from the data layer and performs convolution, padding, sampling, and non-linear transformation operations.

Deformation layer

Since objects have different sizes and many semantic parts, filters with different sizes are added into the convolution layer. One filter with one size would produce one score map, which describes the corresponding part information. The deformation layer is designed to learn the deformation constraints for different object parts. For a given channel of convolution layer C with size V*H, the deformation layer takes small blocks of size (2R+1) * (2R+1) from that convolution layer C and subsamples it to B with size

to produces single output from that block as follows:

where (x, y) is the center of the (2R+1) * (2R+1) block,

both i and j ranges from –R to R,

k_h, k_v are subsampling steps,

c_nand

are deformation parameters to be learned.

The deformation layer takes the P part detection maps as input and outputs P part scores. And the deformation layer can capture multiple patterns simultaneously. The output of convolution layer and deformation layer can be regarded as discriminative features.

Fully connected layer

The fully connected layer takes the discriminative feature as input and operates the inner-production between the feature and weights. Then one non-linear transformation would be operated on the production.

The prediction device 20

Hereinafter, the prediction device 20 will be discussed for details. For each of test image, the prediction device 20 outputs predicted bounding boxes (x, y, w, h) and the corresponding scores for n object classes of the test image. Fig. 12 is a schematic diagram illustrating an exemplary block diagram for the prediction device 20 according to one embodiment of the present application. As shown in Fig, 12, the prediction device 20 comprises a selective search module 201, a region rejection module 202, a feature learning module 203, a sub-boxes detector module 204, a context information module 205. Fig. 13 illustrates flow chart for the process to show how the units 201-206 cooperate to output predicted bounding boxes (x, y, w, h) and the corresponding scores for the predicted bounding boxes.

In step S1301, the selective search module 201 receives at least one test image and then proposes a number of candidate bounding boxes in the test image. The received image includes a plurality of instances of (n) object classes (n semantic classes).

In step S1302, the region rejection module 202 selects some boxes from a large number of candidate bounding boxes by rule of formula 1. Once a candidate box is rejected, this box will be thrown away. Only bounding boxes through the region rejection unit will be passed to the following unit, as discussed in reference to the training device. In step S1303, the feature learning module 203 calculates the classification features for each candidate box through using the fine-tuned network obtained from the training device. Here the fine-tuned network takes the image regions corresponding to the bounding boxes as input and calculates the classification features from the last hidden layer of the fine-tuned network.

In step s1304, the sub-boxes detector module 204 receives the calculated classification features from the module 203 and then uses the sub-boxes detector (binary classifier detector) obtained from the training device 10 to calculate the n-classes scores s_d for each candidate box. Here the sub-boxes detector calculates the classification features of the plurality of sub-image-regions (for example, 4 sub-image-regions) and gets the classification features for each sub-image-region using the fine-tuned network obtained in the training device 10. Then the sub-boxes detector module 204 calculates the classification scores s_d using the sub-boxes detectors (binary classifier detectors) trained in the training device 10. As discussed, the feature outputted from the last hidden layer of the second network (the detection network or fine-tuned network) will be considered as the classification feature, and then input into the sub-boxes detector module 204 to learn a binary classifier detector (for example, SVM detector) so as to output a detection score ＝w*x+b， where x represents the features for bounding box received from the module 203, while w and b are the parameters to be learned/determined by the module 204.

If the sub-boxes detector unit in the train device 10 follows the max-average SVM scheme, the sub-boxes detector (SVM detector) would find one bounding box having the max overlap value with each sub-image-region, calculate the feature for that bounding box using the fine-tuned network and use this feature to represent that sub-image-region. Once all four sub-image-regions get their corresponding representing feature, the element-wise-max and element-wise-average values would be extracted from those four sub-image-regions representing feature. The concatenated feature vectors

multiplying with the binary classifier (SVM) weights obtained in the train device would produce the scores sd.

Once the sub-boxes detector unit 204 uses the detection network (i. e. second network) obtained in the training device 10 to calculate the classification score s_d, then the context information module 205 concatenates the s_d in the previous step with the s_c calculated in this step, and finally multiply the concatenation vector with weights of the binary classifier (SVM) obtained from the training device 10 in step s1305. The production is the final score for the candidate bounding boxes proposed by the selective search module 201. It shall be understood that there may be several models by changing the setting of feature learning unit and sub-boxes detector unit. As these models share the same selective search unit, the candidate boxes are the same for all models. For each candidate box, different model would output different score for different class. In one embodiment of the present application, the prediction device 10 may further comprise model average unit (not shown) . For each class, the final scores are obtained by averaging the final scores of multiple models selected by this model average unit for each candidate box, which is the same as that discussed in reference to the training device 10.

It would be appreciated that the more detailed description on the modules 201-205 are omitted herein since they function in the same way as the units 101-105 of the training device 10 as discussed above.

In the above, the system 100 has been discussed in the case they are implemented using certain hardware with specific circuitry or the combination of the hardware and the software. It shall be appreciated that the

systems

10 and 100 may be also implemented using software. In addition, the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes.

In the case that the system100 is implemented with software, the these systems 100 may run in a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion.

Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.

Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims

A device for training neural networks of multi-class object detection, comprising:

a feature learning unit (103) configured to,

determine a first neural network based on training images of a first training image set, wherein each of the training images has a plurality of bounding boxes with objects inside； and

determine a second neural network based on bounding boxes of the training images of the first training image set and then further fine-tune the second neural network based on bounding boxes of training images of a second training image set； and

a sub-boxes detector unit (104) configured to determine a binary classifier detector for the bounding boxes of the first and the second image sets based on the second neural network, each score of the determined binary classifier detector predicting one semantic object class inside one of the bounding boxes.
A device for training neural networks of multi-class object detection, comprising:

a feature learning unit (103) configured to determine a first neural network based on a plurality of bounding boxes of a first training image set, and then to determine a second neural network based on bounding boxes of the images of a second training image set； and

a sub-boxes detector unit (104) configured to determine a binary classifier detector for the bounding boxes based on the determined second neural network, each score of the determined a binary classifier detector predicting one semantic object class inside one of the bounding boxes .
A device according to claim 1 or 2, wherein the determined first neural network operates to output contextual information for an image inputted thereto,

the device further comprising:

a contextual information unit (105) configured to retrieve the each score of the binary classifier detector from the sub-boxes detector unit (104) and the contextual information from the feature learning unit (103) so as to train a binary classifier detector for each detection class to predict each of the bounding boxes.
A device according to claim 3, further comprising:

a selective search unit (101) configured to retrieve at least one inputted image, and then determine the bounding boxes with objects inside for each retrieved image.
A device according to claim 3, further comprising:

a region rejection unit (102) configured to filter out bounding boxes from the determined boxes based on a predetermined threshold.
A device according to claim 1 or 2, wherein the feature learning unit (103) determines the first neural network using the training images of the first training image set through a back-propagation algorithm.
A device according to claim 1 or 2, wherein the feature learning unit (103) determines the second neural network through a back-propagation algorithm.
A device for multi-class object detection, comprising:

a feature learning module (203) configured to determine a plurality of classification features for each candidate bounding box of an inputted image；

a sub-boxes detector module (204) configured to utilize a pre-trained detection neural network to calculate a plurality of detection classes scores for each candidate box based on the classification features determined by the feature learning module (203) ； and

a context information module configured to concatenate the calculated classification class scores, and determine a final score for the candidate bounding box, the final score representing one semantic object class inside one of the bounding boxes of the inputted image.
A system for multi-class object detection, comprising:

a training device (10) configured to determine a classification neural network, and a detection neural network from a plurality of predetermined training image sets；

a prediction device (20) , comprising:

a feature learning module (203) configured to determine a plurality of features for each candidate bounding box of an inputted image based on the detection neural network, wherein the detection neural network takes the candidate bounding box as input and operates to output detection features for the candidate bounding box；

a sub-boxes detector module (204) configured to utilize the classification neural network to calculate a plurality of classification class scores for each candidate bounding box based on said detection features； and

a context information module (205) configured to concatenate the calculated classification classes scores, and determine, based on the detection neural network, a final score for the candidate bounding box, the final score representing semantic object class inside the box.
A system according to claim 9, wherein the training device (10) further comprises:

a feature learning unit (103) configured to,

determine the classification neural network based on training images of a first training image set, wherein each of the images has a plurality of bounding boxes with objects inside, and the determined classification neural network outputs contextual information for an image inputted thereto； and

determine the detection neural network based on bounding boxes of the images in the first training image set and then further fine-tune the detection neural network based on bounding boxes of the images in second training image set； and

a sub-boxes detector unit (104) configured to determine binary classifier detectors for the bounding boxes based on the detection neural network, each score of the determined binary classifier detector predicting one semantic object class for one of the bounding boxes for.
A system according to claim 9, wherein the training device (10) further comprises:

a feature learning unit (103) configured to determine the classification neural network based on a plurality of bounding boxes of a first training image set, and then to determine the detection neural network based on bounding boxes of the images of a second training image set； and

a sub-boxes detector unit (104) configured to determine a binary classifier detector for the bounding boxes based on the detection neural network, each score of the determined binary classifier detector predicting one semantic object class for one of the bounding boxes.
A system according to claim 11 or 12, wherein, the determined classification neural network is capable of outputting contextual information for an image inputted thereto, and the system further comprises:

a contextual information unit (105) configured to retrieve scores of the binary classifier detector from the sub-boxes detector unit (104) and the contextual information from feature learning unit (103) so as to train a binary classifier detector for each detection class of the bounding boxes for predicting the each bounding box.
A system according to claim 12, further comprising:

a selective search unit (101) configured to retrieve at least one inputted image, and then determine the bounding boxes with objects inside for each retrieved image.
A system according to claim 13, further comprising:

a region rejection unit (102) configured to filter out bounding boxes from the determined boxes based on a predetermined threshold.
A system according to claim 11 or 12, wherein the feature learning unit (103) determines the classification neural network using the images of the first image training set through a back-propagation algorithm.
A system according to claim 11 or 12, wherein the feature learning unit (103) determines the detection neural network through a back-propagation algorithm.
A system according to claim 11 or 12, wherein the sub-boxes detector unit (104) is configured to determine scores of the binary classifier detector based on a max-average SVM.
A system according to claim 11 or 12, wherein the binary classifier detector unit (104) is configured to determine scores of the binary classifier detector based on a multiple-feature SVM.
A method for training neural networks of multi-class object detection, comprising:

determining a first neural network based on training images of a first training image set, wherein each of the images has a plurality of bounding boxes with objects inside；

determining a second neural network based on bounding boxes of the images of the first training image set；

fine-tuning the second neural network based on bounding boxes of the images of a second training image set； and

determining a binary classifier detector for the bounding boxes based on the second neural network, each of scores of the binary classifier detector predicting one semantic object class inside one of the bounding boxes.
A method for training neural networks of multi-class object detection, comprising:

determining a first neural network based on a plurality of bounding boxes of a first training image set；

determining a second neural network based on bounding boxes of the images of a second training image set； and

determining binary classifier detector for the bounding boxes based on the second neural network, each score of the determined binary classifier detector predicting one semantic object class for one of the bounding boxes .
A method according to claim 19 or 20, wherein, the determined first neural network outputting contextual information for an inputted image,

the method further comprising:

training the binary classifier detector for each detection class of the bounding boxes for predicting the each bounding box based on the score of the binary classifier detector and the contextual information.
A method according to claim 21, further comprising:

retrieving at least one inputted image； and

determining the bounding boxes with objects inside for each retrieved image.
A method according to claim 21, further comprising:

filtering out bounding boxes from the determined boxes based on a predetermined threshold.
A method according to claim 19 or 20, wherein the first neural network is determined by using the images of the first image training set through a back-propagation algorithm.
A device according to claim 19 or 20, wherein the second neural network is determined through a back-propagation algorithm.
A method for multi-class object detection, comprising:

determining a plurality of classification features for each candidate bounding box of an inputted image；

calculating a plurality of classification classes scores for each candidate box based on the determined classification features；

concatenating the calculated classification classes scores, and

determining, from concatenated classes scores, a final score for the candidate bounding box through a pre-trained binary classifier detector, wherein the final score is used to predict one semantic object class inside one of the bounding boxes.
A method for multi-class object detection, comprising:

1) determining a classification neural network, a detection neural network, a plurality of binary classifier detectors from a plurality of predetermined training image sets；

2) determining a plurality of features for each candidate bounding box of an inputted image based on the detection neural network, wherein the detection neural network takes the candidate bounding box as input and operates to calculate classification features for the inputted box；

3) calculating, by using the classification neural network, a plurality of classification classes scores for each candidate box based on the calculated features； and

4) concatenating the calculated classification classes scores, so as to determine, based on the detection neural network, a final score for the candidate bounding box through the determined binary classifier detector so as to predict one semantic object class inside one of the bounding boxes.
A method according to claim 27, wherein the step 1) further comprises:

determining the classification neural network based on training images of a first training image set, wherein each of the images has a plurality of bounding boxes with objects inside； and

determining the detection neural network based on bounding boxes of the images of the first training image set and then further fine-tune the detection neural network based on bounding boxes of the images of a second training image set； and

determining the binary classifier detectors for the bounding boxes based on the detection neural network, each score of the binary classifier detector predicting one semantic object class for one of the bounding boxes.
A method according to claim 27, wherein the step 1) further comprises:

determining the classification neural network based on a plurality of bounding boxes of a first training image set；

determining the detection neural network based on bounding boxes of the images of a second training image set； and

determining a binary classifier detector for the bounding boxes based on the detection neural network, each score of the binary classifier detector predicting one semantic object class inside one of the bounding boxes.
A method according to claim 28 or 29, wherein the determined classification neural network outputs contextual information for an image inputted thereto,

the method further comprising:

training the binary classifier detector for each detection class of the bounding boxes for predicting the each bounding box based on the binary classifier detector scores and the contextual information.
A method according to claim 30, further comprising:

retrieving at least one inputted image, and

determining the bounding boxes with objects inside for each retrieved image.
A method according to claim 31, further comprising:

filtering out bounding boxes from the determined boxes based on a predetermined threshold.
A method according to claim 28 or 29, wherein the classification neural network is determined by using the images of the first image training set through a back-propagation algorithm.
A method according to claim 28 or 29, wherein the detection neural network is determined by through a back-propagation algorithm.
A method according to claim 28 or 29, wherein scores of the binary classifier detector are determined based on a max-average SVM.
A method according to claim 28 or 29, wherein scores of the binary classifier detector are determined based on a multiple-feature SVM.