[go: up one dir, main page]

WO2016037300A1 - Method and system for multi-class object detection - Google Patents

Method and system for multi-class object detection Download PDF

Info

Publication number
WO2016037300A1
WO2016037300A1 PCT/CN2014/000833 CN2014000833W WO2016037300A1 WO 2016037300 A1 WO2016037300 A1 WO 2016037300A1 CN 2014000833 W CN2014000833 W CN 2014000833W WO 2016037300 A1 WO2016037300 A1 WO 2016037300A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
bounding boxes
boxes
training
detector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2014/000833
Other languages
French (fr)
Inventor
Xiaoou Tang
Wanli OUYANG
Xingyu ZENG
Shi QIU
Chen Change Loy
Xiaogang Wang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to PCT/CN2014/000833 priority Critical patent/WO2016037300A1/en
Priority to CN201480081846.0A priority patent/CN106688011B/en
Publication of WO2016037300A1 publication Critical patent/WO2016037300A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/768Arrangements for image or video recognition or understanding using pattern recognition or machine learning using context analysis, e.g. recognition aided by known co-occurring patterns
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24317Piecewise classification, i.e. whereby each classification requires several discriminant rules
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application relates to a method and a system of multi-class object detection, of which the aim is to automatically detect instances of object of different classes in digital images of videos.
  • the aim of object detection is to detect instances of object of a certain class in digital images and videos.
  • the performance of object detection systems depends heavily on image representation, of which the quality can be influenced by many kinds of variations, such as viewpoints, illuminations, poses, and occlusions. Due to such uncontrollable factors, it is non-trivial to design robust image representation that is sufficiently discriminative to represent large quantity of object classes.
  • hand-crafted features such as Gabor, SIFT, and HOG
  • object detection based on hand-crafted features involves extracting multiple features on the landmarks of images with multiple scales, and concatenating them into high-dimensional feature vectors.
  • Deep Convolutional Neural Network has been applied to learn features directly from raw pixels.
  • existing deep CNN learning methods pre-train the CNN by using images without bounding box ground truth, and subsequently fine-tune the deep neural net using another set of images with bounding box ground truth.
  • the image set used for fine-tuning has lower quantity of semantic class numbers compared to the image set used for pre-training.
  • the number of semantic classes in image set for fine-tuning equals the number of actual classes we wish to detect.
  • the device may comprise a feature learning unit and a sub-boxes detector unit.
  • the feature learning unit is configured to determine a first neural network based on training images of a first training image set, wherein each of the training images has a plurality of bounding boxes with objects inside; and determine a second neural network based on bounding boxes of the training images of the first training image set and then further fine-tune the second neural network based on bounding boxes of training images of a second training image set.
  • the sub-boxes detector unit is configured to determine a binary classifier detector for the bounding boxes of the first and the second image sets based on the second neural network, each score of the determined binary classifier detector predicting one semantic object class inside one of the bounding boxes
  • a feature learning module configured to determine a plurality of classification features for each candidate bounding box of an inputted image
  • a sub-boxes detector module configured to utilize a pre-trained detection neural network to calculate a plurality of detection classes scores for each candidate box based on the classification features determined by the feature learning module (203)
  • a context information module configured to concatenate the calculated classification class scores, and determine a final score for the candidate bounding box, the final score representing one semantic object class inside one of the bounding boxes of the inputted image.
  • a system for multi-class object detection which comprises a training device, configured to determine a classification neural network, and a detection neural network from a plurality of predetermined training image sets.
  • the system further comprises a prediction device, comprising a feature learning module configured to determine a plurality of features for each candidate bounding box of an inputted image based on the detection neural network, wherein the detection neural network takes the candidate bounding box as input and operates to output detection features for the candidate bounding box; a sub-boxes detector module configured to utilize the classification neural network to calculate a plurality of classification class scores for each candidate bounding box based on said detection features; and a context information module configured to concatenate the calculated classification classes scores, and determine, based on the detection neural network, a final score for the candidate bounding box, the final score representing semantic object class inside the box.
  • a method for training neural networks of multi-class object detection comprising:
  • each of sub-boxes detector scores predicting one value for one of the bounding boxes for one semantic object class.
  • a method for training neural networks of multi-class object detection comprising:
  • each of sub-boxes detector scores predicting one value for one of the bounding boxes for one semantic object class.
  • the present application further proposes a method for multi-class object detection, comprising:
  • the detection neural network takes the candidate bounding box as input and calculates features values from a last hidden layer of the detection neural network
  • Fig. 1 is a schematic diagram illustrating an exemplary system for multi-class object detection according to one embodiment of the present application.
  • Fig. 2 is a schematic diagram illustrating an exemplary block diagram of training device according to one embodiment of the present application.
  • Fig. 3 illustrates a flow chart of the operations for the selective search unit according to one embodiment of the present application.
  • Fig. 4 illustrates a flow chart of the operations for the feature learning unit according to one embodiment of the present application.
  • Fig. 5 illustrates a flow chart for the feature learning unit to train a neural network according to one embodiment of the present application.
  • Fig. 6 illustrates sub-image patches according to one embodiment of the present application.
  • Fig. 7 illustrates a flow chart of the operations for the sub-boxes detector unit according to one embodiment of the present application.
  • Fig. 8 illustrates a flow chart of the operations for the sub-boxes detector unit according to another embodiment of the present application.
  • Fig. 9 illustrates a flow chart of the operations for the contextual information unit according to another embodiment of the present application.
  • Fig. 10 is a schematic diagram illustrating an exemplary configuration of neural network structure according to one embodiment of the present application.
  • Fig. 11 is a schematic diagram illustrating an exemplary configuration of deformation layer of the network according to one embodiment of the present application.
  • Fig. 12 is a schematic diagram illustrating an exemplary block diagram for the prediction device according to one embodiment of the present application.
  • Fig. 13 a flow chart for the process to show how to output predicted bounding boxes and the corresponding scores for the predicted bounding boxes according to one embodiment of the present application.
  • Fig. 14 illustrates a flow chart of the operations for the model average unit according to other embodiment of the present application.
  • Fig. 1 is a schematic diagram illustrating an exemplary system 100 for multi-class object detection according to one embodiment of the present application.
  • the system 100 for multi-class object detection may comprise a training device 10 and a prediction device 20.
  • each box contains a target semantic object.
  • the training device 10 determines a classification neural network, a detection neural network, a plurality of (n) sub-boxes detectors and a plurality of (n) context information detectors from the retrieved training set.
  • the predictiondevice20 can use the networks, sub-boxes detectors and context detectors to detect semantic classes in the images.
  • the prediction device 20 takes an image as input, and output bounding boxes coordinates (x, y, w, h) , with which each box contains a target semantic object.
  • Fig. 2 is a schematic diagram illustrating an exemplary block diagram of training device 10 according to one embodiment of the present application.
  • the training device 10 may comprise a selective search unit 101, a region rejection unit 102, a feature learning unit 103, asub-boxes detector unit 104 and a contextual information unit 105, which will be discussed in details as below.
  • the selective search unit 101 is configured to retrieve at least one digital image of videos, and then propose an over-complete set of candidate bounding boxes that may have objects inside for each retrieved image, and then output a plurality of positive and negative candidate bounding boxes (x, y, w, h) .
  • Fig. 3 illustrates a flow chart of the operations for the selective search unit 101 according to one embodiment of the present application.
  • the selective search unit 101 operates to resize each of the retrieved images to a fixed width, e. g. 500 pixels.
  • the selective search unit 101 performs super-pixel segmentation on each of the images to obtain a set of bounding box locations of the each image, for example, a small set of data-driven, class-independent, high quality bounding box locations.
  • the selective search unit 101 compares candidate bounding boxes (i. e. the obtained bounding boxes) with manually labeled bounding boxes to determine if the overlap between the candidate bounding boxes and manually labeled bounding boxes is larger than a predetermined threshold (in terms of overlap area ratio) , for example 0.5. If yes, the bounding box will be regarded as positive samples in step S304, whereas those with overlap less than 0.5 will be regarded as positive samples regarded as negative samples in step s305.
  • a predetermined threshold in terms of overlap area ratio
  • the region rejection unit 102 is configured to throw away a large part of candidate bounding boxes, according to scores, to make the following procedure faster. This unit 102 is applied on only the fine-tuning set. In other words, the region rejection unit 102 receives at least one image of videos and the obtains positive and negative candidate bounding boxes (x, y, w, h) , and determines which boxes of the obtained positive and negative candidate bounding boxes will be filtered based on the received images.
  • the region rejection unit 102 operates to obtain object detection score for each positive and negative candidate bounding boxes.
  • the region rejection unit 102 may apply any existing object detector on the input images to obtain object detection score for each positive and negative candidate bounding boxes (x, y, w, h) .
  • the i-th candidate bounding box is rejected if the following rejection condition is satisfied:
  • i is the sample index
  • j is the class index
  • is a pre-determined threshold.
  • the feature learning unit 103 is used to train a neural network whose lasthidden-layer values would be regarded as features.
  • the feature learning unit 103 receives, as its inputs, a pre-training set, Bennette-tuning set and the filtered bounding boxes, and then determines, based on the inputs, a fine-tuned neural network, wherein the values outputted from the last hidden layer of the fine-tuned neural network will be regarded as feature.
  • the pre-training set may consist of images and the corresponding ground truth bounding boxes (x, y, w, h) .
  • the pre-training set encompasses m object classes.
  • the fine-tuning set may consist of images and the corresponding ground truth bounding boxes (x, y, w, h).
  • the fine-tuning set encompasses n object classes.
  • Fig. 4 illustrates a flow chart of the operations for the feature learning unit 103 according to one embodiment of the present application.
  • the unit 103 operates to pre-train the first neural network using the images in the pre-training set with positive and negative bounding boxes as determined by the selective search unit 101.
  • the feature learning unit 103 may unitize a back-propagation algorithm to train a neural network.
  • Fig. 5 illustrates a flow chart forthe feature learning unit 103 to train a neural network.
  • thefeature learning unit 103 createsa neural network and then random initializes the created network. The configuration of the created networkwill be discussed later.
  • the feature learning unit 103 calculates the pre-defined loss function for the inputted the images in the pre-training set, the candidate positive and negative image regions corresponding to the positive and negative bounding boxes.
  • step S4013 the feature learning unit 103 calculates the gradient with respect to all the parameters, that’s Then in step s4014, the update process can be described as where lr is one prefixed learning rate.
  • step s4015 the feature learning unit 103 will check if the stopping criterion, for example, whether the loss value of validation set is increasing or not, is satisfied. If not, the feature learning unit 103 returns to step s4012 to run though the steps s4012-S4015 until the stopping criterion is satisfied.
  • a second neural network with the same structure as the pre-trained neural network will be created in step S402.
  • the second neural network is initialized by using the parameters of the pre-trained neural network.
  • the feature learning unit 103 operates to replace the output layer of the second neural network of m node with a new output layer with n node.
  • the feature learning unit 103 operates to fine-tune the second neural network using the bounding boxes of the images in the pre-training set and then further fine-tune the second neural network using the bounding boxes of the images in the fine-tune set.
  • the first neural network may be trained/tuned by using bounding boxes of the pre-training set, and then in step s405, the feature learning unit 103 operates to fine-tune the second neural network using the bounding boxes of the images in the fine-tune set.
  • the pre-train step (step s401) uses the whole images in the pre-training set to train the first neural network
  • the fine-tuning step (step s405) uses the image regions (bounding boxes containing objects) in the pre-training set and then further use the fine-tuning set to train the second neural network.
  • the feature learning unit 103 operates to replace the output layer of the second neural network of m node with a new output layer with n node, and thus the difference between the pre-training step (step s401) and the fine-tuning step (step 405) is that the last layer of the first network has m nodes whereas the last layer of the second layer has n nodes.
  • Prior arts often use the whole images in the pre-training set to train the first neural network and use image regions (bounding boxes containing objects) in the fine-tuning set to train the second neural network.
  • the process as proposed above in the present application uses the image regions (bounding boxes containing objects) in the pre-training set to improve the feature learning performance of the feature learning unit.
  • Sub-boxes detector unit104 Sub-boxes detector unit104
  • the sub-boxes detector unit 104 receives at least one image and the candidate bounding boxes (i. e. the boxes outputted from unit 102) , and then utilizes the fine-tuned network trained by the unit 103 to output a plurality of (n) Support Vector Machine (SVM) detectors, each of which predicts one value for one candidate bounding box for one semantic object class, such that a plurality of (n) Support Vector Machine (SVM) detectors will be obtained for the prediction unit (to be discussed later) to predict detection scores for n object classes.
  • SVM Support Vector Machine
  • the SVM is discussed as an example only, and any other binary classifier may be used in the embodiments of the present application.
  • the sub-boxes detector unit 10 calculates the feature vector F B using the fine-tuned neural network obtained from feature learning unit 103 to describe each candidate bounding box’s contents, and further divide it into a plurality of sub-image patches.
  • Fig. 6 illustrates 4 sub-image patches as an example. It should be appreciated that different number of sub-image patches can be divided in the embodiments of the present application.
  • Fig. 7 illustrates a flow chart of the operations for the sub-boxes detector unit 104 according to one embodiment of the present application (Following max-average SVM) .
  • the sub-boxes detector unit 10 divides the received bounding box into a plurality of (for example, 4) sub-image patches, w.
  • the sub-boxes detector unit 104 calculates its overlapping ratios with all object-bounding-boxes B using the following equation
  • S w , S B , and S w ⁇ B are the size of the sub-image-patches w, the size of the object-bounding-box B, and the size of the intersected region of the sub-image-patches w and the object-bounding-box B, respectively.
  • step s703 for each sub-image-patches w, the object-bounding box with the highest overlapping ratio is chosen as its corresponding box, i. e. ,
  • the feature vector of the object-bounding box is assigned to the sub-image-patches w to describe its contents.
  • step s704 for each object-bounding-box proposal B with its sub-image-patches, the element-wise average of the feature vectors of the plurality of sub-image-patches and maximum of the feature vectors of the plurality of sub-image-patches are calculated as
  • the feature vector F B of the object-bounding-box B is concatenated with and to create a longer feature vector to describe the image contents within the bounding box B.
  • the fine-tuned neural network obtained from the feature learning unit 103 is used to extract features from exact sub-image-patch regions. The element-wise average and maximum of the feature vectors are used to describe the image content.
  • step s706 the concatenated feature vectors and the ground-truth labels of the object-bounding-boxes B are used to train the binary classifier (for example, the SVM as discussed above) detector to output a likelihood score for every possible object class that the box might belongs to.
  • the binary classifier for example, the SVM as discussed above
  • Fig. 8 illustrates a flow chart of the operations for the sub-boxes detector unit 104 according to another embodiment of the present application (Following multiple-feature SVM) .
  • the sub-boxes detector unit 10 divides the received bounding box into a plurality of (for example, 4) sub-image patches w.
  • step s802 for each object-bounding-box B, its feature vectors F B and the feature vectors from the sub-image-patches are used to train separate support vector machines. For example, where there are 4 sub-image-patches, the 4 feature vectors from the 4 sub-image-patches are used to train 5 separate support vector machines.
  • step s803 given a new object-bounding-box B and its feature vector extracted by the fine-tuned neural network obtained from feature learning unit 103, the corresponding support vector machine is applied to calculate a likelihood score for each object class.
  • step s804 for each sub-image-patch w, sub-boxes detector unit 104 first calculates its overlapping ratios with all proposed object-bounding-boxes B using the following equation
  • S w , S B , and S w ⁇ B are the size of the sub-image-patches w, the size of the object-bounding-box B, and the size of the intersected region of the sub-image-patches w and the object bounding box B, respectively.
  • a predetermined threshold for example, 0.5
  • the corresponding trained support vector machine of w is used to test all its candidate corresponding bounding boxes. For each candidate bounding box, the trained support vector machine generates a score for each possible object class in step s805. The highest score of each object class from all candidate windows is chosen as the class likelihood score for w.
  • step s806 the object-bounding-box and it’s (for example, 4) sub-image-patches are associated with a plurality of (for example, 5) sets of object class likelihood scores, the sets of scores are normalized independently and summed together to output a set object class likelihood.
  • the contextual information unit 105 is configured to exploit contextual information to improve detection performance.
  • the contextual information unit 105 receives at least one image and receives the candidate bounding boxes from the unit 102.
  • the unit 105 further retrieves scores of the sub-boxes detector from the sub-boxes detector unit 104 and the contextual information from feature learning unit 103, i. e. classification score outputted from the first network.
  • the unit 105 utilizes the pre-trained network and fine-tuned network to train one binary classifier (for example, SVM) for each detection class of the candidate bounding boxes to output n classes of binary classifier to predict an n-dimension vector for each candidate bounding box.
  • SVM binary classifier
  • Fig. 9 illustrates a flow chart of the operations for the contextual information unit 105 according to another embodiment of the present application.
  • the contextual information unit 105 utilizes the pre-trained network to output the classification score (contextual information) for the whole of the received image, where L c is the number of the classification categories.
  • s c (i) is the probability of the i-th classification class, i. e, i-th classification class of m classes in the pre-training set.
  • the contextual information unit 105 operates to concatenate the classification score s c and detection score s d obtained by sub-boxes detector unit 104 for each bounding box in this image.
  • a new one v. s. all binary classifier (SVM) is trained for each of the n detection class with contextual modeling.
  • the feature vector x B may be concatenated from s d (j) and a sparse feature vector with a weight ⁇ , i. e. , by rule of:
  • step s903 In order to avoid over-fitting to the training data, some irrelevant dimensions of the feature vector are set zero in step s903.
  • step s904 the contextual information unit 105 operates to train one binary classifier for each detection class. Let ⁇ j selects the most relevant classes in the classification task for thej-th class in the detection task. If i ⁇ j , otherwise Then the final score would be outputted as the score of binary classifier in step s905.
  • the multi-class object detection system 100 has been discussed. It shall be understood that there may be several models by changing the setting of feature learning unit, the sub-boxes detector unit and the contextual information unit. For example, configuration of network created by the feature learning unit may be changed with different layers. As these models share the same selective search unit, the candidate boxes are the same for all models. For each candidate box, different model may output different scores for different classes.
  • the prediction device 10 may further comprise model average unit (not shown) .
  • the model average unit is configured to utilize advantages of several models and make the performance better. As it needs to detect instances of multiple classes, and different training setting may result different performance. For example, one model setting may be better in some classes while another model may come out better on other classes. This model average unit is used to select different models for each class.
  • the model average unit tries to find out a combination list for each class and averages the score of the models in this list as the final score for each candidate box.
  • Fig. 14 illustrates a flow chart of the operations for the model average unit according to other embodiment of the present application.
  • step s1401 it creates one empty list for one class in step s1401. Multiple modes can be obtained by changing the setting of the feature learning unit, the sub-boxes detector unit and the contextual information unit. Those models share the same selective search unit.
  • step s1402 for each class, this unit starts to select the best model as the starting point, and tries to find one more model (s1403) so that the performance of this class would be better by averaging the scores of those two models (the best model and said one more model) and then add this model to list in step s1408.
  • step s1402-1407 until no more models can be added or the performance would be worse if one more model is added. Repeat the above procedures for all classes.
  • This model average unit would output one model list for each class.
  • the neural network is The neural network
  • the neural network structure consists of several kinds of layers.
  • Fig. 10 is a schematic diagram illustrating an exemplary configuration of neural network structure according to one embodiment of the present application.
  • Fig. 11 is a schematic diagram illustrating an exemplary configuration of deformation layer of the network according to one embodiment of the present application.
  • This layer receives images and its labels where x ij is the j-th bit value of the d-dimension feature vector of the i-th input image region, y ij is the j-th bit value of the n dimension label vector of the i-th input image region.
  • the convolution layer receives the output from the data layer and performs convolution, padding, sampling, and non-linear transformation operations.
  • the deformation layer is designed to learn the deformation constraints for different object parts. For a given channel of convolution layer C with size V*H, the deformation layer takes small blocks of size (2R+1) * (2R+1) from that convolution layer C and subsamples it to B with size to produces single output from that block as follows:
  • both i and j ranges from –R to R
  • the deformation layer takes the P part detection maps as input and outputs P part scores. And the deformation layer can capture multiple patterns simultaneously.
  • the output of convolution layer and deformation layer can be regarded as discriminative features.
  • the fully connected layer takes the discriminative feature as input and operates the inner-production between the feature and weights. Then one non-linear transformation would be operated on the production.
  • the prediction device 20 The prediction device 20
  • the prediction device 20 For each of test image, the prediction device 20 outputs predicted bounding boxes (x, y, w, h) and the corresponding scores for n object classes of the test image.
  • Fig. 12 is a schematic diagram illustrating an exemplary block diagram for the prediction device 20 according to one embodiment of the present application. As shown in Fig, 12, the prediction device 20 comprises a selective search module 201, a region rejection module 202, a feature learning module 203, a sub-boxes detector module 204, a context information module 205.
  • Fig. 13 illustrates flow chart for the process to show how the units 201-206 cooperate to output predicted bounding boxes (x, y, w, h) and the corresponding scores for the predicted bounding boxes.
  • the selective search module 201 receives at least one test image and then proposes a number of candidate bounding boxes in the test image.
  • the received image includes a plurality of instances of (n) object classes (n semantic classes).
  • step S1302 the region rejection module 202 selects some boxes from a large number of candidate bounding boxes by rule of formula 1. Once a candidate box is rejected, this box will be thrown away. Only bounding boxes through the region rejection unit will be passed to the following unit, as discussed in reference to the training device.
  • the feature learning module 203 calculates the classification features for each candidate box through using the fine-tuned network obtained from the training device. Here the fine-tuned network takes the image regions corresponding to the bounding boxes as input and calculates the classification features from the last hidden layer of the fine-tuned network.
  • the sub-boxes detector module 204 receives the calculated classification features from the module 203 and then uses the sub-boxes detector (binary classifier detector) obtained from the training device 10 to calculate the n-classes scores s d for each candidate box.
  • the sub-boxes detector calculates the classification features of the plurality of sub-image-regions (for example, 4 sub-image-regions) and gets the classification features for each sub-image-region using the fine-tuned network obtained in the training device 10. Then the sub-boxes detector module 204 calculates the classification scores s d using the sub-boxes detectors (binary classifier detectors) trained in the training device 10.
  • a binary classifier detector for example, SVM detector
  • the sub-boxes detector (SVM detector) would find one bounding box having the max overlap value with each sub-image-region, calculate the feature for that bounding box using the fine-tuned network and use this feature to represent that sub-image-region. Once all four sub-image-regions get their corresponding representing feature, the element-wise-max and element-wise-average values would be extracted from those four sub-image-regions representing feature. The concatenated feature vectors multiplying with the binary classifier (SVM) weights obtained in the train device would produce the scores sd.
  • SVM binary classifier
  • the sub-boxes detector unit 204 uses the detection network (i. e. second network) obtained in the training device 10 to calculate the classification score s d , then the context information module 205 concatenates the s d in the previous step with the s c calculated in this step, and finally multiply the concatenation vector with weights of the binary classifier (SVM) obtained from the training device 10 in step s1305.
  • SVM binary classifier
  • the production is the final score for the candidate bounding boxes proposed by the selective search module 201. It shall be understood that there may be several models by changing the setting of feature learning unit and sub-boxes detector unit. As these models share the same selective search unit, the candidate boxes are the same for all models. For each candidate box, different model would output different score for different class.
  • the prediction device 10 may further comprise model average unit (not shown) .
  • the final scores are obtained by averaging the final scores of multiple models selected by this model average unit for each candidate box, which is the same as that discussed in reference to the training device 10.
  • modules 201-205 are omitted herein since they function in the same way as the units 101-105 of the training device 10 as discussed above.
  • system 100 has been discussed in the case they are implemented using certain hardware with specific circuitry or the combination of the hardware and the software. It shall be appreciated that the systems 10 and 100 may be also implemented using software.
  • embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes.
  • the system100 may run in a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed is a device for training neural networks of multi-class object detection. The device may comprise a feature learning unit and a sub-boxes detector unit. According to one embodiment of the present application, the feature learning unit is configured to determine a first neural network based on training images of a first training image set, wherein each of the images has a plurality of bounding boxes with objects inside, and the determined first neural network outputs contextual information for an inputted image; and to determine a second neural network based on bounding boxes of the images in the first training image set and then further fine-tune the second neural network based on bounding boxes of the images in second training image set. The sub-boxes detector unit is configured to determine sub-boxes detector scores for the bounding boxes based on the second neural network, each of sub-boxes detector scores predicting one value for one of the bounding boxes for one semantic object class.

Description

[Title established by the ISA under Rule 37.2] METHOD AND SYSTEM FOR MULTI-CLASS OBJECT DETECTION Technical field
The present application relates to a method and a system of multi-class object detection, of which the aim is to automatically detect instances of object of different classes in digital images of videos.
Background
The aim of object detection is to detect instances of object of a certain class in digital images and videos. The performance of object detection systems depends heavily on image representation, of which the quality can be influenced by many kinds of variations, such as viewpoints, illuminations, poses, and occlusions. Due to such uncontrollable factors, it is non-trivial to design robust image representation that is sufficiently discriminative to represent large quantity of object classes.
Substantial efforts have been dedicated to design hand-crafted features, such as Gabor, SIFT, and HOG, for representing image. Typically, object detection based on hand-crafted features involves extracting multiple features on the landmarks of images with multiple scales, and concatenating them into high-dimensional feature vectors.
Deep Convolutional Neural Network (CNN) has been applied to learn features directly from raw pixels. For the object detection task, existing deep CNN learning methods pre-train the CNN by using images without bounding box ground truth, and subsequently fine-tune the deep neural net using another set of images with bounding box ground truth. Typically, the image set used for fine-tuning has lower quantity of semantic class numbers compared to the image set used for pre-training. In addition, the number of semantic classes in image set for fine-tuning equals the number of actual classes we wish to detect.
Summary of invention
In one aspect, disclosed is a device for training neural networks of multi-class object detection. The device may comprise a feature learning unit and a sub-boxes detector unit. According to one embodiment of the present application, the feature learning unit is configured to determine a first neural network based on training images of a first training image set, wherein each of the training images has a plurality of bounding boxes with objects inside; and determine a second neural network based on bounding boxes of the training images of the first training image set and then further fine-tune the second neural network based on bounding boxes of training images of a second training image set. The sub-boxes detector unit is configured to determine a binary classifier detector for the bounding boxes of the first and the second image sets based on the second neural network, each score of the determined binary classifier detector predicting one semantic object class inside one of the bounding boxes
In another aspect, disclosed is device for multi-class object detection, comprising: a feature learning module configured to determine a plurality of classification features for each candidate bounding box of an inputted image; a sub-boxes detector module configured to utilize a pre-trained detection neural network to calculate a plurality of detection classes scores for each candidate box based on the classification features determined by the feature learning module (203) ; and a context information module configured to concatenate the calculated classification class scores, and determine a final score for the candidate bounding box, the final score representing one semantic object class inside one of the bounding boxes of the inputted image.
In further aspect, disclosed is a system for multi-class object detection, which comprises a training device, configured to determine a classification neural network, and a detection neural network from a plurality of predetermined training image sets. The system further comprises a prediction device, comprising a feature learning  module configured to determine a plurality of features for each candidate bounding box of an inputted image based on the detection neural network, wherein the detection neural network takes the candidate bounding box as input and operates to output detection features for the candidate bounding box; a sub-boxes detector module configured to utilize the classification neural network to calculate a plurality of classification class scores for each candidate bounding box based on said detection features; and a context information module configured to concatenate the calculated classification classes scores, and determine, based on the detection neural network, a final score for the candidate bounding box, the final score representing semantic object class inside the box.
In further aspect, disclosed is a method for training neural networks of multi-class object detection, comprising:
determining a first neural network based on training images of a first training image set, wherein each of the images has a plurality of bounding boxes with objects inside, and the determined first neural network outputs contextual information for an inputted image;
determining a second neural network based on bounding boxes of the images in the first training image set;
fine-tuning the second neural network based on bounding boxes of the images in second training image set; and
determining sub-boxes detector scores for the bounding boxes based on the second neural network, each of sub-boxes detector scores predicting one value for one of the bounding boxes for one semantic object class.
In further aspect, disclosed is a method for training neural networks of multi-class object detection, comprising:
determining a first neural network based on a plurality of bounding boxes of a first training image set;
determining a second neural network based on bounding boxes of the images in a second training image set, the determined first neural network outputting contextual  information for an inputted image; and
determining sub-boxes detector scores for the bounding boxes based on the second neural network, each of sub-boxes detector scores predicting one value for one of the bounding boxes for one semantic object class.
In addition, the present application further proposes a method for multi-class object detection, comprising:
determining a classification neural network, a detection neural network, a plurality of sub-boxes detectors and a plurality of context information detectors from a plurality of predetermined training image sets;
determining a plurality of features for each candidate bounding box of an inputted image based on the detection neural network, wherein the detection neural network takes the candidate bounding box as input and calculates features values from a last hidden layer of the detection neural network;
calculating plurality of classification classes scores for each candidate box based on the classification neural network;
concatenating the calculated classification classes scores, so as to determine, based on the detection neural network, a final score for the candidate bounding box through the determined sub-boxes detector.
Brief Description of the Drawing
Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
Fig. 1 is a schematic diagram illustrating an exemplary system for multi-class object detection according to one embodiment of the present application.
Fig. 2 is a schematic diagram illustrating an exemplary block diagram of training device according to one embodiment of the present application.
Fig. 3 illustrates a flow chart of the operations for the selective search  unit according to one embodiment of the present application.
Fig. 4 illustrates a flow chart of the operations for the feature learning unit according to one embodiment of the present application.
Fig. 5 illustrates a flow chart for the feature learning unit to train a neural network according to one embodiment of the present application.
Fig. 6 illustrates sub-image patches according to one embodiment of the present application.
Fig. 7 illustrates a flow chart of the operations for the sub-boxes detector unit according to one embodiment of the present application.
Fig. 8 illustrates a flow chart of the operations for the sub-boxes detector unit according to another embodiment of the present application.
Fig. 9 illustrates a flow chart of the operations for the contextual information unit according to another embodiment of the present application.
Fig. 10 is a schematic diagram illustrating an exemplary configuration of neural network structure according to one embodiment of the present application.
Fig. 11 is a schematic diagram illustrating an exemplary configuration of deformation layer of the network according to one embodiment of the present application.
Fig. 12 is a schematic diagram illustrating an exemplary block diagram for the prediction device according to one embodiment of the present application.
Fig. 13 a flow chart for the process to show how to output predicted bounding boxes and the corresponding scores for the predicted bounding boxes according to one embodiment of the present application.
Fig. 14 illustrates a flow chart of the operations for the model average unit according to other embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments,  examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like parts.
Fig. 1 is a schematic diagram illustrating an exemplary system 100 for multi-class object detection according to one embodiment of the present application. As illustrated in Fig. 1, the system 100 for multi-class object detection may comprise a training device 10 and a prediction device 20.
The training device 10 is configured to retrieve a set of pre-determined training set, which contains a set of images, each of which labeled with designated bounding boxes (x, y, w, h) , where (x, y) = top-left coordinate of a bounding box, h = height of a bounding box, and w = width of a bounding box. In one embodiment of the present application, each box contains a target semantic object. The training device 10 then determines a classification neural network, a detection neural network, a plurality of (n) sub-boxes detectors and a plurality of (n) context information detectors from the retrieved training set. Once the training device 10 has finished the training procedure, the predictiondevice20 can use the networks, sub-boxes detectors and context detectors to detect semantic classes in the images. The prediction device 20 takes an image as input, and output bounding boxes coordinates (x, y, w, h) , with which each box contains a target semantic object.
Fig. 2 is a schematic diagram illustrating an exemplary block diagram of training device 10 according to one embodiment of the present application. As shown, the training device 10 may comprise a selective search unit 101, a region rejection unit 102, a feature learning unit 103, asub-boxes detector unit 104 and a contextual information unit 105, which will be discussed in details as below.
Selective search unit 101
The selective search unit 101 is configured to retrieve at least one digital image of videos, and then propose an over-complete set of candidate bounding boxes that may have objects inside for each retrieved image, and then output a  plurality of positive and negative candidate bounding boxes (x, y, w, h) . Fig. 3 illustrates a flow chart of the operations for the selective search unit 101 according to one embodiment of the present application. In step s301, the selective search unit 101 operates to resize each of the retrieved images to a fixed width, e. g. 500 pixels. In step s302, the selective search unit 101performs super-pixel segmentation on each of the images to obtain a set of bounding box locations of the each image, for example, a small set of data-driven, class-independent, high quality bounding box locations. In step s303, the selective search unit 101 compares candidate bounding boxes (i. e. the obtained bounding boxes) with manually labeled bounding boxes to determine if the overlap between the candidate bounding boxes and manually labeled bounding boxes is larger than a predetermined threshold (in terms of overlap area ratio) , for example 0.5. If yes, the bounding box will be regarded as positive samples in step S304, whereas those with overlap less than 0.5 will be regarded as positive samples regarded as negative samples in step s305.
Region rejection unit 102
The region rejection unit 102 is configured to throw away a large part of candidate bounding boxes, according to scores, to make the following procedure faster. This unit 102 is applied on only the fine-tuning set. In other words, the region rejection unit 102 receives at least one image of videos and the obtains positive and negative candidate bounding boxes (x, y, w, h) , and determines which boxes of the obtained positive and negative candidate bounding boxes will be filtered based on the received images.
In one embodiment of the present application, the region rejection unit 102 operates to obtain object detection score for each positive and negative candidate bounding boxes. The region rejection unit 102 may apply any existing object detector on the input images to obtain object detection score for each positive and negative candidate bounding boxes (x, y, w, h) . Denote the detection scores for n classes of the i-th candidate bounding boxes to be si. The i-th candidate bounding box  is rejected if the following rejection condition is satisfied:
||si||<γ Formula 1)
where||si||=maxj{si,j},
i is the sample index,
j is the class index, and
γis a pre-determined threshold.
Feature learning unit 103
The feature learning unit 103 is used to train a neural network whose lasthidden-layer values would be regarded as features. In one embodiment of the present application, the feature learning unit 103 receives, as its inputs, a pre-training set, afine-tuning set and the filtered bounding boxes, and then determines, based on the inputs, a fine-tuned neural network, wherein the values outputted from the last hidden layer of the fine-tuned neural network will be regarded as feature. The pre-training set may consist of images and the corresponding ground truth bounding boxes (x, y, w, h) . The pre-training set encompasses m object classes. The fine-tuning setmay consist of images and the corresponding ground truth bounding boxes (x, y, w, h).The fine-tuning set encompasses n object classes.
Fig. 4 illustrates a flow chart of the operations for the feature learning unit 103 according to one embodiment of the present application. In step s401, the unit 103 operates to pre-train the first neural network using the images in the pre-training set with positive and negative bounding boxes as determined by the selective search unit 101. To be specific, the feature learning unit 103 may unitize a back-propagation algorithm to train a neural network. Fig. 5 illustrates a flow chart forthe feature learning unit 103 to train a neural network. As shown, in step s4011, thefeature learning unit 103 createsa neural network and then random initializes the created network. The configuration of the created networkwill be discussed later.
And then in step 4012, the feature learning unit 103 calculates the  pre-defined loss function for the inputted the images in the pre-training set, the candidate positive and negative image regions corresponding to the positive and negative bounding boxes. The loss function can be described as Loss= f (x, y, θ) , where x is the bounding boxes, y is its label, θ stands for all parameters, including the convolution filters, deformation layer weights, fully connected weights and bias in the created network. If x is the positive candidate bounding box, then its y should be a non-zero value. If one ground truth box having the max overlap value with x, the y should be the value of the class that ground truth box belongs to. The whole training process for the neural network is trying to minimize the loss of the whole training images. In step S4013, the feature learning unit 103 calculates the gradient with respect to all the parameters, that’s 
Figure PCTCN2014000833-appb-000001
Then in step s4014, the update process can be described as 
Figure PCTCN2014000833-appb-000002
where lr is one prefixed learning rate. In step s4015, the feature learning unit 103 will check if the stopping criterion, for example, whether the loss value of validation set is increasing or not, is satisfied. If not, the feature learning unit 103 returns to step s4012 to run though the steps s4012-S4015 until the stopping criterion is satisfied.
Returning to Fig. 4, once the first neural network is created and pre-trained, a second neural network with the same structure as the pre-trained neural network will be created in step S402. In step s403, the second neural network is initialized by using the parameters of the pre-trained neural network. In step s404, the feature learning unit 103 operates to replace the output layer of the second neural network of m node with a new output layer with n node. In step s405, the feature learning unit 103 operates to fine-tune the second neural network using the bounding boxes of the images in the pre-training set and then further fine-tune the second neural network using the bounding boxes of the images in the fine-tune set.
Alternatively, in steps s4012-s4015, the first neural network may be trained/tuned by using bounding boxes of the pre-training set, and then in step s405, the feature learning unit 103 operates to fine-tune the second neural network using the  bounding boxes of the images in the fine-tune set.
It shall be appreciated that the pre-train step (step s401) uses the whole images in the pre-training set to train the first neural network, while the fine-tuning step (step s405) uses the image regions (bounding boxes containing objects) in the pre-training set and then further use the fine-tuning set to train the second neural network. As discussed in the above in reference to step s404, for the second network, the feature learning unit 103 operates to replace the output layer of the second neural network of m node with a new output layer with n node, and thus the difference between the pre-training step (step s401) and the fine-tuning step (step 405) is that the last layer of the first network has m nodes whereas the last layer of the second layer has n nodes.
Prior arts often use the whole images in the pre-training set to train the first neural network and use image regions (bounding boxes containing objects) in the fine-tuning set to train the second neural network. In contrast to the previous training scheme, the process as proposed above in the present application uses the image regions (bounding boxes containing objects) in the pre-training set to improve the feature learning performance of the feature learning unit.
Sub-boxes detector unit104
The sub-boxes detector unit 104 receives at least one image and the candidate bounding boxes (i. e. the boxes outputted from unit 102) , and then utilizes the fine-tuned network trained by the unit 103 to output a plurality of (n) Support Vector Machine (SVM) detectors, each of which predicts one value for one candidate bounding box for one semantic object class, such that a plurality of (n) Support Vector Machine (SVM) detectors will be obtained for the prediction unit (to be discussed later) to predict detection scores for n object classes. Herein, the SVM is discussed as an example only, and any other binary classifier may be used in the embodiments of the present application.
For each candidate bounding box B, the sub-boxes detector unit 10  calculates the feature vector FB using the fine-tuned neural network obtained from feature learning unit 103 to describe each candidate bounding box’s contents, and further divide it into a plurality of sub-image patches. Fig. 6 illustrates 4 sub-image patches as an example. It should be appreciated that different number of sub-image patches can be divided in the embodiments of the present application.
Fig. 7 illustrates a flow chart of the operations for the sub-boxes detector unit 104 according to one embodiment of the present application (Following max-average SVM) . In step s701, the sub-boxes detector unit 10 divides the received bounding box into a plurality of (for example, 4) sub-image patches, w. In step s702, for each sub-image-patch w, the sub-boxes detector unit 104 calculates its overlapping ratios with all object-bounding-boxes B using the following equation
Ow,B=Sw∩B/(Sw+SB-Sw∩B),Formula 2)
where Sw, SB, and Sw∩B are the size of the sub-image-patches w, the size of the object-bounding-box B, and the size of the intersected region of the sub-image-patches w and the object-bounding-box B, respectively.
In step s703, for each sub-image-patches w, the object-bounding box 
Figure PCTCN2014000833-appb-000003
with the highest overlapping ratio is chosen as its corresponding box, i. e. , 
Figure PCTCN2014000833-appb-000004
The feature vector of the object-bounding box
Figure PCTCN2014000833-appb-000005
is assigned to the sub-image-patches w to describe its contents.
In step s704, for each object-bounding-box proposal B with its sub-image-patches, the element-wise average
Figure PCTCN2014000833-appb-000006
of the feature vectors of the plurality of sub-image-patches and maximum
Figure PCTCN2014000833-appb-000007
of the feature vectors of the plurality of sub-image-patches
Figure PCTCN2014000833-appb-000008
are calculated as 
Figure PCTCN2014000833-appb-000009
   Formula3)
Figure PCTCN2014000833-appb-000010
   Formula4)
In step s705, the feature vector FB of the object-bounding-box B is concatenated with
Figure PCTCN2014000833-appb-000011
and
Figure PCTCN2014000833-appb-000012
to create a longer feature vector
Figure PCTCN2014000833-appb-000013
to  describe the image contents within the bounding box B. In one embodiment of the present application, the fine-tuned neural network obtained from the feature learning unit 103 is used to extract features from exact sub-image-patch regions. The element-wise average and maximum of the feature vectors are used to describe the image content.
In step s706, the concatenated feature vectorsand the ground-truth labels of the object-bounding-boxes B are used to train the binary classifier (for example, the SVM as discussed above) detector to output a likelihood score for every possible object class that the box might belongs to.
Fig. 8 illustrates a flow chart of the operations for the sub-boxes detector unit 104 according to another embodiment of the present application (Following multiple-feature SVM) . In step s801, the sub-boxes detector unit 10 divides the received bounding box into a plurality of (for example, 4) sub-image patches w. In step s802, for each object-bounding-box B, its feature vectors FB and the feature vectors from the sub-image-patches are used to train separate support vector machines. For example, where there are 4 sub-image-patches, the 4 feature vectors from the 4 sub-image-patches are used to train 5 separate support vector machines.
In step s803, given a new object-bounding-box B and its feature vector extracted by the fine-tuned neural network obtained from feature learning unit 103, the corresponding support vector machine is applied to calculate a likelihood score for each object class.
In step s804, for each sub-image-patch w, sub-boxes detector unit 104 first calculates its overlapping ratios with all proposed object-bounding-boxes B using the following equation
Ow,B=Sw∩B/(Sw+SB-Sw∩B),   Formula 5)
where Sw, SB, and Sw∩B are the size of the sub-image-patches w, the size of the object-bounding-box B, and the size of the intersected region of the  sub-image-patches w and the object bounding box B, respectively.
Only object-bounding-box B with overlapping ratios greater than a predetermined threshold (for example, 0.5) to the sub-image-patch w are selected as the candidate corresponding bounding box for w in step s805.
The corresponding trained support vector machine of w is used to test all its candidate corresponding bounding boxes. For each candidate bounding box, the trained support vector machine generates a score for each possible object class in step s805. The highest score of each object class from all candidate windows is chosen as the class likelihood score for w.
In step s806, the object-bounding-box and it’s (for example, 4) sub-image-patches are associated with a plurality of (for example, 5) sets of object class likelihood scores, the sets of scores are normalized independently and summed together to output a set object class likelihood.
Contextual information unit 105
The contextual information unit 105 is configured to exploit contextual information to improve detection performance. The contextual information unit 105 receives at least one image and receives the candidate bounding boxes from the unit 102. The unit 105 further retrieves scores of the sub-boxes detector from the sub-boxes detector unit 104 and the contextual information from feature learning unit 103, i. e. classification score outputted from the first network. And then the unit 105 utilizes the pre-trained network and fine-tuned network to train one binary classifier (for example, SVM) for each detection class of the candidate bounding boxes to output n classes of binary classifier to predict an n-dimension vector for each candidate bounding box.
Fig. 9 illustrates a flow chart of the operations for the contextual information unit 105 according to another embodiment of the present application.
In step s901, the contextual information unit 105 utilizes the pre-trained network to output the classification score (contextual information)
Figure PCTCN2014000833-appb-000015
Figure PCTCN2014000833-appb-000016
for the whole of the received image, where Lc is the number of the classification categories. sc(i) is the probability of the i-th classification class, i. e, i-th classification class of m classes in the pre-training set.
In step s902, the contextual information unit 105 operates to concatenate the classification score sc and detection score sd obtained by sub-boxes detector unit 104 for each bounding box in this image. After the scores sc and sd are calculated for all images and their bounding boxes, a new one v. s. all binary classifier (SVM) is trained for each of the n detection class with contextual modeling. To train the j-th binary classifier, the feature vector xB may be concatenated from sd(j) and a sparse feature vector
Figure PCTCN2014000833-appb-000017
with a weight η, i. e. , by rule of:
Figure PCTCN2014000833-appb-000018
   Formula 6)
In order to avoid over-fitting to the training data, some irrelevant dimensions of the feature vector
Figure PCTCN2014000833-appb-000019
are set zero in step s903. An then in step s904, the contextual information unit 105 operates to train one binary classifier for each detection class. Let Ωj selects the most relevant classes in the classification task for thej-th class in the detection task. If i∈Ωj
Figure PCTCN2014000833-appb-000020
otherwise
Figure PCTCN2014000833-appb-000021
Then the final score would be outputted as the score of binary classifier in step s905.
The Model Average Unit
Hereinabove, one of the arrangements (one model of the system) for the multi-class object detection system 100 has been discussed. It shall be understood that there may be several models by changing the setting of feature learning unit, the sub-boxes detector unit and the contextual information unit. For example, configuration of network created by the feature learning unit may be changed with different layers. As these models share the same selective search unit, the candidate boxes are the same for all models. For each candidate box, different model may output different scores for different classes.
In one embodiment of the present application, the prediction device  10 may further comprise model average unit (not shown) . The model average unit is configured to utilize advantages of several models and make the performance better. As it needs to detect instances of multiple classes, and different training setting may result different performance. For example, one model setting may be better in some classes while another model may come out better on other classes. This model average unit is used to select different models for each class.
The model average unit tries to find out a combination list for each class and averages the score of the models in this list as the final score for each candidate box. Fig. 14 illustrates a flow chart of the operations for the model average unit according to other embodiment of the present application. In step s1401, it creates one empty list for one class in step s1401. Multiple modes can be obtained by changing the setting of the feature learning unit, the sub-boxes detector unit and the contextual information unit. Those models share the same selective search unit.
In step s1402, for each class, this unit starts to select the best model as the starting point, and tries to find one more model (s1403) so that the performance of this class would be better by averaging the scores of those two models (the best model and said one more model) and then add this model to list in step s1408. Repeat step s1402-1407 until no more models can be added or the performance would be worse if one more model is added. Repeat the above procedures for all classes. This model average unit would output one model list for each class.
The neural network
Hereinafter, the neural network created and trained by the feature learning unit 103 will be discussed.
The neural network structure consists of several kinds of layers. Fig. 10 is a schematic diagram illustrating an exemplary configuration of neural network structure according to one embodiment of the present application. Fig. 11 is a schematic diagram illustrating an exemplary configuration of deformation layer of the network according to one embodiment of the present application.
Data layer
This layer receives images
Figure PCTCN2014000833-appb-000022
and its labels
Figure PCTCN2014000833-appb-000023
where xij is the j-th bit value of the d-dimension feature vector of the i-th input image region, yij is the j-th bit value of the n dimension label vector of the i-th input image region.
Convolution layer
The convolution layer receives the output from the data layer and performs convolution, padding, sampling, and non-linear transformation operations.
Deformation layer
Since objects have different sizes and many semantic parts, filters with different sizes are added into the convolution layer. One filter with one size would produce one score map, which describes the corresponding part information. The deformation layer is designed to learn the deformation constraints for different object parts. For a given channel of convolution layer C with size V*H, the deformation layer takes small blocks of size (2R+1) * (2R+1) from that convolution layer C and subsamples it to B with size
Figure PCTCN2014000833-appb-000024
to produces single output from that block as follows:
Figure PCTCN2014000833-appb-000025
where (x, y) is the center of the (2R+1) * (2R+1) block,
both i and j ranges from –R to R,
kh, kv are subsampling steps,
cnand
Figure PCTCN2014000833-appb-000026
are deformation parameters to be learned.
The deformation layer takes the P part detection maps as input and outputs P part scores. And the deformation layer can capture multiple patterns  simultaneously. The output of convolution layer and deformation layer can be regarded as discriminative features.
Fully connected layer
The fully connected layer takes the discriminative feature as input and operates the inner-production between the feature and weights. Then one non-linear transformation would be operated on the production.
The prediction device 20
Hereinafter, the prediction device 20 will be discussed for details. For each of test image, the prediction device 20 outputs predicted bounding boxes (x, y, w, h) and the corresponding scores for n object classes of the test image. Fig. 12 is a schematic diagram illustrating an exemplary block diagram for the prediction device 20 according to one embodiment of the present application. As shown in Fig, 12, the prediction device 20 comprises a selective search module 201, a region rejection module 202, a feature learning module 203, a sub-boxes detector module 204, a context information module 205. Fig. 13 illustrates flow chart for the process to show how the units 201-206 cooperate to output predicted bounding boxes (x, y, w, h) and the corresponding scores for the predicted bounding boxes.
In step S1301, the selective search module 201 receives at least one test image and then proposes a number of candidate bounding boxes in the test image. The received image includes a plurality of instances of (n) object classes (n semantic classes).
In step S1302, the region rejection module 202 selects some boxes from a large number of candidate bounding boxes by rule of formula 1. Once a candidate box is rejected, this box will be thrown away. Only bounding boxes through the region rejection unit will be passed to the following unit, as discussed in reference to the training device. In step S1303, the feature learning module 203 calculates the  classification features for each candidate box through using the fine-tuned network obtained from the training device. Here the fine-tuned network takes the image regions corresponding to the bounding boxes as input and calculates the classification features from the last hidden layer of the fine-tuned network.
In step s1304, the sub-boxes detector module 204 receives the calculated classification features from the module 203 and then uses the sub-boxes detector (binary classifier detector) obtained from the training device 10 to calculate the n-classes scores sd for each candidate box. Here the sub-boxes detector calculates the classification features of the plurality of sub-image-regions (for example, 4 sub-image-regions) and gets the classification features for each sub-image-region using the fine-tuned network obtained in the training device 10. Then the sub-boxes detector module 204 calculates the classification scores sd using the sub-boxes detectors (binary classifier detectors) trained in the training device 10. As discussed, the feature outputted from the last hidden layer of the second network (the detection network or fine-tuned network) will be considered as the classification feature, and then input into the sub-boxes detector module 204 to learn a binary classifier detector (for example, SVM detector) so as to output a detection score =w*x+b, where x represents the features for bounding box received from the module 203, while w and b are the parameters to be learned/determined by the module 204.
If the sub-boxes detector unit in the train device 10 follows the max-average SVM scheme, the sub-boxes detector (SVM detector) would find one bounding box having the max overlap value with each sub-image-region, calculate the feature for that bounding box using the fine-tuned network and use this feature to represent that sub-image-region. Once all four sub-image-regions get their corresponding representing feature, the element-wise-max and element-wise-average values would be extracted from those four sub-image-regions representing feature. The concatenated feature vectors
Figure PCTCN2014000833-appb-000027
multiplying with the binary classifier (SVM) weights obtained in the train device would produce the scores sd.
Once the sub-boxes detector unit 204 uses the detection network (i. e. second network) obtained in the training device 10 to calculate the classification score sd, then the context information module 205 concatenates the sd in the previous step with the sc calculated in this step, and finally multiply the concatenation vector with weights of the binary classifier (SVM) obtained from the training device 10 in step s1305. The production is the final score for the candidate bounding boxes proposed by the selective search module 201. It shall be understood that there may be several models by changing the setting of feature learning unit and sub-boxes detector unit. As these models share the same selective search unit, the candidate boxes are the same for all models. For each candidate box, different model would output different score for different class. In one embodiment of the present application, the prediction device 10 may further comprise model average unit (not shown) . For each class, the final scores are obtained by averaging the final scores of multiple models selected by this model average unit for each candidate box, which is the same as that discussed in reference to the training device 10.
It would be appreciated that the more detailed description on the modules 201-205 are omitted herein since they function in the same way as the units 101-105 of the training device 10 as discussed above.
In the above, the system 100 has been discussed in the case they are implemented using certain hardware with specific circuitry or the combination of the hardware and the software. It shall be appreciated that the  systems  10 and 100 may be also implemented using software. In addition, the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes.
In the case that the system100 is implemented with software, the these systems 100 may run in a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or  distributed fashion.
Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims are intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.
Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims (36)

  1. A device for training neural networks of multi-class object detection, comprising:
    a feature learning unit (103) configured to,
    determine a first neural network based on training images of a first training image set, wherein each of the training images has a plurality of bounding boxes with objects inside; and
    determine a second neural network based on bounding boxes of the training images of the first training image set and then further fine-tune the second neural network based on bounding boxes of training images of a second training image set; and
    a sub-boxes detector unit (104) configured to determine a binary classifier detector for the bounding boxes of the first and the second image sets based on the second neural network, each score of the determined binary classifier detector predicting one semantic object class inside one of the bounding boxes.
  2. A device for training neural networks of multi-class object detection, comprising:
    a feature learning unit (103) configured to determine a first neural network based on a plurality of bounding boxes of a first training image set, and then to determine a second neural network based on bounding boxes of the images of a second training image set; and
    a sub-boxes detector unit (104) configured to determine a binary classifier detector for the bounding boxes based on the determined second neural network, each score of the determined a binary classifier detector predicting one semantic object class inside one of the bounding boxes .
  3. A device according to claim 1 or 2, wherein the determined first neural network operates to output contextual information for an image inputted thereto,
    the device further comprising:
    a contextual information unit (105) configured to retrieve the each score of the binary classifier detector from the sub-boxes detector unit (104) and the contextual information from the feature learning unit (103) so as to train a binary classifier detector for each detection class to predict each of the bounding boxes.
  4. A device according to claim 3, further comprising:
    a selective search unit (101) configured to retrieve at least one inputted image, and then determine the bounding boxes with objects inside for each retrieved image.
  5. A device according to claim 3, further comprising:
    a region rejection unit (102) configured to filter out bounding boxes from the determined boxes based on a predetermined threshold.
  6. A device according to claim 1 or 2, wherein the feature learning unit (103) determines the first neural network using the training images of the first training image set through a back-propagation algorithm.
  7. A device according to claim 1 or 2, wherein the feature learning unit (103) determines the second neural network through a back-propagation algorithm.
  8. A device for multi-class object detection, comprising:
    a feature learning module (203) configured to determine a plurality of classification features for each candidate bounding box of an inputted image;
    a sub-boxes detector module (204) configured to utilize a pre-trained detection neural network to calculate a plurality of detection classes scores for each candidate box based on the classification features determined by the feature learning module  (203) ; and
    a context information module configured to concatenate the calculated classification class scores, and determine a final score for the candidate bounding box, the final score representing one semantic object class inside one of the bounding boxes of the inputted image.
  9. A system for multi-class object detection, comprising:
    a training device (10) configured to determine a classification neural network, and a detection neural network from a plurality of predetermined training image sets;
    a prediction device (20) , comprising:
    a feature learning module (203) configured to determine a plurality of features for each candidate bounding box of an inputted image based on the detection neural network, wherein the detection neural network takes the candidate bounding box as input and operates to output detection features for the candidate bounding box;
    a sub-boxes detector module (204) configured to utilize the classification neural network to calculate a plurality of classification class scores for each candidate bounding box based on said detection features; and
    a context information module (205) configured to concatenate the calculated classification classes scores, and determine, based on the detection neural network, a final score for the candidate bounding box, the final score representing semantic object class inside the box.
  10. A system according to claim 9, wherein the training device (10) further comprises:
    a feature learning unit (103) configured to,
    determine the classification neural network based on training images of a first training image set, wherein each of the images has a  plurality of bounding boxes with objects inside, and the determined classification neural network outputs contextual information for an image inputted thereto; and
    determine the detection neural network based on bounding boxes of the images in the first training image set and then further fine-tune the detection neural network based on bounding boxes of the images in second training image set; and
    a sub-boxes detector unit (104) configured to determine binary classifier detectors for the bounding boxes based on the detection neural network, each score of the determined binary classifier detector predicting one semantic object class for one of the bounding boxes for.
  11. A system according to claim 9, wherein the training device (10) further comprises:
    a feature learning unit (103) configured to determine the classification neural network based on a plurality of bounding boxes of a first training image set, and then to determine the detection neural network based on bounding boxes of the images of a second training image set; and
    a sub-boxes detector unit (104) configured to determine a binary classifier detector for the bounding boxes based on the detection neural network, each score of the determined binary classifier detector predicting one semantic object class for one of the bounding boxes.
  12.  A system according to claim 11 or 12, wherein, the determined classification neural network is capable of outputting contextual information for an image inputted thereto, and the system further comprises:
    a contextual information unit (105) configured to retrieve scores of the binary classifier detector from the sub-boxes detector unit (104) and the contextual  information from feature learning unit (103) so as to train a binary classifier detector for each detection class of the bounding boxes for predicting the each bounding box.
  13. A system according to claim 12, further comprising:
    a selective search unit (101) configured to retrieve at least one inputted image, and then determine the bounding boxes with objects inside for each retrieved image.
  14. A system according to claim 13, further comprising:
    a region rejection unit (102) configured to filter out bounding boxes from the determined boxes based on a predetermined threshold.
  15. A system according to claim 11 or 12, wherein the feature learning unit (103) determines the classification neural network using the images of the first image training set through a back-propagation algorithm.
  16. A system according to claim 11 or 12, wherein the feature learning unit (103) determines the detection neural network through a back-propagation algorithm.
  17. A system according to claim 11 or 12, wherein the sub-boxes detector unit (104) is configured to determine scores of the binary classifier detector based on a max-average SVM.
  18. A system according to claim 11 or 12, wherein the binary classifier detector unit (104) is configured to determine scores of the binary classifier detector based on a multiple-feature SVM.
  19. A method for training neural networks of multi-class object detection, comprising:
    determining a first neural network based on training images of a first training image set, wherein each of the images has a plurality of bounding boxes with objects inside;
    determining a second neural network based on bounding boxes of the images of the first training image set;
    fine-tuning the second neural network based on bounding boxes of the images of a second training image set; and
    determining a binary classifier detector for the bounding boxes based on the second neural network, each of scores of the binary classifier detector predicting one semantic object class inside one of the bounding boxes.
  20. A method for training neural networks of multi-class object detection, comprising:
    determining a first neural network based on a plurality of bounding boxes of a first training image set;
    determining a second neural network based on bounding boxes of the images of a second training image set; and
    determining binary classifier detector for the bounding boxes based on the second neural network, each score of the determined binary classifier detector predicting one semantic object class for one of the bounding boxes .
  21. A method according to claim 19 or 20, wherein, the determined first neural network outputting contextual information for an inputted image,
    the method further comprising:
    training the binary classifier detector for each detection class of the bounding boxes for predicting the each bounding box based on the score of the binary classifier detector and the contextual information.
  22. A method according to claim 21, further comprising:
    retrieving at least one inputted image; and
    determining the bounding boxes with objects inside for each retrieved image.
  23. A method according to claim 21, further comprising:
    filtering out bounding boxes from the determined boxes based on a predetermined threshold.
  24. A method according to claim 19 or 20, wherein the first neural network is determined by using the images of the first image training set through a back-propagation algorithm.
  25. A device according to claim 19 or 20, wherein the second neural network is determined through a back-propagation algorithm.
  26. A method for multi-class object detection, comprising:
    determining a plurality of classification features for each candidate bounding box of an inputted image;
    calculating a plurality of classification classes scores for each candidate box based on the determined classification features;
    concatenating the calculated classification classes scores, and
    determining, from concatenated classes scores, a final score for the candidate bounding box through a pre-trained binary classifier detector, wherein the final score is used to predict one semantic object class inside one of the bounding boxes.
  27. A method for multi-class object detection, comprising:
    1) determining a classification neural network, a detection neural network, a plurality of binary classifier detectors from a plurality of predetermined training image sets;
    2) determining a plurality of features for each candidate bounding box of an  inputted image based on the detection neural network, wherein the detection neural network takes the candidate bounding box as input and operates to calculate classification features for the inputted box;
    3) calculating, by using the classification neural network, a plurality of classification classes scores for each candidate box based on the calculated features; and
    4) concatenating the calculated classification classes scores, so as to determine, based on the detection neural network, a final score for the candidate bounding box through the determined binary classifier detector so as to predict one semantic object class inside one of the bounding boxes.
  28. A method according to claim 27, wherein the step 1) further comprises:
    determining the classification neural network based on training images of a first training image set, wherein each of the images has a plurality of bounding boxes with objects inside; and
    determining the detection neural network based on bounding boxes of the images of the first training image set and then further fine-tune the detection neural network based on bounding boxes of the images of a second training image set; and
    determining the binary classifier detectors for the bounding boxes based on the detection neural network, each score of the binary classifier detector predicting one semantic object class for one of the bounding boxes.
  29. A method according to claim 27, wherein the step 1) further comprises:
    determining the classification neural network based on a plurality of bounding boxes of a first training image set;
    determining the detection neural network based on bounding boxes of the images of a second training image set; and
    determining a binary classifier detector for the bounding boxes based on the detection neural network, each score of the binary classifier detector predicting one semantic object class inside one of the bounding boxes.
  30. A method according to claim 28 or 29, wherein the determined classification neural network outputs contextual information for an image inputted thereto,
    the method further comprising:
    training the binary classifier detector for each detection class of the bounding boxes for predicting the each bounding box based on the binary classifier detector scores and the contextual information.
  31. A method according to claim 30, further comprising:
    retrieving at least one inputted image, and
    determining the bounding boxes with objects inside for each retrieved image.
  32. A method according to claim 31, further comprising:
    filtering out bounding boxes from the determined boxes based on a predetermined threshold.
  33. A method according to claim 28 or 29, wherein the classification neural network is determined by using the images of the first image training set through a back-propagation algorithm.
  34. A method according to claim 28 or 29, wherein the detection neural network is determined by through a back-propagation algorithm.
  35. A method according to claim 28 or 29, wherein scores of the binary classifier detector are determined based on a max-average SVM.
  36. A method according to claim 28 or 29, wherein scores of the binary classifier detector are determined based on a multiple-feature SVM.
PCT/CN2014/000833 2014-09-10 2014-09-10 Method and system for multi-class object detection Ceased WO2016037300A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2014/000833 WO2016037300A1 (en) 2014-09-10 2014-09-10 Method and system for multi-class object detection
CN201480081846.0A CN106688011B (en) 2014-09-10 2014-09-10 method and system for multi-class object detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/000833 WO2016037300A1 (en) 2014-09-10 2014-09-10 Method and system for multi-class object detection

Publications (1)

Publication Number Publication Date
WO2016037300A1 true WO2016037300A1 (en) 2016-03-17

Family

ID=55458228

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/000833 Ceased WO2016037300A1 (en) 2014-09-10 2014-09-10 Method and system for multi-class object detection

Country Status (2)

Country Link
CN (1) CN106688011B (en)
WO (1) WO2016037300A1 (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016357A (en) * 2017-03-23 2017-08-04 北京工业大学 A kind of video pedestrian detection method based on time-domain convolutional neural networks
WO2017166098A1 (en) * 2016-03-30 2017-10-05 Xiaogang Wang A method and a system for detecting an object in a video
GB2556985A (en) * 2016-11-28 2018-06-13 Adobe Systems Inc Facilitating sketch to painting transformations
CN109784487A (en) * 2017-11-15 2019-05-21 富士通株式会社 For the deep learning network of event detection, the training device and method of the network
CN109783666A (en) * 2019-01-11 2019-05-21 中山大学 A kind of image scene map generation method based on iteration fining
JP2019125128A (en) * 2018-01-16 2019-07-25 Necソリューションイノベータ株式会社 Information processing device, control method and program
CN110889318A (en) * 2018-09-05 2020-03-17 斯特拉德视觉公司 Lane detection method and apparatus using CNN
CN111052146A (en) * 2017-08-31 2020-04-21 三菱电机株式会社 System and method for active learning
US10679351B2 (en) 2017-08-18 2020-06-09 Samsung Electronics Co., Ltd. System and method for semantic segmentation of images
CN111460878A (en) * 2019-01-22 2020-07-28 斯特拉德视觉公司 Neural network operation method using grid generator and device using the same
US20200257984A1 (en) * 2019-02-12 2020-08-13 D-Wave Systems Inc. Systems and methods for domain adaptation
CN112308011A (en) * 2020-11-12 2021-02-02 湖北九感科技有限公司 Multi-feature combined target detection method and device
NL2023577B1 (en) * 2019-07-26 2021-02-18 Suss Microtec Lithography Gmbh Method for detecting alignment marks, method for aligning a first substrate relative to a second substrate as well as apparatus
US20210166146A1 (en) * 2017-03-22 2021-06-03 Ebay Inc. Visual aspect localization presentation
US11055580B2 (en) 2017-06-05 2021-07-06 Siemens Aktiengesellschaft Method and apparatus for analyzing an image
CN113137916A (en) * 2020-01-17 2021-07-20 苹果公司 Automatic measurement based on object classification
WO2022221932A1 (en) * 2021-04-22 2022-10-27 Oro Health Inc. Method and system for automated surface feature detection in digital images
CN115661492A (en) * 2022-12-28 2023-01-31 摩尔线程智能科技(北京)有限责任公司 Image comparison method, device, electronic device, storage medium and program product
US11574485B2 (en) 2020-01-17 2023-02-07 Apple Inc. Automatic measurements based on object classification
US11586915B2 (en) 2017-12-14 2023-02-21 D-Wave Systems Inc. Systems and methods for collaborative filtering with variational autoencoders
US11900264B2 (en) 2019-02-08 2024-02-13 D-Wave Systems Inc. Systems and methods for hybrid quantum-classical computing
US12229632B2 (en) 2016-03-07 2025-02-18 D-Wave Systems Inc. Systems and methods to generate samples for machine learning using quantum computing

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229524A (en) * 2017-05-25 2018-06-29 北京航空航天大学 A kind of chimney and condensing tower detection method based on remote sensing images
EP3596655B1 (en) * 2017-06-05 2023-08-09 Siemens Aktiengesellschaft Method and apparatus for analysing an image
GB2565775A (en) * 2017-08-21 2019-02-27 Nokia Technologies Oy A Method, an apparatus and a computer program product for object detection
US10679129B2 (en) * 2017-09-28 2020-06-09 D5Ai Llc Stochastic categorical autoencoder network
US11335024B2 (en) * 2017-10-20 2022-05-17 Toyota Motor Europe Method and system for processing an image and determining viewpoints of objects
CN108304856B (en) * 2017-12-13 2020-02-28 中国科学院自动化研究所 An image classification method based on a computational model of the cortex and thalamus
CN108121931B (en) * 2017-12-18 2021-06-25 阿里巴巴(中国)有限公司 Two-dimensional code data processing method and device and mobile terminal
CN108416902B (en) * 2018-02-28 2021-11-26 成都好享你网络科技有限公司 Real-time object identification method and device based on difference identification
DE102018207923A1 (en) * 2018-05-18 2019-11-21 Robert Bosch Gmbh Improved training of a classifier
EP3811285B1 (en) * 2018-06-20 2026-01-28 Zoox, Inc. Instance segmentation inferred from machine-learning model output
CN110570389B (en) * 2018-09-18 2020-07-17 阿里巴巴集团控股有限公司 Vehicle damage identification method and device
CN109543685A (en) * 2018-10-16 2019-03-29 深圳大学 Image, semantic dividing method, device and computer equipment
WO2020095131A1 (en) * 2018-11-07 2020-05-14 Foss Analytical A/S Milk analyser for classifying milk
CN109657551B (en) * 2018-11-15 2023-11-14 天津大学 A face detection method based on enhanced context information
CN109657678B (en) * 2018-12-17 2020-07-24 北京旷视科技有限公司 Image processing method and device, electronic equipment and computer storage medium
CN110298248A (en) * 2019-05-27 2019-10-01 重庆高开清芯科技产业发展有限公司 A kind of multi-object tracking method and system based on semantic segmentation
US11055540B2 (en) * 2019-06-28 2021-07-06 Baidu Usa Llc Method for determining anchor boxes for training neural network object detection models for autonomous driving
EP3767521A1 (en) * 2019-07-15 2021-01-20 Promaton Holding B.V. Object detection and instance segmentation of 3d point clouds based on deep learning
CN112288686B (en) * 2020-07-29 2023-12-19 深圳市智影医疗科技有限公司 Model training method and device, electronic equipment and storage medium
CN112101134B (en) * 2020-08-24 2024-01-02 深圳市商汤科技有限公司 Object detection methods and devices, electronic equipment and storage media
CN112418278A (en) * 2020-11-05 2021-02-26 中保车服科技服务股份有限公司 Multi-class object detection method, terminal device and storage medium
CN112733944B (en) * 2021-01-13 2025-02-07 中国传媒大学 Method, device and medium for object detection based on image and category attention
CN114387444B (en) * 2021-12-24 2024-10-15 大连理工大学 Zero sample classification method based on negative boundary triplet loss and data enhancement

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722712A (en) * 2012-01-02 2012-10-10 西安电子科技大学 Multiple-scale high-resolution image object detection method based on continuity
US8527276B1 (en) * 2012-10-25 2013-09-03 Google Inc. Speech synthesis using deep neural networks
US20130266214A1 (en) * 2012-04-06 2013-10-10 Brighham Young University Training an image processing neural network without human selection of features
CN103902987A (en) * 2014-04-17 2014-07-02 福州大学 Station caption identifying method based on convolutional network
CN103955702A (en) * 2014-04-18 2014-07-30 西安电子科技大学 SAR image terrain classification method based on depth RBF network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9235799B2 (en) * 2011-11-26 2016-01-12 Microsoft Technology Licensing, Llc Discriminative pretraining of deep neural networks
CN102521442B (en) * 2011-12-06 2013-07-24 南京航空航天大学 Method for predicting processing time of neural network of aircraft structure based on characteristic sample
CN102693409B (en) * 2012-05-18 2014-04-09 四川大学 Method for quickly identifying two-dimension code system type in images

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722712A (en) * 2012-01-02 2012-10-10 西安电子科技大学 Multiple-scale high-resolution image object detection method based on continuity
US20130266214A1 (en) * 2012-04-06 2013-10-10 Brighham Young University Training an image processing neural network without human selection of features
US8527276B1 (en) * 2012-10-25 2013-09-03 Google Inc. Speech synthesis using deep neural networks
CN103902987A (en) * 2014-04-17 2014-07-02 福州大学 Station caption identifying method based on convolutional network
CN103955702A (en) * 2014-04-18 2014-07-30 西安电子科技大学 SAR image terrain classification method based on depth RBF network

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12229632B2 (en) 2016-03-07 2025-02-18 D-Wave Systems Inc. Systems and methods to generate samples for machine learning using quantum computing
WO2017166098A1 (en) * 2016-03-30 2017-10-05 Xiaogang Wang A method and a system for detecting an object in a video
CN108885684A (en) * 2016-03-30 2018-11-23 北京市商汤科技开发有限公司 Method and system for detecting objects in video
CN108885684B (en) * 2016-03-30 2022-04-01 北京市商汤科技开发有限公司 Method and system for detecting objects in video
GB2556985B (en) * 2016-11-28 2021-04-21 Adobe Inc Facilitating sketch to painting transformations
GB2556985A (en) * 2016-11-28 2018-06-13 Adobe Systems Inc Facilitating sketch to painting transformations
US11783461B2 (en) 2016-11-28 2023-10-10 Adobe Inc. Facilitating sketch to painting transformations
US12277506B2 (en) 2017-03-22 2025-04-15 Ebay Inc. Visual aspect localization presentation
US11775844B2 (en) * 2017-03-22 2023-10-03 Ebay Inc. Visual aspect localization presentation
US20210166146A1 (en) * 2017-03-22 2021-06-03 Ebay Inc. Visual aspect localization presentation
CN107016357B (en) * 2017-03-23 2020-06-16 北京工业大学 A Video Pedestrian Detection Method Based on Time Domain Convolutional Neural Network
CN107016357A (en) * 2017-03-23 2017-08-04 北京工业大学 A kind of video pedestrian detection method based on time-domain convolutional neural networks
US11055580B2 (en) 2017-06-05 2021-07-06 Siemens Aktiengesellschaft Method and apparatus for analyzing an image
US10679351B2 (en) 2017-08-18 2020-06-09 Samsung Electronics Co., Ltd. System and method for semantic segmentation of images
CN111052146A (en) * 2017-08-31 2020-04-21 三菱电机株式会社 System and method for active learning
CN111052146B (en) * 2017-08-31 2023-05-12 三菱电机株式会社 Systems and methods for active learning
CN109784487A (en) * 2017-11-15 2019-05-21 富士通株式会社 For the deep learning network of event detection, the training device and method of the network
CN109784487B (en) * 2017-11-15 2023-04-28 富士通株式会社 Deep learning network for event detection, training device and method for the network
US11586915B2 (en) 2017-12-14 2023-02-21 D-Wave Systems Inc. Systems and methods for collaborative filtering with variational autoencoders
US12198051B2 (en) 2017-12-14 2025-01-14 D-Wave Systems Inc. Systems and methods for collaborative filtering with variational autoencoders
JP2019125128A (en) * 2018-01-16 2019-07-25 Necソリューションイノベータ株式会社 Information processing device, control method and program
JP7107544B2 (en) 2018-01-16 2022-07-27 Necソリューションイノベータ株式会社 Information processing device, control method, and program
CN110889318B (en) * 2018-09-05 2024-01-19 斯特拉德视觉公司 Lane detection method and device using CNN
CN110889318A (en) * 2018-09-05 2020-03-17 斯特拉德视觉公司 Lane detection method and apparatus using CNN
CN109783666A (en) * 2019-01-11 2019-05-21 中山大学 A kind of image scene map generation method based on iteration fining
CN111460878B (en) * 2019-01-22 2023-11-24 斯特拉德视觉公司 Neural network operation method using mesh generator and device using the method
CN111460878A (en) * 2019-01-22 2020-07-28 斯特拉德视觉公司 Neural network operation method using grid generator and device using the same
US11900264B2 (en) 2019-02-08 2024-02-13 D-Wave Systems Inc. Systems and methods for hybrid quantum-classical computing
US11625612B2 (en) * 2019-02-12 2023-04-11 D-Wave Systems Inc. Systems and methods for domain adaptation
US20200257984A1 (en) * 2019-02-12 2020-08-13 D-Wave Systems Inc. Systems and methods for domain adaptation
NL2023577B1 (en) * 2019-07-26 2021-02-18 Suss Microtec Lithography Gmbh Method for detecting alignment marks, method for aligning a first substrate relative to a second substrate as well as apparatus
US11763479B2 (en) 2020-01-17 2023-09-19 Apple Inc. Automatic measurements based on object classification
CN113137916B (en) * 2020-01-17 2023-07-11 苹果公司 Automatic measurement based on object classification
US11574485B2 (en) 2020-01-17 2023-02-07 Apple Inc. Automatic measurements based on object classification
CN113137916A (en) * 2020-01-17 2021-07-20 苹果公司 Automatic measurement based on object classification
CN112308011B (en) * 2020-11-12 2024-03-19 湖北九感科技有限公司 Multi-feature joint target detection method and device
CN112308011A (en) * 2020-11-12 2021-02-02 湖北九感科技有限公司 Multi-feature combined target detection method and device
WO2022221932A1 (en) * 2021-04-22 2022-10-27 Oro Health Inc. Method and system for automated surface feature detection in digital images
CN115661492A (en) * 2022-12-28 2023-01-31 摩尔线程智能科技(北京)有限责任公司 Image comparison method, device, electronic device, storage medium and program product
CN115661492B (en) * 2022-12-28 2023-12-29 摩尔线程智能科技(北京)有限责任公司 Image comparison method, apparatus, electronic device, storage medium, and program product

Also Published As

Publication number Publication date
CN106688011A (en) 2017-05-17
CN106688011B (en) 2018-12-28

Similar Documents

Publication Publication Date Title
WO2016037300A1 (en) Method and system for multi-class object detection
US12154309B2 (en) Joint training of neural networks using multi-scale hard example mining
US20220215227A1 (en) Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium
US9965719B2 (en) Subcategory-aware convolutional neural networks for object detection
CN108830285B (en) Target detection method for reinforcement learning based on fast-RCNN
US9811718B2 (en) Method and a system for face verification
US20230085687A1 (en) Machine learning prediction and document rendering improvement based on content order
CN113570029A (en) Method for obtaining neural network model, image processing method and device
CN114332889B (en) Text box ordering method and text box ordering device for text images
CN110490202A (en) Detection model training method, device, computer equipment and storage medium
US20160224903A1 (en) Hyper-parameter selection for deep convolutional networks
CN106845430A (en) Pedestrian detection and tracking based on acceleration region convolutional neural networks
US10762389B2 (en) Methods and systems of segmentation of a document
CN112766170B (en) Adaptive segmentation detection method and device based on clustered UAV images
CN107958255A (en) An image-based target detection method and device
Wu et al. Typical target detection in satellite images based on convolutional neural networks
WO2022152009A1 (en) Target detection method and apparatus, and device and storage medium
CN115775322A (en) Image classification method and device, computer equipment and storage medium
Lai et al. Improving classification with semi-supervised and fine-grained learning
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
JP2023013293A (en) Teacher Data Generating Device, Learning Model Generating Device, and Method for Generating Teacher Data
CN117523252A (en) Shale pore type detection and classification method and system based on deep learning
WO2015146113A1 (en) Identification dictionary learning system, identification dictionary learning method, and recording medium
Wang et al. Small vehicle classification in the wild using generative adversarial network
CN110008899A (en) A method for candidate target extraction and classification of visible light remote sensing images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14901459

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14901459

Country of ref document: EP

Kind code of ref document: A1