[go: up one dir, main page]

WO2020020472A1 - A computer-implemented method and system for detecting small objects on an image using convolutional neural networks - Google Patents

A computer-implemented method and system for detecting small objects on an image using convolutional neural networks Download PDF

Info

Publication number
WO2020020472A1
WO2020020472A1 PCT/EP2018/072857 EP2018072857W WO2020020472A1 WO 2020020472 A1 WO2020020472 A1 WO 2020020472A1 EP 2018072857 W EP2018072857 W EP 2018072857W WO 2020020472 A1 WO2020020472 A1 WO 2020020472A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
candidate
object detection
regions
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/EP2018/072857
Other languages
French (fr)
Inventor
Victor Manuel BREA SÁNCHEZ
Manuel Felipe MUCIENTES MOLINA
Brais BOSQUET MERA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fundacion Centro Tecnoloxico De Telecomunicacions De Galicia
Universidade de Santiago de Compostela
Original Assignee
Fundacion Centro Tecnoloxico De Telecomunicacions De Galicia
Universidade de Santiago de Compostela
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fundacion Centro Tecnoloxico De Telecomunicacions De Galicia, Universidade de Santiago de Compostela filed Critical Fundacion Centro Tecnoloxico De Telecomunicacions De Galicia
Priority to ES202190001A priority Critical patent/ES2908944B2/en
Publication of WO2020020472A1 publication Critical patent/WO2020020472A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure is comprised in the field of image analysis, and more particularly, in the field of methods and systems for detecting small objects on an image.
  • Object detection has undergone a great progress through deep convolutional neural networks (CNN).
  • CNN deep convolutional neural networks
  • Initial approaches combined region proposal methods based on different techniques with deep convolutional networks that extracted automatically very deep features from those regions and, finally, generated a bounding box and the corresponding object category.
  • Current solutions integrate feature extraction, region proposal, and bounding box and object category in the CNN, in some cases with a fully convolutional architecture.
  • a straightforward solution for small object detection would be to modify a state-of-the-art CNN keeping the resolution of the initial image in all the feature maps.
  • this approach is non-viable because, due to the size of the network, it would not fit on a GPU (Graphics Processing Unit) and, also, the forwarding pass would be very slow.
  • RPN Region Proposal Network
  • the input image goes through a number of convolutional layers for feature extraction up to the RPN.
  • the RPN is based on anchors, which are predefined regions of different sizes and aspect ratios to cope with multiple scales.
  • the anchors are centered at the sliding window and, for each position and anchor, a fixed length feature vector is generated with a set of convolutional layers.
  • the output of the RPN are the coordinates of the bounding boxes and their corresponding classes, namely, object and background.
  • the bounding box and class of the object are determined through a fully-connected classification network.
  • the off-the-shelf Faster-R-CNN is not adequate for small object detection due to two reasons.
  • the detection of small objects requires finer global effective stride in Faster-R-CNN. This leads to a very high increase of memory, making the implementation impossible for current GPUs.
  • R-FCN Region-based Fully Convolutional Network
  • Li et al. [5] introduces a Perceptual Generative Adversarial Network for small object detection.
  • the aim is to enhance the representation of small objects to be similar to that of large ones. This is done by looking for the structural correlations of the objects at different scales.
  • This approach has two networks. First, the generator network transforms the original poor features of small objects to highly discriminative ones. Then, the discriminator network estimates the probability that the input representation belongs to a real large object and, finally, it classifies the proposal and runs bounding box regression.
  • the present disclosure introduces a new CNN architecture for small object detection that solves the aforementioned problems, allowing detection of small targets equal or under 256 square pixels.
  • the global effective stride must be low, which requires a new architecture in order to keep a reasonable memory overhead.
  • the proposed solution is an image object detector and, as such, it does not feature temporal information as the video object detectors reported in [12] and [13].
  • the present disclosure introduces a new CNN architecture for small object detection.
  • the proposed CNN architecture has a size that is significantly lower than its counterparts for the same resolution of the last feature map.
  • the present invention considers the hypothesis that, after a few convolutional layers, the feature map contains enough information to decide which regions of the image contain candidate objects, but there is not enough data to classify the region or to perform bounding box regression.
  • the present invention applies a novel component, called Region Context Network (RCN), which is a filter that allows to select the most promising regions -all of them with the same size- of the feature map, avoiding the processing of the remaining areas of the image.
  • RCN Region Context Network
  • the RCN ends up with a Rol (Region of Interest) Collection Layer (RCL), that builds a new and reduced filtered feature map by arranging all the regions selected by the RCN. Therefore, the memory overhead of the feature maps following the RCN is much lower, but with the same spatial resolution, as the reduction in size is due to the deletion of the least promising regions with small objects.
  • RCN Region Proposal Network
  • a computer- implemented method for detecting small objects on an image using convolutional neural networks comprises the following steps:
  • the first set of candidate regions are determined by applying a first convolution operation to the input feature map to obtain an intermediate convolutional layer and an associated intermediate feature map; applying a second convolution operation to the intermediate feature map to obtain a class feature map including class scores as candidate objects; and selecting a determined number of regions in the input feature map according to the class scores as candidate objects of the class feature map, wherein the first set of candidate regions includes the selected regions.
  • the step of arranging the first set of candidate regions to form a reduced feature map may comprise concatenating the candidate regions and adding an inter region 0-padding between adjacent candidate regions.
  • the method may also comprise a preprocessing stage wherein the number and the size of the anchors used in the Region Proposal Network are automatically learned through k-means applied to a training set of ground truth boxes.
  • the number of anchors is preferably automatically obtained by performing an iterative k-means with an increasing number of kernels until the maximum inter-kernels loU ratio exceeds a certain threshold.
  • an image object detector based on convolutional neural networks comprises:
  • a feature extractor module configured to apply one or more convolution operations to an input image to obtain a first set of convolutional layers and an input feature map corresponding to the last convolutional block of said first set; and configured to apply one or more convolution operations to a reduced feature map to obtain a second set of convolutional layers and an output feature map corresponding to the last convolutional block of said second set.
  • a region context network module configured to analyze the input feature map to determine a first set of candidate regions containing candidate objects.
  • a Rol collection layer module configured to arrange the first set of candidate regions to form the reduced feature map.
  • a Region Proposal Network module configured to obtain, from the output feature map, a second set of candidate regions containing candidate objects
  • a classifier module configured to classify and apply bounding box regression to each candidate region of the second set to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image.
  • the region context network module is configured to apply a first convolution operation to the input feature map to obtain an intermediate convolutional layer and an associated intermediate feature map; apply a second convolution operation to the intermediate feature map to obtain a class feature map including class scores as candidate objects; and select a determined number of regions in the input feature map according to the class scores as candidate objects of the class feature map, wherein the first set of candidate regions includes the selected regions.
  • the Rol collection layer module is preferably configured to form the reduced feature map by concatenating the candidate regions and adding an inter region 0-padding between adjacent candidate regions.
  • the image object detector may be implemented, for instance, in a processor or a GPU.
  • the present invention also refers to an object detection system for detecting small objects on an image using convolutional neural networks.
  • the object detection system comprises an image object detector as previously defined and a camera configured to capture an input image.
  • a vehicle comprising an object detection system as previously defined and a decision module configured to determine, based on the object detection made by the object detection system, at least one action for execution by one or more vehicle systems of the vehicle.
  • the vehicle may be, for instance, an unmanned aerial vehicle.
  • an airspace surveillance system comprising an object detection system as previously defined, wherein the camera of the object detection system is mounted on a ground location and is configured to monitor an airspace region; and a decision module configured to determine, based on the object detection made by the object detection system, at least one action for execution.
  • a ground surveillance system comprising an object detection system as previously defined, wherein the object detection system is installed on an aerial platform or vehicle and the camera of the object detection system is configured to monitor a ground region; and a decision module configured to determine, based on the object detection made by the object detection system, at least one action for execution.
  • a detect and avoid system installed onboard a vehicle comprising an object detection system as previously defined, wherein the camera of the object detection system is configured to monitor a region in front of the vehicle; and a decision module configured to determine, based on the object detection made by the object detection system, at least one action to avoid potential collisions.
  • the invention also refers to a computer program product for detecting small objects on an image using convolutional neural networks, comprising at least one computer- readable storage medium having recorded thereon computer code instructions that, when executed by a processor, causes the processor to perform the method as previously defined.
  • the main contributions of the present invention are:
  • a new CNN for small object detection that is able to work with high resolution feature maps in the deeper layers while having a size that is significantly lower than other CNNs.
  • the present invention relies on a novel component, RCN, that selects the most promising regions of the image and generates a new and filtered feature map with these areas. Therefore, the filtered feature maps can keep the same resolution but with a lower memory overhead and a higher frame rate.
  • the present invention uses an RPN that works with anchors, wherein the number and sizes of the anchors can be automatically selected using a novel algorithm based on k-means.
  • the automatic definition of the anchors with k-means improves the classical heuristic approach.
  • the fully convolutional network (CNN) of the present invention is focused on small targets equal or under 256 square pixels. It includes an early visual attention mechanism, RCN, to choose the most promising regions with small objects and their context. RCN allows to work with feature maps with high resolution but with a reduced memory usage, as the regions with least likely objects are deleted from the filtered feature maps. The filtered feature maps, which only contain the most likely regions with small objects, are forwarded across the network up to the ending Region Proposal Network (RPN), and then classified. RCN is key to increase localization accuracy through finer spatial resolution due to finer global effective strides, smaller memory overhead and higher frame rates.
  • RPN Region Proposal Network
  • Figure 1 shows the structure of a CNN object detector according to the prior art.
  • Figure 2 depicts the steps performed by a CNN object detector according the present invention.
  • FIG. 3 depicts the RCN architecture of the present invention.
  • Figure 4 shows some examples of the feature maps obtained by the RCN.
  • Figure 5 is a schematic diagram of an image object detector according to an embodiment of the present invention.
  • Figure 6 depicts a vehicle with the image object detector installed onboard.
  • Figure 7 depicts the steps performed by the image object detector according to an embodiment of the invention.
  • Figure 8 depicts the steps performed by an ensemble of residual blocks from early or late convolutions of the image object detector in order to extract features from the input feature map.
  • Figure 9 shows an embodiment of the image object detector of Figure 5 applied to airspace surveillance.
  • Figure 10 depicts, according to another embodiment, the image object detector of Figure 5 applied to ground surveillance from an aerial position.
  • Figure 1 1 depicts, according to yet another embodiment, the image object detector of Figure 5 applied to detect and avoid applications.
  • the present disclosure refers to a system and a computer-implemented method for detecting objects on an image using convolutional neural networks.
  • FIG. 1 schematically depicts, according to the prior art, the internal structure of an object detector using convolutional neural networks, CNN object detector 100, which receives and processes an input image 102 to obtain an object classification in the input image 104, thereby detecting the presence of objects in the input image 102.
  • CNN object detector 100 receives and processes an input image 102 to obtain an object classification in the input image 104, thereby detecting the presence of objects in the input image 102.
  • a feature extractor 1 10 of the CNN object detector 100 sequentially applies N successive convolution operations (1 1 1 , 1 13, 1 15), obtaining for each convolution operation a convolution layer and the associated feature maps (1 12, 1 14, 1 16) which will be used in the next convolution operation.
  • a Region Proposal Network (RPN) 120 is then applied to the last feature maps 1 16 obtained by the feature extractor 1 10.
  • RPN Region Proposal Network
  • a classifier 130 receives the output of the RPN 120 and the last feature maps 1 16 of the feature extractor 1 10 to determine the object classification in the input image 104, including the class of the object, using a fully-connected classification network.
  • a bounding box regression is also performed to obtain the bounding box in the input image 102 for the regions detected as objects.
  • the system of the present invention is a fully convolutional network that detects small objects.
  • the system only considers regions of the feature maps containing most likely objects, deleting those regions of the feature maps with least likely objects and building filtered feature maps with the same resolution but lower memory requirements. This way, the system works with high resolution feature maps while keeping a low memory overhead.
  • FIG. 2 schematically depicts the method performed by a CNN object detector, according to an embodiment of the present invention, to detect small objects on an input image using convolutional neural networks.
  • the method 200 comprises receiving an input image 102 and applying one or more convolution operations (early convolutions 210) to the input image 102 to obtain a first set of convolutional layers 212.
  • the first set of convolutional layers 212 is formed by two convolutional blocks (214, 216).
  • the last feature map of the convolutional block 216 of said first set 212 is an input feature map 302 for the convolution operations applied in the next step of the process (shown in more detail in Figure 3), referred to as Region Context Network (RCN) 220 in Figure 2.
  • the RCN 220 analyzes the input feature map 302 to determine a first set of candidate regions 222 in the input feature map 302 containing candidate objects.
  • the first set of candidate regions 222 are arranged to form a reduced feature map 228 (Rol collection layer).
  • One or more convolution operations (late convolutions 230) are then applied to the reduced feature map 228 to obtain a second set of convolutional layers 232.
  • the second set of convolutional layers 232 comprises two convolutional blocks (234, 236).
  • the last feature map of to the last convolutional block 236 of said second set 232 is an output feature map of the late convolutions 230.
  • a Region Proposal Network (RPN) 240 is then applied to said output feature map to obtain a second set of candidate regions 242 (e.g. j candidate regions) in the output feature map containing candidate objects.
  • a classifier 250 classifies and applies bounding box regression to each candidate region of the second set 242 to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image 102.
  • each of the selected candidate regions 242 may first be converted, prior to the classification and bounding box regression, to a fixed size feature map, obtaining / fixed size feature maps 248 (Rol pooling layers).
  • Figure 3 represents in more detail, according to an embodiment, the RCN 220 process to obtain the first set of candidate regions 222 in the input feature map 302.
  • the RCN 220 receives the input feature map 302 and applies a first convolution operation to the input feature map 302 to obtain an intermediate convolutional layer 224 and an associated intermediate feature map.
  • the first convolution operation is a convolution using a fixed kernel size that acts as a fixed size sliding window 304 that maps the input feature map 302.
  • the first convolution operation is a convolution with a 3x3 kernel size and 128 filters (i.e. a 128-d 3x3 convolution).
  • the RCN 220 applies a second convolution operation to the intermediate feature map to obtain a class feature map 226 ( rcn-cls-layer ) including class scores as candidate objects.
  • the second convolution operation is a convolution with a 1 x1 kernel size and 2 filters (i.e. a 2-d 1 x1 convolution).
  • the RCN 220 forms the first set of candidate regions 222 by selecting a determined number of regions in the input feature map according to the scores as candidate objects of the class feature map 226 (for instance, selecting the first n regions with the highest score).
  • the first set of candidate regions 222 are arranged to form a reduced feature map 228 (Rol collection layer, RCL) by concatenating the candidate regions 222 and adding an inter region 0-padding (shown as gaps in the figure) between candidate regions 222.
  • Figure 4 depicts an illustrative example of the reduced feature map 228 (Rol collection layer) for a particular input image 102. The Figure 4 shows only 4 filters, out of the total 256 filters used in the example, of the RCN input (i.e. the input feature map 302), and only 7 filters (a row for each filter) out of a total of 256 of the Rol Collection Layer output (i.e. the reduced feature map 228).
  • FIG. 5 is a schematic diagram showing the components of an image object detector 500 based on convolutional neural networks (i.e. CNN object detector) according to an embodiment of the present invention.
  • the image object detector 500 of the present invention is a system (or part of a system) for detecting small objects on an image using convolutional neural networks.
  • the system may be implemented in a processing device including a processor, a GPU or a combination thereof (or any other kind of data processing device) and a computer-readable medium having encoded thereon computer-executable instructions to cause the processor/GPU to execute the method for detecting small objects on an image using convolutional neural networks as previously described.
  • the image object detector 500 comprises a feature extractor module 510, a region context network module 520, a region of interest (Rol) collection layer module 530, a Region Proposal Network module 540 and a classifier module 550.
  • the feature extractor module 510 is configured to apply one or more convolution operations 210 (early convolutions) to an input image 102 to obtain a first set of convolutional layers 212 and an input feature map 302 corresponding to the last convolutional block 216 of the first set of convolutional layers 212.
  • the region context network module 520 analyzes the input feature map 302, looking for and determining the most promising regions containing candidate objects (i.e. a first set of candidate regions 222). Regions are defined as areas of the image that might contain objects together with their context. The region context network module 520 assigns to each region a score, and the top scored regions (first set of candidate regions 222) are passed on to a Rol collection layer module 530, at the final stage of the RCN.
  • the region context network module 520 avoids forwarding the regions of the input image with least likely objects to the deepest convolutional layers, saving memory and increasing frame rate.
  • Memory saving is key to increase spatial resolution through finer global effective strides across convolutional layers, mandatory in order not to miss the spatial localization of small objects.
  • the region context network module 520 selects the most likely candidate regions with one or more small objects together with their context, and returns them as a set of disjoint regions. As at this stage the goal is not to get accurate object localization, neither a box regression approach, nor a set of anchors with different scales and aspect ratios are needed. A single anchor of a given size suffices to return the most likely candidate regions with small objects.
  • the region context network module 520 first applies a 3 x 3 convolutional filter to each window of the input feature map 302, generating an intermediate 128-d layer with ReLU (Rectified Linear Unit) [17] following.
  • ReLU Rectified Linear Unit
  • This structure feeds a box-classification layer ( rcn-cls-layer ) represented by a 1 x 1 convolutional 2-d layer (“fg”, i.e. object, and“bg”, i.e. no object) which scores regions obtained with sliding- windows over the last early convolution (i.e input feature map 302).
  • rcn-cls-layer a box-classification layer represented by a 1 x 1 convolutional 2-d layer (“fg”, i.e. object, and“bg”, i.e. no object) which scores regions obtained with sliding- windows over the last early convolution (i.e input feature map 302).
  • the ground truths of the objects are grown proportionally in all directions until equaling the anchor’s defined size. Then, those anchors that have a considerable overlap with the modified ground truth (greater than 0.7 by default) are assigned as positive labels, leaving for negative those regions that barely have an overlap (lower than 0.3 by default). As usual, the overlap is measured by the intersection-over-union (loU) ratio.
  • the objectness score of the candidate regions in RCN is minimized through:
  • pi is the predicted probability of the i-th anchor being an object in a RCN mini batch, and pi is the adapted ground-truth label.
  • the term - normalizes the equation
  • L c/S is a softmax loss over object or not object categories.
  • RCN 220 ends up with the so-called Rol collection layer (RCL) ( Figure 3), implemented by the Rol collection layer module 530, which is configured to arrange the first set of candidate regions 222 to form a reduced feature map 228.
  • the Rol collection layer module 530 takes as input the feature map generated by the last early convolution and the top scored proposals from RCN to return a single filtered feature map (reduced feature map 228) with the same information as that of the input feature map 302, but only for the set of selected regions. Successive convolutions with filters greater than 1 x1 will affect the neighboring regions’ outputs.
  • the Rol collection layer module 530 adds an inter region 0-padding -shown by gaps between regions in Figure 3-
  • n is the number of regions from RCN
  • r w and 3 ⁇ 4 are the dimensions of the regions in the RCL input feature map
  • p d is the size of the 0-padding between regions.
  • a 1280x720 input image has an RCL input feature map of 320x180
  • the output RCL generates a 649x12 feature map: 50 regions of size 48x48 at the input image -12x12 at the RCL input feature map for stride 4- with 1 pixel 0-padding in the example; i.e. a reduction 7.4x of GPU memory usage (86.5% saved memory).
  • the feature extractor module 510 also applies one or more convolution operations 230 (late convolutions) to the reduced feature map 228 to obtain a second set of convolutional layers 232 and an output feature map 502 corresponding to the last convolutional block 236 of said second set 232.
  • Convolution operations 230 act on the first set of candidate regions 222 obtained by the region context network module 520 independently due to inter-region 0-padding -displayed as gaps between the different candidate regions in Figure 3-.
  • the feature extractor module 510 can be any of the most widely state-of-the-art solutions found in the literature—e.g. ResNet [14], VGG [15], ZF [16], etc. ResNet-50 is preferably used, since it provides a good trade-off between accuracy, speed and GPU memory consumption [14].
  • the Region Proposal Network (RPN) module 540 is configured to obtain, using the output feature map 502, a second set of candidate regions 242 containing candidate objects.
  • the RPN module 540 performs an initial bounding regression and classification as object (fg) and background (bg) [3], which are finally refined in the classifying stage.
  • the RPN module 540 is based on the RPN presented in [3], but including a set of modifications in order to deal with the fact that the coordinates of its input feature map do not correspond with those of the input image, i.e., the RPN input contains unsorted regions.
  • the region context network module 520 To map the regions on the input image to the RPN’s training function, which is based on the loU between anchors and ground truth, the region context network module 520 passes the 4 coordinates of every region as a parameter to the RPN module 540 to generate the anchors relative to those regions. Finally, the output of the bounding box regression is transformed to the input image coordinates.
  • the approaches that rely on RPNs define the number of anchors and their sizes heuristically.
  • both the number and the size of the anchors are learned through k-means (i.e. automatic anchors initialization by k-means).
  • This approach can be adopted by any other object detection network with anchors, e.g. Faster-R-CNN, regardless the target size of the objects.
  • the k-means anchor learning procedure is implemented as a preprocessing stage k-means is applied to the training set of ground truth boxes’ height and width.
  • an iterative k-means with an increasing number of kernels is performed until the maximum inter-kernels loU exceeds a certain threshold.
  • the threshold is set to 0.5, which is the value used in well-known repositories, as PASCAL VOC [18] or MS COCO [19], to check if a detection is positive or negative with respect to a ground truth.
  • PASCAL VOC PASCAL VOC
  • MS COCO MS COCO
  • the classifier module 550 is configured to classify and apply bounding box regression to each candidate region of the second set of candidate regions 242 to obtain, for each candidate region, a class score 552 as a candidate object and a bounding box 554 in the input image 102.
  • Figure 6 depicts an exemplary embodiment of the image object detector 500 installed onboard a vehicle 600, such as a boat, a car, an aircraft, an unmanned aerial vehicle or a drone.
  • vehicle 600 includes an object detection system 610 for detecting small objects on an image using convolutional neural networks, wherein the object detection system 610 comprises a camera 612 (the term“camera” includes any device able to acquire an image or a set of images, such as a conventional camera or a video camera) configured to capture an input image 102, and the image object detector 500 as previously described in Figure 5.
  • the image object detector 500 is implemented in a processor or a GPU 614.
  • the vehicle 600 may also comprise a decision module 620 that receives the output of the object detection system (the class scores 552 and bounding boxes 554 for the candidate regions selected in the input image 102), and determines, based on the objects detected on the input image 102, one or more actions 622 to be executed by one or more vehicle systems 630 (e.g. communications system 632, navigation system 634 with on-board sensors 635, propulsion system 638) of the vehicle 600.
  • vehicle systems 630 e.g. communications system 632, navigation system 634 with on-board sensors 635, propulsion system 638
  • the action may be sent to the navigation system 634 (continuous line) and/or to the communications system 632 (dotted line).
  • the action may include, as an example, guiding the vehicle towards one of the small detected objects, depending on the class score obtained for said object or the size of the bounding box. This could be the case, for instance: - When the bounding box is so small that the vehicle 600 is required to confirm the class score 552 assigned to the region by getting closer.
  • the class score assigned is of a particular relevance to the vehicle. For example, if the vehicle 600 is a drone patrolling a vast secure geographic area, such as borders between countries, and is looking for people invading that secure geographic area.
  • the navigation system 634 receives a displacement instruction 624 to move towards a determined location (e.g. a detected object) and computes an updated trajectory, which is executed by the propulsion system 638 (e.g. motors, etc.) of the vehicle 630.
  • a displacement instruction 624 to move towards a determined location (e.g. a detected object) and computes an updated trajectory, which is executed by the propulsion system 638 (e.g. motors, etc.) of the vehicle 630.
  • the actions 622 may include reporting the detected objects 626 to an external entity, such as a server, using the communications systems 632 of the vehicle 600.
  • Figure 7 depicts the steps performed by an image object detector 500 according to an exemplary embodiment of the invention (this is merely an example, different parameters may be employed in other embodiments):
  • image object detector 500 takes an image or a video-frame as an input image 102.
  • the input image is scaled to HD resolution, 1280x720x3 (width x height x number of RGB color channels), keeping its width and height ratio.
  • Early convolutions 210 This ensemble is composed by a first convolution layer 710, a max-pooling layer 712 and a second residual block 714.
  • First convolution layer 710 gets the input image and applies a 7x7 kernel size with stride 2, padding 3 and 64 filters. This operation halves the width and height, returning a 640x360x64 feature map.
  • Max-pooling layer 712 transforms the 640x360x64 feature map in a 320x180x64 feature map through a max-pooling operation with a 3x3 kernel size and stride 2. From this point until the end, the image object detector 500 keeps the current resolution, that is, a resolution four times smaller than that of the original input image.
  • Second residual block 714 residual block (Figure 8 depicts the steps performed by an ensemble of residual blocks [14] in order to extract features from the input feature map ) composed by three blocks that increases the number of filters from 64 to 256, returning an 320x180x256 feature map (i.e. input feature map 302 in Figure 3).
  • RCN 220 consists of two convolutional layers (RCN convolution 720 and RCN class score convolution 722) and a layer for the proposal of regions (RCN proposal layer 724).
  • RCN convolution 720 applies a 3x3 kernel size (stride 1 , padding 1 ) that acts as a 3x3 sliding window, mapping the input feature map 302 information in a 128-d output (320x180x128).
  • RCN class score convolution 722 a 1 x1 convolution that learns the necessary characteristics to differentiate between object or non-object of regions at each sliding window location (2-d).
  • Each unit of the feature map decides whether the anchor centered in that unit contains an object or not. This is done by comparing the activation values of the two units in the same spatial localization: one of them learns the foreground score and the other one the background score. Returns a 320x180x2 feature map.
  • RCN proposal layer 724 a custom layer that gets the class (object or non-object) scores from RCN class score convolution 722, calculates their regions’ coordinates in the input image size and returns a first set of candidate regions 222 most likely to contain an object (50x4 rcn rois, where 50 is the number of regions and 4 are the coordinates for each region).
  • RCL 228 is another custom layer that obtains the first set of candidate regions 222 ( rcn rois ) from the RCN proposal layer 724 and the feature map information from the second residual block 714 (input feature map 302). With both inputs, it obtains the information from the feature map of the second residual block 714, but only within the selected regions. Then, it concatenates this information in a new output feature map of size RCL 0UtPut size . Successive convolutions with filters greater than 1 x1 will affect the neighboring regions' outputs. To solve this problem, RCL adds an inter region 0-padding.
  • Late convolutions 230 This ensemble is composed by two residual blocks (third residual block 730 and fourth residual block 732, obtained according to the flow diagram of Figure 8).
  • Third residual block 730 composed by four blocks that take as input the output from RCL 228 and increase the number of filters from 256 to 512, returning a 649x12x512 feature map.
  • restore collection padding is applied (see Figure 8), an auxiliary layer which restores the padding between regions to zero.
  • Fourth residual block 732 residual block composed by six blocks that increases the number of filters from 512 to 1024, returning an 649x12x1024 feature map. As in the previous case, restore collection padding is applied.
  • RPN 240 consists of three convolutional layers (RPN convolution 740, RPN class score convolution 744 and RPN bounding box regression convolution 746) and a layer for the proposal of regions (RPN proposal layer 748).
  • RPN convolution 740 applies a 3x3 kernel size (stride 1 , padding 1 ) which acting as a 3x3 sliding window location that maps the input feature map information in a 256-d output (649x12x256).
  • an auxiliary layer (remove collection padding 742) eliminates the 0-padding between regions since there are no more 3x3 convolutions to be applied on them, returning a 600x12x256 feature map.
  • RPN class score convolution 744 a 1 x1 convolution that learns the necessary characteristics to differentiate between object or non-object at each sliding window location and for each defined anchor (6-d since 3 anchors are used). Returns a 600x12x6 feature map.
  • RPN bounding box regression convolution 746 a 1 x1 convolution that learns the necessary characteristics to apply regression to each of the four coordinates of each anchor at each sliding window location (12-d since 3 anchors are used). Returns a 600x12x12 feature map.
  • RPN proposal layer 748 a custom layer that gets the first set of candidate regions 222 ( rcn rois ) from RCN proposal layer 724, the class (object or non-object) scores for each anchor from RPN class score convolution 744 and the coordinates for each anchor from RPN bounding box regression convolution 746.
  • RPN proposal layer 748 maps the sliding window locations for each anchor to the coordinates of those regions at the original input image. Then, it sorts those most likely to contain an object by the scores from RPN class score convolution 744.
  • the RPN proposal layer 748 also returns the coordinates of the 300 regions relative to the RPN input, i.e. the unsorted map of regions ( scaled_rois ), the second set of candidate regions 242 in Figure 2.
  • Rol pooling layer 248 this layer takes the feature map information of the fourth residual block 732 (i.e. output feature map 502) and the 300 unsorted map of regions scaled_rois (i.e. the second set of candidate regions 242).
  • the auxiliary layer remove collection padding eliminates the 0-padding between regions in the feature map of the fourth residual block 732 so that the size is 600x12x12.
  • the Rol pooling layer 248 obtains the information from the feature map of the fourth residual block 732, but only within the selected regions, and converts them to a fixed size (14x14x1024) feature map. Also, the 300 regions go forward to the next stage.
  • Each region of interest from Rol pooling layer 248 is classified independently by the last residual block (fifth residual block 750, Figure 8) and an average pooling 752.
  • Fifth residual block 750 residual block composed by three blocks. The first one halves the width and height by 2. In addition, the block increases the number of filters from 1024 to 2048, returning an 7x7x2048 feature map.
  • Average pooling 752 an average pooling with 7x7 kernel size reduces the dimension to 1x1x2048, ready to be classified by fully connected layers.
  • Decision Function 760 for each region of interest, the final decision is taken based on two fully connected layers that transform the input 1 x1 x2048 array in the category of the object (class score fully connected layer 762) and its corresponding bounding box regression (bounding box regression fully connected layer 764).
  • class score fully connected layer 762 the value obtained by class score fully connected layer 762 passes through a Softmax function 766 to normalize the score into the range [0, 1] and, on the other hand, a transformation function 768 applies the bounding box regression to the rois relative to the original input image obtained from the RPN proposal layer 748. This returns the final class score 552 and bounding box 554 for each region of interest.
  • both the network that acts as a backbone (ResNet- 50) and the two modules of the network (RCN module 520 and RPN module 540) can be trained end-to-end by backpropagation and stochastic gradient descent (SGD) [20].
  • SGD stochastic gradient descent
  • the approximate joint training [3] has been selected.
  • the RCN module 540 is trained in a similar way to the RPN module 540, except for bounding box regression, which does not exist in RCN.
  • RCL keeps the same output number of images per mini-batch as that of input images, makes the rest of the training identical to other RPN networks like Faster- R-CNN.
  • the initialization of anchors by k-means does not affect training either, since it is performed previously to the training.
  • the RCN module 520 obtains its mini-batch from a single image by selecting positive and negative anchors.
  • the mini-batch used within the RCN is 64 examples trying to maintain whenever possible a ratio of 1 :1 of positive and negative labels.
  • the anchor's size is obtained by estimating the effective receptive field (ERF) which, in practice, follows a Gaussian distribution [21 ], so half of the theoretical receptive field of the convolutions between RCN and RPN is selected as ERF.
  • ERF effective receptive field
  • an aggressive non-maximum suppression with a low threshold (0.3) is applied over the 2,000 best proposals before the RCL, resulting in a low number of scattered regions - around 200 on average-.
  • we let pass through the RCN those regions with confidence higher than 0.3, up to a maximum of 50 regions.
  • RCN and RCL can be integrated in any object detection convolutional framework just adapting the corresponding region proposal method to work with unsorted regions.
  • the method has been implemented over Faster-R-CNN.
  • the hyper parameters for training and testing are the same as those used in Faster-R-CNN.
  • the RPN module 540 is placed between the fourth residual block 732 and fifth residual block 732 convolutional layers, as it is done in [14] for Faster-R-CNN.
  • a box voting scheme after non-maximum suppression is applied [22]
  • the framework Caffe is used [23].
  • Figures 9, 10 and 11 depicts several possible applications of the method and the image object detector of the present invention. However, it is noted that the present invention could be applied to many other real case scenarios. Among the different use cases envisaged for the proposed invention, the following are highlighted:
  • the airspace surveillance system 900 of Figure 9 comprises a camera 912 (implemented in this example as a video camera) located on the ground 902, mounted on a pole or on a terrestrial moving platform.
  • the camera 912 is pointing towards the sky, either in a vertical or in an oblique direction.
  • the camera 912 monitors a determined airspace region 903.
  • the airspace region 903 monitored can be a static region, if the position and orientation of the camera is fixed, or a dynamic region, if the position, orientation and /or zoom of the camera 912 dynamically changes.
  • the video stream (sequence of input images 102) acquired by the camera 912 is sent to a processor 914 for further analysis in order to detect all those flying object 904 (e.g. a drone in the example of Figure 9) which appear in the field of vision 906 of the camera 912 and are represented in the input image 102 as small objects 908 (with a size up to 16x16 pixels).
  • the processor 14 implements the image object detector 500 of Figure 5.
  • the processor 914 may be placed together with the camera 912 (i.e. locally) or remotely, for instance in a remote data center. In the latter case, the input images 102 are transmitted to the processor 900 using a broadband data connection, such as the Internet.
  • the camera 912 and the image object detector 500 implemented by the processor 914 form an object detection system 910 in charge of monitoring the airspace and detecting, in real time, any kind of flying objects 904 (such as aircraft -e.g. drones, airships-, parachutists, or even meteorites).
  • any kind of flying objects 904 such as aircraft -e.g. drones, airships-, parachutists, or even meteorites.
  • the airspace surveillance system 900 is helpful in scenarios where monitoring the airspace for security and/or safety reasons is critical, such as airports (in order to detect flying objects that can pose a potential hazard for commercial or military aviation), nuclear plants, Transportation hubs, government facilities, football stadiums and any other critical infrastructures.
  • the small object detection performed by the airspace surveillance system 900 is carried out as soon as possible (i.e. whenever the flying object 904 appears in the field of view 906 of the camera 912, despite its size), in order to take the contingency actions required.
  • the airspace surveillance system 900 may optionally include a decision module 920 to determine one or more actions 922 to be carried out, based on the object detection made by the object detection system 910.
  • Actions 922 may include, for instance, neutralizing a drone flying in the airspace near an airport, sending an alarm message, etc.
  • the airspace surveillance system 900 may also comprise means (not shown in the figure) for executing the actions 922 determined by the decision module 920. For example, if the action to be taken is neutralizing a detected drone, the airspace surveillance system 900 may include a missile launcher to destroy the drone.
  • FIG. 10 depicts the image object detector 500 applied to ground surveillance from aerial positions (e.g. from aerial vehicles or platforms).
  • the ground surveillance system 1000 comprises a camera 1012 (e.g. a video camera) mounted on an aerial vehicle 1001 (e.g. a drone), pointing downwards toward the ground 1002, either in a vertical or in an oblique direction, to monitor a ground region 1003.
  • the video stream captured by the camera 1012 is sent to a processor 1014 (either on-board the aerial vehicle 1001 or remotely located on an on-ground facility) in charge to analyze the input images 102 to detect static or moving small terrestrial objects 1004 (e.g.
  • ground surveillance system 1000 may be applied in different aerial-based surveillance scenarios, such as:
  • the camera 1012 and the image object detector 500 implemented by the processor 1014 form an object detection system 1010 in charge of monitoring the ground region 1003 and detecting, in real time, any kind of terrestrial objects 1004 (e.g. people).
  • the ground surveillance system 1000 may also include a decision module 1020 to determine one or more actions 1022 to be performed based on the object detection made by the object detection system 1010. Actions 1022 may include, among others, sending a message to a remote station informing about the detected objects.
  • the ground surveillance system 1000 may also comprise means (not shown in the figure) for executing the actions 1022.
  • the detect and avoid system 1 100 comprises a camera 1 1 12 (e.g. a video camera) mounted onboard a vehicle 1 101 , such as the aircraft depicted in Figure 1 1 or any other type of vehicle (e.g. an autonomous car, a drone, etc.).
  • the camera 1 1 12 is pointing forward, in the direction of movement of the vehicle 1 101 , either in a horizontal or in an slight oblique direction, towards a dynamic region 1 103 (in the embodiment, an airspace region).
  • the video stream of the camera 1 1 12 is analyzed by an on-board processor 1 1 14, in order to detect other small flying objects 1 104 which appear in the field of vision 1 106 of the camera 612, represented in the input image 102 as small objects 1 108 with a size up to 16x16 pixels, and that may involve a potential obstacle for the aerial vehicle 600.
  • the camera 1 1 12 and the image object detector 500 implemented by the processor 1 1 14 form an object detection system in charge of monitoring the airspace region 1 103 and detecting, in real time, any kind of flying objects 1 104 (e.g. drones, birds).
  • any kind of flying objects 1 104 e.g. drones, birds.
  • the detect and avoid system 1 100 comprises a decision module 1 120 that determines one or more actions 1 122 to be performed based on the flying objects 1 104 detected by the object detection system.
  • the actions 1 122 determined by the decision module 1 120 are aimed to avoid collision against the detected flying objects 1 104.
  • a new trajectory may be computed by the decision module 1 120 for execution by the vehicle 1 101 (e.g. by the FMS of an aircraft or by an autonomous navigation module of a drone).
  • a vehicle 1 101 comprising the detect and avoid system 1 100 may also be part of the invention, the vehicle comprising means for executing the actions 1 122 to avoid collision.
  • the small object detection performed by the detect and avoid system 1 100 is performed as soon as possible (i.e. whenever the flying object 1 104 appears in the field of view 1 106 of the camera 1 1 12, despite its size), in order to take the contingency actions required to avoid potential collisions.
  • the method and image object detector 500 of the present invention is especially advantageous, when comparing with the prior art, for detecting small objects (equal or under 16x16 pixels) on an image.
  • the invention may also be applied for detecting bigger objects (i.e. above 16x16 pixels) on an image.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Databases & Information Systems (AREA)
  • Remote Sensing (AREA)
  • Astronomy & Astrophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

A computer-implemented method and system for detecting small objects on an image using convolutional neural networks. The method comprises: applying convolution operations (210) to an input image (102) to obtain a first set of convolutional layers (212) and an input feature map (302); analyzing the input feature map (302) to determine a first set of candidate regions (222) containing candidate objects; arranging the first set of candidate regions (222) to form a reduced feature map (228); applying convolution operations (230) to the reduced feature map (228) to obtain a second set of convolutional layers (232) and an output feature map (502); applying a Region Proposal Network (240) to the output feature map (502) to obtain a second set of candidate regions (242) containing candidate objects; classifying and applying bounding box regression (250) to each candidate region of the second set (242) to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image (102).

Description

A COMPUTER-IMPLEMENTED METHOD AND SYSTEM FOR DETECTING SMALL OBJECTS ON AN IMAGE USING CONVOLUTIONAL NEURAL NETWORKS
DESCRIPTION
FIELD
The present disclosure is comprised in the field of image analysis, and more particularly, in the field of methods and systems for detecting small objects on an image. BACKGROUND
Object detection has undergone a great progress through deep convolutional neural networks (CNN). Initial approaches combined region proposal methods based on different techniques with deep convolutional networks that extracted automatically very deep features from those regions and, finally, generated a bounding box and the corresponding object category. Current solutions integrate feature extraction, region proposal, and bounding box and object category in the CNN, in some cases with a fully convolutional architecture.
Applications like sense and avoid on board of unmanned aerial vehicles (UAVs) or video surveillance over wide areas demand early detections of objects of interest to act quickly. This means to detect as far— and therefore small— an object as possible. Recent CNN object detectors provide high accuracy over a wide range of scales, from 32 x 32 pixels up to the image size. Nevertheless, there are no specific CNNs focused on small targets. Oualitatively, in the present disclosure the term“small targets” (or“small objects”) refers to those objects without visual cues to assign them to a category or subcategory; quantitatively,“small targets” (or“small objects”) refers to objects in an image having a size with a total number of pixels equal or under 256 square pixels (e.g. 16 x 16 pixels, 25 x 10 pixels). Most of the state-of-the-art CNNs for object detection are unsuitable for the detection of such small objects, because both the region proposal, the bounding box regression and the final classification take the feature maps generated in the last convolutional layers as inputs. These feature maps have much lower resolution than the input image— in most cases the reductions in resolution are up to 16x. Thus, many small objects are represented in the last feature maps by only one pixel, which makes classification and bounding box regression very hard, if not impossible.
A straightforward solution for small object detection would be to modify a state-of-the-art CNN keeping the resolution of the initial image in all the feature maps. Of course, this approach is non-viable because, due to the size of the network, it would not fit on a GPU (Graphics Processing Unit) and, also, the forwarding pass would be very slow.
Modern object detectors are based on CNNs [2] Faster-R-CNN [3] has become a milestone in CNNs for object detection thanks to the inclusion of a visual attention mechanism through the so-called Region Proposal Network (RPN). In Faster-R-CNN the input image goes through a number of convolutional layers for feature extraction up to the RPN. The RPN is based on anchors, which are predefined regions of different sizes and aspect ratios to cope with multiple scales. The anchors are centered at the sliding window and, for each position and anchor, a fixed length feature vector is generated with a set of convolutional layers. The output of the RPN are the coordinates of the bounding boxes and their corresponding classes, namely, object and background. Finally, given the output of the RPN and the last feature map of the feature extractor network, the bounding box and class of the object are determined through a fully-connected classification network.
The off-the-shelf Faster-R-CNN is not adequate for small object detection due to two reasons. First, the sizes of the predefined anchors are very large for small objects. Second, and more important, the global effective stride— downscaling of the input image with respect to the feature map that is the input to the RPN— is 16, which means that a 16 x 16 object is represented by just one pixel in that feature map. The detection of small objects requires finer global effective stride in Faster-R-CNN. This leads to a very high increase of memory, making the implementation impossible for current GPUs.
In [4], a fully convolutional approach to object detection, called Region-based Fully Convolutional Network (R-FCN), is presented. The major difference with Faster-R-CNN is that R-FCN generates k x k x (C + 1 ) feature maps in the last convolutional layer, instead of only one. These maps are position-sensitive, i.e., each of the k x k x (C + 1 ) maps corresponds with a part of an object of one of the C object categories (+1 for background). This, however, limits the applicability of the R-FCN architecture to small object detection, as it is very hard to distinguish their parts. The capability of dealing with objects of different sizes in Faster-R-CNN and R-FCN is limited to a few scales produced with the anchors. Hence, more recent CNNs for object detection tackle the issue of scale invariance and small object detection through more elaborated solutions.
Li et al. [5] introduces a Perceptual Generative Adversarial Network for small object detection. The aim is to enhance the representation of small objects to be similar to that of large ones. This is done by looking for the structural correlations of the objects at different scales. This approach has two networks. First, the generator network transforms the original poor features of small objects to highly discriminative ones. Then, the discriminator network estimates the probability that the input representation belongs to a real large object and, finally, it classifies the proposal and runs bounding box regression. The proposal has been tested with two datasets: (i) traffic signs from the Tsinghua-Tencent 100k dataset [6], where they consider as small objects those with an area under 32 x 32 pixels; (ii) pedestrians over 50 pixels tall from the Caltech benchmark [7].
In [8] an approach for company logo detection is presented. This approach is based on Faster-RCNN. As logos appear usually as small objects, Eggert et al. presents an architecture with three RPNs to detect objects of different sizes. For instance, the RPN after conv3 has anchors for side lengths under 45 px. Both the RPNs and the final classification and bounding box regression receive as inputs the combination of the feature maps of the last three convolutions: high-level feature maps are upscaled through bilinear interpolation and then summed with the lower-level maps. This proposal was validated in the FlickrLogos dataset.
Also, in [9] an architecture with several RPNs is proposed. Each RPN is in a different branch of the net. Shallower RPNs are adequate for small objects, while deeper ones are appropriate for larger targets. In order to have a more informative Rol (Region of Interest) pooling— principally for small objects— , the CNN applies upsampling to the last feature map of each of the branches of the net. In the experimental evaluation the smallest objects range from 25 to 50 pixels of height. Yang et al. [10] separate the detection of objects of different sizes in different branches. Their proposal relies on scale-dependent pooling— pooling for smaller objects uses only the shallower feature maps— and, also, on layer-wise cascades rejection classifiers in several branches for the different object sizes. This approach considers objects of less than 64 pixels of height as small targets.
In [1 1 ] authors propose a CNN in which the deeper feature maps are upsampled and combined with shallower feature maps. Object detection relies on these combined feature maps: the shallower ones for small objects and the deeper ones for larger objects.
All the previous approaches are based on single images. In [12] the detection of flying objects from a single moving camera is implemented taking into account spatio-temporal image cubes. This proposal has two main components, motion compensation and object detection, both based on CNNs. Motion compensation takes as input an image patch and returns the shift necessary to center the object in the patch. The CNN for object detection receives the motion compensated spatio-temporal image cubes, and returns whether or not there is an object.
The present disclosure introduces a new CNN architecture for small object detection that solves the aforementioned problems, allowing detection of small targets equal or under 256 square pixels. This makes a big difference with the above prior art documents, as, firstly, the objects of interest do not feature definitive visual clues to classify them into a category and, secondly, the sizes of the targets considered in the present disclosure are significantly smaller than those considered in the prior art documents, making the object detection more difficult. In order to detect such small objects, the global effective stride must be low, which requires a new architecture in order to keep a reasonable memory overhead. Besides, the proposed solution is an image object detector and, as such, it does not feature temporal information as the video object detectors reported in [12] and [13].
References
[1 ] J. Redmon, A. Farhadi, Yolo9000: Better, faster, stronger, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 6517-6525. [2] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al., Speed/accuracy trade-offs for modern convolutional object detectors, in: IEEE Computer Vision and Pattern Recognition (CVPR), 2017.
[3] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in Neural Information Processing Systems (NIPS), 2015, pp. 91 -99.
[4] J. Dai, Y. Li, K. He, J. Sun, R-fcn: Object detection via region-based fully convolutional networks, in: Advances in Neural Information Processing Systems (NIPS), 2016, pp. 379-387.
[5] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, S. Yan, Perceptual generative adversarial networks for small object detection, in: IEEE Computer Vision and Pattern Recognition (CVPR), 2017.
[6] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, S. Hu, Traffic-sign detection and classification in the wild, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 21 10-21 18.
[7] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: An evaluation of the state of the art, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (4) (2012) 743-761 .
[8] C. Eggert, D. Zecha, S. Brehm, R. Lienhart, Improving small object proposals for company logo detection, in: ACM on International Conference on Multimedia Retrieval, ACM, 2017, pp. 167-174.
[9] Z. Cai, Q. Fan, R. S. Feris, N. Vasconcelos, A uniffied multi-scale deep convolutional neural network for fast object detection, in: European Conference on Computer Vision (ECCV), Springer, 2016, pp. 354-370.
[10] F. Yang, W. Choi, Y. Lin, Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2129-2137.
[1 1 ] T.-Y. Lin, P. Doll_ar, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: IEEE Computer Vision and Pattern Recognition (CVPR), Vol. 1 , 2017, p. 4.
[12] A. Rozantsev, V. Lepetit, P. Fua, Detecting flying objects using a single moving camera, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (5) (2017) 879-892.
[13] C. Feichtenhofer, A. Pinz, A. Zisserman, Detect to track and track to detect, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3038-3046. [14] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770- 778.
[15] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations, 2015.
[16] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European Conference on Computer Vision (ECCV), Springer, 2014, pp. 818-833.
[17] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in: 27th International Conference on Machine Learning (ICML), 2010, pp. 807-814.
[18] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International Journal of Computer Vision 88 (2) (2010) 303-338.
[19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick, Microsoft coco: Common objects in context, in: European Conference on Computer Vision (ECCV), Springer, 2014, pp. 740-755.
[20] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural Computation 1 (4) (1989) 541 -551 .
[21 ] W. Luo, Y. Li, R. Urtasun, R. Zemel, Understanding the e_ective receptive field in deep convolutional neural networks, in: Advances in Neural Information Processing Systems (NIPS), 2016, pp. 4898-4906.
[22] S. Gidaris, N. Komodakis, Object detection via a multi-region and semantic segmentation-aware cnn model, in: IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1 134-1 142.
[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: 22nd ACM International Conference on Multimedia, ACM, 2014, pp. 675-678.
SUMMARY
The present disclosure introduces a new CNN architecture for small object detection. The proposed CNN architecture has a size that is significantly lower than its counterparts for the same resolution of the last feature map. The present invention considers the hypothesis that, after a few convolutional layers, the feature map contains enough information to decide which regions of the image contain candidate objects, but there is not enough data to classify the region or to perform bounding box regression. Given an intermediate feature map, the present invention applies a novel component, called Region Context Network (RCN), which is a filter that allows to select the most promising regions -all of them with the same size- of the feature map, avoiding the processing of the remaining areas of the image. The RCN ends up with a Rol (Region of Interest) Collection Layer (RCL), that builds a new and reduced filtered feature map by arranging all the regions selected by the RCN. Therefore, the memory overhead of the feature maps following the RCN is much lower, but with the same spatial resolution, as the reduction in size is due to the deletion of the least promising regions with small objects. Finally, the present invention applies a Region Proposal Network (RPN) to the last filtered feature map, classifies the regions and performs bounding box regression.
In accordance with one aspect of the present invention there is provided a computer- implemented method for detecting small objects on an image using convolutional neural networks. The method comprises the following steps:
- Applying one or more convolution operations to an input image to obtain a first set of convolutional layers and an input feature map corresponding to the last convolutional block of said first set.
- Analyzing the input feature map to determine a first set of candidate regions containing candidate objects.
- Arranging the first set of candidate regions to form a reduced feature map.
- Applying one or more convolution operations to the reduced feature map to obtain a second set of convolutional layers and an output feature map corresponding to the last convolutional block of said second set. - Applying a Region Proposal Network to the output feature map to obtain a second set of candidate regions containing candidate objects.
- Classifying and applying bounding box regression to each candidate region of the second set to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image.
In an embodiment, the first set of candidate regions are determined by applying a first convolution operation to the input feature map to obtain an intermediate convolutional layer and an associated intermediate feature map; applying a second convolution operation to the intermediate feature map to obtain a class feature map including class scores as candidate objects; and selecting a determined number of regions in the input feature map according to the class scores as candidate objects of the class feature map, wherein the first set of candidate regions includes the selected regions.
The step of arranging the first set of candidate regions to form a reduced feature map may comprise concatenating the candidate regions and adding an inter region 0-padding between adjacent candidate regions.
The method may also comprise a preprocessing stage wherein the number and the size of the anchors used in the Region Proposal Network are automatically learned through k-means applied to a training set of ground truth boxes. The number of anchors is preferably automatically obtained by performing an iterative k-means with an increasing number of kernels until the maximum inter-kernels loU ratio exceeds a certain threshold.
In accordance with a further aspect of the present invention there is provided an image object detector based on convolutional neural networks. The image object detector comprises:
- A feature extractor module configured to apply one or more convolution operations to an input image to obtain a first set of convolutional layers and an input feature map corresponding to the last convolutional block of said first set; and configured to apply one or more convolution operations to a reduced feature map to obtain a second set of convolutional layers and an output feature map corresponding to the last convolutional block of said second set.
- A region context network module configured to analyze the input feature map to determine a first set of candidate regions containing candidate objects.
- A Rol collection layer module configured to arrange the first set of candidate regions to form the reduced feature map.
- A Region Proposal Network module configured to obtain, from the output feature map, a second set of candidate regions containing candidate objects;
- A classifier module configured to classify and apply bounding box regression to each candidate region of the second set to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image.
According to an embodiment, the region context network module is configured to apply a first convolution operation to the input feature map to obtain an intermediate convolutional layer and an associated intermediate feature map; apply a second convolution operation to the intermediate feature map to obtain a class feature map including class scores as candidate objects; and select a determined number of regions in the input feature map according to the class scores as candidate objects of the class feature map, wherein the first set of candidate regions includes the selected regions.
The Rol collection layer module is preferably configured to form the reduced feature map by concatenating the candidate regions and adding an inter region 0-padding between adjacent candidate regions. The image object detector may be implemented, for instance, in a processor or a GPU.
The present invention also refers to an object detection system for detecting small objects on an image using convolutional neural networks. The object detection system comprises an image object detector as previously defined and a camera configured to capture an input image.
In accordance with yet a further aspect of the present invention there is provided a vehicle comprising an object detection system as previously defined and a decision module configured to determine, based on the object detection made by the object detection system, at least one action for execution by one or more vehicle systems of the vehicle. The vehicle may be, for instance, an unmanned aerial vehicle.
In accordance with another aspect of the present invention there is provided an airspace surveillance system comprising an object detection system as previously defined, wherein the camera of the object detection system is mounted on a ground location and is configured to monitor an airspace region; and a decision module configured to determine, based on the object detection made by the object detection system, at least one action for execution.
In accordance with yet another aspect of the present invention there is provided a ground surveillance system comprising an object detection system as previously defined, wherein the object detection system is installed on an aerial platform or vehicle and the camera of the object detection system is configured to monitor a ground region; and a decision module configured to determine, based on the object detection made by the object detection system, at least one action for execution. In accordance with another aspect of the present invention there is provided a detect and avoid system installed onboard a vehicle, comprising an object detection system as previously defined, wherein the camera of the object detection system is configured to monitor a region in front of the vehicle; and a decision module configured to determine, based on the object detection made by the object detection system, at least one action to avoid potential collisions.
The invention also refers to a computer program product for detecting small objects on an image using convolutional neural networks, comprising at least one computer- readable storage medium having recorded thereon computer code instructions that, when executed by a processor, causes the processor to perform the method as previously defined. The main contributions of the present invention are:
A new CNN for small object detection that is able to work with high resolution feature maps in the deeper layers while having a size that is significantly lower than other CNNs. The present invention relies on a novel component, RCN, that selects the most promising regions of the image and generates a new and filtered feature map with these areas. Therefore, the filtered feature maps can keep the same resolution but with a lower memory overhead and a higher frame rate.
The present invention uses an RPN that works with anchors, wherein the number and sizes of the anchors can be automatically selected using a novel algorithm based on k-means. The automatic definition of the anchors with k-means improves the classical heuristic approach.
The fully convolutional network (CNN) of the present invention is focused on small targets equal or under 256 square pixels. It includes an early visual attention mechanism, RCN, to choose the most promising regions with small objects and their context. RCN allows to work with feature maps with high resolution but with a reduced memory usage, as the regions with least likely objects are deleted from the filtered feature maps. The filtered feature maps, which only contain the most likely regions with small objects, are forwarded across the network up to the ending Region Proposal Network (RPN), and then classified. RCN is key to increase localization accuracy through finer spatial resolution due to finer global effective strides, smaller memory overhead and higher frame rates.
Experimental results over small object databases show that the present invention improves the average precision (AP@:S) of the best state-of-the-art approach for small target detection from 52.7% to 60.1 %.
BRIEF DESCRIPTION OF THE DRAWINGS
A series of drawings which aid in better understanding the invention and which are expressly related with an embodiment of said invention, presented as a non-limiting example thereof, are very briefly described below.
Figure 1 shows the structure of a CNN object detector according to the prior art.
Figure 2 depicts the steps performed by a CNN object detector according the present invention.
Figure 3 depicts the RCN architecture of the present invention.
Figure 4 shows some examples of the feature maps obtained by the RCN.
Figure 5 is a schematic diagram of an image object detector according to an embodiment of the present invention.
Figure 6 depicts a vehicle with the image object detector installed onboard.
Figure 7 depicts the steps performed by the image object detector according to an embodiment of the invention.
Figure 8 depicts the steps performed by an ensemble of residual blocks from early or late convolutions of the image object detector in order to extract features from the input feature map.
Figure 9 shows an embodiment of the image object detector of Figure 5 applied to airspace surveillance. Figure 10 depicts, according to another embodiment, the image object detector of Figure 5 applied to ground surveillance from an aerial position.
Figure 1 1 depicts, according to yet another embodiment, the image object detector of Figure 5 applied to detect and avoid applications.
DETAILED DESCRIPTION
The present disclosure refers to a system and a computer-implemented method for detecting objects on an image using convolutional neural networks.
Figure 1 schematically depicts, according to the prior art, the internal structure of an object detector using convolutional neural networks, CNN object detector 100, which receives and processes an input image 102 to obtain an object classification in the input image 104, thereby detecting the presence of objects in the input image 102.
A feature extractor 1 10 of the CNN object detector 100 sequentially applies N successive convolution operations (1 1 1 , 1 13, 1 15), obtaining for each convolution operation a convolution layer and the associated feature maps (1 12, 1 14, 1 16) which will be used in the next convolution operation. A Region Proposal Network (RPN) 120, as described in the prior art (see for instance [3]), is then applied to the last feature maps 1 16 obtained by the feature extractor 1 10. A classifier 130 receives the output of the RPN 120 and the last feature maps 1 16 of the feature extractor 1 10 to determine the object classification in the input image 104, including the class of the object, using a fully-connected classification network. Along with the classification, a bounding box regression is also performed to obtain the bounding box in the input image 102 for the regions detected as objects.
The system of the present invention is a fully convolutional network that detects small objects. The system only considers regions of the feature maps containing most likely objects, deleting those regions of the feature maps with least likely objects and building filtered feature maps with the same resolution but lower memory requirements. This way, the system works with high resolution feature maps while keeping a low memory overhead.
Figure 2 schematically depicts the method performed by a CNN object detector, according to an embodiment of the present invention, to detect small objects on an input image using convolutional neural networks. The method 200 comprises receiving an input image 102 and applying one or more convolution operations (early convolutions 210) to the input image 102 to obtain a first set of convolutional layers 212. In the example of Figure 2 the first set of convolutional layers 212 is formed by two convolutional blocks (214, 216). The last feature map of the convolutional block 216 of said first set 212 is an input feature map 302 for the convolution operations applied in the next step of the process (shown in more detail in Figure 3), referred to as Region Context Network (RCN) 220 in Figure 2. The RCN 220 analyzes the input feature map 302 to determine a first set of candidate regions 222 in the input feature map 302 containing candidate objects.
The first set of candidate regions 222 are arranged to form a reduced feature map 228 (Rol collection layer). One or more convolution operations (late convolutions 230) are then applied to the reduced feature map 228 to obtain a second set of convolutional layers 232. In the embodiment shown in Figure 2 the second set of convolutional layers 232 comprises two convolutional blocks (234, 236). The last feature map of to the last convolutional block 236 of said second set 232 is an output feature map of the late convolutions 230.
A Region Proposal Network (RPN) 240 is then applied to said output feature map to obtain a second set of candidate regions 242 (e.g. j candidate regions) in the output feature map containing candidate objects. A classifier 250 classifies and applies bounding box regression to each candidate region of the second set 242 to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image 102. In an embodiment, each of the selected candidate regions 242 may first be converted, prior to the classification and bounding box regression, to a fixed size feature map, obtaining / fixed size feature maps 248 (Rol pooling layers).
Figure 3 represents in more detail, according to an embodiment, the RCN 220 process to obtain the first set of candidate regions 222 in the input feature map 302. The RCN 220 receives the input feature map 302 and applies a first convolution operation to the input feature map 302 to obtain an intermediate convolutional layer 224 and an associated intermediate feature map. The first convolution operation is a convolution using a fixed kernel size that acts as a fixed size sliding window 304 that maps the input feature map 302. In the embodiment of Figure 3, the first convolution operation is a convolution with a 3x3 kernel size and 128 filters (i.e. a 128-d 3x3 convolution).
The RCN 220 applies a second convolution operation to the intermediate feature map to obtain a class feature map 226 ( rcn-cls-layer ) including class scores as candidate objects. In the embodiment of Figure 3, the second convolution operation is a convolution with a 1 x1 kernel size and 2 filters (i.e. a 2-d 1 x1 convolution). The RCN 220 forms the first set of candidate regions 222 by selecting a determined number of regions in the input feature map according to the scores as candidate objects of the class feature map 226 (for instance, selecting the first n regions with the highest score).
According to the embodiment of Figure 3, the first set of candidate regions 222 are arranged to form a reduced feature map 228 (Rol collection layer, RCL) by concatenating the candidate regions 222 and adding an inter region 0-padding (shown as gaps in the figure) between candidate regions 222. Figure 4 depicts an illustrative example of the reduced feature map 228 (Rol collection layer) for a particular input image 102. The Figure 4 shows only 4 filters, out of the total 256 filters used in the example, of the RCN input (i.e. the input feature map 302), and only 7 filters (a row for each filter) out of a total of 256 of the Rol Collection Layer output (i.e. the reduced feature map 228).
Figure 5 is a schematic diagram showing the components of an image object detector 500 based on convolutional neural networks (i.e. CNN object detector) according to an embodiment of the present invention. The image object detector 500 of the present invention is a system (or part of a system) for detecting small objects on an image using convolutional neural networks. The system may be implemented in a processing device including a processor, a GPU or a combination thereof (or any other kind of data processing device) and a computer-readable medium having encoded thereon computer-executable instructions to cause the processor/GPU to execute the method for detecting small objects on an image using convolutional neural networks as previously described.
The image object detector 500 comprises a feature extractor module 510, a region context network module 520, a region of interest (Rol) collection layer module 530, a Region Proposal Network module 540 and a classifier module 550. The feature extractor module 510 is configured to apply one or more convolution operations 210 (early convolutions) to an input image 102 to obtain a first set of convolutional layers 212 and an input feature map 302 corresponding to the last convolutional block 216 of the first set of convolutional layers 212.
The region context network module 520 analyzes the input feature map 302, looking for and determining the most promising regions containing candidate objects (i.e. a first set of candidate regions 222). Regions are defined as areas of the image that might contain objects together with their context. The region context network module 520 assigns to each region a score, and the top scored regions (first set of candidate regions 222) are passed on to a Rol collection layer module 530, at the final stage of the RCN.
The region context network module 520 avoids forwarding the regions of the input image with least likely objects to the deepest convolutional layers, saving memory and increasing frame rate. Memory saving is key to increase spatial resolution through finer global effective strides across convolutional layers, mandatory in order not to miss the spatial localization of small objects.
The region context network module 520 selects the most likely candidate regions with one or more small objects together with their context, and returns them as a set of disjoint regions. As at this stage the goal is not to get accurate object localization, neither a box regression approach, nor a set of anchors with different scales and aspect ratios are needed. A single anchor of a given size suffices to return the most likely candidate regions with small objects. The region context network module 520 first applies a 3 x 3 convolutional filter to each window of the input feature map 302, generating an intermediate 128-d layer with ReLU (Rectified Linear Unit) [17] following. This structure feeds a box-classification layer ( rcn-cls-layer ) represented by a 1 x 1 convolutional 2-d layer (“fg”, i.e. object, and“bg”, i.e. no object) which scores regions obtained with sliding- windows over the last early convolution (i.e input feature map 302).
To verify that the anchor is a positive or a negative candidate in each sliding window region during the RCN’s training phase, the ground truths of the objects are grown proportionally in all directions until equaling the anchor’s defined size. Then, those anchors that have a considerable overlap with the modified ground truth (greater than 0.7 by default) are assigned as positive labels, leaving for negative those regions that barely have an overlap (lower than 0.3 by default). As usual, the overlap is measured by the intersection-over-union (loU) ratio. The objectness score of the candidate regions in RCN is minimized through:
Figure imgf000018_0001
object/not object classifier where pi is the predicted probability of the i-th anchor being an object in a RCN mini batch, and pi is the adapted ground-truth label. The term - normalizes the equation
Ncls
and it refers to the size of the RCN’s mini-batch. Lc/S is a softmax loss over object or not object categories.
RCN 220 ends up with the so-called Rol collection layer (RCL) (Figure 3), implemented by the Rol collection layer module 530, which is configured to arrange the first set of candidate regions 222 to form a reduced feature map 228. The Rol collection layer module 530 takes as input the feature map generated by the last early convolution and the top scored proposals from RCN to return a single filtered feature map (reduced feature map 228) with the same information as that of the input feature map 302, but only for the set of selected regions. Successive convolutions with filters greater than 1 x1 will affect the neighboring regions’ outputs. To solve this problem, the Rol collection layer module 530 adds an inter region 0-padding -shown by gaps between regions in Figure 3-
With this configuration, the dimensions of the feature map output are obtained as follows:
Figure imgf000018_0002
where n is the number of regions from RCN, rw and ¾ are the dimensions of the regions in the RCL input feature map and pd is the size of the 0-padding between regions. For example, a 1280x720 input image has an RCL input feature map of 320x180, and the output RCL generates a 649x12 feature map: 50 regions of size 48x48 at the input image -12x12 at the RCL input feature map for stride 4- with 1 pixel 0-padding in the example; i.e. a reduction 7.4x of GPU memory usage (86.5% saved memory).
The feature extractor module 510 also applies one or more convolution operations 230 (late convolutions) to the reduced feature map 228 to obtain a second set of convolutional layers 232 and an output feature map 502 corresponding to the last convolutional block 236 of said second set 232. Late convolutions 230 act on the first set of candidate regions 222 obtained by the region context network module 520 independently due to inter-region 0-padding -displayed as gaps between the different candidate regions in Figure 3-.
The feature extractor module 510 can be any of the most widely state-of-the-art solutions found in the literature— e.g. ResNet [14], VGG [15], ZF [16], etc. ResNet-50 is preferably used, since it provides a good trade-off between accuracy, speed and GPU memory consumption [14].
The Region Proposal Network (RPN) module 540 is configured to obtain, using the output feature map 502, a second set of candidate regions 242 containing candidate objects. The RPN module 540 performs an initial bounding regression and classification as object (fg) and background (bg) [3], which are finally refined in the classifying stage.
The RPN module 540 is based on the RPN presented in [3], but including a set of modifications in order to deal with the fact that the coordinates of its input feature map do not correspond with those of the input image, i.e., the RPN input contains unsorted regions. To map the regions on the input image to the RPN’s training function, which is based on the loU between anchors and ground truth, the region context network module 520 passes the 4 coordinates of every region as a parameter to the RPN module 540 to generate the anchors relative to those regions. Finally, the output of the bounding box regression is transformed to the input image coordinates.
The approaches that rely on RPNs define the number of anchors and their sizes heuristically. In the present invention, both the number and the size of the anchors are learned through k-means (i.e. automatic anchors initialization by k-means). This approach can be adopted by any other object detection network with anchors, e.g. Faster-R-CNN, regardless the target size of the objects. The k-means anchor learning procedure is implemented as a preprocessing stage k-means is applied to the training set of ground truth boxes’ height and width. In order to obtain the number of kernels, which will be the number of anchors, an iterative k-means with an increasing number of kernels is performed until the maximum inter-kernels loU exceeds a certain threshold. In an embodiment, the threshold is set to 0.5, which is the value used in well-known repositories, as PASCAL VOC [18] or MS COCO [19], to check if a detection is positive or negative with respect to a ground truth. A similar contribution was defined in [1 ], where a k-means algorithm selects the anchors' size according to the dataset, but where the selection of the number of anchors is done manually, visualizing the best trade-off between the number of anchors and the average intersection with these with the dataset objects. The present approach makes the anchors selection completely automatic.
The classifier module 550 is configured to classify and apply bounding box regression to each candidate region of the second set of candidate regions 242 to obtain, for each candidate region, a class score 552 as a candidate object and a bounding box 554 in the input image 102.
Figure 6 depicts an exemplary embodiment of the image object detector 500 installed onboard a vehicle 600, such as a boat, a car, an aircraft, an unmanned aerial vehicle or a drone. In particular, the vehicle 600 includes an object detection system 610 for detecting small objects on an image using convolutional neural networks, wherein the object detection system 610 comprises a camera 612 (the term“camera” includes any device able to acquire an image or a set of images, such as a conventional camera or a video camera) configured to capture an input image 102, and the image object detector 500 as previously described in Figure 5. In this example, the image object detector 500 is implemented in a processor or a GPU 614.
The vehicle 600 may also comprise a decision module 620 that receives the output of the object detection system (the class scores 552 and bounding boxes 554 for the candidate regions selected in the input image 102), and determines, based on the objects detected on the input image 102, one or more actions 622 to be executed by one or more vehicle systems 630 (e.g. communications system 632, navigation system 634 with on-board sensors 635, propulsion system 638) of the vehicle 600.
For instance, as shown in the embodiment of Figure 6, the action may be sent to the navigation system 634 (continuous line) and/or to the communications system 632 (dotted line). In the first case, the action may include, as an example, guiding the vehicle towards one of the small detected objects, depending on the class score obtained for said object or the size of the bounding box. This could be the case, for instance: - When the bounding box is so small that the vehicle 600 is required to confirm the class score 552 assigned to the region by getting closer.
- When the class score assigned is of a particular relevance to the vehicle. For example, if the vehicle 600 is a drone patrolling a vast secure geographic area, such as borders between countries, and is looking for people invading that secure geographic area.
In this first case, the navigation system 634 receives a displacement instruction 624 to move towards a determined location (e.g. a detected object) and computes an updated trajectory, which is executed by the propulsion system 638 (e.g. motors, etc.) of the vehicle 630.
In the second case, the actions 622 may include reporting the detected objects 626 to an external entity, such as a server, using the communications systems 632 of the vehicle 600.
Figure 7 depicts the steps performed by an image object detector 500 according to an exemplary embodiment of the invention (this is merely an example, different parameters may be employed in other embodiments):
Input: image object detector 500 takes an image or a video-frame as an input image 102. The input image is scaled to HD resolution, 1280x720x3 (width x height x number of RGB color channels), keeping its width and height ratio.
Early convolutions 210: This ensemble is composed by a first convolution layer 710, a max-pooling layer 712 and a second residual block 714.
• First convolution layer 710: gets the input image and applies a 7x7 kernel size with stride 2, padding 3 and 64 filters. This operation halves the width and height, returning a 640x360x64 feature map.
• Max-pooling layer 712: transforms the 640x360x64 feature map in a 320x180x64 feature map through a max-pooling operation with a 3x3 kernel size and stride 2. From this point until the end, the image object detector 500 keeps the current resolution, that is, a resolution four times smaller than that of the original input image.
• Second residual block 714: residual block (Figure 8 depicts the steps performed by an ensemble of residual blocks [14] in order to extract features from the input feature map ) composed by three blocks that increases the number of filters from 64 to 256, returning an 320x180x256 feature map (i.e. input feature map 302 in Figure 3).
Region Context Network (RCN) 220: RCN 220 consists of two convolutional layers (RCN convolution 720 and RCN class score convolution 722) and a layer for the proposal of regions (RCN proposal layer 724).
• RCN convolution 720: applies a 3x3 kernel size (stride 1 , padding 1 ) that acts as a 3x3 sliding window, mapping the input feature map 302 information in a 128-d output (320x180x128).
• RCN class score convolution 722: a 1 x1 convolution that learns the necessary characteristics to differentiate between object or non-object of regions at each sliding window location (2-d). Each unit of the feature map decides whether the anchor centered in that unit contains an object or not. This is done by comparing the activation values of the two units in the same spatial localization: one of them learns the foreground score and the other one the background score. Returns a 320x180x2 feature map.
• RCN proposal layer 724: a custom layer that gets the class (object or non-object) scores from RCN class score convolution 722, calculates their regions’ coordinates in the input image size and returns a first set of candidate regions 222 most likely to contain an object (50x4 rcn rois, where 50 is the number of regions and 4 are the coordinates for each region).
- Rol Collection Layer (RCL) 228: RCL 228 is another custom layer that obtains the first set of candidate regions 222 ( rcn rois ) from the RCN proposal layer 724 and the feature map information from the second residual block 714 (input feature map 302). With both inputs, it obtains the information from the feature map of the second residual block 714, but only within the selected regions. Then, it concatenates this information in a new output feature map of size RCL0UtPut size. Successive convolutions with filters greater than 1 x1 will affect the neighboring regions' outputs. To solve this problem, RCL adds an inter region 0-padding. For this example, if we take the top 50 most likely regions with a region size of 48x48 pixels (12x12 on the feature map of the second residual block 714) and 1 pixel 0-padding, the output feature map size is 649x12x256. Late convolutions 230: This ensemble is composed by two residual blocks (third residual block 730 and fourth residual block 732, obtained according to the flow diagram of Figure 8).
• Third residual block 730: composed by four blocks that take as input the output from RCL 228 and increase the number of filters from 256 to 512, returning a 649x12x512 feature map. Inside the residual block and after each 3x3 convolution, restore collection padding is applied (see Figure 8), an auxiliary layer which restores the padding between regions to zero.
• Fourth residual block 732: residual block composed by six blocks that increases the number of filters from 512 to 1024, returning an 649x12x1024 feature map. As in the previous case, restore collection padding is applied.
Region Proposal Network (RPN) 240: RPN 240 consists of three convolutional layers (RPN convolution 740, RPN class score convolution 744 and RPN bounding box regression convolution 746) and a layer for the proposal of regions (RPN proposal layer 748).
• RPN convolution 740: applies a 3x3 kernel size (stride 1 , padding 1 ) which acting as a 3x3 sliding window location that maps the input feature map information in a 256-d output (649x12x256). After this operation, an auxiliary layer (remove collection padding 742) eliminates the 0-padding between regions since there are no more 3x3 convolutions to be applied on them, returning a 600x12x256 feature map.
• RPN class score convolution 744: a 1 x1 convolution that learns the necessary characteristics to differentiate between object or non-object at each sliding window location and for each defined anchor (6-d since 3 anchors are used). Returns a 600x12x6 feature map.
• RPN bounding box regression convolution 746: a 1 x1 convolution that learns the necessary characteristics to apply regression to each of the four coordinates of each anchor at each sliding window location (12-d since 3 anchors are used). Returns a 600x12x12 feature map.
• RPN proposal layer 748: a custom layer that gets the first set of candidate regions 222 ( rcn rois ) from RCN proposal layer 724, the class (object or non-object) scores for each anchor from RPN class score convolution 744 and the coordinates for each anchor from RPN bounding box regression convolution 746. With the first set of candidate regions 222 ( rcn rois ), it maps the sliding window locations for each anchor to the coordinates of those regions at the original input image. Then, it sorts those most likely to contain an object by the scores from RPN class score convolution 744. For all of them, the regression values learned by RPN bounding box regression convolution 746 are applied, obtaining the top N final regions (in the example N=300) in the original input image ( rois ). Moreover, the RPN proposal layer 748 also returns the coordinates of the 300 regions relative to the RPN input, i.e. the unsorted map of regions ( scaled_rois ), the second set of candidate regions 242 in Figure 2.
Rol pooling layer 248: this layer takes the feature map information of the fourth residual block 732 (i.e. output feature map 502) and the 300 unsorted map of regions scaled_rois (i.e. the second set of candidate regions 242). The auxiliary layer remove collection padding eliminates the 0-padding between regions in the feature map of the fourth residual block 732 so that the size is 600x12x12. Then, the Rol pooling layer 248 obtains the information from the feature map of the fourth residual block 732, but only within the selected regions, and converts them to a fixed size (14x14x1024) feature map. Also, the 300 regions go forward to the next stage.
- Classifier 250: Each region of interest from Rol pooling layer 248 is classified independently by the last residual block (fifth residual block 750, Figure 8) and an average pooling 752.
• Fifth residual block 750: residual block composed by three blocks. The first one halves the width and height by 2. In addition, the block increases the number of filters from 1024 to 2048, returning an 7x7x2048 feature map.
• Average pooling 752: an average pooling with 7x7 kernel size reduces the dimension to 1x1x2048, ready to be classified by fully connected layers.
Decision Function 760: for each region of interest, the final decision is taken based on two fully connected layers that transform the input 1 x1 x2048 array in the category of the object (class score fully connected layer 762) and its corresponding bounding box regression (bounding box regression fully connected layer 764). On the one hand, the value obtained by class score fully connected layer 762 passes through a Softmax function 766 to normalize the score into the range [0, 1] and, on the other hand, a transformation function 768 applies the bounding box regression to the rois relative to the original input image obtained from the RPN proposal layer 748. This returns the final class score 552 and bounding box 554 for each region of interest.
Regarding the training of the system, as all the learnable layers are convolutional and shared, both the network that acts as a backbone (ResNet- 50) and the two modules of the network (RCN module 520 and RPN module 540) can be trained end-to-end by backpropagation and stochastic gradient descent (SGD) [20]. In an embodiment, the approximate joint training [3] has been selected.
The RCN module 540 is trained in a similar way to the RPN module 540, except for bounding box regression, which does not exist in RCN. The fact that RCL keeps the same output number of images per mini-batch as that of input images, makes the rest of the training identical to other RPN networks like Faster- R-CNN. The initialization of anchors by k-means does not affect training either, since it is performed previously to the training. In the same way as the RPN, the RCN module 520 obtains its mini-batch from a single image by selecting positive and negative anchors. The mini-batch used within the RCN is 64 examples trying to maintain whenever possible a ratio of 1 :1 of positive and negative labels. The anchor's size is obtained by estimating the effective receptive field (ERF) which, in practice, follows a Gaussian distribution [21 ], so half of the theoretical receptive field of the convolutions between RCN and RPN is selected as ERF. In order to eliminate overlapping regions from those proposed by the RCN, an aggressive non-maximum suppression with a low threshold (0.3) is applied over the 2,000 best proposals before the RCL, resulting in a low number of scattered regions - around 200 on average-. At test, we let pass through the RCN those regions with confidence higher than 0.3, up to a maximum of 50 regions.
RCN and RCL can be integrated in any object detection convolutional framework just adapting the corresponding region proposal method to work with unsorted regions. In an embodiment, the method has been implemented over Faster-R-CNN. The hyper parameters for training and testing are the same as those used in Faster-R-CNN. The RPN module 540 is placed between the fourth residual block 732 and fifth residual block 732 convolutional layers, as it is done in [14] for Faster-R-CNN. Finally, at test, a box voting scheme after non-maximum suppression is applied [22] In this implementation, the framework Caffe is used [23]. Figures 9, 10 and 11 depicts several possible applications of the method and the image object detector of the present invention. However, it is noted that the present invention could be applied to many other real case scenarios. Among the different use cases envisaged for the proposed invention, the following are highlighted:
- Airspace surveillance.
- Ground surveillance from an aerial position.
- Detect & avoid.
The first use case, airspace surveillance, is depicted in Figure 9. The airspace surveillance system 900 of Figure 9 comprises a camera 912 (implemented in this example as a video camera) located on the ground 902, mounted on a pole or on a terrestrial moving platform. The camera 912 is pointing towards the sky, either in a vertical or in an oblique direction. The camera 912 monitors a determined airspace region 903. The airspace region 903 monitored can be a static region, if the position and orientation of the camera is fixed, or a dynamic region, if the position, orientation and /or zoom of the camera 912 dynamically changes.
The video stream (sequence of input images 102) acquired by the camera 912 is sent to a processor 914 for further analysis in order to detect all those flying object 904 (e.g. a drone in the example of Figure 9) which appear in the field of vision 906 of the camera 912 and are represented in the input image 102 as small objects 908 (with a size up to 16x16 pixels). To that end, the processor 14 implements the image object detector 500 of Figure 5. The processor 914 may be placed together with the camera 912 (i.e. locally) or remotely, for instance in a remote data center. In the latter case, the input images 102 are transmitted to the processor 900 using a broadband data connection, such as the Internet. The camera 912 and the image object detector 500 implemented by the processor 914 form an object detection system 910 in charge of monitoring the airspace and detecting, in real time, any kind of flying objects 904 (such as aircraft -e.g. drones, airships-, parachutists, or even meteorites).
The airspace surveillance system 900 is helpful in scenarios where monitoring the airspace for security and/or safety reasons is critical, such as airports (in order to detect flying objects that can pose a potential hazard for commercial or military aviation), nuclear plants, Transportation hubs, government facilities, football stadiums and any other critical infrastructures. The small object detection performed by the airspace surveillance system 900 is carried out as soon as possible (i.e. whenever the flying object 904 appears in the field of view 906 of the camera 912, despite its size), in order to take the contingency actions required. The airspace surveillance system 900 may optionally include a decision module 920 to determine one or more actions 922 to be carried out, based on the object detection made by the object detection system 910. Actions 922 may include, for instance, neutralizing a drone flying in the airspace near an airport, sending an alarm message, etc. The airspace surveillance system 900 may also comprise means (not shown in the figure) for executing the actions 922 determined by the decision module 920. For example, if the action to be taken is neutralizing a detected drone, the airspace surveillance system 900 may include a missile launcher to destroy the drone.
Figure 10 depicts the image object detector 500 applied to ground surveillance from aerial positions (e.g. from aerial vehicles or platforms). In this case, the ground surveillance system 1000 comprises a camera 1012 (e.g. a video camera) mounted on an aerial vehicle 1001 (e.g. a drone), pointing downwards toward the ground 1002, either in a vertical or in an oblique direction, to monitor a ground region 1003. The video stream captured by the camera 1012 is sent to a processor 1014 (either on-board the aerial vehicle 1001 or remotely located on an on-ground facility) in charge to analyze the input images 102 to detect static or moving small terrestrial objects 1004 (e.g. people, as depicted in the example of Figure 10) on the ground 1002 (land, sea, river, lagoon, etc.) which appear in the field of vision 1006 of the camera 1012 and are represented in the input image 102 as small objects 1008 (with a size up to 16x16 pixels). The ground surveillance system 1000 may be applied in different aerial-based surveillance scenarios, such as:
- Search and rescue, e.g. in maritime environments, for locating either boats or people in the sea; in land environments, for detecting hikers (the small terrestrial objects 1004 in the example of Figure 10) who got lost.
- Security applications, for detecting specific targets (e.g. vehicles or people) approaching a given protected area (e.g. in Homeland Security applications: illegal immigrants approaching a border).
- T raffic surveillance, in order to detect vehicles, traffic jam, and other traffic management related events. The camera 1012 and the image object detector 500 implemented by the processor 1014 form an object detection system 1010 in charge of monitoring the ground region 1003 and detecting, in real time, any kind of terrestrial objects 1004 (e.g. people).
The ground surveillance system 1000 may also include a decision module 1020 to determine one or more actions 1022 to be performed based on the object detection made by the object detection system 1010. Actions 1022 may include, among others, sending a message to a remote station informing about the detected objects. The ground surveillance system 1000 may also comprise means (not shown in the figure) for executing the actions 1022.
In the embodiment of Figure 11 , an application of the image object detector 500 to avoid collisions (i.e. detect and avoid applications) is depicted. This is a particular useful application for the vehicle 600 of Figure 6. In this embodiment, the detect and avoid system 1 100 comprises a camera 1 1 12 (e.g. a video camera) mounted onboard a vehicle 1 101 , such as the aircraft depicted in Figure 1 1 or any other type of vehicle (e.g. an autonomous car, a drone, etc.). The camera 1 1 12 is pointing forward, in the direction of movement of the vehicle 1 101 , either in a horizontal or in an slight oblique direction, towards a dynamic region 1 103 (in the embodiment, an airspace region). The video stream of the camera 1 1 12 is analyzed by an on-board processor 1 1 14, in order to detect other small flying objects 1 104 which appear in the field of vision 1 106 of the camera 612, represented in the input image 102 as small objects 1 108 with a size up to 16x16 pixels, and that may involve a potential obstacle for the aerial vehicle 600.
The camera 1 1 12 and the image object detector 500 implemented by the processor 1 1 14 form an object detection system in charge of monitoring the airspace region 1 103 and detecting, in real time, any kind of flying objects 1 104 (e.g. drones, birds).
The detect and avoid system 1 100 comprises a decision module 1 120 that determines one or more actions 1 122 to be performed based on the flying objects 1 104 detected by the object detection system. The actions 1 122 determined by the decision module 1 120 are aimed to avoid collision against the detected flying objects 1 104. For instance, a new trajectory may be computed by the decision module 1 120 for execution by the vehicle 1 101 (e.g. by the FMS of an aircraft or by an autonomous navigation module of a drone). A vehicle 1 101 comprising the detect and avoid system 1 100 may also be part of the invention, the vehicle comprising means for executing the actions 1 122 to avoid collision.
The small object detection performed by the detect and avoid system 1 100 is performed as soon as possible (i.e. whenever the flying object 1 104 appears in the field of view 1 106 of the camera 1 1 12, despite its size), in order to take the contingency actions required to avoid potential collisions.
It is also important to remark that the method and image object detector 500 of the present invention is especially advantageous, when comparing with the prior art, for detecting small objects (equal or under 16x16 pixels) on an image. However, the invention may also be applied for detecting bigger objects (i.e. above 16x16 pixels) on an image.

Claims

1. A computer-implemented method for detecting small objects on an image using convolutional neural networks, comprising:
applying one or more convolution operations (210) to an input image (102) to obtain a first set of convolutional layers (212) and an input feature map (302) corresponding to the last convolutional block (216) of said first set (212);
analyzing the input feature map (302) to determine a first set of candidate regions (222) containing candidate objects;
arranging the first set of candidate regions (222) to form a reduced feature map
(228);
applying one or more convolution operations (230) to the reduced feature map (228) to obtain a second set of convolutional layers (232) and an output feature map (502) corresponding to the last convolutional block (236) of said second set (232); applying a Region Proposal Network (240) to the output feature map (502) to obtain a second set of candidate regions (242) containing candidate objects;
classifying and applying bounding box regression (250) to each candidate region of the second set (242) to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image (102).
2. The computer-implemented method of claim 1 , wherein the first set of candidate regions (222) are determined by:
applying a first convolution operation to the input feature map (302) to obtain an intermediate convolutional layer (224) and an associated intermediate feature map; applying a second convolution operation to the intermediate feature map to obtain a class feature map (226) including class scores as candidate objects;
selecting a determined number of regions in the input feature map (302) according to the class scores as candidate objects of the class feature map, wherein the first set of candidate regions (222) includes the selected regions.
3. The computer-implemented method of any preceding claim, wherein the step of arranging the first set of candidate regions (222) to form a reduced feature map (228) comprises concatenating the candidate regions (222) and adding an inter region 0- padding between adjacent candidate regions (222).
4. The computer-implemented method of any preceding claim, further comprising a preprocessing stage wherein the number and the size of the anchors used in the Region Proposal Network (240) are automatically learned through k-means applied to a training set of ground truth boxes.
5. The computer-implemented method of claim 4, wherein the number of anchors is automatically obtained by performing an iterative k-means with an increasing number of kernels until the maximum inter-kernels loU ratio exceeds a certain threshold.
6. An image object detector based on convolutional neural networks, comprising:
a feature extractor module (510) configured to:
apply one or more convolution operations (210) to an input image (102) to obtain a first set of convolutional layers (212) and an input feature map (302) corresponding to the last convolutional block (216) of said first set (212);
apply one or more convolution operations (230) to a reduced feature map (228) to obtain a second set of convolutional layers (232) and an output feature map (502) corresponding to the last convolutional block (236) of said second set (232);
a region context network module (520) configured to analyze the input feature map (302) to determine a first set of candidate regions (222) containing candidate objects;
a Rol collection layer module (530) configured to arrange the first set of candidate regions (222) to form the reduced feature map (228);
a Region Proposal Network module (540) configured to obtain, from the output feature map (502), a second set of candidate regions (242) containing candidate objects; a classifier module (550) configured to classify and apply bounding box regression to each candidate region of the second set (242) to obtain, for each candidate region, a class score (552) as a candidate object and a bounding box (554) in the input image (102).
7. The image object detector of claim 6, wherein the region context network module (520) is configured to:
apply a first convolution operation to the input feature map (302) to obtain an intermediate convolutional layer (224) and an associated intermediate feature map; apply a second convolution operation to the intermediate feature map to obtain a class feature map (226) including class scores as candidate objects;
select a determined number of regions in the input feature map (302) according to the class scores as candidate objects of the class feature map (226), wherein the first set of candidate regions (222) includes the selected regions.
8. The image object detector of any of claims 6 to 7, wherein the Rol collection layer module (530) is configured to form the reduced feature map (228) by concatenating the candidate regions (222) and adding an inter region 0-padding between adjacent candidate regions (222).
9. The image object detector of any of claims 6 to 8, implemented in a processor (914; 1014) or a GPU (614).
10. An object detection system for detecting small objects on an image using convolutional neural networks, the object detection system (610; 910) comprising: a camera (612; 912) configured to capture an input image (102), and
an image object detector (500) according to any of claims 6 to 9.
1 1 . A vehicle (600; 1 100), comprising:
an object detection system (610; 1 1 10) according to claim 10; and
a decision module (620; 1 120) configured to determine, based on the object detection made by the object detection system (610; 1 1 10), at least one action (622; 1 122) for execution by one or more vehicle systems (630; 1 130) of the vehicle (600; 1 100).
12. An airspace surveillance system (900), comprising:
an object detection system (910) according to claim 10, wherein the camera (912) of the object detection system (910) is mounted on a ground location and is configured to monitor an airspace region (903); and
a decision module (920) configured to determine, based on the object detection made by the object detection system (910), at least one action (922) for execution.
13. A ground surveillance system (1000), comprising :
an object detection system (1010) according to claim 10, wherein the object detection system (1010) is installed on an aerial platform or vehicle (1001 ) and the camera (1012) of the object detection system (1010) is configured to monitor a ground region (1003); and
a decision module (1020) configured to determine, based on the object detection made by the object detection system (1010), at least one action (1022) for execution.
14. A detect and avoid system (1 100) installed onboard a vehicle (1 101 ), comprising: an object detection system (1 1 10) according to claim 10, wherein the camera (1 1 12) of the object detection system (1 1 10) is configured to monitor a region (1 103) in front of the vehicle (1 101 ); and
a decision module (1 120) configured to determine, based on the object detection made by the object detection system (1 1 10), at least one action (1 122) to avoid potential collisions.
15. A computer program product for detecting small objects on an image using convolutional neural networks, comprising at least one computer-readable storage medium having recorded thereon computer code instructions that, when executed by a processor, causes the processor to perform the method of any of claims 1 to 5.
PCT/EP2018/072857 2018-07-24 2018-08-24 A computer-implemented method and system for detecting small objects on an image using convolutional neural networks Ceased WO2020020472A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
ES202190001A ES2908944B2 (en) 2018-07-24 2018-08-24 A COMPUTER IMPLEMENTED METHOD AND SYSTEM FOR DETECTING SMALL OBJECTS IN AN IMAGE USING CONVOLUTIONAL NEURAL NETWORKS

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
ESP201830753 2018-07-24
ES201830753 2018-07-24

Publications (1)

Publication Number Publication Date
WO2020020472A1 true WO2020020472A1 (en) 2020-01-30

Family

ID=63557402

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/072857 Ceased WO2020020472A1 (en) 2018-07-24 2018-08-24 A computer-implemented method and system for detecting small objects on an image using convolutional neural networks

Country Status (2)

Country Link
ES (1) ES2908944B2 (en)
WO (1) WO2020020472A1 (en)

Cited By (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368653A (en) * 2020-02-19 2020-07-03 杭州电子科技大学 A low-altitude small target detection method based on R-D graph and deep neural network
CN111401297A (en) * 2020-04-03 2020-07-10 天津理工大学 Triphibian robot target recognition system and method based on edge calculation and neural network
CN111415000A (en) * 2020-04-29 2020-07-14 Oppo广东移动通信有限公司 Convolutional neural network, and data processing method and device based on convolutional neural network
CN111428602A (en) * 2020-03-18 2020-07-17 浙江科技学院 Binocular saliency image detection method based on edge-assisted enhancement of convolutional neural network
CN111597945A (en) * 2020-05-11 2020-08-28 济南博观智能科技有限公司 Target detection method, device, equipment and medium
CN111611925A (en) * 2020-05-21 2020-09-01 重庆现代建筑产业发展研究院 Building detection and identification method and device
CN111626208A (en) * 2020-05-27 2020-09-04 北京百度网讯科技有限公司 Method and apparatus for detecting small targets
CN111666850A (en) * 2020-05-28 2020-09-15 浙江工业大学 Cell image detection and segmentation method for generating candidate anchor frame based on clustering
CN111797769A (en) * 2020-07-06 2020-10-20 东北大学 A Small Target Sensitive Vehicle Detection System
CN111916206A (en) * 2020-08-04 2020-11-10 重庆大学 A cascade-based CT image-aided diagnosis system
CN111950488A (en) * 2020-08-18 2020-11-17 山西大学 An Improved Faster-RCNN Remote Sensing Image Object Detection Method
CN112036455A (en) * 2020-08-19 2020-12-04 浙江大华技术股份有限公司 Image identification method, intelligent terminal and storage medium
CN112069907A (en) * 2020-08-11 2020-12-11 盛视科技股份有限公司 X-ray machine image recognition method, device and system based on example segmentation
CN112085088A (en) * 2020-09-03 2020-12-15 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium
CN112115847A (en) * 2020-09-16 2020-12-22 深圳印像数据科技有限公司 Method for judging face emotion joyfulness
CN112329861A (en) * 2020-11-06 2021-02-05 北京工业大学 Layered feature fusion method for multi-target detection of mobile robot
CN112364687A (en) * 2020-09-29 2021-02-12 上善智城(苏州)信息科技有限公司 Improved Faster R-CNN gas station electrostatic sign identification method and system
CN112733691A (en) * 2021-01-04 2021-04-30 北京工业大学 Multi-direction unmanned aerial vehicle aerial photography vehicle detection method based on attention mechanism
CN112949499A (en) * 2021-03-04 2021-06-11 北京联合大学 Improved MTCNN face detection method based on ShuffleNet
CN112966579A (en) * 2021-02-24 2021-06-15 湖南三湘绿谷生态科技有限公司 Large-area camellia oleifera forest rapid yield estimation method based on unmanned aerial vehicle remote sensing
CN113012220A (en) * 2021-02-02 2021-06-22 深圳市识农智能科技有限公司 Fruit counting method and device and electronic equipment
CN113011561A (en) * 2021-03-04 2021-06-22 中国人民大学 Method for processing data based on logarithm polar space convolution
CN113139540A (en) * 2021-04-02 2021-07-20 北京邮电大学 Backboard detection method and equipment
EP3905116A1 (en) * 2020-04-29 2021-11-03 FotoNation Limited Image processing system for identifying and tracking objects
CN113705387A (en) * 2021-08-13 2021-11-26 国网江苏省电力有限公司电力科学研究院 Method for detecting and tracking interferent for removing foreign matters on overhead line by laser
CN113780147A (en) * 2021-09-06 2021-12-10 西安电子科技大学 A lightweight dynamic fusion convolutional net-based hyperspectral object classification method and system
KR102344004B1 (en) * 2020-07-09 2021-12-27 정영규 Deep learning based real-time small target detection device for cpu only embedded board
CN113963265A (en) * 2021-09-13 2022-01-21 北京理工雷科电子信息技术有限公司 A fast detection and recognition method of small samples and small targets in complex remote sensing terrestrial environment
CN114120056A (en) * 2021-10-29 2022-03-01 中国农业大学 Small target identification method, small target identification device, electronic equipment, medium and product
WO2022074483A1 (en) * 2020-10-05 2022-04-14 International Business Machines Corporation Action-object recognition in cluttered video scenes using text
CN114596450A (en) * 2022-03-17 2022-06-07 四川邦辰信息科技有限公司 Image inclusion detection method and device
CN114611685A (en) * 2022-03-08 2022-06-10 安谋科技(中国)有限公司 Feature processing method, medium, device, and program product in neural network model
CN114627437A (en) * 2022-05-16 2022-06-14 科大天工智能装备技术(天津)有限公司 A kind of traffic target recognition method and system
EP4016473A1 (en) * 2020-12-16 2022-06-22 HERE Global B.V. Method, apparatus, and computer program product for training a signature encoding module and a query processing module to identify objects of interest within an image utilizing digital signatures
CN114677611A (en) * 2021-03-22 2022-06-28 腾讯云计算(北京)有限责任公司 Data identification method, storage medium and device
CN114723733A (en) * 2022-04-26 2022-07-08 湖北工业大学 A Class Activation Mapping Method and Device Based on Axiom Interpretation
US11423252B1 (en) 2021-04-29 2022-08-23 International Business Machines Corporation Object dataset creation or modification using labeled action-object videos
WO2022213307A1 (en) * 2021-04-07 2022-10-13 Nokia Shanghai Bell Co., Ltd. Adaptive convolutional neural network for object detection
CN115203449A (en) * 2022-07-15 2022-10-18 中国人民解放军国防科技大学 Data processing method and device
CN115346170A (en) * 2022-08-11 2022-11-15 北京市燃气集团有限责任公司 Intelligent monitoring method and device for gas facility area
US20220391615A1 (en) * 2021-06-01 2022-12-08 Hummingbird Technologies Limited Tool for counting and sizing plants in a field
US11587253B2 (en) 2020-12-23 2023-02-21 Here Global B.V. Method, apparatus, and computer program product for displaying virtual graphical data based on digital signatures
CN115908874A (en) * 2022-11-29 2023-04-04 华中光电技术研究所(中国船舶集团有限公司第七一七研究所) A Siamese Network Based Target Tracking Model De-redundancy Method
CN115984846A (en) * 2023-02-06 2023-04-18 山东省人工智能研究院 An intelligent recognition method for small targets in high-resolution images based on deep learning
CN113469272B (en) * 2021-07-20 2023-05-19 东北财经大学 Object detection method for hotel scene pictures based on Faster R-CNN-FFS model
CN116597331A (en) * 2023-06-01 2023-08-15 北京联合大学 A Lightweight Object Detection Method for UAV Aerial Images
CN116824386A (en) * 2023-03-22 2023-09-29 齐鲁工业大学(山东省科学院) Method and system for detecting rotating targets in aerial remote sensing images
CN117132856A (en) * 2023-07-31 2023-11-28 南京信息工程大学 A small target detection method using asymmetric modulation fusion features
US11830103B2 (en) 2020-12-23 2023-11-28 Here Global B.V. Method, apparatus, and computer program product for training a signature encoding module and a query processing module using augmented data
US11829192B2 (en) 2020-12-23 2023-11-28 Here Global B.V. Method, apparatus, and computer program product for change detection based on digital signatures
CN117292394A (en) * 2023-09-27 2023-12-26 自然资源部地图技术审查中心 Map review method and device
CN117442190A (en) * 2023-12-21 2024-01-26 山东第一医科大学附属省立医院(山东省立医院) Automatic wound surface measurement method and system based on target detection
CN117496132A (en) * 2023-12-29 2024-02-02 数据空间研究院 Scale sensing detection method for small-scale target detection
US11991295B2 (en) 2021-12-07 2024-05-21 Here Global B.V. Method, apparatus, and computer program product for identifying an object of interest within an image from a digital signature generated by a signature encoding module including a hypernetwork
CN118229964A (en) * 2024-05-24 2024-06-21 厦门大学 Small target detection method based on full pipeline improvement
US12073615B2 (en) 2020-12-16 2024-08-27 Here Global B.V. Method, apparatus, and computer program product for identifying objects of interest within an image captured by a relocatable image capture device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114782319B (en) * 2022-03-24 2024-08-23 什维新智医疗科技(上海)有限公司 Method for identifying scale for ultrasonic image

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10354159B2 (en) * 2016-09-06 2019-07-16 Carnegie Mellon University Methods and software for detecting objects in an image using a contextual multiscale fast region-based convolutional neural network
CN108009509A (en) * 2017-12-12 2018-05-08 河南工业大学 Vehicle target detection method

Non-Patent Citations (27)

* Cited by examiner, † Cited by third party
Title
A. ROZANTSEV; V. LEPETIT; P. FUA: "Detecting flying objects using a single moving camera", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 39, no. 5, 2017, pages 879 - 892, XP011644552, DOI: doi:10.1109/TPAMI.2016.2564408
BRAIS BOSQUET ET AL: "STDnet: A ConvNet for Small Target Detection", 3 September 2018 (2018-09-03) - 3 September 2018 (2018-09-03), XP055570891, Retrieved from the Internet <URL:http://bmvc2018.org/contents/papers/0897.pdf> [retrieved on 20190319] *
C. EGGERT; D. ZECHA; S. BREHM; R. LIENHART: "ACM on International Conference on Multimedia Retrieval", 2017, ACM, article "Improving small object proposals for company logo detection", pages: 167 - 174
C. FEICHTENHOFER; A. PINZ; A. ZISSERMAN: "Detect to track and track to detect", IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2017, pages 3038 - 3046
CHU MENGDIE ET AL: "Rich Features and Precise Localization with Region Proposal Network for Object Detection", 20 October 2017, PROC. INT.CONF. ADV. BIOMETRICS (ICB); [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER, BERLIN, HEIDELBERG, ISBN: 978-3-642-17318-9, pages: 605 - 614, XP047451367 *
F. YANG; W. CHOI; Y. LIN: "Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers", IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2016, pages 2129 - 2137, XP033021392, DOI: doi:10.1109/CVPR.2016.234
J. DAI; Y. LI; K. HE; J. SUN: "R-fcn: Object detection via region-based fully convolutional networks", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS (NIPS, 2016, pages 379 - 387
J. HUANG; V. RATHOD; C. SUN; M. ZHU; A. KORATTIKARA; A. FATHI; I. FISCHER; Z. WOJNA; Y. SONG; S. GUADARRAMA ET AL.: "Speed/accuracy trade-offs for modern convolutional object detectors", IEEE COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2017
J. LI; X. LIANG; Y. WEI; T. XU; J. FENG; S. YAN: "Perceptual generative adversarial networks for small object detection", IEEE COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2017
J. REDMON; A. FARHADI: "Yolo9000: Better, faster, stronger", IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2017, pages 6517 - 6525, XP033250016, DOI: doi:10.1109/CVPR.2017.690
K. HE; X. ZHANG; S. REN; J. SUN: "Deep residual learning for image recognition", IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2016, pages 770 - 778, XP033021254, DOI: doi:10.1109/CVPR.2016.90
K. SIMONYAN; A. ZISSERMAN: "Very deep convolutional networks for large-scale image recognition", INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 2015
M. D. ZEILER; R. FERGUS: "European Conference on Computer Vision (ECCV", 2014, SPRINGER, article "Visualizing and understanding convolutional networks", pages: 818 - 833
M. EVERINGHAM; L. VAN GOOL; C. K. WILLIAMS; J. WINN; A. ZISSERMAN: "The pascal visual object classes (voc) challenge", INTERNATIONAL JOURNAL OF COMPUTER VISION, vol. 88, no. 2, 2010, pages 303 - 338, XP019796004
P. DOLLAR; C. WOJEK; B. SCHIELE; P. PERONA: "Pedestrian detection: An evaluation of the state of the art", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 34, no. 4, 2012, pages 743 - 761, XP011490656, DOI: doi:10.1109/TPAMI.2011.155
S. GIDARIS; N. KOMODAKIS: "Object detection via a multi-region and semantic segmentation-aware cnn model", IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV, 2015, pages 1134 - 1142, XP032866440, DOI: doi:10.1109/ICCV.2015.135
S. REN; K. HE; R. GIRSHICK; J. SUN: "Faster r-cnn: Towards real-time object detection with region proposal networks", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS (NIPS, 2015, pages 91 - 99
SHAOQING REN ET AL: "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 39, no. 6, 1 June 2017 (2017-06-01), USA, pages 1137 - 1149, XP055560008, ISSN: 0162-8828, DOI: 10.1109/TPAMI.2016.2577031 *
T.-Y. LIN; M. MAIRE; S. BELONGIE; J. HAYS; P. PERONA; D. RAMANAN; P. DOLLAR; C. L. ZITNICK: "European Conference on Computer Vision (ECCV", 2014, SPRINGER, article "Microsoft coco: Common objects in context", pages: 740 - 755
T.-Y. LIN; P. DOLL AR; R. GIRSHICK; K. HE; B. HARIHARAN; S. BELONGIE: "Feature pyramid networks for object detection", IEEE COMPUTER VISION AND PATTERN RECOGNITION (CVPR, vol. 1, 2017, pages 4
V. NAIR; G. E. HINTON: "Rectified linear units improve restricted boltzmann machines", 27TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML, 2010, pages 807 - 814, XP055398393
W. LUO; Y. LI; R. URTASUN; R. ZEMEL: "Understanding the e_ective receptive field in deep convolutional neural networks", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS (NIPS, 2016, pages 4898 - 4906
Y. JIA; E. SHELHAMER; J. DONAHUE; S. KARAYEV; J. LONG; R. GIRSHICK; S. GUADARRAMA; T. DARRELL: "22nd ACM International Conference on Multimedia", 2014, ACM, article "Caffe: Convolutional architecture for fast feature embedding", pages: 675 - 678
Y. LECUN; B. BOSER; J. S. DENKER; D. HENDERSON; R. E. HOWARD; W. HUBBARD; L. D. JACKEL: "Backpropagation applied to handwritten zip code recognition", NEURAL COMPUTATION, vol. 1, no. 4, 1989, pages 541 - 551, XP000789854
YAN CHAO ET AL: "A new two-stage object detection network without RoI-Pooling", 2018 CHINESE CONTROL AND DECISION CONFERENCE (CCDC), IEEE, 9 June 2018 (2018-06-09), pages 1680 - 1685, XP033370487, DOI: 10.1109/CCDC.2018.8407398 *
Z. CAI; Q. FAN; R. S. FERIS; N. VASCONCELOS: "European Conference on Computer Vision (ECCV", 2016, SPRINGER, article "A uniffied multi-scale deep convolutional neural network for fast object detection", pages: 354 - 370
Z. ZHU; D. LIANG; S. ZHANG; X. HUANG; B. LI; S. HU: "Traffic-sign detection and classification in the wild", IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2016, pages 2110 - 2118, XP033021390, DOI: doi:10.1109/CVPR.2016.232

Cited By (77)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368653A (en) * 2020-02-19 2020-07-03 杭州电子科技大学 A low-altitude small target detection method based on R-D graph and deep neural network
CN111368653B (en) * 2020-02-19 2023-09-08 杭州电子科技大学 A low-altitude small target detection method based on R-D graph and deep neural network
CN111428602A (en) * 2020-03-18 2020-07-17 浙江科技学院 Binocular saliency image detection method based on edge-assisted enhancement of convolutional neural network
CN111401297A (en) * 2020-04-03 2020-07-10 天津理工大学 Triphibian robot target recognition system and method based on edge calculation and neural network
CN111415000B (en) * 2020-04-29 2024-03-22 Oppo广东移动通信有限公司 Convolutional neural network, data processing method and device based on convolutional neural network
EP3905116A1 (en) * 2020-04-29 2021-11-03 FotoNation Limited Image processing system for identifying and tracking objects
CN111415000A (en) * 2020-04-29 2020-07-14 Oppo广东移动通信有限公司 Convolutional neural network, and data processing method and device based on convolutional neural network
CN111597945B (en) * 2020-05-11 2023-08-18 济南博观智能科技有限公司 A target detection method, device, equipment and medium
CN111597945A (en) * 2020-05-11 2020-08-28 济南博观智能科技有限公司 Target detection method, device, equipment and medium
CN111611925A (en) * 2020-05-21 2020-09-01 重庆现代建筑产业发展研究院 Building detection and identification method and device
CN111626208A (en) * 2020-05-27 2020-09-04 北京百度网讯科技有限公司 Method and apparatus for detecting small targets
CN111626208B (en) * 2020-05-27 2023-06-13 阿波罗智联(北京)科技有限公司 Method and device for detecting small objects
CN111666850A (en) * 2020-05-28 2020-09-15 浙江工业大学 Cell image detection and segmentation method for generating candidate anchor frame based on clustering
CN111797769A (en) * 2020-07-06 2020-10-20 东北大学 A Small Target Sensitive Vehicle Detection System
CN111797769B (en) * 2020-07-06 2023-06-30 东北大学 A Vehicle Detection System Sensitive to Small Objects
KR102344004B1 (en) * 2020-07-09 2021-12-27 정영규 Deep learning based real-time small target detection device for cpu only embedded board
CN111916206B (en) * 2020-08-04 2023-12-08 重庆大学 A CT image-assisted diagnosis system based on cascade
CN111916206A (en) * 2020-08-04 2020-11-10 重庆大学 A cascade-based CT image-aided diagnosis system
CN112069907A (en) * 2020-08-11 2020-12-11 盛视科技股份有限公司 X-ray machine image recognition method, device and system based on example segmentation
CN111950488B (en) * 2020-08-18 2022-07-19 山西大学 Improved Faster-RCNN remote sensing image target detection method
CN111950488A (en) * 2020-08-18 2020-11-17 山西大学 An Improved Faster-RCNN Remote Sensing Image Object Detection Method
CN112036455B (en) * 2020-08-19 2023-09-01 浙江大华技术股份有限公司 Image identification method, intelligent terminal and storage medium
CN112036455A (en) * 2020-08-19 2020-12-04 浙江大华技术股份有限公司 Image identification method, intelligent terminal and storage medium
CN112085088A (en) * 2020-09-03 2020-12-15 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium
CN112115847B (en) * 2020-09-16 2024-05-17 深圳印像数据科技有限公司 Face emotion pleasure degree judging method
CN112115847A (en) * 2020-09-16 2020-12-22 深圳印像数据科技有限公司 Method for judging face emotion joyfulness
CN112364687A (en) * 2020-09-29 2021-02-12 上善智城(苏州)信息科技有限公司 Improved Faster R-CNN gas station electrostatic sign identification method and system
GB2614170A (en) * 2020-10-05 2023-06-28 Ibm Action-object recognition in cluttered video scenes using text
WO2022074483A1 (en) * 2020-10-05 2022-04-14 International Business Machines Corporation Action-object recognition in cluttered video scenes using text
US11928849B2 (en) 2020-10-05 2024-03-12 International Business Machines Corporation Action-object recognition in cluttered video scenes using text
GB2614170B (en) * 2020-10-05 2023-12-13 Ibm Action-object recognition in cluttered video scenes using text
CN112329861B (en) * 2020-11-06 2024-05-28 北京工业大学 A hierarchical feature fusion method for multi-target detection in mobile robots
CN112329861A (en) * 2020-11-06 2021-02-05 北京工业大学 Layered feature fusion method for multi-target detection of mobile robot
US12073615B2 (en) 2020-12-16 2024-08-27 Here Global B.V. Method, apparatus, and computer program product for identifying objects of interest within an image captured by a relocatable image capture device
US11900662B2 (en) 2020-12-16 2024-02-13 Here Global B.V. Method, apparatus, and computer program product for training a signature encoding module and a query processing module to identify objects of interest within an image utilizing digital signatures
EP4016473A1 (en) * 2020-12-16 2022-06-22 HERE Global B.V. Method, apparatus, and computer program product for training a signature encoding module and a query processing module to identify objects of interest within an image utilizing digital signatures
US11830103B2 (en) 2020-12-23 2023-11-28 Here Global B.V. Method, apparatus, and computer program product for training a signature encoding module and a query processing module using augmented data
US11829192B2 (en) 2020-12-23 2023-11-28 Here Global B.V. Method, apparatus, and computer program product for change detection based on digital signatures
US11587253B2 (en) 2020-12-23 2023-02-21 Here Global B.V. Method, apparatus, and computer program product for displaying virtual graphical data based on digital signatures
US12094163B2 (en) 2020-12-23 2024-09-17 Here Global B.V. Method, apparatus, and computer program product for displaying virtual graphical data based on digital signatures
CN112733691A (en) * 2021-01-04 2021-04-30 北京工业大学 Multi-direction unmanned aerial vehicle aerial photography vehicle detection method based on attention mechanism
CN113012220A (en) * 2021-02-02 2021-06-22 深圳市识农智能科技有限公司 Fruit counting method and device and electronic equipment
CN112966579A (en) * 2021-02-24 2021-06-15 湖南三湘绿谷生态科技有限公司 Large-area camellia oleifera forest rapid yield estimation method based on unmanned aerial vehicle remote sensing
CN112949499A (en) * 2021-03-04 2021-06-11 北京联合大学 Improved MTCNN face detection method based on ShuffleNet
CN113011561A (en) * 2021-03-04 2021-06-22 中国人民大学 Method for processing data based on logarithm polar space convolution
CN113011561B (en) * 2021-03-04 2023-06-20 中国人民大学 A Method of Data Processing Based on Logarithmic Polar Space Convolution
CN114677611A (en) * 2021-03-22 2022-06-28 腾讯云计算(北京)有限责任公司 Data identification method, storage medium and device
CN113139540A (en) * 2021-04-02 2021-07-20 北京邮电大学 Backboard detection method and equipment
WO2022213307A1 (en) * 2021-04-07 2022-10-13 Nokia Shanghai Bell Co., Ltd. Adaptive convolutional neural network for object detection
US11423252B1 (en) 2021-04-29 2022-08-23 International Business Machines Corporation Object dataset creation or modification using labeled action-object videos
US20220391615A1 (en) * 2021-06-01 2022-12-08 Hummingbird Technologies Limited Tool for counting and sizing plants in a field
CN113469272B (en) * 2021-07-20 2023-05-19 东北财经大学 Object detection method for hotel scene pictures based on Faster R-CNN-FFS model
CN113705387A (en) * 2021-08-13 2021-11-26 国网江苏省电力有限公司电力科学研究院 Method for detecting and tracking interferent for removing foreign matters on overhead line by laser
CN113705387B (en) * 2021-08-13 2023-11-17 国网江苏省电力有限公司电力科学研究院 An interference detection and tracking method for laser removal of foreign objects on overhead lines
CN113780147A (en) * 2021-09-06 2021-12-10 西安电子科技大学 A lightweight dynamic fusion convolutional net-based hyperspectral object classification method and system
CN113963265A (en) * 2021-09-13 2022-01-21 北京理工雷科电子信息技术有限公司 A fast detection and recognition method of small samples and small targets in complex remote sensing terrestrial environment
CN114120056A (en) * 2021-10-29 2022-03-01 中国农业大学 Small target identification method, small target identification device, electronic equipment, medium and product
US11991295B2 (en) 2021-12-07 2024-05-21 Here Global B.V. Method, apparatus, and computer program product for identifying an object of interest within an image from a digital signature generated by a signature encoding module including a hypernetwork
CN114611685A (en) * 2022-03-08 2022-06-10 安谋科技(中国)有限公司 Feature processing method, medium, device, and program product in neural network model
CN114596450A (en) * 2022-03-17 2022-06-07 四川邦辰信息科技有限公司 Image inclusion detection method and device
CN114723733A (en) * 2022-04-26 2022-07-08 湖北工业大学 A Class Activation Mapping Method and Device Based on Axiom Interpretation
CN114627437A (en) * 2022-05-16 2022-06-14 科大天工智能装备技术(天津)有限公司 A kind of traffic target recognition method and system
CN114627437B (en) * 2022-05-16 2022-08-05 科大天工智能装备技术(天津)有限公司 Traffic target identification method and system
CN115203449A (en) * 2022-07-15 2022-10-18 中国人民解放军国防科技大学 Data processing method and device
CN115346170A (en) * 2022-08-11 2022-11-15 北京市燃气集团有限责任公司 Intelligent monitoring method and device for gas facility area
CN115908874A (en) * 2022-11-29 2023-04-04 华中光电技术研究所(中国船舶集团有限公司第七一七研究所) A Siamese Network Based Target Tracking Model De-redundancy Method
CN115984846A (en) * 2023-02-06 2023-04-18 山东省人工智能研究院 An intelligent recognition method for small targets in high-resolution images based on deep learning
CN115984846B (en) * 2023-02-06 2023-10-10 山东省人工智能研究院 An intelligent recognition method of small targets in high-resolution images based on deep learning
CN116824386A (en) * 2023-03-22 2023-09-29 齐鲁工业大学(山东省科学院) Method and system for detecting rotating targets in aerial remote sensing images
CN116597331A (en) * 2023-06-01 2023-08-15 北京联合大学 A Lightweight Object Detection Method for UAV Aerial Images
CN117132856A (en) * 2023-07-31 2023-11-28 南京信息工程大学 A small target detection method using asymmetric modulation fusion features
CN117292394A (en) * 2023-09-27 2023-12-26 自然资源部地图技术审查中心 Map review method and device
CN117292394B (en) * 2023-09-27 2024-04-30 自然资源部地图技术审查中心 Map auditing method and device
CN117442190B (en) * 2023-12-21 2024-04-02 山东第一医科大学附属省立医院(山东省立医院) An automatic wound measurement method and system based on target detection
CN117442190A (en) * 2023-12-21 2024-01-26 山东第一医科大学附属省立医院(山东省立医院) Automatic wound surface measurement method and system based on target detection
CN117496132A (en) * 2023-12-29 2024-02-02 数据空间研究院 Scale sensing detection method for small-scale target detection
CN118229964A (en) * 2024-05-24 2024-06-21 厦门大学 Small target detection method based on full pipeline improvement

Also Published As

Publication number Publication date
ES2908944R1 (en) 2022-05-13
ES2908944A2 (en) 2022-05-04
ES2908944B2 (en) 2023-01-09

Similar Documents

Publication Publication Date Title
WO2020020472A1 (en) A computer-implemented method and system for detecting small objects on an image using convolutional neural networks
Sudha et al. RETRACTED ARTICLE: An intelligent multiple vehicle detection and tracking using modified vibe algorithm and deep learning algorithm
Bosquet et al. STDnet: A ConvNet for Small Target Detection.
Mujtaba et al. UAV-Based road traffic monitoring via FCN segmentation and deepsort for smart cities
T'Jampens et al. Automatic detection, tracking and counting of birds in marine video content
Jeyabharathi et al. Vehicle tracking and speed measurement system (VTSM) based on novel feature descriptor: diagonal hexadecimal pattern (DHP)
Yusuf et al. Target detection and classification via EfficientDet and CNN over unmanned aerial vehicles
Dorbe et al. FCN and LSTM based computer vision system for recognition of vehicle type, license plate number, and registration country
Delleji et al. An improved YOLOv5 for real-time mini-UAV detection in no fly zones.
Mujtaba et al. Remote Sensing-based Vehicle Monitoring System using YOLOv10 and CrossViT
Ajith et al. Hybrid deep learning for object detection in drone imagery: A new metaheuristic based model
Yass et al. A comprehensive review of deep learning and machine learning techniques for real-time car detection and wrong-way vehicle tracking
Mudavath et al. Object detection challenges: Navigating through varied weather conditions—Acomprehensive survey
Şengül et al. Detection of Military Aircraft Using YOLO and Transformer-Based Object Detection Models in Complex Environments
CN109493371A (en) A kind of quadrotor drone pedestrian tracting method of view-based access control model
Mukhopadhyay et al. Performance comparison of different cnn models for indian road dataset
Farhadmanesh et al. Implementing Haar cascade classifiers for automated rapid detection of light aircraft at local airports
Pramanik et al. Real-time detection of traffic anomalies near roundabouts
Talbi et al. An overview on computer vision analysis in the airport applications
Forczmański et al. Deep learning approach to detection of preceding vehicle in advanced driver assistance
Prito et al. Image processing and deep learning based road object detection system for safe transportation
Xing et al. Drone surveillance using detection, tracking and classification techniques
Dilawari et al. Toward generating human-centered video annotations
Cao et al. Visual attention accelerated vehicle detection in low-altitude airborne video of urban environment
Mittal et al. A feature pyramid based multi-stage framework for object detection in low-altitude UAV images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18769052

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18769052

Country of ref document: EP

Kind code of ref document: A1