WO2020020472A1

WO2020020472A1 - A computer-implemented method and system for detecting small objects on an image using convolutional neural networks

Info

Publication number: WO2020020472A1
Application number: PCT/EP2018/072857
Authority: WO
Inventors: Victor Manuel BREA SÁNCHEZ; Manuel Felipe MUCIENTES MOLINA; Brais BOSQUET MERA
Original assignee: Fundacion Centro Tecnoloxico De Telecomunicacions De Galicia; Universidade de Santiago de Compostela
Current assignee: Fundacion Centro Tecnoloxico De Telecomunicacions De Galicia; Universidade de Santiago de Compostela
Priority date: 2018-07-24
Filing date: 2018-08-24
Publication date: 2020-01-30
Anticipated expiration: 2021-01-24
Also published as: ES2908944R1; ES2908944A2; ES2908944B2

Abstract

A computer-implemented method and system for detecting small objects on an image using convolutional neural networks. The method comprises: applying convolution operations (210) to an input image (102) to obtain a first set of convolutional layers (212) and an input feature map (302); analyzing the input feature map (302) to determine a first set of candidate regions (222) containing candidate objects; arranging the first set of candidate regions (222) to form a reduced feature map (228); applying convolution operations (230) to the reduced feature map (228) to obtain a second set of convolutional layers (232) and an output feature map (502); applying a Region Proposal Network (240) to the output feature map (502) to obtain a second set of candidate regions (242) containing candidate objects; classifying and applying bounding box regression (250) to each candidate region of the second set (242) to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image (102).

Description

A COMPUTER-IMPLEMENTED METHOD AND SYSTEM FOR DETECTING SMALL OBJECTS ON AN IMAGE USING CONVOLUTIONAL NEURAL NETWORKS

DESCRIPTION

FIELD

The present disclosure is comprised in the field of image analysis, and more particularly, in the field of methods and systems for detecting small objects on an image. BACKGROUND

Object detection has undergone a great progress through deep convolutional neural networks (CNN). Initial approaches combined region proposal methods based on different techniques with deep convolutional networks that extracted automatically very deep features from those regions and, finally, generated a bounding box and the corresponding object category. Current solutions integrate feature extraction, region proposal, and bounding box and object category in the CNN, in some cases with a fully convolutional architecture.

Applications like sense and avoid on board of unmanned aerial vehicles (UAVs) or video surveillance over wide areas demand early detections of objects of interest to act quickly. This means to detect as far— and therefore small— an object as possible. Recent CNN object detectors provide high accuracy over a wide range of scales, from 32 x 32 pixels up to the image size. Nevertheless, there are no specific CNNs focused on small targets. Oualitatively, in the present disclosure the term“small targets” (or“small objects”) refers to those objects without visual cues to assign them to a category or subcategory; quantitatively,“small targets” (or“small objects”) refers to objects in an image having a size with a total number of pixels equal or under 256 square pixels (e.g. 16 x 16 pixels, 25 x 10 pixels). Most of the state-of-the-art CNNs for object detection are unsuitable for the detection of such small objects, because both the region proposal, the bounding box regression and the final classification take the feature maps generated in the last convolutional layers as inputs. These feature maps have much lower resolution than the input image— in most cases the reductions in resolution are up to 16x. Thus, many small objects are represented in the last feature maps by only one pixel, which makes classification and bounding box regression very hard, if not impossible.

A straightforward solution for small object detection would be to modify a state-of-the-art CNN keeping the resolution of the initial image in all the feature maps. Of course, this approach is non-viable because, due to the size of the network, it would not fit on a GPU (Graphics Processing Unit) and, also, the forwarding pass would be very slow.

Modern object detectors are based on CNNs [2] Faster-R-CNN [3] has become a milestone in CNNs for object detection thanks to the inclusion of a visual attention mechanism through the so-called Region Proposal Network (RPN). In Faster-R-CNN the input image goes through a number of convolutional layers for feature extraction up to the RPN. The RPN is based on anchors, which are predefined regions of different sizes and aspect ratios to cope with multiple scales. The anchors are centered at the sliding window and, for each position and anchor, a fixed length feature vector is generated with a set of convolutional layers. The output of the RPN are the coordinates of the bounding boxes and their corresponding classes, namely, object and background. Finally, given the output of the RPN and the last feature map of the feature extractor network, the bounding box and class of the object are determined through a fully-connected classification network.

The off-the-shelf Faster-R-CNN is not adequate for small object detection due to two reasons. First, the sizes of the predefined anchors are very large for small objects. Second, and more important, the global effective stride— downscaling of the input image with respect to the feature map that is the input to the RPN— is 16, which means that a 16 x 16 object is represented by just one pixel in that feature map. The detection of small objects requires finer global effective stride in Faster-R-CNN. This leads to a very high increase of memory, making the implementation impossible for current GPUs.

In [4], a fully convolutional approach to object detection, called Region-based Fully Convolutional Network (R-FCN), is presented. The major difference with Faster-R-CNN is that R-FCN generates k x k x (C + 1 ) feature maps in the last convolutional layer, instead of only one. These maps are position-sensitive, i.e., each of the k x k x (C + 1 ) maps corresponds with a part of an object of one of the C object categories (+1 for background). This, however, limits the applicability of the R-FCN architecture to small object detection, as it is very hard to distinguish their parts. The capability of dealing with objects of different sizes in Faster-R-CNN and R-FCN is limited to a few scales produced with the anchors. Hence, more recent CNNs for object detection tackle the issue of scale invariance and small object detection through more elaborated solutions.

Li et al. [5] introduces a Perceptual Generative Adversarial Network for small object detection. The aim is to enhance the representation of small objects to be similar to that of large ones. This is done by looking for the structural correlations of the objects at different scales. This approach has two networks. First, the generator network transforms the original poor features of small objects to highly discriminative ones. Then, the discriminator network estimates the probability that the input representation belongs to a real large object and, finally, it classifies the proposal and runs bounding box regression. The proposal has been tested with two datasets: (i) traffic signs from the Tsinghua-Tencent 100k dataset [6], where they consider as small objects those with an area under 32 x 32 pixels; (ii) pedestrians over 50 pixels tall from the Caltech benchmark [7].

In [8] an approach for company logo detection is presented. This approach is based on Faster-RCNN. As logos appear usually as small objects, Eggert et al. presents an architecture with three RPNs to detect objects of different sizes. For instance, the RPN after conv3 has anchors for side lengths under 45 px. Both the RPNs and the final classification and bounding box regression receive as inputs the combination of the feature maps of the last three convolutions: high-level feature maps are upscaled through bilinear interpolation and then summed with the lower-level maps. This proposal was validated in the FlickrLogos dataset.

Also, in [9] an architecture with several RPNs is proposed. Each RPN is in a different branch of the net. Shallower RPNs are adequate for small objects, while deeper ones are appropriate for larger targets. In order to have a more informative Rol (Region of Interest) pooling— principally for small objects— , the CNN applies upsampling to the last feature map of each of the branches of the net. In the experimental evaluation the smallest objects range from 25 to 50 pixels of height. Yang et al. [10] separate the detection of objects of different sizes in different branches. Their proposal relies on scale-dependent pooling— pooling for smaller objects uses only the shallower feature maps— and, also, on layer-wise cascades rejection classifiers in several branches for the different object sizes. This approach considers objects of less than 64 pixels of height as small targets.

In [1 1 ] authors propose a CNN in which the deeper feature maps are upsampled and combined with shallower feature maps. Object detection relies on these combined feature maps: the shallower ones for small objects and the deeper ones for larger objects.

All the previous approaches are based on single images. In [12] the detection of flying objects from a single moving camera is implemented taking into account spatio-temporal image cubes. This proposal has two main components, motion compensation and object detection, both based on CNNs. Motion compensation takes as input an image patch and returns the shift necessary to center the object in the patch. The CNN for object detection receives the motion compensated spatio-temporal image cubes, and returns whether or not there is an object.

The present disclosure introduces a new CNN architecture for small object detection that solves the aforementioned problems, allowing detection of small targets equal or under 256 square pixels. This makes a big difference with the above prior art documents, as, firstly, the objects of interest do not feature definitive visual clues to classify them into a category and, secondly, the sizes of the targets considered in the present disclosure are significantly smaller than those considered in the prior art documents, making the object detection more difficult. In order to detect such small objects, the global effective stride must be low, which requires a new architecture in order to keep a reasonable memory overhead. Besides, the proposed solution is an image object detector and, as such, it does not feature temporal information as the video object detectors reported in [12] and [13].

References

[1 ] J. Redmon, A. Farhadi, Yolo9000: Better, faster, stronger, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 6517-6525. [2] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al., Speed/accuracy trade-offs for modern convolutional object detectors, in: IEEE Computer Vision and Pattern Recognition (CVPR), 2017.

[3] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in Neural Information Processing Systems (NIPS), 2015, pp. 91 -99.

[4] J. Dai, Y. Li, K. He, J. Sun, R-fcn: Object detection via region-based fully convolutional networks, in: Advances in Neural Information Processing Systems (NIPS), 2016, pp. 379-387.

[5] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, S. Yan, Perceptual generative adversarial networks for small object detection, in: IEEE Computer Vision and Pattern Recognition (CVPR), 2017.

[6] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, S. Hu, Traffic-sign detection and classification in the wild, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 21 10-21 18.

[7] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: An evaluation of the state of the art, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (4) (2012) 743-761 .

[8] C. Eggert, D. Zecha, S. Brehm, R. Lienhart, Improving small object proposals for company logo detection, in: ACM on International Conference on Multimedia Retrieval, ACM, 2017, pp. 167-174.

[9] Z. Cai, Q. Fan, R. S. Feris, N. Vasconcelos, A uniffied multi-scale deep convolutional neural network for fast object detection, in: European Conference on Computer Vision (ECCV), Springer, 2016, pp. 354-370.

[10] F. Yang, W. Choi, Y. Lin, Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2129-2137.

[1 1 ] T.-Y. Lin, P. Doll_ar, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: IEEE Computer Vision and Pattern Recognition (CVPR), Vol. 1 , 2017, p. 4.

[12] A. Rozantsev, V. Lepetit, P. Fua, Detecting flying objects using a single moving camera, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (5) (2017) 879-892.

[13] C. Feichtenhofer, A. Pinz, A. Zisserman, Detect to track and track to detect, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3038-3046. [14] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770- 778.

[15] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations, 2015.

[16] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European Conference on Computer Vision (ECCV), Springer, 2014, pp. 818-833.

[17] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in: 27th International Conference on Machine Learning (ICML), 2010, pp. 807-814.

[18] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International Journal of Computer Vision 88 (2) (2010) 303-338.

[19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick, Microsoft coco: Common objects in context, in: European Conference on Computer Vision (ECCV), Springer, 2014, pp. 740-755.

[20] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural Computation 1 (4) (1989) 541 -551 .

[21 ] W. Luo, Y. Li, R. Urtasun, R. Zemel, Understanding the e_ective receptive field in deep convolutional neural networks, in: Advances in Neural Information Processing Systems (NIPS), 2016, pp. 4898-4906.

[22] S. Gidaris, N. Komodakis, Object detection via a multi-region and semantic segmentation-aware cnn model, in: IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1 134-1 142.

[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: 22nd ACM International Conference on Multimedia, ACM, 2014, pp. 675-678.

SUMMARY

The present disclosure introduces a new CNN architecture for small object detection. The proposed CNN architecture has a size that is significantly lower than its counterparts for the same resolution of the last feature map. The present invention considers the hypothesis that, after a few convolutional layers, the feature map contains enough information to decide which regions of the image contain candidate objects, but there is not enough data to classify the region or to perform bounding box regression. Given an intermediate feature map, the present invention applies a novel component, called Region Context Network (RCN), which is a filter that allows to select the most promising regions -all of them with the same size- of the feature map, avoiding the processing of the remaining areas of the image. The RCN ends up with a Rol (Region of Interest) Collection Layer (RCL), that builds a new and reduced filtered feature map by arranging all the regions selected by the RCN. Therefore, the memory overhead of the feature maps following the RCN is much lower, but with the same spatial resolution, as the reduction in size is due to the deletion of the least promising regions with small objects. Finally, the present invention applies a Region Proposal Network (RPN) to the last filtered feature map, classifies the regions and performs bounding box regression.

In accordance with one aspect of the present invention there is provided a computer- implemented method for detecting small objects on an image using convolutional neural networks. The method comprises the following steps:

- Applying one or more convolution operations to an input image to obtain a first set of convolutional layers and an input feature map corresponding to the last convolutional block of said first set.

- Analyzing the input feature map to determine a first set of candidate regions containing candidate objects.

- Arranging the first set of candidate regions to form a reduced feature map.

- Applying one or more convolution operations to the reduced feature map to obtain a second set of convolutional layers and an output feature map corresponding to the last convolutional block of said second set. - Applying a Region Proposal Network to the output feature map to obtain a second set of candidate regions containing candidate objects.

- Classifying and applying bounding box regression to each candidate region of the second set to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image.

In an embodiment, the first set of candidate regions are determined by applying a first convolution operation to the input feature map to obtain an intermediate convolutional layer and an associated intermediate feature map; applying a second convolution operation to the intermediate feature map to obtain a class feature map including class scores as candidate objects; and selecting a determined number of regions in the input feature map according to the class scores as candidate objects of the class feature map, wherein the first set of candidate regions includes the selected regions.

The step of arranging the first set of candidate regions to form a reduced feature map may comprise concatenating the candidate regions and adding an inter region 0-padding between adjacent candidate regions.

The method may also comprise a preprocessing stage wherein the number and the size of the anchors used in the Region Proposal Network are automatically learned through k-means applied to a training set of ground truth boxes. The number of anchors is preferably automatically obtained by performing an iterative k-means with an increasing number of kernels until the maximum inter-kernels loU ratio exceeds a certain threshold.

In accordance with a further aspect of the present invention there is provided an image object detector based on convolutional neural networks. The image object detector comprises:

- A feature extractor module configured to apply one or more convolution operations to an input image to obtain a first set of convolutional layers and an input feature map corresponding to the last convolutional block of said first set; and configured to apply one or more convolution operations to a reduced feature map to obtain a second set of convolutional layers and an output feature map corresponding to the last convolutional block of said second set.

- A region context network module configured to analyze the input feature map to determine a first set of candidate regions containing candidate objects.

- A Rol collection layer module configured to arrange the first set of candidate regions to form the reduced feature map.

- A Region Proposal Network module configured to obtain, from the output feature map, a second set of candidate regions containing candidate objects;

- A classifier module configured to classify and apply bounding box regression to each candidate region of the second set to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image.

According to an embodiment, the region context network module is configured to apply a first convolution operation to the input feature map to obtain an intermediate convolutional layer and an associated intermediate feature map; apply a second convolution operation to the intermediate feature map to obtain a class feature map including class scores as candidate objects; and select a determined number of regions in the input feature map according to the class scores as candidate objects of the class feature map, wherein the first set of candidate regions includes the selected regions.

The Rol collection layer module is preferably configured to form the reduced feature map by concatenating the candidate regions and adding an inter region 0-padding between adjacent candidate regions. The image object detector may be implemented, for instance, in a processor or a GPU.

The present invention also refers to an object detection system for detecting small objects on an image using convolutional neural networks. The object detection system comprises an image object detector as previously defined and a camera configured to capture an input image.

In accordance with yet a further aspect of the present invention there is provided a vehicle comprising an object detection system as previously defined and a decision module configured to determine, based on the object detection made by the object detection system, at least one action for execution by one or more vehicle systems of the vehicle. The vehicle may be, for instance, an unmanned aerial vehicle.

In accordance with another aspect of the present invention there is provided an airspace surveillance system comprising an object detection system as previously defined, wherein the camera of the object detection system is mounted on a ground location and is configured to monitor an airspace region; and a decision module configured to determine, based on the object detection made by the object detection system, at least one action for execution.

In accordance with yet another aspect of the present invention there is provided a ground surveillance system comprising an object detection system as previously defined, wherein the object detection system is installed on an aerial platform or vehicle and the camera of the object detection system is configured to monitor a ground region; and a decision module configured to determine, based on the object detection made by the object detection system, at least one action for execution. In accordance with another aspect of the present invention there is provided a detect and avoid system installed onboard a vehicle, comprising an object detection system as previously defined, wherein the camera of the object detection system is configured to monitor a region in front of the vehicle; and a decision module configured to determine, based on the object detection made by the object detection system, at least one action to avoid potential collisions.

The invention also refers to a computer program product for detecting small objects on an image using convolutional neural networks, comprising at least one computer- readable storage medium having recorded thereon computer code instructions that, when executed by a processor, causes the processor to perform the method as previously defined. The main contributions of the present invention are:

A new CNN for small object detection that is able to work with high resolution feature maps in the deeper layers while having a size that is significantly lower than other CNNs. The present invention relies on a novel component, RCN, that selects the most promising regions of the image and generates a new and filtered feature map with these areas. Therefore, the filtered feature maps can keep the same resolution but with a lower memory overhead and a higher frame rate.

The present invention uses an RPN that works with anchors, wherein the number and sizes of the anchors can be automatically selected using a novel algorithm based on k-means. The automatic definition of the anchors with k-means improves the classical heuristic approach.

The fully convolutional network (CNN) of the present invention is focused on small targets equal or under 256 square pixels. It includes an early visual attention mechanism, RCN, to choose the most promising regions with small objects and their context. RCN allows to work with feature maps with high resolution but with a reduced memory usage, as the regions with least likely objects are deleted from the filtered feature maps. The filtered feature maps, which only contain the most likely regions with small objects, are forwarded across the network up to the ending Region Proposal Network (RPN), and then classified. RCN is key to increase localization accuracy through finer spatial resolution due to finer global effective strides, smaller memory overhead and higher frame rates.

Experimental results over small object databases show that the present invention improves the average precision (AP@_:S) of the best state-of-the-art approach for small target detection from 52.7% to 60.1 %.

BRIEF DESCRIPTION OF THE DRAWINGS

A series of drawings which aid in better understanding the invention and which are expressly related with an embodiment of said invention, presented as a non-limiting example thereof, are very briefly described below.

Figure 1 shows the structure of a CNN object detector according to the prior art.

Figure 2 depicts the steps performed by a CNN object detector according the present invention.

Figure 3 depicts the RCN architecture of the present invention.

Figure 4 shows some examples of the feature maps obtained by the RCN.

Figure 5 is a schematic diagram of an image object detector according to an embodiment of the present invention.

Figure 6 depicts a vehicle with the image object detector installed onboard.

Figure 7 depicts the steps performed by the image object detector according to an embodiment of the invention.

Figure 8 depicts the steps performed by an ensemble of residual blocks from early or late convolutions of the image object detector in order to extract features from the input feature map.

Figure 9 shows an embodiment of the image object detector of Figure 5 applied to airspace surveillance. Figure 10 depicts, according to another embodiment, the image object detector of Figure 5 applied to ground surveillance from an aerial position.

Figure 1 1 depicts, according to yet another embodiment, the image object detector of Figure 5 applied to detect and avoid applications.

DETAILED DESCRIPTION

The present disclosure refers to a system and a computer-implemented method for detecting objects on an image using convolutional neural networks.

Figure 1 schematically depicts, according to the prior art, the internal structure of an object detector using convolutional neural networks, CNN object detector 100, which receives and processes an input image 102 to obtain an object classification in the input image 104, thereby detecting the presence of objects in the input image 102.

A feature extractor 1 10 of the CNN object detector 100 sequentially applies N successive convolution operations (1 1 1 , 1 13, 1 15), obtaining for each convolution operation a convolution layer and the associated feature maps (1 12, 1 14, 1 16) which will be used in the next convolution operation. A Region Proposal Network (RPN) 120, as described in the prior art (see for instance [3]), is then applied to the last feature maps 1 16 obtained by the feature extractor 1 10. A classifier 130 receives the output of the RPN 120 and the last feature maps 1 16 of the feature extractor 1 10 to determine the object classification in the input image 104, including the class of the object, using a fully-connected classification network. Along with the classification, a bounding box regression is also performed to obtain the bounding box in the input image 102 for the regions detected as objects.

The system of the present invention is a fully convolutional network that detects small objects. The system only considers regions of the feature maps containing most likely objects, deleting those regions of the feature maps with least likely objects and building filtered feature maps with the same resolution but lower memory requirements. This way, the system works with high resolution feature maps while keeping a low memory overhead.

Figure 2 schematically depicts the method performed by a CNN object detector, according to an embodiment of the present invention, to detect small objects on an input image using convolutional neural networks. The method 200 comprises receiving an input image 102 and applying one or more convolution operations (early convolutions 210) to the input image 102 to obtain a first set of convolutional layers 212. In the example of Figure 2 the first set of convolutional layers 212 is formed by two convolutional blocks (214, 216). The last feature map of the convolutional block 216 of said first set 212 is an input feature map 302 for the convolution operations applied in the next step of the process (shown in more detail in Figure 3), referred to as Region Context Network (RCN) 220 in Figure 2. The RCN 220 analyzes the input feature map 302 to determine a first set of candidate regions 222 in the input feature map 302 containing candidate objects.

The first set of candidate regions 222 are arranged to form a reduced feature map 228 (Rol collection layer). One or more convolution operations (late convolutions 230) are then applied to the reduced feature map 228 to obtain a second set of convolutional layers 232. In the embodiment shown in Figure 2 the second set of convolutional layers 232 comprises two convolutional blocks (234, 236). The last feature map of to the last convolutional block 236 of said second set 232 is an output feature map of the late convolutions 230.

A Region Proposal Network (RPN) 240 is then applied to said output feature map to obtain a second set of candidate regions 242 (e.g. j candidate regions) in the output feature map containing candidate objects. A classifier 250 classifies and applies bounding box regression to each candidate region of the second set 242 to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image 102. In an embodiment, each of the selected candidate regions 242 may first be converted, prior to the classification and bounding box regression, to a fixed size feature map, obtaining / fixed size feature maps 248 (Rol pooling layers).

Figure 3 represents in more detail, according to an embodiment, the RCN 220 process to obtain the first set of candidate regions 222 in the input feature map 302. The RCN 220 receives the input feature map 302 and applies a first convolution operation to the input feature map 302 to obtain an intermediate convolutional layer 224 and an associated intermediate feature map. The first convolution operation is a convolution using a fixed kernel size that acts as a fixed size sliding window 304 that maps the input feature map 302. In the embodiment of Figure 3, the first convolution operation is a convolution with a 3x3 kernel size and 128 filters (i.e. a 128-d 3x3 convolution).

The RCN 220 applies a second convolution operation to the intermediate feature map to obtain a class feature map 226 ( rcn-cls-layer ) including class scores as candidate objects. In the embodiment of Figure 3, the second convolution operation is a convolution with a 1 x1 kernel size and 2 filters (i.e. a 2-d 1 x1 convolution). The RCN 220 forms the first set of candidate regions 222 by selecting a determined number of regions in the input feature map according to the scores as candidate objects of the class feature map 226 (for instance, selecting the first n regions with the highest score).

According to the embodiment of Figure 3, the first set of candidate regions 222 are arranged to form a reduced feature map 228 (Rol collection layer, RCL) by concatenating the candidate regions 222 and adding an inter region 0-padding (shown as gaps in the figure) between candidate regions 222. Figure 4 depicts an illustrative example of the reduced feature map 228 (Rol collection layer) for a particular input image 102. The Figure 4 shows only 4 filters, out of the total 256 filters used in the example, of the RCN input (i.e. the input feature map 302), and only 7 filters (a row for each filter) out of a total of 256 of the Rol Collection Layer output (i.e. the reduced feature map 228).

Figure 5 is a schematic diagram showing the components of an image object detector 500 based on convolutional neural networks (i.e. CNN object detector) according to an embodiment of the present invention. The image object detector 500 of the present invention is a system (or part of a system) for detecting small objects on an image using convolutional neural networks. The system may be implemented in a processing device including a processor, a GPU or a combination thereof (or any other kind of data processing device) and a computer-readable medium having encoded thereon computer-executable instructions to cause the processor/GPU to execute the method for detecting small objects on an image using convolutional neural networks as previously described.

The image object detector 500 comprises a feature extractor module 510, a region context network module 520, a region of interest (Rol) collection layer module 530, a Region Proposal Network module 540 and a classifier module 550. The feature extractor module 510 is configured to apply one or more convolution operations 210 (early convolutions) to an input image 102 to obtain a first set of convolutional layers 212 and an input feature map 302 corresponding to the last convolutional block 216 of the first set of convolutional layers 212.

The region context network module 520 analyzes the input feature map 302, looking for and determining the most promising regions containing candidate objects (i.e. a first set of candidate regions 222). Regions are defined as areas of the image that might contain objects together with their context. The region context network module 520 assigns to each region a score, and the top scored regions (first set of candidate regions 222) are passed on to a Rol collection layer module 530, at the final stage of the RCN.

The region context network module 520 avoids forwarding the regions of the input image with least likely objects to the deepest convolutional layers, saving memory and increasing frame rate. Memory saving is key to increase spatial resolution through finer global effective strides across convolutional layers, mandatory in order not to miss the spatial localization of small objects.

The region context network module 520 selects the most likely candidate regions with one or more small objects together with their context, and returns them as a set of disjoint regions. As at this stage the goal is not to get accurate object localization, neither a box regression approach, nor a set of anchors with different scales and aspect ratios are needed. A single anchor of a given size suffices to return the most likely candidate regions with small objects. The region context network module 520 first applies a 3 x 3 convolutional filter to each window of the input feature map 302, generating an intermediate 128-d layer with ReLU (Rectified Linear Unit) [17] following. This structure feeds a box-classification layer ( rcn-cls-layer ) represented by a 1 x 1 convolutional 2-d layer (“fg”, i.e. object, and“bg”, i.e. no object) which scores regions obtained with sliding- windows over the last early convolution (i.e input feature map 302).

To verify that the anchor is a positive or a negative candidate in each sliding window region during the RCN’s training phase, the ground truths of the objects are grown proportionally in all directions until equaling the anchor’s defined size. Then, those anchors that have a considerable overlap with the modified ground truth (greater than 0.7 by default) are assigned as positive labels, leaving for negative those regions that barely have an overlap (lower than 0.3 by default). As usual, the overlap is measured by the intersection-over-union (loU) ratio. The objectness score of the candidate regions in RCN is minimized through:

object/not object classifier where pi is the predicted probability of the i-th anchor being an object in a RCN mini batch, and pi is the adapted ground-truth label. The term - normalizes the equation

^Ncls

and it refers to the size of the RCN’s mini-batch. L_c/S is a softmax loss over object or not object categories.

RCN 220 ends up with the so-called Rol collection layer (RCL) (Figure 3), implemented by the Rol collection layer module 530, which is configured to arrange the first set of candidate regions 222 to form a reduced feature map 228. The Rol collection layer module 530 takes as input the feature map generated by the last early convolution and the top scored proposals from RCN to return a single filtered feature map (reduced feature map 228) with the same information as that of the input feature map 302, but only for the set of selected regions. Successive convolutions with filters greater than 1 x1 will affect the neighboring regions’ outputs. To solve this problem, the Rol collection layer module 530 adds an inter region 0-padding -shown by gaps between regions in Figure 3-

With this configuration, the dimensions of the feature map output are obtained as follows:

where n is the number of regions from RCN, r_w and ¾ are the dimensions of the regions in the RCL input feature map and p_d is the size of the 0-padding between regions. For example, a 1280x720 input image has an RCL input feature map of 320x180, and the output RCL generates a 649x12 feature map: 50 regions of size 48x48 at the input image -12x12 at the RCL input feature map for stride 4- with 1 pixel 0-padding in the example; i.e. a reduction 7.4x of GPU memory usage (86.5% saved memory).

The feature extractor module 510 also applies one or more convolution operations 230 (late convolutions) to the reduced feature map 228 to obtain a second set of convolutional layers 232 and an output feature map 502 corresponding to the last convolutional block 236 of said second set 232. Late convolutions 230 act on the first set of candidate regions 222 obtained by the region context network module 520 independently due to inter-region 0-padding -displayed as gaps between the different candidate regions in Figure 3-.

The feature extractor module 510 can be any of the most widely state-of-the-art solutions found in the literature— e.g. ResNet [14], VGG [15], ZF [16], etc. ResNet-50 is preferably used, since it provides a good trade-off between accuracy, speed and GPU memory consumption [14].

The Region Proposal Network (RPN) module 540 is configured to obtain, using the output feature map 502, a second set of candidate regions 242 containing candidate objects. The RPN module 540 performs an initial bounding regression and classification as object (fg) and background (bg) [3], which are finally refined in the classifying stage.

The RPN module 540 is based on the RPN presented in [3], but including a set of modifications in order to deal with the fact that the coordinates of its input feature map do not correspond with those of the input image, i.e., the RPN input contains unsorted regions. To map the regions on the input image to the RPN’s training function, which is based on the loU between anchors and ground truth, the region context network module 520 passes the 4 coordinates of every region as a parameter to the RPN module 540 to generate the anchors relative to those regions. Finally, the output of the bounding box regression is transformed to the input image coordinates.

The approaches that rely on RPNs define the number of anchors and their sizes heuristically. In the present invention, both the number and the size of the anchors are learned through k-means (i.e. automatic anchors initialization by k-means). This approach can be adopted by any other object detection network with anchors, e.g. Faster-R-CNN, regardless the target size of the objects. The k-means anchor learning procedure is implemented as a preprocessing stage k-means is applied to the training set of ground truth boxes’ height and width. In order to obtain the number of kernels, which will be the number of anchors, an iterative k-means with an increasing number of kernels is performed until the maximum inter-kernels loU exceeds a certain threshold. In an embodiment, the threshold is set to 0.5, which is the value used in well-known repositories, as PASCAL VOC [18] or MS COCO [19], to check if a detection is positive or negative with respect to a ground truth. A similar contribution was defined in [1 ], where a k-means algorithm selects the anchors' size according to the dataset, but where the selection of the number of anchors is done manually, visualizing the best trade-off between the number of anchors and the average intersection with these with the dataset objects. The present approach makes the anchors selection completely automatic.

The classifier module 550 is configured to classify and apply bounding box regression to each candidate region of the second set of candidate regions 242 to obtain, for each candidate region, a class score 552 as a candidate object and a bounding box 554 in the input image 102.

Figure 6 depicts an exemplary embodiment of the image object detector 500 installed onboard a vehicle 600, such as a boat, a car, an aircraft, an unmanned aerial vehicle or a drone. In particular, the vehicle 600 includes an object detection system 610 for detecting small objects on an image using convolutional neural networks, wherein the object detection system 610 comprises a camera 612 (the term“camera” includes any device able to acquire an image or a set of images, such as a conventional camera or a video camera) configured to capture an input image 102, and the image object detector 500 as previously described in Figure 5. In this example, the image object detector 500 is implemented in a processor or a GPU 614.

The vehicle 600 may also comprise a decision module 620 that receives the output of the object detection system (the class scores 552 and bounding boxes 554 for the candidate regions selected in the input image 102), and determines, based on the objects detected on the input image 102, one or more actions 622 to be executed by one or more vehicle systems 630 (e.g. communications system 632, navigation system 634 with on-board sensors 635, propulsion system 638) of the vehicle 600.

For instance, as shown in the embodiment of Figure 6, the action may be sent to the navigation system 634 (continuous line) and/or to the communications system 632 (dotted line). In the first case, the action may include, as an example, guiding the vehicle towards one of the small detected objects, depending on the class score obtained for said object or the size of the bounding box. This could be the case, for instance: - When the bounding box is so small that the vehicle 600 is required to confirm the class score 552 assigned to the region by getting closer.

- When the class score assigned is of a particular relevance to the vehicle. For example, if the vehicle 600 is a drone patrolling a vast secure geographic area, such as borders between countries, and is looking for people invading that secure geographic area.

In this first case, the navigation system 634 receives a displacement instruction 624 to move towards a determined location (e.g. a detected object) and computes an updated trajectory, which is executed by the propulsion system 638 (e.g. motors, etc.) of the vehicle 630.

In the second case, the actions 622 may include reporting the detected objects 626 to an external entity, such as a server, using the communications systems 632 of the vehicle 600.

Figure 7 depicts the steps performed by an image object detector 500 according to an exemplary embodiment of the invention (this is merely an example, different parameters may be employed in other embodiments):

Input: image object detector 500 takes an image or a video-frame as an input image 102. The input image is scaled to HD resolution, 1280x720x3 (width x height x number of RGB color channels), keeping its width and height ratio.

Early convolutions 210: This ensemble is composed by a first convolution layer 710, a max-pooling layer 712 and a second residual block 714.

• First convolution layer 710: gets the input image and applies a 7x7 kernel size with stride 2, padding 3 and 64 filters. This operation halves the width and height, returning a 640x360x64 feature map.

• Max-pooling layer 712: transforms the 640x360x64 feature map in a 320x180x64 feature map through a max-pooling operation with a 3x3 kernel size and stride 2. From this point until the end, the image object detector 500 keeps the current resolution, that is, a resolution four times smaller than that of the original input image.

• Second residual block 714: residual block (Figure 8 depicts the steps performed by an ensemble of residual blocks [14] in order to extract features from the input feature map ) composed by three blocks that increases the number of filters from 64 to 256, returning an 320x180x256 feature map (i.e. input feature map 302 in Figure 3).

Region Context Network (RCN) 220: RCN 220 consists of two convolutional layers (RCN convolution 720 and RCN class score convolution 722) and a layer for the proposal of regions (RCN proposal layer 724).

• RCN convolution 720: applies a 3x3 kernel size (stride 1 , padding 1 ) that acts as a 3x3 sliding window, mapping the input feature map 302 information in a 128-d output (320x180x128).

• RCN class score convolution 722: a 1 x1 convolution that learns the necessary characteristics to differentiate between object or non-object of regions at each sliding window location (2-d). Each unit of the feature map decides whether the anchor centered in that unit contains an object or not. This is done by comparing the activation values of the two units in the same spatial localization: one of them learns the foreground score and the other one the background score. Returns a 320x180x2 feature map.

• RCN proposal layer 724: a custom layer that gets the class (object or non-object) scores from RCN class score convolution 722, calculates their regions’ coordinates in the input image size and returns a first set of candidate regions 222 most likely to contain an object (50x4 rcn rois, where 50 is the number of regions and 4 are the coordinates for each region).

- Rol Collection Layer (RCL) 228: RCL 228 is another custom layer that obtains the first set of candidate regions 222 ( rcn rois ) from the RCN proposal layer 724 and the feature map information from the second residual block 714 (input feature map 302). With both inputs, it obtains the information from the feature map of the second residual block 714, but only within the selected regions. Then, it concatenates this information in a new output feature map of size RCL_{0UtPut size}. Successive convolutions with filters greater than 1 x1 will affect the neighboring regions' outputs. To solve this problem, RCL adds an inter region 0-padding. For this example, if we take the top 50 most likely regions with a region size of 48x48 pixels (12x12 on the feature map of the second residual block 714) and 1 pixel 0-padding, the output feature map size is 649x12x256. Late convolutions 230: This ensemble is composed by two residual blocks (third residual block 730 and fourth residual block 732, obtained according to the flow diagram of Figure 8).

• Third residual block 730: composed by four blocks that take as input the output from RCL 228 and increase the number of filters from 256 to 512, returning a 649x12x512 feature map. Inside the residual block and after each 3x3 convolution, restore collection padding is applied (see Figure 8), an auxiliary layer which restores the padding between regions to zero.

• Fourth residual block 732: residual block composed by six blocks that increases the number of filters from 512 to 1024, returning an 649x12x1024 feature map. As in the previous case, restore collection padding is applied.

Region Proposal Network (RPN) 240: RPN 240 consists of three convolutional layers (RPN convolution 740, RPN class score convolution 744 and RPN bounding box regression convolution 746) and a layer for the proposal of regions (RPN proposal layer 748).

• RPN convolution 740: applies a 3x3 kernel size (stride 1 , padding 1 ) which acting as a 3x3 sliding window location that maps the input feature map information in a 256-d output (649x12x256). After this operation, an auxiliary layer (remove collection padding 742) eliminates the 0-padding between regions since there are no more 3x3 convolutions to be applied on them, returning a 600x12x256 feature map.

• RPN class score convolution 744: a 1 x1 convolution that learns the necessary characteristics to differentiate between object or non-object at each sliding window location and for each defined anchor (6-d since 3 anchors are used). Returns a 600x12x6 feature map.

• RPN bounding box regression convolution 746: a 1 x1 convolution that learns the necessary characteristics to apply regression to each of the four coordinates of each anchor at each sliding window location (12-d since 3 anchors are used). Returns a 600x12x12 feature map.

• RPN proposal layer 748: a custom layer that gets the first set of candidate regions 222 ( rcn rois ) from RCN proposal layer 724, the class (object or non-object) scores for each anchor from RPN class score convolution 744 and the coordinates for each anchor from RPN bounding box regression convolution 746. With the first set of candidate regions 222 ( rcn rois ), it maps the sliding window locations for each anchor to the coordinates of those regions at the original input image. Then, it sorts those most likely to contain an object by the scores from RPN class score convolution 744. For all of them, the regression values learned by RPN bounding box regression convolution 746 are applied, obtaining the top N final regions (in the example N=300) in the original input image ( rois ). Moreover, the RPN proposal layer 748 also returns the coordinates of the 300 regions relative to the RPN input, i.e. the unsorted map of regions ( scaled_rois ), the second set of candidate regions 242 in Figure 2.

Rol pooling layer 248: this layer takes the feature map information of the fourth residual block 732 (i.e. output feature map 502) and the 300 unsorted map of regions scaled_rois (i.e. the second set of candidate regions 242). The auxiliary layer remove collection padding eliminates the 0-padding between regions in the feature map of the fourth residual block 732 so that the size is 600x12x12. Then, the Rol pooling layer 248 obtains the information from the feature map of the fourth residual block 732, but only within the selected regions, and converts them to a fixed size (14x14x1024) feature map. Also, the 300 regions go forward to the next stage.

- Classifier 250: Each region of interest from Rol pooling layer 248 is classified independently by the last residual block (fifth residual block 750, Figure 8) and an average pooling 752.

• Fifth residual block 750: residual block composed by three blocks. The first one halves the width and height by 2. In addition, the block increases the number of filters from 1024 to 2048, returning an 7x7x2048 feature map.

• Average pooling 752: an average pooling with 7x7 kernel size reduces the dimension to 1x1x2048, ready to be classified by fully connected layers.

Decision Function 760: for each region of interest, the final decision is taken based on two fully connected layers that transform the input 1 x1 x2048 array in the category of the object (class score fully connected layer 762) and its corresponding bounding box regression (bounding box regression fully connected layer 764). On the one hand, the value obtained by class score fully connected layer 762 passes through a Softmax function 766 to normalize the score into the range [0, 1] and, on the other hand, a transformation function 768 applies the bounding box regression to the rois relative to the original input image obtained from the RPN proposal layer 748. This returns the final class score 552 and bounding box 554 for each region of interest.

Regarding the training of the system, as all the learnable layers are convolutional and shared, both the network that acts as a backbone (ResNet- 50) and the two modules of the network (RCN module 520 and RPN module 540) can be trained end-to-end by backpropagation and stochastic gradient descent (SGD) [20]. In an embodiment, the approximate joint training [3] has been selected.

The RCN module 540 is trained in a similar way to the RPN module 540, except for bounding box regression, which does not exist in RCN. The fact that RCL keeps the same output number of images per mini-batch as that of input images, makes the rest of the training identical to other RPN networks like Faster- R-CNN. The initialization of anchors by k-means does not affect training either, since it is performed previously to the training. In the same way as the RPN, the RCN module 520 obtains its mini-batch from a single image by selecting positive and negative anchors. The mini-batch used within the RCN is 64 examples trying to maintain whenever possible a ratio of 1 :1 of positive and negative labels. The anchor's size is obtained by estimating the effective receptive field (ERF) which, in practice, follows a Gaussian distribution [21 ], so half of the theoretical receptive field of the convolutions between RCN and RPN is selected as ERF. In order to eliminate overlapping regions from those proposed by the RCN, an aggressive non-maximum suppression with a low threshold (0.3) is applied over the 2,000 best proposals before the RCL, resulting in a low number of scattered regions - around 200 on average-. At test, we let pass through the RCN those regions with confidence higher than 0.3, up to a maximum of 50 regions.

RCN and RCL can be integrated in any object detection convolutional framework just adapting the corresponding region proposal method to work with unsorted regions. In an embodiment, the method has been implemented over Faster-R-CNN. The hyper parameters for training and testing are the same as those used in Faster-R-CNN. The RPN module 540 is placed between the fourth residual block 732 and fifth residual block 732 convolutional layers, as it is done in [14] for Faster-R-CNN. Finally, at test, a box voting scheme after non-maximum suppression is applied [22] In this implementation, the framework Caffe is used [23]. Figures 9, 10 and 11 depicts several possible applications of the method and the image object detector of the present invention. However, it is noted that the present invention could be applied to many other real case scenarios. Among the different use cases envisaged for the proposed invention, the following are highlighted:

- Airspace surveillance.

- Ground surveillance from an aerial position.

- Detect & avoid.

The first use case, airspace surveillance, is depicted in Figure 9. The airspace surveillance system 900 of Figure 9 comprises a camera 912 (implemented in this example as a video camera) located on the ground 902, mounted on a pole or on a terrestrial moving platform. The camera 912 is pointing towards the sky, either in a vertical or in an oblique direction. The camera 912 monitors a determined airspace region 903. The airspace region 903 monitored can be a static region, if the position and orientation of the camera is fixed, or a dynamic region, if the position, orientation and /or zoom of the camera 912 dynamically changes.

The video stream (sequence of input images 102) acquired by the camera 912 is sent to a processor 914 for further analysis in order to detect all those flying object 904 (e.g. a drone in the example of Figure 9) which appear in the field of vision 906 of the camera 912 and are represented in the input image 102 as small objects 908 (with a size up to 16x16 pixels). To that end, the processor 14 implements the image object detector 500 of Figure 5. The processor 914 may be placed together with the camera 912 (i.e. locally) or remotely, for instance in a remote data center. In the latter case, the input images 102 are transmitted to the processor 900 using a broadband data connection, such as the Internet. The camera 912 and the image object detector 500 implemented by the processor 914 form an object detection system 910 in charge of monitoring the airspace and detecting, in real time, any kind of flying objects 904 (such as aircraft -e.g. drones, airships-, parachutists, or even meteorites).

The airspace surveillance system 900 is helpful in scenarios where monitoring the airspace for security and/or safety reasons is critical, such as airports (in order to detect flying objects that can pose a potential hazard for commercial or military aviation), nuclear plants, Transportation hubs, government facilities, football stadiums and any other critical infrastructures. The small object detection performed by the airspace surveillance system 900 is carried out as soon as possible (i.e. whenever the flying object 904 appears in the field of view 906 of the camera 912, despite its size), in order to take the contingency actions required. The airspace surveillance system 900 may optionally include a decision module 920 to determine one or more actions 922 to be carried out, based on the object detection made by the object detection system 910. Actions 922 may include, for instance, neutralizing a drone flying in the airspace near an airport, sending an alarm message, etc. The airspace surveillance system 900 may also comprise means (not shown in the figure) for executing the actions 922 determined by the decision module 920. For example, if the action to be taken is neutralizing a detected drone, the airspace surveillance system 900 may include a missile launcher to destroy the drone.

Figure 10 depicts the image object detector 500 applied to ground surveillance from aerial positions (e.g. from aerial vehicles or platforms). In this case, the ground surveillance system 1000 comprises a camera 1012 (e.g. a video camera) mounted on an aerial vehicle 1001 (e.g. a drone), pointing downwards toward the ground 1002, either in a vertical or in an oblique direction, to monitor a ground region 1003. The video stream captured by the camera 1012 is sent to a processor 1014 (either on-board the aerial vehicle 1001 or remotely located on an on-ground facility) in charge to analyze the input images 102 to detect static or moving small terrestrial objects 1004 (e.g. people, as depicted in the example of Figure 10) on the ground 1002 (land, sea, river, lagoon, etc.) which appear in the field of vision 1006 of the camera 1012 and are represented in the input image 102 as small objects 1008 (with a size up to 16x16 pixels). The ground surveillance system 1000 may be applied in different aerial-based surveillance scenarios, such as:

- Search and rescue, e.g. in maritime environments, for locating either boats or people in the sea; in land environments, for detecting hikers (the small terrestrial objects 1004 in the example of Figure 10) who got lost.

- Security applications, for detecting specific targets (e.g. vehicles or people) approaching a given protected area (e.g. in Homeland Security applications: illegal immigrants approaching a border).

- T raffic surveillance, in order to detect vehicles, traffic jam, and other traffic management related events. The camera 1012 and the image object detector 500 implemented by the processor 1014 form an object detection system 1010 in charge of monitoring the ground region 1003 and detecting, in real time, any kind of terrestrial objects 1004 (e.g. people).

The ground surveillance system 1000 may also include a decision module 1020 to determine one or more actions 1022 to be performed based on the object detection made by the object detection system 1010. Actions 1022 may include, among others, sending a message to a remote station informing about the detected objects. The ground surveillance system 1000 may also comprise means (not shown in the figure) for executing the actions 1022.

In the embodiment of Figure 11 , an application of the image object detector 500 to avoid collisions (i.e. detect and avoid applications) is depicted. This is a particular useful application for the vehicle 600 of Figure 6. In this embodiment, the detect and avoid system 1 100 comprises a camera 1 1 12 (e.g. a video camera) mounted onboard a vehicle 1 101 , such as the aircraft depicted in Figure 1 1 or any other type of vehicle (e.g. an autonomous car, a drone, etc.). The camera 1 1 12 is pointing forward, in the direction of movement of the vehicle 1 101 , either in a horizontal or in an slight oblique direction, towards a dynamic region 1 103 (in the embodiment, an airspace region). The video stream of the camera 1 1 12 is analyzed by an on-board processor 1 1 14, in order to detect other small flying objects 1 104 which appear in the field of vision 1 106 of the camera 612, represented in the input image 102 as small objects 1 108 with a size up to 16x16 pixels, and that may involve a potential obstacle for the aerial vehicle 600.

The camera 1 1 12 and the image object detector 500 implemented by the processor 1 1 14 form an object detection system in charge of monitoring the airspace region 1 103 and detecting, in real time, any kind of flying objects 1 104 (e.g. drones, birds).

The detect and avoid system 1 100 comprises a decision module 1 120 that determines one or more actions 1 122 to be performed based on the flying objects 1 104 detected by the object detection system. The actions 1 122 determined by the decision module 1 120 are aimed to avoid collision against the detected flying objects 1 104. For instance, a new trajectory may be computed by the decision module 1 120 for execution by the vehicle 1 101 (e.g. by the FMS of an aircraft or by an autonomous navigation module of a drone). A vehicle 1 101 comprising the detect and avoid system 1 100 may also be part of the invention, the vehicle comprising means for executing the actions 1 122 to avoid collision.

The small object detection performed by the detect and avoid system 1 100 is performed as soon as possible (i.e. whenever the flying object 1 104 appears in the field of view 1 106 of the camera 1 1 12, despite its size), in order to take the contingency actions required to avoid potential collisions.

It is also important to remark that the method and image object detector 500 of the present invention is especially advantageous, when comparing with the prior art, for detecting small objects (equal or under 16x16 pixels) on an image. However, the invention may also be applied for detecting bigger objects (i.e. above 16x16 pixels) on an image.

Claims

1. A computer-implemented method for detecting small objects on an image using convolutional neural networks, comprising:

applying one or more convolution operations (210) to an input image (102) to obtain a first set of convolutional layers (212) and an input feature map (302) corresponding to the last convolutional block (216) of said first set (212);

analyzing the input feature map (302) to determine a first set of candidate regions (222) containing candidate objects;

arranging the first set of candidate regions (222) to form a reduced feature map

(228);

applying one or more convolution operations (230) to the reduced feature map (228) to obtain a second set of convolutional layers (232) and an output feature map (502) corresponding to the last convolutional block (236) of said second set (232); applying a Region Proposal Network (240) to the output feature map (502) to obtain a second set of candidate regions (242) containing candidate objects;

classifying and applying bounding box regression (250) to each candidate region of the second set (242) to obtain, for each candidate region, a class score as a candidate object and a bounding box in the input image (102).

2. The computer-implemented method of claim 1 , wherein the first set of candidate regions (222) are determined by:

applying a first convolution operation to the input feature map (302) to obtain an intermediate convolutional layer (224) and an associated intermediate feature map; applying a second convolution operation to the intermediate feature map to obtain a class feature map (226) including class scores as candidate objects;

selecting a determined number of regions in the input feature map (302) according to the class scores as candidate objects of the class feature map, wherein the first set of candidate regions (222) includes the selected regions.

3. The computer-implemented method of any preceding claim, wherein the step of arranging the first set of candidate regions (222) to form a reduced feature map (228) comprises concatenating the candidate regions (222) and adding an inter region 0- padding between adjacent candidate regions (222).

4. The computer-implemented method of any preceding claim, further comprising a preprocessing stage wherein the number and the size of the anchors used in the Region Proposal Network (240) are automatically learned through k-means applied to a training set of ground truth boxes.

5. The computer-implemented method of claim 4, wherein the number of anchors is automatically obtained by performing an iterative k-means with an increasing number of kernels until the maximum inter-kernels loU ratio exceeds a certain threshold.

6. An image object detector based on convolutional neural networks, comprising:

a feature extractor module (510) configured to:

apply one or more convolution operations (210) to an input image (102) to obtain a first set of convolutional layers (212) and an input feature map (302) corresponding to the last convolutional block (216) of said first set (212);

apply one or more convolution operations (230) to a reduced feature map (228) to obtain a second set of convolutional layers (232) and an output feature map (502) corresponding to the last convolutional block (236) of said second set (232);

a region context network module (520) configured to analyze the input feature map (302) to determine a first set of candidate regions (222) containing candidate objects;

a Rol collection layer module (530) configured to arrange the first set of candidate regions (222) to form the reduced feature map (228);

a Region Proposal Network module (540) configured to obtain, from the output feature map (502), a second set of candidate regions (242) containing candidate objects; a classifier module (550) configured to classify and apply bounding box regression to each candidate region of the second set (242) to obtain, for each candidate region, a class score (552) as a candidate object and a bounding box (554) in the input image (102).

7. The image object detector of claim 6, wherein the region context network module (520) is configured to:

apply a first convolution operation to the input feature map (302) to obtain an intermediate convolutional layer (224) and an associated intermediate feature map; apply a second convolution operation to the intermediate feature map to obtain a class feature map (226) including class scores as candidate objects;

select a determined number of regions in the input feature map (302) according to the class scores as candidate objects of the class feature map (226), wherein the first set of candidate regions (222) includes the selected regions.

8. The image object detector of any of claims 6 to 7, wherein the Rol collection layer module (530) is configured to form the reduced feature map (228) by concatenating the candidate regions (222) and adding an inter region 0-padding between adjacent candidate regions (222).

9. The image object detector of any of claims 6 to 8, implemented in a processor (914; 1014) or a GPU (614).

10. An object detection system for detecting small objects on an image using convolutional neural networks, the object detection system (610; 910) comprising: a camera (612; 912) configured to capture an input image (102), and

an image object detector (500) according to any of claims 6 to 9.

1 1 . A vehicle (600; 1 100), comprising:

an object detection system (610; 1 1 10) according to claim 10; and

a decision module (620; 1 120) configured to determine, based on the object detection made by the object detection system (610; 1 1 10), at least one action (622; 1 122) for execution by one or more vehicle systems (630; 1 130) of the vehicle (600; 1 100).

12. An airspace surveillance system (900), comprising:

an object detection system (910) according to claim 10, wherein the camera (912) of the object detection system (910) is mounted on a ground location and is configured to monitor an airspace region (903); and

a decision module (920) configured to determine, based on the object detection made by the object detection system (910), at least one action (922) for execution.

13. A ground surveillance system (1000), comprising :

an object detection system (1010) according to claim 10, wherein the object detection system (1010) is installed on an aerial platform or vehicle (1001 ) and the camera (1012) of the object detection system (1010) is configured to monitor a ground region (1003); and

a decision module (1020) configured to determine, based on the object detection made by the object detection system (1010), at least one action (1022) for execution.

14. A detect and avoid system (1 100) installed onboard a vehicle (1 101 ), comprising: an object detection system (1 1 10) according to claim 10, wherein the camera (1 1 12) of the object detection system (1 1 10) is configured to monitor a region (1 103) in front of the vehicle (1 101 ); and

a decision module (1 120) configured to determine, based on the object detection made by the object detection system (1 1 10), at least one action (1 122) to avoid potential collisions.

15. A computer program product for detecting small objects on an image using convolutional neural networks, comprising at least one computer-readable storage medium having recorded thereon computer code instructions that, when executed by a processor, causes the processor to perform the method of any of claims 1 to 5.