Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 shows an exemplary system architecture 100 to which embodiments of the present method for detecting small targets or apparatus for detecting small targets may be applied.
As shown in fig. 1, the system architecture 100 may include a vehicle 101 and a traffic sign 102.
The vehicle 101 may be a regular vehicle or an unmanned vehicle. A controller 1011, a network 1012, and a sensor 1013 may be installed in the vehicle 101. Network 1012 is used to provide a medium for communication links between controller 1011 and sensors 1013. Network 1012 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
The controller (also referred to as an on-board brain) 1011 is responsible for intelligent control of the vehicle 101. The Controller 1011 may be a separately installed Controller, such as a Programmable Logic Controller (PLC), a single chip microcomputer, an industrial Controller, and the like; or the equipment consists of other electronic devices which have input/output ports and have the operation control function; but also a computer device installed with a vehicle driving control type application. The controller is provided with a trained segmentation network and a detection model.
The sensor 1013 may be various types of sensors, such as a camera, a gravity sensor, a wheel speed sensor, a temperature sensor, a humidity sensor, a laser radar, a millimeter wave radar, and the like. In some cases, the vehicle 101 may also include GNSS (Global Navigation Satellite System) equipment, SINS (Strap-down Inertial Navigation System), and the like.
The vehicle 101 captures a traffic sign 102 while traveling. Whether the images are shot at a longer distance or shot at a shorter distance, the traffic signs in the images are small targets.
The vehicle 101 sends the captured original image including the traffic sign to the controller for recognition, and determines the position of the traffic sign. OCR recognition can also be carried out to recognize the content of the traffic sign. And then outputting the content of the traffic identification in the form of voice or text.
It should be noted that the method for detecting a small target provided in the embodiment of the present application is generally performed by the controller 1011, and accordingly, the apparatus for detecting a small target is generally disposed in the controller 1011.
It should be understood that the number of controllers, networks, and sensors in fig. 1 is merely illustrative. There may be any number of controllers, networks, and sensors, as desired for an implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for detecting small targets according to the present disclosure is shown. The method for detecting the small target comprises the following steps:
step 201, an original image including a small target is acquired.
In the present embodiment, an execution subject (e.g., a controller shown in fig. 1) of the method for detecting a small target may collect a front image by a vehicle-mounted camera, and the collected original image includes the small target. A small target refers to an image of a target object having a number of length and width pixels less than a predetermined value (e.g., 20).
Step 202, the original image is reduced to a low resolution image.
In this embodiment, the original picture may be divided by 4 (or other multiples) in each of the lengthwise and widthwise directions to obtain a low-resolution image. The length-width ratio is kept unchanged in the process of reduction.
Step 203, a candidate region including a small target is identified from the low-resolution image by adopting a lightweight segmentation network.
In this embodiment, since only the approximate position where the target may exist needs to be located in the first stage of detection, and an accurate outer frame is not needed, the first stage of detection is implemented by using a lightweight segmentation network, and a point in the final output thermodynamic diagram, which is greater than a certain threshold, is regarded as a point suspected of having the target. A split network like U-Net can be used, and a shufflenet is used as a backbone network for the purpose of light weight.
When a training sample for segmenting the network is made, pixels inside a rectangular frame originally used for detecting a task are set as positive samples, and pixels outside the rectangular frame are set as negative samples. Due to the scaling in the length and width directions, in order to ensure the recall rate on the small target, when the training sample is made, the length and width are smaller than the preset value, for example, the rectangular frame of the 20-pixel target is expanded by one time, and then the pixels in the expanded rectangular frame are all set as the positive samples.
And 204, taking the area of the original image corresponding to the candidate area as an interest area, running a pre-trained detection model on the interest area, and determining the position of the small target in the original image.
In this embodiment, after filtering out noise points in the result output by the segmentation network, a minimum circumscribed rectangle surrounding all remaining suspected target points is formed, and a region corresponding to the rectangle in the unscaled high-resolution image is used as an interest region. The detection model is then run on the region of interest, which requires only a portion of the region of the high resolution picture to be processed, thereby reducing the amount of computation.
As described above, in order to better detect a small target, a picture needs to maintain a higher resolution, a large picture may cause a multiplied increase in the amount of calculation, and real-time processing in a car-machine environment is difficult to achieve. On the other hand, the traffic sign occupies a small proportion of the picture, and most of the traffic sign is a background area, and the amount of calculation in the background area occupies a large proportion of the total amount of calculation, so that processing the background area at high resolution is time-consuming and meaningless. Therefore, the invention adopts a two-stage detection mode, firstly, the approximate position of the suspected target is positioned on the low-resolution picture through a lightweight segmentation network, then the minimum circumscribed rectangle containing all the suspected targets is solved, and finally, the detection model is operated on the high-resolution image block corresponding to the minimum circumscribed rectangle, thereby reducing the calculated amount under the condition of ensuring the detection rate of the small target.
After the above two stages of processing, the average calculation amount of the detection model is reduced to about 25% of the original calculation amount, and the average calculation amount of the two models added up is about 45% of the original calculation amount.
With continued reference to fig. 4, fig. 4 is a schematic diagram of an application scenario of the method for detecting a small target according to the present embodiment. In the application scenario of fig. 4, the vehicle acquires the front image in real time during the driving process. The length and width of the acquired original image are divided by 4 respectively, and then the original image is reduced into a low-resolution image. And inputting the low-resolution image into a lightweight segmentation network, and identifying a candidate area comprising the traffic identification. And then finding the area of the original image corresponding to the candidate area in the original image as the interest area. And (4) extracting the images of the interest areas, inputting a pre-trained detection model, and determining the specific position of the traffic sign in the original image, as shown by a dotted line frame.
According to the method provided by the embodiment of the disclosure, through secondary detection, the calculated amount is reduced, and the identification speed and accuracy are improved.
With further reference to fig. 4, a flow 400 of yet another embodiment of a method for detecting small targets is shown. The process 400 of the method for detecting small objects includes the steps of:
step 401, determining a network structure of the initial detection model and initializing a network parameter of the initial detection model.
In this embodiment, an electronic device (e.g., the controller shown in fig. 1) on which the method for detecting small objects operates may train a detection model. The detection model may also be trained by a third party server and then installed into the controller of the vehicle. The detection model is a neural network model, and can be any existing neural network for target detection.
In some optional implementations of the present embodiment, the detection model is a deep neural network, e.g., a YOLO series network. YOLO (You Only Look one) is an object recognition and positioning algorithm based on a deep neural network, and has the biggest characteristic of high operation speed and can be used for a real-time system. YOLO has now been developed to version v3 (YOLO3), but new versions have evolved with continued improvements over the original. In the original structural design of YOLO3, the low-resolution feature map is fused with the high-resolution feature map by upsampling. However, such fusion only occurs on high-resolution feature maps, and features of different scales cannot be sufficiently fused.
In order to better fuse the features of different layers, the invention firstly selects the features of 8 times, 16 times and 32 times of downsampling in a backbone network as basic features, then in order to predict the targets with different sizes, the sizes of the prediction feature maps are respectively set as the sizes of 8 times, 16 times and 32 times of downsampling of pictures, the features of each prediction feature map are from 3 basic feature layers, and the features are unified to the same size through downsampling or upsampling and then are fused. Taking a picture downsampling 16 times of prediction layers as an example, the characteristics of the prediction layers are respectively from 3 basic characteristic layers, in order to be unified to the same size, one-time downsampling is performed on 8 times of the basic characteristic layers, one-time upsampling is performed on 32 times of the downsampling basic characteristic layers, and then the two characteristic layers are fused with 16 times of the downsampling basic characteristic layers.
If the features with different scales are simply fused, the specific gravities of the features in the 3 prediction layers are the same, and the features cannot be used with emphasis according to different prediction targets. Therefore, after the feature fusion of each prediction layer, the attention module is introduced to learn a proper weight for the features of different channels, so that each prediction layer can use the fused features with emphasis according to the characteristics of the target needing to be predicted by the prediction layer. The network structure is shown in fig. 5. The learning method of the parameters of the attention module is the prior art, and therefore, the detailed description thereof is omitted.
The method can adopt YOLO3 as a detection network, the design and assignment of anchors in anchor-based (anchor) detection methods are very important, and the number of anchors which can be matched with small targets is small, so that the learning of the small targets by a model is not enough, and the small targets cannot be well detected. Therefore, a dynamic anchor matching mechanism is adopted, an IOU (confidence score) threshold value when the anchor is matched with the ground channel is adaptively selected according to the size of the ground channel, and when the target is small, the threshold value of the IOU is reduced, so that more small targets can participate in training to improve the performance of the model on small target detection. In making the training sample, the size of the target is known and then an appropriate IOU threshold is selected based on the target size.
Step 402, a training sample set is obtained.
In this embodiment, the training sample includes a sample image and annotation information for characterizing the position of the small target in the sample image.
In step 403, the training sample is enhanced by at least one of the following methods: copying, multi-scale changing and editing.
In the present embodiment, this is mainly a strategy taken against the insufficient number of small targets in the training data. On one hand, the number of the small targets in the data is directly increased by copying pictures containing the small targets in a plurality of data sets; on the other hand, after small objects in the picture are scratched out to be subjected to operations such as scaling and rotation, the small objects are pasted to other positions of the picture randomly, so that the number of the small objects can be increased, more changes can be introduced, and the distribution of training data is enriched.
Optionally, the training picture is scaled to different scales for training, so that the target scale change in the original data set can be enriched, and the model can adapt to the detection tasks of targets with different scales.
And step 404, respectively taking the sample images and the labeling information in the training samples in the enhanced training sample set as the input and the expected output of the initial detection model, and training the initial detection model by using a machine learning method.
In this embodiment, the executive body may input a sample image in a training sample set into the initial detection model, obtain position information of a small target in the sample image, and train the initial detection model by using a machine learning method with labeling information in the training sample as expected output of the initial detection model. Specifically, the difference between the obtained position information and the labeled information in the training sample may be first calculated by using a preset loss function, for example, the difference between the obtained position information and the labeled information in the training sample may be calculated by using the L2 norm as the loss function. Then, the network parameters of the initial detection model may be adjusted based on the calculated difference, and the training may be ended if a preset training end condition is satisfied. For example, the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the calculated difference is less than a preset difference threshold.
Here, various implementations may be employed to adjust network parameters of the initial detection model based on differences between the generated location information and the label information in the training sample. For example, a BP (Back Propagation) algorithm or an SGD (Stochastic Gradient Descent) algorithm may be used to adjust the network parameters of the initial detection model.
And step 405, determining the initial detection model obtained by training as a detection model trained in advance.
In this embodiment, the executing agent of the training step may determine the initial detection model trained in step 404 as a pre-trained detection model.
With further reference to fig. 6, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for detecting a small target, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.
As shown in fig. 6, the apparatus 600 for detecting a small target of the present embodiment includes: an acquisition unit 601, a reduction unit 602, a first detection unit 603, and a second detection unit 604. Wherein, the acquiring unit 601 is configured to acquire an original image including a small target; a reduction unit 602 configured to reduce an original image into a low resolution image; a first detection unit 603 configured to identify a candidate region including a small target from the low resolution image using a lightweight segmentation network; and a second detection unit 604, configured to use the region of the original image corresponding to the candidate region as an interest region, run a pre-trained detection model on the interest region, and determine a position of the small target in the original image.
In this embodiment, the specific processing of the acquisition unit 601, the reduction unit 602, the first detection unit 603, and the second detection unit 604 of the apparatus 600 for detecting a small target may refer to step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2.
In some optional implementations of this embodiment, the apparatus 600 further comprises a training unit (not shown in the drawings) configured to: determining a network structure of an initial detection model and initializing network parameters of the initial detection model; acquiring a training sample set, wherein the training sample comprises a sample image and marking information used for representing the position of a small target in the sample image; enhancing the training sample by at least one of the following methods: copying, multi-scale changing and editing; respectively taking the sample images and the labeling information in the training samples in the enhanced training sample set as the input and the expected output of the initial detection model, and training the initial detection model by using a machine learning device; and determining the initial detection model obtained by training as a detection model trained in advance.
In some optional implementations of this embodiment, the training unit is further configured to: extracting small objects from the sample image; and (3) carrying out scaling and/or rotating operation on the small object, and then pasting the small object to other positions in the sample image randomly to obtain a new sample image.
In some optional implementations of this embodiment, the first detection unit is further configured to: when a training sample of a segmentation network is manufactured, pixels inside a rectangular frame originally used for detecting a task are set as positive samples, and pixels outside the rectangular frame are set as negative samples; expanding the rectangular frame of the small target with the length and the width smaller than the preset pixel number; and setting the pixels in the rectangular frame after the external expansion as positive samples.
In some optional implementations of the present embodiment, the detection model is a deep neural network.
In some optional implementations of this embodiment, an attention module is introduced after each prediction layer feature fusion to learn an appropriate weight for the features of different channels.
Referring now to FIG. 7, a schematic diagram of an electronic device (e.g., the controller of FIG. 1) 700 suitable for use in implementing embodiments of the present disclosure is shown. The controller shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an original image including a small target; reducing an original image into a low-resolution image; identifying a candidate region comprising a small target from the low-resolution image by adopting a lightweight segmentation network; and taking the area of the original image corresponding to the candidate area as an interest area, running a pre-trained detection model on the interest area, and determining the position of the small target in the original image.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a reduction unit, a first detection unit, and a second detection unit. The names of these units do not in some cases constitute a limitation to the unit itself, and for example, the acquiring unit may also be described as a "unit that receives a web browsing request of a user".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.