CN112801027B

CN112801027B - Vehicle target detection method based on event camera

Info

Publication number: CN112801027B
Application number: CN202110182127.XA
Authority: CN
Inventors: 孙艳丰; 刘萌允; 齐娜; 施云惠; 尹宝才
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2024-07-12
Anticipated expiration: 2041-02-09
Also published as: CN112801027A

Abstract

The invention discloses a vehicle target detection method based on an event camera, which is researched under an extreme scene by utilizing a deep learning technology based on the event camera. The event camera can generate frame and event data asynchronously, which is helpful for overcoming motion blur and extreme lighting conditions. Firstly, converting an event into an event image, then simultaneously sending a frame image and the event image into a fusion convolutional neural network, and adding a convolutional layer for extracting characteristics of the event image; simultaneously, the characteristics of the two are fused at the middle layer of the network through a fusion module; finally, the effectiveness of vehicle target detection is improved by redesigning the loss function. The method can make up for the defect that only the frame image is used for target detection in the extreme scene, and the event image is fused in the fused convolutional neural network on the basis of using the frame image, so that the effect of vehicle target detection in the extreme scene is enhanced.

Description

Vehicle target detection method based on event camera

Technical Field

The invention discloses a vehicle target detection method under an extreme scene based on an event camera and by utilizing a deep learning technology, belongs to the field of computer vision, and particularly relates to the technologies of deep learning, target detection and the like.

Background

With the rapid development of the automotive industry, autopilot automotive technology has received extensive attention in recent years in both academia and industry. Vehicle target detection is a challenging task in the art of automatically driving vehicles. It is an important application in the fields of automatic driving automobile technology and intelligent traffic systems. It plays a key role in autopilot technology. The purpose of vehicle target detection is to accurately position the positions of the rest vehicles in the surrounding environment, so that accidents with other vehicles are avoided.

A great deal of current target detection research uses deep neural networks to augment target detection systems. These studies basically use frame-based cameras known as Active Pixel Sensors (APS). Thus, many detected objects are stationary or slowly moving, and lighting conditions are also suitable. In practice, vehicles encounter a variety of complex and extreme scenarios. In extreme illumination and motion blur, the image presented by a conventional frame-based camera may be overexposed and blurred, which may present a significant challenge for object detection.

Dynamic Vision Sensors (DVS) have key features of high dynamic range and low latency. These features enable them to capture environmental information and generate images faster than standard cameras. At the same time, they are not affected by motion blur, which is helpful for frame cameras in extreme cases. Furthermore, due to its low delay and short response time, an autonomous vehicle may be made more sensitive. Dynamic and active pixel sensors (DAVIS) can output regular gray frames and asynchronous events through APS and DVS channels, respectively. Regular grayscale frames can provide the primary information for object detection and asynchronous events can provide information for rapid motion and illumination changes. In this way, the detection performance of the target can be improved by combining the two data.

In recent years, the deep learning algorithm has achieved great success, and has been widely used in image classification and target detection. The deep neural network has excellent feature extraction capability and strong learning capability, and can identify target categories and locate target positions in target identification tasks. Convolutional Neural Networks (CNNs) based on boundary regression can regression directly from the input image to obtain the location and class of the target without searching for candidate regions. But this requires that the objects to be discriminated in the image fed into the CNN are sharp, whereas the objects of the image generated in the extreme scene may be blurred. It cannot meet the demand if only CNN is used for target detection of a frame image generated in an extreme scene.

A CNN-based vehicle detection method is presented herein that fuses the two data of a frame and an event output by a DAVIS camera. Firstly reconstructing event data into an image, simultaneously sending a frame image and the event image into a convolutional neural network, and fusing the characteristics extracted from the event image and the characteristics extracted from the frame image in a network middle layer through a fusion module. At the last detection layer, the loss function of the network is redesigned, and the loss term for the DVS feature is increased. The data set used for the experiment employs a self-built vehicle target detection data set (Dataset of APS AND DVS, DAD). The comparison of different input modes shows that the vehicle detection result is obviously improved under different environmental conditions. Meanwhile, the method proposed herein has a remarkable effect compared with a different method using a network in which a single image is input and a network in which two kinds of data are simultaneously input, or the like.

Disclosure of Invention

The invention provides a vehicle target detection method based on an event camera by utilizing a deep learning technology. Since a general camera generates motion blur, overexposure, or darkness in fast moving and extreme brightness scenes, temporal data generated by the event camera is used to enhance the detection effect. The event camera may asynchronously output events for changes in intensity, including coordinates of pixels, polarity of intensity, and time stamps, so the events are first turned into images. This is because image-based object detection techniques are now well established, whereby detection of events is achieved. Then, the frame image (APS) and the event image (DVS) are simultaneously sent to a frame (ADF) of a fusion convolution network to perform convolution operation, and feature extraction and feature fusion are performed in the network frame. Thus, the characteristics of the respective images can be extracted, and simultaneously, the finally extracted characteristics have effective characteristic information of the two characteristics. Finally, by modifying the loss function of the model, the loss term of the DVS is increased on the basis of only carrying out the loss term of the APS. The whole frame diagram of the method is shown in the attached figure 1, and can be divided into the following four steps: the event data is converted into event images, features are extracted through the integral framework of the fusion convolutional neural network, the features are fused through the fusion module, and the extracted features are subjected to target detection through the detection layer.

(1) Converting event data into event images

Considering that the current target detection algorithm for the image is relatively mature, the event data of the DVS channel is converted into the image and then is sent into the network together with the APS image for target detection. The event data is 5 parts total, the abscissa x of the pixel, the ordinate y of the pixel, the luminance polarity is increased by +1, the luminance polarity is decreased by-1 and the time stamp. According to the change of the coordinates and the polarity of the pixels, the event data is converted into an event image with the same size as the frame image in the accumulated time.

(2) Integral frame for feature extraction

The invention uses darknet-53 as a basic framework, and adds a convolution layer for extracting features of DVS images on the basis of performing convolution operation on APS images. Because the data of the DVS channel is sparse, features are extracted with fewer convolution layers at different resolutions. With reference Darknet-53, the dvs channel still employs successive 3 x 3 and 1 x1 convolutional layers. The specific number of convolution layers is shown in table 1.

(3) Fusion module

In the network architecture, a fusion module is designed with reference ResNet. The fusion module fuses the main features of the DVS with the features of the APS with the same size after extracting the main features of the DVS under different resolutions so as to guide the network to learn more detail features of the APS and the DVS at the same time. The fusion module is shown in fig. 2.

(4) Performing object detection on the extracted features through a detection layer

The loss function of the network is modified at the detection layer, and the loss function of the APS features is a cross entropy loss function, including losses to coordinates, class and confidence. The loss calculation is also carried out on the DVS characteristic by adopting a loss function of cross entropy. And finally, combining the detection result of the APS and the detection result of the DVS. The results for APS or DVS alone may still be correct results. Taking only the intersection of the two, many correct detection results will be lost. The results of the two are combined, so that errors can be reduced, and the accuracy is improved.

Compared with the prior art, the invention has the following obvious advantages and beneficial effects:

The invention adopts convolutional neural network technology to detect the target of the vehicle in the polar scene based on the APS image and DVS data generated by the event camera. First, compared with using only the conventional APS image, the event data is converted into the event image, and the image is identified by using sophisticated deep learning. And then adding a fusion module in the convolutional neural network, and fusing the two parts of information in a characteristic layer. Finally, through revising the loss function, the capability of the network for target identification when problems such as target blurring and illumination discomfort exist in the image is improved, and a good effect is obtained in an extreme scene.

Drawings

FIG. 1 is a diagram of an overall network architecture;

FIG. 2 is a schematic diagram of a fusion module;

FIG. 3 is an experimental effect diagram;

Detailed Description

In light of the foregoing, the following is a specific implementation, but the scope of protection of this patent is not limited to this implementation.

Step 1: converting event data into event images

Based on the generation mechanism of the event, there are three reconstruction methods to convert the event into a frame. These are the fixed event number method, the leaky integrator method and the fixed time interval method, respectively. In the present invention, the object is to be able to detect fast moving objects. The fixed time interval method is used to set the event reconstruction to a fixed frame length of 10 ms. In each time interval, according to the pixel position generated by the event, at the corresponding pixel point generated by the polarity, the event with increased polarity is drawn as a white pixel, the event with reduced polarity is drawn as a black pixel, and the background color of the image is gray. Finally, an event image of the same size as the APS image is generated.

Step 2: feature extraction via a network overall framework

The APS image and the DVS image are simultaneously input into the network frame, features are extracted through respective convolution layers of 3×3 and 1×1, except that the number of convolution layers of the respective extracted features is different, and DVS is less than APS. The network of (a) predicts the input APS image and predicts the DVS image at the same time. Both APS and DVS images are divided into sxs grids, each of which predicts B bounding boxes, co-predicting class C. Each bbox is introduced into the Gaussian model, predicting 8 coordinate values, μ_x, ε_x, μ_y, ε_y, μ_w, ε_w, μ_h, ε_h. A confidence score p is also predicted. The last input detection layer in the network is a tensor of 2 x S x B x (c+9). The tensors of three sizes of the APS channel and the tensors of three same sizes of the DVS channel are respectively fed into the detection layer.

Step 3: fusion module

After passing through respective convolution layers, APS and DVS respectively obtain characteristics F _aps and F _dvs, and sending the characteristics F _aps and F _dvs into a fusion model, and firstly, F _aps and F _dvs are subjected to a given transformation operation Tc, F→U, F epsilon R, U epsilon R ^M×N×C,U＝[u₁,u₂,…,u_C to obtain transformed characteristics U _aps and U _dvs, wherein U _c is a characteristic matrix with the size M multiplied by N of a C-th channel in C channels. Briefly, the Tc operation is taken as a convolution operation;

After obtaining the transformed feature U _dvs, we consider the global information of all channels in the feature and compress this global information into one channel to obtain the aggregate information z _c. This is accomplished by a global average pooling operation Tst (U _dvs), formally expressed as:

where u _c (i, j) is the (i, j) th value in the feature matrix. In order to perform excitation operation by using the aggregated information z _c in compression operation, the convolution characteristic information of each channel is fused, and the dependency relationship s on the channel is obtained, namely:

s＝Tex(z,E)＝δ(E₂σ(E₁z))#(2)

where σ represents the ReLU activation function, δ represents the sigmoid activation function, and E ₁ and E ₂ are two weights. Two fully connected layers are used to achieve this;

Scaling of the s-activated transition U _aps is used by the Tscan operation to obtain a feature block U':

U′＝Tscale(U_aps,s)＝U_aps·s#(3)

Finally, fusing the feature blocks of the DVS and the features of the APS to obtain a final fusion feature F _aps':

The specific implementation adopts splicing operation.

Step 4: performing object detection on the extracted features through a detection layer

As in the APS section, adding a DVS detection result to the detection layer, and performing binary cross entropy loss on objects and classes detected by DVS, a negative log likelihood loss function (NLL) of the coordinate frame is as follows:

Wherein the method comprises the steps of NLL penalty for the x-coordinate of DVS. W and H are the number of grids per width and height, respectively, and K is the number of prior frames. The output of the detection layer at the kth a priori frame of the (i, j) trellis is: And The coordinates of the x are represented as such,Representing the uncertainty of the x-coordinate.Is Ground Truth of the x-coordinate, which is calculated from the width and height of the image adjusted in Gaussian YOLOv and the kth a priori frame a priori. ζ is a fixed value of 10-9.The loss of the remaining coordinates y, w, h is represented as the x-coordinate.

ω_scale＝2-w^G×h^G#(7)

Omega _scale provides different weights during training according to object size (w ^G,h^G). (6) In (a) and (b)Is a parameter that is only applied in the penalty if there is an anchor point in the a priori box that is most appropriate for the current object. The value of this parameter is 1 or 0, which is determined by the Intersection (IOU) of GroundTruth with the kth a priori frame in the (i, j) grid.

The value of C _ijk depends on whether the bounding box of the grid cell fits into the predicted object. If appropriate, then C _ijk =1; otherwise, C _ijk＝0.τ_noobj indicates that the kth a priori frame of the grid is not fit into the target.Representing the correct category.The kth a priori block of the indication grid is not responsible for predicting the target.

Category losses are as follows:

p _ij represents the probability that the currently detected target is the correct target.

The loss function of the DVS part is:

Where L _DVS represents the sum of DVS channel coordinate value loss, class loss, and confidence loss.

L _APS and L _DVS remain formally identical. The loss function of the whole network is:

L＝L_APS+L_DVS#(11)

the loss function of the DVS channel is increased, so that the data of the extreme environment detected by the model has stronger robustness, and the accuracy of the algorithm is improved.

In order to verify the effectiveness of the proposed solution of the present invention, experiments were first performed on a custom data set. A comparison experiment was performed on different methods of inputting only APS image, inputting only DVS image, inputting superimposed image of APS and DVS pixels, and inputting both images at the same time, and the experimental results are shown in table 2. In addition, the effects of the different input modes are shown in fig. 3. Each column in the figure corresponds to an input mode. Each method selects four scenes (fast moving, light too strong, light too dark and normal). In a scene where an object is moving fast, the input DVS image may detect a fast moving vehicle, but may not detect a relatively stationary vehicle. The opposite input APS image can detect a relatively stationary vehicle, but cannot detect a fast moving vehicle. The effect of inputting an image after the superposition of APS and DVS pixels is equivalent to the effect of inputting only an APS image. By inputting two images at the same time, a good detection effect can be obtained for a vehicle that is moving rapidly or stationary. In the case of too strong or too dark illumination, neither the input APS image nor the superimposed image of APS and DVS pixels has a good detection effect. In comparison, the two parts of characteristics can be well fused by inputting the APS image and the DVS image at the same time, and the defect of APS is overcome through the DVS. DVS image detection is the worst in normal scenes because only the brightness change in the image can produce information, while the area without brightness change corresponds to the background and cannot be recognized. In general, the method of inputting two images to be fused in a network while using an ADF network is significantly superior to other methods.

At the same time, several most advanced single input network and methods were also selected for comparison, as shown in table 3. The network comparison results of the single image inputs are all compared on the custom data set. It can be seen from the table that the model of the present invention is not as effective as other networks in the case where only a single image is input, because the network itself is designed to achieve dual inputs. So when the model inputs the frame and the event at the same time, the experimental result is improved, which also proves that the recognition effect can be improved by using the event data.

In addition, the present invention compares the data set of PKU-DDD17-CAR with the JDF network that inputs both data, and the results are shown in Table 4. The event data in the dataset is converted into an image and then sent to the ADF network. The results of inputting only the frame image and simultaneously inputting the frame image and the event data are compared, respectively. Although the network is inferior to the JDF network in the case of inputting only a frame image, the network is better than the JDF network in the case of inputting both data at the same time.

Table 1 number of convolutional layers in network frame

TABLE 2 results of experiments on custom datasets

Table 3 results of comparison with Single image input network

Table 4 results of comparison with two different networks of data inputs

Claims

1. The vehicle target detection method based on the event camera is characterized by comprising the following steps of: based on APS image and DVS data generated by an event camera, a convolutional neural network technology is adopted to detect a target of a vehicle in a polar scene, and the event data is converted into an event image; according to the change of the coordinates and the polarity of the pixels, converting the event data into an event image with the same size as the frame image in the accumulated time; by using a mature convolutional neural network, on the basis of darknet-53 frames, a convolutional layer for extracting features of a DVS image is added on the basis of performing convolutional operation on the APS image, and a DVS channel still adopts continuous 3×3 and 1×1 convolutional layers; then adding a fusion module in the convolutional neural network, and weighting the features of the APS with the same size after extracting the DVS features under different resolutions so as to guide the network to learn more detail features of the APS and the DVS at the same time; modifying a loss function of the network at a detection layer, wherein the loss function of the APS features adopts a cross entropy loss function, and the loss comprises losses of coordinates, categories and confidence; the loss function of the cross entropy carries out loss calculation on the DVS characteristics;

The two parts of characteristics are effectively fused in the fusion module; after passing through respective convolution layers, APS and DVS respectively obtain characteristics F _aps and F _dvs, sending the characteristics F _aps and F _dvs into a fusion model, and firstly, passing F _aps and F _dvs through a given transformation operation Tc, F- & gt U, F E R and U E R ^M×N×C,U＝[u₁,u₂,…,u_C to obtain transformed characteristics U _aps and U _dvs, wherein U _c is a characteristic matrix with the size M multiplied by N of a C-th channel in C channels; briefly, the Tc operation is taken as a convolution operation;

after the transformation feature U _dvs is obtained, the global information of all channels in the feature is considered, and the global information is compressed into one channel to obtain the aggregation information z _c; this is accomplished by a global average pooling operation Tsq (U _dvs), formally expressed as:

Wherein u _c (i, j) is the (i, j) th value in the feature matrix; in order to perform excitation operation by using the aggregated information z _c in compression operation, the convolution characteristic information of each channel is fused, and the dependency relationship s on the channel is obtained, namely:

s＝Tex(z,E)＝δ(E₂σ(E₁z)) (2)

wherein σ represents a ReLU activation function, δ represents a sigmoid activation function; e ₁ and E ₂ are two weights; two fully connected layers are used to achieve this;

U′＝Tscale(U_aps,s)＝U_aps·s (3)

F_aps′＝U′⊕U_aps (4)

splicing operation is adopted in the concrete implementation;

Adding a loss term for the DVS characteristic in the detection layer; as in the APS part, adding a DVS detection result to the detection layer, and performing binary cross entropy loss on objects and classes detected by DVS, where the negative log likelihood loss function NLL of the coordinate frame is as follows:

Wherein the method comprises the steps of NLL penalty for the x coordinate of DVS; w and H are the grid number of each width and height respectively, K is the prior frame number; the output of the detection layer at the kth a priori frame of the (i, j) trellis is: And The coordinates of the x are represented as such,Representing uncertainty of x coordinates; Ground Truth, which is the x-coordinate, calculated from the width and height of the image adjusted in Gaussian YOLOv and the kth a priori frame a priori; ζ is a fixed value of 10-9; the same as the x coordinate, the loss of the rest coordinates y, w and h is represented;

ω_scale＝2-w^G×h^G (7)

omega _scale provides different weights during training according to object size (w ^G,h^G); in formula (7) Is a parameter that is applied in the penalty only if there is an anchor point in the a priori frame that is most appropriate for the current object; the value of this parameter is 1 or 0, which is determined by the intersection IOU of Ground Truth with the kth a priori frame in the (i, j) grids;

The value of C _ijk depends on whether the bounding box of the grid cell fits into the predicted object; if appropriate, then C _ijk =1; otherwise, C _ijk＝0;τ_noobj indicates that the kth prior box of the grid is not fit into the target; representing the correct category; The kth prior box of the indication grid is not responsible for predicting the target;

Category losses are as follows:

P _ij denotes the probability that the currently detected target is the correct target;

the loss function of the DVS part is:

Wherein L _DVS represents the sum of DVS channel coordinate value penalty, class penalty, and confidence penalty;

L _APS and L _DVS remain formally identical; the loss function of the whole network is:

L＝L_APS+L_DVS (11)。

2. The event camera-based vehicle target detection method according to claim 1, wherein converting an event into an image employs a fixed time interval method; to achieve detection at a speed FPS of 100 frames per second, the frame reconstruction is set to a fixed frame length of 10 ms; in each time interval, according to the pixel position generated by the event, at the corresponding pixel point generated by the polarity, the event with increased polarity is drawn into a white pixel, the event with reduced polarity is drawn into a black pixel, and the background color of the image is gray; finally, an event image of the same size as the APS image is generated.

3. The event camera based vehicle object detection method of claim 1, wherein successive 3 x3 and 1 x 1 convolution layers are added to extract features from the DVS image; the APS image and the DVS image are simultaneously input into a network frame, the features are extracted through the respective convolution layers of 3 multiplied by 3 and 1 multiplied by 1, the difference is that the number of the convolution layers of the respective extracted features is different, and the DVS is less than the APS; the network predicts the input APS image and predicts the DVS image at the same time; both the APS image and the DVS image are divided into sxs grids, each of which predicts B bounding boxes, co-predicting class C; each bbox is introduced into the gaussian model, predicting 8 coordinate values, μ_x, ε_x, μ_y, ε_y, μ_w, ε_w, μ_h, ε_h; in addition, a confidence score p is predicted; so at the last input detection layer of the network is 2 XSXSXSXBX tensors of (c+9); the tensors of three sizes of the APS channel and the tensors of three same sizes of the DVS channel are respectively fed into the detection layer.