CN111508002A

CN111508002A - A small low-flying target visual detection and tracking system and method thereof

Info

Publication number: CN111508002A
Application number: CN202010309617.7A
Authority: CN
Inventors: 陶然; 李伟; 黄展超; 马鹏阁; 揭斐然
Original assignee: Beijing Institute of Technology BIT; Luoyang Institute of Electro Optical Equipment AVIC; Zhengzhou University of Aeronautics
Current assignee: Beijing Institute of Technology BIT; Luoyang Institute of Electro Optical Equipment AVIC; Zhengzhou University of Aeronautics
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2020-08-07
Anticipated expiration: 2040-04-20
Also published as: CN111508002B

Abstract

The invention discloses a small low-flying target visual detection and tracking system and a method thereof. The system comprises: a video data input unit, a video preprocessing unit, a training data construction unit, a detection model training unit, a target comparison and screening unit, and a detection and correction unit , a reference frame initialization unit, a sample library dynamic construction unit, an online learning unit, a position refinement unit, a decision control unit and a tracking result output unit. The method includes: target detection network construction, target comparison and screening; target tracking online learning; dynamic construction of a classifier training sample library, target tracking position refinement; Robust target tracking can be achieved by the occurrence of tracking drift conditions caused by factors such as these. It has the ability to update the reference frame features in time according to the target change, and at the same time, the introduction of the feature point matching algorithm can avoid the wrong tracking caused by updating the reference frame features.

Description

A small low-flying target visual detection and tracking system and method thereof

技术领域technical field

本发明涉及飞行目标跟踪技术领域，特别涉及一种基于神经网络和在线学习的小型低飞目标检测与跟踪系统及其检测与跟踪方法。The invention relates to the technical field of flight target tracking, in particular to a small low-flying target detection and tracking system based on neural network and online learning and a detection and tracking method thereof.

背景技术Background technique

目前已经存在一些联合检测与跟踪的相关方法和一些适用于低慢小目标的跟踪方法。现有方法通过相关滤波类方法实现短时目标跟踪，在跟踪失败时采用基于神经网络的目标检测方法实现重定位目标的功能。相关专利和研究技术如下：At present, there are some related methods of joint detection and tracking and some tracking methods suitable for low-slow and small targets. Existing methods achieve short-term target tracking through correlation filtering methods, and use neural network-based target detection methods to achieve the function of relocating targets when tracking fails. The relevant patents and research technologies are as follows:

中国发明专利，名称为：一种基于相关滤波与目标检测的鲁棒性长期跟踪方法，申请号：CN201910306616.4；该发明通过相关滤波方法实现目标跟踪，使用一阶段目标检测器YOLO进行目标检测，得到检测结果后使用SURF特征点匹配方法选定匹配点数最高的候选框作为重新初始化跟踪器的目标包围框，最终达到长期跟踪的效果。然而这种方法没有考虑到检测器中目标与背景极端不平衡的问题，而且无法适用于小目标长期跟踪场景中。Chinese invention patent, titled: A robust long-term tracking method based on correlation filtering and target detection, application number: CN201910306616.4; the invention realizes target tracking through correlation filtering method, and uses one-stage target detector YOLO for target detection , after obtaining the detection results, the SURF feature point matching method is used to select the candidate frame with the highest number of matching points as the target bounding frame of the re-initialized tracker, and finally achieve the effect of long-term tracking. However, this method does not take into account the extreme imbalance between the target and the background in the detector, and is not suitable for long-term tracking of small targets.

中国发明专利，名称为：一种结合相关滤波与视觉显著性的低空慢速无人机跟踪方法，申请号：201910117155.6；它在较小搜索区域使用相关滤波类方法得到预测响应图以确定预测目标的中心位置，在较大搜索区域使用显著性检测方法来确定预测目标的尺度，实现了一种适应于低空慢速无人机的跟踪方法。然而该方法在目标跟踪失败后没有做出进一步处理，精度有待进一步提高。Chinese invention patent, titled: A low-altitude slow-speed UAV tracking method combining correlation filtering and visual saliency, application number: 201910117155.6; it uses correlation filtering in a small search area to obtain a predicted response map to determine the predicted target The center position of , and the saliency detection method is used to determine the scale of the predicted target in a large search area, and a tracking method suitable for low-altitude and slow UAVs is realized. However, this method does not perform further processing after the target tracking fails, and the accuracy needs to be further improved.

综合上述现有技术，它们都没有解决单目标检测跟踪中出现的目标背景极端不平衡的问题，网络性能还没有达到最优，针对小目标的跟踪精度还有进一步提高的空间。Combining the above existing technologies, none of them solves the problem of extreme imbalance of target background in single target detection and tracking, the network performance has not yet reached the optimum, and the tracking accuracy for small targets still has room for further improvement.

发明内容SUMMARY OF THE INVENTION

本发明针对现有技术的缺陷，提供了一种小型低飞目标视觉检测跟踪系统及其方法，解决了现有技术中存在的缺陷。Aiming at the defects of the prior art, the present invention provides a small low-flying target visual detection and tracking system and a method thereof, and solves the defects in the prior art.

为了实现以上发明目的，本发明采取的技术方案如下：In order to realize the above purpose of the invention, the technical scheme adopted by the present invention is as follows:

一种小型低飞目标检测跟踪系统，其包括：视频数据输入单元、视频预处理单元、训练数据构建单元、检测模型训练单元、目标比对筛选单元、检测校正单元、基准帧初始化单元、样本库动态构建单元、在线学习单元、位置精修单元、决策控制单元和跟踪结果输出单元。A small low-flying target detection and tracking system, comprising: a video data input unit, a video preprocessing unit, a training data construction unit, a detection model training unit, a target comparison and screening unit, a detection and correction unit, a reference frame initialization unit, and a sample library Dynamic construction unit, online learning unit, position refinement unit, decision control unit and tracking result output unit.

所述视频数据输入单元用于：输入包含目标的若干视频序列，并随机分成两部分，一部分用于训练目标检测模型，一部分用于目标跟踪模型的在线测试。The video data input unit is used for: inputting several video sequences containing the target, and randomly dividing them into two parts, one part is used for training the target detection model, and the other part is used for the online test of the target tracking model.

所述视频预处理单元用于：根据目标检测和跟踪单元的需要完成前期视频预处理工作，具体包含删减掉长时间没有目标的视频片段，剔除明显不符合低空空域以缓慢时速飞行的小型目标特点的视频片段等。The video preprocessing unit is used to: complete the preliminary video preprocessing work according to the needs of the target detection and tracking unit, specifically including deleting video clips that have no targets for a long time, and eliminating small targets that obviously do not meet the low-altitude airspace and fly at a slow speed. Features video clips, etc.

所述训练数据构建单元用于：保证训练数据的完备性和丰富性，采取等间隔抽取视频帧的方式构建训练集和验证集，并进行数据的标注，即确定图像中目标的中心位置、宽和高信息，用于有监督的训练目标检测模型。The training data construction unit is used to: ensure the completeness and richness of the training data, construct a training set and a verification set by extracting video frames at equal intervals, and perform data annotation, that is, determine the center position and width of the target in the image. and high information for supervised training of object detection models.

所述检测模型训练单元用于：创建金字塔结构的目标检测网络，并使用焦点损失来缓解目标背景不平衡的问题。观测训练损失函数趋于稳定后停止训练过程，保存验证过程中性能最优的模型文件，用于在目标跟踪失败时提供重置框信息。The detection model training unit is used for: creating a pyramid-structured object detection network, and using focal loss to alleviate the problem of object-background imbalance. Stop the training process after observing that the training loss function becomes stable, and save the model file with the best performance during the verification process, which is used to provide reset box information when target tracking fails.

所述目标比对筛选单元用于：使用SURF特征匹配算法对所跟踪视频第一帧真值框和检测结果进行比对，剔除明显不是低慢小目标的虚警，进一步保障长时稳定跟踪的鲁棒性。The target comparison and screening unit is used to: use the SURF feature matching algorithm to compare the true value frame of the first frame of the tracked video with the detection result, eliminate false alarms that are obviously not low-slow and small targets, and further ensure long-term stable tracking. robustness.

所述检测校正单元用于：当出现以下两种情况时启动检测校正单元，一是当跟踪框位置置信度低于设定阈值时，说明当前帧目标跟踪失败则启动检测校正单元；二是到达指定帧数间隔自动启动检测校正单元，保证当前正在跟踪目标与基准帧目标特征不会差距过大。The detection and correction unit is used to: start the detection and correction unit when the following two situations occur: one is when the confidence level of the tracking frame position is lower than the set threshold, indicating that the current frame target tracking fails, then the detection and correction unit is started; The detection and correction unit is automatically activated at the specified frame interval to ensure that the current tracking target does not differ too much from the reference frame target features.

所述基准帧初始化单元用于：根据接收到的基准框目标位置和尺度信息，按目标尺寸的5倍大小裁剪出搜索区域，并经过图像放缩到指定大小288*288，将指定尺寸的搜索区域图像块和目标位置尺度信息等输入到分类器动态构建样本库中。The reference frame initialization unit is used to: according to the received target position and scale information of the reference frame, cut out the search area by 5 times the size of the target size, and zoom to the specified size 288*288 through the image, and the search area of the specified size is The regional image patch and target location scale information are input to the classifier to dynamically construct the sample library.

所述样本库动态构建单元用于：对基准帧样本进行数据增强，包括旋转、缩放、抖动和模糊的基本操作，并接收在跟踪过程中新增的样本。The sample library dynamic construction unit is used for: performing data enhancement on the reference frame samples, including basic operations of rotation, scaling, jittering and blurring, and receiving newly added samples during the tracking process.

所述在线学习单元使用深度网络ResNet18网络对样本库中保存的样本进行特征提取，再经过两个全连接层之后获得预测高斯响应图，同时根据样本库中的标签信息以目标中心为高斯分布的峰值点生成真值高斯标签，以缩小预测高斯响应图和真值高斯标签间的差距为优化目标在线调整两个全连接层的参数，达到在没有标签的情况下通过特征提取网络和全连接层实现预测高斯响应图和获得预测目标位置中心的目的。The online learning unit uses the deep network ResNet18 network to perform feature extraction on the samples stored in the sample library, and then obtains the predicted Gaussian response map after two fully connected layers. The peak point generates the true Gaussian label, in order to narrow the gap between the predicted Gaussian response map and the true Gaussian label as the optimization goal, the parameters of the two fully connected layers are adjusted online to achieve the feature extraction network and the fully connected layer without labels. To achieve the purpose of predicting the Gaussian response map and obtaining the center of the predicted target location.

所述位置精修单元用于：对在线学习单元获得的初始跟踪结果进行位置精修，以当前帧在线学习单元获得的预测目标位置中心和上一帧得到的目标尺度宽高为参考，获得若干抖动框，并映射到当前帧搜索区域上使用精准感兴趣区域池化层进行抖动框的特征提取，和由基准帧获得的调制特征进行拼接最后经过全连接层得到各个抖动框的预测位置置信度，将置信度最高的3个抖动框的结果进行合并，即为精修后当前帧的跟踪结果。The position refinement unit is used to: perform position refinement on the initial tracking result obtained by the online learning unit, and obtain a number of The jitter frame is mapped to the search area of the current frame. The precise region of interest pooling layer is used to extract the features of the jitter frame, and the modulation features obtained from the reference frame are spliced. Finally, the confidence of the predicted position of each jitter frame is obtained through the fully connected layer. , and merge the results of the three jitter boxes with the highest confidence, which is the tracking result of the current frame after refinement.

所述决策控制单元用于：在跟踪过程中通过跟踪框的预测位置置信度与设定阈值关系来判断目标跟踪状态，如果目标被稳定跟踪，则继续进行下一帧的跟踪，如果已经跟丢，则激活检测校正单元进行当前帧的目标检测，并更新基准帧，以实现低慢小目标的长时稳定跟踪。The decision control unit is used to: judge the target tracking state through the relationship between the predicted position confidence of the tracking frame and the set threshold during the tracking process, if the target is stably tracked, then continue to track the next frame, , the detection and correction unit is activated to detect the target of the current frame and update the reference frame to achieve long-term stable tracking of low-slow and small targets.

所述跟踪结果输出单元用于：在遍历完视频所有帧后，输出各帧位置和尺度信息。The tracking result output unit is used for outputting the position and scale information of each frame after traversing all the frames of the video.

本发明还公开了一种小型低飞目标视觉检测跟踪方法，包括如下步骤：The invention also discloses a small low-flying target visual detection and tracking method, comprising the following steps:

步骤1，目标检测网络构建；Step 1, target detection network construction;

1)构建网络结构，包括：主干网络、分类子网和回归子网；1) Construct the network structure, including: backbone network, classification subnet and regression subnet;

2)损失函数，使用焦点损失解决目标检测中正负样本比例严重失衡的问题，降低简单负样本在训练中所占的权重。2) Loss function, use focus loss to solve the problem of serious imbalance of positive and negative samples in target detection, and reduce the weight of simple negative samples in training.

步骤2，目标比对筛选；Step 2, target comparison and screening;

在送入检测校正单元之前，先进行一次SURF特征点匹配，将当前视频的的第一帧目标与检测结果进行特征匹配，当匹配点数大于设定值时，说明检测结果确实是当前需要跟踪的目标，说明检测成功，此时将检测框送入检测校正单元，进行后续过程。Before sending it to the detection and correction unit, perform a SURF feature point matching, and perform feature matching between the first frame target of the current video and the detection result. When the number of matching points is greater than the set value, it indicates that the detection result is indeed what needs to be tracked. The target indicates that the detection is successful. At this time, the detection frame is sent to the detection and correction unit for the subsequent process.

步骤3，目标跟踪在线学习；Step 3, target tracking online learning;

预测目标中心位置，包含初始化分类器和在线分类过程两部分：Predicting the center position of the target includes two parts: initialization classifier and online classification process:

1)初始化分类器1) Initialize the classifier

对样本库中经过数据增强的基准帧，利用特征提取网络提取特征，同时以基准帧目标中心位置为峰值生成与特征图相同大小的二维高斯真值标签ygt，根据特征和标签初始化分类器，用最小二乘优化算法尽量缩小实际值和真实值之间的距离，并用高斯牛顿迭代法求解非线性最小二乘问题；公式如下：For the data-enhanced reference frame in the sample library, the feature extraction network is used to extract features, and at the same time, the target center position of the reference frame is used as the peak value to generate a two-dimensional Gaussian ground truth label ygt of the same size as the feature map, and the classifier is initialized according to the features and labels. Use the least squares optimization algorithm to minimize the distance between the actual value and the true value, and use the Gauss-Newton iteration method to solve the nonlinear least squares problem; the formula is as follows:

在(1)式中，x∈{1，…，M}表示基准帧目标中心点的水平方向坐标，M为特征图的宽；y∈{1，…，N}表示基准帧目标中心点的竖直方向坐标，N为特征图的高；σ为高斯带宽。In equation (1), x∈{1,...,M} represents the horizontal coordinate of the target center point of the reference frame, M is the width of the feature map; y∈{1,...,N} represents the target center point of the reference frame Vertical coordinate, N is the height of the feature map; σ is the Gaussian bandwidth.

2)在线分类过程2) Online classification process

根据前一帧(第t-1帧)跟踪结果(x_t-1，y_t-1，w_t-1，h_t-1)(其中(x_t-1，y_t-1)为估计目标的中心坐标，(w_t-1，h_t-1)为估计目标的宽和高)，以前一帧目标中心位置为当前帧(第t帧)目标搜索区域中心，按指定比例k扩展宽、高，生成当前帧搜索区域(x_t-1，y_t-1，k*w_t-1，k*h_t-1)。然后，使用特征提取网络提取搜索区域的特征f_t，经过两个全连接层后生成与搜索区域尺寸一致的预测高斯响应图

(

表示全连接层的映射函数；weight₁,weight₂表示全连接层的权重系数矩阵)，最大响应位置即为当前帧估计的目标中心坐标(x_t，y_t)。在线训练分类器充分考虑被跟踪目标和背景区域，期间不断的更新分类器来估计目标的位置。According to the previous frame (frame t-1) tracking result (x _t-1 , y _t-1 , w _t-1 , h _t-1 ) (where (x _t-1 , y _t-1 ) is the estimated target (w _t-1 , h _t-1 ) is the width and height of the estimated target), the center position of the target in the previous frame is the center of the target search area of the current frame (frame t), and the width is expanded according to the specified ratio k, High, generate the current frame search area (x _t-1 , y _t-1 , k*w _t-1 , k*h _t-1 ). Then, a feature extraction network is used to extract the features f _t of the search area, and after two fully connected layers, a predicted Gaussian response map that is consistent with the size of the search area is generated

(

Represents the mapping function of the fully connected layer; weight ₁ and weight ₂ represent the weight coefficient matrix of the fully connected layer), and the maximum response position is the estimated target center coordinate (x _t , y _t ) of the current frame. The online training classifier fully considers the tracked target and the background area, and the classifier is continuously updated to estimate the position of the target.

步骤4，动态构建分类器训练样本库Step 4: Dynamically build a classifier training sample library

1)设置基准帧更新间隔为T，满足当前帧数t能被T整除时调用目标检测单元更新基准帧，清除样本库中的所有过时样本，使用更新后的基准帧重新初始化分类器，同时随着跟踪过程的进行将新产生的样本依次添加到样本库中，目的在于使样本库中目标的特征与当前正在跟踪的样本相似性较高，利于精准估计出目标的中心位置。1) Set the reference frame update interval to T, and call the target detection unit to update the reference frame when the current frame number t is divisible by T, clear all outdated samples in the sample library, and re-initialize the classifier with the updated reference frame. As the tracking process progresses, the newly generated samples are added to the sample library in turn, the purpose is to make the characteristics of the target in the sample library more similar to the sample currently being tracked, which is conducive to accurately estimating the center position of the target.

2)当跟踪框的预测位置置信度小于设定阈值时，说明当前帧目标跟踪失败，由决策控制单元向检测校正单元发送需要重新初始化基准帧的信息，检测校正单元收到目标检测单元的检测框信息之后，将信息送入基准帧初始化单元，进行数据增强等操作，并最终送入动态构建的样本库中。2) When the predicted position confidence of the tracking frame is less than the set threshold, it indicates that the current frame target tracking fails, and the decision control unit sends the information that the reference frame needs to be re-initialized to the detection and correction unit, and the detection and correction unit receives the detection of the target detection unit. After the frame information, the information is sent to the reference frame initialization unit for data enhancement and other operations, and finally sent to the dynamically constructed sample library.

步骤5，目标跟踪位置精修；Step 5, target tracking position refinement;

包括特征提取网络和相似度评估网络两部分：It includes two parts: feature extraction network and similarity evaluation network:

1)特征提取网络1) Feature extraction network

特征提取使用ResNet18网络，平衡保留先前模板信息与更新当前基准帧信息，为神经网络提供结合目标当前与历史状态的特征，提升跟踪的稳定性，对基准帧、当前帧、基准帧与当前帧中间时刻的图像帧三部分均提取搜索区域特征，并分别送入精准感兴趣区域池化层中，用于相似度评估网络计算预测位置置信度。The feature extraction uses the ResNet18 network, which balances the retention of the previous template information and the update of the current reference frame information, provides the neural network with features that combine the current and historical states of the target, and improves the stability of tracking. The characteristics of the search area are extracted from the three parts of the image frame at the moment, and sent to the precise area of interest pooling layer respectively, which is used for the similarity evaluation network to calculate the confidence of the predicted position.

2)相似度评估网络2) Similarity evaluation network

相似度评估网络的核心是精准感兴趣区域池化层，其输入包含两部分，第一部分是对网络提取的图像特征图进行插值系数为IC的双线性插值The core of the similarity evaluation network is the precise region of interest pooling layer. Its input consists of two parts. The first part is to perform bilinear interpolation on the image feature map extracted by the network with an interpolation coefficient of IC.

IC(x，y，i，j)＝max(0，1-|x-i|)×max(0，1-|y-j|) (2)IC(x, y, i, j)=max(0,1-|x-i|)×max(0,1-|y-j|) (2)

将离散的特征图映射到一个连续空间并得到特征图f(x,y)Map the discrete feature map to a continuous space and get the feature map f(x,y)

在式(2)和式(3)中，(x,y)为特征图中心坐标，(i，j)为特征图上的坐标索引，w_i，j为特征图上位置(i，j)对应的权值。输入的第二部分是矩形框的左上角坐标(x₂，x₁)和右下角坐标(y₂，y₁)。根据得到的连续空间特征图和矩形框的坐标进行精准感兴趣区域池化操作，最大限度的保留图像上的目标特征，为进一步比对参考目标和历史帧目标相似性作准备。最终对特征图f(x,y)进行双重积分

并除以矩形框面积，得到的精准感兴趣区域池化(PrROI Pooling)In equations (2) and (3), (x, y) is the center coordinate of the feature map, (i, j) is the coordinate index on the feature map, wi _{, j} is the position (i, j) on the feature map corresponding weights. The second part of the input is the upper left corner coordinates (x ₂ , x ₁ ) and the lower right corner coordinates (y ₂ , y ₁ ) of the rectangle. According to the obtained continuous spatial feature map and the coordinates of the rectangular frame, the precise region of interest pooling operation is performed, and the target features on the image are preserved to the greatest extent, so as to prepare for further comparison of the similarity between the reference target and the historical frame target. Finally, double-integrate the feature map f(x,y)

And divide it by the area of the rectangular box to get the precise region of interest pooling (PrROI Pooling)

得到精准感兴趣区域池化层的特征后，将基准帧、中间帧和当前帧的三个特征进行拼接，输入到全连接层中输出最终的位置置信度。对比候选目标和历史目标的相似程度，找到最大相似目标作为跟踪结果。After obtaining the features of the accurate region of interest pooling layer, the three features of the reference frame, the intermediate frame and the current frame are spliced, and input to the fully connected layer to output the final position confidence. Compare the similarity between the candidate target and the historical target, and find the largest similar target as the tracking result.

与现有技术相比，本发明的优点在于：Compared with the prior art, the advantages of the present invention are:

1)本发明联合了检测与跟踪方法，能够有效缓解跟踪目标由于遮挡、尺度变化、光照等因素造成的跟踪漂移状况的发生，可以实现鲁棒的目标跟踪。1) The present invention combines detection and tracking methods, which can effectively alleviate the occurrence of tracking drift caused by occlusion, scale change, illumination and other factors of the tracking target, and can achieve robust target tracking.

2)具备根据目标变化及时更新基准帧特征的能力，同时引入特征点匹配算法又可以避免由于更新基准帧特征带来的错误跟踪。2) It has the ability to update the reference frame features in time according to the target change, and at the same time, the introduction of the feature point matching algorithm can avoid the erroneous tracking caused by updating the reference frame features.

3)适用于光学对空遥感图像中低慢小目标的长时稳定跟踪。3) It is suitable for long-term stable tracking of low, slow and small targets in optical aerial remote sensing images.

附图说明Description of drawings

图1是本发明实施例小型低飞目标检测跟踪系统的结构框图；1 is a structural block diagram of a small low-flying target detection and tracking system according to an embodiment of the present invention;

图2是本发明实施例目标检测网络结构图；2 is a structural diagram of a target detection network according to an embodiment of the present invention;

图3是本发明实施例在线学习流程图；3 is a flow chart of online learning according to an embodiment of the present invention;

图4是本发明实施例位置精修流程图。FIG. 4 is a flow chart of position refinement according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案及优点更加清楚明白，以下根据附图并列举实施例，对本发明做进一步详细说明。In order to make the objectives, technical solutions and advantages of the present invention more clearly understood, the present invention will be further described in detail below according to the accompanying drawings and examples.

如图1所示，一种小型低飞目标视觉检测跟踪系统，其包括以下单元：As shown in Figure 1, a small low-flying target visual detection and tracking system includes the following units:

(1)视频数据输入单元。输入包含目标的若干视频序列，并随机分成两部分，一部分用于训练目标检测模型，一部分用于目标跟踪模型的在线测试。(1) Video data input unit. Several video sequences containing targets are input and randomly divided into two parts, one part is used for training the target detection model, and the other part is used for online testing of the target tracking model.

(2)视频预处理单元。根据目标检测和跟踪单元的需要完成前期视频预处理工作，具体包含删减掉长时间没有目标的视频片段，剔除明显不符合低慢小目标特点的视频片段等。(2) Video preprocessing unit. According to the needs of the target detection and tracking unit, the preliminary video preprocessing work is completed, including the deletion of video clips that have no target for a long time, and the removal of video clips that obviously do not meet the characteristics of low-slow and small targets.

(3)训练数据构建单元。为保证训练数据的完备性和丰富性，采取等间隔抽取视频帧的方式构建训练集和验证集，并进行数据的标注，即确定图像中目标的中心位置和宽高等信息，用于有监督的训练目标检测模型。(3) Training data construction unit. In order to ensure the completeness and richness of the training data, the training set and the verification set are constructed by extracting video frames at equal intervals, and the data is annotated, that is, the center position and width and height of the target in the image are determined, which is used for supervised training. Train an object detection model.

(4)检测模型训练单元。由于低慢小飞行目标存在较大尺度变化及单目标训练过程中类别极端不平衡的问题，设计金字塔结构的目标检测网络，并使用焦点损失来缓解目标背景不平衡的问题。观测训练损失函数趋于稳定后停止训练过程，保存验证过程中性能最优的模型文件，用于在目标跟踪失败时提供重置框信息。(4) Detection model training unit. Due to the large scale variation of low, slow and small flying targets and the extreme class imbalance in the single target training process, a target detection network with pyramid structure is designed, and focus loss is used to alleviate the problem of target background imbalance. Stop the training process after observing that the training loss function becomes stable, and save the model file with the best performance during the verification process, which is used to provide reset box information when target tracking fails.

(5)目标比对筛选单元。鉴于目标检测的结果有可能出现虚警，使用SURF特征匹配算法对所跟踪视频第一帧真值框和检测结果进行比对，剔除明显不是低慢小目标的虚警，进一步保障长时稳定跟踪的鲁棒性。(5) Target comparison screening unit. In view of the possibility of false alarms in the results of target detection, the SURF feature matching algorithm is used to compare the ground truth box of the first frame of the tracked video with the detection results, and eliminate false alarms that are obviously not low, slow and small targets, and further ensure long-term stable tracking. robustness.

(6)检测校正单元。出现以下两种情况启动检测校正单元，一是当跟踪框位置置信度低于设定阈值时，说明当前帧目标跟踪失败则启动检测校正单元；二是到达指定帧数间隔自动启动检测校正单元，保证当前正在跟踪目标与基准帧目标特征不会差距过大。(6) Detection and correction unit. When the following two situations occur, the detection and correction unit is activated. One is when the confidence level of the tracking frame position is lower than the set threshold, indicating that the current frame target tracking fails, and the detection and correction unit is activated; Ensure that the current tracking target and the reference frame target features are not too far apart.

(7)基准帧初始化单元。根据接收到的基准框目标位置和尺度信息，按目标尺寸的5倍大小裁剪出搜索区域，并经过图像放缩到指定大小288*288，将指定尺寸的搜索区域图像块和目标位置尺度信息等输入到分类器动态构建样本库中。(7) Reference frame initialization unit. According to the received target position and scale information of the reference frame, the search area is cut out according to 5 times of the target size, and the image is scaled to the specified size 288*288, and the specified size of the search area image block and target position scale information, etc. Input to the classifier dynamically builds the sample library.

(8)样本库动态构建单元。对基准帧样本进行数据增强，包括旋转、缩放、抖动和模糊等基本操作，并接收在跟踪过程中新增的样本。(8) Dynamic construction unit of sample library. Perform data augmentation on base frame samples, including basic operations such as rotation, scaling, jittering, and blurring, and receive newly added samples during tracking.

(9)在线学习单元。使用深度网络ResNet18对样本库中保存的样本进行特征提取，再经过两个全连接层之后获得预测高斯响应图，同时根据样本库中的标签信息以目标中心为高斯分布的峰值点生成真值高斯标签，以缩小预测高斯响应图和真值高斯标签间的差距为优化目标在线调整两个全连接层的参数，达到在没有标签的情况下通过特征提取网络和全连接层实现预测高斯响应图和获得预测目标位置中心的目的。(9) Online learning unit. Use the deep network ResNet18 to extract the features of the samples saved in the sample library, and then obtain the predicted Gaussian response map after two fully connected layers. At the same time, according to the label information in the sample library, the target center is the peak point of the Gaussian distribution to generate the true value Gaussian label, in order to narrow the gap between the predicted Gaussian response map and the true Gaussian label as the optimization goal, the parameters of the two fully connected layers are adjusted online to achieve the predicted Gaussian response map and the fully connected layer without labels. Get the purpose of predicting the center of the target location.

(10)位置精修单元。对在线学习单元获得的初始跟踪结果进行位置精修，以当前帧在线学习单元获得的预测目标位置中心和上一帧得到的目标尺度宽高为参考，获得若干抖动框，并映射到当前帧搜索区域上使用精准感兴趣区域池化层进行抖动框的特征提取，和由基准帧获得的调制特征进行拼接最后经过全连接层得到各个抖动框的预测位置置信度，将置信度最高的3个抖动框的结果进行合并，即为精修后当前帧的跟踪结果。(10) Position refinement unit. Perform position refinement on the initial tracking results obtained by the online learning unit, take the predicted target position center obtained by the online learning unit of the current frame and the target scale width and height obtained in the previous frame as reference, obtain several jitter boxes, and map them to the current frame search On the area, the precise region of interest pooling layer is used to extract the features of the jitter frame, and the modulation features obtained from the reference frame are spliced. Finally, the confidence of the predicted position of each jitter frame is obtained through the fully connected layer, and the three jitters with the highest confidence are jittered. The results of the boxes are merged, that is, the tracking results of the current frame after refinement.

(11)决策控制单元。在跟踪过程中通过跟踪框的预测位置置信度与设定阈值关系来判断目标跟踪状态，如果目标被稳定跟踪，则继续进行下一帧的跟踪，如果已经跟丢，则激活检测校正单元进行当前帧的目标检测，并更新基准帧，以实现低慢小目标的长时稳定跟踪。(11) Decision control unit. During the tracking process, the target tracking state is judged by the relationship between the predicted position confidence of the tracking frame and the set threshold. If the target is tracked stably, the tracking of the next frame is continued. If it has been lost, the detection and correction unit is activated to perform the current Frame target detection, and update the reference frame to achieve long-term stable tracking of low-slow and small targets.

(12)跟踪结果输出单元。在遍历完视频所有帧后，输出各帧位置和尺度信息。(12) Tracking result output unit. After traversing all frames of the video, output the position and scale information of each frame.

一种小型低飞目标视觉检测跟踪方法，包括如下步骤：A small low-flying target visual detection and tracking method, comprising the following steps:

用于低慢小目标的检测网络结构主要包含两部分，一是多级金字塔结构的神经网络结构，二是训练过程中缓解目标背景极端不平衡的损失函数：The detection network structure for low-slow and small targets mainly includes two parts, one is the neural network structure of the multi-level pyramid structure, and the other is the loss function that alleviates the extreme imbalance of the target background during the training process:

1)网络结构1) Network structure

主干网络：Backbone network:

目标检测的主干网络使用特征金字塔等级P3到P7。主干网络中P3和P4由两部分构成，其中一部分是通过横向连接从相应的特征提取网络ResNet(C3和C4)的输出计算得到，另一部分是采用自顶向下的途径将深层小尺寸特征图上采样达到与浅层相同的尺寸，再执行叠加操作和卷积操作使得输出的通道数仍为256，虽然特征图的通道数始终保持不变，但是具有的信息是递增的。P5是通过横向连接从相应的特征提取网络ResNet(C5)的输出计算得到的。P6和P7在C5的基础上通过卷积层和激活层得到，其中P_l的分辨率比输入图像低2^l(l表示金字塔的级别)，所有的金字塔级别都具有C＝256个通道。The backbone network for object detection uses feature pyramid levels P3 to P7. P3 and P4 in the backbone network are composed of two parts, one of which is calculated from the output of the corresponding feature extraction network ResNet (C3 and C4) through lateral connections, and the other is a top-down approach to deep small-size feature maps. Upsampling reaches the same size as the shallow layer, and then performs stacking and convolution operations so that the number of output channels is still 256. Although the number of channels of the feature map remains unchanged, the information it has is increasing. P5 is computed from the output of the corresponding feature extraction network ResNet (C5) via lateral connections. P6 and P7 are obtained on the basis of C5 through convolutional layers and activation layers, where P1 has a resolution 21 lower than the input image ⁽ ₁ represents the level of the pyramid), and all pyramid levels have C=256 channels.

分类子网：Classified subnets:

目标分类子网预测每A＝9个锚点和K目标类的每个空间位置处目标存在的概率。这个子网是一个连接到每个主干网络金字塔层级的小全卷积网络，所有金字塔级别共享该子网的参数。从一个给定的金字塔等级的C通道输入特征映射，子网应用4个3×3的卷积层，每一层具有C＝256个滤波器和Relu激活函数，接着通过有KA个滤波器的3×3卷积层，最后在每个空间位置上附加了sigmoid激活函数，输出KA二元预测。The object classification subnet predicts the probability of object existence at each spatial location for every A = 9 anchors and K object classes. This subnet is a small fully convolutional network connected to each pyramid level of the backbone network, and all pyramid levels share the parameters of this subnet. From a C-channel input feature map at a given pyramid level, the subnet applies four 3×3 convolutional layers, each with C=256 filters and Relu activation function, followed by a A 3×3 convolutional layer, with a sigmoid activation function attached to each spatial position at the end, outputs KA binary predictions.

回归子网：Return subnet:

边框回归子网与目标分类子网并行，在每个金字塔等级后附加另一个小型的全卷积网络，以便将每个锚框到附近可能存在的真值目标的偏移量进行回归。边框回归的设计子网与分类子网相同，只不过它在每个空间位置具有4A个线性输出，对于以每个空间位置为中心点的A个锚框，4的含义是4个输出预测锚框的左上角和右下角坐标位置和真值框的对应位置之间的相对偏移量。目标分类子网和框回归子网虽然共享一个通用结构，但使用单独的参数。如图2所示为目标检测网络的主体结构。The bounding box regression subnet runs in parallel with the object classification subnet, with another small fully convolutional network appended after each pyramid level to regress the offset of each anchor box to possible nearby ground-truth objects. The design subnet of the bounding box regression is the same as the classification subnet, except that it has 4A linear outputs at each spatial position. For A anchor boxes centered at each spatial position, the meaning of 4 is 4 output prediction anchors The relative offset between the coordinate positions of the upper left and lower right corners of the box and the corresponding positions of the ground truth box. The object classification subnet and the box regression subnet share a common structure but use separate parameters. Figure 2 shows the main structure of the target detection network.

2)损失函数2) Loss function

使用焦点损失(Focal Loss，FL)Use Focal Loss (FL)

FL(p_t)＝-α_t(1-p_t)γlog(p_t) (5)FL(p _t )=-α _t (1-p _t )γlog(p _t ) (5)

解决目标检测中正负样本比例严重失衡的问题，降低简单负样本在训练中所占的权重。在式(5)中，FL(.)表示焦点损失；p_t表示目标分类概率的判别函数值；α表示用于平衡正负样本比例不均的平衡因子；γ表示用于平衡难易样本分类的平衡因子，γ取大于0的数值将减小易分类样本损失，使得模型更关注于困难和错分样本的学习。式(6)为概率判别函数的表达式，y是经过激活函数的预测输出标签，值在0-1之间；p表示目标属于标签标注类别的概率值。Solve the problem that the proportion of positive and negative samples in target detection is seriously unbalanced, and reduce the weight of simple negative samples in training. In formula (5), FL(.) represents the focal loss; p _t represents the discriminant function value of the target classification probability; α represents the balance factor used to balance the uneven proportion of positive and negative samples; γ represents the classification of difficult and easy samples The balance factor of γ is greater than 0, which will reduce the loss of easy-to-classify samples and make the model pay more attention to the learning of difficult and misclassified samples. Equation (6) is the expression of the probability discriminant function, y is the predicted output label after the activation function, and the value is between 0-1; p represents the probability value of the target belonging to the label labeling category.

步骤2，目标比对筛选；Step 2, target comparison and screening;

为防止在图像中不存在目标或者目标检测为虚警时利用背景信息更新基准帧造成跟踪失败的现象，在送入检测校正单元之前，先进行一次SURF特征点匹配，将当前视频的的第一帧目标与检测结果进行特征匹配，当匹配点数大于设定值时，说明检测结果确实是当前需要跟踪的目标，说明检测成功，此时将检测框送入检测校正单元，进行后续过程。In order to prevent the tracking failure caused by using background information to update the reference frame when there is no target in the image or the target is detected as a false alarm, before sending it to the detection and correction unit, a SURF feature point matching is performed, and the first SURF feature point of the current video is matched. The frame target and the detection result are feature-matched. When the number of matching points is greater than the set value, it means that the detection result is indeed the target that needs to be tracked, and the detection is successful. At this time, the detection frame is sent to the detection and correction unit for the subsequent process.

步骤3，目标跟踪在线学习；Step 3, target tracking online learning;

如图3所示，目标跟踪在线学习实现跟踪过程中实时的预测目标中心位置的功能。主要包含初始化分类器和在线分类过程两部分：As shown in Figure 3, the online learning of target tracking realizes the function of predicting the center position of the target in real time during the tracking process. It mainly includes two parts: initialization classifier and online classification process:

1)初始化分类器1) Initialize the classifier

对样本库中经过数据增强的基准帧，利用特征提取网络提取特征，同时以基准帧目标中心位置为峰值生成与特征图相同大小的二维高斯真值标签y_gt，For the data-enhanced reference frame in the sample library, use the feature extraction network to extract features, and at the same time use the target center position of the reference frame as the peak value to generate a two-dimensional Gaussian ground truth label y _gt of the same size as the feature map,

根据特征和标签初始化分类器，用最小二乘优化算法尽量缩小实际值和真实值之间的距离，并用高斯牛顿迭代法求解非线性最小二乘问题，高斯牛顿迭代法的基本思想是使用泰勒级数展开式去近似代替非线性回归模型，然后通过多次迭代，多次修正回归系数，使回归系数不断逼近非线性回归模型的最佳回归系数，最后使原模型的残差平方和达到最小。Initialize the classifier according to the features and labels, use the least squares optimization algorithm to minimize the distance between the actual value and the true value, and use the Gauss-Newton iteration method to solve the nonlinear least squares problem. The basic idea of the Gauss-Newton iteration method is to use the Taylor level The number expansion is used to approximate the nonlinear regression model, and then through multiple iterations, the regression coefficients are revised multiple times, so that the regression coefficients continue to approach the optimal regression coefficients of the nonlinear regression model, and finally the residual sum of squares of the original model is minimized.

2)在线分类过程2) Online classification process

如图3所示，根据前一帧跟踪结果(x_t-1，y_t-1，w_t-1，h_t-1)，其中(x_t-1，y_t-1)为估计目标的中心坐标，(w_t-1，h_t-1)为估计目标的宽和高，以前一帧目标中心位置为中心，按指定比例k扩展宽高，由当前帧生成搜索区域(x_t-1，y_t-1，k*w_t-1，k*h_t-1)，再使用特征提取网络提取搜索区域的特征f_t，经过两个全连接层后生成与搜索区域尺寸相同的预测高斯响应图

最大响应位置即为当前帧估计的目标中心坐标(x_t，y_t)。在线训练分类器充分考虑被跟踪目标和背景区域，期间不断的更新分类器来估计目标的位置。As shown in Figure 3, according to the previous frame tracking results (x _t-1 , y _t-1 , w _t-1 , h _t-1 ), where (x _t-1 , y _t-1 ) is the estimated target Center coordinates, (w _t-1 , h _t-1 ) are the width and height of the estimated target, the center position of the target in the previous frame is the center, the width and height are expanded according to the specified ratio k, and the search area is generated from the current frame (x _t-1 , y _t-1 , k*w _t-1 , k*h _t-1 ), and then use the feature extraction network to extract the feature f _t of the search area, and generate a predicted Gaussian with the same size as the search area after two fully connected layers response graph

The maximum response position is the estimated target center coordinates (x _t , y _t ) in the current frame. The online training classifier fully considers the tracked target and the background area, and the classifier is continuously updated to estimate the position of the target.

鉴于低慢小飞行目标运动状态较为复杂，目标尺度、目标形态在跟踪过程中都会发生较大变化，如果仅仅使用视频第一帧作为监督帧，显然存在无法实时适应目标的动态变化，因此我们采取以下两步来进一步缓解此问题，具体内容为：In view of the complex motion state of low, slow and small flying targets, the target scale and target shape will change greatly during the tracking process. If only the first frame of the video is used as the supervision frame, it is obvious that there are dynamic changes that cannot adapt to the target in real time, so we adopt The following two steps to further alleviate this problem are as follows:

一是设置基准帧更新间隔为T，满足t能被T整除时调用目标检测单元更新基准帧，清除样本库中的所有过时样本，使用更新后的基准帧重新初始化分类器，同时随着跟踪过程的进行将新产生的样本依次添加到样本库中，目的在于使样本库中目标的特征与当前正在跟踪的样本相似性较高，利于精准估计出目标的中心位置。One is to set the update interval of the reference frame to T. When t is divisible by T, call the target detection unit to update the reference frame, clear all outdated samples in the sample library, and re-initialize the classifier with the updated reference frame. The newly generated samples are added to the sample library in turn, the purpose is to make the characteristics of the target in the sample library have a high similarity with the sample currently being tracked, which is conducive to accurately estimating the center position of the target.

二是当跟踪框的预测位置置信度小于设定阈值时，说明当前帧目标跟踪失败，由决策控制单元向检测校正单元发送需要重新初始化基准帧的信息，检测校正单元收到目标检测单元的检测框信息之后，将信息送入基准帧初始化单元，进行数据增强等操作，并最终送入动态构建的样本库中。Second, when the confidence of the predicted position of the tracking frame is less than the set threshold, it means that the target tracking of the current frame fails, and the decision control unit sends the information that the reference frame needs to be re-initialized to the detection and correction unit, and the detection and correction unit receives the detection of the target detection unit. After the frame information, the information is sent to the reference frame initialization unit for data enhancement and other operations, and finally sent to the dynamically constructed sample library.

步骤5，目标跟踪位置精修Step 5, target tracking position refinement

如图4所示，根据分类器预测目标中心位置和上一帧目标宽高来最终确定当前帧跟踪框的位置和尺度信息是目标跟踪位置精修的主要功能。主要包括特征提取网络和相似度评估网络两部分：As shown in Figure 4, the position and scale information of the current frame tracking frame is finally determined according to the classifier predicts the center position of the target and the width and height of the target in the previous frame, which is the main function of the target tracking position refinement. It mainly includes two parts: feature extraction network and similarity evaluation network:

1)特征提取网络1) Feature extraction network

特征提取仍使用ResNet-18网络，为了充分利用历史信息，平衡保留先前模板信息与更新当前基准帧信息，为神经网络提供结合目标当前与历史状态的特征，提升跟踪的稳定性，对基准帧、当前帧、基准帧与当前帧中间时刻的图像帧三部分均提取搜索区域特征，并分别送入精准感兴趣区域池化层中，用于相似度评估网络计算预测位置置信度。The feature extraction still uses the ResNet-18 network. In order to make full use of the historical information, balance the preservation of the previous template information and the update of the current reference frame information, to provide the neural network with features that combine the current and historical states of the target, and to improve the stability of tracking. The features of the search area are extracted from the current frame, the reference frame and the image frame in the middle of the current frame, and are respectively sent to the precise area of interest pooling layer for the similarity evaluation network to calculate the confidence of the predicted position.

2)相似度评估网络2) Similarity evaluation network

相似度评估网络的核心是精准感兴趣区域池化层，它的输入包含两部分，一是使用网络提取的图像特征图，其中(i，j)为特征图上的坐标，w_i，j为特征图上对应位置(i，j)的权值，采用插值系数为IC的双线性插值The core of the similarity evaluation network is the precise region of interest pooling layer. Its input consists of two parts. One is the image feature map extracted by the network, where (i, j) are the coordinates on the feature map, and w _{i, j} are The weight of the corresponding position (i, j) on the feature map, using bilinear interpolation with the interpolation coefficient IC

IC(x，y，i，j)＝max(0，1-|x-i|)×max(0，1-|y-j|) (8)IC(x, y, i, j)=max(0,1-|x-i|)×max(0,1-|y-j|) (8)

将离散的特征图映射到一个连续空间Mapping discrete feature maps into a continuous space

二是矩形框的左上角和右下角坐标(x₂，x₁)和(y₂，y₁)。根据上述得到的连续空间特征图和矩形框的坐标进行精准感兴趣区域池化操作，The second is the coordinates (x ₂ , x ₁ ) and (y ₂ , y ₁ ) of the upper left and lower right corners of the rectangular box. According to the obtained continuous space feature map and the coordinates of the rectangular frame, the precise region of interest pooling operation is performed,

最大限度的保留图像上的目标特征，为进一步比对参考目标和历史帧目标相似性作准备。得到精准感兴趣区域池化层的特征后，将基准帧、中间帧和当前帧的三个特征进行拼接，输入到全连接层中输出最终的位置置信度。对比候选目标和历史目标的相似程度，找到最大相似目标作为跟踪结果。Retain the target features on the image to the greatest extent, and prepare for further comparison of the similarity between the reference target and the historical frame target. After obtaining the features of the accurate region of interest pooling layer, the three features of the reference frame, the intermediate frame and the current frame are spliced, and input to the fully connected layer to output the final position confidence. Compare the similarity between the candidate target and the historical target, and find the largest similar target as the tracking result.

本领域的普通技术人员将会意识到，这里所述的实施例是为了帮助读者理解本发明的实施方法，应被理解为本发明的保护范围并不局限于这样的特别陈述和实施例。本领域的普通技术人员可以根据本发明公开的这些技术启示做出各种不脱离本发明实质的其它各种具体变形和组合，这些变形和组合仍然在本发明的保护范围内。Those of ordinary skill in the art will appreciate that the embodiments described herein are intended to help readers understand the implementation method of the present invention, and it should be understood that the protection scope of the present invention is not limited to such specific statements and embodiments. Those skilled in the art can make various other specific modifications and combinations without departing from the essence of the present invention according to the technical teaching disclosed in the present invention, and these modifications and combinations still fall within the protection scope of the present invention.

Claims

1. A small low-flying target detection and tracking system, comprising: the system comprises a video data input unit, a video preprocessing unit, a training data construction unit, a detection model training unit, a target comparison screening unit, a detection correction unit, a reference frame initialization unit, a sample library dynamic construction unit, an online learning unit, a position fine modification unit, a decision control unit and a tracking result output unit;

the video data input unit is configured to: inputting a plurality of video sequences containing targets, and randomly dividing the video sequences into two parts, wherein one part is used for training a target detection model, and the other part is used for online testing of a target tracking model;

the video pre-processing unit is configured to: finishing the preliminary video preprocessing work according to the needs of a target detection and tracking unit, specifically comprising deleting video segments without targets for a long time, eliminating video segments obviously not conforming to the characteristics of small targets flying at a slow speed per hour in a low-altitude airspace, and the like;

the training data construction unit is configured to: ensuring the completeness and richness of training data, constructing a training set and a verification set by extracting video frames at equal intervals, and carrying out data labeling, namely determining the central position, width and height information of a target in an image, wherein the information is used for a supervised training target detection model;

the detection model training unit is used for: creating a target detection network with a pyramid structure, and relieving the problem of unbalanced target background by using focus loss; stopping the training process after observing that the training loss function tends to be stable, storing a model file with optimal performance in the verification process, and providing reset frame information when the target tracking fails;

the target alignment screening unit is used for: comparing the first frame true value frame of the tracked video with the detection result by using a SURF (speeded up robust features) feature matching algorithm, eliminating false alarms which are obviously not low-slow small targets, and further ensuring the robustness of long-term stable tracking;

the detection correction unit is used for: starting a detection and correction unit when the following two conditions occur, namely starting the detection and correction unit when the position confidence coefficient of a tracking frame is lower than a set threshold value, which indicates that the target tracking of the current frame fails; secondly, automatically starting a detection and correction unit when a specified frame number interval is reached, and ensuring that the difference between the current tracking target and the target characteristics of the reference frame is not too large;

the reference frame initialization unit is configured to: cutting out a search area according to the received target position and scale information of the reference frame and according to the size 5 times of the target size, zooming the image to the specified size 288 x 288, and inputting the image blocks of the search area with the specified size, the scale information of the target position and the like into a dynamic construction sample library of the classifier;

the sample library dynamic construction unit is used for: performing data enhancement on the reference frame sample, including basic operations of rotation, scaling, dithering and blurring, and receiving a newly added sample in the tracking process;

the online learning unit extracts features of samples stored in a sample library by using a deep network ResNet18 network, obtains a predicted Gaussian response map after passing through two fully-connected layers, generates a true Gaussian label by taking a target center as a peak point of Gaussian distribution according to label information in the sample library, adjusts parameters of the two fully-connected layers on line for an optimization target by reducing a difference between the predicted Gaussian response map and the true Gaussian label, and achieves the purposes of realizing the predicted Gaussian response map and obtaining a predicted target position center through the feature extraction network and the fully-connected layers under the condition of no label;

the position refinement unit is configured to: performing position refinement on an initial tracking result obtained by an online learning unit, obtaining a plurality of shaking frames by taking the center of a predicted target position obtained by the online learning unit of a current frame and the width and height of a target scale obtained by a previous frame as references, mapping the shaking frames onto a search area of the current frame, performing feature extraction on the shaking frames by using an accurate region-of-interest pooling layer, splicing modulation features obtained by a reference frame, and finally obtaining the confidence coefficient of the predicted position of each shaking frame through a full-connection layer, and merging the results of 3 shaking frames with the highest confidence coefficients to obtain the tracking result of the current frame after the refinement;

the decision control unit is configured to: judging a target tracking state through the relation between the confidence coefficient of the predicted position of the tracking frame and a set threshold value in the tracking process, if the target is stably tracked, continuing to track the next frame, and if the target is lost, activating a detection and correction unit to detect the target of the current frame and updating a reference frame so as to realize long-term stable tracking of low and slow small targets;

the tracking result output unit is used for: and after traversing all the frames of the video, outputting the position and scale information of each frame.

2. The detection and tracking method of the small-sized low-flying target detection and tracking system according to claim 1, characterized by comprising the following steps:

step 1, constructing a target detection network;

1) constructing a network structure comprising: a backbone network, a classification subnet and a regression subnet;

2) a loss function, which solves the problem of serious imbalance of the proportion of positive and negative samples in target detection by using focus loss and reduces the weight of simple negative samples in training;

step 2, comparing and screening targets;

before the video is sent to the detection and correction unit, SURF feature point matching is carried out for one time, a first frame target of a current video is subjected to feature matching with a detection result, when the number of matching points is larger than a set value, the detection result is really a target needing to be tracked at present, the detection is successful, and at the moment, a detection frame is sent to the detection and correction unit for subsequent processes;

step 3, target tracking online learning;

predicting the target center position, comprising two parts of an initialization classifier and an online classification process:

1) initialization classifier

For a reference frame subjected to data enhancement in a sample base, extracting features by using a feature extraction network, and generating a two-dimensional Gaussian true value label y with the same size as a feature map by taking the target central position of the reference frame as a peak value_gtInitializing a classifier according to the characteristics and the label, minimizing the distance between an actual value and a true value by using a least square optimization algorithm, and solving a nonlinear least square problem by using a Gauss-Newton iteration method; formula asThe following:

in the formula (1), y ∈ {1, …, M } represents the horizontal direction coordinate of the target center point of the reference frame, M is the width of the feature map, y ∈ {1, …, N } represents the vertical direction coordinate of the target center point of the reference frame, N is the height of the feature map, and sigma is the Gaussian bandwidth;

2) online classification process

Tracking the result (x) from the previous frame (t-1 th frame)_t-1，y_t-1，w_t-1，h_t-1) (wherein (x)_t-1，y_t-1) To estimate the center coordinates of the target, (w)_t-1，h_t-1) For estimating the width and height of the target), the center position of the target of the previous frame is the center of the target search area of the current frame (the t frame), the target search area is expanded to be wide and high according to a specified proportion k, and the search area (x) of the current frame is generated_t-1，y_t-1，k*w_t-1，k*h_t-1) (ii) a Then, the feature f of the search area is extracted using the feature extraction network_tAfter two full connection layers, a prediction Gaussian response graph consistent with the size of the search area is generated

(

A mapping function representing a fully connected layer; weight₁,weight₂A weight coefficient matrix representing a full connection layer), and the maximum response position is the target center coordinate (x) estimated by the current frame_t，y_t) (ii) a The online training classifier fully considers the tracked target and the background area, and the classifier is continuously updated to estimate the position of the target;

step 4, dynamically constructing a classifier training sample library

1) Setting a reference frame updating interval as T, calling a target detection unit to update a reference frame when the current frame number T can be completely divided by T, removing all outdated samples in a sample library, using the updated reference frame to reinitialize a classifier, and simultaneously sequentially adding newly generated samples into the sample library along with the tracking process, so that the similarity between the characteristics of a target in the sample library and the currently tracked sample is higher, and the central position of the target can be accurately estimated;

2) when the confidence coefficient of the predicted position of the tracking frame is smaller than a set threshold, the current frame target tracking fails, the decision control unit sends information of a reference frame needing to be reinitialized to the detection and correction unit, and after the detection and correction unit receives the information of the detection frame of the target detection unit, the information is sent to the reference frame initialization unit, data enhancement and other operations are carried out, and finally the information is sent to a dynamically constructed sample library;

step 5, fine trimming of the target tracking position;

the method comprises two parts of a feature extraction network and a similarity evaluation network:

1) feature extraction network

The feature extraction uses a ResNet18 network, balance and reserve the information of a previous template and update the information of a current reference frame, provide the features combining the current and historical states of a target for a neural network, improve the tracking stability, extract the features of a search area for three parts of an image frame at the middle time of the reference frame, the current frame, the reference frame and the current frame, respectively send the features into an accurate region-of-interest pooling layer, and are used for calculating the confidence coefficient of a predicted position by a similarity evaluation network;

2) similarity evaluation network

The core of the similarity evaluation network is a precise region-of-interest pooling layer, the input of which comprises two parts, the first part is bilinear interpolation with interpolation coefficient of IC for an image feature map extracted by the network

IC (x, y, i, j) ═ max (0, 1- | x-i |) × max (0, 1- | y-j |) (2) maps the discrete feature map to a continuous space and obtains the feature map f (x, y)

In the formulas (2) and (3), (x, y) are characteristic diagram center seatsThe index (i, j) is the coordinate index on the feature map, w_i，jThe weight value corresponding to the position (i, j) on the feature map; the second part of the input is the coordinates (x) of the upper left corner of the rectangular box₂，x₁) And the coordinates of the lower right corner (y)₂，y₁) (ii) a Performing accurate region-of-interest pooling operation according to the obtained continuous spatial feature map and the coordinates of the rectangular frame, and reserving target features on the image to the maximum extent to prepare for further comparing the similarity of the reference target and the target of the historical frame; finally, the feature map f (x, y) is doubly integrated

And dividing by the area of the rectangular frame to obtain a precise region of interest Pooling (PrROI Pooling)

After the characteristics of the accurate region-of-interest pooling layer are obtained, the three characteristics of the reference frame, the intermediate frame and the current frame are spliced and input into the full-connection layer to output the final position confidence; and comparing the similarity degree of the candidate target and the historical target, and finding the maximum similar target as a tracking result.