CN114494736A

CN114494736A - A method for outdoor location re-identification based on saliency region detection

Info

Publication number: CN114494736A
Application number: CN202210104480.0A
Authority: CN
Inventors: 张晓峰; 欧垚君; 陈哲; 王梅; 丁红; 施正阳; 陶秦; 魏东
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-13
Anticipated expiration: 2042-01-28
Also published as: CN114494736B

Abstract

The invention provides an outdoor location re-identification method based on saliency region detection, and belongs to the technical field of computer vision and deep learning. The technical scheme is as follows: the method comprises the following steps: step one, extracting an SE-ResNet feature map; step two, detecting a salient region; step three, training a visual word bag model; and step four, matching the similarity between the images. The invention has the beneficial effects that: according to the method, the local features of the salient regions are fused into the whole local features through the visual bag-of-words model constructed by deep learning features, and the matching accuracy is improved.

Description

A method for outdoor location re-identification based on saliency region detection

技术领域technical field

本发明涉及计算机视觉、深度学习技术领域，尤其涉及一种基于显著性区域检测的室外地点重识别方法。The invention relates to the technical fields of computer vision and deep learning, in particular to an outdoor location re-identification method based on saliency area detection.

背景技术Background technique

对于自主导航机器人而言，定位与建图是首要的目的。而依赖视觉传感器的机器人在定位问题上的解决方案则是视觉地点识别。给定一张描述指定地点的场景图象，机器人需要判断此地点是否曾经到达过，判断的过程需要对数据库中路径轨迹的关键帧进行相似度的匹配。由于场景地点图像普遍存在光照变化，视角朝向变化，行人遮挡等干扰因素，传统方法的图像特征点的提取方式过度依赖人工设计的特征，即使在稳定的室内环境有不错的效果，但在上述室外场景的干扰下，所获得的效果却不太好。For autonomous navigation robots, localization and mapping are the primary goals. The solution to the positioning problem of robots relying on vision sensors is visual place recognition. Given a scene image that describes a specified location, the robot needs to determine whether the location has been reached before, and the process of judging requires the similarity matching of the key frames of the path trajectory in the database. Due to the common interference factors such as illumination change, viewing angle change, pedestrian occlusion, etc. in scene location images, the extraction method of image feature points of traditional methods relies too much on artificially designed features, even in a stable indoor environment. Under the interference of the scene, the effect obtained is not very good.

如何解决上述技术问题为本发明面临的课题。How to solve the above technical problems is the subject of the present invention.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种基于显著性区域检测的室外地点重识别方法，本发明的目的是对室外场景的各种干扰下能更好地提取图像全局特征，通过深度学习特征构建的视觉词袋模型，把显著性区域的局部特征融合成全局特征，提高匹配的准确度。The purpose of the present invention is to provide an outdoor location re-identification method based on saliency region detection. The bag model fuses local features of salient regions into global features to improve matching accuracy.

本发明的发明思想是：本发明把整个流程分为两个部分，一部分是检测显著性区域，第二部分是区域特征转化为更为鲁棒的词袋向量，获得全局特征，并对图片进行相似度匹配，本发明先将卷积神经网络对图像提取的特征进行分析，通过对激活值区域均值大小的判断，对图像显著性区域进行检测，并提取显著性区域特征，此外本发明还收集了具有代表性的室外场景图片集合，并以此训练得到深度学习特征的视觉词袋模型，显著性区域提取的局部特征通过视觉词袋模型聚合成全局特征，所获得的特征在面对视角朝向变化等干扰因素时鲁棒性更强。The inventive idea of the present invention is as follows: the present invention divides the whole process into two parts, one part is to detect salient regions, and the second part is to convert regional features into more robust word bag vectors, obtain global features, and perform image processing on images. Similarity matching, the present invention first analyzes the features extracted from the image by the convolutional neural network, and by judging the average size of the activation value area, detects the saliency area of the image, and extracts the salient area features. In addition, the present invention also collects A collection of representative outdoor scene pictures is used to train a visual word bag model of deep learning features. The local features extracted from salient regions are aggregated into global features through the visual word bag model. It is more robust against disturbance factors such as changes.

本发明是通过如下措施实现的：一种基于显著性区域检测的室外地点重识别方法，包含以下步骤：The present invention is realized by the following measures: an outdoor location re-identification method based on saliency area detection, comprising the following steps:

步骤一、SE-ResNet特征图的提取Step 1. Extraction of SE-ResNet feature map

在卷积神经网络中，卷积操作，很大一部分工作是提高感受野，在空间上把特征融合，或是通过多通道提取多尺度空间信息，传统的卷积操作基本上默认对输入特征图的所有通道进行融合，而SE-Net的关注通道之间的关系，使得模型可以自动学习到不同通道特征的重要程度，SE-Net的网络架构如图1所示，目前大多数的主流网络都是基于这两种类似的单元通过重复的方式叠加来构造的；由此可见，SE模块可以嵌入到现在几乎所有的网络结构中。经对比实验证明，SE-Net嵌入到ResNet当中效果更好，因此在特征图的提取中，本发明采用SE-ResNet模型对图像进行卷积操作，对输入图像I∈R^{W′×H′×3}，在经过卷积操作后得到特征图F∈R^W×H×C。In the convolutional neural network, a large part of the convolution operation is to improve the receptive field, spatially fuse features, or extract multi-scale spatial information through multiple channels. The traditional convolution operation basically defaults to the input feature map. All channels are fused, and SE-Net pays attention to the relationship between channels, so that the model can automatically learn the importance of different channel features. The network architecture of SE-Net is shown in Figure 1. At present, most mainstream networks are is constructed based on the overlapping of these two similar units in a repetitive manner; thus, SE modules can be embedded in almost all network structures up to now. It is proved by comparative experiments that SE-Net is better embedded in ResNet. Therefore, in the extraction of feature maps, the present invention uses the SE-ResNet model to perform convolution operations on the image, and the input image I∈R ^{W′×H′× 3.} After the convolution operation, the feature map F∈R ^W×H×C is obtained.

步骤二、显著性区域的检测Step 2. Detection of salient regions

通过分析室外场景图像的特点，可以发现室外地点能够通过一些标志性建筑、或者路标等物体来辨别两幅图像是否属于同一地点，为此我们通过卷积后得到的特征图F中，激活值高的区域往往是图像中特别显著的区域，但这些物体在图像中的大小并不是唯一的，为了适应各个显著性区域的大小不同，本发明使用非零值的连通区域的检测方法来确定显著性区域的位置，因此，在步骤一中提取到的特征图中进行如下操作：By analyzing the characteristics of outdoor scene images, it can be found that outdoor locations can identify whether two images belong to the same location through some landmark buildings, road signs and other objects. For this reason, in the feature map F obtained by convolution, the activation value is high. The area of is often a particularly salient area in the image, but the size of these objects in the image is not unique. In order to adapt to the different sizes of each salient area, the present invention uses the non-zero value of the detection method of the connected area to determine the saliency The location of the region, therefore, perform the following operations in the feature map extracted in step 1:

(1)、二值化特征图(1), binarized feature map

图像在经过卷积神经网络的卷积层和激活函数的处理后会保留图像的空间纹理特征，其中特征图的激活值大小反映了图像该区域的纹理强度大小，因此，为了筛选出显著性区域，首先在各个通道的特征图中划分需要检测的区域，首先使用二值化特征图来划分各个图像物体区域，特征图中激活值较大的区域我们使用1去表示该区域为值得关注的区域，激活值较小的区域则用0表示其为纹理较少；不值得关注的区域，在二值化特征图过程中，本发明使用阈值δ区分每个区域应该被设置为0还是1；After the image is processed by the convolutional layer and activation function of the convolutional neural network, the spatial texture features of the image will be preserved, and the activation value of the feature map reflects the texture intensity of this area of the image. Therefore, in order to filter out the salient areas , first divide the area to be detected in the feature map of each channel, first use the binarized feature map to divide the area of each image object, and use 1 to indicate that the area with a larger activation value in the feature map is an area worthy of attention , the area with smaller activation value is 0 to indicate that it has less texture; the area that is not worthy of attention, in the process of binarizing the feature map, the present invention uses the threshold δ to distinguish whether each area should be set to 0 or 1;

通过如下公式得到二值化之后的特征图F_B The feature map F _B after binarization is obtained by the following formula

(2)、划分相关区域ROI(2), divide the relevant area ROI

假设显著性区域之间应该是独立的，或者至少是无重叠的。所以使用非零值的连通区域来表示每一个单独的图像区域。It is assumed that the saliency regions should be independent, or at least non-overlapping. So each individual image region is represented by a connected region of non-zero values.

在二值特征图F_B中，对所有值为1的位置，搜索与其相邻的8个位置的值，如果有同样为1的元素，则形成同一区域，再对区域内其余元素进行相邻值的搜索，直到同一区域内所有元素都被搜索过，最后得到多个相关区域ROIs(regions of interest)，每个通道都有数量不一的ROI，最后总共会产生N个相关区域。In the binary feature map F _B , for all positions with a value of 1, search for the values of 8 adjacent positions. If there are elements with the same value of 1, the same area is formed, and then the remaining elements in the area are adjacent to each other. The value is searched until all elements in the same region have been searched, and finally multiple related regions ROIs (regions of interest) are obtained. Each channel has a different number of ROIs, and finally a total of N related regions will be generated.

(3)、确定显著区域位置(3), determine the location of the significant area

对N个ROI对应特征图的区域计算特征图激活值的均值a_r，

公式如下：Calculate the mean value a _r of the activation value of the feature map for the region corresponding to the feature map of the N ROIs,

The formula is as follows:

i＝1，...，i_r；j＝1，...，j_r

_i =1,..., _ir ; j=1,...,jr

并按照均值a_r的值大小从高到低排序，选取最高的m个区域，作为最终的显著性区域S＝{s_i|i∈{1，...，m}}。And sort from high to low according to the value of the mean a _r , and select the highest m regions as the final saliency region S={s _i |i∈{1,...,m}}.

(4)、提取局部特征(4), extract local features

对于某一选定的显著性区域s_i，其区域范围为W_s×H_s，其中0＜W_s，H_s＜min(W，H)，在特征图F上定位区域s_i的区域位置，并在该区域的所有通道上，得到维度为W_s×H_s×C的局部特征D_L；最后采用总和池化方法，得到池化后的局部特征D_L∈R^1×1×C，公式如下：For a selected saliency region s _i , whose region range is W _s ×H _s , where 0<W _s , H _s <min(W, H), locate the region position of region _si on the feature map F , and on all channels in the region, a local feature DL with dimension W _s × _H _s ×C is obtained; finally, the sum pooling method is used to obtain the pooled local feature DL ∈R ^1×1×C _, The formula is as follows:

i＝1，...，W_s；j＝1，...，H_s

i=1, _... ,Ws; j=1, _... ,Hs

其中

为局部特征D_L中第c个通道的值。in

is the value of the _cth channel in the local feature DL.

步骤三、训练视觉词袋模型Step 3. Train the visual word bag model

普遍的视觉词袋模型是基于图像提取的SIFT特征训练得到的，本发明使用了SE-ResNet的网络层来生成特征描述符，同时保留卷积信息和局部特征，这些描述符的性能优于类似SIFT的探测器，特别是在SIFT包含许多异常值或无法匹配足够数量特征点的情况下。The general visual bag of words model is trained based on SIFT features extracted from images. The present invention uses the network layer of SE-ResNet to generate feature descriptors while retaining convolution information and local features. The performance of these descriptors is better than similar A detector for SIFT, especially when SIFT contains many outliers or cannot match a sufficient number of feature points.

本发明的步骤三中，训练视觉词袋模型由图像特征提取、视觉词汇树生成、视觉词汇特征构建三个部分组成；其中，第一部分的图像特征提取已经在步骤一、步骤二中得到了，步骤三中主要说明视觉词汇树生成和视觉词汇特征构建，主要流程如下：In step 3 of the present invention, the training of the visual word bag model consists of three parts: image feature extraction, visual vocabulary tree generation, and visual vocabulary feature construction; wherein, the image feature extraction of the first part has been obtained in steps 1 and 2, Step 3 mainly describes the visual vocabulary tree generation and visual vocabulary feature construction. The main process is as follows:

(1)、收集用于构建词汇树的特征(1), collect the features used to build the vocabulary tree

对于词汇树的生成，本发明使用k-means方法，k-means算法作为一种最常用的聚类方法，因其直观易懂，被广泛用于对图像局部特征进行聚类，在聚类之前，本发明收集了一定数量的比较具有代表性的室外场景图像，并对每一张图像按步骤一、步骤二进行特征提取，每一张图片会选取m个显著性区域，得到所有显著性区域的局部特征。For the generation of the vocabulary tree, the present invention uses the k-means method. As one of the most commonly used clustering methods, the k-means algorithm is widely used for clustering local features of images because of its intuitive and easy-to-understand. , the present invention collects a certain number of relatively representative outdoor scene images, and performs feature extraction for each image according to step 1 and step 2, and each image will select m salient areas to obtain all salient areas. local features.

(2)、使用k-means构建词汇树T(2), use k-means to build a vocabulary tree T

先构建根节点，使用k-means对所有特征进行第一次聚类，得到k个类及其类中心，以使类内具有较高的相似度，而类间相似度较低，取类中心作为根节点的子节点，完成词汇树第一层的构建，继续对第一层每个节点的类进行k-means聚类，得到k类，类中心作为该节点的子节点，一直循环，直到所有特征都分到叶子节点上，则词汇树T构建完成。First build the root node, use k-means to cluster all the features for the first time, and obtain k classes and their class centers, so that the intra-class similarity is high, and the inter-class similarity is low, and the class center is taken. As a child node of the root node, complete the construction of the first layer of the vocabulary tree, and continue to perform k-means clustering on the class of each node in the first layer to obtain k classes, and the class center is used as the child node of the node, and the cycle continues until All features are assigned to leaf nodes, and the vocabulary tree T is constructed.

(3)、视觉词汇特征向量V_bow (3), visual vocabulary feature vector V _bow

词汇树的每个叶子节点都代表一个视觉单词，假设词汇树包含v个视觉单词，统计单词表中每个单词在图像中出现的次数s，从而将图像表示成为一个维度为v的向量V_bow。Each leaf node of the vocabulary tree represents a visual word. Assuming that the vocabulary tree contains v visual words, count the number of times s that each word in the vocabulary appears in the image, so as to represent the image as a vector V _bow with dimension v .

(4)、加权的特征向量V_W (4), weighted eigenvector V _W

当视觉单词出现在图像数据库的很多图像或每幅图像中时，就会导致一些并没有实际意义的单词的统计值较大，仅仅统计单词表中每个单词在图像出现的次数是不够的，由于每个单词的重要性不同，需要对单词的重要性进行计算，也就是计算单词表中视觉单词的权重，为了解决这个问题，本发明使用TF-IDF(术语频率-逆文档频率)重加权方法，其中TF指的是某视觉单词出现的频率，IDF是逆向文档频率，包含某视觉单词的图片越少，IDF值越大，说明该词语具有很强的区分能力，TF-IDF值越大表示该特征词对这个文本的重要性越大，其计算公式如下：When visual words appear in many images or in each image of the image database, the statistical value of some words that have no actual meaning will be large. It is not enough to count the number of times each word in the word list appears in the image. Since the importance of each word is different, it is necessary to calculate the importance of the word, that is, to calculate the weight of the visual word in the word list. In order to solve this problem, the present invention uses TF-IDF (term frequency-inverse document frequency) reweighting method, where TF refers to the frequency of occurrence of a certain visual word, and IDF is the frequency of inverse documents. The fewer pictures containing a certain visual word, the larger the IDF value, indicating that the word has a strong ability to distinguish, and the larger the TF-IDF value is It indicates that the feature word is more important to the text, and its calculation formula is as follows:

TFIDF＝TF×IDFTFIDF=TF×IDF

其中，s为视觉单词出现的次数，v为视觉单词总量，TF_w代表了单词w在所有单词中出现的频率，P为图片的总数量，P_w为出现了单词w的数量。Among them, s is the number of occurrences of visual words, v is the total number of visual words, TF _w represents the frequency of word w in all words, P is the total number of pictures, and P _w is the number of occurrences of word w.

步骤四、图像之间的相似度匹配Step 4. Similarity matching between images

对于两幅图像I_a和I_b，通过上述步骤获得全局特征

本发明通过余弦相似度公式来度量两个全局特征向量的距离，For two images I _a and I _b , global features are obtained through the above steps

The present invention measures the distance between two global feature vectors through the cosine similarity formula,

与现有技术相比，本发明的有益效果为：Compared with the prior art, the beneficial effects of the present invention are:

1、本发明提出了一种更为健壮的特征提取方法，检测图片中的显著性区域，能有效地防止场景视角变化的干扰，在室外场景下能提取鲁棒性更强的图像全局特征，减少误匹配。1. The present invention proposes a more robust feature extraction method, which detects salient regions in pictures, can effectively prevent the interference of scene perspective changes, and can extract more robust image global features in outdoor scenes, Reduce mismatches.

2、本发明提出的方法无需使用大量的数据进行卷积神经网络的参数训练，节省了计算资源和时间的开销。2. The method proposed by the present invention does not need to use a large amount of data for parameter training of the convolutional neural network, which saves the overhead of computing resources and time.

3、使用深度学习特征替代传统特征，并结合到词袋模型中，在保持特征维度大小的情况下提高了地点重识别的精度。3. Use deep learning features to replace traditional features and combine them into the bag-of-words model to improve the accuracy of place re-identification while maintaining the feature dimension.

附图说明Description of drawings

附图用来提供对本发明的进一步理解，并且构成说明书的一部分，与本发明的实施例一起用于解释本发明，并不构成对本发明的限制。The accompanying drawings are used to provide a further understanding of the present invention, and constitute a part of the specification, and are used to explain the present invention together with the embodiments of the present invention, and do not constitute a limitation to the present invention.

图1为本发明中SE-Net网络流程图。Fig. 1 is the SE-Net network flow chart in the present invention.

图2为本发明整体流程示意图。FIG. 2 is a schematic diagram of the overall flow of the present invention.

图3为本发明提供的实施例中实验结果图。FIG. 3 is a graph of experimental results in the embodiment provided by the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。当然，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Of course, the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

实施例1Example 1

参见图1至图3，本发明提供其技术方案为，本发明提出了一种基于显著性区域检测的室外地点重识别方法，视觉地点识别问题类似于图像检索问题，对输入的图像，在图像数据库中检索出与输入图像相似度最高的图片，本实施例中，本发明的实验在pytorch上进行，使用英伟达2070sGPU上进行网络训练及测试。实验过程中，在Place365公开数据集上进行SE-ResNet的模型预训练以及视觉词袋模型的生成。在Tokyo24/7数据集上对本发明的模型精度与NetVLAD方法以及SIFT特征匹配方法的精度进行比较。Referring to FIGS. 1 to 3 , the technical solutions provided by the present invention are that the present invention proposes an outdoor location re-identification method based on saliency area detection. The visual location recognition problem is similar to the image retrieval problem. The image with the highest similarity to the input image is retrieved from the database. In this embodiment, the experiment of the present invention is performed on pytorch, and network training and testing are performed on NVIDIA 2070sGPU. During the experiment, the model pre-training of SE-ResNet and the generation of the visual word bag model were performed on the Place365 public dataset. The accuracy of the model of the present invention is compared with the accuracy of the NetVLAD method and the SIFT feature matching method on the Tokyo24/7 dataset.

1、模型预训练1. Model pre-training

Place365是一个用于训练场景识别模型的数据集，数据集中包含多种不同的场景，使用此数据集对用于提取特征的SE-ResNet网络模型进行预训练，得到的模型会对室外场景例如街道路标、建筑物等重要信息更为敏感。提取到的特征会更可靠。Place365 is a dataset for training scene recognition models. The dataset contains a variety of different scenes. Use this dataset to pre-train the SE-ResNet network model for feature extraction, and the resulting model will be used for outdoor scenes such as streets. Important information such as road signs and buildings is more sensitive. The extracted features will be more reliable.

2、视觉词袋模型的生成2. Generation of visual word bag model

在本发明中，需要选取一定数量的图片提取特征以形成视觉词袋模型的词典树。本实施例中选用2000张具有代表性的室外场景的图片，所有图片均在Place365数据集中选取。在图片选取完成后，利用预训练好的SE-ResNet网络对所有图片进行特征提取。图片的特征提取是由SE-ResNet最后一层卷积层计算完成后输出得到的。在得到卷积层特征以后，对特征进行显著性区域的提取，得到一个固定数量的局部特征。在本实施例中，每张图片选取前10个激活值最高的区域，作为此图片的局部特征。于是总共得到了20000个特征向量。对所有特征向量进行正则化之后，循环执行k-means聚类算法，构建词典树，并对所有叶子节点根据TF-IDF公式进行权重的分配。In the present invention, a certain number of image extraction features need to be selected to form a dictionary tree of the visual word bag model. In this embodiment, 2000 pictures of representative outdoor scenes are selected, and all pictures are selected from the Place365 data set. After the image selection is completed, the pre-trained SE-ResNet network is used to extract features from all the images. The feature extraction of the image is obtained by the output of the last convolutional layer of SE-ResNet. After the convolutional layer features are obtained, the salient regions are extracted for the features to obtain a fixed number of local features. In this embodiment, the top 10 regions with the highest activation values are selected for each picture as the local features of the picture. So a total of 20,000 feature vectors are obtained. After regularizing all feature vectors, the k-means clustering algorithm is executed cyclically, a dictionary tree is constructed, and weights are assigned to all leaf nodes according to the TF-IDF formula.

对比试验：Comparative Test:

本实施例使用Tokyo24/7数据集进行模型精度的验证，Tokyo24/7数据集包含使用手机摄像头拍摄的7万5千张用于检索的数据库图像和315个用于查询的图像。其中查询图像分别是在白天、傍晚和晚上拍摄的，而数据库图像仅在白天拍摄，因此，查询图像与检索数据库中的图像之间的光照变化很大，对比的难度也很大。In this example, the Tokyo24/7 dataset is used to verify the model accuracy. The Tokyo24/7 dataset contains 75,000 database images for retrieval and 315 images for query captured by mobile phone cameras. The query images are taken during the day, evening and night respectively, while the database images are only taken during the day. Therefore, the illumination changes between the query images and the images in the retrieval database are very large, and the comparison is also very difficult.

对于查询正确与否的评价标准，本实施例设定对于检索结果的前n张相似度最高的图片中，至少有一张结果图片与查询图片的位置距离在5米的范围内，则认为是一次成功的查询。位置距离可以根据数据集给定的每张图片的GPS信息求得。然后针对不同的n值绘制正确识别查询的百分比(召回率)。For the evaluation criteria of whether the query is correct or not, this embodiment assumes that among the top n pictures with the highest similarity in the retrieval result, if at least one of the result pictures is within a range of 5 meters from the query picture, it is considered to be one time. successful query. The location distance can be obtained from the GPS information of each picture given in the dataset. The percentage of correctly identified queries (recall) is then plotted against different values of n.

本实施例中首先采用NetVLAD方法与本发明的模型效果做对比实验，通过输入相同的查询图片，设置相同的n值，并记录每一次查询的成功与否，求出不同的n值下的召回率，绘制召回率曲线，如图3所示。由图可知本发明在不同n值的召回率均比NetVLAD方法要高。In this embodiment, the NetVLAD method is first used to compare the effect of the model of the present invention. By inputting the same query picture, setting the same n value, and recording the success of each query, the recall under different n values is obtained. rate, draw the recall curve, as shown in Figure 3. It can be seen from the figure that the recall rate of the present invention at different n values is higher than that of the NetVLAD method.

此外，本实施例还对传统的特征点提取方法(SIFT)与本发明的特征提取方法对检索结果精度做了对比。首先采用同样的训练图片集，用SIFT特征提取方法对所有图片采集特征。然后对所有的SIFT特征构建视觉词袋模型，得到SIFT特征词典树T_SIFT。对查询图片提取SIFT特征，并通过词典树转换成视觉词袋向量

与图像数据库中所有图片进行比对，检索出前n个相似度最高的图片。记录不同n值对应不同的正确识别查询的百分比(召回率)。通过图3可以看出，本发明使用的卷积神经网络提取特征在室外场景下地点重识别的效果要比传统方法提取的SIFT特征要好。In addition, this embodiment also compares the retrieval result accuracy between the traditional feature extraction method (SIFT) and the feature extraction method of the present invention. First, the same training image set is used, and the SIFT feature extraction method is used to collect features for all images. Then a visual word bag model is constructed for all SIFT features, and the SIFT feature dictionary tree T _SIFT is obtained. Extract SIFT features from the query image and convert it into a visual word bag vector through a dictionary tree

Compare with all the pictures in the image database, and retrieve the top n pictures with the highest similarity. Record the percentage of correctly identified queries (recall) for different values of n. It can be seen from FIG. 3 that the convolutional neural network extraction feature used in the present invention has a better effect on location re-identification in an outdoor scene than the SIFT feature extracted by the traditional method.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. an outdoor location re-identification method based on saliency area detection, is characterized in that, comprises the following steps:

Step 1. Extraction of SE-ResNet feature map

In the convolutional neural network, the convolution operation is to fuse the features in space, or extract multi-scale spatial information through multiple channels. SE-Net pays attention to the relationship between channels, so that the model can automatically learn the importance of different channel features. degree, SE-Net is embedded in ResNet. In the extraction of feature map, the SE-ResNet model is used to perform convolution operation on the image, and the input image I∈R ^{W′×H′×3} is obtained after the convolution operation. Feature map F∈R ^W×H×C ;

Step 2. Detection of salient regions

By analyzing the characteristics of the outdoor scene image, it is found that the outdoor location can identify whether the two images belong to the same location through landmark buildings or road signs. In the feature map F obtained after convolution, the area with high activation value is particularly significant in the image. In order to adapt to the different sizes of each saliency region, the detection method of the connected region with non-zero value is used to determine the position of the saliency region;

Step 3. Train the visual word bag model

The general visual bag-of-words model is trained based on SIFT features extracted from images. The network layers of SE-ResNet are used to generate feature descriptors, which preserve convolutional information and local features. The performance of descriptors is better than that of SIFT-like detectors, especially In cases where the SIFT contains many outliers or fails to match a sufficient number of feature points;

Step 4. Similarity matching between images

For two images I _a and I _b , global features are obtained through the above steps

The distance between two global feature vectors is measured by the cosine similarity formula;

2. The outdoor location re-identification method based on saliency area detection according to claim 1, wherein the following steps are performed on the feature map extracted in the step 1:

(1), binarized feature map

After the image is processed by the convolutional layer and activation function of the convolutional neural network, the spatial texture features of the image are retained. The activation value of the feature map reflects the texture intensity of this area of the image. First, it is divided into the feature maps of each channel. For the area to be detected, use the binarized feature map to divide each image object area. The area with a large activation value in the feature map uses 1 to indicate that the area is worthy of attention, and the area with a small activation value uses 0 to indicate that it is a texture In the process of binarizing the feature map, the threshold δ is used to distinguish whether each region should be set to 0 or 1;

The feature map F _B after binarization is obtained by the following formula

(2), divide the relevant area ROI

Assuming that the saliency regions are independent, or at least non-overlapping, each individual image region is represented by a non-zero connected region;

In the binary feature map F _B , for all positions with a value of 1, search for the values of 8 adjacent positions. If there are elements with the same value of 1, the same area is formed, and then the remaining elements in the area are adjacent to each other. The value is searched until all elements in the same area have been searched, and finally multiple relevant area ROIs are obtained, each channel has a different number of ROIs, and finally a total of N relevant areas are generated;

(3), determine the location of the significant area

Calculate the mean value of the activation value of the feature map for the region corresponding to the feature map of the N ROIs

The formula is as follows:

And sort according to the value of the mean a _r from high to low, and select the highest m regions as the final saliency region S={s _i |i∈{1,...,m}};

(4), extract local features

For a selected saliency region s _i , whose region range is W _s ×H _s , where 0<W _s , H _s <min(W, H), locate the region position of region _si on the feature map F , and on all channels in this area, the local feature DL with dimension W _s ×H _s ×C is obtained, and the pooled local feature _DL ∈ _R ^1×1×C is obtained by using the sum pooling method, the formula as follows:

in

is the value of the _cth channel in the local feature DL.

3. The outdoor location re-identification method based on saliency area detection according to claim 1, is characterized in that, in described step 3, training visual word bag model by image feature extraction, visual vocabulary tree generation, visual vocabulary feature construction It consists of three parts, among which, the image feature extraction is obtained in steps 1 and 2, and the specific content of visual vocabulary tree generation and visual vocabulary feature construction in step 3 includes the following steps:

(1), collect the features used to build the vocabulary tree

For the visual vocabulary tree generation, the k-means method is used, and the k-means algorithm is used as the clustering method. Before clustering, a certain number of representative outdoor scene images are collected, and the steps 1 and 1 are performed for each image. In step 2, feature extraction is performed, and m salient regions are selected for each image to obtain the local features of all salient regions;

(2), use k-means to build a vocabulary tree T

First build the root node, use k-means to cluster all the features for the first time, get k classes and their class centers, take the class center as the child node of the root node, complete the construction of the first layer of the vocabulary tree, and continue to the first The class of each node in the layer is k-means clustering to obtain k classes, and the class center is used as the child node of the node, and the cycle continues until all the features are assigned to the leaf nodes, and the vocabulary tree T is constructed;

(3), visual vocabulary feature vector V _bow

Each leaf node of the vocabulary tree represents a visual word, assuming that the vocabulary tree contains v visual words, count the number of times s of each word in the word list appearing in the image, so as to represent the image as a vector V _bow with dimension v;

(4), weighted eigenvector V _W

The TF-IDF re-weighting method is used, where TF is the frequency of occurrence of a visual word, and IDF is the frequency of inverse documents. The larger the IDF value, the greater the importance of the feature word to the text. The calculation formula is as follows:

TFIDF=TF×IDF

Among them, s is the number of occurrences of visual words, v is the total number of visual words, TF _w represents the frequency of word w in all words, P is the total number of pictures, and P _w is the number of occurrences of word w.