CN115357754A

CN115357754A - Deep learning-based large-scale short video retrieval method, system and equipment

Info

Publication number: CN115357754A
Application number: CN202210811333.7A
Authority: CN
Inventors: 陈亚雄; 杨锴; 黄景灏; 周中舟; 熊盛武
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-11-18
Anticipated expiration: 2042-07-11
Also published as: CN115357754B

Abstract

The invention discloses a large-scale short video retrieval method, a large-scale short video retrieval system and large-scale short video retrieval equipment based on deep learning, wherein the method firstly carries out key frame extraction and video standardization processing aiming at short videos to be retrieved; then inputting the processed short videos into a short video semantic feature extraction network, and calculating by utilizing similarity to obtain similar T videos; wherein T is a preset value; the invention designs a new short video semantic feature extraction network to extract the data representation of three-dimensional video data in a two-dimensional space. And introducing a hash representation layer under the structure of the inherent convolutional neural network to obtain a corresponding hash code of the short video in the Hamming space. The invention not only captures the relative semantic correlation of the hash code in different modes, but also obviously reduces the space-time complexity of short video retrieval. The invention fully utilizes the video space-time semantic information and further improves the retrieval performance.

Description

Large-scale short video retrieval method, system and equipment based on deep learning

技术领域technical field

本发明属于视频检索技术领域，涉及一种短视频检索方法、系统及设备，具体涉及一种基于深度学习的大规模短视频检索方法、系统及设备。The invention belongs to the technical field of video retrieval, and relates to a short video retrieval method, system and equipment, in particular to a large-scale short video retrieval method, system and equipment based on deep learning.

背景技术Background technique

随着互联网技术和智能设备的发展与发展，社交网络和其他的信息平台出现了多种多样的多媒体数据，如文本，语音，音频，图像和视频等。特别是短视频备受用户青睐，短视频因网络社交媒体的爆火而兴起，短视频是很重要的信息传播载体，它能够在很短的时间里面传达非常多的信息，据《2021中国网络视听发展研究报告》显示，泛网络视听领域市场规模超6000亿，短视频领域市场份额占三成，市场规模为2051.3亿。短视频因其碎片化和社交化等特性，可以更好满足用户的个性化需求。与此同时，各类视频媒体的纷纷出现，产生了各种各样的视频需求，如何根据用户上传的视频，在大规模视频数据库及有限的设备资源情况下，查找与其近似重复或者用户感兴趣的视频成为了大数据时代的一个有意义的课题。With the development and development of Internet technology and smart devices, social networks and other information platforms have emerged a variety of multimedia data, such as text, voice, audio, images and video. In particular, short videos are favored by users. Short videos have risen due to the explosion of social media on the Internet. Short videos are an important carrier of information dissemination. They can convey a lot of information in a short period of time. According to "2021 China Network The Audiovisual Development Research Report shows that the market size of the pan-network audiovisual field exceeds 600 billion, and the short video field accounts for 30% of the market share, with a market size of 205.13 billion. Due to its fragmentation and social characteristics, short videos can better meet the individual needs of users. At the same time, the emergence of various video media has created a variety of video needs. How to find similar duplicates or users interested in videos uploaded by users based on large-scale video databases and limited equipment resources? video has become a meaningful topic in the era of big data.

近年来，有很多学者利用深度学习方法解决短视频数据检索问题。常见的做法是通过人工的方式对视频数据加以注释，标签等信息，依靠文本搜索间接完成视频数据的搜索任务。尽管现有的短视频检索方法有一定的发展，但是仍然有两点不足：1)短视频信息和人工标注的文本信息并不是完全一致或者一一对应的，通过人工标注很难将视频的信息完全阐述。2) 通过机器学习提取到的视频特征仍然非常复杂，对其存储需要大量的存储空间，在相似度计算方面也需要大量的计算资源。In recent years, many scholars have used deep learning methods to solve short video data retrieval problems. A common practice is to manually annotate, label and other information on video data, relying on text search to indirectly complete the search task of video data. Although the existing short video retrieval methods have made some progress, there are still two deficiencies: 1) The short video information and the manually marked text information are not completely consistent or one-to-one correspondence, and it is difficult to extract the video information through manual marking. fully elaborated. 2) The video features extracted by machine learning are still very complex, and their storage requires a large amount of storage space, as well as a large amount of computing resources in terms of similarity calculation.

发明内容Contents of the invention

为了解决上述技术问题，本发明提出了一种基于深度学习的大规模短视频检索方法、系统及设备，通过学习短视频数据的语义信息，利用改进的卷积神经网络模型生成哈希码，最后利用相似度计算来检索出给定数目的短视频项。In order to solve the above technical problems, the present invention proposes a large-scale short video retrieval method, system and equipment based on deep learning, by learning the semantic information of short video data, using an improved convolutional neural network model to generate hash codes, and finally Use similarity calculation to retrieve a given number of short video items.

本发明的方法所采用的技术方案是：一种基于深度学习的大规模短视频检索方法，包括以下步骤：The technical solution adopted in the method of the present invention is: a large-scale short video retrieval method based on deep learning, comprising the following steps:

步骤1：将查询数据和检索集中短视频，进行关键帧提取与视频标准化处理；Step 1: Concentrate query data and retrieve short videos, perform key frame extraction and video standardization;

步骤2：将处理后的数据输入短视频语义特征提取网络，利用相似度计算得到相似的T 个视频；其中，T为预设值；Step 2: Input the processed data into the short video semantic feature extraction network, and use the similarity calculation to obtain T similar videos; where T is a preset value;

所述短视频语义特征提取网络，由特征提取模块和特征哈希码映射模块构成，包括第一 3×3×3卷积层、第一1×2×2池化层、第二3×3×3卷积层、第二2×2×2池化层、第三 3×3×3卷积层、第三2×2×2池化层、第四3×3×3卷积层、第四2×2×2池化层、第五 3×3×3卷积层、第五2×2×2池化层、第一1×4096全连接层、第二1×4096全连接层、哈希层组成；The short video semantic feature extraction network consists of a feature extraction module and a feature hash code mapping module, including a first 3×3×3 convolutional layer, a first 1×2×2 pooling layer, a second 3×3 ×3 convolutional layer, second 2×2×2 pooling layer, third 3×3×3 convolutional layer, third 2×2×2 pooling layer, fourth 3×3×3 convolutional layer, The fourth 2×2×2 pooling layer, the fifth 3×3×3 convolutional layer, the fifth 2×2×2 pooling layer, the first 1×4096 fully connected layer, the second 1×4096 fully connected layer , Hash layer composition;

所述第一3×3×3卷积层、第一1×2×2池化层、第二3×3×3卷积层、第二2×2×2池化层、第三3×3×3卷积层、第三2×2×2池化层、第四3×3×3卷积层、第四2×2×2 池化层、第五3×3×3卷积层、第五2×2×2池化层顺序连接，共同构成特征提取模块，输入短视频，经过特征提取模块得到短视频原始特征向量图；The first 3×3×3 convolutional layer, the first 1×2×2 pooling layer, the second 3×3×3 convolutional layer, the second 2×2×2 pooling layer, the third 3× 3×3 convolutional layer, third 2×2×2 pooling layer, fourth 3×3×3 convolutional layer, fourth 2×2×2 pooling layer, fifth 3×3×3 convolutional layer , the fifth 2×2×2 pooling layers are sequentially connected to form a feature extraction module together, input a short video, and obtain the original feature vector graph of the short video through the feature extraction module;

所述第一1×4096全连接层、第二1×4096全连接层、哈希层顺序连接，共同构成特征哈希码映射模块，输入短视频原始特征向量图，经过短视频特征哈希码映射模块得到短视频特征映射后的哈希码；The first 1×4096 fully connected layer, the second 1×4096 fully connected layer, and the hash layer are sequentially connected to form a feature hash code mapping module, which inputs the original feature vector map of the short video, and passes through the short video feature hash code. The mapping module obtains the hash code after the feature mapping of the short video;

所述哈希层为被sigmoid函数激活的1×K全连接层，K为预设哈希码长度。The hash layer is a 1×K fully connected layer activated by a sigmoid function, and K is a preset hash code length.

本发明的系统所采用的技术方案是：一种基于深度学习的大规模短视频检索系统，包括预处理模块和检索模块；The technical solution adopted by the system of the present invention is: a large-scale short video retrieval system based on deep learning, including a preprocessing module and a retrieval module;

所述预处理模块，用于将查询数据和检索集中短视频，进行关键帧提取与视频标准化处理；Described preprocessing module is used for querying data and retrieving centralized short video, carries out key frame extraction and video standardization processing;

所述检索模块，用于将处理后的数据输入短视频语义特征提取网络，利用相似度计算得到相似的T个视频；其中，T为预设值；Described retrieval module, for the data input after processing short video semantic feature extraction network, utilize similarity calculation to obtain similar T videos; Wherein, T is preset value;

所述短视频语义特征提取网络，整体由特征提取模块和特征哈希码映射模块构成，包括第一3×3×3卷积层、第一1×2×2池化层、第二3×3×3卷积层、第二2×2×2池化层、第三3×3×3卷积层、第三2×2×2池化层、第四3×3×3卷积层、第四2×2×2池化层、第五3×3×3卷积层、第五2×2×2池化层、第一1×4096全连接层、第二1×4096全连接层、哈希层组成；The short video semantic feature extraction network is composed of a feature extraction module and a feature hash code mapping module as a whole, including a first 3×3×3 convolutional layer, a first 1×2×2 pooling layer, a second 3× 3×3 convolutional layer, second 2×2×2 pooling layer, third 3×3×3 convolutional layer, third 2×2×2 pooling layer, fourth 3×3×3 convolutional layer , the fourth 2×2×2 pooling layer, the fifth 3×3×3 convolutional layer, the fifth 2×2×2 pooling layer, the first 1×4096 fully connected layer, the second 1×4096 fully connected layer, hash layer;

本发明的设备所采用的技术方案是：一种基于深度学习的大规模短视频检索设备，包括：The technical solution adopted by the device of the present invention is: a large-scale short video retrieval device based on deep learning, including:

一个或多个处理器；one or more processors;

存储装置，用于存储一个或多个程序，当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现所述的基于深度学习的大规模短视频检索方法。The storage device is used to store one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the described deep learning-based large-scale Scale short video retrieval method.

本发明与现有技术相比，本发明提出的方法不仅捕获了视频在时间维度上的语义相关性，学习了深度特征的相对语义相关性，而且大大优化了视频检索的时空复杂度。Compared with the prior art, the method proposed by the present invention not only captures the semantic correlation of videos in the time dimension, learns the relative semantic correlation of deep features, but also greatly optimizes the time-space complexity of video retrieval.

附图说明Description of drawings

图1为本发明实施例的原理框架图。Fig. 1 is a schematic frame diagram of an embodiment of the present invention.

图2为本发明实施例的原理示意图。Fig. 2 is a schematic diagram of the principle of an embodiment of the present invention.

图3为本发明实施例的短视频语义特征提取网络结构图。FIG. 3 is a network structure diagram of short video semantic feature extraction according to an embodiment of the present invention.

图4为本发明实施例的短视频语义特征提取网络训练流程示意图。FIG. 4 is a schematic diagram of a short video semantic feature extraction network training process according to an embodiment of the present invention.

具体实施方式Detailed ways

为了便于本领域普通技术人员理解和实施本发明，下面结合附图及实施例对本发明作进一步的详细描述，应当理解，此处所描述的实施示例仅用于说明和解释本发明，并不用于限定本发明。In order to facilitate those of ordinary skill in the art to understand and implement the present invention, the present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the implementation examples described here are only used to illustrate and explain the present invention, and are not intended to limit this invention.

请见图1和图2，本发明提供的一种基于深度学习的大规模短视频检索方法，包括以下步骤：Please see Fig. 1 and Fig. 2, a kind of large-scale short video retrieval method based on deep learning provided by the present invention, comprises the following steps:

本实施例中，关键帧提取是根据短视频镜头特性，即采取一镜到底的方式进行拍摄，采用等间隔抽取关键帧技术。视频标准化是根据视频尺寸的大小，如果视频图像过大，需要进行下采样操作，如果视频图像的尺寸过小需要进行插值操作，最后将每一帧处理过的图像进行拼接得到完整的视频。In this embodiment, the key frame extraction is based on the characteristics of the short video shot, that is, shooting in a one shot to the bottom mode, and using the technology of extracting key frames at equal intervals. Video standardization is based on the size of the video. If the video image is too large, downsampling is required. If the video image is too small, interpolation is required. Finally, the processed images of each frame are spliced to obtain a complete video.

本实施例中，视频标准化处理，是根据视频尺寸的大小，如果视频图像大于预设值，则进行下采样操作处理，将M×N的图像进行s倍下采样，得到(M/s)×(N/s)的图像；具体公式如下：In this embodiment, the video standardization process is based on the size of the video. If the video image is larger than the preset value, the downsampling operation is performed, and the M×N image is downsampled by s times to obtain (M/s)× (N/s) image; the specific formula is as follows:

DS(f)＝P_avg({b₁,b₁,……,b_n})DS(f)＝P _avg ({b ₁ ,b ₁ ,...,b _n })

其中，f为视频帧，b为图片分割成的不同的s×s的小块，P_avg()为均值池化操作；Among them, f is the video frame, b is the different s×s small blocks that the picture is divided into, and P _avg () is the mean value pooling operation;

如果视频图像的尺寸小于预设值，则进插值操作处理，这里为了加快计算，缺少的像素值为横向最近像素点与纵向最近像素点的值的均值；最后将每一帧处理过的图像进行拼接得到完整的视频。If the size of the video image is smaller than the preset value, the interpolation operation is performed. Here, in order to speed up the calculation, the missing pixel value is the average value of the horizontal nearest pixel and the vertical nearest pixel; finally, each frame of the processed image is processed. Stitching to get a complete video.

若输入视频

其中F_i为第i个视频帧，N为视频V 的总帧数；则提取的关键帧

其中，t代表以秒为单位的视频长度，FPS代表帧率， n是视频挑选的帧数，t＝1,2,...,n。If input video

Where F _i is the i-th video frame, N is the total number of frames of the video V; then the extracted key frame

Wherein, t represents the length of the video in seconds, FPS represents the frame rate, n is the number of frames selected from the video, and t=1,2,...,n.

本实施例中，使用短视频语义特征提取网络计算处理后的短视频的哈希码，将查询数据和检索集各样本的哈希码之间的汉明距离从大到小排序，并计算排名列表的前T个精度作为检索结果。In this embodiment, use the short video semantic feature extraction network to calculate the hash code of the processed short video, sort the Hamming distances between the query data and the hash codes of each sample in the retrieval set from large to small, and calculate the ranking The first T precisions of the list are used as the search results.

请见图3，本实施例的短视频语义特征提取网络，整体由特征提取模块和特征哈希码映射模块构成，包括第一3×3×3卷积层、第一1×2×2池化层、第二3×3×3卷积层、第二 2×2×2池化层、第三3×3×3卷积层、第三2×2×2池化层、第四3×3×3卷积层、第四 2×2×2池化层、第五3×3×3卷积层、第五2×2×2池化层、第一1×4096全连接层、第二1×4096全连接层、哈希层组成；第一3×3×3卷积层、第一1×2×2池化层、第二3×3 ×3卷积层、第二2×2×2池化层、第三3×3×3卷积层、第三2×2×2池化层、第四3×3 ×3卷积层、第四2×2×2池化层、第五3×3×3卷积层、第五2×2×2池化层顺序连接，共同构成特征提取模块，输入短视频，经过特征提取模块得到短视频原始特征向量图；第一 1×4096全连接层、第二1×4096全连接层、哈希层顺序连接，共同构成特征哈希码映射模块，输入短视频原始特征向量图，经过短视频特征哈希码映射模块得到短视频特征映射后的哈希码；哈希层为被sigmoid函数激活的1×K全连接层，K为预设哈希码长度。Please see Figure 3. The short video semantic feature extraction network in this embodiment is composed of a feature extraction module and a feature hash code mapping module as a whole, including the first 3×3×3 convolutional layer and the first 1×2×2 pooling layer. layer, the second 3×3×3 convolutional layer, the second 2×2×2 pooling layer, the third 3×3×3 convolutional layer, the third 2×2×2 pooling layer, the fourth 3 ×3×3 convolutional layer, fourth 2×2×2 pooling layer, fifth 3×3×3 convolutional layer, fifth 2×2×2 pooling layer, first 1×4096 fully connected layer, The second 1×4096 fully connected layer and hash layer; the first 3×3×3 convolutional layer, the first 1×2×2 pooling layer, the second 3×3×3 convolutional layer, the second 2 ×2×2 pooling layer, third 3×3×3 convolutional layer, third 2×2×2 pooling layer, fourth 3×3 ×3 convolutional layer, fourth 2×2×2 pooling layer Layer, the fifth 3×3×3 convolutional layer, and the fifth 2×2×2 pooling layer are sequentially connected to form a feature extraction module, input a short video, and obtain the original feature vector map of the short video through the feature extraction module; the first The 1×4096 fully connected layer, the second 1×4096 fully connected layer, and the hash layer are sequentially connected to form a feature hash code mapping module. The original feature vector map of the short video is input, and the short video is obtained through the short video feature hash code mapping module. The hash code after video feature mapping; the hash layer is a 1×K fully connected layer activated by the sigmoid function, and K is the preset hash code length.

本实施例中，视频的深层特征表示f(V)，哈希表示层H(V)。从全连接层得到的特征向量大小为[N，4096，1，1，1]，即全连接层节点个数num为4096。假设从全连接层输出的特征向量为FC＝{FC₁，FC₂，...，FC_num}，经过哈希层之后得到的特征向量为H＝{h₁，h₂，...，h_d}，其中d为哈希层节点个数，即哈希层输出的向量维度。为了保证数据集中的每一类标签都有独一无二的哈希码与其对应，则至少需要d_min个节点，节点个数的计算公式如下：In this embodiment, the deep feature of the video represents f(V), and the hash represents layer H(V). The size of the feature vector obtained from the fully connected layer is [N, 4096, 1, 1, 1], that is, the number of nodes in the fully connected layer num is 4096. Assuming that the feature vector output from the fully connected layer is FC={FC ₁ , FC ₂ ,...,FC _num }, the feature vector obtained after the hash layer is H={h ₁ , h ₂ ,..., h _d }, where d is the number of hash layer nodes, that is, the vector dimension of the hash layer output. In order to ensure that each type of label in the data set has a unique hash code corresponding to it, at least d _min nodes are required, and the calculation formula for the number of nodes is as follows:

设d的最大取值为d_max，则d_max＝num，即哈希层的节点个数需要低于上一层全连接层的节点个数。由于哈希层的目的是将全连接特征映射成紧凑的类二进制码，需要对上一层输出的特征进行降维。那么d的取值范围为：Assuming that the maximum value of d is d _max , then d _max = num, that is, the number of nodes in the hash layer needs to be lower than the number of nodes in the upper fully connected layer. Since the purpose of the hash layer is to map the fully connected features into a compact binary-like code, it is necessary to reduce the dimensionality of the features output by the previous layer. Then the value range of d is:

哈希函数的定义如下：The hash function is defined as follows:

[₁,₂,...,_d]^T＝[Sign(W*FC+B)]^T；[ ₁ , ₂ ,..., _d ] ^T ＝ [Sign(W*FC+B)] ^T ;

其中，Sign为符号函数，W为哈希层权重，FC为全连接层，B为偏置。Among them, Sign is the sign function, W is the weight of the hash layer, FC is the fully connected layer, and B is the bias.

为了将哈希特征映射为二值编码，采用阈值函数对输出值进行处理，阈值函数设计如下：In order to map hash features into binary codes, a threshold function is used to process the output value. The threshold function is designed as follows:

请见图4，本实施例的短视频语义特征提取网络，为训练好的短视频语义特征提取网络；其训练过程包括以下步骤：Please see Fig. 4, the short video semantic feature extraction network of the present embodiment is a trained short video semantic feature extraction network; its training process includes the following steps:

步骤2.1：获取数据集，并划分为训练数据集和测试数据集；Step 2.1: Obtain the data set and divide it into training data set and test data set;

本实施例中，使用HMDB51和UCF-101数据集，选取该数据集的70％作为训练数据集I_train，余下的30％作为测试数据集I_test；In this embodiment, use the HMDB51 and UCF-101 data sets, select 70% of the data sets as the training data set I _train , and the remaining 30% as the test data set I _test ;

步骤2.2：针对训练数据集和测试数据集，进行关键帧提取与视频标准化处理；Step 2.2: For the training data set and the test data set, perform key frame extraction and video standardization;

若输入视频

其中F_i为第i个视频帧，N为视频V 的总帧数；则提取的关键帧

步骤2.3：确定训练目标函数、优化算法、学习率、动量、权值衰减、批量大小和网络训练迭代次数epoch；Step 2.3: Determine the training objective function, optimization algorithm, learning rate, momentum, weight decay, batch size and network training iteration number epoch;

本实施例中，目标函数Loss由三元组损失函数L_Triplet，交叉熵损失函数L_CE和平滑平均精度损失函数L_AP组成；In this embodiment, the objective function Loss is composed of a triplet loss function L _Triplet , a cross-entropy loss function L _CE and a smoothed average precision loss function L _AP ;

Loss＝αL_Triplet+βL_CE+γL_AP；Loss = αL _Triplet + βL _CE + γL _AP ;

其中，α、β、γ均为超参数，通过训练模型反向传播技术从而得到网络的权重参数W和偏置参数B；Among them, α, β, and γ are all hyperparameters, and the weight parameter W and bias parameter B of the network are obtained by training the model backpropagation technology;

本实施例中，三元组损失函数

其中，

为锚样本，是选定的某个类型的视频数据，

为与锚样本同类型的正例样本，

为与锚样本不同类型的负例样本；f()为将样本数据映射到同一个向量空间函数；l为预设的负例样本与正例样本的最小距离；K为数据总样本数；In this example, the triplet loss function

in,

An anchor sample is a selected type of video data,

is a positive sample of the same type as the anchor sample,

is a different type of negative sample from the anchor sample; f() is a function that maps sample data to the same vector space; l is the minimum distance between a preset negative sample and a positive sample; K is the total number of data samples;

本实施例中，多分类的交叉熵损失函数

其中， K为数据总样本数，M为类别的数量；y_ic为符号函数，取值0或1，如果样本i的真是类别等于c则取1；p_ic为观测样本i属于类别c的预测概率；In this embodiment, the cross-entropy loss function of multi-classification

Among them, K is the total number of samples of the data, M is the number of categories; y _ic is a sign function, the value is 0 or 1, if the real category of sample i is equal to c, then it is 1; p _ic is the prediction that the observed sample i belongs to category c probability;

本实施例中，平滑平均精度损失函数

其中，m为样本个数；AP_k为平滑平均估计值，

S_P为数据集中和查询视频属于同一类型的视频集合，

表示将指示函数替换为了可导的Sigmoid函数；D_ij为排名矩阵，第 i行表示第i个样本和其余样本的相关性的排名分数；τ为平滑系数，不同的平滑系数可以提供不同的梯度信息；In this example, the smoothing average precision loss function

Among them, m is the number of samples; AP _k is the smoothed average estimated value,

S _P is the video collection of the same type as the query video in the dataset,

Indicates that the indicator function is replaced by a derivable Sigmoid function; D _ij is a ranking matrix, and the i-th row represents the ranking score of the correlation between the i-th sample and the rest of the samples; τ is a smoothing coefficient, and different smoothing coefficients can provide different gradients information;

步骤2.4：将训练数据集输入短视频语义特征提取网络中，训练短视频语义特征提取网络；Step 2.4: Input the training data set into the short video semantic feature extraction network, and train the short video semantic feature extraction network;

本实施例中，使用Adam算法进行优化，学习率设置为10^-5，动量设置为0.9，权值衰减设置为10^-5，批量大小设置为32，学习率的衰减策略为依据验证集的Loss不再下降进行调整。输入帧图像尺寸为3×112×112每次卷积16帧作为一个clip，卷积神经网络ResNet34_3D 的初始权重使用预先训练好的权值进行初始化。通过训练模型从而得到网络的权重参数W 和偏置参数B。In this example, the Adam algorithm is used for optimization, the learning rate is set to 10 ^-5 , the momentum is set to 0.9, the weight decay is set to 10 ^-5 , the batch size is set to 32, and the learning rate decay strategy is based on the Loss of the verification set. No more dips to adjust. The size of the input frame image is 3×112×112 and each convolution is 16 frames as a clip. The initial weights of the convolutional neural network ResNet34_3D are initialized with pre-trained weights. The weight parameter W and bias parameter B of the network are obtained by training the model.

步骤2.5：使用训练后的短视频语义特征提取网络计算测试数据集中样本的哈希码，将查询样本和训练数据集各样本的哈希码之间的汉明距离从大到小排序，并计算排名列表的前 n个精度，得出平均精度指标MAP和前n名检索结果；输入视频的类别和检索到的视频类别相同即为检索正确，上述损失函数值趋于稳定不再下降则得到效果好的网络。Step 2.5: Use the trained short video semantic feature extraction network to calculate the hash codes of the samples in the test data set, sort the Hamming distances between the query samples and the hash codes of the samples in the training data set from large to small, and calculate The top n accuracy of the ranking list is used to obtain the average precision index MAP and the top n retrieval results; the category of the input video is the same as the category of the retrieved video, which means the retrieval is correct, and the value of the above loss function tends to be stable and no longer declines to obtain the effect good internet.

为了评估本发明方法的有效性，将本发明方法与几种最先进的方法进行了检索性能比较，包括FIHTV、VHSL、SVH、DH、DVH、DBNVH、BRVH、DSH、DLBHC和SRH本实验采用不同位数(16bits，32bits，64bits)的哈希码，采用UCF-101数据集和HMDB-51数据集，DVH方法利用卷积神经网络同时学习到图像的特征表达以及哈希函数，然后将它们相应的特征投影到一个共同的表示空间中，FIHTV、VHSL、SVH、DH、DBNVH、BRVH、 DSH、DLBHC和SRH方法按原文执行。本实验采用的环境是

Platinum 8260 CPU＠2.4GHZ、NVIDIA A100TENSOR CORE GPU、Ubuntu 20.04系统，运用Python和开源库Pytorch进行开发。In order to evaluate the effectiveness of the method of the present invention, the retrieval performance of the method of the present invention was compared with several state-of-the-art methods, including FIHTV, VHSL, SVH, DH, DVH, DBNVH, BRVH, DSH, DLBHC and SRH. Bits (16bits, 32bits, 64bits) hash codes, using UCF-101 data set and HMDB-51 data set, DVH method uses convolutional neural network to learn the feature expression and hash function of the image at the same time, and then they are corresponding The features projected into a common representation space, the FIHTV, VHSL, SVH, DH, DBNVH, BRVH, DSH, DLBHC and SRH methods perform as in the original text. The environment used in this experiment is

Platinum 8260 CPU@2.4GHZ, NVIDIA A100TENSOR CORE GPU, Ubuntu 20.04 system, using Python and open source library Pytorch for development.

表1Table 1

表1是本发明与其他方法在UCF-101数据集上视频检索任务的比较实验结果，其中mAP 为平均精度指标。Table 1 is the experimental results of the comparison between the present invention and other methods on the video retrieval task on the UCF-101 dataset, where mAP is the average precision index.

表2Table 2

表2是本发明与其他方法在HMDB-51数据集上视频检索任务的比较实验结果，其中mAP为平均精度指标。Table 2 is the comparative experimental results of the present invention and other methods on the video retrieval task on the HMDB-51 dataset, where mAP is the average precision index.

表3table 3

表3是本发明不同视频检索方式在速度上的比较，多级哈希检索指的是按照次序使用16 bits、32bits、64bits的二进制哈希特征编码与数据库中视频特征编码表示进行比较，逐步地缩小查找范围，最后再使用实值特征编码在数据库中进行精准比较。本发明充分利用视频时空语义信息，进一步提升了检索性。Table 3 is the comparison in speed of different video retrieval methods of the present invention. Multi-stage hash retrieval refers to using binary hash feature coding of 16 bits, 32bits, 64bits in order to compare with video feature coding representations in the database, step by step Narrow down the search scope, and finally use the real-valued feature encoding to perform an accurate comparison in the database. The invention makes full use of the temporal and spatial semantic information of the video, further improving the retrieval performance.

应当理解的是，上述针对较佳实施例的描述较为详细，并不能因此而认为是对本发明专利保护范围的限制，本领域的普通技术人员在本发明的启示下，在不脱离本发明权利要求所保护的范围情况下，还可以做出替换或变形，均落入本发明的保护范围之内，本发明的请求保护范围应以所附权利要求为准。It should be understood that the above-mentioned descriptions for the preferred embodiments are relatively detailed, and should not therefore be considered as limiting the scope of the patent protection of the present invention. Within the scope of protection, replacements or modifications can also be made, all of which fall within the protection scope of the present invention, and the scope of protection of the present invention should be based on the appended claims.

Claims

1. A large-scale short video retrieval method based on deep learning, is characterized in that, comprises the following steps:

Step 1: Concentrate query data and retrieve short videos, perform key frame extraction and video standardization;

Step 2: Input the processed data into the short video semantic feature extraction network, and use the similarity calculation to obtain T similar videos; where T is a preset value;

The short video semantic feature extraction network consists of a feature extraction module and a feature hash code mapping module, including a first 3×3×3 convolutional layer, a first 1×2×2 pooling layer, a second 3×3 ×3 convolutional layer, second 2×2×2 pooling layer, third 3×3×3 convolutional layer, third 2×2×2 pooling layer, fourth 3×3×3 convolutional layer, The fourth 2×2×2 pooling layer, the fifth 3×3×3 convolutional layer, the fifth 2×2×2 pooling layer, the first 1×4096 fully connected layer, the second 1×4096 fully connected layer , Hash layer composition;

The first 3×3×3 convolutional layer, the first 1×2×2 pooling layer, the second 3×3×3 convolutional layer, the second 2×2×2 pooling layer, the third 3× 3×3 convolutional layer, third 2×2×2 pooling layer, fourth 3×3×3 convolutional layer, fourth 2×2×2 pooling layer, fifth 3×3×3 convolutional layer , the fifth 2×2×2 pooling layers are sequentially connected to form a feature extraction module together, input a short video, and obtain the original feature vector graph of the short video through the feature extraction module;

The first 1×4096 fully connected layer, the second 1×4096 fully connected layer, and the hash layer are sequentially connected to form a feature hash code mapping module, which inputs the original feature vector map of the short video, and passes through the short video feature hash code. The mapping module obtains the hash code after the feature mapping of the short video;

The hash layer is a 1×K fully connected layer activated by a sigmoid function, and K is a preset hash code length.

2. The large-scale short video retrieval method based on deep learning according to claim 1, characterized in that: in step 1, the key frame extraction is based on the characteristics of the short video lens, that is, taking a shot to the end to shoot , the key frame extraction is carried out by using the key frame extraction technology at equal intervals;

If input video

Among them, F _i is the i-th video frame, and N is the total number of frames of the video V; then the extracted key frame

3. The large-scale short video retrieval method based on deep learning according to claim 1, characterized in that: in step 1, the video standardization process is based on the size of the video size, if the video image is greater than a preset value, then Perform down-sampling operation processing, and perform s-fold down-sampling on the M×N image to obtain an (M/s)×(N/s) image; the specific formula is as follows:

DS(f)＝P _avg ({b ₁ ,b ₁ ,...,b _n })

Among them, f is the video frame, b is the different s×s small blocks that the picture is divided into, and P _avg () is the mean value pooling operation;

If the size of the video image is smaller than the preset value, the interpolation operation is performed, and the missing pixel value is the average value of the horizontal nearest pixel and the vertical nearest pixel; finally, the processed images of each frame are spliced to obtain a complete video .

4. The large-scale short video retrieval method based on deep learning according to claim 1, characterized in that: in step 2, use the hash code of the short video processed by the short video semantic feature extraction network calculation, query data and The Hamming distance between the hash codes of each sample in the retrieval set is sorted from large to small, and the top T accuracy of the ranking list is calculated as the retrieval result.

5. The large-scale short video retrieval method based on deep learning according to any one of claims 1-4, characterized in that: the short video semantic feature extraction network described in step 2 is a trained short video semantic feature extraction network; its training process includes the following steps:

Step 2.1: Obtain the data set and divide it into training data set and test data set;

Step 2.2: For the training data set and the test data set, perform key frame extraction and video standardization;

Step 2.3: Determine the training objective function, optimization algorithm, learning rate, momentum, weight decay, batch size and network training iteration number epoch;

The objective function Loss is composed of triplet loss function L _Triplet , cross-entropy loss function L _CE and smooth average precision loss function L _AP ;

Loss = αL _Triplet + βL _CE + γL _AP ;

Among them, α, β, and γ are all hyperparameters, and the weight parameter W and bias parameter B of the network are obtained by training the model backpropagation technology;

The triplet loss function

in,

An anchor sample is a selected type of video data,

is a positive sample of the same type as the anchor sample,

The multi-class cross-entropy loss function

Among them, K is the total number of data samples, M is the number of categories; y _ic is a sign function, with a value of 0 or 1, if the real category of sample i is equal to c, then it is 1; p _ic is the prediction that observed sample i belongs to category c probability;

The smoothed average precision loss function

Indicates that the indicator function is replaced by a derivable Sigmoid function; D _ij is a ranking matrix, and the i-th row represents the ranking score of the correlation between the i-th sample and the rest of the samples; τ is a smoothing coefficient;

Step 2.4: input the training data set into the short video semantic feature extraction network, and train the short video semantic feature extraction network;

Step 2.5: Use the trained short video semantic feature extraction network to calculate the hash codes of the samples in the test data set, sort the Hamming distances between the query samples and the hash codes of the samples in the training data set from large to small, and calculate The top n accuracy of the ranking list is used to obtain the average precision index MAP and the top n retrieval results; the category of the input video is the same as the category of the retrieved video, which means the retrieval is correct, and the value of the above loss function tends to be stable and no longer declines to obtain the effect good internet.

6. A large-scale short video retrieval system based on deep learning, characterized in that: it includes a preprocessing module and a retrieval module;

The preprocessing module is used to collect query data and retrieve short videos, and perform key frame extraction and video standardization processing;

The retrieval module is used to input the processed data into the short video semantic feature extraction network, and use similarity calculations to obtain similar T videos; wherein, T is a preset value;

7. A large-scale short video retrieval device based on deep learning, characterized in that it comprises:

one or more processors;

A storage device, used to store one or more programs, when the one or more programs are executed by the one or more processors, the one or more processors can realize any one of claims 1 to 5 A large-scale short video retrieval method based on deep learning.