CN115357754A - Deep learning-based large-scale short video retrieval method, system and equipment - Google Patents
Deep learning-based large-scale short video retrieval method, system and equipment Download PDFInfo
- Publication number
- CN115357754A CN115357754A CN202210811333.7A CN202210811333A CN115357754A CN 115357754 A CN115357754 A CN 115357754A CN 202210811333 A CN202210811333 A CN 202210811333A CN 115357754 A CN115357754 A CN 115357754A
- Authority
- CN
- China
- Prior art keywords
- video
- layer
- short video
- short
- hash
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域technical field
本发明属于视频检索技术领域,涉及一种短视频检索方法、系统及设备,具体涉及一种 基于深度学习的大规模短视频检索方法、系统及设备。The invention belongs to the technical field of video retrieval, and relates to a short video retrieval method, system and equipment, in particular to a large-scale short video retrieval method, system and equipment based on deep learning.
背景技术Background technique
随着互联网技术和智能设备的发展与发展,社交网络和其他的信息平台出现了多种多样 的多媒体数据,如文本,语音,音频,图像和视频等。特别是短视频备受用户青睐,短视频 因网络社交媒体的爆火而兴起,短视频是很重要的信息传播载体,它能够在很短的时间里面 传达非常多的信息,据《2021中国网络视听发展研究报告》显示,泛网络视听领域市场规模 超6000亿,短视频领域市场份额占三成,市场规模为2051.3亿。短视频因其碎片化和社交 化等特性,可以更好满足用户的个性化需求。与此同时,各类视频媒体的纷纷出现,产生了 各种各样的视频需求,如何根据用户上传的视频,在大规模视频数据库及有限的设备资源情 况下,查找与其近似重复或者用户感兴趣的视频成为了大数据时代的一个有意义的课题。With the development and development of Internet technology and smart devices, social networks and other information platforms have emerged a variety of multimedia data, such as text, voice, audio, images and video. In particular, short videos are favored by users. Short videos have risen due to the explosion of social media on the Internet. Short videos are an important carrier of information dissemination. They can convey a lot of information in a short period of time. According to "2021 China Network The Audiovisual Development Research Report shows that the market size of the pan-network audiovisual field exceeds 600 billion, and the short video field accounts for 30% of the market share, with a market size of 205.13 billion. Due to its fragmentation and social characteristics, short videos can better meet the individual needs of users. At the same time, the emergence of various video media has created a variety of video needs. How to find similar duplicates or users interested in videos uploaded by users based on large-scale video databases and limited equipment resources? video has become a meaningful topic in the era of big data.
近年来,有很多学者利用深度学习方法解决短视频数据检索问题。常见的做法是通过人 工的方式对视频数据加以注释,标签等信息,依靠文本搜索间接完成视频数据的搜索任务。 尽管现有的短视频检索方法有一定的发展,但是仍然有两点不足:1)短视频信息和人工标注 的文本信息并不是完全一致或者一一对应的,通过人工标注很难将视频的信息完全阐述。2) 通过机器学习提取到的视频特征仍然非常复杂,对其存储需要大量的存储空间,在相似度计 算方面也需要大量的计算资源。In recent years, many scholars have used deep learning methods to solve short video data retrieval problems. A common practice is to manually annotate, label and other information on video data, relying on text search to indirectly complete the search task of video data. Although the existing short video retrieval methods have made some progress, there are still two deficiencies: 1) The short video information and the manually marked text information are not completely consistent or one-to-one correspondence, and it is difficult to extract the video information through manual marking. fully elaborated. 2) The video features extracted by machine learning are still very complex, and their storage requires a large amount of storage space, as well as a large amount of computing resources in terms of similarity calculation.
发明内容Contents of the invention
为了解决上述技术问题,本发明提出了一种基于深度学习的大规模短视频检索方法、系 统及设备,通过学习短视频数据的语义信息,利用改进的卷积神经网络模型生成哈希码,最 后利用相似度计算来检索出给定数目的短视频项。In order to solve the above technical problems, the present invention proposes a large-scale short video retrieval method, system and equipment based on deep learning, by learning the semantic information of short video data, using an improved convolutional neural network model to generate hash codes, and finally Use similarity calculation to retrieve a given number of short video items.
本发明的方法所采用的技术方案是:一种基于深度学习的大规模短视频检索方法,包括 以下步骤:The technical solution adopted in the method of the present invention is: a large-scale short video retrieval method based on deep learning, comprising the following steps:
步骤1:将查询数据和检索集中短视频,进行关键帧提取与视频标准化处理;Step 1: Concentrate query data and retrieve short videos, perform key frame extraction and video standardization;
步骤2:将处理后的数据输入短视频语义特征提取网络,利用相似度计算得到相似的T 个视频;其中,T为预设值;Step 2: Input the processed data into the short video semantic feature extraction network, and use the similarity calculation to obtain T similar videos; where T is a preset value;
所述短视频语义特征提取网络,由特征提取模块和特征哈希码映射模块构成,包括第一 3×3×3卷积层、第一1×2×2池化层、第二3×3×3卷积层、第二2×2×2池化层、第三 3×3×3卷积层、第三2×2×2池化层、第四3×3×3卷积层、第四2×2×2池化层、第五 3×3×3卷积层、第五2×2×2池化层、第一1×4096全连接层、第二1×4096全连接层、 哈希层组成;The short video semantic feature extraction network consists of a feature extraction module and a feature hash code mapping module, including a first 3×3×3 convolutional layer, a first 1×2×2 pooling layer, a second 3×3 ×3 convolutional layer, second 2×2×2 pooling layer, third 3×3×3 convolutional layer, third 2×2×2 pooling layer, fourth 3×3×3 convolutional layer, The fourth 2×2×2 pooling layer, the fifth 3×3×3 convolutional layer, the fifth 2×2×2 pooling layer, the first 1×4096 fully connected layer, the second 1×4096 fully connected layer , Hash layer composition;
所述第一3×3×3卷积层、第一1×2×2池化层、第二3×3×3卷积层、第二2×2×2池化层、第三3×3×3卷积层、第三2×2×2池化层、第四3×3×3卷积层、第四2×2×2 池化层、第五3×3×3卷积层、第五2×2×2池化层顺序连接,共同构成特征提取模块,输 入短视频,经过特征提取模块得到短视频原始特征向量图;The first 3×3×3 convolutional layer, the first 1×2×2 pooling layer, the second 3×3×3 convolutional layer, the second 2×2×2 pooling layer, the third 3× 3×3 convolutional layer, third 2×2×2 pooling layer, fourth 3×3×3 convolutional layer, fourth 2×2×2 pooling layer, fifth 3×3×3 convolutional layer , the fifth 2×2×2 pooling layers are sequentially connected to form a feature extraction module together, input a short video, and obtain the original feature vector graph of the short video through the feature extraction module;
所述第一1×4096全连接层、第二1×4096全连接层、哈希层顺序连接,共同构成特征 哈希码映射模块,输入短视频原始特征向量图,经过短视频特征哈希码映射模块得到短视频 特征映射后的哈希码;The first 1×4096 fully connected layer, the second 1×4096 fully connected layer, and the hash layer are sequentially connected to form a feature hash code mapping module, which inputs the original feature vector map of the short video, and passes through the short video feature hash code. The mapping module obtains the hash code after the feature mapping of the short video;
所述哈希层为被sigmoid函数激活的1×K全连接层,K为预设哈希码长度。The hash layer is a 1×K fully connected layer activated by a sigmoid function, and K is a preset hash code length.
本发明的系统所采用的技术方案是:一种基于深度学习的大规模短视频检索系统,包括 预处理模块和检索模块;The technical solution adopted by the system of the present invention is: a large-scale short video retrieval system based on deep learning, including a preprocessing module and a retrieval module;
所述预处理模块,用于将查询数据和检索集中短视频,进行关键帧提取与视频标准化处 理;Described preprocessing module is used for querying data and retrieving centralized short video, carries out key frame extraction and video standardization processing;
所述检索模块,用于将处理后的数据输入短视频语义特征提取网络,利用相似度计算得 到相似的T个视频;其中,T为预设值;Described retrieval module, for the data input after processing short video semantic feature extraction network, utilize similarity calculation to obtain similar T videos; Wherein, T is preset value;
所述短视频语义特征提取网络,整体由特征提取模块和特征哈希码映射模块构成,包括 第一3×3×3卷积层、第一1×2×2池化层、第二3×3×3卷积层、第二2×2×2池化层、 第三3×3×3卷积层、第三2×2×2池化层、第四3×3×3卷积层、第四2×2×2池化层、第五3×3×3卷积层、第五2×2×2池化层、第一1×4096全连接层、第二1×4096全连接 层、哈希层组成;The short video semantic feature extraction network is composed of a feature extraction module and a feature hash code mapping module as a whole, including a first 3×3×3 convolutional layer, a first 1×2×2 pooling layer, a second 3× 3×3 convolutional layer, second 2×2×2 pooling layer, third 3×3×3 convolutional layer, third 2×2×2 pooling layer, fourth 3×3×3 convolutional layer , the fourth 2×2×2 pooling layer, the fifth 3×3×3 convolutional layer, the fifth 2×2×2 pooling layer, the first 1×4096 fully connected layer, the second 1×4096 fully connected layer, hash layer;
所述第一3×3×3卷积层、第一1×2×2池化层、第二3×3×3卷积层、第二2×2×2池化层、第三3×3×3卷积层、第三2×2×2池化层、第四3×3×3卷积层、第四2×2×2 池化层、第五3×3×3卷积层、第五2×2×2池化层顺序连接,共同构成特征提取模块,输 入短视频,经过特征提取模块得到短视频原始特征向量图;The first 3×3×3 convolutional layer, the first 1×2×2 pooling layer, the second 3×3×3 convolutional layer, the second 2×2×2 pooling layer, the third 3× 3×3 convolutional layer, third 2×2×2 pooling layer, fourth 3×3×3 convolutional layer, fourth 2×2×2 pooling layer, fifth 3×3×3 convolutional layer , the fifth 2×2×2 pooling layers are sequentially connected to form a feature extraction module together, input a short video, and obtain the original feature vector graph of the short video through the feature extraction module;
所述第一1×4096全连接层、第二1×4096全连接层、哈希层顺序连接,共同构成特征 哈希码映射模块,输入短视频原始特征向量图,经过短视频特征哈希码映射模块得到短视频 特征映射后的哈希码;The first 1×4096 fully connected layer, the second 1×4096 fully connected layer, and the hash layer are sequentially connected to form a feature hash code mapping module, which inputs the original feature vector map of the short video, and passes through the short video feature hash code. The mapping module obtains the hash code after the feature mapping of the short video;
所述哈希层为被sigmoid函数激活的1×K全连接层,K为预设哈希码长度。The hash layer is a 1×K fully connected layer activated by a sigmoid function, and K is a preset hash code length.
本发明的设备所采用的技术方案是:一种基于深度学习的大规模短视频检索设备,包括:The technical solution adopted by the device of the present invention is: a large-scale short video retrieval device based on deep learning, including:
一个或多个处理器;one or more processors;
存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器 执行时,使得所述一个或多个处理器实现所述的基于深度学习的大规模短视频检索方法。The storage device is used to store one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the described deep learning-based large-scale Scale short video retrieval method.
本发明与现有技术相比,本发明提出的方法不仅捕获了视频在时间维度上的语义相关性, 学习了深度特征的相对语义相关性,而且大大优化了视频检索的时空复杂度。Compared with the prior art, the method proposed by the present invention not only captures the semantic correlation of videos in the time dimension, learns the relative semantic correlation of deep features, but also greatly optimizes the time-space complexity of video retrieval.
附图说明Description of drawings
图1为本发明实施例的原理框架图。Fig. 1 is a schematic frame diagram of an embodiment of the present invention.
图2为本发明实施例的原理示意图。Fig. 2 is a schematic diagram of the principle of an embodiment of the present invention.
图3为本发明实施例的短视频语义特征提取网络结构图。FIG. 3 is a network structure diagram of short video semantic feature extraction according to an embodiment of the present invention.
图4为本发明实施例的短视频语义特征提取网络训练流程示意图。FIG. 4 is a schematic diagram of a short video semantic feature extraction network training process according to an embodiment of the present invention.
具体实施方式Detailed ways
为了便于本领域普通技术人员理解和实施本发明,下面结合附图及实施例对本发明作进 一步的详细描述,应当理解,此处所描述的实施示例仅用于说明和解释本发明,并不用于限 定本发明。In order to facilitate those of ordinary skill in the art to understand and implement the present invention, the present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the implementation examples described here are only used to illustrate and explain the present invention, and are not intended to limit this invention.
请见图1和图2,本发明提供的一种基于深度学习的大规模短视频检索方法,包括以下 步骤:Please see Fig. 1 and Fig. 2, a kind of large-scale short video retrieval method based on deep learning provided by the present invention, comprises the following steps:
步骤1:将查询数据和检索集中短视频,进行关键帧提取与视频标准化处理;Step 1: Concentrate query data and retrieve short videos, perform key frame extraction and video standardization;
本实施例中,关键帧提取是根据短视频镜头特性,即采取一镜到底的方式进行拍摄,采 用等间隔抽取关键帧技术。视频标准化是根据视频尺寸的大小,如果视频图像过大,需要进 行下采样操作,如果视频图像的尺寸过小需要进行插值操作,最后将每一帧处理过的图像进 行拼接得到完整的视频。In this embodiment, the key frame extraction is based on the characteristics of the short video shot, that is, shooting in a one shot to the bottom mode, and using the technology of extracting key frames at equal intervals. Video standardization is based on the size of the video. If the video image is too large, downsampling is required. If the video image is too small, interpolation is required. Finally, the processed images of each frame are spliced to obtain a complete video.
本实施例中,视频标准化处理,是根据视频尺寸的大小,如果视频图像大于预设值,则 进行下采样操作处理,将M×N的图像进行s倍下采样,得到(M/s)×(N/s)的图像;具体 公式如下:In this embodiment, the video standardization process is based on the size of the video. If the video image is larger than the preset value, the downsampling operation is performed, and the M×N image is downsampled by s times to obtain (M/s)× (N/s) image; the specific formula is as follows:
DS(f)=Pavg({b1,b1,……,bn})DS(f)=P avg ({b 1 ,b 1 ,...,b n })
其中,f为视频帧,b为图片分割成的不同的s×s的小块,Pavg()为均值池化操作;Among them, f is the video frame, b is the different s×s small blocks that the picture is divided into, and P avg () is the mean value pooling operation;
如果视频图像的尺寸小于预设值,则进插值操作处理,这里为了加快计算,缺少的像素 值为横向最近像素点与纵向最近像素点的值的均值;最后将每一帧处理过的图像进行拼接得 到完整的视频。If the size of the video image is smaller than the preset value, the interpolation operation is performed. Here, in order to speed up the calculation, the missing pixel value is the average value of the horizontal nearest pixel and the vertical nearest pixel; finally, each frame of the processed image is processed. Stitching to get a complete video.
若输入视频其中Fi为第i个视频帧,N为视频V 的总帧数;则提取的关键帧其中,t代表以秒为单位的视频长度,FPS代表帧率, n是视频挑选的帧数,t=1,2,...,n。If input video Where F i is the i-th video frame, N is the total number of frames of the video V; then the extracted key frame Wherein, t represents the length of the video in seconds, FPS represents the frame rate, n is the number of frames selected from the video, and t=1,2,...,n.
步骤2:将处理后的数据输入短视频语义特征提取网络,利用相似度计算得到相似的T 个视频;其中,T为预设值;Step 2: Input the processed data into the short video semantic feature extraction network, and use the similarity calculation to obtain T similar videos; where T is a preset value;
本实施例中,使用短视频语义特征提取网络计算处理后的短视频的哈希码,将查询数据 和检索集各样本的哈希码之间的汉明距离从大到小排序,并计算排名列表的前T个精度作为 检索结果。In this embodiment, use the short video semantic feature extraction network to calculate the hash code of the processed short video, sort the Hamming distances between the query data and the hash codes of each sample in the retrieval set from large to small, and calculate the ranking The first T precisions of the list are used as the search results.
请见图3,本实施例的短视频语义特征提取网络,整体由特征提取模块和特征哈希码映 射模块构成,包括第一3×3×3卷积层、第一1×2×2池化层、第二3×3×3卷积层、第二 2×2×2池化层、第三3×3×3卷积层、第三2×2×2池化层、第四3×3×3卷积层、第四 2×2×2池化层、第五3×3×3卷积层、第五2×2×2池化层、第一1×4096全连接层、第 二1×4096全连接层、哈希层组成;第一3×3×3卷积层、第一1×2×2池化层、第二3×3 ×3卷积层、第二2×2×2池化层、第三3×3×3卷积层、第三2×2×2池化层、第四3×3 ×3卷积层、第四2×2×2池化层、第五3×3×3卷积层、第五2×2×2池化层顺序连接, 共同构成特征提取模块,输入短视频,经过特征提取模块得到短视频原始特征向量图;第一 1×4096全连接层、第二1×4096全连接层、哈希层顺序连接,共同构成特征哈希码映射模 块,输入短视频原始特征向量图,经过短视频特征哈希码映射模块得到短视频特征映射后的 哈希码;哈希层为被sigmoid函数激活的1×K全连接层,K为预设哈希码长度。Please see Figure 3. The short video semantic feature extraction network in this embodiment is composed of a feature extraction module and a feature hash code mapping module as a whole, including the first 3×3×3 convolutional layer and the first 1×2×2 pooling layer. layer, the second 3×3×3 convolutional layer, the second 2×2×2 pooling layer, the third 3×3×3 convolutional layer, the third 2×2×2 pooling layer, the fourth 3 ×3×3 convolutional layer, fourth 2×2×2 pooling layer, fifth 3×3×3 convolutional layer, fifth 2×2×2 pooling layer, first 1×4096 fully connected layer, The second 1×4096 fully connected layer and hash layer; the first 3×3×3 convolutional layer, the first 1×2×2 pooling layer, the second 3×3×3 convolutional layer, the second 2 ×2×2 pooling layer, third 3×3×3 convolutional layer, third 2×2×2 pooling layer, fourth 3×3 ×3 convolutional layer, fourth 2×2×2 pooling layer Layer, the fifth 3×3×3 convolutional layer, and the fifth 2×2×2 pooling layer are sequentially connected to form a feature extraction module, input a short video, and obtain the original feature vector map of the short video through the feature extraction module; the first The 1×4096 fully connected layer, the second 1×4096 fully connected layer, and the hash layer are sequentially connected to form a feature hash code mapping module. The original feature vector map of the short video is input, and the short video is obtained through the short video feature hash code mapping module. The hash code after video feature mapping; the hash layer is a 1×K fully connected layer activated by the sigmoid function, and K is the preset hash code length.
本实施例中,视频的深层特征表示f(V),哈希表示层H(V)。从全连接层得 到的特征向量大小为[N,4096,1,1,1],即全连接层节点个数num为4096。假设 从全连接层输出的特征向量为FC={FC1,FC2,...,FCnum},经过哈希层之后 得到的特征向量为H={h1,h2,...,hd},其中d为哈希层节点 个数,即哈希层输出的向量维度。为了保证数据集中的每一类标签都有独一无二 的哈希码与其对应,则至少需要dmin个节点,节点个数的计算公式如下:In this embodiment, the deep feature of the video represents f(V), and the hash represents layer H(V). The size of the feature vector obtained from the fully connected layer is [N, 4096, 1, 1, 1], that is, the number of nodes in the fully connected layer num is 4096. Assuming that the feature vector output from the fully connected layer is FC={FC 1 , FC 2 ,...,FC num }, the feature vector obtained after the hash layer is H={h 1 , h 2 ,..., h d }, where d is the number of hash layer nodes, that is, the vector dimension of the hash layer output. In order to ensure that each type of label in the data set has a unique hash code corresponding to it, at least d min nodes are required, and the calculation formula for the number of nodes is as follows:
设d的最大取值为dmax,则dmax=num,即哈希层的节点个数需要低于上一层全连接层 的节点个数。由于哈希层的目的是将全连接特征映射成紧凑的类二进制码,需要对上一层输 出的特征进行降维。那么d的取值范围为:Assuming that the maximum value of d is d max , then d max = num, that is, the number of nodes in the hash layer needs to be lower than the number of nodes in the upper fully connected layer. Since the purpose of the hash layer is to map the fully connected features into a compact binary-like code, it is necessary to reduce the dimensionality of the features output by the previous layer. Then the value range of d is:
哈希函数的定义如下:The hash function is defined as follows:
[1,2,...,d]T=[Sign(W*FC+B)]T;[ 1 , 2 ,..., d ] T = [Sign(W*FC+B)] T ;
其中,Sign为符号函数,W为哈希层权重,FC为全连接层,B为偏置。Among them, Sign is the sign function, W is the weight of the hash layer, FC is the fully connected layer, and B is the bias.
为了将哈希特征映射为二值编码,采用阈值函数对输出值进行处理,阈值函数设计如下:In order to map hash features into binary codes, a threshold function is used to process the output value. The threshold function is designed as follows:
请见图4,本实施例的短视频语义特征提取网络,为训练好的短视频语义特征提取网络; 其训练过程包括以下步骤:Please see Fig. 4, the short video semantic feature extraction network of the present embodiment is a trained short video semantic feature extraction network; its training process includes the following steps:
步骤2.1:获取数据集,并划分为训练数据集和测试数据集;Step 2.1: Obtain the data set and divide it into training data set and test data set;
本实施例中,使用HMDB51和UCF-101数据集,选取该数据集的70%作为训练数据集Itrain,余下的30%作为测试数据集Itest;In this embodiment, use the HMDB51 and UCF-101 data sets, select 70% of the data sets as the training data set I train , and the remaining 30% as the test data set I test ;
步骤2.2:针对训练数据集和测试数据集,进行关键帧提取与视频标准化处理;Step 2.2: For the training data set and the test data set, perform key frame extraction and video standardization;
本实施例中,关键帧提取是根据短视频镜头特性,即采取一镜到底的方式进行拍摄,采 用等间隔抽取关键帧技术。视频标准化是根据视频尺寸的大小,如果视频图像过大,需要进 行下采样操作,如果视频图像的尺寸过小需要进行插值操作,最后将每一帧处理过的图像进 行拼接得到完整的视频。In this embodiment, the key frame extraction is based on the characteristics of the short video shot, that is, shooting in a one shot to the bottom mode, and using the technology of extracting key frames at equal intervals. Video standardization is based on the size of the video. If the video image is too large, downsampling is required. If the video image is too small, interpolation is required. Finally, the processed images of each frame are spliced to obtain a complete video.
若输入视频其中Fi为第i个视频帧,N为视频V 的总帧数;则提取的关键帧其中,t代表以秒为单位的视频长度,FPS代表帧率, n是视频挑选的帧数,t=1,2,...,n。If input video Where F i is the i-th video frame, N is the total number of frames of the video V; then the extracted key frame Wherein, t represents the length of the video in seconds, FPS represents the frame rate, n is the number of frames selected from the video, and t=1,2,...,n.
步骤2.3:确定训练目标函数、优化算法、学习率、动量、权值衰减、批量大小和网络训练迭代次数epoch;Step 2.3: Determine the training objective function, optimization algorithm, learning rate, momentum, weight decay, batch size and network training iteration number epoch;
本实施例中,目标函数Loss由三元组损失函数LTriplet,交叉熵损失函数LCE和平滑平均精 度损失函数LAP组成;In this embodiment, the objective function Loss is composed of a triplet loss function L Triplet , a cross-entropy loss function L CE and a smoothed average precision loss function L AP ;
Loss=αLTriplet+βLCE+γLAP;Loss = αL Triplet + βL CE + γL AP ;
其中,α、β、γ均为超参数,通过训练模型反向传播技术从而得到网络的权重参数W和 偏置参数B;Among them, α, β, and γ are all hyperparameters, and the weight parameter W and bias parameter B of the network are obtained by training the model backpropagation technology;
本实施例中,三元组损失函数 其中,为锚样本,是选定的某个类型的视频数据,为与锚样本同类型的正例样本,为与锚样本不同类型的负例样本;f()为将样本数据映射到同一个向量空间函数;l为预设 的负例样本与正例样本的最小距离;K为数据总样本数;In this example, the triplet loss function in, An anchor sample is a selected type of video data, is a positive sample of the same type as the anchor sample, is a different type of negative sample from the anchor sample; f() is a function that maps sample data to the same vector space; l is the minimum distance between a preset negative sample and a positive sample; K is the total number of data samples;
本实施例中,多分类的交叉熵损失函数其中, K为数据总样本数,M为类别的数量;yic为符号函数,取值0或1,如果样本i的真是类别等于c则取1;pic为观测样本i属于类别c的预测概率;In this embodiment, the cross-entropy loss function of multi-classification Among them, K is the total number of samples of the data, M is the number of categories; y ic is a sign function, the value is 0 or 1, if the real category of sample i is equal to c, then it is 1; p ic is the prediction that the observed sample i belongs to category c probability;
本实施例中,平滑平均精度损失函数其中,m为样本个数;APk为 平滑平均估计值,SP为数据集中和查询视频属于 同一类型的视频集合,表示将指示函数替换为了可导的Sigmoid函数;Dij为排名矩阵,第 i行表示第i个样本和其余样本的相关性的排名分数;τ为平滑系数,不同的平滑系数可以提 供不同的梯度信息;In this example, the smoothing average precision loss function Among them, m is the number of samples; AP k is the smoothed average estimated value, S P is the video collection of the same type as the query video in the dataset, Indicates that the indicator function is replaced by a derivable Sigmoid function; D ij is a ranking matrix, and the i-th row represents the ranking score of the correlation between the i-th sample and the rest of the samples; τ is a smoothing coefficient, and different smoothing coefficients can provide different gradients information;
步骤2.4:将训练数据集输入短视频语义特征提取网络中,训练短视频语义特征提取网 络;Step 2.4: Input the training data set into the short video semantic feature extraction network, and train the short video semantic feature extraction network;
本实施例中,使用Adam算法进行优化,学习率设置为10-5,动量设置为0.9,权值衰减 设置为10-5,批量大小设置为32,学习率的衰减策略为依据验证集的Loss不再下降进行调 整。输入帧图像尺寸为3×112×112每次卷积16帧作为一个clip,卷积神经网络ResNet34_3D 的初始权重使用预先训练好的权值进行初始化。通过训练模型从而得到网络的权重参数W 和偏置参数B。In this example, the Adam algorithm is used for optimization, the learning rate is set to 10 -5 , the momentum is set to 0.9, the weight decay is set to 10 -5 , the batch size is set to 32, and the learning rate decay strategy is based on the Loss of the verification set. No more dips to adjust. The size of the input frame image is 3×112×112 and each convolution is 16 frames as a clip. The initial weights of the convolutional neural network ResNet34_3D are initialized with pre-trained weights. The weight parameter W and bias parameter B of the network are obtained by training the model.
步骤2.5:使用训练后的短视频语义特征提取网络计算测试数据集中样本的哈希码,将 查询样本和训练数据集各样本的哈希码之间的汉明距离从大到小排序,并计算排名列表的前 n个精度,得出平均精度指标MAP和前n名检索结果;输入视频的类别和检索到的视频类 别相同即为检索正确,上述损失函数值趋于稳定不再下降则得到效果好的网络。Step 2.5: Use the trained short video semantic feature extraction network to calculate the hash codes of the samples in the test data set, sort the Hamming distances between the query samples and the hash codes of the samples in the training data set from large to small, and calculate The top n accuracy of the ranking list is used to obtain the average precision index MAP and the top n retrieval results; the category of the input video is the same as the category of the retrieved video, which means the retrieval is correct, and the value of the above loss function tends to be stable and no longer declines to obtain the effect good internet.
为了评估本发明方法的有效性,将本发明方法与几种最先进的方法进行了检索性能比较, 包括FIHTV、VHSL、SVH、DH、DVH、DBNVH、BRVH、DSH、DLBHC和SRH本实验 采用不同位数(16bits,32bits,64bits)的哈希码,采用UCF-101数据集和HMDB-51数据 集,DVH方法利用卷积神经网络同时学习到图像的特征表达以及哈希函数,然后将它们相 应的特征投影到一个共同的表示空间中,FIHTV、VHSL、SVH、DH、DBNVH、BRVH、 DSH、DLBHC和SRH方法按原文执行。本实验采用的环境是Platinum 8260 CPU@2.4GHZ、NVIDIA A100TENSOR CORE GPU、Ubuntu 20.04系统,运用Python和开 源库Pytorch进行开发。In order to evaluate the effectiveness of the method of the present invention, the retrieval performance of the method of the present invention was compared with several state-of-the-art methods, including FIHTV, VHSL, SVH, DH, DVH, DBNVH, BRVH, DSH, DLBHC and SRH. Bits (16bits, 32bits, 64bits) hash codes, using UCF-101 data set and HMDB-51 data set, DVH method uses convolutional neural network to learn the feature expression and hash function of the image at the same time, and then they are corresponding The features projected into a common representation space, the FIHTV, VHSL, SVH, DH, DBNVH, BRVH, DSH, DLBHC and SRH methods perform as in the original text. The environment used in this experiment is Platinum 8260 CPU@2.4GHZ, NVIDIA A100TENSOR CORE GPU, Ubuntu 20.04 system, using Python and open source library Pytorch for development.
表1Table 1
表1是本发明与其他方法在UCF-101数据集上视频检索任务的比较实验结果,其中mAP 为平均精度指标。Table 1 is the experimental results of the comparison between the present invention and other methods on the video retrieval task on the UCF-101 dataset, where mAP is the average precision index.
表2Table 2
表2是本发明与其他方法在HMDB-51数据集上视频检索任务的比较实验结果,其中mAP为平均精度指标。Table 2 is the comparative experimental results of the present invention and other methods on the video retrieval task on the HMDB-51 dataset, where mAP is the average precision index.
表3table 3
表3是本发明不同视频检索方式在速度上的比较,多级哈希检索指的是按照次序使用16 bits、32bits、64bits的二进制哈希特征编码与数据库中视频特征编码表示进行比较,逐步地 缩小查找范围,最后再使用实值特征编码在数据库中进行精准比较。本发明充分利用视频时 空语义信息,进一步提升了检索性。Table 3 is the comparison in speed of different video retrieval methods of the present invention. Multi-stage hash retrieval refers to using binary hash feature coding of 16 bits, 32bits, 64bits in order to compare with video feature coding representations in the database, step by step Narrow down the search scope, and finally use the real-valued feature encoding to perform an accurate comparison in the database. The invention makes full use of the temporal and spatial semantic information of the video, further improving the retrieval performance.
应当理解的是,上述针对较佳实施例的描述较为详细,并不能因此而认为是对本发明专 利保护范围的限制,本领域的普通技术人员在本发明的启示下,在不脱离本发明权利要求所 保护的范围情况下,还可以做出替换或变形,均落入本发明的保护范围之内,本发明的请求 保护范围应以所附权利要求为准。It should be understood that the above-mentioned descriptions for the preferred embodiments are relatively detailed, and should not therefore be considered as limiting the scope of the patent protection of the present invention. Within the scope of protection, replacements or modifications can also be made, all of which fall within the protection scope of the present invention, and the scope of protection of the present invention should be based on the appended claims.
Claims (7)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210811333.7A CN115357754B (en) | 2022-07-11 | 2022-07-11 | Deep learning-based large-scale short video retrieval method, system and equipment |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210811333.7A CN115357754B (en) | 2022-07-11 | 2022-07-11 | Deep learning-based large-scale short video retrieval method, system and equipment |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN115357754A true CN115357754A (en) | 2022-11-18 |
| CN115357754B CN115357754B (en) | 2025-07-08 |
Family
ID=84031377
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210811333.7A Active CN115357754B (en) | 2022-07-11 | 2022-07-11 | Deep learning-based large-scale short video retrieval method, system and equipment |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115357754B (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116644208A (en) * | 2023-05-30 | 2023-08-25 | 平安科技(深圳)有限公司 | Video retrieval method, device, electronic device, and computer-readable storage medium |
| CN118069885A (en) * | 2024-04-19 | 2024-05-24 | 山东建筑大学 | A dynamic video content coding retrieval method and system |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111563184A (en) * | 2020-03-25 | 2020-08-21 | 中山大学 | Video hash retrieval representation conversion method based on deep learning |
| CN113177141A (en) * | 2021-05-24 | 2021-07-27 | 北湾科技(武汉)有限公司 | Multi-label video hash retrieval method and device based on semantic embedded soft similarity |
| CN113326392A (en) * | 2021-05-06 | 2021-08-31 | 武汉理工大学 | Remote sensing image audio retrieval method based on quadruple hash |
-
2022
- 2022-07-11 CN CN202210811333.7A patent/CN115357754B/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111563184A (en) * | 2020-03-25 | 2020-08-21 | 中山大学 | Video hash retrieval representation conversion method based on deep learning |
| CN113326392A (en) * | 2021-05-06 | 2021-08-31 | 武汉理工大学 | Remote sensing image audio retrieval method based on quadruple hash |
| CN113177141A (en) * | 2021-05-24 | 2021-07-27 | 北湾科技(武汉)有限公司 | Multi-label video hash retrieval method and device based on semantic embedded soft similarity |
Non-Patent Citations (1)
| Title |
|---|
| 曾凡智等: ""一种视频时空特征提取算法及其应用研究"", 《佛山科学技术学院学报(自然科学版) 》, 5 September 2020 (2020-09-05), pages 16 - 23 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116644208A (en) * | 2023-05-30 | 2023-08-25 | 平安科技(深圳)有限公司 | Video retrieval method, device, electronic device, and computer-readable storage medium |
| CN116644208B (en) * | 2023-05-30 | 2025-10-17 | 平安科技(深圳)有限公司 | Video retrieval method, device, electronic equipment and computer readable storage medium |
| CN118069885A (en) * | 2024-04-19 | 2024-05-24 | 山东建筑大学 | A dynamic video content coding retrieval method and system |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115357754B (en) | 2025-07-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN106845411B (en) | Video description generation method based on deep learning and probability map model | |
| CN110298037B (en) | Text Recognition Approach Based on Convolutional Neural Network Matching with Enhanced Attention Mechanism | |
| CN114358188B (en) | Feature extraction model processing, sample retrieval method, device and computer equipment | |
| CN113297369B (en) | Intelligent Question Answering System Based on Knowledge Graph Subgraph Retrieval | |
| WO2022068195A1 (en) | Cross-modal data processing method and device, storage medium and electronic device | |
| CN110309331A (en) | A kind of cross-module state depth Hash search method based on self-supervisory | |
| CN111198964B (en) | Image retrieval method and system | |
| CN108647350A (en) | Image-text associated retrieval method based on two-channel network | |
| CN111400455A (en) | A relation detection method for question answering system based on knowledge graph | |
| CN110457514A (en) | A Multi-label Image Retrieval Method Based on Deep Hash | |
| CN114564943A (en) | A method, device and medium for text classification of maritime merchants based on fusion features | |
| CN107657008A (en) | Across media training and search method based on depth discrimination sequence study | |
| CN115292533B (en) | Cross-modal pedestrian retrieval method driven by visual positioning | |
| CN108595688A (en) | Across the media Hash search methods of potential applications based on on-line study | |
| CN114548293B (en) | Video-text cross-modal retrieval method based on cross-granularity self-distillation | |
| CN116955730A (en) | Training method of feature extraction model, content recommendation method and device | |
| CN114911958B (en) | Semantic preference-based rapid image retrieval method | |
| CN115357754A (en) | Deep learning-based large-scale short video retrieval method, system and equipment | |
| CN117807259A (en) | Cross-modal hash retrieval method based on deep learning technology | |
| CN108595546B (en) | A semi-supervised cross-media feature learning retrieval method | |
| CN110196918B (en) | An unsupervised deep hashing method based on object detection | |
| CN117743614B (en) | Remote sensing image text retrieval method based on remote sensing multimodal basic model | |
| CN113139464B (en) | Power grid fault detection method | |
| CN112667797B (en) | Question-answer matching method, system and storage medium for adaptive transfer learning | |
| CN114647717A (en) | Intelligent question and answer method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |