CN111079601A

CN111079601A - Video content description method, system and device based on multi-mode attention mechanism

Info

Publication number: CN111079601A
Application number: CN201911243331.7A
Authority: CN
Inventors: 胡卫明; 孙亮; 李兵
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-04-28

Abstract

The invention belongs to the field of computer vision and natural language processing, and particularly relates to a video content description method, a system and a device based on a multi-mode attention mechanism, aiming at solving the problem that the accuracy of a generated description statement is low due to the fact that the video content description method only considers video characteristics and ignores high-level semantic attribute information. The method comprises the following steps: acquiring a video frame sequence of a video to be described; extracting multi-modal feature vectors of the video frame sequence, constructing the multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; obtaining semantic attribute vectors corresponding to the feature representations through a semantic attribute detection network; and expressing the concatenated vector and semantic attribute vector based on the features corresponding to each modal feature vector sequence, and obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism. The invention integrates visual features and high-level semantic attributes, and improves the accuracy of generating the video description sentences.

Description

Video content description method, system and device based on multimodal attention mechanism

技术领域technical field

本发明属于计算机视觉和自然语言处理领域，具体涉及一种基于多模态注意力机制的视频内容描述方法、系统、装置。The invention belongs to the fields of computer vision and natural language processing, and in particular relates to a video content description method, system and device based on a multimodal attention mechanism.

背景技术Background technique

人工智能大体可分为两个研究方向：感知智能和认知智能。感知智能研究进展讯速，比如图片分类、自然语言翻译，但认知智能发展速度有限，比如看图说话、视觉描述等。将自然语言和计算机视觉结合起来研究，有利于搭起人类和机器之间沟通的桥梁，促进认知智能的研究。Artificial intelligence can be roughly divided into two research directions: perceptual intelligence and cognitive intelligence. The research progress of perceptual intelligence is rapid, such as image classification and natural language translation, but the development speed of cognitive intelligence is limited, such as seeing pictures and speaking, visual description, etc. Combining natural language and computer vision research is beneficial to build a bridge of communication between humans and machines, and promote the research of cognitive intelligence.

视频内容描述不同于视频分类、物体检测等标签式的粗粒度视觉理解任务，而是需要用通顺准确的一句话来描述视频内容。这不仅需要识别出视频中的物体，还需要理解视频中物体之间的相互关系。同时由于视频内容描述风格多样，比如对场景的抽象描述，对各物体之间关系的描述，对视频中物体行为和运动的描述等，这将给视频内容描述研究带来很大的挑战性。传统的视频内容描述算法主要采用基于语言模板的方法或基于检索的方法。基于语言模板的方法，由于受到固定语言模板的限制，只能生成形式单一缺乏灵活性的句子。而基于检索的方法过于依赖检索视频库的大小，当数据库中缺少与待描述视频相似的视频时，生成的描述语句将和视频内容存在较大的偏差。同时这两种方法都需要在前期对视频进行复杂的预处理过程，而对后端的语言序列部分优化不足，从而导致生成的语句质量较差。Video content description is different from label-based coarse-grained visual understanding tasks such as video classification and object detection. Instead, it needs to describe the video content with a smooth and accurate sentence. This requires not only recognizing objects in the video, but also understanding the interrelationships between objects in the video. At the same time, due to the variety of video content description styles, such as the abstract description of the scene, the description of the relationship between various objects, the description of the behavior and motion of objects in the video, etc., which will bring great challenges to the research of video content description. Traditional video content description algorithms mainly use language template-based methods or retrieval-based methods. The methods based on language templates, due to the limitation of fixed language templates, can only generate sentences with a single form and lack of flexibility. The retrieval-based method relies too much on the size of the retrieved video database. When there is no video in the database similar to the video to be described, the generated description sentence will have a large deviation from the video content. At the same time, these two methods require complex preprocessing of the video in the early stage, and the language sequence part of the back-end is not optimized enough, resulting in poor quality of the generated sentences.

随着深度学习技术的进步，基于编码解码器的序列学习模型在视频内容描述问题中取得突破性的进展。本发明也是基于编码解码器模型，此类方法前期不需要对视频采取复杂的处理过程，直接通过网络实现端到端的训练，能够直接从大量的训练数据中学习到视频到语言的映射关系，从而产生内容更加精确、形式多样和语法灵活的视频描述。With the advancement of deep learning technology, codec-based sequence learning models have made breakthroughs in video content description problems. The present invention is also based on the codec model. Such methods do not require complex processing procedures for video in the early stage, and directly implement end-to-end training through the network, and can directly learn the mapping relationship from video to language from a large amount of training data, thereby Produce video descriptions with more precise content, diverse forms, and flexible syntax.

视频内容描述问题的关键首先在于视频特征的提取，由于视频中不同模态信息能够互相辅助，对视频多模态信息进行编码有助于挖掘更多的语义信息。同时由于通常的视频内容描述算法只考虑视频特征而忽略了视频高级语义属性信息，为了提高生成描述句子的质量，本发明还探讨了如何提取高层语义属性以及将语义属性运用到视频内容描述任务上来。本发明还对解码器端语言生成部分优化不足的问题进行分析与研究，当前大部分的视频内容描述算法都采用最大似然对语言序列建模，用交叉熵损失进行训练优化，这将带来两个明显的缺陷：一是曝光偏差问题，模型在训练的时候，解码器每个时刻的输入来自训练集中真实词，而模型测试的时候，每个时刻输入来自上一时刻预测到的单词。如果其中某一个单词预测不够准确，错误可能会向下传递，导致后面生成的单词质量越来越差。二是训练指标和评价准则不统一的问题，训练阶段采用交叉熵损失函数来最大化后验概率，而评价阶段采用BLEU、METEOR、CIDER等客观评价准则，这种不一致导致模型无法充分对视频内容描述生成的评价指标充分优化。The key to the problem of video content description is the extraction of video features. Since different modal information in the video can assist each other, encoding the multi-modal information of the video helps to mine more semantic information. At the same time, since the general video content description algorithm only considers the video features and ignores the video high-level semantic attribute information, in order to improve the quality of the generated description sentence, the present invention also discusses how to extract the high-level semantic attributes and apply the semantic attributes to the video content description task. . The invention also analyzes and studies the problem of insufficient optimization of the language generation part at the decoder side. Most of the current video content description algorithms use maximum likelihood to model language sequences, and use cross entropy loss for training optimization, which will bring There are two obvious defects: one is the exposure bias problem. When the model is training, the input of the decoder at each moment comes from the real word in the training set, while when the model is tested, the input at each moment comes from the word predicted at the previous moment. If one of the word predictions is not accurate enough, the error may be passed down, resulting in worse and worse quality of the words generated later. The second is the problem of inconsistent training indicators and evaluation criteria. In the training phase, the cross-entropy loss function is used to maximize the posterior probability, while in the evaluation phase, objective evaluation criteria such as BLEU, METEOR, and CIDER are used. This inconsistency makes the model unable to fully evaluate the video content. The evaluation metrics generated by the description are fully optimized.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中的上述问题，即为了解决现有视频内容描述方法只考虑视频特征而忽略了视频高级语义属性信息，导致生成的描述语句准确度较低的问题，本发明第一方面，提出了一种基于多模态注意力机制的视频内容描述方法，该方法包括：In order to solve the above problem in the prior art, that is, in order to solve the problem that the existing video content description method only considers the video features but ignores the video high-level semantic attribute information, resulting in the low accuracy of the generated description sentence, the first aspect of the present invention, A video content description method based on a multimodal attention mechanism is proposed, which includes:

步骤S100，获取待描述视频的视频帧序列，作为输入序列；Step S100, obtaining the video frame sequence of the video to be described as the input sequence;

步骤S200，提取所述输入序列的多模态特征向量，构建多模态特征向量序列，并通过循环神经网络得到各模态特征向量序列对应的特征表示；所述多模态特征向量序列包括视频帧特征向量序列、光流帧特征向量序列、视频片段特征向量序列；Step S200, extracting the multimodal feature vector of the input sequence, constructing a multimodal feature vector sequence, and obtaining the feature representation corresponding to each modal feature vector sequence through a recurrent neural network; the multimodal feature vector sequence includes video Frame feature vector sequence, optical flow frame feature vector sequence, video segment feature vector sequence;

步骤S300，基于各模态特征向量序列对应的特征表示，分别通过语义属性检测网络得到各特征表示对应的语义属性向量；Step S300, based on the feature representation corresponding to each modal feature vector sequence, obtain the semantic attribute vector corresponding to each feature representation through a semantic attribute detection network respectively;

步骤S400，将各模态特征向量序列对应的特征表示进行级联，得到初始编码向量；基于所述初始编码向量、各特征表示对应的语义属性向量，通过基于注意力机制的LSTM网络得到所述待描述视频的描述语句；Step S400, concatenate the feature representations corresponding to each modal feature vector sequence to obtain an initial coding vector; based on the initial coding vector and the semantic attribute vector corresponding to each feature representation, obtain the LSTM network based on the attention mechanism. The description sentence of the video to be described;

其中，in,

所述语义属性检测网络基于多层感知机构建，并基于训练样本进行训练，所述训练样本包括特征表示样本、对应的语义属性向量标签。The semantic attribute detection network is constructed based on a multi-layer perceptron, and is trained based on training samples, where the training samples include feature representation samples and corresponding semantic attribute vector labels.

在一些优选的实施方式中，步骤S200中“提取所述输入序列的多模态特征向量，构建多模态特征向量序列”，其方法为：In some preferred embodiments, in step S200, "extract the multimodal feature vector of the input sequence, and construct a multimodal feature vector sequence", and the method is:

基于深度残差网络对所述输入序列中每一帧RGB图像进行特征提取，得到视频帧特征向量序列；Perform feature extraction on each frame of RGB image in the input sequence based on a deep residual network to obtain a video frame feature vector sequence;

基于所述输入序列，通过Lucas-Kanade算法得到光流序列；通过深度残差网络对该光流序列进行特征提取，得到光流帧特征向量序列；Based on the input sequence, the optical flow sequence is obtained through the Lucas-Kanade algorithm; the feature extraction is performed on the optical flow sequence through the deep residual network to obtain the optical flow frame feature vector sequence;

将所述输入序列平分为T段，通过三维卷积深度神经网络分别提取各段序列的特征向量，得到视频片段特征向量序列。The input sequence is equally divided into T segments, and feature vectors of each segment are extracted respectively through a three-dimensional convolutional deep neural network to obtain a sequence of video segment feature vectors.

在一些优选的实施方式中，所述语义属性检测网络其训练方法为：In some preferred embodiments, the training method of the semantic attribute detection network is:

获取训练数据集，所述训练数据集包括视频及对应的描述语句；Obtain a training data set, the training data set includes a video and a corresponding description sentence;

提取所述训练数据集中描述语句的单词，并按照出现频率进行排序，选择前K个单词作为高层语义属性向量；根据所述描述语句是否包含所述高层语义属性向量，获取视频真实的语义属性向量标签；Extract the words describing the sentences in the training data set, and sort them according to the frequency of occurrence, and select the first K words as high-level semantic attribute vectors; according to whether the description sentences contain the high-level semantic attribute vectors, obtain the real semantic attribute vectors of the video Label;

获取所述训练数据集中视频的多模态特征向量序列对应的特征表示；Obtain the feature representation corresponding to the multimodal feature vector sequence of the video in the training data set;

基于所述特征表示、所述真实的语义属性向量标签，训练所述语义属性检测网络。The semantic attribute detection network is trained based on the feature representation and the true semantic attribute vector labels.

在一些优选的实施方式中，所述语义属性检测网络在训练过程中其损失函数loss₁为：In some preferred embodiments, the loss function loss ₁ of the semantic attribute detection network in the training process is:

其中，N为训练数据集中描述语句的数量，K为语义属性检测网络输出的预测语义属性向量标签的维度，s_ik为语义属性检测网络输出的预测语义属性向量标签，y_ik为真实的语义属性向量标签，i、k为下标，α为权重，W^encoder为循环神经网络、语义属性检测网络所有的权重矩阵和偏置矩阵参数的集合。Among them, N is the number of description sentences in the training data set, K is the dimension of the predicted semantic attribute vector label output by the semantic attribute detection network, s _ik is the predicted semantic attribute vector label output by the semantic attribute detection network, and y _ik is the real semantic attribute Vector label, i, k are subscripts, α is the weight, W ^encoder is the set of all weight matrix and bias matrix parameters of the recurrent neural network and the semantic attribute detection network.

在一些优选的实施方式中，步骤S400中“基于所述初始编码向量、各特征表示对应的语义属性向量，通过基于注意力机制的LSTM网络得到所述待描述视频的语句描述”，其方法为：In some preferred embodiments, in step S400, "based on the initial encoding vector and the semantic attribute vector corresponding to each feature representation, the sentence description of the video to be described is obtained through the LSTM network based on the attention mechanism", and the method is as follows: :

通过注意力机制对各特征表示对应的语义属性向量进行加权，获取多模态的语义属性向量；The semantic attribute vector corresponding to each feature representation is weighted by the attention mechanism to obtain the multi-modal semantic attribute vector;

基于所述初始编码向量、所述多模态的语义属性向量，通过LSTM网络生成所述待描述视频的语句描述。Based on the initial encoding vector and the multimodal semantic attribute vector, a sentence description of the video to be described is generated through an LSTM network.

在一些优选的实施方式中，所述基于注意力机制的LSTM网络在训练过程中采用因式分解的方法进行权重矩阵的计算。In some preferred embodiments, the LSTM network based on the attention mechanism adopts a factorization method to calculate the weight matrix in the training process.

本发明的第二方面，提出了一种基于多模态注意力机制的视频内容描述系统，该系统包括获取模块、提取特征表示模块、语义属性检测模块、生成视频描述模块；In a second aspect of the present invention, a video content description system based on a multimodal attention mechanism is proposed, the system includes an acquisition module, a feature extraction module, a semantic attribute detection module, and a video description generation module;

所述获取模块，配置为获取待描述视频的视频帧序列，作为输入序列；The obtaining module is configured to obtain the video frame sequence of the video to be described as the input sequence;

所述提取特征表示模块，配置为提取所述输入序列的多模态特征向量，构建多模态特征向量序列，并通过循环神经网络得到各模态特征向量序列对应的特征表示；所述多模态特征向量序列包括视频帧特征向量序列、光流帧特征向量序列、视频片段特征向量序列；The extraction feature representation module is configured to extract the multimodal feature vector of the input sequence, construct a multimodal feature vector sequence, and obtain the feature representation corresponding to each modal feature vector sequence through a recurrent neural network; the multimodal feature vector sequence The state feature vector sequence includes video frame feature vector sequence, optical flow frame feature vector sequence, and video segment feature vector sequence;

所述语义属性检测模块，配置为基于各模态特征向量序列对应的特征表示，分别通过语义属性检测网络得到各特征表示对应的语义属性向量；The semantic attribute detection module is configured to obtain a semantic attribute vector corresponding to each feature representation through a semantic attribute detection network based on the feature representation corresponding to each modal feature vector sequence;

所述生成视频描述模块，配置为将各模态特征向量序列对应的特征表示进行级联，得到初始编码向量；基于所述初始编码向量、各特征表示对应的语义属性向量，通过基于注意力机制的LSTM网络得到所述待描述视频的描述语句；The generating video description module is configured to concatenate the feature representations corresponding to each modal feature vector sequence to obtain an initial coding vector; based on the initial coding vector and the semantic attribute vector corresponding to each feature representation, the attention mechanism The LSTM network obtains the description sentence of the video to be described;

其中，in,

本发明的第三方面，提出了一种存储装置，其中存储有多条程序，所述程序应用由处理器加载并执行以实现上述的基于多模态注意力机制的视频内容描述方法。In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the program applications are loaded and executed by a processor to implement the above-mentioned video content description method based on a multimodal attention mechanism.

本发明的第四方面，提出了一种处理装置，包括处理器、存储装置；处理器，适用于执行各条程序；存储装置，适用于存储多条程序；所述程序适用于由处理器加载并执行以实现上述的基于多模态注意力机制的视频内容描述方法。In a fourth aspect of the present invention, a processing device is proposed, including a processor and a storage device; the processor is adapted to execute various programs; the storage device is adapted to store multiple programs; the programs are adapted to be loaded by the processor And execute to realize the above-mentioned video content description method based on multimodal attention mechanism.

本发明的有益效果：Beneficial effects of the present invention:

本发明融合视觉特征和高层语义属性，提高了生成视频描述语句的准确度。本发明从多模态信息出发，采用视频帧、光流帧和视频片段相结合的方法得到视频特征向量序列，同时检测和生成视频的语义属性向量标签。为了获得更有效的视觉特征和语义属性，将语义属性向量标签生成阶段的辅助分类损失和LSTM网络损失进行同时优化，可以保证句子中的上下文关系。在解码阶段，提出结合语义属性的注意力机制算法，将语义属性向量融入到传统的循环神经网络权重矩阵中，并且在生成句子单词的每一时刻，采用注意力机制来关注特定的语义属性，提高了视频内容描述的准确度。The invention integrates visual features and high-level semantic attributes, and improves the accuracy of generating video description sentences. Starting from multimodal information, the invention adopts the method of combining video frames, optical flow frames and video segments to obtain video feature vector sequences, and simultaneously detects and generates video semantic attribute vector labels. In order to obtain more effective visual features and semantic attributes, the auxiliary classification loss and LSTM network loss in the semantic attribute vector label generation stage are simultaneously optimized to ensure the contextual relationship in the sentence. In the decoding stage, an attention mechanism algorithm combining semantic attributes is proposed, the semantic attribute vector is integrated into the traditional RNN weight matrix, and the attention mechanism is used to pay attention to specific semantic attributes at every moment of generating sentence words. Improved the accuracy of video content descriptions.

附图说明Description of drawings

通过阅读参照以下附图所做的对非限制性实施例所做的详细描述，本申请的其他特征、目的和优点将会变得更明显。Other features, objects and advantages of the present application will become more apparent upon reading the detailed description of non-limiting embodiments taken with reference to the following drawings.

图1是本发明一种实施例的基于多模态注意力机制的视频内容描述方法的流程示意图；1 is a schematic flowchart of a video content description method based on a multimodal attention mechanism according to an embodiment of the present invention;

图2本发明一种实施例的基于多模态注意力机制的视频内容描述系统的框架示意图；2 is a schematic diagram of the framework of a video content description system based on a multimodal attention mechanism according to an embodiment of the present invention;

图3是本发明一种实施例的基于多模态注意力机制的视频内容描述方法的训练过程的示意图；3 is a schematic diagram of a training process of a video content description method based on a multimodal attention mechanism according to an embodiment of the present invention;

图4是本发明一种实施例的语义属性检测网络的网络结构示意图。FIG. 4 is a schematic diagram of a network structure of a semantic attribute detection network according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, not All examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅用于解释相关发明，而非对该发明的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与有关发明相关的部分。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the related invention, but not to limit the invention. In addition, it should be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict.

本发明的基于多模态注意力机制的视频内容描述方法，如图1所示，包括以下步骤：The video content description method based on the multimodal attention mechanism of the present invention, as shown in Figure 1, includes the following steps:

其中，in,

为了更清晰地对本发明基于多模态注意力机制的视频内容描述方法进行说明，下面结合附图对本发明方法一种实施例中各步骤进行展开详述。In order to describe the video content description method based on the multimodal attention mechanism of the present invention more clearly, each step in an embodiment of the method of the present invention is described in detail below with reference to the accompanying drawings.

本发明的方法具体运行的编程语言并不受限制，用任何语言编写都可以实现本发明的方法。发明采用一台具有12G字节内存的4卡Titan Xp GPU服务器，并用Python语言编制了基于多模态注意力机制的视频内容描述方法的工作程序，实现了本发明的方法。具体实现步骤如下：The programming language for the specific operation of the method of the present invention is not limited, and the method of the present invention can be implemented by writing in any language. The invention adopts a 4-card Titan Xp GPU server with 12G-byte memory, and uses Python language to compile a working program of a video content description method based on a multimodal attention mechanism, thereby realizing the method of the present invention. The specific implementation steps are as follows:

步骤S100，获取待描述视频的视频帧序列，作为输入序列。Step S100, acquiring a video frame sequence of the video to be described as an input sequence.

在本实施例中，待描述视频可以是实时拍摄的视频，例如智能监控及行为分析场景中，需要对摄像头实时拍摄的视频进行描述，此时，待描述视频可以是摄像头实时拍摄的视频；或者待描述视频可以是从网络中获取的视频，例如视频内容预览场景中，需要将网络获取的视频通过自然语言进行描述，实现用户对视频内容的预览，此时，待描述视频可以是网络中获取的需要进行预览的视频；或者，待描述视频可以是本地存储的视频，例如视频分类存储场景中，需要对视频进行描述，并根据描述信息进行分类存储，此时待描述视频可以是本地存储的需要进行分类存储的视频。基于获取的待描述视频，提取视频帧序列，作为输入。In this embodiment, the video to be described may be a video shot in real time. For example, in an intelligent monitoring and behavior analysis scenario, the video shot in real time by a camera needs to be described. In this case, the video to be described may be a video shot in real time by a camera; or The video to be described can be a video obtained from the network. For example, in a video content preview scenario, the video obtained from the network needs to be described in natural language to enable users to preview the video content. In this case, the video to be described can be obtained from the network. The video that needs to be previewed; or, the video to be described can be a locally stored video. For example, in a video classification storage scenario, the video needs to be described and classified and stored according to the description information. At this time, the video to be described can be stored locally. Videos that need to be classified and stored. Based on the acquired video to be described, a sequence of video frames is extracted as input.

步骤S200，提取所述输入序列的多模态特征向量，构建多模态特征向量序列，并通过循环神经网络得到各模态特征向量序列对应的特征表示；所述多模态特征向量序列包括视频帧特征向量序列、光流帧特征向量序列、视频片段特征向量序列。Step S200, extracting the multimodal feature vector of the input sequence, constructing a multimodal feature vector sequence, and obtaining the feature representation corresponding to each modal feature vector sequence through a recurrent neural network; the multimodal feature vector sequence includes video Frame feature vector sequence, optical flow frame feature vector sequence, video segment feature vector sequence.

在本实例中，对待描述视频进行多模态视频特征的提取，分别为视频帧、光流帧和视频片段。具体步骤如下：In this example, multi-modal video features are extracted from the video to be described, which are video frames, optical flow frames and video clips respectively. Specific steps are as follows:

步骤S201，使用经过预训练的深度残差网络对待描述视频的视频帧序列每一帧进行特征提取，将网络第i层的输出作为对该帧的特征表示，得到视频帧特征向量序列

Step S201, using a pre-trained deep residual network to perform feature extraction on each frame of the video frame sequence of the video to be described, using the output of the i-th layer of the network as the feature representation of the frame, and obtaining a video frame feature vector sequence

将视频帧特征向量序列按顺序输入循环神经网络LSTM中，将网络最后时刻的隐藏状态h_t作为视频的视频帧特征向量的特征表示，记为v_f。The video frame feature vector sequence is input into the recurrent neural network LSTM in sequence, and the hidden state h _t at the last moment of the network is used as the feature representation of the video frame feature vector of the video, denoted as v _f .

步骤S202，待描述视频的视频帧序列通过Lucas-Kanade算法，生成视频的光流序列，经过预训练的深度残差网络对每一帧进行特征提取，将网络第i层的输出作为对该帧的特征表示，得到光流帧特征向量序列

Step S202, the video frame sequence of the video to be described generates the optical flow sequence of the video through the Lucas-Kanade algorithm, the pre-trained deep residual network performs feature extraction on each frame, and the output of the i-th layer of the network is used as the frame. The feature representation of the optical flow frame feature vector sequence is obtained

将光流帧特征向量序列按顺序输入循环神经网络LSTM中，将网络最后时刻的隐藏状态h_t作为视频的光流帧的特征表示，记为v_o The sequence of optical flow frame feature vectors is input into the recurrent neural network LSTM in sequence, and the hidden state h _t at the last moment of the network is used as the feature representation of the optical flow frame of the video, denoted as v _o

步骤S203，待描述视频的视频帧序列平均分成T段，每段使用三维卷积深度神经网络进行特征提取，将网络第i层的输出作为对第t段视频的特征表示，得到视频片段特征向量序列

Step S203, the video frame sequence of the video to be described is evenly divided into T segments, each segment uses a three-dimensional convolutional deep neural network to perform feature extraction, and the output of the i-th layer of the network is used as the feature representation for the t-th segment of the video to obtain the video segment feature vector. sequence

将视频片段特征向量序列按顺序输入循环神经网络LSTM中，将网络最后时刻的隐藏状态h_t作为视频的视频段的特征表示，记为v_c。The sequence of feature vectors of video segments is input into the recurrent neural network LSTM in sequence, and the hidden state h _t at the last moment of the network is used as the feature representation of the video segment of the video, denoted as _vc .

上述步骤提取视频的多模态特征表示的过程，如图3所示，输入Video(视频)，分为视频帧(Frame)、光流帧(Optical flow)和视频片段(Video clip)，其中视频帧输出的是Frame Feature(视频帧特征向量的特征表示)即静态特征，视频片段输出的是C3D Feature(视频片段的3D-CNN特征)，光流帧输出的是Motion Feature(动态特征)。图3中的其他步骤在下文中描述。The above steps extract the process of multi-modal feature representation of video, as shown in Figure 3, input Video (video), which is divided into video frame (Frame), optical flow frame (Optical flow) and video clip (Video clip), where video Frame output is Frame Feature (feature representation of video frame feature vector), that is, static feature, video clip output is C3D Feature (3D-CNN feature of video clip), and optical flow frame output is Motion Feature (dynamic feature). The other steps in Figure 3 are described below.

步骤S300，基于各模态特征向量序列对应的特征表示，分别通过语义属性检测网络得到各特征表示对应的语义属性向量。Step S300 , based on the feature representation corresponding to each modal feature vector sequence, obtain a semantic attribute vector corresponding to each feature representation through a semantic attribute detection network.

在本实施例中，先对语义属性检测网络的训练进行介绍，再对通过语义属性检测网络得到各特征表示对应的语义属性向量进行介绍。In this embodiment, the training of the semantic attribute detection network is introduced first, and then the semantic attribute vector corresponding to each feature representation obtained through the semantic attribute detection network is introduced.

语义属性检测网络基于多层感知机构建，结构如图4所示，包括输入层(InputLayer)、隐藏层(Hidden Layer)、输出层(Output layer)，输入的是一个视频(InputVideo)及对应的描述语句是“A Small child id playing the guitar(一个小孩子正在弹吉他)”，通过循环神经网络LSTM，得到多模态特征向量序列(v_i1,v_i2,...,v_in)，通过语义属性检测网络输出语义属性向量s_i1,s_i2,...,s_iK。语义属性检测网络的具体训练过程如下：The semantic attribute detection network is constructed based on a multi-layer perceptron. The structure is shown in Figure 4, including an input layer (InputLayer), a hidden layer (Hidden Layer), and an output layer (Output layer). The input is a video (InputVideo) and the corresponding The description sentence is "A Small child id playing the guitar (a small child is playing the guitar)", through the recurrent neural network LSTM, the multi-modal feature vector sequence (v _i1 ,v _i2 ,...,v _in ) is obtained, through The semantic attribute detection network outputs semantic attribute vectors s _i1 ,s _i2 ,...,s _iK . The specific training process of the semantic attribute detection network is as follows:

步骤A301，获取训练数据集，所述训练数据集包括视频及对应的描述语句。Step A301: Obtain a training data set, where the training data set includes videos and corresponding description sentences.

步骤A302，提取所述训练数据集中描述语句的单词，并按照出现频率进行排序，选择前K个单词作为高层语义属性向量；根据所述描述语句是否包含所述高层语义属性向量，获取视频真实的语义属性向量标签。Step A302, extracting the words describing the sentences in the training data set, and sorting them according to the frequency of occurrence, and selecting the first K words as high-level semantic attribute vectors; according to whether the description sentences include the high-level semantic attribute vectors, obtain the real video of the video. Semantic attribute vector labels.

提取训练数据集中描述语句的单词，按照单词出现频率对单词排序，移除虚词，然后选择出现概率最高的K个单词作为高层语义属性值Extract the words that describe the sentences in the training data set, sort the words according to the frequency of word occurrence, remove the function words, and then select the K words with the highest occurrence probability as the high-level semantic attribute value

假设训练数据集有N条语句，y_i＝[y_i1,y_i1,...y_il,...y_iK]是第i个视频的真实的语义属性向量标签。其中如果视频i对应的描述语句中包含属性单词l，则y_il＝1；否则y_il＝0。Assuming that the training data set has N sentences, y _i =[y _i1 , y _i1 ,...y _il ,...y _iK ] is the true semantic attribute vector label of the ith video. Wherein, if the description sentence corresponding to the video i contains the attribute word l, then y _il =1; otherwise, y _il =0.

步骤A303，通过步骤S200的方法，获取上述训练数据集中视频的多模态特征向量序列对应的特征表示。Step A303: Obtain the feature representation corresponding to the multimodal feature vector sequence of the video in the training data set by using the method of step S200.

步骤A304，基于特征表示、真实的语义属性向量标签，训练所述语义属性检测网络。Step A304: Train the semantic attribute detection network based on the feature representation and the real semantic attribute vector label.

让v_i∈{v_f，v_o，v_c}表示第i个视频学习到的特征表示，此时的训练样本为{v_i,y_i}。本发明中采用基于采用多层感知机构建的语义属性检测网络来学习函数f(·):R^m→R^K，其表示为将一个m维的空间映射为K维，其中，R^m为m维度的实数空间，R^K同理，m为输入特征表示的维度，K为输出语义属性向量的维度，这个维度和上述提取的语义属性值的个数(高层语义属性向量的维度)相等，多层感知机输出向量s_i＝[s_i1,...,s_iK]为第i个视频的预测语义属性向量标签。语义属性检测网络的分类损失函数loss₁如公式(1)所示：Let v _i ∈ {v _f , v _o , v _c } denote the feature representation learned from the ith video, and the training sample at this time is {v _i , y _i }. In the present invention, a semantic attribute detection network based on a multi-layer perceptron is used to learn the function f(·):R ^m →R ^K , which is expressed as mapping an m-dimensional space to K-dimension, where R ^m is m The real number space of dimensions, R ^K is the same, m is the dimension represented by the input feature, K is the dimension of the output semantic attribute vector, this dimension is equal to the number of semantic attribute values extracted above (the dimension of the high-level semantic attribute vector), and more The layer perceptron output vector s _i =[s _i1 ,...,s _iK ] is the predicted semantic attribute vector label of the ith video. The classification loss function loss ₁ of the semantic attribute detection network is shown in formula (1):

其中，W^encoder表示循环神经网络、语义属性检测网络所有权重矩阵和偏置矩阵参数的集合，α为权重，s_i＝α(f(v_i))学习得到K维向量，α(·)表示sigmoid函数，f(·)由多层感知机实现。Among them, W ^encoder represents the set of weight matrix and bias matrix parameters of the recurrent neural network and semantic attribute detection network, α is the weight, s _i =α(f(vi ₎ ) learns to obtain a K-dimensional vector, and α( ) represents The sigmoid function, f( ) is implemented by a multilayer perceptron.

将语义属性检测网络训练完成后，在实际的应用过程中，基于各模态特征向量序列对应的特征表示，分别通过语义属性检测网络得到各特征表示对应的语义属性向量。如图3中的Multimodal Semantic Detector模块。After the semantic attribute detection network is trained, in the actual application process, based on the feature representation corresponding to each modal feature vector sequence, the semantic attribute vector corresponding to each feature representation is obtained through the semantic attribute detection network. The Multimodal Semantic Detector module in Figure 3.

步骤S400，将各模态特征向量序列对应的特征表示进行级联，得到初始编码向量；基于所述初始编码向量、各特征表示对应的语义属性向量，通过基于注意力机制的LSTM网络得到所述待描述视频的描述语句。Step S400, concatenate the feature representations corresponding to each modal feature vector sequence to obtain an initial coding vector; based on the initial coding vector and the semantic attribute vector corresponding to each feature representation, obtain the LSTM network based on the attention mechanism. A description sentence for the video to be described.

在本实施例中，将{v_f，v_o，v_c}级联，作为初始编码向量v，如图3中的Concatenation模块。Attention Fusion为基于注意力模块。In this embodiment, {v _f , v _o , v _c } is concatenated as the initial coding vector v, such as the Concatenation module in FIG. 3 . Attention Fusion is based on the attention module.

下文先对基于注意力机制的LSTM网络的训练过程进行介绍，再对通过基于注意力机制的LSTM网络获取待描述视频的描述语句的方法进行介绍。The following first introduces the training process of the LSTM network based on the attention mechanism, and then introduces the method of obtaining the description sentences of the video to be described through the LSTM network based on the attention mechanism.

基于注意力机制的LSTM网络在训练时，输入的描述语句是“A Small child idplaying the guitar(一个小孩子正在弹吉他)”，具体的训练过程如下：When the LSTM network based on the attention mechanism is trained, the input description sentence is "A Small child idplaying the guitar". The specific training process is as follows:

当输出是一个句子的时候，用LSTM网络作为解码器，此时能够捕获句子的长期依赖性。假设当前时刻输入的单词为w_t，LSTM网络上一时刻隐藏状态为h_t-1，上一时刻细胞的记忆状态为c_t-1，则LSTM网络在t时刻的更新规则如公式(2)(3)(4)(5)(6)(7)(8)所示：When the output is a sentence, using the LSTM network as the decoder can capture the long-term dependencies of the sentence. Assuming that the word input at the current moment is w _t , the hidden state of the LSTM network at the last moment is h _t-1 , and the memory state of the cell at the last moment is c _t-1 , then the update rule of the LSTM network at time t is as formula (2) (3)(4)(5)(6)(7)(8):

i_t＝σ(W_iw_t+U_hih_t-1+z) (2)i _t =σ(W _i w _t +U _hi h _t-1 +z) (2)

f_t＝σ(W_fw_t+U_hfh_t-1+z) (3)f _t =σ(W _f w _t +U _hf h _t-1 +z) (3)

o_t＝σ(W_ow_t+U_hoh_t-1+z) (4)o _t =σ(W _o w _t +U _ho h _t-1 +z) (4)

h_t＝o_t⊙tanh(c_t) (7)h _t =o _t ⊙tanh(c _t ) (7)

z＝1(t＝1)·Cv (8)z=1(t=1)·Cv (8)

用*表示上述公式{i,f,o,c}中的某一个下标，其中，W_*、U_h*和C均为权重矩阵，i_t,f_t,o_t,c_t,

分别表示t时刻输入门、遗忘门、输出门、记忆单元和压缩输入的状态，tanh(·)表示双曲正切函数，1(t＝1)为指示函数，初始编码向量v在LSTM的初始时刻作为输入，z表示在t＝1初始时刻将视频向量作为输入。为了简化，上述公式中的偏置项均被省略。Use * to represent a subscript in the above formula {i,f,o,c}, where W _* , U _h* and C are weight matrices, i _t ,f _t ,o _t ,c _t ,

respectively represent the state of the input gate, forget gate, output gate, memory unit and compressed input at time t, tanh( ) represents the hyperbolic tangent function, 1(t=1) is the indicator function, and the initial encoding vector v is at the initial moment of LSTM As input, z represents the video vector as input at the initial time t=1. For simplicity, the bias terms in the above formulas are omitted.

为了更好的利用来自多个模态语义属性的辅助信息，我们提出结合语义属性的注意力机制来计算权重矩阵W_*和U_h*，将传统LSTM的每个权重矩阵扩展为与K个属性相关权重矩阵的集合，用于挖掘单个单词的含义以生成最终的描述语句。用W_*(S_t)/U_h*(S_t)替换初始权重矩阵W_*/U_h*，其中S_t∈R^K是一个多模态的语义属性向量，随时刻动态变化。特别地，定义两个权重矩阵

其中n_h是隐藏单元数目，n_x是词嵌入向量的维数，则W_*(S_t)/U_h*(S_t)的表达式如公式(9)(10)所示：To better utilize auxiliary information from multiple modal semantic attributes, we propose an attention mechanism combined with semantic attributes to calculate weight matrices W _* and U _h* , extending each weight matrix of traditional LSTM to be related to K attributes A collection of relevant weight matrices for mining the meaning of individual words to generate final descriptive sentences. Replace the initial weight matrix W _* /U _h* with W _* (S _t )/U _h* (S _t ), where S _t ∈ R ^K is a multimodal semantic attribute vector that changes dynamically over time. In particular, define two weight matrices

Where n _h is the number of hidden units, and n _x is the dimension of the word embedding vector, then the expression of W _* (S _t )/U _h* (S _t ) is shown in formula (9) (10):

其中，W_τ[k]、U_τ[k]分别表示为权重矩阵W_τ和U_τ的第k个2D切片，其与概率值S_t[k]相关联，S_t[k]是多模态的语义属性向量S_t的第k个元素。由于W_τ为三维向量，所以2D切片指的是W_τ的一个切片，是一个二维的向量。where W _τ [k], U _τ [k] are denoted as the kth 2D slice of the weight matrices W _τ and U _τ , respectively, which are associated with the probability value S _t [k], which _is the multimodal The kth element of the semantic attribute vector S _t of the state. Since W _τ is a three-dimensional vector, a 2D slice refers to a slice of W _τ , which is a two-dimensional vector.

S_t的计算过程如公式(11)(12)(13)所示：The calculation process of S _t is shown in formula (11) (12) (13):

e_ti＝w^Ttanh(W_ah_t-1+U_as_i) (13)e _ti =w ^T tanh(W _a h _t-1 +U _a s _i ) (13)

其中，l＝3表示学习到的三个语义属性向量{s_f，s_o，s_c}，注意力权重a_ti反映了生成当前时刻视频中第i个语义属性的重要性程度。可以看出，对于不同的时间步长t，语义属性S_t是不同的，这使得模型在每次产生单词时有选择性地关注视频中的不同语义属性部分，j表示下标，e_ti表示未正则化的注意力权重，w^T表示转换矩阵。Among them, l=3 represents the learned three semantic attribute vectors {s _f , s _o , s _c }, and the attention weight a _ti reflects the importance of generating the i-th semantic attribute in the video at the current moment. It can be seen that for different time steps t, the semantic attributes S _t are different, which makes the model selectively focus on different semantic attribute parts in the video each time a word is generated, j represents the subscript, and e _ti represents Unregularized attention weights, w ^T denotes the transformation matrix.

在基于注意力机制的LSTM网络的训练过程中，等同于联合训练K个LSTM，网络的参数量和K值成正比，当K值很大时，网络几乎不能完成训练，采取以下的因式分解方法，如公式(14)(15)所示：In the training process of the LSTM network based on the attention mechanism, it is equivalent to jointly training K LSTMs. The parameter amount of the network is proportional to the K value. When the K value is large, the network can hardly complete the training. The following factorization is adopted method, as shown in formula (14) (15):

W_*(S_t)＝W_a·diag(W_bS_t)·W_c (14)W _* (S _t )=W _a ·diag(W _b S _t )·W _c (14)

U_h*(S_t)＝U_a·diag(U_bS_t)·U_c (15)U _h* (S _t )=U _a ·diag(U _b S _t ) ·U _c (15)

其中，

和

n_f表示因子分解的相关超参数。in,

and

n _f denotes the relevant hyperparameters for factorization.

为什么采用因式分解网络的参数量大大减少，规避了原始网络的参数量和K值成正比的难题，下面对网络的参数量进行一个分析。在公式(9)(10)中，总的参数量为K·n_h·(n_x+n_h)，可以认为参数量和K成正比。在公式(14)(15)中，W_*(S_t)公式参数量为n_f·(n_h+K+n_x)，U_h*(S_t)公式参数量为n_f·(2n_h+K)，二者的参数量之和为n_f·(3n_h+2K+n_x)。当指定n_f＝n_h时，对于较大的K值，n_f·(3n_h+2K+n_x)要远远小于K·n_h·(n_x+n_h)。Why the parameter quantity of the factorization network is greatly reduced, and the problem that the parameter quantity of the original network is proportional to the K value is avoided. The following is an analysis of the parameter quantity of the network. In formulas (9) and (10), the total parameter quantity is K·n _h ·(n _x +n _h ), and it can be considered that the parameter quantity is proportional to K. In formulas (14) and (15), the parameter quantity of the W _* (S _t ) formula is n _f ·(n _h +K+n _x ), and the parameter quantity of the U _h* (S _t ) formula is n _f ·(2n _{h )} +K), the sum of the two parameters is n _f ·(3n _h +2K+n _x ). When specifying n _f =n _h , for larger values of K, n _f ·(3n _h +2K+n _x ) is much smaller than K ·n _h ·(n _x +n _h ).

把因式分解的公式(14)(15)带入LSTM网络更新规则中可得公式(16)(17)(18)：Bring the factorized formulas (14) (15) into the LSTM network update rule to obtain formulas (16) (17) (18):

其中，⊙表示逐元素乘法运算符，对于S_t中的每个元素值，参数矩阵W_a和U_a是共享的，这可以有效地捕获视频中共有的语言模式，而对角线矩阵diag(W_bS_t)和diag(U_bS_t)考虑了不同视频中的特定语义属性部分，

表示融入语义属性向量的输入，

表示融入语义属性向量的隐藏状态。同理可证，f_t,o_t,c_t的表达式和上面公式相似。where ⊙ denotes the element-wise multiplication operator, and for each element value in S _t , the parameter matrices W _a and U _a are shared, which can effectively capture language patterns common in videos, while the diagonal matrix diag( W _b S _t ) and diag(U _b S _t ) consider the specific semantic attribute parts in different videos,

represents the input into the semantic attribute vector,

Represents the hidden state incorporated into the semantic attribute vector. It can be proved by the same reason that the expressions of f _t , o _t , and c _t are similar to the above formulas.

从上述这些公式可得，网络充分训练之后，既可以有效的捕获视频中共有的语言模式部分，又可以关注于视频中特定语义属性部分，同时由于采用因式分解，网络的参数量大大减少，规避了原始网络的参数量和K值成正比的难题。It can be seen from the above formulas that after the network is fully trained, it can not only effectively capture the common language pattern part in the video, but also focus on the specific semantic attribute part in the video. It avoids the problem that the parameter amount of the original network is proportional to the K value.

在生成视频的描述语句过程中，先采用贪婪搜索，每时刻输出的单词如公式(19)所示：In the process of generating video description sentences, greedy search is used first, and the words output at each moment are shown in formula (19):

w_t＝softmax(Wh_t) (19)w _t =softmax(Wh _t ) (19)

其中，W为转换矩阵。Among them, W is the transformation matrix.

因此，设计网络的生成句子的损失函数loss₂，如公式(20)所示：Therefore, the loss function loss ₂ of the generated sentence of the network is designed, as shown in formula (20):

loss₂＝-logP(Y|v,s_f,s_c,s_o)＝-∑logP(w_t|w_0～t-1) (20)loss ₂ =-logP(Y|v,s _f ,s _c ,s _o )=-∑logP(w _t |w _0～t-1 ) (20)

其中，Y＝{w₁,w₂,.......w_N}，表示由N个单词组成的句子，w_0～t-1为t时刻之前生成的单词。Among them, Y={w ₁ , w ₂ , ...... w _N }, represents a sentence composed of N words, and w _{0 to t-1} are words generated before time t.

将生成高层语义属性的分类损失loss₁和生成描述句子的损失loss₂相加，同时进行优化，可以保证句子中的上下文关系。基于获取的损失值，采用反向传播算法对网络进行训练。如图3中的Classification Loss(分类损失)模块和Captioning Loss(句子损失)模块，通过相加得到Total Loss(总的损失或全局损失)模块。The classification loss loss ₁ for generating high-level semantic attributes and the loss loss ₂ for generating description sentences are added together and optimized at the same time, which can ensure the context relationship in the sentence. Based on the obtained loss values, the network is trained using a back-propagation algorithm. As shown in Figure 3, the Classification Loss (classification loss) module and the Captioning Loss (sentence loss) module are added to obtain the Total Loss (total loss or global loss) module.

基于注意力机制的LSTM网络训练完毕后，基于所述初始编码向量、各特征表示对应的语义属性向量，通过基于注意力机制的LSTM网络得到待描述视频的描述语句。After the training of the LSTM network based on the attention mechanism is completed, based on the initial encoding vector and the corresponding semantic attribute vector of each feature representation, the description sentence of the video to be described is obtained through the LSTM network based on the attention mechanism.

本发明第二实施例的一种基于多模态注意力机制的视频内容描述系统，如图2所示，包括：获取模块100、提取特征表示模块200、语义属性检测模块300、生成视频描述模块400；A video content description system based on a multimodal attention mechanism according to the second embodiment of the present invention, as shown in FIG. 2 , includes: an acquisition module 100 , an extraction feature representation module 200 , a semantic attribute detection module 300 , and a video description generation module 400;

所述获取模块100，配置为获取待描述视频的视频帧序列，作为输入序列；The obtaining module 100 is configured to obtain the video frame sequence of the video to be described as the input sequence;

所述提取特征表示模块200，配置为提取所述输入序列的多模态特征向量，构建多模态特征向量序列，并通过循环神经网络得到各模态特征向量序列对应的特征表示；所述多模态特征向量序列包括视频帧特征向量序列、光流帧特征向量序列、视频片段特征向量序列；The extraction feature representation module 200 is configured to extract the multimodal feature vector of the input sequence, construct a multimodal feature vector sequence, and obtain the feature representation corresponding to each modal feature vector sequence through a recurrent neural network; The modal feature vector sequence includes a video frame feature vector sequence, an optical flow frame feature vector sequence, and a video segment feature vector sequence;

所述语义属性检测模块300，配置为基于各模态特征向量序列对应的特征表示，分别通过语义属性检测网络得到各特征表示对应的语义属性向量；The semantic attribute detection module 300 is configured to obtain a semantic attribute vector corresponding to each feature representation through a semantic attribute detection network based on the feature representation corresponding to each modal feature vector sequence;

所述生成视频描述模块400，配置为将各模态特征向量序列对应的特征表示进行级联，得到初始编码向量；基于所述初始编码向量、各特征表示对应的语义属性向量，通过基于注意力机制的LSTM网络得到所述待描述视频的描述语句；The generating video description module 400 is configured to concatenate feature representations corresponding to each modal feature vector sequence to obtain an initial coding vector; The LSTM network of the mechanism obtains the description sentence of the video to be described;

其中，in,

所述技术领域的技术人员可以清楚的了解到，为描述的方便和简洁，上述描述的系统的具体的工作过程及有关说明，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the technical field can clearly understand that, for the convenience and brevity of description, for the specific working process and related description of the system described above, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here.

需要说明的是，上述实施例提供的基于多模态注意力机制的视频内容描述系统，仅以上述各功能模块的划分进行举例说明，在实际应用中，可以根据需要而将上述功能分配由不同的功能模块来完成，即将本发明实施例中的模块或者步骤再分解或者组合，例如，上述实施例的模块可以合并为一个模块，也可以进一步拆分成多个子模块，以完成以上描述的全部或者部分功能。对于本发明实施例中涉及的模块、步骤的名称，仅仅是为了区分各个模块或者步骤，不视为对本发明的不当限定。It should be noted that, the video content description system based on the multi-modal attention mechanism provided by the above embodiments is only illustrated by the division of the above functional modules. That is, the modules or steps in the embodiments of the present invention are decomposed or combined. For example, the modules in the above-mentioned embodiments can be combined into one module, or can be further split into multiple sub-modules, so as to complete all the above descriptions. or some functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing each module or step, and should not be regarded as an improper limitation of the present invention.

本发明第三实施例的一种存储装置，其中存储有多条程序，所述程序适用于由处理器加载并实现上述的基于多模态注意力机制的视频内容描述方法。A storage device according to a third embodiment of the present invention stores a plurality of programs, and the programs are suitable for being loaded by a processor and implementing the above-mentioned video content description method based on a multimodal attention mechanism.

本发明第四实施例的一种处理装置，包括处理器、存储装置；处理器，适于执行各条程序；存储装置，适于存储多条程序；所述程序适于由处理器加载并执行以实现上述的基于多模态注意力机制的视频内容描述方法。A processing device according to a fourth embodiment of the present invention includes a processor and a storage device; the processor is adapted to execute various programs; the storage device is adapted to store multiple programs; the programs are adapted to be loaded and executed by the processor In order to realize the above-mentioned video content description method based on multimodal attention mechanism.

所述技术领域的技术人员可以清楚的了解到，未描述的方便和简洁，上述描述的存储装置、处理装置的具体工作过程及有关说明，可以参考前述方法实例中的对应过程，在此不再赘述。Those skilled in the technical field can clearly understand that the undescribed convenience and brevity are not described. Repeat.

本领域技术人员应该能够意识到，结合本文中所公开的实施例描述的各示例的模块、方法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，软件模块、方法步骤对应的程序可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。为了清楚地说明电子硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以电子硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Those skilled in the art should be aware that the modules and method steps of each example described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software or a combination of the two, and the programs corresponding to the software modules and method steps Can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or as known in the art in any other form of storage medium. In order to clearly illustrate the interchangeability of electronic hardware and software, the components and steps of each example have been described generally in terms of functionality in the foregoing description. Whether these functions are performed in electronic hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods of implementing the described functionality for each particular application, but such implementations should not be considered beyond the scope of the present invention.

术语“第一”、“第二”等是用于区别类似的对象，而不是用于描述或表示特定的顺序或先后次序。The terms "first," "second," etc. are used to distinguish between similar objects, and are not used to describe or indicate a particular order or sequence.

术语“包括”或者任何其它类似用语旨在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备/装置不仅包括那些要素，而且还包括没有明确列出的其它要素，或者还包括这些过程、方法、物品或者设备/装置所固有的要素。The term "comprising" or any other similar term is intended to encompass a non-exclusive inclusion such that a process, method, article or device/means comprising a list of elements includes not only those elements but also other elements not expressly listed, or Also included are elements inherent to these processes, methods, articles or devices/devices.

至此，已经结合附图所示的优选实施方式描述了本发明的技术方案，但是，本领域技术人员容易理解的是，本发明的保护范围显然不局限于这些具体实施方式。在不偏离本发明的原理的前提下，本领域技术人员可以对相关技术特征作出等同的更改或替换，这些更改或替换之后的技术方案都将落入本发明的保护范围之内。So far, the technical solutions of the present invention have been described with reference to the preferred embodiments shown in the accompanying drawings, however, those skilled in the art can easily understand that the protection scope of the present invention is obviously not limited to these specific embodiments. Without departing from the principle of the present invention, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions after these changes or substitutions will fall within the protection scope of the present invention.

Claims

1. a video content description method based on multimodal attention mechanism, is characterized in that, this method comprises:

Step S100, obtaining the video frame sequence of the video to be described as the input sequence;

Step S200, extracting the multimodal feature vector of the input sequence, constructing a multimodal feature vector sequence, and obtaining the feature representation corresponding to each modal feature vector sequence through a recurrent neural network; the multimodal feature vector sequence includes video Frame feature vector sequence, optical flow frame feature vector sequence, video segment feature vector sequence;

Step S300, based on the feature representation corresponding to each modal feature vector sequence, obtain the semantic attribute vector corresponding to each feature representation through a semantic attribute detection network respectively;

Step S400, concatenate the feature representations corresponding to each modal feature vector sequence to obtain an initial coding vector; based on the initial coding vector and the semantic attribute vector corresponding to each feature representation, obtain the LSTM network based on the attention mechanism. The description sentence of the video to be described;

in,

The semantic attribute detection network is constructed based on a multi-layer perceptron, and is trained based on training samples, where the training samples include feature representation samples and corresponding semantic attribute vector labels.

2. The video content description method based on a multimodal attention mechanism according to claim 1, wherein in step S200, "extract the multimodal feature vector of the input sequence, and construct a multimodal feature vector sequence" , the method is:

Perform feature extraction on each frame of RGB image in the input sequence based on a deep residual network to obtain a video frame feature vector sequence;

Based on the input sequence, the optical flow sequence is obtained through the Lucas-Kanade algorithm; the feature extraction is performed on the optical flow sequence through the deep residual network to obtain the optical flow frame feature vector sequence;

The input sequence is equally divided into T segments, and feature vectors of each segment are extracted respectively through a three-dimensional convolutional deep neural network to obtain a sequence of video segment feature vectors.

3. the video content description method based on multimodal attention mechanism according to claim 1, is characterized in that, its training method of described semantic attribute detection network is:

Obtain a training data set, the training data set includes a video and a corresponding description sentence;

Extract the words describing the sentences in the training data set, and sort them according to the frequency of occurrence, and select the first K words as high-level semantic attribute vectors; according to whether the description sentences contain the high-level semantic attribute vectors, obtain the real semantic attribute vectors of the video Label;

Obtain the feature representation corresponding to the multimodal feature vector sequence of the video in the training data set;

The semantic attribute detection network is trained based on the feature representation and the true semantic attribute vector labels.

4. The video content description method based on multi-modal attention mechanism according to claim 3, is characterized in that, in the training process of described semantic attribute detection network, its loss function loss ₁ is:

Among them, N is the number of description sentences in the training data set, K is the dimension of the predicted semantic attribute vector label output by the semantic attribute detection network, s _ik is the predicted semantic attribute vector label output by the semantic attribute detection network, and y _ik is the real semantic attribute Vector label, i, k are subscripts, α is the weight, W ^encoder is the set of all weight matrix and bias matrix parameters of the recurrent neural network and the semantic attribute detection network.

5. The video content description method based on a multimodal attention mechanism according to claim 1, wherein in step S400 "based on the initial coding vector, each feature represents the corresponding semantic attribute vector, The LSTM network of the mechanism obtains the sentence description of the video to be described", and the method is:

The semantic attribute vector corresponding to each feature representation is weighted by the attention mechanism to obtain the multi-modal semantic attribute vector;

Based on the initial encoding vector and the multimodal semantic attribute vector, a sentence description of the video to be described is generated through an LSTM network.

6. The video content description method based on a multimodal attention mechanism according to claim 1, wherein the LSTM network based on the attention mechanism adopts the method of factoring to carry out the calculation of the weight matrix in the training process .

7. A video content description system based on a multimodal attention mechanism, characterized in that the system comprises an acquisition module, an extraction feature representation module, a semantic attribute detection module, and a generated video description module;

The obtaining module is configured to obtain the video frame sequence of the video to be described as the input sequence;

The extraction feature representation module is configured to extract the multimodal feature vector of the input sequence, construct a multimodal feature vector sequence, and obtain the feature representation corresponding to each modal feature vector sequence through a recurrent neural network; the multimodal feature vector sequence The state feature vector sequence includes video frame feature vector sequence, optical flow frame feature vector sequence, and video segment feature vector sequence;

The semantic attribute detection module is configured to obtain a semantic attribute vector corresponding to each feature representation through a semantic attribute detection network based on the feature representation corresponding to each modal feature vector sequence;

The generating video description module is configured to concatenate the feature representations corresponding to each modal feature vector sequence to obtain an initial coding vector; based on the initial coding vector and the semantic attribute vector corresponding to each feature representation, the attention mechanism The LSTM network obtains the description sentence of the video to be described;

in,

8. A storage device, wherein a plurality of programs are stored, wherein the program application is loaded and executed by a processor to realize the video based on the multimodal attention mechanism of any one of claims 1-6 Content description method.

9. A processing device, comprising a processor and a storage device; the processor is adapted to execute each program; the storage device is adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by the processor to The method for describing video content based on a multimodal attention mechanism according to any one of claims 1 to 6 is implemented.