[go: up one dir, main page]

CN113011334A - Video description method based on graph convolution neural network - Google Patents

Video description method based on graph convolution neural network Download PDF

Info

Publication number
CN113011334A
CN113011334A CN202110295594.3A CN202110295594A CN113011334A CN 113011334 A CN113011334 A CN 113011334A CN 202110295594 A CN202110295594 A CN 202110295594A CN 113011334 A CN113011334 A CN 113011334A
Authority
CN
China
Prior art keywords
feature vector
vector
feature
video
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110295594.3A
Other languages
Chinese (zh)
Inventor
宫晓东
杨光
孟宪菊
梅海艺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihuang Chinese Medicine Culture Development Foundation
Original Assignee
Beijing Qihuang Chinese Medicine Culture Development Foundation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihuang Chinese Medicine Culture Development Foundation filed Critical Beijing Qihuang Chinese Medicine Culture Development Foundation
Priority to CN202110295594.3A priority Critical patent/CN113011334A/en
Publication of CN113011334A publication Critical patent/CN113011334A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种基于图卷积神经网络的视频描述方法,该方法包括:步骤1,根据采样间隔,提取视频片段中的视频帧,记作样本视频帧,分别提取样本视频帧中的场景特征向量和对象特征向量;步骤2,根据缩放向量和逐点乘积运算,对场景特征向量和对象特征向量进行特征增强,并根据增强后的场景特征向量和增强后的对象特征向量进行特征融合,记作融合特征向量;步骤3,利用语言模型LSTM,对融合特征向量进行视频描述。通过本申请中的技术方案,分别对对视频帧中的全局特征信息、局部特征信息进行挖掘利用,并将不同的特征信息进行融合,以对视频内容进行描述,提升描述准确性。

Figure 202110295594

The present application discloses a video description method based on a graph convolutional neural network. The method includes: step 1, according to the sampling interval, extracting video frames in a video segment, recording them as sample video frames, and extracting scenes in the sample video frames respectively feature vector and object feature vector; step 2, perform feature enhancement on the scene feature vector and the object feature vector according to the scaling vector and the point-by-point product operation, and perform feature fusion according to the enhanced scene feature vector and the enhanced object feature vector, Denoted as fusion feature vector; Step 3, use language model LSTM to describe the fusion feature vector in video. Through the technical solutions in the present application, the global feature information and local feature information in the video frame are mined and utilized respectively, and different feature information is fused to describe the video content and improve the description accuracy.

Figure 202110295594

Description

Video description method based on graph convolution neural network
Technical Field
The application relates to the technical field of computer vision and natural language processing, in particular to a video description method based on a graph convolution neural network.
Background
With the continuous development of computer technology and the internet, multimedia technology has also advanced greatly, and video has become the main media transmission form. Massive video data exist in the network, and anyone can upload videos without constraint, so that illegal contents such as violence, pornography and the like inevitably exist; it is difficult to review all video contents only by manual screening, and careless omission is avoided during work. Besides the video auditing work mentioned above, the video content understanding and description can also be applied to aspects of robot man-machine interaction, assistance for visually impaired people and the like. Therefore, how to process and understand the content of the video is important, but for large-scale video data, it is difficult for a computer to thoroughly understand the content of the video information.
The deep learning technology enables a computer to solve video contents, and an end-to-end encoder-decoder algorithm framework constructed by matching with a natural language processing technology enables the computer to understand the video contents and describe the video contents by using a natural language. In the video description task, a deep Convolutional Neural Network (CNN) can effectively extract spatial features or temporal features of a video, and thus the CNN is often used as an encoder; however, the long-term memory network (LSTM) can effectively capture the relation of words in the text, effectively solve the problem of gradient explosion and alleviate the problem of long-term dependence, and therefore, the language model LSTM is often used as a decoder.
However, in the prior art, because the CNN can only extract global feature information, the conventional video description method only uses global information of video frames, that is, the video frames are regarded as a whole and input to the CNN, and local information of the video frames is not well mined and utilized, and a phenomenon of local feature loss exists, so that the following technical problems often exist in the existing video description method:
(1) the motion description of the subject is inaccurate, such as describing the playing of a ball as running;
(2) the description of the object in the video is inaccurate, such as the description of the gender of the person is wrong.
Therefore, in order to describe the video more accurately, the problem of acquiring and processing local information in the video and the problem of fusion between different features need to be solved.
Disclosure of Invention
The purpose of this application lies in: based on the graph convolution neural network, the global feature information and the local feature information in the video frame are respectively mined and utilized, and different feature information is fused to describe the video content, so that the description accuracy is improved.
The technical scheme of the application is as follows: a video description method based on a graph convolution neural network is provided, and the method comprises the following steps: step 1, extracting video frames in a video clip according to a sampling interval, recording the video frames as sample video frames, and respectively extracting scene characteristic vectors and object characteristic vectors in the sample video frames; step 2, performing feature enhancement on the scene feature vector and the object feature vector according to the scaling vector and point-by-point product operation, performing feature fusion according to the enhanced scene feature vector and the enhanced object feature vector, and recording as a fusion feature vector; and 3, performing video description on the fusion feature vector by using a language model LSTM.
In any one of the above technical solutions, further, in step 1, extracting a scene feature vector in a sample video frame specifically includes: step 11, inputting a sample video frame into a CNN network for feature extraction operation, and recording the output of the last layer of pooling layer of the CNN network as a high-dimensional feature map; step 12, performing two-dimensional average pooling operation on the high-dimensional feature map, and recording a pooling result as a first feature vector; and step 13, inputting the first feature vector into a frame GCN network for coding and embedding operation, and generating a scene feature vector.
In any one of the above technical solutions, further, in step 1, extracting an object feature vector in a sample video frame specifically includes: step 14, inputting the sample video frame into a target detection model, screening the area in the sample video frame by using a non-maximum suppression method, determining the area position of an object, and recording the area position as an object area; step 15, performing area correspondence on the object area and the high-dimensional feature map, and performing clipping and ROIAlign operation on the object area to generate a second feature vector; step 16, performing two-dimensional average pooling operation on the second feature vector to generate a third feature vector; and step 17, inputting the third feature vector into a regional GCN network for encoding and embedding operation, and generating an object feature vector.
In any one of the above technical solutions, further, step 13 specifically includes:
step 131, performing linear transformation on the first eigenvector, calculating a first row vector relationship between the row vectors in the first eigenvector after the linear transformation, and determining a first graph matrix according to the first row vector relationship, wherein a calculation formula of the first graph matrix G is as follows:
Figure BDA0002984230370000031
Figure BDA0002984230370000032
φ(xi)=Wxi+b
in the formula, F (x)i,xj) Is the ith row vector x in the first feature vector after linear transformationiAnd the jth row vector xjThe first line vector relationship between, phi (-) is a line linear transformation function, WIs the first learnable parameter matrix, b is the learnable bias coefficient;
step 132, according to the second learning parameter matrix, performing linear spatial transformation on the first feature vector X, and performing feature embedding on the first feature vector after the linear spatial transformation by using the first graph matrix G to generate a scene feature vector, where the corresponding calculation formula is:
Y=GXW
in the formula, Y is a scene feature vector, W is a second learning parameter matrix, and X is a first feature vector.
In any one of the above technical solutions, further, step 17 specifically includes: 171, performing linear transformation on the third eigenvector, calculating a second row vector relationship between each row vector in the linearly transformed third eigenvector, and determining a second graph matrix according to the second row vector relationship; and 172, performing linear space transformation on the third feature vector according to the third learning parameter matrix, and performing feature embedding on the linearly-space-transformed third feature vector by using the second graph matrix to generate an object feature vector.
In any one of the above technical solutions, further, before step 2, the method further includes: and respectively carrying out one-dimensional average pooling operation on the scene feature vector and the object feature vector.
In any one of the above technical solutions, further, in step 2, performing feature enhancement on the scene feature vector according to the scaling vector and the point-by-point product operation, specifically including:
calculating a scaling vector according to a gating mechanism and the scene characteristic vector after one-dimensional average pooling, wherein a corresponding calculation formula is as follows:
α=σ(g(vT))=σ(W2δ(W1vT+b1)+b2)
in the formula, vTIs a one-dimensional average pooled scene feature vector, W1、W2To scale the matrix, b1、b2The scaling parameter, σ (-) is the Sigmoid activation function, δ (-) is the ReLU function.
And performing point-by-point product operation on the scene feature vector subjected to one-dimensional average pooling and the zooming vector, and performing feature enhancement on the scene feature vector.
In any of the above technical solutions, further, the video segment is composed of consecutive, multi-frame video frames, and the sampling interval is obtained by a ratio of a total number of the video frames to a preset number of sampling frames through a down-rounding operation.
The beneficial effect of this application is:
according to the technical scheme, firstly, a CNN network and a frame GCN network are combined, and global feature information in a video frame is extracted to serve as a scene feature vector; secondly, extracting an accurate object region by using a target detection model, extracting local characteristic information in a video frame as an object characteristic vector by using a characteristic embedding function of a regional GCN network, and well highlighting the effect of a key object in the whole video so as to well dig out the local characteristic information, namely the information of the object in the video;
and then, feature enhancement is respectively carried out on the scene feature vector and the object feature vector by utilizing scaling vector and point-by-point product operation, the process is realized by two SE modules, global features and local features are recoded, so that global and local feature information can highlight respective video description effects, the algorithm can give consideration to the global and local feature information, the description of the action is more accurate, and the generated description has both global property and pertinence to key objects.
And finally, describing the fused feature vector by using a language model LSTM.
In addition, the description method in the application has certain effect improvement on different CNN models, is suitable for 2D CNN and 3D CNN, has good robustness, enables technologies such as late fusion and the like to be directly used, and has high practicability and applicability.
Drawings
The advantages of the above and/or additional aspects of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic flow diagram of a graph-based convolutional neural network-based video description method according to one embodiment of the present application;
FIG. 2 is a schematic flow diagram of a video segment feature extraction process according to one embodiment of the present application;
FIG. 3 is a schematic diagram of a SE module according to one embodiment of the present application;
FIG. 4 is a video screenshot in a video description process according to one embodiment of the present application.
Detailed Description
In order that the above objects, features and advantages of the present application can be more clearly understood, the present application will be described in further detail with reference to the accompanying drawings and detailed description. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited by the specific embodiments disclosed below.
As shown in fig. 1 and fig. 2, the present embodiment provides a video description method based on a graph-convolution neural network, including:
step 1, extracting video frames in a video clip according to a sampling interval, recording the video frames as sample video frames, and respectively extracting scene characteristic vectors and object characteristic vectors in the sample video frames, wherein the video clip is composed of continuous multiple frames of the video frames, and the sampling interval is obtained by a ratio of the total number of the video frames to a preset number of sampling frames through a down-rounding operation.
Specifically, the video segment in this embodiment is composed of consecutive F frame video frames, and a preset sampling frame number is set to be T for video description, so that a rounding-down operation is adopted, and a sampling interval T is calculated by a ratio of a total number of video frames to the preset sampling frame number, where a corresponding calculation formula is:
Figure BDA0002984230370000051
and according to the calculated sampling interval T, carrying out equidistant sampling in the video clip to obtain T video frames, and recording the T video frames as sample video frames.
In the embodiment, feature extraction is mainly performed on a sample video frame through a CNN network and a GCN network, then the extracted features are fused, global and local feature information in the sample video frame is highlighted, and then the fused features are described by using a language model LSTM, so that description of video clips is further realized.
It should be noted that the implementation of the language model LSTM in this embodiment is not limited.
In this embodiment, a method for extracting a scene feature vector in the sample video frame is shown, which specifically includes:
step 11, inputting the sample video frame into a CNN network for feature extraction operation, and recording the output of the last layer of pooling layer of the CNN network as a high-dimensional feature map;
in this embodiment, the selected CNN network may be a 2D CNN network such as ResNet, inclusion, or a 3D CNN network such as 3D ResNet, Temporal Segment Networks, or the like. When a 2D CNN network is selected, pre-training can be performed by using a picture classification database (ImageNet); when a 3D CNN network is selected, it can be pre-trained with a video classification database (Kinetics).
And after the CNN network finishes pre-training, inputting the T frame sample video frame into the CNN network for feature extraction, and recording an output result of the last pooling layer of the CNN network as a high-dimensional feature map, wherein h is the height of the high-dimensional feature map, w is the width of the high-dimensional feature map, and d is the channel number of the high-dimensional feature map.
Step 12, performing two-dimensional average pooling operation on the high-dimensional feature map with the size of T multiplied by h multiplied by w multiplied by d, and recording a pooling result as a first feature vector, namely the high-dimensional feature map with the size of T multiplied by h multiplied by w multiplied by d
Figure BDA0002984230370000061
The feature vector of (2). Thus, the first feature vector is of size T
Figure BDA0002984230370000062
Is used to generate the feature vector.
And step 13, inputting the first feature vector into a frame GCN network for coding and embedding operation, and generating the scene feature vector.
In this embodiment, the frame GCN network and the area GCN network are both the graph convolution neural network GCN, and the implementation of the GCN network in this embodiment is not limited.
Further, the process of generating the scene feature vector by the frame GCN network according to the input first feature vector specifically includes:
step 131, performing linear transformation on the first eigenvector, calculating a first row vector relationship between each row vector in the linearly transformed first eigenvector, and determining a first graph matrix according to the first row vector relationship, wherein a calculation formula of the first graph matrix G is as follows:
Figure BDA0002984230370000071
Figure BDA0002984230370000072
φ(xi)=W′xi+b
in the formula, F (x)i,xj) Is the ith row vector x in the first feature vector after linear transformationiAnd the jth row vector xjThe first row vector relationship between, phi (-) is the row linear transformation function, W' is the first learnable parameter matrix, and b is the learnable bias-execution coefficient.
Specifically, according to a first learnable parameter matrix W' and learnable bias execution coefficients b, the first feature vector X is subjected to linear transformation, the first feature vector X is mapped to another space through the linear transformation, and X is recordediFor the ith row vector in the first feature vector X after linear transformation, record phi (X)i) For linear transformation functions, the result of the calculation of the linear transformation function is the sum of the row vectors xiThe vectors with the same size correspond to the calculation formula as follows:
φ(xi)=W′xi+b
mapping the first feature vector to other spaces to enable the mapped first feature vector to have more definite relation information so as to obtain the relation between each line vector through dot product operation, and recording the first line vector relation between each line vector as F (x)i,xj) The result is a scalar, and the corresponding calculation formula is:
Figure BDA0002984230370000073
taking a frame GCN network as an example, the row vector xiFeature vector representing the image of the ith frame, F (x)i,xj) The result of (1) describes the relative relationship between the image of the ith frame and the image of the jth frame, which is abstracted into a scalar quantity whose value has no actual physical meaning and is used only as an auxiliary GCN calculation.
To facilitate subsequent operation of the frame GCN network, a first graph matrix G is introduced, i.e. of size
Figure BDA0002984230370000075
Wherein GijFor the element in the ith row and the jth column of the first graph matrix G, the corresponding calculation formula is:
Figure BDA0002984230370000074
the purpose of this process is to eliminate the computation of the relationship F (x) between the row vectorsi,xj) And (3) carrying out Softmax function processing on each row vector of the feature vector by the formula to achieve the aim of normalization and eliminate unnecessary errors for the feature embedding step of the GCN.
Step 132, according to a second learning parameter matrix W, performing linear spatial transformation on the first feature vector X, and performing feature embedding on the first feature vector after the linear spatial transformation by using the first graph matrix G to generate the scene feature vector, where a corresponding calculation formula is:
Y=GXW
specifically, the first feature vector X is subjected to linear spatial transformation through the second learning parameter matrix W to realize the encoding process of the first feature vector X, so that the first feature vector X is transformed into a suitable linear space for subsequent calculation, and then the first feature vector XW after the linear spatial transformation is subjected to feature embedding through the constructed first graph matrix G.
Through encoding and feature embedding, the scene feature vector Y output by the frame GCN network can play a role in highlighting some key frames in the video compared with the first feature vector X, so that the output of the frame GCN network can better dig out information in the video.
On the basis of the foregoing embodiment, this embodiment further shows a method for extracting an object feature vector in the sample video frame, which specifically includes:
step 14, inputting the sample video frame into a target detection model, screening the area in the sample video frame by using a non-maximum suppression method, determining the area position of an object, and recording the area position as an object area;
in this embodiment, the selected target detection model (RPN) may be a mainstream target detection model such as YOLO v3, SSD, Faster R-CNN, and Mask R-CNN, and is pre-trained in a target detection database (MSCOCO) before use.
The T frame sample video frame is detected using a target detection model (RPN), and the number of region positions of the object that can be obtained is set to N, that is, N object regions. In the process of extracting the target region, the regions are screened by using a non-maximum suppression method (NMS), and the N screened regions are set as the target regions.
Step 15, performing region correspondence between the N object regions and the high-dimensional feature map obtained in the step 11, that is, corresponding the N object regions to a high-dimensional feature map with a size of T × h × w × d, and then performing clipping and roiign operations on the high-dimensional feature map corresponding to each object region to generate a second feature vector with a size of N × 7 × 7 × d;
step 16, performing two-dimensional average pooling operation on the second feature vector to generate a third feature vector, wherein the size of the third feature vector is
Figure BDA0002984230370000091
Thus, the third feature vector is N-sized
Figure BDA0002984230370000092
Is used to generate the feature vector.
And step 17, inputting the third feature vector into a regional GCN network for encoding and embedding operation, and generating the object feature vector.
In this embodiment, the process of outputting the object feature vector by the regional GCN network is substantially the same as the process of outputting the scene feature vector by the frame GCN network, and specifically includes:
step 171, performing linear transformation on the third eigenvector, calculating a second row vector relationship between each row vector in the linearly transformed third eigenvector, and determining a second graph matrix according to the second row vector relationship;
and 172, performing linear space transformation on the third feature vector according to a third learning parameter matrix, and performing feature embedding on the linearly and spatially transformed third feature vector by using the second graph matrix to generate the object feature vector.
In this embodiment, the key information in the N object feature vectors can be found out through the regional GCN network, the local feature information such as appearance details, action details, and interaction relationships between objects in the sample video frame can be found out, and the local feature information is enhanced through encoding and embedding operations. Therefore, local feature information in a sample video frame is firstly mined by a target detection model RPN, and then the local feature information is enhanced by a regional GCN network so that a language model LSTM can be better utilized.
Step 2, performing feature enhancement on the scene feature vector and the object feature vector according to the scaling vector and point-by-point product operation, performing feature fusion according to the enhanced scene feature vector and the enhanced object feature vector, and recording as a fusion feature vector;
in this embodiment, two independent SE modules are mainly used to perform feature enhancement on the scene feature vector and the object feature vector, and before performing the feature enhancement, step 2 further includes:
respectively carrying out one-dimensional average pooling operation on the scene feature vector and the object feature vector, wherein the result of the one-dimensional average pooling operation of the scene feature vector is a feature vector vT
Figure BDA0002984230370000093
The result of the one-dimensional average pooling operation of the object feature vectors is the feature vector vN
Figure BDA0002984230370000094
Feature vector vTAnd a feature vector vNFor feature enhancement.
In this embodiment, the same method is used to perform feature enhancement on the scene feature vector and the object feature vector, and now, taking the scene feature vector as an example, a process of feature enhancement is described, where the specific process includes:
according to the scene feature vector v after one-dimensional average poolingTAnd calculating the scaling vector alpha, wherein the corresponding calculation formula is as follows:
α=σ(W2δ(W1vT+b1)+b2)
in the formula, vTFeature vector of scene pooled for one-dimensional averaging, W1、W2To scale the matrix, b1、b2Scaling parameter, W1、W2、b1、b2Is a learnable parameter, σ (-) is a Sigmoid activation function, and δ (-) is a ReLU function.
And performing point-by-point product operation on the scene feature vector subjected to one-dimensional average pooling and the scaling vector, and performing feature enhancement on the scene feature vector.
Specifically, as shown in fig. 3, the scene feature vector v after one-dimensional average poolingTInputting the data into an SE module for feature enhancement and realizing recoding, wherein the corresponding calculation formula is as follows:
Figure BDA0002984230370000101
α=σ(g(vT))=σ(W2δ(W1vT+b1)+b2)
wherein the vector is scaled
Figure BDA0002984230370000102
Figure BDA0002984230370000103
And carrying out feature enhancement on the scene feature vector for the SE module.
The process of feature enhancement of the object feature vector is not repeated.
In this embodiment, by introducing two independent SE modules, the key information of the scene feature vector and the object feature vector is found among the d channels, and is activated by enhancement, thereby realizing feature enhancement.
Thereafter, the two features are used to enhance the results
Figure BDA0002984230370000104
And
Figure BDA0002984230370000105
adding the two components, performing feature fusion, and recording the fusion result as fusion feature vector
Figure BDA0002984230370000106
Figure BDA0002984230370000107
Highlighting the most important features of the global feature information and the local feature information, respectively, to better fuse the global and local feature information, wherein,
Figure BDA0002984230370000108
is a one-dimensional average pooled object feature vector.
Step 3, utilizing a language model LSTM to perform fusion on the feature vector
Figure BDA0002984230370000109
Performing video description to fuse the feature vectors
Figure BDA00029842303700001010
Inputting into language model LSTM to obtain probability distribution of video frame description words, and selectingAnd taking the word with the highest probability as the current output to realize the description of the video clip.
It should be noted that, in the model training process, the cross moisture is used as a loss function, Adam is used as an optimization algorithm to train the model, parameters of the CNN network and the target detection model RPN are fixed, and only the GCN network and the language model LSTM are trained.
In order to verify the accuracy of the video description method in this embodiment, a network video as shown in fig. 4 is selected for video description, where fig. 4(a), (B), and (C) are screenshots of 3 sample video frames sampled at equal intervals, and the description results using different description methods are shown in table 1.
TABLE 1
Text description
Real situation A man is speaking on a bench
Description of the prior art One man is speaking
Description method of the present embodiment One man is sitting on the chair and speaking
Therefore, by using the video description method in the embodiment, not only can the object in the video frame be accurately found, but also the mined local information and the global information can be well fused, so that the video can be more accurately described.
The technical scheme of the present application is described in detail above with reference to the accompanying drawings, and the present application provides a video description method based on a graph convolution neural network, the method including: step 1, extracting video frames in a video clip according to a sampling interval, recording the video frames as sample video frames, and respectively extracting scene characteristic vectors and object characteristic vectors in the sample video frames; step 2, performing feature enhancement on the scene feature vector and the object feature vector according to the scaling vector and point-by-point product operation, performing feature fusion according to the enhanced scene feature vector and the enhanced object feature vector, and recording as a fusion feature vector; and 3, performing video description on the fusion feature vector by using a language model LSTM. According to the technical scheme, the global feature information and the local feature information in the video frame are mined and utilized respectively, and different feature information is fused to describe the video content, so that the description accuracy is improved.
The steps in the present application may be sequentially adjusted, combined, and subtracted according to actual requirements.
The units in the device can be merged, divided and deleted according to actual requirements.
Although the present application has been disclosed in detail with reference to the accompanying drawings, it is to be understood that such description is merely illustrative and not restrictive of the application of the present application. The scope of the present application is defined by the appended claims and may include various modifications, adaptations, and equivalents of the invention without departing from the scope and spirit of the application.

Claims (8)

1.一种基于图卷积神经网络的视频描述方法,其特征在于,所述方法包括:1. a video description method based on graph convolutional neural network, is characterized in that, described method comprises: 步骤1,根据采样间隔,提取视频片段中的视频帧,记作样本视频帧,分别提取所述样本视频帧中的场景特征向量和对象特征向量;Step 1, according to the sampling interval, extract the video frame in the video clip, record it as the sample video frame, and extract the scene feature vector and the object feature vector in the sample video frame respectively; 步骤2,根据缩放向量和逐点乘积运算,对所述场景特征向量和所述对象特征向量进行特征增强,并根据增强后的场景特征向量和增强后的对象特征向量进行特征融合,记作融合特征向量;Step 2, according to the scaling vector and the point-by-point product operation, feature enhancement is performed on the scene feature vector and the object feature vector, and feature fusion is performed according to the enhanced scene feature vector and the enhanced object feature vector, which is denoted as fusion. Feature vector; 步骤3,利用语言模型LSTM,对所述融合特征向量进行视频描述。Step 3, using the language model LSTM to describe the fusion feature vector in video. 2.如权利要求1所述的基于图卷积神经网络的视频描述方法,其特征在于,所述步骤1中,提取所述样本视频帧中的场景特征向量,具体包括:2. The video description method based on a graph convolutional neural network as claimed in claim 1, wherein in the step 1, the scene feature vector in the sample video frame is extracted, specifically comprising: 步骤11,将所述样本视频帧输入至CNN网络进行特征提取运算,将所述CNN网络最后一层池化层的输出记作高维特征图;Step 11, the sample video frame is input to the CNN network for feature extraction operation, and the output of the last layer of the CNN network pooling layer is recorded as a high-dimensional feature map; 步骤12,对所述高维特征图进行二维平均池化操作,将池化结果记作第一特征向量;Step 12, perform a two-dimensional average pooling operation on the high-dimensional feature map, and record the pooling result as the first feature vector; 步骤13,将所述第一特征向量输入至帧GCN网络进行编码和嵌入操作,生成所述场景特征向量。Step 13: Input the first feature vector into the frame GCN network for encoding and embedding operations to generate the scene feature vector. 3.如权利要求2所述的基于图卷积神经网络的视频描述方法,其特征在于,所述步骤1中,提取所述样本视频帧中的对象特征向量,具体包括:3. The video description method based on a graph convolutional neural network as claimed in claim 2, wherein in the step 1, extracting the object feature vector in the sample video frame, specifically comprising: 步骤14,将所述样本视频帧输入至目标检测模型,利用非极大值抑制方法,对所述样本视频帧中的区域进行筛选,确定物体对象的区域位置,记作对象区域;Step 14, the sample video frame is input into the target detection model, and the non-maximum value suppression method is used to screen the area in the sample video frame, and the area position of the object object is determined, which is recorded as the object area; 步骤15,将所述对象区域与所述高维特征图进行区域对应,并对所述对象区域进行剪裁和ROIAlign操作,生成第二特征向量;Step 15, performing regional correspondence between the object region and the high-dimensional feature map, and performing clipping and ROIAlign operations on the object region to generate a second feature vector; 步骤16,对所述第二特征向量进行二维平均池化操作,生成第三特征向量;Step 16, performing a two-dimensional average pooling operation on the second feature vector to generate a third feature vector; 步骤17,将所述第三特征向量输入至区域GCN网络进行编码和嵌入操作,生成所述对象特征向量。Step 17: Input the third feature vector into the regional GCN network for encoding and embedding operations to generate the object feature vector. 4.如权利要求2所述的基于图卷积神经网络的视频描述方法,其特征在于,所述步骤13,具体包括:4. The video description method based on a graph convolutional neural network as claimed in claim 2, wherein the step 13 specifically comprises: 步骤131,对所述第一特征向量进行线性变换,并计算线性变换后第一特征向量中各行向量之间的第一行向量关系,并根据所述第一行向量关系,确定第一图矩阵,其中,所述第一图矩阵G的计算公式为:Step 131: Perform a linear transformation on the first eigenvector, and calculate a first row-vector relationship between the row vectors in the first eigenvector after the linear transformation, and determine a first graph matrix according to the first row-vector relationship , wherein, the calculation formula of the first graph matrix G is:
Figure FDA0002984230360000021
Figure FDA0002984230360000021
Figure FDA0002984230360000022
Figure FDA0002984230360000022
φ(xi)=W′xi+bφ(x i )=W′x i +b 式中,F(xi,xj)为线性变换后第一特征向量中第i行行向量xi与第j行行向量xj之间的第一行向量关系,φ(·)为行线性变换函数,W′为第一可学习参数矩阵,b为可学习偏执系数;In the formula, F(x i , x j ) is the first row vector relationship between the i-th row vector x i and the j-th row vector x j in the first eigenvector after linear transformation, and φ( ) is the row Linear transformation function, W' is the first learnable parameter matrix, b is the learnable paranoid coefficient; 步骤132,根据第二学习参数矩阵,对所述第一特征向量X进行线性空间变换,利用所述第一图矩阵G对线性空间变换后的第一特征向量进行特征嵌入,生成所述场景特征向量,对应的计算公式为:Step 132: Perform linear space transformation on the first feature vector X according to the second learning parameter matrix, and use the first graph matrix G to perform feature embedding on the linear space transformed first feature vector to generate the scene feature. vector, and the corresponding calculation formula is: Y=GXWY=GXW 式中,Y为所述场景特征向量,W为所述第二学习参数矩阵,X为所述第一特征向量。In the formula, Y is the scene feature vector, W is the second learning parameter matrix, and X is the first feature vector.
5.如权利要求3所述的基于图卷积神经网络的视频描述方法,其特征在于,所述步骤17,具体包括:5. The video description method based on a graph convolutional neural network as claimed in claim 3, wherein the step 17 specifically comprises: 步骤171,对所述第三特征向量进行线性变换,并计算线性变换后第三特征向量中各行向量之间的第二行向量关系,并根据所述第二行向量关系,确定第二图矩阵;Step 171: Perform linear transformation on the third eigenvector, calculate a second row-vector relationship between the row vectors in the third eigenvector after linear transformation, and determine a second graph matrix according to the second row-vector relationship ; 步骤172,根据第三学习参数矩阵,对所述第三特征向量进行线性空间变换,利用所述第二图矩阵对线性空间变换后的第三特征向量进行特征嵌入,生成所述对象特征向量。Step 172: Perform linear space transformation on the third feature vector according to the third learning parameter matrix, and perform feature embedding on the third feature vector after linear space transformation by using the second graph matrix to generate the object feature vector. 6.如权利要求1所述的基于图卷积神经网络的视频描述方法,其特征在于,所述步骤2之前,还包括:6. The video description method based on a graph convolutional neural network as claimed in claim 1, wherein, before the step 2, further comprising: 分别对所述场景特征向量和所述对象特征向量进行一维平均池化操作。A one-dimensional average pooling operation is performed on the scene feature vector and the object feature vector respectively. 7.如权利要求6所述的基于图卷积神经网络的视频描述方法,其特征在于,所述步骤2中,根据缩放向量和逐点乘积运算,对所述场景特征向量进行特征增强,具体包括:7. The video description method based on a graph convolutional neural network as claimed in claim 6, wherein in the step 2, according to the scaling vector and the point-by-point product operation, feature enhancement is performed on the scene feature vector, specifically include: 根据门控机制和一维平均池化后的场景特征向量,计算所述缩放向量,对应的计算公式为:According to the gating mechanism and the scene feature vector after one-dimensional average pooling, the scaling vector is calculated, and the corresponding calculation formula is: α=σ(g(vT))=σ(W2δ(W1vT+b1)+b2)α=σ(g(v T ))=σ(W 2 δ(W 1 v T +b 1 )+b 2 ) 式中,vT为所述一维平均池化后的场景特征向量,W1、W2为缩放矩阵,b1、b2缩放参数,σ(·)为Sigmoid激活函数,δ(·)为ReLU函数。In the formula, v T is the scene feature vector after the one-dimensional average pooling, W 1 , W 2 are scaling matrices, b 1 , b 2 scaling parameters, σ( ) is the sigmoid activation function, δ( ) is ReLU function. 将所述一维平均池化的场景特征向量与所述缩放向量进行逐点乘积运算,对所述场景特征向量进行特征增强。A point-by-point product operation is performed on the scene feature vector of the one-dimensional average pooling and the scaling vector, and feature enhancement is performed on the scene feature vector. 8.如权利要求1所述的基于图卷积神经网络的视频描述方法,其特征在于,所述视频片段由连续的、多帧所述视频帧组成,8. The video description method based on a graph convolutional neural network according to claim 1, wherein the video segment is composed of continuous, multiple frames of the video frame, 所述采样间隔由所述视频帧的总数和预设的采样帧数的比值,通过向下取整运算得出。The sampling interval is obtained by rounding down the ratio of the total number of video frames to the preset number of sampling frames.
CN202110295594.3A 2021-03-19 2021-03-19 Video description method based on graph convolution neural network Pending CN113011334A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110295594.3A CN113011334A (en) 2021-03-19 2021-03-19 Video description method based on graph convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110295594.3A CN113011334A (en) 2021-03-19 2021-03-19 Video description method based on graph convolution neural network

Publications (1)

Publication Number Publication Date
CN113011334A true CN113011334A (en) 2021-06-22

Family

ID=76403128

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110295594.3A Pending CN113011334A (en) 2021-03-19 2021-03-19 Video description method based on graph convolution neural network

Country Status (1)

Country Link
CN (1) CN113011334A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114386259A (en) * 2021-12-29 2022-04-22 桂林电子科技大学 A video description method, device and storage medium
CN114419525A (en) * 2022-03-30 2022-04-29 成都考拉悠然科技有限公司 Harmful video detection method and system
CN115455233A (en) * 2022-08-08 2022-12-09 中国科学院自动化研究所 Method, device, equipment and storage medium for generating video dynamic thumbnail

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259197A (en) * 2020-01-13 2020-06-09 清华大学 A video description generation method based on precoding semantic features
CN111488807A (en) * 2020-03-29 2020-08-04 复旦大学 Video description generation system based on graph convolution network
WO2020190112A1 (en) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Method, apparatus, device and medium for generating captioning information of multimedia data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020190112A1 (en) * 2019-03-21 2020-09-24 Samsung Electronics Co., Ltd. Method, apparatus, device and medium for generating captioning information of multimedia data
CN111259197A (en) * 2020-01-13 2020-06-09 清华大学 A video description generation method based on precoding semantic features
CN111488807A (en) * 2020-03-29 2020-08-04 复旦大学 Video description generation system based on graph convolution network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐航: "基于深度网络与多特征融合的视频语义描述方法研究", 中国优秀硕士学位论文全文数据库, 15 February 2020 (2020-02-15), pages 27 - 30 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114386259A (en) * 2021-12-29 2022-04-22 桂林电子科技大学 A video description method, device and storage medium
CN114386259B (en) * 2021-12-29 2025-04-15 桂林电子科技大学 Video description method, device and storage medium
CN114419525A (en) * 2022-03-30 2022-04-29 成都考拉悠然科技有限公司 Harmful video detection method and system
CN115455233A (en) * 2022-08-08 2022-12-09 中国科学院自动化研究所 Method, device, equipment and storage medium for generating video dynamic thumbnail

Similar Documents

Publication Publication Date Title
AU2019213369B2 (en) Non-local memory network for semi-supervised video object segmentation
CN111091045B (en) Sign language identification method based on space-time attention mechanism
CN111079532A (en) Video content description method based on text self-encoder
CN113011334A (en) Video description method based on graph convolution neural network
GB2579262A (en) Space-time memory network for locating target object in video content
CN113420179A (en) Semantic reconstruction video description method based on time sequence Gaussian mixture hole convolution
CN114693577B (en) A Fusion Method of Infrared Polarization Image Based on Transformer
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN114494297B (en) Adaptive video target segmentation method for processing multiple priori knowledge
CN116309890A (en) Model generation method, stylized image generation method, device and electronic device
CN118982597A (en) A remote sensing image generation method and device based on multi-condition controllable diffusion model
CN114661874A (en) A visual question answering method based on multi-angle semantic understanding and adaptive dual-channel
Lu et al. Tcnet: Continuous sign language recognition from trajectories and correlated regions
CN114241218A (en) Target significance detection method based on step-by-step attention mechanism
CN111325068A (en) Video description method and device based on convolutional neural network
CN117350910A (en) Image watermark protection method based on diffusion image editing model
CN116703857A (en) A video action quality evaluation method based on spatio-temporal domain perception
CN117788629B (en) Image generation method, device and storage medium with style personalization
CN118627176A (en) Garden landscape design auxiliary system and method based on virtual reality technology
Chen et al. Generative Multi-Modal Mutual Enhancement Video Semantic Communications.
CN114511813B (en) Video semantic description method and device
CN116320389A (en) No-reference video quality assessment method based on self-attention recurrent neural network
CN120163162A (en) A steganographic text detection technology based on text reconstruction and word order semantic features
CN116883902B (en) Action recognition method based on multi-scale space-time characteristic distillation
CN116403142B (en) Video processing method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210622