CN113011334A

CN113011334A - Video description method based on graph convolution neural network

Info

Publication number: CN113011334A
Application number: CN202110295594.3A
Authority: CN
Inventors: 宫晓东; 杨光; 孟宪菊; 梅海艺
Original assignee: Beijing Qihuang Chinese Medicine Culture Development Foundation
Current assignee: Beijing Qihuang Chinese Medicine Culture Development Foundation
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2021-06-22

Abstract

The present application discloses a video description method based on a graph convolutional neural network. The method includes: step 1, according to the sampling interval, extracting video frames in a video segment, recording them as sample video frames, and extracting scenes in the sample video frames respectively feature vector and object feature vector; step 2, perform feature enhancement on the scene feature vector and the object feature vector according to the scaling vector and the point-by-point product operation, and perform feature fusion according to the enhanced scene feature vector and the enhanced object feature vector, Denoted as fusion feature vector; Step 3, use language model LSTM to describe the fusion feature vector in video. Through the technical solutions in the present application, the global feature information and local feature information in the video frame are mined and utilized respectively, and different feature information is fused to describe the video content and improve the description accuracy.

Description

Video description method based on graph convolution neural network

Technical Field

The application relates to the technical field of computer vision and natural language processing, in particular to a video description method based on a graph convolution neural network.

Background

With the continuous development of computer technology and the internet, multimedia technology has also advanced greatly, and video has become the main media transmission form. Massive video data exist in the network, and anyone can upload videos without constraint, so that illegal contents such as violence, pornography and the like inevitably exist; it is difficult to review all video contents only by manual screening, and careless omission is avoided during work. Besides the video auditing work mentioned above, the video content understanding and description can also be applied to aspects of robot man-machine interaction, assistance for visually impaired people and the like. Therefore, how to process and understand the content of the video is important, but for large-scale video data, it is difficult for a computer to thoroughly understand the content of the video information.

The deep learning technology enables a computer to solve video contents, and an end-to-end encoder-decoder algorithm framework constructed by matching with a natural language processing technology enables the computer to understand the video contents and describe the video contents by using a natural language. In the video description task, a deep Convolutional Neural Network (CNN) can effectively extract spatial features or temporal features of a video, and thus the CNN is often used as an encoder; however, the long-term memory network (LSTM) can effectively capture the relation of words in the text, effectively solve the problem of gradient explosion and alleviate the problem of long-term dependence, and therefore, the language model LSTM is often used as a decoder.

However, in the prior art, because the CNN can only extract global feature information, the conventional video description method only uses global information of video frames, that is, the video frames are regarded as a whole and input to the CNN, and local information of the video frames is not well mined and utilized, and a phenomenon of local feature loss exists, so that the following technical problems often exist in the existing video description method:

(1) the motion description of the subject is inaccurate, such as describing the playing of a ball as running;

(2) the description of the object in the video is inaccurate, such as the description of the gender of the person is wrong.

Therefore, in order to describe the video more accurately, the problem of acquiring and processing local information in the video and the problem of fusion between different features need to be solved.

Disclosure of Invention

The purpose of this application lies in: based on the graph convolution neural network, the global feature information and the local feature information in the video frame are respectively mined and utilized, and different feature information is fused to describe the video content, so that the description accuracy is improved.

The technical scheme of the application is as follows: a video description method based on a graph convolution neural network is provided, and the method comprises the following steps: step 1, extracting video frames in a video clip according to a sampling interval, recording the video frames as sample video frames, and respectively extracting scene characteristic vectors and object characteristic vectors in the sample video frames; step 2, performing feature enhancement on the scene feature vector and the object feature vector according to the scaling vector and point-by-point product operation, performing feature fusion according to the enhanced scene feature vector and the enhanced object feature vector, and recording as a fusion feature vector; and 3, performing video description on the fusion feature vector by using a language model LSTM.

In any one of the above technical solutions, further, in step 1, extracting a scene feature vector in a sample video frame specifically includes: step 11, inputting a sample video frame into a CNN network for feature extraction operation, and recording the output of the last layer of pooling layer of the CNN network as a high-dimensional feature map; step 12, performing two-dimensional average pooling operation on the high-dimensional feature map, and recording a pooling result as a first feature vector; and step 13, inputting the first feature vector into a frame GCN network for coding and embedding operation, and generating a scene feature vector.

In any one of the above technical solutions, further, in step 1, extracting an object feature vector in a sample video frame specifically includes: step 14, inputting the sample video frame into a target detection model, screening the area in the sample video frame by using a non-maximum suppression method, determining the area position of an object, and recording the area position as an object area; step 15, performing area correspondence on the object area and the high-dimensional feature map, and performing clipping and ROIAlign operation on the object area to generate a second feature vector; step 16, performing two-dimensional average pooling operation on the second feature vector to generate a third feature vector; and step 17, inputting the third feature vector into a regional GCN network for encoding and embedding operation, and generating an object feature vector.

In any one of the above technical solutions, further, step 13 specifically includes:

step 131, performing linear transformation on the first eigenvector, calculating a first row vector relationship between the row vectors in the first eigenvector after the linear transformation, and determining a first graph matrix according to the first row vector relationship, wherein a calculation formula of the first graph matrix G is as follows:

φ(x_i)＝W^′x_i+b

in the formula, F (x)_i,x_j) Is the ith row vector x in the first feature vector after linear transformation_iAnd the jth row vector x_jThe first line vector relationship between, phi (-) is a line linear transformation function, W^′Is the first learnable parameter matrix, b is the learnable bias coefficient;

step 132, according to the second learning parameter matrix, performing linear spatial transformation on the first feature vector X, and performing feature embedding on the first feature vector after the linear spatial transformation by using the first graph matrix G to generate a scene feature vector, where the corresponding calculation formula is:

Y＝GXW

in the formula, Y is a scene feature vector, W is a second learning parameter matrix, and X is a first feature vector.

In any one of the above technical solutions, further, step 17 specifically includes: 171, performing linear transformation on the third eigenvector, calculating a second row vector relationship between each row vector in the linearly transformed third eigenvector, and determining a second graph matrix according to the second row vector relationship; and 172, performing linear space transformation on the third feature vector according to the third learning parameter matrix, and performing feature embedding on the linearly-space-transformed third feature vector by using the second graph matrix to generate an object feature vector.

In any one of the above technical solutions, further, before step 2, the method further includes: and respectively carrying out one-dimensional average pooling operation on the scene feature vector and the object feature vector.

In any one of the above technical solutions, further, in step 2, performing feature enhancement on the scene feature vector according to the scaling vector and the point-by-point product operation, specifically including:

calculating a scaling vector according to a gating mechanism and the scene characteristic vector after one-dimensional average pooling, wherein a corresponding calculation formula is as follows:

α＝σ(g(v_T))＝σ(W₂δ(W₁v_T+b₁)+b₂)

in the formula, v_TIs a one-dimensional average pooled scene feature vector, W₁、W₂To scale the matrix, b₁、b₂The scaling parameter, σ (-) is the Sigmoid activation function, δ (-) is the ReLU function.

And performing point-by-point product operation on the scene feature vector subjected to one-dimensional average pooling and the zooming vector, and performing feature enhancement on the scene feature vector.

In any of the above technical solutions, further, the video segment is composed of consecutive, multi-frame video frames, and the sampling interval is obtained by a ratio of a total number of the video frames to a preset number of sampling frames through a down-rounding operation.

The beneficial effect of this application is:

according to the technical scheme, firstly, a CNN network and a frame GCN network are combined, and global feature information in a video frame is extracted to serve as a scene feature vector; secondly, extracting an accurate object region by using a target detection model, extracting local characteristic information in a video frame as an object characteristic vector by using a characteristic embedding function of a regional GCN network, and well highlighting the effect of a key object in the whole video so as to well dig out the local characteristic information, namely the information of the object in the video;

and then, feature enhancement is respectively carried out on the scene feature vector and the object feature vector by utilizing scaling vector and point-by-point product operation, the process is realized by two SE modules, global features and local features are recoded, so that global and local feature information can highlight respective video description effects, the algorithm can give consideration to the global and local feature information, the description of the action is more accurate, and the generated description has both global property and pertinence to key objects.

And finally, describing the fused feature vector by using a language model LSTM.

In addition, the description method in the application has certain effect improvement on different CNN models, is suitable for 2D CNN and 3D CNN, has good robustness, enables technologies such as late fusion and the like to be directly used, and has high practicability and applicability.

Drawings

The advantages of the above and/or additional aspects of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow diagram of a graph-based convolutional neural network-based video description method according to one embodiment of the present application;

FIG. 2 is a schematic flow diagram of a video segment feature extraction process according to one embodiment of the present application;

FIG. 3 is a schematic diagram of a SE module according to one embodiment of the present application;

FIG. 4 is a video screenshot in a video description process according to one embodiment of the present application.

Detailed Description

In order that the above objects, features and advantages of the present application can be more clearly understood, the present application will be described in further detail with reference to the accompanying drawings and detailed description. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited by the specific embodiments disclosed below.

As shown in fig. 1 and fig. 2, the present embodiment provides a video description method based on a graph-convolution neural network, including:

step 1, extracting video frames in a video clip according to a sampling interval, recording the video frames as sample video frames, and respectively extracting scene characteristic vectors and object characteristic vectors in the sample video frames, wherein the video clip is composed of continuous multiple frames of the video frames, and the sampling interval is obtained by a ratio of the total number of the video frames to a preset number of sampling frames through a down-rounding operation.

Specifically, the video segment in this embodiment is composed of consecutive F frame video frames, and a preset sampling frame number is set to be T for video description, so that a rounding-down operation is adopted, and a sampling interval T is calculated by a ratio of a total number of video frames to the preset sampling frame number, where a corresponding calculation formula is:

and according to the calculated sampling interval T, carrying out equidistant sampling in the video clip to obtain T video frames, and recording the T video frames as sample video frames.

In the embodiment, feature extraction is mainly performed on a sample video frame through a CNN network and a GCN network, then the extracted features are fused, global and local feature information in the sample video frame is highlighted, and then the fused features are described by using a language model LSTM, so that description of video clips is further realized.

It should be noted that the implementation of the language model LSTM in this embodiment is not limited.

In this embodiment, a method for extracting a scene feature vector in the sample video frame is shown, which specifically includes:

step 11, inputting the sample video frame into a CNN network for feature extraction operation, and recording the output of the last layer of pooling layer of the CNN network as a high-dimensional feature map;

in this embodiment, the selected CNN network may be a 2D CNN network such as ResNet, inclusion, or a 3D CNN network such as 3D ResNet, Temporal Segment Networks, or the like. When a 2D CNN network is selected, pre-training can be performed by using a picture classification database (ImageNet); when a 3D CNN network is selected, it can be pre-trained with a video classification database (Kinetics).

And after the CNN network finishes pre-training, inputting the T frame sample video frame into the CNN network for feature extraction, and recording an output result of the last pooling layer of the CNN network as a high-dimensional feature map, wherein h is the height of the high-dimensional feature map, w is the width of the high-dimensional feature map, and d is the channel number of the high-dimensional feature map.

Step 12, performing two-dimensional average pooling operation on the high-dimensional feature map with the size of T multiplied by h multiplied by w multiplied by d, and recording a pooling result as a first feature vector, namely the high-dimensional feature map with the size of T multiplied by h multiplied by w multiplied by d

The feature vector of (2). Thus, the first feature vector is of size T

Is used to generate the feature vector.

And step 13, inputting the first feature vector into a frame GCN network for coding and embedding operation, and generating the scene feature vector.

In this embodiment, the frame GCN network and the area GCN network are both the graph convolution neural network GCN, and the implementation of the GCN network in this embodiment is not limited.

Further, the process of generating the scene feature vector by the frame GCN network according to the input first feature vector specifically includes:

step 131, performing linear transformation on the first eigenvector, calculating a first row vector relationship between each row vector in the linearly transformed first eigenvector, and determining a first graph matrix according to the first row vector relationship, wherein a calculation formula of the first graph matrix G is as follows:

φ(x_i)＝W′x_i+b

in the formula, F (x)_i,x_j) Is the ith row vector x in the first feature vector after linear transformation_iAnd the jth row vector x_jThe first row vector relationship between, phi (-) is the row linear transformation function, W' is the first learnable parameter matrix, and b is the learnable bias-execution coefficient.

Specifically, according to a first learnable parameter matrix W' and learnable bias execution coefficients b, the first feature vector X is subjected to linear transformation, the first feature vector X is mapped to another space through the linear transformation, and X is recorded_iFor the ith row vector in the first feature vector X after linear transformation, record phi (X)_i) For linear transformation functions, the result of the calculation of the linear transformation function is the sum of the row vectors x_iThe vectors with the same size correspond to the calculation formula as follows:

φ(x_i)＝W′x_i+b

mapping the first feature vector to other spaces to enable the mapped first feature vector to have more definite relation information so as to obtain the relation between each line vector through dot product operation, and recording the first line vector relation between each line vector as F (x)_i,x_j) The result is a scalar, and the corresponding calculation formula is:

taking a frame GCN network as an example, the row vector x_iFeature vector representing the image of the ith frame, F (x)_i,x_j) The result of (1) describes the relative relationship between the image of the ith frame and the image of the jth frame, which is abstracted into a scalar quantity whose value has no actual physical meaning and is used only as an auxiliary GCN calculation.

To facilitate subsequent operation of the frame GCN network, a first graph matrix G is introduced, i.e. of size

Wherein G_ijFor the element in the ith row and the jth column of the first graph matrix G, the corresponding calculation formula is:

the purpose of this process is to eliminate the computation of the relationship F (x) between the row vectors_i,x_j) And (3) carrying out Softmax function processing on each row vector of the feature vector by the formula to achieve the aim of normalization and eliminate unnecessary errors for the feature embedding step of the GCN.

Step 132, according to a second learning parameter matrix W, performing linear spatial transformation on the first feature vector X, and performing feature embedding on the first feature vector after the linear spatial transformation by using the first graph matrix G to generate the scene feature vector, where a corresponding calculation formula is:

Y＝GXW

specifically, the first feature vector X is subjected to linear spatial transformation through the second learning parameter matrix W to realize the encoding process of the first feature vector X, so that the first feature vector X is transformed into a suitable linear space for subsequent calculation, and then the first feature vector XW after the linear spatial transformation is subjected to feature embedding through the constructed first graph matrix G.

Through encoding and feature embedding, the scene feature vector Y output by the frame GCN network can play a role in highlighting some key frames in the video compared with the first feature vector X, so that the output of the frame GCN network can better dig out information in the video.

On the basis of the foregoing embodiment, this embodiment further shows a method for extracting an object feature vector in the sample video frame, which specifically includes:

step 14, inputting the sample video frame into a target detection model, screening the area in the sample video frame by using a non-maximum suppression method, determining the area position of an object, and recording the area position as an object area;

in this embodiment, the selected target detection model (RPN) may be a mainstream target detection model such as YOLO v3, SSD, Faster R-CNN, and Mask R-CNN, and is pre-trained in a target detection database (MSCOCO) before use.

The T frame sample video frame is detected using a target detection model (RPN), and the number of region positions of the object that can be obtained is set to N, that is, N object regions. In the process of extracting the target region, the regions are screened by using a non-maximum suppression method (NMS), and the N screened regions are set as the target regions.

Step 15, performing region correspondence between the N object regions and the high-dimensional feature map obtained in the step 11, that is, corresponding the N object regions to a high-dimensional feature map with a size of T × h × w × d, and then performing clipping and roiign operations on the high-dimensional feature map corresponding to each object region to generate a second feature vector with a size of N × 7 × 7 × d;

step 16, performing two-dimensional average pooling operation on the second feature vector to generate a third feature vector, wherein the size of the third feature vector is

Thus, the third feature vector is N-sized

Is used to generate the feature vector.

And step 17, inputting the third feature vector into a regional GCN network for encoding and embedding operation, and generating the object feature vector.

In this embodiment, the process of outputting the object feature vector by the regional GCN network is substantially the same as the process of outputting the scene feature vector by the frame GCN network, and specifically includes:

step 171, performing linear transformation on the third eigenvector, calculating a second row vector relationship between each row vector in the linearly transformed third eigenvector, and determining a second graph matrix according to the second row vector relationship;

and 172, performing linear space transformation on the third feature vector according to a third learning parameter matrix, and performing feature embedding on the linearly and spatially transformed third feature vector by using the second graph matrix to generate the object feature vector.

In this embodiment, the key information in the N object feature vectors can be found out through the regional GCN network, the local feature information such as appearance details, action details, and interaction relationships between objects in the sample video frame can be found out, and the local feature information is enhanced through encoding and embedding operations. Therefore, local feature information in a sample video frame is firstly mined by a target detection model RPN, and then the local feature information is enhanced by a regional GCN network so that a language model LSTM can be better utilized.

Step 2, performing feature enhancement on the scene feature vector and the object feature vector according to the scaling vector and point-by-point product operation, performing feature fusion according to the enhanced scene feature vector and the enhanced object feature vector, and recording as a fusion feature vector;

in this embodiment, two independent SE modules are mainly used to perform feature enhancement on the scene feature vector and the object feature vector, and before performing the feature enhancement, step 2 further includes:

respectively carrying out one-dimensional average pooling operation on the scene feature vector and the object feature vector, wherein the result of the one-dimensional average pooling operation of the scene feature vector is a feature vector v_T，

The result of the one-dimensional average pooling operation of the object feature vectors is the feature vector v_N，

Feature vector v_TAnd a feature vector v_NFor feature enhancement.

In this embodiment, the same method is used to perform feature enhancement on the scene feature vector and the object feature vector, and now, taking the scene feature vector as an example, a process of feature enhancement is described, where the specific process includes:

according to the scene feature vector v after one-dimensional average pooling_TAnd calculating the scaling vector alpha, wherein the corresponding calculation formula is as follows:

α＝σ(W₂δ(W₁v_T+b₁)+b₂)

in the formula, v_TFeature vector of scene pooled for one-dimensional averaging, W₁、W₂To scale the matrix, b₁、b₂Scaling parameter, W₁、W₂、b₁、b₂Is a learnable parameter, σ (-) is a Sigmoid activation function, and δ (-) is a ReLU function.

And performing point-by-point product operation on the scene feature vector subjected to one-dimensional average pooling and the scaling vector, and performing feature enhancement on the scene feature vector.

Specifically, as shown in fig. 3, the scene feature vector v after one-dimensional average pooling_TInputting the data into an SE module for feature enhancement and realizing recoding, wherein the corresponding calculation formula is as follows:

α＝σ(g(v_T))＝σ(W₂δ(W₁v_T+b₁)+b₂)

wherein the vector is scaled

And carrying out feature enhancement on the scene feature vector for the SE module.

The process of feature enhancement of the object feature vector is not repeated.

In this embodiment, by introducing two independent SE modules, the key information of the scene feature vector and the object feature vector is found among the d channels, and is activated by enhancement, thereby realizing feature enhancement.

Thereafter, the two features are used to enhance the results

And

adding the two components, performing feature fusion, and recording the fusion result as fusion feature vector

Highlighting the most important features of the global feature information and the local feature information, respectively, to better fuse the global and local feature information, wherein,

is a one-dimensional average pooled object feature vector.

Step 3, utilizing a language model LSTM to perform fusion on the feature vector

Performing video description to fuse the feature vectors

Inputting into language model LSTM to obtain probability distribution of video frame description words, and selectingAnd taking the word with the highest probability as the current output to realize the description of the video clip.

It should be noted that, in the model training process, the cross moisture is used as a loss function, Adam is used as an optimization algorithm to train the model, parameters of the CNN network and the target detection model RPN are fixed, and only the GCN network and the language model LSTM are trained.

In order to verify the accuracy of the video description method in this embodiment, a network video as shown in fig. 4 is selected for video description, where fig. 4(a), (B), and (C) are screenshots of 3 sample video frames sampled at equal intervals, and the description results using different description methods are shown in table 1.

TABLE 1

	Text description
		Real situation	A man is speaking on a bench
Description of the prior art	One man is speaking
		Description method of the present embodiment	One man is sitting on the chair and speaking

Therefore, by using the video description method in the embodiment, not only can the object in the video frame be accurately found, but also the mined local information and the global information can be well fused, so that the video can be more accurately described.

The technical scheme of the present application is described in detail above with reference to the accompanying drawings, and the present application provides a video description method based on a graph convolution neural network, the method including: step 1, extracting video frames in a video clip according to a sampling interval, recording the video frames as sample video frames, and respectively extracting scene characteristic vectors and object characteristic vectors in the sample video frames; step 2, performing feature enhancement on the scene feature vector and the object feature vector according to the scaling vector and point-by-point product operation, performing feature fusion according to the enhanced scene feature vector and the enhanced object feature vector, and recording as a fusion feature vector; and 3, performing video description on the fusion feature vector by using a language model LSTM. According to the technical scheme, the global feature information and the local feature information in the video frame are mined and utilized respectively, and different feature information is fused to describe the video content, so that the description accuracy is improved.

The steps in the present application may be sequentially adjusted, combined, and subtracted according to actual requirements.

The units in the device can be merged, divided and deleted according to actual requirements.

Although the present application has been disclosed in detail with reference to the accompanying drawings, it is to be understood that such description is merely illustrative and not restrictive of the application of the present application. The scope of the present application is defined by the appended claims and may include various modifications, adaptations, and equivalents of the invention without departing from the scope and spirit of the application.

Claims

1. a video description method based on graph convolutional neural network, is characterized in that, described method comprises:

Step 1, according to the sampling interval, extract the video frame in the video clip, record it as the sample video frame, and extract the scene feature vector and the object feature vector in the sample video frame respectively;

Step 2, according to the scaling vector and the point-by-point product operation, feature enhancement is performed on the scene feature vector and the object feature vector, and feature fusion is performed according to the enhanced scene feature vector and the enhanced object feature vector, which is denoted as fusion. Feature vector;

Step 3, using the language model LSTM to describe the fusion feature vector in video.

2. The video description method based on a graph convolutional neural network as claimed in claim 1, wherein in the step 1, the scene feature vector in the sample video frame is extracted, specifically comprising:

Step 11, the sample video frame is input to the CNN network for feature extraction operation, and the output of the last layer of the CNN network pooling layer is recorded as a high-dimensional feature map;

Step 12, perform a two-dimensional average pooling operation on the high-dimensional feature map, and record the pooling result as the first feature vector;

Step 13: Input the first feature vector into the frame GCN network for encoding and embedding operations to generate the scene feature vector.

3. The video description method based on a graph convolutional neural network as claimed in claim 2, wherein in the step 1, extracting the object feature vector in the sample video frame, specifically comprising:

Step 14, the sample video frame is input into the target detection model, and the non-maximum value suppression method is used to screen the area in the sample video frame, and the area position of the object object is determined, which is recorded as the object area;

Step 15, performing regional correspondence between the object region and the high-dimensional feature map, and performing clipping and ROIAlign operations on the object region to generate a second feature vector;

Step 16, performing a two-dimensional average pooling operation on the second feature vector to generate a third feature vector;

Step 17: Input the third feature vector into the regional GCN network for encoding and embedding operations to generate the object feature vector.

4. The video description method based on a graph convolutional neural network as claimed in claim 2, wherein the step 13 specifically comprises:

Step 131: Perform a linear transformation on the first eigenvector, and calculate a first row-vector relationship between the row vectors in the first eigenvector after the linear transformation, and determine a first graph matrix according to the first row-vector relationship , wherein, the calculation formula of the first graph matrix G is:

φ(x _i )=W′x _i +b

In the formula, F(x _i , x _j ) is the first row vector relationship between the i-th row vector x _i and the j-th row vector x _j in the first eigenvector after linear transformation, and φ( ) is the row Linear transformation function, W' is the first learnable parameter matrix, b is the learnable paranoid coefficient;

Step 132: Perform linear space transformation on the first feature vector X according to the second learning parameter matrix, and use the first graph matrix G to perform feature embedding on the linear space transformed first feature vector to generate the scene feature. vector, and the corresponding calculation formula is:

Y=GXW

In the formula, Y is the scene feature vector, W is the second learning parameter matrix, and X is the first feature vector.

5. The video description method based on a graph convolutional neural network as claimed in claim 3, wherein the step 17 specifically comprises:

Step 171: Perform linear transformation on the third eigenvector, calculate a second row-vector relationship between the row vectors in the third eigenvector after linear transformation, and determine a second graph matrix according to the second row-vector relationship ;

Step 172: Perform linear space transformation on the third feature vector according to the third learning parameter matrix, and perform feature embedding on the third feature vector after linear space transformation by using the second graph matrix to generate the object feature vector.

6. The video description method based on a graph convolutional neural network as claimed in claim 1, wherein, before the step 2, further comprising:

A one-dimensional average pooling operation is performed on the scene feature vector and the object feature vector respectively.

7. The video description method based on a graph convolutional neural network as claimed in claim 6, wherein in the step 2, according to the scaling vector and the point-by-point product operation, feature enhancement is performed on the scene feature vector, specifically include:

According to the gating mechanism and the scene feature vector after one-dimensional average pooling, the scaling vector is calculated, and the corresponding calculation formula is:

α=σ(g(v _T ))=σ(W ₂ δ(W ₁ v _T +b ₁ )+b ₂ )

In the formula, v _T is the scene feature vector after the one-dimensional average pooling, W ₁ , W ₂ are scaling matrices, b ₁ , b ₂ scaling parameters, σ( ) is the sigmoid activation function, δ( ) is ReLU function.

A point-by-point product operation is performed on the scene feature vector of the one-dimensional average pooling and the scaling vector, and feature enhancement is performed on the scene feature vector.

8. The video description method based on a graph convolutional neural network according to claim 1, wherein the video segment is composed of continuous, multiple frames of the video frame,

The sampling interval is obtained by rounding down the ratio of the total number of video frames to the preset number of sampling frames.