Disclosure of Invention
The purpose of this application lies in: based on the graph convolution neural network, the global feature information and the local feature information in the video frame are respectively mined and utilized, and different feature information is fused to describe the video content, so that the description accuracy is improved.
The technical scheme of the application is as follows: a video description method based on a graph convolution neural network is provided, and the method comprises the following steps: step 1, extracting video frames in a video clip according to a sampling interval, recording the video frames as sample video frames, and respectively extracting scene characteristic vectors and object characteristic vectors in the sample video frames; step 2, performing feature enhancement on the scene feature vector and the object feature vector according to the scaling vector and point-by-point product operation, performing feature fusion according to the enhanced scene feature vector and the enhanced object feature vector, and recording as a fusion feature vector; and 3, performing video description on the fusion feature vector by using a language model LSTM.
In any one of the above technical solutions, further, in step 1, extracting a scene feature vector in a sample video frame specifically includes: step 11, inputting a sample video frame into a CNN network for feature extraction operation, and recording the output of the last layer of pooling layer of the CNN network as a high-dimensional feature map; step 12, performing two-dimensional average pooling operation on the high-dimensional feature map, and recording a pooling result as a first feature vector; and step 13, inputting the first feature vector into a frame GCN network for coding and embedding operation, and generating a scene feature vector.
In any one of the above technical solutions, further, in step 1, extracting an object feature vector in a sample video frame specifically includes: step 14, inputting the sample video frame into a target detection model, screening the area in the sample video frame by using a non-maximum suppression method, determining the area position of an object, and recording the area position as an object area; step 15, performing area correspondence on the object area and the high-dimensional feature map, and performing clipping and ROIAlign operation on the object area to generate a second feature vector; step 16, performing two-dimensional average pooling operation on the second feature vector to generate a third feature vector; and step 17, inputting the third feature vector into a regional GCN network for encoding and embedding operation, and generating an object feature vector.
In any one of the above technical solutions, further, step 13 specifically includes:
step 131, performing linear transformation on the first eigenvector, calculating a first row vector relationship between the row vectors in the first eigenvector after the linear transformation, and determining a first graph matrix according to the first row vector relationship, wherein a calculation formula of the first graph matrix G is as follows:
φ(xi)=W′xi+b
in the formula, F (x)i,xj) Is the ith row vector x in the first feature vector after linear transformationiAnd the jth row vector xjThe first line vector relationship between, phi (-) is a line linear transformation function, W′Is the first learnable parameter matrix, b is the learnable bias coefficient;
step 132, according to the second learning parameter matrix, performing linear spatial transformation on the first feature vector X, and performing feature embedding on the first feature vector after the linear spatial transformation by using the first graph matrix G to generate a scene feature vector, where the corresponding calculation formula is:
Y=GXW
in the formula, Y is a scene feature vector, W is a second learning parameter matrix, and X is a first feature vector.
In any one of the above technical solutions, further, step 17 specifically includes: 171, performing linear transformation on the third eigenvector, calculating a second row vector relationship between each row vector in the linearly transformed third eigenvector, and determining a second graph matrix according to the second row vector relationship; and 172, performing linear space transformation on the third feature vector according to the third learning parameter matrix, and performing feature embedding on the linearly-space-transformed third feature vector by using the second graph matrix to generate an object feature vector.
In any one of the above technical solutions, further, before step 2, the method further includes: and respectively carrying out one-dimensional average pooling operation on the scene feature vector and the object feature vector.
In any one of the above technical solutions, further, in step 2, performing feature enhancement on the scene feature vector according to the scaling vector and the point-by-point product operation, specifically including:
calculating a scaling vector according to a gating mechanism and the scene characteristic vector after one-dimensional average pooling, wherein a corresponding calculation formula is as follows:
α=σ(g(vT))=σ(W2δ(W1vT+b1)+b2)
in the formula, vTIs a one-dimensional average pooled scene feature vector, W1、W2To scale the matrix, b1、b2The scaling parameter, σ (-) is the Sigmoid activation function, δ (-) is the ReLU function.
And performing point-by-point product operation on the scene feature vector subjected to one-dimensional average pooling and the zooming vector, and performing feature enhancement on the scene feature vector.
In any of the above technical solutions, further, the video segment is composed of consecutive, multi-frame video frames, and the sampling interval is obtained by a ratio of a total number of the video frames to a preset number of sampling frames through a down-rounding operation.
The beneficial effect of this application is:
according to the technical scheme, firstly, a CNN network and a frame GCN network are combined, and global feature information in a video frame is extracted to serve as a scene feature vector; secondly, extracting an accurate object region by using a target detection model, extracting local characteristic information in a video frame as an object characteristic vector by using a characteristic embedding function of a regional GCN network, and well highlighting the effect of a key object in the whole video so as to well dig out the local characteristic information, namely the information of the object in the video;
and then, feature enhancement is respectively carried out on the scene feature vector and the object feature vector by utilizing scaling vector and point-by-point product operation, the process is realized by two SE modules, global features and local features are recoded, so that global and local feature information can highlight respective video description effects, the algorithm can give consideration to the global and local feature information, the description of the action is more accurate, and the generated description has both global property and pertinence to key objects.
And finally, describing the fused feature vector by using a language model LSTM.
In addition, the description method in the application has certain effect improvement on different CNN models, is suitable for 2D CNN and 3D CNN, has good robustness, enables technologies such as late fusion and the like to be directly used, and has high practicability and applicability.
Detailed Description
In order that the above objects, features and advantages of the present application can be more clearly understood, the present application will be described in further detail with reference to the accompanying drawings and detailed description. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited by the specific embodiments disclosed below.
As shown in fig. 1 and fig. 2, the present embodiment provides a video description method based on a graph-convolution neural network, including:
step 1, extracting video frames in a video clip according to a sampling interval, recording the video frames as sample video frames, and respectively extracting scene characteristic vectors and object characteristic vectors in the sample video frames, wherein the video clip is composed of continuous multiple frames of the video frames, and the sampling interval is obtained by a ratio of the total number of the video frames to a preset number of sampling frames through a down-rounding operation.
Specifically, the video segment in this embodiment is composed of consecutive F frame video frames, and a preset sampling frame number is set to be T for video description, so that a rounding-down operation is adopted, and a sampling interval T is calculated by a ratio of a total number of video frames to the preset sampling frame number, where a corresponding calculation formula is:
and according to the calculated sampling interval T, carrying out equidistant sampling in the video clip to obtain T video frames, and recording the T video frames as sample video frames.
In the embodiment, feature extraction is mainly performed on a sample video frame through a CNN network and a GCN network, then the extracted features are fused, global and local feature information in the sample video frame is highlighted, and then the fused features are described by using a language model LSTM, so that description of video clips is further realized.
It should be noted that the implementation of the language model LSTM in this embodiment is not limited.
In this embodiment, a method for extracting a scene feature vector in the sample video frame is shown, which specifically includes:
step 11, inputting the sample video frame into a CNN network for feature extraction operation, and recording the output of the last layer of pooling layer of the CNN network as a high-dimensional feature map;
in this embodiment, the selected CNN network may be a 2D CNN network such as ResNet, inclusion, or a 3D CNN network such as 3D ResNet, Temporal Segment Networks, or the like. When a 2D CNN network is selected, pre-training can be performed by using a picture classification database (ImageNet); when a 3D CNN network is selected, it can be pre-trained with a video classification database (Kinetics).
And after the CNN network finishes pre-training, inputting the T frame sample video frame into the CNN network for feature extraction, and recording an output result of the last pooling layer of the CNN network as a high-dimensional feature map, wherein h is the height of the high-dimensional feature map, w is the width of the high-dimensional feature map, and d is the channel number of the high-dimensional feature map.
Step 12, performing two-dimensional average pooling operation on the high-dimensional feature map with the size of T multiplied by h multiplied by w multiplied by d, and recording a pooling result as a first feature vector, namely the high-dimensional feature map with the size of T multiplied by h multiplied by w multiplied by d
The feature vector of (2). Thus, the first feature vector is of size T
Is used to generate the feature vector.
And step 13, inputting the first feature vector into a frame GCN network for coding and embedding operation, and generating the scene feature vector.
In this embodiment, the frame GCN network and the area GCN network are both the graph convolution neural network GCN, and the implementation of the GCN network in this embodiment is not limited.
Further, the process of generating the scene feature vector by the frame GCN network according to the input first feature vector specifically includes:
step 131, performing linear transformation on the first eigenvector, calculating a first row vector relationship between each row vector in the linearly transformed first eigenvector, and determining a first graph matrix according to the first row vector relationship, wherein a calculation formula of the first graph matrix G is as follows:
φ(xi)=W′xi+b
in the formula, F (x)i,xj) Is the ith row vector x in the first feature vector after linear transformationiAnd the jth row vector xjThe first row vector relationship between, phi (-) is the row linear transformation function, W' is the first learnable parameter matrix, and b is the learnable bias-execution coefficient.
Specifically, according to a first learnable parameter matrix W' and learnable bias execution coefficients b, the first feature vector X is subjected to linear transformation, the first feature vector X is mapped to another space through the linear transformation, and X is recordediFor the ith row vector in the first feature vector X after linear transformation, record phi (X)i) For linear transformation functions, the result of the calculation of the linear transformation function is the sum of the row vectors xiThe vectors with the same size correspond to the calculation formula as follows:
φ(xi)=W′xi+b
mapping the first feature vector to other spaces to enable the mapped first feature vector to have more definite relation information so as to obtain the relation between each line vector through dot product operation, and recording the first line vector relation between each line vector as F (x)i,xj) The result is a scalar, and the corresponding calculation formula is:
taking a frame GCN network as an example, the row vector xiFeature vector representing the image of the ith frame, F (x)i,xj) The result of (1) describes the relative relationship between the image of the ith frame and the image of the jth frame, which is abstracted into a scalar quantity whose value has no actual physical meaning and is used only as an auxiliary GCN calculation.
To facilitate subsequent operation of the frame GCN network, a first graph matrix G is introduced, i.e. of size
Wherein G
ijFor the element in the ith row and the jth column of the first graph matrix G, the corresponding calculation formula is:
the purpose of this process is to eliminate the computation of the relationship F (x) between the row vectorsi,xj) And (3) carrying out Softmax function processing on each row vector of the feature vector by the formula to achieve the aim of normalization and eliminate unnecessary errors for the feature embedding step of the GCN.
Step 132, according to a second learning parameter matrix W, performing linear spatial transformation on the first feature vector X, and performing feature embedding on the first feature vector after the linear spatial transformation by using the first graph matrix G to generate the scene feature vector, where a corresponding calculation formula is:
Y=GXW
specifically, the first feature vector X is subjected to linear spatial transformation through the second learning parameter matrix W to realize the encoding process of the first feature vector X, so that the first feature vector X is transformed into a suitable linear space for subsequent calculation, and then the first feature vector XW after the linear spatial transformation is subjected to feature embedding through the constructed first graph matrix G.
Through encoding and feature embedding, the scene feature vector Y output by the frame GCN network can play a role in highlighting some key frames in the video compared with the first feature vector X, so that the output of the frame GCN network can better dig out information in the video.
On the basis of the foregoing embodiment, this embodiment further shows a method for extracting an object feature vector in the sample video frame, which specifically includes:
step 14, inputting the sample video frame into a target detection model, screening the area in the sample video frame by using a non-maximum suppression method, determining the area position of an object, and recording the area position as an object area;
in this embodiment, the selected target detection model (RPN) may be a mainstream target detection model such as YOLO v3, SSD, Faster R-CNN, and Mask R-CNN, and is pre-trained in a target detection database (MSCOCO) before use.
The T frame sample video frame is detected using a target detection model (RPN), and the number of region positions of the object that can be obtained is set to N, that is, N object regions. In the process of extracting the target region, the regions are screened by using a non-maximum suppression method (NMS), and the N screened regions are set as the target regions.
Step 15, performing region correspondence between the N object regions and the high-dimensional feature map obtained in the step 11, that is, corresponding the N object regions to a high-dimensional feature map with a size of T × h × w × d, and then performing clipping and roiign operations on the high-dimensional feature map corresponding to each object region to generate a second feature vector with a size of N × 7 × 7 × d;
step 16, performing two-dimensional average pooling operation on the second feature vector to generate a third feature vector, wherein the size of the third feature vector is
Thus, the third feature vector is N-sized
Is used to generate the feature vector.
And step 17, inputting the third feature vector into a regional GCN network for encoding and embedding operation, and generating the object feature vector.
In this embodiment, the process of outputting the object feature vector by the regional GCN network is substantially the same as the process of outputting the scene feature vector by the frame GCN network, and specifically includes:
step 171, performing linear transformation on the third eigenvector, calculating a second row vector relationship between each row vector in the linearly transformed third eigenvector, and determining a second graph matrix according to the second row vector relationship;
and 172, performing linear space transformation on the third feature vector according to a third learning parameter matrix, and performing feature embedding on the linearly and spatially transformed third feature vector by using the second graph matrix to generate the object feature vector.
In this embodiment, the key information in the N object feature vectors can be found out through the regional GCN network, the local feature information such as appearance details, action details, and interaction relationships between objects in the sample video frame can be found out, and the local feature information is enhanced through encoding and embedding operations. Therefore, local feature information in a sample video frame is firstly mined by a target detection model RPN, and then the local feature information is enhanced by a regional GCN network so that a language model LSTM can be better utilized.
Step 2, performing feature enhancement on the scene feature vector and the object feature vector according to the scaling vector and point-by-point product operation, performing feature fusion according to the enhanced scene feature vector and the enhanced object feature vector, and recording as a fusion feature vector;
in this embodiment, two independent SE modules are mainly used to perform feature enhancement on the scene feature vector and the object feature vector, and before performing the feature enhancement, step 2 further includes:
respectively carrying out one-dimensional average pooling operation on the scene feature vector and the object feature vector, wherein the result of the one-dimensional average pooling operation of the scene feature vector is a feature vector v
T,
The result of the one-dimensional average pooling operation of the object feature vectors is the feature vector v
N,
Feature vector v
TAnd a feature vector v
NFor feature enhancement.
In this embodiment, the same method is used to perform feature enhancement on the scene feature vector and the object feature vector, and now, taking the scene feature vector as an example, a process of feature enhancement is described, where the specific process includes:
according to the scene feature vector v after one-dimensional average poolingTAnd calculating the scaling vector alpha, wherein the corresponding calculation formula is as follows:
α=σ(W2δ(W1vT+b1)+b2)
in the formula, vTFeature vector of scene pooled for one-dimensional averaging, W1、W2To scale the matrix, b1、b2Scaling parameter, W1、W2、b1、b2Is a learnable parameter, σ (-) is a Sigmoid activation function, and δ (-) is a ReLU function.
And performing point-by-point product operation on the scene feature vector subjected to one-dimensional average pooling and the scaling vector, and performing feature enhancement on the scene feature vector.
Specifically, as shown in fig. 3, the scene feature vector v after one-dimensional average poolingTInputting the data into an SE module for feature enhancement and realizing recoding, wherein the corresponding calculation formula is as follows:
α=σ(g(vT))=σ(W2δ(W1vT+b1)+b2)
wherein the vector is scaled
And carrying out feature enhancement on the scene feature vector for the SE module.
The process of feature enhancement of the object feature vector is not repeated.
In this embodiment, by introducing two independent SE modules, the key information of the scene feature vector and the object feature vector is found among the d channels, and is activated by enhancement, thereby realizing feature enhancement.
Thereafter, the two features are used to enhance the results
And
adding the two components, performing feature fusion, and recording the fusion result as fusion feature vector
Highlighting the most important features of the global feature information and the local feature information, respectively, to better fuse the global and local feature information, wherein,
is a one-dimensional average pooled object feature vector.
Step 3, utilizing a language model LSTM to perform fusion on the feature vector
Performing video description to fuse the feature vectors
Inputting into language model LSTM to obtain probability distribution of video frame description words, and selectingAnd taking the word with the highest probability as the current output to realize the description of the video clip.
It should be noted that, in the model training process, the cross moisture is used as a loss function, Adam is used as an optimization algorithm to train the model, parameters of the CNN network and the target detection model RPN are fixed, and only the GCN network and the language model LSTM are trained.
In order to verify the accuracy of the video description method in this embodiment, a network video as shown in fig. 4 is selected for video description, where fig. 4(a), (B), and (C) are screenshots of 3 sample video frames sampled at equal intervals, and the description results using different description methods are shown in table 1.
TABLE 1
| |
Text description
|
| Real situation
|
A man is speaking on a bench
|
| Description of the prior art
|
One man is speaking
|
| Description method of the present embodiment
|
One man is sitting on the chair and speaking |
Therefore, by using the video description method in the embodiment, not only can the object in the video frame be accurately found, but also the mined local information and the global information can be well fused, so that the video can be more accurately described.
The technical scheme of the present application is described in detail above with reference to the accompanying drawings, and the present application provides a video description method based on a graph convolution neural network, the method including: step 1, extracting video frames in a video clip according to a sampling interval, recording the video frames as sample video frames, and respectively extracting scene characteristic vectors and object characteristic vectors in the sample video frames; step 2, performing feature enhancement on the scene feature vector and the object feature vector according to the scaling vector and point-by-point product operation, performing feature fusion according to the enhanced scene feature vector and the enhanced object feature vector, and recording as a fusion feature vector; and 3, performing video description on the fusion feature vector by using a language model LSTM. According to the technical scheme, the global feature information and the local feature information in the video frame are mined and utilized respectively, and different feature information is fused to describe the video content, so that the description accuracy is improved.
The steps in the present application may be sequentially adjusted, combined, and subtracted according to actual requirements.
The units in the device can be merged, divided and deleted according to actual requirements.
Although the present application has been disclosed in detail with reference to the accompanying drawings, it is to be understood that such description is merely illustrative and not restrictive of the application of the present application. The scope of the present application is defined by the appended claims and may include various modifications, adaptations, and equivalents of the invention without departing from the scope and spirit of the application.