CN115033736B

CN115033736B - A natural language guided video summarization method

Info

Publication number: CN115033736B
Application number: CN202210652477.2A
Authority: CN
Inventors: 金永刚; 郑婧; 马海钢
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2025-04-15
Anticipated expiration: 2042-06-07
Also published as: CN115033736A

Abstract

The invention discloses a video abstraction method for natural language guidance, which comprises the steps of decomposing a video file into a frame sequence, extracting frame image features, extracting frame semantic features and text semantic features, performing space cosine similarity calculation to obtain attention weight, constructing a video abstraction model for natural language guidance, performing model training on an integral network model, and reasonably selecting a video abstraction according to a frame importance score sequence. The invention creatively provides a natural language guiding attention mechanism aiming at the video abstraction task, introduces the natural language guiding attention mechanism in the video abstraction frame, and the ablation experiment shows that the natural language guiding attention mechanism obviously improves the abstract model performance. In addition, the attention mechanism of the natural language guidance provided by the invention has stronger objectivity, can fully pay attention to the video clips related to the title text, is easy to obtain the title text in the Internet video and does not need any cost.

Description

Video abstracting method for natural language guidance

Technical Field

The invention belongs to the technical field of video abstraction, and particularly relates to a video abstraction method guided by natural language.

Background

With the rapid development of multimedia technology and network information technology in recent years, video becomes an increasingly mainstream information communication mode. How to concentrate a lengthy video to within minutes or even seconds, i.e., video summary, on the basis of maintaining key information, has gradually become an important research content in the field of video technology. The video abstraction technology utilizes a computer algorithm program to automatically select important fragments in the video, and the important fragments are used as abstracts of the video, so that the storage space of the video is reduced, and a user can browse video information quickly.

The idea of mimicking human attention was originally presented in the field of computer vision to reduce the computational complexity of image processing and to improve performance by introducing an attention mechanism that focuses only on a partial region of an image, not the entire image. Similarly, each frame in a video has a respective importance level, and a part of the frames contain key information of the video content, so that people need to pay attention to the summary. Therefore, many researchers at home and abroad introduce attention mechanisms into the video abstraction method to assign different importance weights to different frames in the input sequence, instead of looking at the same kernel for all the input frames, thus providing an inherent link between the input video sequence and the output importance scores.

If a video segment is interesting to the user, it is more likely to be an important segment of the whole sequence, and many researchers at home and abroad model based on the user's attention, for example, ma et al propose scoring the importance of the video segment with low-level features such as motion changes, face features, etc. in document [A user attention model for video summarization[C]//Proceedings of the tenth ACM international conference on Multimedia.2002:533-542], combining these scores to form an attention curve, and the part on the peak of the curve is extracted as a key shot to construct a summary. With the development of deep learning technology, some deep video abstraction methods based on attention mechanisms are proposed, for example, zhong et al propose AVS video abstraction model in document [Video summarization with attention-based encoder–decoder networks[J].IEEE Transactions on Circuits and Systems for Video Technology,2019,30(6):1709-1717], describe supervised video abstracts as a sequence-to-sequence learning problem, explore two decoding networks based on attention mechanisms by utilizing additivity and multiplicative objective functions, obtain attention weights by the nature of the video sequence itself, and learn the attention mechanisms in a supervised manner. Apostolidis et al in document [Combining Global and Local Attention with Positional Encoding for Video Summarization[C]//2021IEEE International Symposium on Multimedia(ISM).IEEE,2021:226-234] propose to find frame dependency models of different granularity in combination with global and local multi-headed attention mechanisms, and the attention mechanism utilized integrates components of the encoded video frame temporal position.

However, it is not difficult to find that the existing depth video summarization method based on the attention mechanism generally uses only video image information to generate a summary, and does not consider information of other modes, but the descriptive text such as titles often contains key information highly related to video content.

Disclosure of Invention

In view of the above, the invention provides a video abstraction method for natural language guidance, which creatively provides a natural language guidance attention mechanism for video abstraction task, obtains attention weight by calculating similarity between a video sequence and text, introduces the natural language guidance attention mechanism in Encoder-Decoder video abstraction framework, and ablation experiments show that the natural language guidance attention mechanism obviously improves the abstract model performance.

A method for video summarization of natural language guidance, comprising the steps of:

(1) Decomposing the video file in the training set into a frame sequence, and extracting image features of the frame sequence by using a pre-trained depth image network to obtain a corresponding frame image feature sequence (f ₁,…,f_n);

(2) Extracting semantic features of a frame sequence by utilizing an image coding network of the pre-training multi-mode model to obtain a frame semantic feature sequence (x ₁,…,x_n);

(3) According to the frame semantic feature sequence (x ₁,…,x_n) and the text semantic feature t, obtaining an attention weight sequence (alpha ₁,…,α_n) through space cosine similarity calculation;

(4) Constructing a video abstract model based on a natural language guidance attention mechanism, inputting a frame image characteristic sequence (f ₁,…,f_n) and an attention weight sequence (alpha ₁,…,α_n), outputting a frame importance score sequence (y ₁,…,y_n), and further training the video abstract model;

(5) Inputting a frame image characteristic sequence (f ₁,…,f_n) and an attention weight sequence (alpha ₁,…,α_n) of the video file of the test set into a trained video abstract model, and outputting to obtain a frame importance score sequence (y ₁,…,y_n) of the video file;

(6) Key shots are selected and synthesized into a video summary using a sequence of frame importance scores (y ₁,…,y_n).

Furthermore, since the natural language text of a significant part of the video files in the training set does not well reflect the title content, before extracting the semantic features of the natural language text of the video files, the natural language text needs to be optimized, namely, some titles are firstly drawn according to the video content, then a plurality of users watch the video in a questionnaire mode and select proper titles, and finally, the title with the highest proportion selected by the users is taken as the optimized natural language text.

Further, the specific implementation manner of the step (3) is that firstly, the cosine similarity between the text semantic feature t and each feature vector in the frame semantic feature sequence (x ₁,…,x_n) is calculated to obtain a similarity sequenceFurther to the similarity sequenceThe attention weight sequence (alpha ₁,…,α_n) is obtained by carrying out softmax normalization processing, and can reflect the importance degree of the video clip relative to the title text, thereby providing an effective attention mechanism for guiding the generation of the abstract.

Further, the video summary model is formed by sequentially connecting a Encoder module, an Attention module and a Decoder module, wherein the Encoder module encodes an input frame image feature sequence (f ₁,…,f_n) to obtain a hidden sequence (h ₁,…,h_n), the Attention module uses an Attention weight sequence (alpha ₁,…,α_n) to weight and sum the hidden sequence (h ₁,…,h_n) to obtain a fusion variable h, and the Decoder module decodes the fusion variable h and outputs a frame importance score sequence (y ₁,…,y_n).

Further, in the step (4), the training process of the video abstract model is that firstly, model parameters are initialized, then a frame image feature sequence (f ₁,…,f_n) and an attention weight sequence (alpha ₁,…,α_n) are input into the model, the model predicts and outputs a frame importance score sequence (y ₁,…,y_n), the model parameters are iteratively updated by utilizing gradient descent and back propagation algorithms according to a loss function L, and the training is completed until the loss function L is minimized and converged or the maximum iteration times are reached.

Further, the expression of the loss function L is as follows:

Wherein y _i represents the importance score of the ith video frame, s _i represents the annotation score of the ith video frame, and n is the total frame number of the video file.

Further, for the labeling score s _i of the ith video frame, the importance of the video frame is labeled by multiple users, the label 1 represents importance, the label 0 represents unimportance, and the labeling score s _i is the percentage of the number of users labeled with the label 1 to the number of all users.

The method comprises the steps of (1) combining the visually continuous frames into shots, converting a frame importance score sequence (y ₁,…,y_n) into a shot importance score sequence, obtaining importance scores of the shots by averaging importance scores of frames in the shots, selecting key shots by adopting a 0/1 knapsack algorithm according to the shot importance score sequence and the shot length, wherein the length of the key shots is not more than 15% of the whole video length, and finally synthesizing all the key shots into a video abstract.

Based on the technical scheme, the invention has the following beneficial technical effects:

1. Five-fold cross-validation of the natural language directed video summary model averaged 47.0% for the F1-score with 150 epoch training. According to the invention, an ablation experiment for removing the natural language guidance is carried out, and the five-fold cross validation average F1-score of the video abstract model for removing the natural language guidance is 44.5%, which shows that the natural language guidance improves the model performance.

2. Compared with a user-based attention mechanism, the attention mechanism of natural language guidance provided by the invention has stronger objectivity, can fully pay attention to video clips related to the title text, and meanwhile, the title text is easy to obtain in Internet video without any cost, and the attention of the user is required to be acquired and analyzed and is not easy to obtain.

3. The abstract model fully focuses on the important content related to the natural language text while learning how the human beings abstract the video. When a user provides a natural language text and a video, the abstract generated by the video abstract method has stronger correlation with the natural language text provided by the user, and can fully embody the interests of the user.

Drawings

FIG. 1 is a schematic diagram of attention weight generation based on natural language guidance in accordance with the present invention.

FIG. 2 is a schematic diagram of a Encoder-Decoder video summary model based on the attention mechanism according to the present invention.

Detailed Description

In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.

As shown in fig. 1, the video abstraction method based on natural language guidance of the present invention includes the following steps:

S1, decomposing the video file into a frame sequence, and extracting image features of the frame sequence by using a pre-training depth image network to obtain a frame image feature sequence (f ₁,…,f_n).

In this embodiment, the video is sampled into a frame sequence at a sampling rate of 2fps, and then the image features of the frame sequence are extracted using the pool5 layer of the pre-training GoogLeNet model on the large-scale image dataset ImageNet, resulting in a frame image feature sequence (f ₁,…,f_n), n represents the video length, and the feature dimension of each video frame is 1024.

S2, carrying out semantic feature extraction on the frame sequence and the natural language text by utilizing an image coding network and a text coding network of the pre-training multi-mode model to obtain a frame semantic feature sequence (x ₁,…,x_n) and a text semantic feature t, and carrying out space cosine similarity calculation to obtain an attention weight sequence (alpha ₁,…,α_n).

According to the embodiment, the image coding network and the text coding network of the pre-training CLIP are utilized to extract semantic features of the frame sequence and the natural language text, and the CLIP model uniformly converts the feature space of the image coding network and the feature space of the text coding network into the semantic space in the training process, so that good properties provide conditions for similarity calculation of video frame images and texts. Performing space cosine Similarity calculation on a frame semantic feature sequence (x ₁,…,x_n) and a text semantic feature t extracted by the CLIP pre-training model to obtain a Similarity sequenceWill beThe attention weight sequence (alpha ₁,…,α_n) can be obtained by carrying out softmax normalization processing, and the obtained attention weight sequence can reflect the importance degree of the video clip relative to the title text, so that an effective attention mechanism is provided for guiding the generation of the abstract.

And S3, constructing a Encoder-Decoder video abstract model based on an attention mechanism, which is realized by an LSTM network, inputting a frame image feature sequence (f ₁,…,f_n) and an attention weight sequence (alpha ₁,…,α_n), and outputting a frame importance score sequence (y ₁,…,y_n).

The model Encoder and the Decoder adopted in the embodiment can be iteratively realized by a convolutional neural network, a cyclic neural network and the like, and considering that LSTM can well solve the problem of long-distance dependence and has certain advantages in the problem of long-sequence modeling, the invention selects LSTM to realize, as shown in figure 2, the Encoder-Decoder video abstract model of the attention mechanism based on natural guidance realized by the LSTM network is divided into 3 modules Encoder, attention, decoder, and the specific realization modes of the 3 modules are as follows:

The Encoder module is realized by n stacked LSTM units, the LSTM units input a memory state c _t-1 at the last moment and an input f _t at the current moment, a hidden state h _t at the moment is output, a frame characteristic sequence (f ₁,…,f_n) is input into Encoder, and a hidden sequence (h ₁,…,h_n) is output:

(h₁,…,h_n)＝Encoder(f₁,…,f_n)

The Attention module performs weighted accumulation on the hidden sequence (h ₁,…,h_n), the Attention weight sequence is (alpha ₁,…,α_n), and the output variable h:

h=∑h_i*α_i

The Decoder module is realized by n stacked LSTM units, inputs the obtained fusion variable h and outputs an importance score sequence (y ₁,…,y_n):

(y₁,…,y_n)=Decoder(h)

and S4, performing model training on the whole network model.

In this embodiment, sumMe video summary data sets are selected for training, but titles of a part of videos in SumMe data sets do not reflect title contents well, for example, "Jumps" cannot reflect video contents completely, and thus the training effect is affected. Therefore, the video title text of SumMe data set needs to be optimized, specifically, some titles are formulated according to the video content, a plurality of users watch the video in a questionnaire mode and select proper titles, finally, the original video title of SumMe data set and the optimized title provided by the method are selected as training text with the highest selection ratio by the users, as shown in table 1:

TABLE 1

SumMe data sets are provided with 25 videos, the data set is small in size and suitable for training by a cross-validation method, 25 videos in the data sets are fully utilized, and adverse effects caused by unbalanced division are reduced. In this embodiment, 80% of data is selected as a training set, 20% of data is selected as a test set, a five-fold cross validation method is used, the number of test sets is 20, the number of validation sets is 5, the model is trained by using mean loss, (y ₁,…,y_n) is model prediction score, (s ₁,…,s_n) is training set labeling score, and the loss function is as follows:

And S5, reasonably selecting the video abstract according to the frame importance score sequence (y ₁,…,y_n).

In this embodiment, visually consecutive frames are combined into shots, and simultaneously, a frame importance score sequence is converted into a shot importance score sequence, that is, importance scores of all frames in the shots are averaged to obtain a shot importance score, and then, a 0/1 knapsack algorithm is used for selecting a key shot for the shot importance score and the shot length, the length of the key shot needs to be limited to 15% of the original video length, and the key shot is combined into a video abstract.

The embodiments described above are described in order to facilitate the understanding and application of the present invention to those skilled in the art, and it will be apparent to those skilled in the art that various modifications may be made to the embodiments described above and that the general principles described herein may be applied to other embodiments without the need for inventive faculty. Therefore, the present invention is not limited to the above-described embodiments, and those skilled in the art, based on the present disclosure, should make improvements and modifications within the scope of the present invention.

Claims

1. A method for video summarization of natural language guidance, comprising the steps of:

(1) Decomposing the video file in the training set into a frame sequence, and extracting image features of the frame sequence by using a pre-trained depth image network to obtain a corresponding frame image feature sequence (f ₁,...,f_n);

(2) Extracting semantic features of a frame sequence by utilizing an image coding network of the pre-training multi-mode model to obtain a frame semantic feature sequence (x ₁,...,x_n);

(3) According to the frame semantic feature sequence (x ₁,...,x_n) and the text semantic feature t, obtaining an attention weight sequence (alpha ₁,...,α_n) through space cosine similarity calculation;

(4) Constructing a video abstract model based on a natural language guidance attention mechanism, inputting a frame image characteristic sequence (f ₁,...,f_n) and an attention weight sequence (alpha ₁,...,α_n), outputting a frame importance score sequence (y ₁,...,y_n), and further training the video abstract model;

(5) Inputting a frame image characteristic sequence (f ₁,...,f_n) and an attention weight sequence (alpha ₁,...,α_n) of the video file of the test set into a trained video abstract model, and outputting to obtain a frame importance score sequence (y ₁,...,y_n) of the video file;

(6) Key shots are selected and synthesized into a video summary using a sequence of frame importance scores (y ₁,...,y_n).

2. The method for abstracting video according to claim 1, wherein the natural language text of a substantial part of the video files in the training set does not reflect the content of the title well, so that before extracting semantic features of the natural language text of the video files, the natural language text is optimized, a plurality of titles are firstly formulated according to the video content, then a plurality of users watch the video in a questionnaire manner and select a proper title, and finally the title with the highest proportion selected by the users is taken as the optimized natural language text.

3. The video summarization method according to claim 1, wherein the step (3) is specifically implemented by first calculating cosine similarity between a text semantic feature t and each feature vector in a frame semantic feature sequence (x ₁,...,x_n) to obtain a similarity sequenceFurther to the similarity sequenceAnd (5) carrying out softmax normalization treatment to obtain the attention weight sequence (alpha ₁,...,α_n).

4. The method of claim 1, wherein the video summary model is formed by sequentially connecting a Encoder module, an Attention module and a Decoder module, wherein the Encoder module encodes an input frame image feature sequence (f ₁,...,f_n) to obtain a hidden sequence (h ₁,...,h_n), the Attention module uses an Attention weight sequence (alpha ₁,...,α_n) to weight and sum the hidden sequence (h ₁,...,h_n) to obtain a fusion variable h, and the Decoder module decodes the fusion variable h to output a frame importance score sequence (y ₁,...,y_n).

5. The method for abstracting video according to claim 1, wherein the training of the abstract video model in the step (4) is performed by initializing model parameters, inputting a frame image feature sequence (f ₁,...,f_n) and an attention weight sequence (alpha ₁,...,α_n) into the model, outputting a frame importance score sequence (y ₁,...,y_n) by model prediction, and iteratively updating the model parameters by using gradient descent and back propagation algorithms according to a loss function L until the loss function L converges at minimum or reaches a maximum iteration number, i.e. training is completed.

6. The video summarization method of claim 5, wherein the loss function L is expressed as follows:

7. The method of claim 6, wherein for the score s _i of the ith video frame, the importance of the ith video frame is marked by a plurality of users, the importance is marked by the label 1, the importance is marked by the label 0, and the score s _i is the percentage of the number of users marked by the label 1 to the number of all users.

8. The video summarization method according to claim 1, wherein the step (6) is specifically implemented by firstly combining visually continuous frames into shots, then converting a frame importance score sequence (y ₁,...,y_n) into a shot importance score sequence, obtaining importance scores of the shots by averaging importance scores of the frames in the shots, further selecting key shots by adopting a 0/1 knapsack algorithm according to the shot importance score sequence and the shot length, wherein the length of the key shots is not more than 15% of the whole video length, and finally synthesizing all the key shots into the video summary.