[go: up one dir, main page]

CN115033736B - A natural language guided video summarization method - Google Patents

A natural language guided video summarization method Download PDF

Info

Publication number
CN115033736B
CN115033736B CN202210652477.2A CN202210652477A CN115033736B CN 115033736 B CN115033736 B CN 115033736B CN 202210652477 A CN202210652477 A CN 202210652477A CN 115033736 B CN115033736 B CN 115033736B
Authority
CN
China
Prior art keywords
video
sequence
frame
natural language
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210652477.2A
Other languages
Chinese (zh)
Other versions
CN115033736A (en
Inventor
金永刚
郑婧
马海钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202210652477.2A priority Critical patent/CN115033736B/en
Publication of CN115033736A publication Critical patent/CN115033736A/en
Application granted granted Critical
Publication of CN115033736B publication Critical patent/CN115033736B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video abstraction method for natural language guidance, which comprises the steps of decomposing a video file into a frame sequence, extracting frame image features, extracting frame semantic features and text semantic features, performing space cosine similarity calculation to obtain attention weight, constructing a video abstraction model for natural language guidance, performing model training on an integral network model, and reasonably selecting a video abstraction according to a frame importance score sequence. The invention creatively provides a natural language guiding attention mechanism aiming at the video abstraction task, introduces the natural language guiding attention mechanism in the video abstraction frame, and the ablation experiment shows that the natural language guiding attention mechanism obviously improves the abstract model performance. In addition, the attention mechanism of the natural language guidance provided by the invention has stronger objectivity, can fully pay attention to the video clips related to the title text, is easy to obtain the title text in the Internet video and does not need any cost.

Description

Video abstracting method for natural language guidance
Technical Field
The invention belongs to the technical field of video abstraction, and particularly relates to a video abstraction method guided by natural language.
Background
With the rapid development of multimedia technology and network information technology in recent years, video becomes an increasingly mainstream information communication mode. How to concentrate a lengthy video to within minutes or even seconds, i.e., video summary, on the basis of maintaining key information, has gradually become an important research content in the field of video technology. The video abstraction technology utilizes a computer algorithm program to automatically select important fragments in the video, and the important fragments are used as abstracts of the video, so that the storage space of the video is reduced, and a user can browse video information quickly.
The idea of mimicking human attention was originally presented in the field of computer vision to reduce the computational complexity of image processing and to improve performance by introducing an attention mechanism that focuses only on a partial region of an image, not the entire image. Similarly, each frame in a video has a respective importance level, and a part of the frames contain key information of the video content, so that people need to pay attention to the summary. Therefore, many researchers at home and abroad introduce attention mechanisms into the video abstraction method to assign different importance weights to different frames in the input sequence, instead of looking at the same kernel for all the input frames, thus providing an inherent link between the input video sequence and the output importance scores.
If a video segment is interesting to the user, it is more likely to be an important segment of the whole sequence, and many researchers at home and abroad model based on the user's attention, for example, ma et al propose scoring the importance of the video segment with low-level features such as motion changes, face features, etc. in document [A user attention model for video summarization[C]//Proceedings of the tenth ACM international conference on Multimedia.2002:533-542], combining these scores to form an attention curve, and the part on the peak of the curve is extracted as a key shot to construct a summary. With the development of deep learning technology, some deep video abstraction methods based on attention mechanisms are proposed, for example, zhong et al propose AVS video abstraction model in document [Video summarization with attention-based encoder–decoder networks[J].IEEE Transactions on Circuits and Systems for Video Technology,2019,30(6):1709-1717], describe supervised video abstracts as a sequence-to-sequence learning problem, explore two decoding networks based on attention mechanisms by utilizing additivity and multiplicative objective functions, obtain attention weights by the nature of the video sequence itself, and learn the attention mechanisms in a supervised manner. Apostolidis et al in document [Combining Global and Local Attention with Positional Encoding for Video Summarization[C]//2021IEEE International Symposium on Multimedia(ISM).IEEE,2021:226-234] propose to find frame dependency models of different granularity in combination with global and local multi-headed attention mechanisms, and the attention mechanism utilized integrates components of the encoded video frame temporal position.
However, it is not difficult to find that the existing depth video summarization method based on the attention mechanism generally uses only video image information to generate a summary, and does not consider information of other modes, but the descriptive text such as titles often contains key information highly related to video content.
Disclosure of Invention
In view of the above, the invention provides a video abstraction method for natural language guidance, which creatively provides a natural language guidance attention mechanism for video abstraction task, obtains attention weight by calculating similarity between a video sequence and text, introduces the natural language guidance attention mechanism in Encoder-Decoder video abstraction framework, and ablation experiments show that the natural language guidance attention mechanism obviously improves the abstract model performance.
A method for video summarization of natural language guidance, comprising the steps of:
(1) Decomposing the video file in the training set into a frame sequence, and extracting image features of the frame sequence by using a pre-trained depth image network to obtain a corresponding frame image feature sequence (f 1,…,fn);
(2) Extracting semantic features of a frame sequence by utilizing an image coding network of the pre-training multi-mode model to obtain a frame semantic feature sequence (x 1,…,xn);
(3) According to the frame semantic feature sequence (x 1,…,xn) and the text semantic feature t, obtaining an attention weight sequence (alpha 1,…,αn) through space cosine similarity calculation;
(4) Constructing a video abstract model based on a natural language guidance attention mechanism, inputting a frame image characteristic sequence (f 1,…,fn) and an attention weight sequence (alpha 1,…,αn), outputting a frame importance score sequence (y 1,…,yn), and further training the video abstract model;
(5) Inputting a frame image characteristic sequence (f 1,…,fn) and an attention weight sequence (alpha 1,…,αn) of the video file of the test set into a trained video abstract model, and outputting to obtain a frame importance score sequence (y 1,…,yn) of the video file;
(6) Key shots are selected and synthesized into a video summary using a sequence of frame importance scores (y 1,…,yn).
Furthermore, since the natural language text of a significant part of the video files in the training set does not well reflect the title content, before extracting the semantic features of the natural language text of the video files, the natural language text needs to be optimized, namely, some titles are firstly drawn according to the video content, then a plurality of users watch the video in a questionnaire mode and select proper titles, and finally, the title with the highest proportion selected by the users is taken as the optimized natural language text.
Further, the specific implementation manner of the step (3) is that firstly, the cosine similarity between the text semantic feature t and each feature vector in the frame semantic feature sequence (x 1,…,xn) is calculated to obtain a similarity sequenceFurther to the similarity sequenceThe attention weight sequence (alpha 1,…,αn) is obtained by carrying out softmax normalization processing, and can reflect the importance degree of the video clip relative to the title text, thereby providing an effective attention mechanism for guiding the generation of the abstract.
Further, the video summary model is formed by sequentially connecting a Encoder module, an Attention module and a Decoder module, wherein the Encoder module encodes an input frame image feature sequence (f 1,…,fn) to obtain a hidden sequence (h 1,…,hn), the Attention module uses an Attention weight sequence (alpha 1,…,αn) to weight and sum the hidden sequence (h 1,…,hn) to obtain a fusion variable h, and the Decoder module decodes the fusion variable h and outputs a frame importance score sequence (y 1,…,yn).
Further, in the step (4), the training process of the video abstract model is that firstly, model parameters are initialized, then a frame image feature sequence (f 1,…,fn) and an attention weight sequence (alpha 1,…,αn) are input into the model, the model predicts and outputs a frame importance score sequence (y 1,…,yn), the model parameters are iteratively updated by utilizing gradient descent and back propagation algorithms according to a loss function L, and the training is completed until the loss function L is minimized and converged or the maximum iteration times are reached.
Further, the expression of the loss function L is as follows:
Wherein y i represents the importance score of the ith video frame, s i represents the annotation score of the ith video frame, and n is the total frame number of the video file.
Further, for the labeling score s i of the ith video frame, the importance of the video frame is labeled by multiple users, the label 1 represents importance, the label 0 represents unimportance, and the labeling score s i is the percentage of the number of users labeled with the label 1 to the number of all users.
The method comprises the steps of (1) combining the visually continuous frames into shots, converting a frame importance score sequence (y 1,…,yn) into a shot importance score sequence, obtaining importance scores of the shots by averaging importance scores of frames in the shots, selecting key shots by adopting a 0/1 knapsack algorithm according to the shot importance score sequence and the shot length, wherein the length of the key shots is not more than 15% of the whole video length, and finally synthesizing all the key shots into a video abstract.
Based on the technical scheme, the invention has the following beneficial technical effects:
1. Five-fold cross-validation of the natural language directed video summary model averaged 47.0% for the F1-score with 150 epoch training. According to the invention, an ablation experiment for removing the natural language guidance is carried out, and the five-fold cross validation average F1-score of the video abstract model for removing the natural language guidance is 44.5%, which shows that the natural language guidance improves the model performance.
2. Compared with a user-based attention mechanism, the attention mechanism of natural language guidance provided by the invention has stronger objectivity, can fully pay attention to video clips related to the title text, and meanwhile, the title text is easy to obtain in Internet video without any cost, and the attention of the user is required to be acquired and analyzed and is not easy to obtain.
3. The abstract model fully focuses on the important content related to the natural language text while learning how the human beings abstract the video. When a user provides a natural language text and a video, the abstract generated by the video abstract method has stronger correlation with the natural language text provided by the user, and can fully embody the interests of the user.
Drawings
FIG. 1 is a schematic diagram of attention weight generation based on natural language guidance in accordance with the present invention.
FIG. 2 is a schematic diagram of a Encoder-Decoder video summary model based on the attention mechanism according to the present invention.
Detailed Description
In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.
As shown in fig. 1, the video abstraction method based on natural language guidance of the present invention includes the following steps:
S1, decomposing the video file into a frame sequence, and extracting image features of the frame sequence by using a pre-training depth image network to obtain a frame image feature sequence (f 1,…,fn).
In this embodiment, the video is sampled into a frame sequence at a sampling rate of 2fps, and then the image features of the frame sequence are extracted using the pool5 layer of the pre-training GoogLeNet model on the large-scale image dataset ImageNet, resulting in a frame image feature sequence (f 1,…,fn), n represents the video length, and the feature dimension of each video frame is 1024.
S2, carrying out semantic feature extraction on the frame sequence and the natural language text by utilizing an image coding network and a text coding network of the pre-training multi-mode model to obtain a frame semantic feature sequence (x 1,…,xn) and a text semantic feature t, and carrying out space cosine similarity calculation to obtain an attention weight sequence (alpha 1,…,αn).
According to the embodiment, the image coding network and the text coding network of the pre-training CLIP are utilized to extract semantic features of the frame sequence and the natural language text, and the CLIP model uniformly converts the feature space of the image coding network and the feature space of the text coding network into the semantic space in the training process, so that good properties provide conditions for similarity calculation of video frame images and texts. Performing space cosine Similarity calculation on a frame semantic feature sequence (x 1,…,xn) and a text semantic feature t extracted by the CLIP pre-training model to obtain a Similarity sequenceWill beThe attention weight sequence (alpha 1,…,αn) can be obtained by carrying out softmax normalization processing, and the obtained attention weight sequence can reflect the importance degree of the video clip relative to the title text, so that an effective attention mechanism is provided for guiding the generation of the abstract.
And S3, constructing a Encoder-Decoder video abstract model based on an attention mechanism, which is realized by an LSTM network, inputting a frame image feature sequence (f 1,…,fn) and an attention weight sequence (alpha 1,…,αn), and outputting a frame importance score sequence (y 1,…,yn).
The model Encoder and the Decoder adopted in the embodiment can be iteratively realized by a convolutional neural network, a cyclic neural network and the like, and considering that LSTM can well solve the problem of long-distance dependence and has certain advantages in the problem of long-sequence modeling, the invention selects LSTM to realize, as shown in figure 2, the Encoder-Decoder video abstract model of the attention mechanism based on natural guidance realized by the LSTM network is divided into 3 modules Encoder, attention, decoder, and the specific realization modes of the 3 modules are as follows:
The Encoder module is realized by n stacked LSTM units, the LSTM units input a memory state c t-1 at the last moment and an input f t at the current moment, a hidden state h t at the moment is output, a frame characteristic sequence (f 1,…,fn) is input into Encoder, and a hidden sequence (h 1,…,hn) is output:
(h1,…,hn)=Encoder(f1,…,fn)
The Attention module performs weighted accumulation on the hidden sequence (h 1,…,hn), the Attention weight sequence is (alpha 1,…,αn), and the output variable h:
h=∑hii
The Decoder module is realized by n stacked LSTM units, inputs the obtained fusion variable h and outputs an importance score sequence (y 1,…,yn):
(y1,…,yn)=Decoder(h)
and S4, performing model training on the whole network model.
In this embodiment, sumMe video summary data sets are selected for training, but titles of a part of videos in SumMe data sets do not reflect title contents well, for example, "Jumps" cannot reflect video contents completely, and thus the training effect is affected. Therefore, the video title text of SumMe data set needs to be optimized, specifically, some titles are formulated according to the video content, a plurality of users watch the video in a questionnaire mode and select proper titles, finally, the original video title of SumMe data set and the optimized title provided by the method are selected as training text with the highest selection ratio by the users, as shown in table 1:
TABLE 1
SumMe data sets are provided with 25 videos, the data set is small in size and suitable for training by a cross-validation method, 25 videos in the data sets are fully utilized, and adverse effects caused by unbalanced division are reduced. In this embodiment, 80% of data is selected as a training set, 20% of data is selected as a test set, a five-fold cross validation method is used, the number of test sets is 20, the number of validation sets is 5, the model is trained by using mean loss, (y 1,…,yn) is model prediction score, (s 1,…,sn) is training set labeling score, and the loss function is as follows:
And S5, reasonably selecting the video abstract according to the frame importance score sequence (y 1,…,yn).
In this embodiment, visually consecutive frames are combined into shots, and simultaneously, a frame importance score sequence is converted into a shot importance score sequence, that is, importance scores of all frames in the shots are averaged to obtain a shot importance score, and then, a 0/1 knapsack algorithm is used for selecting a key shot for the shot importance score and the shot length, the length of the key shot needs to be limited to 15% of the original video length, and the key shot is combined into a video abstract.
The embodiments described above are described in order to facilitate the understanding and application of the present invention to those skilled in the art, and it will be apparent to those skilled in the art that various modifications may be made to the embodiments described above and that the general principles described herein may be applied to other embodiments without the need for inventive faculty. Therefore, the present invention is not limited to the above-described embodiments, and those skilled in the art, based on the present disclosure, should make improvements and modifications within the scope of the present invention.

Claims (8)

1. A method for video summarization of natural language guidance, comprising the steps of:
(1) Decomposing the video file in the training set into a frame sequence, and extracting image features of the frame sequence by using a pre-trained depth image network to obtain a corresponding frame image feature sequence (f 1,...,fn);
(2) Extracting semantic features of a frame sequence by utilizing an image coding network of the pre-training multi-mode model to obtain a frame semantic feature sequence (x 1,...,xn);
(3) According to the frame semantic feature sequence (x 1,...,xn) and the text semantic feature t, obtaining an attention weight sequence (alpha 1,...,αn) through space cosine similarity calculation;
(4) Constructing a video abstract model based on a natural language guidance attention mechanism, inputting a frame image characteristic sequence (f 1,...,fn) and an attention weight sequence (alpha 1,...,αn), outputting a frame importance score sequence (y 1,...,yn), and further training the video abstract model;
(5) Inputting a frame image characteristic sequence (f 1,...,fn) and an attention weight sequence (alpha 1,...,αn) of the video file of the test set into a trained video abstract model, and outputting to obtain a frame importance score sequence (y 1,...,yn) of the video file;
(6) Key shots are selected and synthesized into a video summary using a sequence of frame importance scores (y 1,...,yn).
2. The method for abstracting video according to claim 1, wherein the natural language text of a substantial part of the video files in the training set does not reflect the content of the title well, so that before extracting semantic features of the natural language text of the video files, the natural language text is optimized, a plurality of titles are firstly formulated according to the video content, then a plurality of users watch the video in a questionnaire manner and select a proper title, and finally the title with the highest proportion selected by the users is taken as the optimized natural language text.
3. The video summarization method according to claim 1, wherein the step (3) is specifically implemented by first calculating cosine similarity between a text semantic feature t and each feature vector in a frame semantic feature sequence (x 1,...,xn) to obtain a similarity sequenceFurther to the similarity sequenceAnd (5) carrying out softmax normalization treatment to obtain the attention weight sequence (alpha 1,...,αn).
4. The method of claim 1, wherein the video summary model is formed by sequentially connecting a Encoder module, an Attention module and a Decoder module, wherein the Encoder module encodes an input frame image feature sequence (f 1,...,fn) to obtain a hidden sequence (h 1,...,hn), the Attention module uses an Attention weight sequence (alpha 1,...,αn) to weight and sum the hidden sequence (h 1,...,hn) to obtain a fusion variable h, and the Decoder module decodes the fusion variable h to output a frame importance score sequence (y 1,...,yn).
5. The method for abstracting video according to claim 1, wherein the training of the abstract video model in the step (4) is performed by initializing model parameters, inputting a frame image feature sequence (f 1,...,fn) and an attention weight sequence (alpha 1,...,αn) into the model, outputting a frame importance score sequence (y 1,...,yn) by model prediction, and iteratively updating the model parameters by using gradient descent and back propagation algorithms according to a loss function L until the loss function L converges at minimum or reaches a maximum iteration number, i.e. training is completed.
6. The video summarization method of claim 5, wherein the loss function L is expressed as follows:
Wherein y i represents the importance score of the ith video frame, s i represents the annotation score of the ith video frame, and n is the total frame number of the video file.
7. The method of claim 6, wherein for the score s i of the ith video frame, the importance of the ith video frame is marked by a plurality of users, the importance is marked by the label 1, the importance is marked by the label 0, and the score s i is the percentage of the number of users marked by the label 1 to the number of all users.
8. The video summarization method according to claim 1, wherein the step (6) is specifically implemented by firstly combining visually continuous frames into shots, then converting a frame importance score sequence (y 1,...,yn) into a shot importance score sequence, obtaining importance scores of the shots by averaging importance scores of the frames in the shots, further selecting key shots by adopting a 0/1 knapsack algorithm according to the shot importance score sequence and the shot length, wherein the length of the key shots is not more than 15% of the whole video length, and finally synthesizing all the key shots into the video summary.
CN202210652477.2A 2022-06-07 2022-06-07 A natural language guided video summarization method Active CN115033736B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210652477.2A CN115033736B (en) 2022-06-07 2022-06-07 A natural language guided video summarization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210652477.2A CN115033736B (en) 2022-06-07 2022-06-07 A natural language guided video summarization method

Publications (2)

Publication Number Publication Date
CN115033736A CN115033736A (en) 2022-09-09
CN115033736B true CN115033736B (en) 2025-04-15

Family

ID=83122726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210652477.2A Active CN115033736B (en) 2022-06-07 2022-06-07 A natural language guided video summarization method

Country Status (1)

Country Link
CN (1) CN115033736B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115658963B (en) * 2022-10-09 2025-07-18 浙江大学 Pupil size-based man-machine cooperation video abstraction method
CN115665508B (en) * 2022-11-02 2025-03-18 阿里巴巴(中国)有限公司 Method, device, electronic device and storage medium for generating video summary
CN116208772A (en) * 2023-05-05 2023-06-02 浪潮电子信息产业股份有限公司 Data processing method, device, electronic device, and computer-readable storage medium
CN116567348A (en) * 2023-05-05 2023-08-08 合肥工业大学 Method for generating video abstract of minimally invasive surgery
CN117009576B (en) * 2023-07-07 2025-09-23 咪咕文化科技有限公司 Method, device, equipment and storage medium for generating text video summary
CN117835012B (en) * 2023-12-27 2024-07-26 北京智象未来科技有限公司 Controllable video generation method, device, equipment and storage medium
CN118172712B (en) * 2024-05-09 2024-10-25 百度在线网络技术(北京)有限公司 Video summarizing method, large model training method, device and electronic equipment
CN119862861B (en) * 2025-03-25 2025-07-15 山东大学 Visual-text collaborative abstract generation method and system based on multi-modal learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102519581A (en) * 2011-12-21 2012-06-27 浙江大学 Separation method of power transformer vibration signal
CN109409221A (en) * 2018-09-20 2019-03-01 中国科学院计算技术研究所 Video content description method and system based on frame selection

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6892193B2 (en) * 2001-05-10 2005-05-10 International Business Machines Corporation Method and apparatus for inducing classifiers for multimedia based on unified representation of features reflecting disparate modalities
GB2558582A (en) * 2017-01-06 2018-07-18 Nokia Technologies Oy Method and apparatus for automatic video summarisation
CN110110140A (en) * 2019-04-19 2019-08-09 天津大学 Video summarization method based on attention expansion coding and decoding network
CN113869324A (en) * 2021-08-19 2021-12-31 北京大学 An implementation method of video commonsense knowledge reasoning based on multimodal fusion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102519581A (en) * 2011-12-21 2012-06-27 浙江大学 Separation method of power transformer vibration signal
CN109409221A (en) * 2018-09-20 2019-03-01 中国科学院计算技术研究所 Video content description method and system based on frame selection

Also Published As

Publication number Publication date
CN115033736A (en) 2022-09-09

Similar Documents

Publication Publication Date Title
CN115033736B (en) A natural language guided video summarization method
Xiong et al. A unified framework for multi-modal federated learning
Pan et al. Jointly modeling embedding and translation to bridge video and language
CN113157965B (en) Audio visual model training and audio visual method, device and equipment
Li et al. Residual attention-based LSTM for video captioning
CN113987169A (en) Method, device, device and storage medium for generating text summaries based on semantic blocks
CN114048350A (en) A text-video retrieval method based on a fine-grained cross-modal alignment model
CN115719510B (en) Group behavior recognition method based on multimodal fusion and implicit interaction relationship learning
CN113392265B (en) Multimedia processing method, device and equipment
CN113609284A (en) A method and device for automatic generation of text summaries integrating multiple semantics
Chen et al. A cheaper and better diffusion language model with soft-masked noise
CN114780841A (en) KPHAN-based sequence recommendation method
Perez-Martin et al. A comprehensive review of the video-to-text problem
CN119293286B (en) Interactive content prediction method, device, storage medium and electronic device
CN113128431A (en) Video clip retrieval method, device, medium and electronic equipment
CN118537908A (en) Multi-mode multi-granularity feature fusion expression package emotion recognition method based on large model
CN118965139A (en) Multimodal sentiment analysis method and system with audio modality as target modality
Zhang et al. Multi-view self-supervised learning on heterogeneous graphs for recommendation
Jiang et al. Knowledge augmented dialogue generation with divergent facts selection
Vaishnavi et al. Video captioning–a survey
CN118035565B (en) Active service recommendation method, system and device based on multimodal emotion perception
CN117171440B (en) News recommendation method and system based on joint modeling of news events and news styles
CN114880521B (en) Video description method and medium based on autonomous optimization alignment of vision and language semantics
CN116956950A (en) Machine translation method, apparatus, device, medium, and program product
Huang et al. Enhanced video caption generation based on multimodal features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant