WO2026011669A1

WO2026011669A1 - Video information summary generation method and apparatus, and electronic apparatus and storage medium

Info

Publication number: WO2026011669A1
Application number: PCT/CN2024/135756
Authority: WO
Inventors: 高闯; 张力文; 金子杰; 练俊健
Original assignee: E Surfing Vision Technology Co Ltd
Current assignee: E Surfing Vision Technology Co Ltd
Priority date: 2024-07-10
Filing date: 2024-11-29
Publication date: 2026-01-15
Anticipated expiration: 2027-01-10
Also published as: CN118467778B; CN118467778A

Abstract

The present application relates to a video information summary generation method and apparatus, and an electronic apparatus and a storage medium. The video information summary generation method comprises: acquiring an initial video and a preset text description for a target object; inputting the initial video and the text description into a trained open-set object detection model for key frame detection, so as to obtain key frames including the target object in the initial video; performing clustering on several key frames to obtain an initial video information summary; inputting the initial video information summary into a picture-text extraction unit for text description extraction, so as to obtain a picture-text description of the initial video information summary; inputting the picture-text description of the initial video information summary and the initial video information summary into a video-text semantic alignment unit for semantic alignment, so as to obtain aligned video feature representations; and inputting the aligned video feature representations into a text generation unit, so as to obtain a target video information summary. Therefore, the correctness of the content of a text summary is improved.

Description

Video information summarization methods, apparatus, electronic devices, and storage media

Technical Field

本申请涉及计算机视觉技术领域，特别是涉及视频信息摘要生成方法、装置、电子装置和存储介质。This application relates to the field of computer vision technology, and in particular to video information summarization methods, apparatus, electronic devices, and storage media.

Background Technology

在安防、交通、厨房、学校等视频监控场景，产生了大量视频数据，这些视频数据往往时长较长，且视频中存在大量冗余和无效信息，直接观看这些长视频获取有用信息是十分耗时且低效的。In video surveillance scenarios such as security, transportation, kitchens, and schools, a large amount of video data is generated. This video data is often quite long and contains a lot of redundant and invalid information. Directly watching these long videos to obtain useful information is very time-consuming and inefficient.

视频摘要技术，主要是通过自动/半自动的方式，自动提取和生成视频的关键帧，帮助用户快速了解视频内容，特别是长视频中的关键信息，不需要观看完整视频就可获知视频的具体内容。传统静态视频摘要技术主要是基于关键帧表示、聚类等方法，从视频中提取具有代表性的关键帧，并将其组合成新的视频，实现视频摘要的提取。随着深度学习的发展，也出现了一些新的提取关键帧的方法。如通过利用CNN模型从视频帧中提取空间特征，再利用LSTM模型处理时序特征，使模型学会识别重要的特征变化点，预测关键帧；或者利用目标检测算法识别包含主要对象的关键帧。然后，基于传统目标检测算法的静态视频摘要技术，只能根据已训练模型能够识别的目标进行检测，并生成视频摘要。若想在新场景下应用该技术，需要采集相应数据对目标检测模型进行重新训练，既费时费力，又限制了技术应用的灵活性。通过目标检测算法识别关键对象，将包含关键对象的视频帧作为关键帧组成视频摘要。这种方式获得的视频摘要可能包含了多个事件，这种类型的视频摘要增加了视频理解的难度，使得最终无法生成正确表述视频内容的文本摘要。Video summarization technology primarily extracts and generates keyframes from videos automatically or semi-automatically, helping users quickly understand video content, especially key information in long videos, without needing to watch the entire video. Traditional static video summarization techniques mainly rely on keyframe representation and clustering methods to extract representative keyframes from videos and combine them into a new video to achieve video summarization. With the development of deep learning, some new methods for extracting keyframes have emerged. For example, using CNN models to extract spatial features from video frames and then using LSTM models to process temporal features, the model learns to identify important feature change points and predict keyframes; or using object detection algorithms to identify keyframes containing the main object. However, static video summarization techniques based on traditional object detection algorithms can only detect objects that the trained model can recognize and generate video summaries. To apply this technology in new scenarios, it is necessary to collect relevant data to retrain the object detection model, which is time-consuming and laborious, and limits the flexibility of the technology's application. By identifying key objects through object detection algorithms, video frames containing key objects are used as keyframes to compose video summaries. The video summaries obtained in this way may contain multiple events. This type of video summary increases the difficulty of video understanding and makes it impossible to generate a text summary that accurately represents the video content.

针对相关技术中存在无法生成正确表述视频内容的文本摘要的问题，目前还没有提出有效的解决方案。There is currently no effective solution to the problem that related technologies cannot generate text summaries that accurately represent video content.

Summary of the Invention

在本实施例中提供了一种视频信息摘要生成方法、装置、电子装置和存储介质，以解决相关技术中无法生成正确表述视频内容的文本摘要的问题。This embodiment provides a video information summarization method, apparatus, electronic device, and storage medium to solve the problem in related technologies that it is impossible to generate text summaries that correctly represent video content.

第一个方面，在本实施例中提供了一种视频信息摘要生成方法，包括：Firstly, this embodiment provides a video information summary generation method, including:

获取初始视频和预设的针对目标对象的文字描述；Obtain the initial video and a preset text description for the target object;

将初始视频和文字描述输入至训练后的开放世界目标检测模型进行关键帧检测，得到初始视频中包含目标对象的关键帧；The initial video and text description are input into the trained open-world object detection model to perform keyframe detection, and the keyframes containing the target object in the initial video are obtained.

对若干帧的关键帧进行聚类，得到初始视频信息摘要；Clustering of key frames from several frames yields an initial video information summary.

将初始视频信息摘要输入至图片-文本提取单元提取文本描述，得到初始视频信息摘要的图片文本描述；The initial video information summary is input into the image-text extraction unit to extract the text description, resulting in the image text description of the initial video information summary;

将初始视频信息摘要的图片文本描述与初始视频信息摘要输入至视频-文本语义对齐单元进行语义对齐，得到对齐后的视频特征表示；The image text description of the initial video information summary and the initial video information summary are input into the video-text semantic alignment unit for semantic alignment to obtain the aligned video feature representation;

将对齐后的视频特征表示输入至文本生成单元，得到目标视频信息摘要。The aligned video feature representation is input into the text generation unit to obtain the target video information summary.

在其中的一些实施例中，对若干帧的关键帧进行聚类，得到初始视频信息摘要，包括：In some embodiments, keyframes from several frames are clustered to obtain an initial video information digest, including:

提取若干帧的关键帧的特征表示；Extract feature representations from keyframes across several frames;

将关键帧的特征表示输入至训练后的骨干网络模型中，基于骨干网络模型，进行K-means聚类分析，得到初始视频信息摘要。The feature representations of keyframes are input into the trained backbone network model. Based on the backbone network model, K-means clustering analysis is performed to obtain the initial video information summary.

在其中的一些实施例中，骨干网络模型采用Swin Transformer模型。In some of these embodiments, the backbone network model employs the Swing Transformer model.

在其中的一些实施例中，基于模块骨干网络，进行K-means聚类分析，得到初始视频信息摘要，包括：In some embodiments, K-means clustering analysis is performed based on the modular backbone network to obtain an initial video information summary, including:

基于模块骨干网络，将关键帧之间的余弦相似度作为距离度量，根据距离度量，进行K-means聚类分析，得到初始视频信息摘要。Based on the modular backbone network, the cosine similarity between keyframes is used as a distance metric. K-means clustering analysis is then performed based on the distance metric to obtain an initial video information summary.

在其中的一些实施例中，将关键帧的特征表示输入至训练后的模块骨干网络中，基于模块骨干网络，进行K-means聚类分析，得到初始视频信息摘要，包括：In some embodiments, the feature representations of keyframes are input into the trained module backbone network. Based on the module backbone network, K-means clustering analysis is performed to obtain an initial video information summary, including:

将包含目标对象的关键帧的特征表示输入至模块骨干网络中，基于模块骨干网络进行K-means聚类分析，得到多个聚类初始视频信息摘要；The feature representation of keyframes containing the target object is input into the module backbone network, and K-means clustering analysis is performed based on the module backbone network to obtain multiple initial video information summaries.

采用K近邻法对聚类初始视频信息摘要进行异常帧检测，剔除大于预设阈值的视频帧，得到初始视频信息摘要。The K-nearest neighbor method is used to detect abnormal frames in the initial video information digest of the clustering, and video frames with a value greater than a preset threshold are removed to obtain the initial video information digest.

在其中的一些实施例中，将初始视频信息摘要输入至图片-文本提取单元，得到初始视频信息摘要的图片文本描述，包括：In some embodiments, the initial video information digest is input to the image-text extraction unit to obtain an image-text description of the initial video information digest, including:

通过预训练后的BLIP-2模型提取初始视频信息摘要中的关键帧的特征表示；Feature representations of keyframes in the initial video information summary are extracted using the pre-trained BLIP-2 model;

将初始视频信息摘要中的关键帧的特征表示输入至自回归文本生成器进行文本生成，得到初始视频信息摘要的图片文本描述。The feature representations of keyframes in the initial video information summary are input into an autoregressive text generator to generate text, resulting in an image text description of the initial video information summary.

在其中的一些实施例中，文本生成单元为自回归文本生成器。In some of these embodiments, the text generation unit is an autoregressive text generator.

第二个方面，在本实施例中提供了一种视频信息摘要生成装置，包括：获取模块、关键帧提取模块、关键帧聚类模块、图文转换模块、对齐模块和生成模块，其中：Secondly, this embodiment provides a video information summarization generation device, including: an acquisition module, a keyframe extraction module, a keyframe clustering module, an image-to-text conversion module, an alignment module, and a generation module, wherein:

获取模块，用于获取初始视频和预设的针对目标对象的文字描述；The acquisition module is used to acquire the initial video and a preset text description of the target object;

关键帧提取模块，用于将初始视频和文字描述输入至训练后的开放世界目标检测模型进行关键帧检测，得到初始视频中包含目标对象的关键帧；The keyframe extraction module is used to input the initial video and text description into the trained open-world object detection model to perform keyframe detection and obtain keyframes containing the target object in the initial video.

关键帧聚类模块，用于对若干帧的关键帧进行聚类，得到初始视频信息摘要；The keyframe clustering module is used to cluster key frames from a number of frames to obtain an initial video information summary;

图文转换模块，用于将初始视频信息摘要输入至图片-文本提取单元提取文本描述，得到初始视频信息摘要的图片文本描述；The image-to-text conversion module is used to input the initial video information summary into the image-to-text extraction unit to extract the text description, and obtain the image-to-text description of the initial video information summary;

对齐模块，用于将初始视频信息摘要的图片文本描述与初始视频信息摘要输入至视频-文本语义对齐单元进行语义对齐，得到对齐后的视频特征表示；The alignment module is used to input the image text description of the initial video information summary and the initial video information summary into the video-text semantic alignment unit for semantic alignment, so as to obtain the aligned video feature representation;

生成模块，用于将对齐后的视频特征表示输入至文本生成单元，得到目标视频信息摘要。The generation module is used to input the aligned video feature representation into the text generation unit to obtain the target video information summary.

第三个方面，在本实施例中提供了一种电子装置，包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述第一个方面所述的视频信息摘要生成方法。Thirdly, this embodiment provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the video information summary generation method described in the first aspect above.

第四个方面，在本实施例中提供了一种存储介质，其上存储有计算机程序，该程序被处理器执行时实现上述第一个方面所述的视频信息摘要生成方法。Fourthly, this embodiment provides a storage medium storing a computer program that, when executed by a processor, implements the video information summary generation method described in the first aspect above.

与相关技术相比，在本实施例中提供的视频信息摘要生成方法，通过获取初始视频和预设的针对目标对象的文字描述；将初始视频和文字描述输入至训练后的开放世界目标检测模型进行关键帧检测，得到初始视频中包含目标对象的关键帧；对若干帧的关键帧进行聚类，得到初始视频信息摘要；将初始视频信息摘要输入至图片-文本提取单元提取文本描述，得到初始视频信息摘要的图片文本描述；将初始视频信息摘要的图片文本描述与初始视频信息摘要输入至视频-文本语义对齐单元进行语义对齐，得到对齐后的视频特征表示；将对齐后的视频特征表示输入至文本生成单元，得到目标视频信息摘要。实现了生成正确表述视频内容的文本摘要，提高了文本摘要内容的正确性。Compared with related technologies, the video information summary generation method provided in this embodiment obtains an initial video and a preset text description of the target object; inputs the initial video and text description into a trained open-world object detection model for keyframe detection to obtain keyframes containing the target object in the initial video; clusters the keyframes of several frames to obtain an initial video information summary; inputs the initial video information summary into an image-text extraction unit to extract the text description, obtaining an image-text description of the initial video information summary; inputs the image-text description of the initial video information summary and the initial video information summary into a video-text semantic alignment unit for semantic alignment to obtain an aligned video feature representation; and inputs the aligned video feature representation into a text generation unit to obtain the target video information summary. This achieves the generation of a text summary that correctly describes the video content, improving the accuracy of the text summary content.

本申请的一个或多个实施例的细节在以下附图和描述中提出，以使本申请的其他特征、目的和优点更加简明易懂。Details of one or more embodiments of this application are set forth in the following drawings and description to make other features, objects and advantages of this application more readily apparent.

Attached Figure Description

此处所说明的附图用来提供对本申请的进一步理解，构成本申请的一部分，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:

图1是本实施例的视频信息摘要生成方法的终端的硬件结构框图。Figure 1 is a hardware structure block diagram of the terminal of the video information digest generation method of this embodiment.

图2是本实施例的视频信息摘要生成方法的流程图。Figure 2 is a flowchart of the video information summary generation method in this embodiment.

图3是本实施例的另一种视频信息摘要生成方法的流程图。Figure 3 is a flowchart of another video information summary generation method in this embodiment.

图4是本实施例的视频信息摘要生成方法的优选流程图。Figure 4 is a preferred flowchart of the video information summary generation method of this embodiment.

图5是本实施例的视频信息摘要生成装置的结构框图。Figure 5 is a structural block diagram of the video information summary generation device in this embodiment.

Detailed Implementation

为更清楚地理解本申请的目的、技术方案和优点，下面结合附图和实施例，对本申请进行了描述和说明。To better understand the purpose, technical solution, and advantages of this application, the application is described and illustrated below in conjunction with the accompanying drawings and embodiments.

除另作定义外，本申请所涉及的技术术语或者科学术语应具有本申请所属技术领域具备一般技能的人所理解的一般含义。在本申请中的“一”、“一个”、“一种”、“该”、“这些”等类似的词并不表示数量上的限制，它们可以是单数或者复数。在本申请中所涉及的术语“包括”、“包含”、“具有”及其任何变体，其目的是涵盖不排他的包含；例如，包含一系列步骤或模块(单元)的过程、方法和系统、产品或设备并未限定于列出的步骤或模块(单元)，而可包括未列出的步骤或模块(单元)，或者可包括这些过程、方法、产品或设备固有的其他步骤或模块(单元)。在本申请中所涉及的“连接”、“相连”、“耦接”等类似的词语并不限定于物理的或机械连接，而可以包括电气连接，无论是直接连接还是间接连接。在本申请中所涉及的“多个”是指两个或两个以上。“和/或”描述关联对象的关联关系，表示可以存在三种关系，例如，“A和/或B”可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。通常情况下，字符“/”表示前后关联的对象是一种“或”的关系。在本申请中所涉及的术语“第一”、“第二”、“第三”等，只是对相似对象进行区分，并不代表针对对象的特定排序。Unless otherwise defined, the technical or scientific terms used in this application shall have the general meaning as understood by one of ordinary skill in the art to which this application pertains. Words such as “a,” “an,” “an,” “the,” “the,” and “these,” used in this application, do not indicate quantitative limitation and may be singular or plural. The terms “comprising,” “including,” “having,” and any variations thereof used in this application are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or device that comprises a series of steps or modules (units) is not limited to the listed steps or modules (units) but may include steps or modules (units) not listed, or may include other steps or modules (units) inherent to such processes, methods, products, or devices. The terms “connected,” “linked,” and “coupled,” used in this application, are not limited to physical or mechanical connections but may include electrical connections, whether direct or indirect. The term “multiple” used in this application refers to two or more. The "and/or" operator describes the relationship between related objects, indicating that three relationships can exist. For example, "A and/or B" can represent three cases: A alone, A and B simultaneously, and B alone. Typically, the character "/" indicates that the objects before and after it are in an "or" relationship. The terms "first," "second," and "third," etc., used in this application are merely for distinguishing similar objects and do not represent a specific ordering of the objects.

在本实施例中提供的方法实施例可以在终端、计算机或者类似的运算装置中执行。比如在终端上运行，图1是本实施例的视频信息摘要生成方法的终端的硬件结构框图。如图1所示，终端可以包括一个或多个(图1中仅示出一个)处理器102和用于存储数据的存储器104，其中，处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置。上述终端还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解，图1所示的结构仅为示意，其并不对上述终端的结构造成限制。例如，终端还可包括比图1中所示更多或者更少的组件，或者具有与图1所示出的不同配置。The method embodiments provided in this example can be executed in a terminal, computer, or similar computing device. For example, running on a terminal, Figure 1 is a hardware structure block diagram of the terminal for the video information summary generation method of this embodiment. As shown in Figure 1, the terminal may include one or more processors 102 (only one is shown in Figure 1) and a memory 104 for storing data, wherein the processor 102 may be, but is not limited to, a processing device such as a microprocessor (MCU) or a programmable logic device (FPGA). The terminal may also include a transmission device 106 for communication functions and an input/output device 108. Those skilled in the art will understand that the structure shown in Figure 1 is merely illustrative and does not limit the structure of the terminal. For example, the terminal may include more or fewer components than shown in Figure 1, or have a different configuration than that shown in Figure 1.

存储器104可用于存储计算机程序，例如，应用软件的软件程序以及模块，如在本实施例中的视频信息摘要生成方法对应的计算机程序，处理器102通过运行存储在存储器104内的计算机程序，从而执行各种功能应用以及数据处理，即实现上述的方法。存储器104可包括高速随机存储器，还可包括非易失性存储器，如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中，存储器104可进一步包括相对于处理器102远程设置的存储器，这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 104 can be used to store computer programs, such as application software programs and modules, like the computer program corresponding to the video information summary generation method in this embodiment. The processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, thereby implementing the above-described method. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory remotely located relative to the processor 102, and these remote memories can be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

传输设备106用于经由一个网络接收或者发送数据。上述的网络包括终端的通信供应商提供的无线网络。在一个实例中，传输设备106包括一个网络适配器(Network Interface Controller，简称为NIC)，其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中，传输设备106可以为射频(Radio Frequency，简称为RF)模块，其用于通过无线方式与互联网进行通讯。The transmission device 106 is used to receive or send data via a network. This network includes a wireless network provided by the terminal's communication provider. In one example, the transmission device 106 includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device 106 can be a Radio Frequency (RF) module for wireless communication with the Internet.

在本实施例中提供了一种视频信息摘要生成方法，图2是本实施例的视频信息摘要生成方法的流程图，如图2所示，该流程包括如下步骤：This embodiment provides a video information digest generation method. Figure 2 is a flowchart of the video information digest generation method of this embodiment. As shown in Figure 2, the process includes the following steps:

步骤S201，获取初始视频和预设的针对目标对象的文字描述。Step S201: Obtain the initial video and the preset text description for the target object.

具体地，获取初始视频，其中，初始视频可以来自各种来源，如监控摄像头、社交媒体平台、个人手机、专业拍摄设备等。这些视频可能包含不同的分辨率、帧率、编码格式等。初始视频内容可以包含多种元素，如人物、动物、场景、物品等。具体可以根据实际应用需求，选取需要的初始视频。本实施例对此不做具体限定。根据需要提前预先设定针对目标对象的文字描述，该文字描述用于提取需要的视频摘要，文字描述可以对目标对象名称、属性、位置、行为等。Specifically, the initial video is acquired. This initial video can come from various sources, such as surveillance cameras, social media platforms, personal mobile phones, and professional shooting equipment. These videos may contain different resolutions, frame rates, and encoding formats. The initial video content can include various elements, such as people, animals, scenes, and objects. The specific initial video selected can be chosen based on actual application needs. This embodiment does not impose specific limitations on this. A text description of the target object is pre-defined as needed. This text description is used to extract the required video summary. The text description can include the target object's name, attributes, location, behavior, etc.

步骤S202，将初始视频和文字描述输入至训练后的开放世界目标检测模型进行关键帧检测，得到初始视频中包含目标对象的关键帧。Step S202: Input the initial video and text description into the trained open-world object detection model to perform keyframe detection, and obtain keyframes in the initial video that contain the target object.

传统目标检测算法(非开放世界，Close-Set Object Detection)的检测范围受限于预先定义的已知类别，即训练集中出现过的类别，当针对新场景或者新标签时无法识别目标。如：当训练时已有标签为“狮子”，则模型会将全部标签为“狮子”的目标识别出来，但该模型无法用于检测马，也无法识别更精确的目标，如只识别图中左侧的狮子。同时，传统目标检测模型大多是有监督模型，需要收集大量目标场景的已标注数据进行训练，每次想要扩展或更改可识别对象集时，都必须收集数据、标注数据，费时又费力，缺乏灵活性。本申请中采用开放世界目标检测模型作为目标检测模型，先对开放世界目标检测模型进行预训练得到训练好的开放世界目标检测模型，将预设的目标对象的文字描述和初始视频输入至训练后的开放世界目标检测模型中进行关键帧的检测，得到包含目标对象的关键帧。通过开放世界目标检测模型可以包含多个目标事件，提高了目标检测能力和灵活性。Traditional object detection algorithms (close-set object detection) are limited to predefined known categories, i.e., categories that have appeared in the training set. They cannot identify targets when faced with new scenes or labels. For example, if the model is trained with the label "lion," it will identify all targets labeled "lion," but it cannot detect horses or more precise targets, such as only identifying the lion on the left side of the image. Furthermore, most traditional object detection models are supervised models, requiring the collection of a large amount of labeled data for training. Each time the set of identifiable objects needs to be expanded or changed, data must be collected and labeled, which is time-consuming, labor-intensive, and lacks flexibility. This application uses an open-world object detection model. The open-world object detection model is pre-trained to obtain a trained model. Pre-defined text descriptions of target objects and initial video are input into the trained open-world object detection model for keyframe detection, resulting in keyframes containing the target object. The open-world object detection model can include multiple target events, improving object detection capability and flexibility.

步骤S203，对若干帧的关键帧进行聚类，得到初始视频信息摘要。Step S203: Cluster the key frames of several frames to obtain an initial video information summary.

具体的，根据预设的目标对象的文字描述得到的关键帧包括多个预设的目标对象，在对关键帧进行聚合时，可以通过识别关键帧中不同目标对象对应的场景、动作和事件进行分析，又或者可以通过分析关键帧中不同目标对象的颜色、纹理、形状等，又或者可以根据图像质量、内容相似度和动作变化来对关键帧进行分析，又或者根据语义空间相关性对关键帧中的不同目标对象进行分析，通过分析后理解关键帧中不同目标对象的主要内容、结构和时序关系，对若干关键帧中相同目标对象的关键帧进行聚合，得到包含不同目标对象的多组聚合的连贯的视频摘要，即初始视频信息摘要。Specifically, the keyframes obtained from the textual descriptions of the preset target objects include multiple preset target objects. When aggregating the keyframes, analysis can be performed by identifying the scenes, actions, and events corresponding to different target objects in the keyframes, or by analyzing the colors, textures, and shapes of different target objects in the keyframes, or by analyzing the keyframes based on image quality, content similarity, and action changes, or by analyzing different target objects in the keyframes based on semantic space relevance. After analysis, the main content, structure, and temporal relationships of different target objects in the keyframes can be understood. By aggregating keyframes with the same target objects in several keyframes, a coherent video summary containing multiple aggregates of different target objects can be obtained, which is the initial video information summary.

步骤S204，将初始视频信息摘要输入至图片-文本提取单元提取文本描述，得到初始视频信息摘要的图片文本描述。Step S204: Input the initial video information summary into the image-text extraction unit to extract the text description, and obtain the image text description of the initial video information summary.

初始视频信息摘要由一系列关键帧组成，这些关键帧代表视频的主要内容。在输入到图片-文本提取单元之前，对关键帧可以进行一些预处理，包括缩放、裁剪、去噪等，以提高后续图像识别和文本提取的准确性。The initial video information summary consists of a series of keyframes, which represent the main content of the video. Before being input into the image-to-text extraction unit, the keyframes can undergo some preprocessing, including scaling, cropping, and denoising, to improve the accuracy of subsequent image recognition and text extraction.

在提取文本描述之前，图片-文本提取单元先加载模型、算法和参数。其中，模型可以是预训练的图像识别模型、目标检测模型、OCR(光学字符识别)模型以及自然语言处理模型等。Before extracting the text description, the image-to-text extraction unit loads the model, algorithm, and parameters. The model can be a pre-trained image recognition model, object detection model, OCR (Optical Character Recognition) model, or natural language processing model, etc.

图片-文本提取单元首先对初始视频信息摘要中的每个关键帧进行图像内容识别。包括识别图像中的对象、场景、动作等。通过利用预训练的图像识别模型，图片-文本提取单元单元识别出关键帧中的主要内容，并将其转换为图片文本描述。The image-to-text extraction unit first performs image content recognition on each keyframe in the initial video information summary. This includes recognizing objects, scenes, actions, etc., in the image. By utilizing a pre-trained image recognition model, the image-to-text extraction unit identifies the main content in the keyframes and converts it into image-text descriptions.

步骤S205，将初始视频信息摘要的图片文本描述与初始视频信息摘要输入至视频-文本语义对齐单元进行语义对齐，得到对齐后的视频特征表示。Step S205: Input the image text description of the initial video information summary and the initial video information summary into the video-text semantic alignment unit for semantic alignment to obtain the aligned video feature representation.

具体地，图片文本描述包含从关键帧中提取出的关键信息和描述，初始视频信息摘要则包含视频的主要内容和结构。视频-文本语义对齐单元将图片文本描述中的信息和初始视频信息摘要中的信息进行语义对齐，得到对应的视频特征表示。具体对齐方式可以基于规则的匹配、基于统计模型的匹配、基于深度学习的匹配等。具体对齐方式可以根据实际情况而定，本申请对此做具体限定。通过对齐，将图片文本描述中的各个部分与初始视频信息摘要中的相应内容进行匹配，确保它们在语义上的一致性，并增强视频摘要的特征表示。Specifically, the image text description contains key information and descriptions extracted from keyframes, while the initial video information summary contains the main content and structure of the video. The video-text semantic alignment unit semantically aligns the information in the image text description with the information in the initial video information summary to obtain the corresponding video feature representation. Specific alignment methods can include rule-based matching, statistical model-based matching, deep learning-based matching, etc. The specific alignment method can be determined according to the actual situation, and this application provides specific limitations on this. Through alignment, each part of the image text description is matched with the corresponding content in the initial video information summary, ensuring their semantic consistency and enhancing the feature representation of the video summary.

步骤S206，将对齐后的视频特征表示输入至文本生成单元，得到目标视频信息摘要。Step S206: Input the aligned video feature representation into the text generation unit to obtain the target video information summary.

具体地，文本生成单元是一个将输入的数据转换为目标视频信息摘要的模块，将对齐后的视频特征表示输入至文本生成单元，将对齐后的视频特征表示转换成对应的文本描述，得到目标视频信息信息摘要。Specifically, the text generation unit is a module that converts the input data into a target video information summary. The aligned video feature representation is input into the text generation unit, which converts the aligned video feature representation into the corresponding text description to obtain the target video information summary.

通过上述步骤S201至步骤S206，获取初始视频和预设的针对目标对象的文字描述；将初始视频和文字描述输入至训练后的开放世界目标检测模型进行关键帧检测，得到初始视频中包含目标对象的关键帧；对若干帧的关键帧进行聚合，得到初始视频信息摘要；将初始视频信息摘要输入至图片-文本提取单元提取文本描述，得到初始视频信息摘要的图片文本描述；将初始视频信息摘要的图片文本描述与初始视频信息摘要输入至视频-文本语义对齐单元进行语义对齐，得到对齐后的视频特征表示；将对齐后的视频特征表示输入至文本生成单元，得到目标视频信息摘要。与目前通过传统静态视频摘要技术中通过从视频中提取具有代表性的关键帧组合成新的视频先比，本申请通过采用开放世界目标检测模型根据预设的目标对象的文字描述提取初始视频中的关键帧，再将关键帧进行聚合得到多个目标对象的视频信息摘要，再从多个目标对象的视频信息摘要提取图片文本描述，将图片文本描述信息与初始视频信息摘要信息进行对齐后转化成文本输出，得到目标视频信息摘要。实现了生成正确表述视频内容的文本摘要，提高了文本摘要内容的正确性。Through steps S201 to S206, an initial video and a preset text description of the target object are obtained. The initial video and text description are input into a trained open-world object detection model for keyframe detection to obtain keyframes containing the target object in the initial video. The keyframes of several frames are aggregated to obtain an initial video information summary. The initial video information summary is input into an image-text extraction unit to extract text descriptions to obtain image text descriptions of the initial video information summary. The image text descriptions of the initial video information summary and the initial video information summary are input into a video-text semantic alignment unit for semantic alignment to obtain aligned video feature representations. The aligned video feature representations are input into a text generation unit to obtain the target video information summary. Compared with the current traditional static video summarization technology that extracts representative keyframes from the video to form a new video summary, this application uses an open-world object detection model to extract keyframes from the initial video based on the preset text description of the target object, then aggregates the keyframes to obtain video information summaries of multiple target objects, then extracts image text descriptions from the video information summaries of multiple target objects, aligns the image text descriptions with the initial video information summary information, and converts them into text output to obtain the target video information summary. It enables the generation of text summaries that accurately represent video content, thereby improving the accuracy of the text summaries.

在其中的一些实施例中，对若干帧的关键帧进行聚类，得到初始视频信息摘要，包括：提取若干帧的关键帧的特征表示；将关键帧的特征表示输入至训练后的骨干网络模型中，基于骨干网络模型，进行K-means聚类分析，得到初始视频信息摘要。其中，在骨干网络模型采用Swin Transformer模型的基础上，将关键帧之间的余弦相似度作为距离度量，根据距离度量，进行K-means聚类分析，得到初始视频信息摘要。In some embodiments, keyframes from several frames are clustered to obtain an initial video information summary, including: extracting feature representations of the keyframes from several frames; inputting the feature representations of the keyframes into a trained backbone network model; and performing K-means clustering analysis based on the backbone network model to obtain the initial video information summary. Specifically, based on the Swin Transformer model used in the backbone network model, the cosine similarity between keyframes is used as a distance metric, and K-means clustering analysis is performed based on this distance metric to obtain the initial video information summary.

具体地，不同的骨干网络模型，该骨干网络模型的表示能力也不同。本发明采用Swin Transformer模型作为骨干网络模型，它通过分层特征表示、移位窗口计算的特性，使模型可以在不同尺度上有更强的建模能力，且具有线性计算的复杂度，增强了性能的同时也加快了运算的速度。由于预设的针对目标对象的文字描述中可能存在多个目标对象，根据不同的目标对象或者需求，将相同特征的关键帧进行聚类处理，以得到不同目标对象的初始视频信息摘要。通过嵌入模块，提取若干帧的关键帧的特征表示，将关键帧的特征表示输入至训练后的骨干网络模型Swin Transformer模型中，计算关键帧之间的余弦相似度，将余弦相似度作为距离度量，进行K-means聚类，获得N/M(向上取整，其中N为聚类后的视频帧数量，M为初始视频摘要中包含的帧数)聚类，即初始视频信息摘要。Specifically, different backbone network models have different representation capabilities. This invention uses the Swin Transformer model as the backbone network model. Through hierarchical feature representation and shifted window computation, the model has stronger modeling capabilities at different scales and linear computational complexity, enhancing performance and accelerating computation. Since there may be multiple target objects in the preset text description of the target object, keyframes with the same features are clustered according to different target objects or needs to obtain initial video information summaries for different target objects. Through the embedding module, feature representations of several keyframes are extracted and input into the trained Swin Transformer backbone network model. The cosine similarity between keyframes is calculated, and K-means clustering is performed using the cosine similarity as a distance metric to obtain N/M (rounded up, where N is the number of video frames after clustering and M is the number of frames included in the initial video summary) clusters, which is the initial video information summary.

在另一个实施例中，将关键帧的特征表示输入至训练后的模块骨干网络中，基于模块骨干网络，进行K-means聚类分析，得到初始视频信息摘要，包括：In another embodiment, the feature representations of keyframes are input into the trained module backbone network. Based on the module backbone network, K-means clustering analysis is performed to obtain an initial video information summary, including:

将包含目标对象的关键帧的特征表示输入至模块骨干网络中，基于模块骨干网络进行K-means聚类分析，得到多个聚类初始视频信息摘要；采用K近邻法对聚类初始视频信息摘要进行异常帧检测，剔除大于预设阈值的视频帧，得到初始视频信息摘要。The feature representation of keyframes containing the target object is input into the module backbone network. K-means clustering analysis is performed based on the module backbone network to obtain multiple initial video information summaries. The K-nearest neighbor method is used to detect abnormal frames in the initial video information summaries and remove video frames that exceed a preset threshold to obtain the initial video information summaries.

具体地，将关键帧基于特征进行聚类后得到聚类初始视频信息摘要，其中可能含有异常帧，对于异常帧，设置异常帧检测模块进行异常检测。本实施例采用K近邻法对聚类初始视频信息摘要进行异常帧检测，具体包括：Specifically, keyframes are clustered based on features to obtain an initial video information digest, which may contain anomalous frames. For anomalous frames, an anomalous frame detection module is set up to detect them. This embodiment uses the K-nearest neighbor method to detect anomalous frames in the initial video information digest, specifically including:

(1)在获取关键帧的同时并记录关键帧的帧序号，将聚类初始视频信息摘要的中的关键帧根据帧序号按照升序进行排列；(1) While acquiring keyframes, record the frame number of the keyframes and arrange the keyframes in the initial video information summary of the cluster in ascending order according to the frame number;

(2)设定最近邻点个数k＝5；(2) Set the number of nearest neighbor points k = 5;

(3)计算每个帧序号与其最近的k个点的平均距离；(3) Calculate the average distance between each frame number and its k nearest points;

(4)去掉平均距离大于阈值(本实施例设置为5)的关键帧，该关键帧即为异常帧。(4) Remove keyframes whose average distance is greater than the threshold (set to 5 in this embodiment). These keyframes are abnormal frames.

在其中一些实施例中，将初始视频信息摘要输入至图片-文本提取单元，得到初始视频信息摘要的图片文本描述，包括：通过预训练后的BLIP-2模型提取初始视频信息摘要中的关键帧的特征表示；将初始视频信息摘要中的关键帧的特征表示输入至自回归文本生成器进行文本生成，得到初始视频信息摘要的图片文本描述。In some embodiments, the initial video information summary is input to the image-text extraction unit to obtain an image-text description of the initial video information summary, including: extracting feature representations of key frames in the initial video information summary using a pre-trained BLIP-2 model; and inputting the feature representations of key frames in the initial video information summary to an autoregressive text generator to generate text, thereby obtaining an image-text description of the initial video information summary.

具体地，通过BLIP-2(Bidirectional Language-Image Pre-training with Pairings 2)模型提取初始视频信息摘要中的关键帧的特征表示，BLIP-2模型能够同时处理图像和文本数据，通过其视觉-语言交互能力，识别出视频中的关键帧，然后将这些关键帧转化为特征表示。这些特征表示包含了关键帧的视觉信息，为后续的文本生成提供了基础。将提取出的关键帧的特征表示输入至自回归文本生成器中，使用自回归的方式生成文本描述，得到初始视频信息摘要的图片文本描述。Specifically, the BLIP-2 (Bidirectional Language-Image Pre-training with Pairings 2) model is used to extract feature representations of keyframes from the initial video information summary. The BLIP-2 model can process both image and text data simultaneously. Through its visual-language interaction capabilities, it identifies keyframes in the video and then transforms these keyframes into feature representations. These feature representations contain the visual information of the keyframes, providing a foundation for subsequent text generation. The extracted keyframe feature representations are then input into an autoregressive text generator, which uses an autoregressive approach to generate text descriptions, resulting in the image-text description of the initial video information summary.

在其中一些实施例中，将对齐后的视频特征表示输入至文本生成单元，得到目标视频信息摘要，其中文本生成单元为自回归文本生成器。In some embodiments, the aligned video feature representation is input to a text generation unit to obtain a target video information summary, wherein the text generation unit is an autoregressive text generator.

具体地，文本生成单元中采用自回归文本生成器，以对齐后的视频特征表示作为输入，根据自回归文本生成器的预测和自回归功能，生成对应的目标视频信息摘要。Specifically, the text generation unit employs an autoregressive text generator, which takes the aligned video feature representation as input and generates the corresponding target video information summary based on the prediction and autoregressive function of the autoregressive text generator.

在本实施例中还提供了一种视频信息摘要生成方法。图3是本实施例的另一种视频信息摘要生成方法的流程图，如图3所示，该流程包括如下步骤：This embodiment also provides a video information digest generation method. Figure 3 is a flowchart of another video information digest generation method according to this embodiment. As shown in Figure 3, the process includes the following steps:

步骤S301，通过图片-文本提取单元提取文本描述；Step S301: Extract text descriptions using the image-text extraction unit;

具体地，将聚类得到的初始视频信息摘要中的聚类后的多个视频作为输入，提取各视频中的关键帧的特征表示，将特征表示输入至子回归文本生成器中，生成关键帧的文本描述。Specifically, the clustered videos in the initial video information summary obtained by clustering are used as input, the feature representations of keyframes in each video are extracted, and the feature representations are input into the sub-regression text generator to generate text descriptions of the keyframes.

步骤S302，通过视频-文本语义对齐单元，将文本描述信息与初始视频信息摘要进行语义对齐，得到对齐后的视频特征表示。Step S302: The text description information is semantically aligned with the initial video information summary through the video-text semantic alignment unit to obtain the aligned video feature representation.

具体地，将文本描述信息和初始视频信息摘要输入至视频-文本语义对齐单元，将文本描述信息中的特征与初始视频信息中的特征进行匹配，得到对齐后的视频特征表示，确保它们在语义上的一致性，并增强视频摘要的特征表示。其中，在视频-文本语义对齐单元中，视觉嵌入器(Visual Embedder)基于视频转换器Video Swin Transformer构建，文本嵌入器Text Embedder基于剪辑文本编码器CLIP Text Encoder构建。Specifically, textual description information and initial video information summaries are input into the video-text semantic alignment unit. Features from the textual description information are matched with features from the initial video information to obtain aligned video feature representations, ensuring semantic consistency and enhancing the feature representation of the video summary. Within the video-text semantic alignment unit, the visual embedder is built based on the Video Swin Transformer, and the text embedder is built based on the CLIP Text Encoder.

步骤S303，通过文本生成单元，将对齐后的视频特征表示输入至自回归文本生成器中，得到目标视频信息摘要。Step S303: The aligned video feature representation is input into the autoregressive text generator through the text generation unit to obtain the target video information summary.

具体的，文本生成单元采用自回归文本生成器，以对齐后的视频特征表示作为输入，根据自回归文本生成器的预测和自回归功能，生成对应的目标视频信息摘要。其中，该自回归文本生成器采用转化器编码器(Transformer)结构构建。Specifically, the text generation unit employs an autoregressive text generator, using aligned video feature representations as input. Based on the predictions and autoregressive function of the autoregressive text generator, it generates corresponding target video information summaries. This autoregressive text generator is constructed using a Transformer encoder structure.

通过上述步骤S301至步骤S303，通过图片-文本提取单元获得初始视频信息摘要中每帧的文本描述；通过视频-文本语义对齐单元，以每帧的文本描述的语义信息增强初始视频信息摘要的特征表示；最后通过文本生成单元，采用自回归的方式得到目标视频信息摘要。本实施例提出的视频信息摘要生成算法架构，可减少视频与文本之间的语义差距，将视觉与语言表示映射到共享的语义空间中，提高视频描述能力，提高了文本摘要内容的正确性。Through steps S301 to S303, the image-text extraction unit obtains the text description of each frame in the initial video information summary; the video-text semantic alignment unit enhances the feature representation of the initial video information summary with the semantic information of the text description of each frame; finally, the text generation unit obtains the target video information summary using an autoregressive approach. The video information summary generation algorithm architecture proposed in this embodiment can reduce the semantic gap between video and text, map visual and linguistic representations to a shared semantic space, improve video description capabilities, and enhance the accuracy of the text summary content.

图4是本实施例的视频信息摘要生成方法的优选流程图，如图4所示，该视频信息摘要生成方法包括如下步骤：Figure 4 is a preferred flowchart of the video information digest generation method of this embodiment. As shown in Figure 4, the video information digest generation method includes the following steps:

步骤S401，获取初始视频和预设的针对目标对象的文字描述；Step S401: Obtain the initial video and a preset text description for the target object;

步骤S402，将初始视频和文字描述输入至训练后的开放世界目标检测模型进行关键帧检测，得到初始视频中包含目标对象的关键帧；Step S402: Input the initial video and text description into the trained open-world object detection model to perform keyframe detection, and obtain keyframes in the initial video that contain the target object;

步骤S403，提取若干帧的关键帧的特征表示；Step S403: Extract feature representations of keyframes from several frames;

步骤S404，将关键帧的特征表示输入至训练后的Swin Transformer骨干网络模型中，将关键帧之间的余弦相似度作为距离度量，进行K-means聚类分析，得到多个聚类初始视频信息摘要；Step S404: Input the feature representation of the keyframe into the trained Swin Transformer backbone network model, use the cosine similarity between keyframes as the distance metric, perform K-means clustering analysis, and obtain multiple initial video information summaries for clustering.

步骤S405，采用K近邻法对聚类初始视频信息摘要进行异常帧检测，剔除大于预设阈值的视频帧，得到初始视频信息摘要；Step S405: Use the K-nearest neighbor method to detect abnormal frames in the initial video information digest of the clustering, remove video frames that are greater than a preset threshold, and obtain the initial video information digest.

步骤S406，通过预训练后的BLIP-2模型提取初始视频信息摘要中的关键帧的特征表示；Step S406: Extract feature representations of keyframes from the initial video information summary using the pre-trained BLIP-2 model;

步骤S407，将初始视频信息摘要中的关键帧的特征表示输入至图片-文本提取单元进行文本生成，得到初始视频信息摘要的图片文本描述；Step S407: Input the feature representation of the keyframe in the initial video information summary into the image-text extraction unit to generate text, and obtain the image text description of the initial video information summary;

步骤S408，将初始视频信息摘要的图片文本描述与初始视频信息摘要输入至视频-文本语义对齐单元进行语义对齐，得到对齐后的视频特征表示；Step S408: Input the image text description of the initial video information summary and the initial video information summary into the video-text semantic alignment unit for semantic alignment to obtain the aligned video feature representation;

步骤S409，将对齐后的视频特征表示输入至文本生成单元，得到目标视频信息摘要。Step S409: Input the aligned video feature representation into the text generation unit to obtain the target video information summary.

在本实施例中还提供了一种视频信息摘要生成装置，该装置用于实现上述实施例及优选实施方式，已经进行过说明的不再赘述。以下所使用的术语“模块”、“单元”、“子单元”等可以实现预定功能的软件和/或硬件的组合。尽管在以下实施例中所描述的装置较佳地以软件来实现，但是硬件，或者软件和硬件的组合的实现也是可能并被构想的。This embodiment also provides a video information summarization generation device, which is used to implement the above embodiments and preferred embodiments; details already described will not be repeated. The terms "module," "unit," "subunit," etc., used below refer to combinations of software and/or hardware that perform predetermined functions. Although the device described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.

图5是本实施例的视频信息摘要生成装置的结构框图，如图5所示，该装置50包括：获取模块51、关键帧提取模块52、关键帧聚类模块53、图文转换模块54、对齐模块55和生成模块56，其中：Figure 5 is a structural block diagram of the video information summary generation device of this embodiment. As shown in Figure 5, the device 50 includes: an acquisition module 51, a keyframe extraction module 52, a keyframe clustering module 53, an image-to-text conversion module 54, an alignment module 55, and a generation module 56, wherein:

获取模块51，用于获取初始视频和预设的针对目标对象的文字描述；The acquisition module 51 is used to acquire the initial video and the preset text description of the target object;

关键帧提取模块52，用于将初始视频和文字描述输入至训练后的开放世界目标检测模型进行关键帧检测，得到初始视频中包含目标对象的关键帧；The keyframe extraction module 52 is used to input the initial video and text description into the trained open-world object detection model to perform keyframe detection and obtain keyframes containing the target object in the initial video.

关键帧聚类模块53，用于对若干帧的关键帧进行聚类，得到初始视频信息摘要；The keyframe clustering module 53 is used to cluster keyframes of several frames to obtain an initial video information summary.

图文转换模块54，用于将初始视频信息摘要输入至图片-文本提取单元提取文本描述，得到初始视频信息摘要的图片文本描述；The image-to-text conversion module 54 is used to input the initial video information summary into the image-to-text extraction unit to extract the text description and obtain the image-to-text description of the initial video information summary.

对齐模块55，用于将初始视频信息摘要的图片文本描述与初始视频信息摘要输入至视频-文本语义对齐单元进行语义对齐，得到对齐后的视频特征表示；Alignment module 55 is used to input the image text description of the initial video information summary and the initial video information summary into the video-text semantic alignment unit for semantic alignment, so as to obtain the aligned video feature representation;

生成模块56，用于将对齐后的视频特征表示输入至文本生成单元，得到目标视频信息摘要。The generation module 56 is used to input the aligned video feature representation into the text generation unit to obtain the target video information summary.

在本实施例中还提供了一种电子装置，包括存储器和处理器，该存储器中存储有计算机程序，该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。This embodiment also provides an electronic device including a memory and a processor, the memory storing a computer program and the processor being configured to run the computer program to perform the steps in any of the above method embodiments.

可选地，上述电子装置还可以包括传输设备以及输入输出设备，其中，该传输设备和上述处理器连接，该输入输出设备和上述处理器连接。Optionally, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor and the input/output device is connected to the processor.

可选地，在本实施例中，上述处理器可以被设置为通过计算机程序执行以下步骤：Optionally, in this embodiment, the processor can be configured to perform the following steps via a computer program:

S1，获取初始视频和预设的针对目标对象的文字描述；S1, Obtain the initial video and preset text descriptions for the target object;

S2，将初始视频和文字描述输入至训练后的开放世界目标检测模型进行关键帧检测，得到初始视频中包含目标对象的关键帧；S2, input the initial video and text description into the trained open-world object detection model to perform keyframe detection, and obtain the keyframes in the initial video that contain the target object;

S3，对若干帧的关键帧进行聚类，得到初始视频信息摘要；S3, cluster the key frames of several frames to obtain the initial video information summary;

S4，将初始视频信息摘要输入至图片-文本提取单元提取文本描述，得到初始视频信息摘要的图片文本描述；S4, input the initial video information summary into the image-text extraction unit to extract the text description, and obtain the image text description of the initial video information summary;

S5，将初始视频信息摘要的图片文本描述与初始视频信息摘要输入至视频-文本语义对齐单元进行语义对齐，得到对齐后的视频特征表示；S5, input the image text description of the initial video information summary and the initial video information summary into the video-text semantic alignment unit for semantic alignment, and obtain the aligned video feature representation;

S6，将对齐后的视频特征表示输入至文本生成单元，得到目标视频信息摘要。S6. The aligned video feature representation is input into the text generation unit to obtain the target video information summary.

需要说明的是，在本实施例中的具体示例可以参考上述实施例及可选实施方式中所描述的示例，在本实施例中不再赘述。It should be noted that the specific examples in this embodiment can refer to the examples described in the above embodiments and optional implementations, and will not be repeated in this embodiment.

此外，结合上述实施例中提供的视频信息摘要生成方法，在本实施例中还可以提供一种存储介质来实现。该存储介质上存储有计算机程序；该计算机程序被处理器执行时实现上述实施例中的任意一种视频信息摘要生成方法。Furthermore, in conjunction with the video information digest generation method provided in the above embodiments, this embodiment can also provide a storage medium for implementation. The storage medium stores a computer program; when executed by a processor, the computer program implements any of the video information digest generation methods described in the above embodiments.

应该明白的是，这里描述的具体实施例只是用来解释这个应用，而不是用来对它进行限定。根据本申请提供的实施例，本领域普通技术人员在不进行创造性劳动的情况下得到的所有其它实施例，均属本申请保护范围。It should be understood that the specific embodiments described herein are merely illustrative of the application and not intended to limit it. All other embodiments derived by those skilled in the art based on the embodiments provided in this application without inventive effort are within the scope of protection of this application.

显然，附图只是本申请的一些例子或实施例，对本领域的普通技术人员来说，也可以根据这些附图将本申请适用于其他类似情况，但无需付出创造性劳动。另外，可以理解的是，尽管在此开发过程中所做的工作可能是复杂和漫长的，但是，对于本领域的普通技术人员来说，根据本申请披露的技术内容进行的某些设计、制造或生产等更改仅是常规的技术手段，不应被视为本申请公开的内容不足。Obviously, the accompanying drawings are merely some examples or embodiments of this application. Those skilled in the art can apply this application to other similar situations based on these drawings without any creative effort. Furthermore, it is understood that although the work done in this development process may be complex and lengthy, for those skilled in the art, certain design, manufacturing, or production modifications made based on the technical content disclosed in this application are merely conventional technical means and should not be considered as insufficient disclosure of this application.

“实施例”一词在本申请中指的是结合实施例描述的具体特征、结构或特性可以包括在本申请的至少一个实施例中。该短语出现在说明书中的各个位置并不一定意味着相同的实施例，也不意味着与其它实施例相互排斥而具有独立性或可供选择。本领域的普通技术人员能够清楚或隐含地理解的是，本申请中描述的实施例在没有冲突的情况下，可以与其它实施例结合。The term "embodiment" in this application refers to a specific feature, structure, or characteristic described in connection with an embodiment that may be included in at least one embodiment of this application. The appearance of this phrase in various places in the specification does not necessarily imply the same embodiment, nor does it imply that it is mutually exclusive with or independent of other embodiments. It will be clearly or implicitly understood by those skilled in the art that the embodiments described in this application may be combined with other embodiments without conflict.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用，均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory，ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory，MRAM)、铁电存储器(Ferroelectric Random Access Memory，FRAM)、相变存储器(Phase Change Memory，PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory，RAM)或外部高速缓冲存储器等。作为说明而非局限，RAM可以是多种形式，比如静态随机存取存储器(Static Random Access Memory，SRAM)或动态随机存取存储器(Dynamic Random Access Memory，DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等，不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等，不限于此。Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对专利保护范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请的保护范围应以所附权利要求为准。The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of patent protection. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the appended claims.

Claims

A video information digest generation method, characterized in that it includes:

Obtain the initial video and a preset text description for the target object;

The initial video and the text description are input into the trained open-world object detection model to perform keyframe detection, thereby obtaining keyframes in the initial video that contain the target object.

Cluster the key frames from several frames to obtain an initial video information summary;

The initial video information summary is input into the image-text extraction unit to extract the text description, thereby obtaining the image text description of the initial video information summary;

The image text description of the initial video information summary and the initial video information summary are input into the video-text semantic alignment unit for semantic alignment to obtain the aligned video feature representation;

The aligned video feature representation is input into the text generation unit to obtain the target video information summary.

The video information digest generation method according to claim 1, characterized in that, the step of clustering the key frames of a plurality of frames to obtain an initial video information digest includes:

Extract the feature representation of the key frame from the plurality of frames;

The feature representations of the keyframes are input into the trained backbone network model. Based on the backbone network model, K-means clustering analysis is performed to obtain the initial video information summary.

The video information digest generation method according to claim 2 is characterized in that the backbone network model adopts the Swing Transformer model.

The video information digest generation method according to claim 2, characterized in that, the step of performing K-means clustering analysis based on the module backbone network to obtain the initial video information digest includes:

Based on the backbone network of the module, the cosine similarity between the keyframes is used as a distance metric. K-means clustering analysis is performed based on the distance metric to obtain the initial video information summary.

The video information summary generation method according to claim 2, characterized in that, the step of inputting the feature representation of the keyframe into the trained module backbone network, and performing K-means clustering analysis based on the module backbone network to obtain the initial video information summary, includes:

The feature representation of the key frame containing the target object is input into the module backbone network, and K-means clustering analysis is performed based on the module backbone network to obtain multiple initial video information summaries of clusters;

The K-nearest neighbor method is used to detect abnormal frames in the initial video information digest of the clustering, and video frames with values greater than a preset threshold are removed to obtain the initial video information digest.

The video information digest generation method according to claim 1, characterized in that, the step of inputting the initial video information digest into the image-text extraction unit to obtain the image-text description of the initial video information digest includes:

Feature representations of keyframes in the initial video information summary are extracted using the pre-trained BLIP-2 model;

The feature representations of keyframes in the initial video information summary are input into an autoregressive text generator to generate text, resulting in an image text description of the initial video information summary.

The video information summarization method according to claim 1 is characterized in that the text generation unit is an autoregressive text generator.

A video information summarization generation device, characterized in that it comprises: an acquisition module, a keyframe extraction module, a keyframe clustering module, an image-to-text conversion module, an alignment module, and a generation module, wherein:

The acquisition module is used to acquire the initial video and a preset text description of the target object;

The keyframe extraction module is used to input the initial video and the text description into the trained open-world object detection model to perform keyframe detection, and obtain keyframes in the initial video that contain the target object.

The keyframe clustering module is used to cluster the keyframes of several frames to obtain an initial video information summary.

The image-to-text conversion module is used to input the initial video information summary into the image-to-text extraction unit to extract the text description, thereby obtaining the image-to-text description of the initial video information summary;

The alignment module is used to input the image text description of the initial video information summary and the initial video information summary into the video-text semantic alignment unit for semantic alignment, so as to obtain the aligned video feature representation.

The generation module is used to input the aligned video feature representation into the text generation unit to obtain the target video information summary.

An electronic device includes a memory and a processor, characterized in that the memory stores a computer program, and the processor is configured to run the computer program to perform the video information summary generation method according to any one of claims 1 to 7.

A computer-readable storage medium having a computer program stored thereon, characterized in that, when the computer program is executed by a processor, it implements the steps of the video information digest generation method according to any one of claims 1 to 7.