CN115665508A

CN115665508A - Method, device, electronic device and storage medium for video abstract generation

Info

Publication number: CN115665508A
Application number: CN202211364555.5A
Authority: CN
Inventors: 王建国; 李鹏宇
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-01-31
Anticipated expiration: 2042-11-02
Also published as: CN115665508B

Abstract

The present application provides a video summary generation method, device, electronic equipment and storage medium, the method may include the following steps: determining the characteristics of the video clip, the video clip is obtained by segmenting the video; determining the video image frame in the video clip Features, the number of video image frames in the video clip is multiple; determine the features of the attention information, and the attention information is used to represent the attention to the video in different dimensions; use the features of the video clips, the features of the video image frames in the video clips and the attention features of the information to generate a video summarization of the video. According to the embodiment of the present application, video summaries are adaptively generated for content with different user preferences by focusing on information. The same set of models is realized to meet the individual needs of different users.

Description

Method, device, electronic device and storage medium for video abstract generation

技术领域technical field

本申请涉及视频处理技术领域，尤其涉及一种视频摘要生成的方法和装置、电子设备及存储介质。The present application relates to the technical field of video processing, and in particular to a method and device for generating a video abstract, electronic equipment, and a storage medium.

背景技术Background technique

视频摘要作为一种浓缩视频的方法，经常用于视频预览、视频剪辑或视频片段搜索等场景。已有的视频摘要生成方式，一种是针对特定场景关注的特定类别的目标，例如交通场景，生成关于人或车的视频摘要；另一种是通用场景，通过大量采集各种场景的数据并且进行标注来学习到符合场景的视频摘要。已有的视频摘要生成方式存在着场景单一或标注数据量大等弊端。显然，已有的视频摘要生成的方式会导致不同用户看到的视频摘要都是相同的，由此会给用户带来的新鲜感较低，体验较差。As a method of concentrating videos, video summarization is often used in scenarios such as video preview, video clipping or video segment search. Existing video summary generation methods, one is to generate video summaries about people or cars for specific categories of targets that are concerned with specific scenes, such as traffic scenes; the other is general scenes, by collecting a large amount of data from various scenes and Annotation is performed to learn scene-appropriate video summarization. Existing video summarization methods have disadvantages such as single scene or large amount of labeled data. Apparently, the existing method for generating video summaries will cause different users to see the same video summaries, which will bring less freshness and poor experience to users.

发明内容Contents of the invention

本申请实施例提供一种视频摘要生成的方法、装置、电子设备及存储介质，以实现可以根据关注信息从不同角度生成视频摘要。Embodiments of the present application provide a method, device, electronic device, and storage medium for generating video summaries, so as to generate video summaries from different angles according to attention information.

第一方面，本申请实施例提供了一种视频摘要生成的方法，该方法可以包括以下步骤：In the first aspect, the embodiment of the present application provides a method for generating a video summary, which may include the following steps:

确定视频片段的特征，视频片段是对视频进行切分得到的；Determining the characteristics of the video clip, the video clip is obtained by segmenting the video;

确定视频片段中视频图像帧的特征，视频片段中视频图像帧的数量为多个；Determine the characteristics of the video image frame in the video clip, the number of video image frames in the video clip is multiple;

确定关注信息的特征，关注信息用于表征对视频在不同维度的关注情况；Determine the characteristics of the attention information, and the attention information is used to represent the attention to the video in different dimensions;

利用视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征，生成视频的视频摘要。A video summarization of a video is generated using the features of the video clip, the features of the video image frames in the video clip, and the features of the information of interest.

第二方面，本申请实施例提供了一种视频摘要生成的方法，该方法可以包括以下步骤：In a second aspect, the embodiment of the present application provides a method for generating a video summary, which may include the following steps:

将接收到的对视频的关注信息发送给视频摘要的生成端；Send the received attention information to the video to the generator of the video summary;

接收视频摘要的生成端响应关注信息生成的视频的视频摘要；视频的视频摘要是频摘要的生成端利用视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征生成的；视频片段是对视频进行切分得到的；The generation terminal receiving the video summary responds to the video summary of the video generated by the attention information; the video summary of the video is generated by the generation terminal of the video summary using the characteristics of the video segment, the characteristics of the video image frame in the video segment and the characteristics of the attention information; the video segment is obtained by segmenting the video;

在视频预览窗口展示视频的视频摘要。Displays the video summary of the video in the video preview window.

第三方面，本申请实施例提供了一种视频摘要生成的装置，该装置可以包括：In a third aspect, the embodiment of the present application provides an apparatus for generating a video summary, which may include:

视频特征确定模块，用于确定视频片段的特征，视频片段是对视频进行切分得到的；A video feature determination module is used to determine the feature of the video segment, and the video segment is obtained by segmenting the video;

图像特征确定模块，用于确定视频片段中视频图像帧的特征，每个视频片段中视频图像帧的数量为多个；Image characteristic determination module, is used for determining the feature of video image frame in video segment, the quantity of video image frame in each video segment is a plurality of;

关注信息确定模块，用于确定关注信息的特征，关注信息用于表征对视频在不同维度的关注情况；A concerned information determination module is used to determine the characteristics of the concerned information, and the concerned information is used to represent the attention to the video in different dimensions;

视频摘要生成模块，用于利用视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征，生成视频的视频摘要。The video summary generating module is used to generate a video summary of the video by using the features of the video clip, the features of the video image frame in the video clip and the features of the attention information.

第四方面，本申请实施例提供了一种视频摘要生成的装置，该装置可以包括：In a fourth aspect, the embodiment of the present application provides an apparatus for generating a video summary, which may include:

关注信息发送模块，用于将接收到的对视频的关注信息发送给视频摘要的生成端；The attention information sending module is used to send the received attention information to the video to the generation end of the video summary;

视频摘要获取模块，用于接收视频摘要的生成端响应关注信息生成的视频的视频摘要；视频的视频摘要是频摘要的生成端利用视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征生成的；视频片段是对视频进行切分得到的；The video summary acquisition module is used to receive the video summary of the video generated by the generation end of the video summary in response to the attention information; the video summary of the video is that the generation terminal of the video summary utilizes the characteristics of the video clip, the characteristics of the video image frame in the video clip and the attention information generated by features; the video segment is obtained by segmenting the video;

视频摘要展示模块，用于在视频预览窗口展示视频的视频摘要。The video summary display module is used to display the video summary of the video in the video preview window.

第五方面，本申请实施例提供了一种电子设备，包括存储器、处理器及存储在存储器上的计算机程序，所述处理器在执行所述计算机程序时实现上述任一项所述的方法。In a fifth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory, and the processor implements the method described in any one of the above when executing the computer program.

第六方面，本申请实施例提供了一种计算机可读存储介质，所述计算机可读存储介质内存储有计算机程序，所述计算机程序被处理器执行时实现上述任一项所述的方法。In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method described in any one of the foregoing is implemented.

与现有技术相比，本申请具有如下优点：Compared with the prior art, the present application has the following advantages:

依据本申请实施例通过关注信息实现了基于不同用户偏好的内容，自适应生成视频摘要。实现了同一套模型不同用户能够生成出符合用户偏好的摘要视频，从而满足不同用户的个性化需求。同时，在视频摘要生成的过程中可以参考视频片段的特征、视频图像帧特征、关注信息的特征等多个维度的信息，多维度的信息综合考量，能够得到更高质量的视频摘要结果。According to the embodiment of the present application, video summaries are adaptively generated based on content of different user preferences by focusing on information. It realizes that different users of the same model can generate summary videos that meet user preferences, so as to meet the individual needs of different users. At the same time, in the process of video summarization, multi-dimensional information such as video segment features, video image frame features, and attention information features can be referred to. Comprehensive consideration of multi-dimensional information can obtain higher-quality video summarization results.

上述说明仅是本申请技术方案的概述，为了能够更清楚了解本申请的技术手段，可依照说明书的内容予以实施，并且为了让本申请的上述和其他目的、特征和优点能够更明显易懂，以下特举本申请的具体实施方式。The above description is only an overview of the technical solution of the present application. In order to understand the technical means of the present application more clearly, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present application more obvious and understandable, Specific embodiments of the present application are enumerated below.

附图说明Description of drawings

在附图中，除非另外规定，否则贯穿多个附图相同的附图标记表示相同或相似的部件或元素。这些附图不一定是按照比例绘制的。应该理解，这些附图仅描绘了根据本申请的一些实施方式，而不应将其视为是对本申请范围的限制。In the drawings, unless otherwise specified, the same reference numerals designate the same or similar parts or elements throughout the several drawings. The drawings are not necessarily drawn to scale. It should be understood that these drawings only depict some implementations according to the application, and should not be considered as limiting the scope of the application.

图1为本申请提供的视频摘要生成的方法的场景示意图；FIG. 1 is a schematic diagram of a scene of a method for generating a video abstract provided by the present application;

图2为本申请一实施例的视频摘要生成的方法的流程图之一；FIG. 2 is one of the flow charts of the method for generating a video abstract according to an embodiment of the present application;

图3为本申请一实施例的视频片段的确定方式的流程图之一；Fig. 3 is one of the flow charts of the method of determining the video clip according to an embodiment of the present application;

图4为本申请一实施例的视频片段的确定方式的流程图之二；Fig. 4 is the second flow chart of the method for determining the video segment in an embodiment of the present application;

图5为本申请一实施例的生成视频摘要具体过程的流程图之一；FIG. 5 is one of the flow charts of the specific process of generating a video abstract according to an embodiment of the present application;

图6为本申请一实施例的生成视频摘要具体过程的流程图之二；FIG. 6 is the second flow chart of the specific process of generating a video abstract according to an embodiment of the present application;

图7为本申请一实施例的生成视频摘要具体过程的流程图之三；FIG. 7 is the third flowchart of the specific process of generating a video abstract according to an embodiment of the present application;

图8为本申请一实施例的关注信息的生成方式的流程图之一；FIG. 8 is one of the flow charts of the method of generating attention information according to an embodiment of the present application;

图9为本申请一实施例的关注信息的生成方式的流程图之二；FIG. 9 is the second flow chart of the method of generating attention information in an embodiment of the present application;

图10是本申请一实施例的视频摘要生成的方法的流程图之二；FIG. 10 is the second flow chart of the method for generating a video abstract according to an embodiment of the present application;

图11是本申请一实施例的视频摘要生成的方法的流程图之三；Fig. 11 is the third flowchart of the method for generating a video abstract according to an embodiment of the present application;

图12是本申请一实施例的视频摘要生成的方法的流程图之四；FIG. 12 is a fourth flowchart of a method for generating a video abstract according to an embodiment of the present application;

图13是本申请一实施例的视频摘要生成装置的结构框图之一；FIG. 13 is one of the structural block diagrams of a video abstract generation device according to an embodiment of the present application;

图14是本申请一实施例的视频摘要生成装置的结构框图之二；以及Fig. 14 is the second structural block diagram of a video abstract generating device according to an embodiment of the present application; and

图15为用来实现本申请实施例的电子设备的框图。FIG. 15 is a block diagram of an electronic device used to implement an embodiment of the present application.

具体实施方式Detailed ways

在下文中，仅简单地描述了某些示例性实施例。正如本领域技术人员可认识到的那样，在不脱离本申请的构思或范围的情况下，可通过各种不同方式修改所描述的实施例。因此，附图和描述被认为本质上是示例性的，而非限制性的。In the following, only some exemplary embodiments are briefly described. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and descriptions are to be regarded as illustrative in nature, and not restrictive.

为便于理解本申请实施例的技术方案，以下对本申请实施例的相关技术进行说明。以下相关技术作为可选方案与本申请实施例的技术方案可以进行任意结合，其均属于本申请实施例的保护范围。In order to facilitate understanding of the technical solutions of the embodiments of the present application, the related technologies of the embodiments of the present application are described below. The following related technologies may be optionally combined with the technical solutions of the embodiments of the present application as optional solutions, and all of them belong to the protection scope of the embodiments of the present application.

首先对本申请所涉及的名词进行解释。First, the nouns involved in this application are explained.

中文多模态预训练模型(M6，Multi-Modality to Multi-Modality MultitaskMega-Transformer)：该模型是基于翻译模型(Transformer)为基础模型，通过多个任务进行预训练得到的。预训练使该模型具有单模态和多模态的理解和生成能力。中文多模态预训练模型可以应用于一系列下游应用，例如，可以包括应用于对象的描述生成、视觉问答、中文诗词生成等。Chinese multimodal pre-training model (M6, Multi-Modality to Multi-Modality MultitaskMega-Transformer): This model is based on the translation model (Transformer) and pre-trained through multiple tasks. Pre-training makes the model capable of both unimodal and multimodal understanding and generation. The Chinese multimodal pre-training model can be applied to a series of downstream applications, for example, it can include description generation applied to objects, visual question answering, Chinese poetry generation, etc.

视频特征提取模型(Video Transformer)：可以基于自注意力机制，实现对输入的视频片段进行特征提取。视频特征提取模型的输出结果为视频片段的特征。Video Feature Extraction Model (Video Transformer): Based on the self-attention mechanism, feature extraction of input video clips can be realized. The output of the video feature extraction model is the feature of the video clip.

视频镜头切割算法(Video Shot-Segmentation)：依赖于视频连续帧之间的像素变化、直方图变化、边缘的变化情况等，实现将视频进行分段。每个视频分段可以对应视频中镜头的一次切换过程。Video shot-segmentation algorithm (Video Shot-Segmentation): It relies on pixel changes, histogram changes, edge changes, etc. between consecutive frames of the video to segment the video. Each video segment may correspond to a camera switching process in the video.

视频场景切割算法(Video Scene-Segmentation)：作为视频理解中的一个子任务，其主要目标是以视频按照场景内容为切分线索，将长视频切分成若干视频片段。一个视频片段由连续的多个镜头片段组成。Video Scene Segmentation Algorithm (Video Scene-Segmentation): As a sub-task in video understanding, its main goal is to divide the long video into several video segments based on the scene content as the clue to segment the video. A video clip consists of consecutive multiple camera clips.

视频描述生成算法(Video Captioning)：视频描述生成算法是将深度学习应用于计算机视觉与自然语言处理领域，具体可以是给定一段视频，视频描述生成算法输出描述这段视频的文字。Video Captioning Algorithm (Video Captioning): The video captioning algorithm applies deep learning to the fields of computer vision and natural language processing. Specifically, given a video, the video captioning algorithm outputs text describing the video.

图1为示例性的用于实现本申请实施例的方法的一个应用场景的示意图。在生成视频摘要过程中，对于不同的用户往往会出现不同的需求。例如，对于同一个视频，第一用户的偏好是演员的服装搭配，第二用户的偏好是视频中的美食片段。为了解决这种用户偏好不同的问题，通常采用重新标注训练样本进行模型参数微调的方式对模型进行调整得到新的模型，从而采用不同的模型实现用户的差异化需求。但是重新标注训练样本所产生的弊端就是人力标注成本、模型训练时间成本的提升。Fig. 1 is a schematic diagram of an exemplary application scenario for implementing the method of the embodiment of the present application. In the process of generating video summarization, different users often have different requirements. For example, for the same video, the first user's preference is the actor's clothing collocation, and the second user's preference is the food segment in the video. In order to solve this problem of different user preferences, the model is usually adjusted by relabeling the training samples to fine-tune the model parameters to obtain a new model, so as to use different models to meet the differentiated needs of users. However, the disadvantages of relabeling training samples are the increase in human labeling costs and model training time costs.

对此，在当前应用场景下，可以将待生成视频摘要的目标视频按照指定规则拆分成多个视频片段。另外，可以接收用户的偏好信息。用户的偏好信息可以是接收到的文字信息、声音信息或者是视频信息等，用户的偏好信息用于表征用户对目标视频在不同维度的偏好情况。例如，图1中所示的服装搭配和美食片段可以表征2个不同用户的偏好。将目标视频的多个视频片段与偏好信息进行匹配，匹配可以包括视频片段的特征提取、偏好信息的特征，视频片段的特征与偏好信息的特征之间的相似性比较等过程。最终可以确定出与用户偏好相关的目标视频片段。基于确定出的目标视频片段，经过目标视频片段的组合、画质调整或视频片段时长调整等过程后，即可得到与用户的偏好信息相符的视频摘要。例如，在图1中，对应第一用户的偏好，可以生成与服装搭配内容相关的视频摘要。对应第二用户的偏好，可以生成与美食相关的视频摘要。由此可以解决不同场景和不同用户的偏好带来的需要重新标注训练数据进行训练微调的弊端，较大程度上减少了针对不同场景、不同用户对应的多次训练所带来的资源消耗的弊端，能够利用一套模型实现基于偏好信息的视频摘要自适应生成。For this, in the current application scenario, the target video for which the video summary is to be generated can be split into multiple video segments according to specified rules. In addition, user's preference information may be received. The user's preference information may be received text information, audio information, or video information, etc. The user's preference information is used to represent the user's preference for the target video in different dimensions. For example, the clothing collocation and food clips shown in Figure 1 can represent the preferences of 2 different users. Multiple video clips of the target video are matched with the preference information, and the matching may include feature extraction of the video clips, features of the preference information, similarity comparison between the features of the video clips and the features of the preference information, and the like. Finally, the target video segment related to user preference can be determined. Based on the determined target video clips, after processes such as combination of target video clips, image quality adjustment, or video clip duration adjustment, a video summary that matches the user's preference information can be obtained. For example, in FIG. 1 , corresponding to the preference of the first user, a video summary related to clothing collocation content may be generated. Corresponding to the preferences of the second user, video summaries related to food may be generated. This can solve the disadvantages of relabeling training data for training fine-tuning brought about by different scenarios and different user preferences, and greatly reduce the disadvantages of resource consumption caused by multiple trainings for different scenarios and different users. , which can use a set of models to realize adaptive generation of video summaries based on preference information.

本申请实施例提供了一种视频摘要生成的方法，如图2所示为本申请一实施例的视频摘要生成的方法的流程图，可以包括：The embodiment of the present application provides a method for generating a video abstract, as shown in FIG. 2 , which is a flowchart of a method for generating a video abstract in an embodiment of the present application, which may include:

步骤S201：确定视频片段的特征，视频片段是对视频进行切分得到的。Step S201: Determine the features of the video segment, which is obtained by segmenting the video.

本申请实施例的执行主体可以是云端，也可以是客户端。视频片段可以是利用指定规则，从待生成视频摘要的目标视频切分得到的。其中，指定规则可以是时间规则。例如，每个视频片段可以是1分钟或者30秒等。或者，指定规则可以是镜头规则。例如，目标视频的每个拍摄镜头的切换可以对应一个视频片段。或者，指定规则还可以是场景规则。例如，每个场景可以对应一个视频片段。场景可以包括至少一个镜头。在场景包括多个镜头的情况下，多个镜头所拍摄的内容可以是相似的。例如，多个镜头都是在一个拍摄地拍摄的镜头，或者多个镜头都是一个演员的镜头等。The execution subject of this embodiment of the application may be a cloud or a client. The video segment can be obtained by segmenting the target video to be generated with a video summary by using a specified rule. Wherein, the designated rule may be a time rule. For example, each video segment may be 1 minute or 30 seconds, etc. Alternatively, the specified rules may be shot rules. For example, the switching of each shot of the target video may correspond to a video clip. Alternatively, the specified rule may also be a scenario rule. For example, each scene may correspond to a video clip. A scene may include at least one shot. Where a scene includes multiple shots, the content captured by the multiple shots may be similar. For example, multiple shots are shot at one shooting location, or multiple shots are all shots of one actor, and so on.

视频片段的特征可以利用预先训练好的视频特征提取模型进行。将视频片段输入预先训练好的视频特征提取模型，可以得到该视频片段对应的特征。该特征可以以向量的形式表示。The features of video clips can be performed using pre-trained video feature extraction models. Input the video clip into the pre-trained video feature extraction model, and the corresponding features of the video clip can be obtained. This feature can be represented in the form of a vector.

步骤S202：确定视频片段中视频图像帧的特征，视频片段中视频图像帧的数量为多个。Step S202: Determine the features of the video image frames in the video clip, where the number of video image frames in the video clip is multiple.

每个视频片段中可以包括多个视频图像帧。视频图像帧的数量是与视频片段的时长成正比的。视频图像帧的特征可以包括像像素维度的特征和文字维度的特征。Each video segment may include multiple video image frames. The number of video image frames is proportional to the duration of the video segment. The features of a video image frame may include features such as pixel dimensions and text dimensions.

像素维度的特征可以利用先训练好的中文多模态预训练模型确定。在中文多模态预训练模型中，包含图像编码器和文本编码器。利用图像编码器可以直接从像素维度确定视频图像帧的特征。文字维度的特征可以利用视频描述生成算法和中文多模态预训练模型确定。利用视频描述生成算法可以得到视频片段中指定视频图像帧的文字表述。利用文本编码器可以基于文字表述得到文字维度的特征。The features of the pixel dimension can be determined by using the pre-trained Chinese multimodal pre-training model. In the Chinese multimodal pre-training model, an image encoder and a text encoder are included. Using an image encoder, the features of a video image frame can be determined directly from the pixel dimension. The characteristics of the text dimension can be determined using the video description generation algorithm and the Chinese multimodal pre-trained model. The text description of the specified video image frame in the video segment can be obtained by using the video description generation algorithm. Text encoders can be used to obtain text dimension features based on text representations.

指定视频图像帧可以是视频片段的全部视频图像帧，也可以是间隔一定时间或间隔一定帧数得到的图像帧。例如，可以是每间隔0.1秒获取一帧视频图像帧。又例如，可以取单数或双数的视频图像帧。再例如，还可以每间隔5帧获取一帧视频图像帧等。The specified video image frame may be all video image frames of the video segment, or may be image frames obtained at a certain time interval or at a certain number of frames. For example, one video image frame may be acquired every 0.1 second. For another example, an odd number or an even number of video image frames may be used. For another example, it is also possible to acquire a video image frame every 5 frames and the like.

步骤S203：确定关注信息的特征，关注信息用于表征对视频在不同维度的关注情况。Step S203: Determine the characteristics of the attention information, which is used to represent the attention to the video in different dimensions.

关注信息可以是不同用户所输入或者是对不同用户进行数据采集得到的偏好信息，用于表征对视频在不同维度的关注情况。以A球队和B球队之间的足球比赛视频为示例，用户甲是A球队的球迷，用户乙是B球队的C球员的球迷。则用户甲的关注信息可以是A球队的精彩表现或者B球队的失误集锦。用户乙的关注信息可以是C球员的精彩表现。再以美食探店的视频为示例，用户甲的关注信息可以为菜品种类的介绍，例如该店的招牌菜是炙子烤肉，如何选肉、如何腌制肉是用户甲关注的重点。而用户乙的关注信息可以为店铺的地址、环境等。例如，可以是该店铺所在的商圈、该店铺是否为网红店铺、是否为XX榜单上榜店铺等。The attention information may be input by different users or preference information obtained by collecting data from different users, and is used to represent attention to videos in different dimensions. Taking a football game video between team A and team B as an example, user A is a fan of team A, and user B is a fan of player C of team B. Then user A's attention information may be the wonderful performance of team A or the mistakes of team B. The attention information of user B may be the wonderful performance of player C. Taking the video of a gourmet restaurant as an example, user A’s attention information can be an introduction to the types of dishes. For example, the restaurant’s signature dish is roasted pork. How to choose meat and how to marinate meat is the focus of user A’s attention. The attention information of the user B may be the address and environment of the store. For example, it can be the business district where the store is located, whether the store is an online celebrity store, whether it is a store on the XX list, etc.

关注信息的特征可以利用预先训练好的中文多模态预训练模型进行。在接收到的偏好信息为语音数据的情况下，可以先将语音数据转为文字信息，进而利用中文多模态预训练模型中包含的文本编码器确定关注信息的特征。在接收到的偏好信息为文字信息的情况下，可以直接利用中文多模态预训练模型中包含的文本编码器确定关注信息的特征。The features of the attention information can be performed using the pre-trained Chinese multimodal pre-training model. When the received preference information is speech data, the speech data can be converted into text information first, and then the text encoder included in the Chinese multimodal pre-training model can be used to determine the characteristics of the concerned information. In the case that the received preference information is text information, the text encoder contained in the Chinese multimodal pre-training model can be directly used to determine the characteristics of the attention information.

步骤S204：利用视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征，生成视频的视频摘要。Step S204: Using the features of the video segment, the features of the video image frame in the video segment, and the features of the attention information to generate a video summary of the video.

从每个视频图像帧，以及每个视频图像帧所对应的视频片段两个维度出发，确定与关注信息相关度较高的一个或多个视频图像帧。所谓相关度较高可以是相关度不低于对应阈值的视频图像帧。可以采用预先训练的评分模型，该模型的输入为视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征，输出为每个视频图像帧的评分。Starting from two dimensions of each video image frame and the video segment corresponding to each video image frame, one or more video image frames that are highly correlated with the attention information are determined. The so-called high correlation degree may be a video image frame whose correlation degree is not lower than a corresponding threshold. A pre-trained scoring model can be used. The input of the model is the feature of the video clip, the feature of the video image frame in the video clip and the feature of the attention information, and the output is the score of each video image frame.

如果根据评分筛选出多个视频图像帧，可以进一步确定每个视频图像帧是否对应同一个视频片段。若对应同一个视频片段，可以利用确定出的唯一视频片段进行视频摘要的生成。若对应多个视频片段，可以利用多个视频片段进行视频摘要的生成。例如，有3个视频图像帧对应第一视频片段，有7个图像帧对应第二视频片段。由于第一视频片段的视频图像帧数量相对较少，可以只利用第二视频片段生成视频摘要。或者，可以直接将第一视频片段和第二视频片段进行拼接后生成视频摘要。又或者，还可以利用包含3个视频图像帧的部分连续的第一视频片段和包含7个图像帧的部分连续的第二视频片段生成视频摘要。例如，包含3个视频图像帧的部分连续的第一视频片段可以是3个视频图像帧各自的前向相邻的多个视频图像帧和后向相邻的多个视频图像帧组成的视频片段。If multiple video image frames are screened out according to the scores, it may be further determined whether each video image frame corresponds to the same video segment. If corresponding to the same video segment, the determined unique video segment may be used to generate the video summary. If corresponding to multiple video clips, the multiple video clips may be used to generate the video summary. For example, there are 3 video image frames corresponding to the first video segment, and 7 image frames corresponding to the second video segment. Since the number of video image frames of the first video segment is relatively small, only the second video segment may be used to generate the video summary. Alternatively, the video abstract may be generated by directly splicing the first video segment and the second video segment. Alternatively, the video summary may also be generated by using a partially continuous first video segment including 3 video image frames and a partially continuous second video segment including 7 image frames. For example, the partially continuous first video segment containing 3 video image frames may be a video segment composed of a plurality of forward adjacent video image frames and a plurality of backward adjacent video image frames of each of the three video image frames .

通过上述过程，通过关注信息实现了对不同用户偏好的内容自适应生成视频摘要。实现了同一套模型的满足不同用户的需求需求。同时，在匹配视频片段的过程中可以参考视频片段的特征、视频图像帧特征、关注信息的特征，多维度的特征关联匹配，能够得到更高质量的视频摘要结果。Through the above process, the self-adaptive generation of video summaries for content with different user preferences is realized by focusing on information. The same set of models is realized to meet the needs of different users. At the same time, in the process of matching video clips, you can refer to the features of video clips, video image frame features, and features of attention information. Multi-dimensional feature association matching can obtain higher-quality video summary results.

如图3所示，在一种可能的实现方式中，步骤S201中所涉及的视频片段的确定方式，可以包括：As shown in Figure 3, in a possible implementation, the method of determining the video segment involved in step S201 may include:

步骤S301：根据视频的镜头切换情况，将视频切分成多个镜头片段。Step S301: Divide the video into multiple shot segments according to the scene switching of the video.

在视频拍摄过程中，可以有多个拍摄镜头。通过将多个拍摄镜头所拍摄内容的剪辑、拼接等操作，可以得到最终的视频。例如，在两个演员的对手戏中，第一镜头可以是拍摄演员甲的画面，第二镜头可以是拍摄演员乙的画面。在视频中可以出现多个镜头画面的来回切换。基于此，可以根据视频的镜头切换情况，以镜头的每次切换作为镜头片段的切分指令，实现对于视频的切分。根据镜头切换情况将视频切分成多个镜头片段可以利用视频镜头切割算法实现。During video capture, there may be multiple capture shots. The final video can be obtained by editing, splicing, and other operations of the content captured by multiple shooting shots. For example, in a rival scene between two actors, the first shot may be a shot of actor A, and the second shot may be a shot of actor B. In the video, there may be switching back and forth between multiple lens images. Based on this, according to the camera switching situation of the video, each switching of the camera lens can be used as a segmentation instruction of the camera segment to realize the segmentation of the video. Slicing a video into multiple shot segments according to the shot switching situation can be realized by using a video shot cutting algorithm.

步骤S302：将镜头片段作为确定出的视频片段。Step S302: Take the shot segment as the determined video segment.

根据镜头切换情况，可以将视频切分成多个镜头片段。基于此，每个镜头片段都可以作为确定出的视频片段。According to the camera switching situation, the video can be divided into multiple camera clips. Based on this, each shot segment can be used as a determined video segment.

如图4所示，在一种可能的实现方式中，步骤S201中所涉及的视频片段的确定方式，还可以包括：As shown in Figure 4, in a possible implementation, the method of determining the video segment involved in step S201 may also include:

步骤S401：根据镜头片段的内容的相似情况，对镜头片段进行合并，得到至少一个内容片段。Step S401: According to the similarity of the contents of the shot clips, merge the shot clips to obtain at least one content clip.

相似的镜头片段可以是在同一个场景下的多个镜头片段。同一个场景可以包括拍摄地相同，拍摄风格相同或近似、或者拍摄内容相同或近似等。以美食探店节目为示例，在进入美食店之前的外景镜头，可以归类为第一场景。在美食店就餐区域拍摄的镜头，可以归类为第二场景。在美食店后厨拍摄的镜头，可以归类为第三场景。以不同题材的影视剧为示例，文戏的镜头可以归类为第一场景、武戏的镜头可以归类为第二场景。每个内容片段可以由至少一个镜头片段组成。Similar Shots can be multiple Shots in the same scene. The same scene may include the same shooting location, the same or similar shooting style, or the same or similar shooting content. Taking the gourmet restaurant exploration program as an example, the exterior shots before entering the gourmet restaurant can be classified as the first scene. Shots shot in the dining area of the gourmet restaurant can be classified as the second scene. Shots shot in the back kitchen of the gourmet restaurant can be classified as the third scene. Taking film and television dramas with different themes as an example, the scenes of literary dramas can be classified as the first scene, and the scenes of martial arts can be classified as the second scene. Each content segment may consist of at least one shot segment.

内容片段可以是在确定出镜头片段后，由多个镜头片段进行合并所得到的。另外，内容片段还可以是利用视频场景切割算法实现。The content clip may be obtained by merging multiple shot clips after the shot clip is determined. In addition, the content segment can also be implemented by using a video scene cutting algorithm.

步骤S402：将内容片段作为确定出的视频片段；或，将内容片段和镜头片段同时作为确定出的视频片段。Step S402: Use the content segment as the determined video segment; or, use the content segment and the shot segment as the determined video segment at the same time.

一种情况下，可以将内容片段作为确定出的视频片段。有益效果在于可以有效的降低数据量。另一种情况下，可以将内容片段和镜头片段同时作为确定出的视频片段。有益效果在于可以从更多维度标注视频的特征。In one case, the content segment may be used as the determined video segment. The beneficial effect is that the amount of data can be effectively reduced. In another case, the content clip and the shot clip can be used as the determined video clip at the same time. The beneficial effect is that the features of the video can be marked from more dimensions.

如图5所示，在一种可能的实现方式中，步骤S204所涉及的利用视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征，生成视频的视频摘要，可以包括：As shown in FIG. 5, in a possible implementation manner, the use of the features of the video clip, the features of the video image frame in the video clip and the features of the attention information involved in step S204 to generate a video summary of the video may include:

步骤S501：利用视频片段的特征与关注信息的特征的第一关联程度，以及视频图像帧的特征与关注信息的特征的第二关联程度，在视频图像帧中确定至少一个关键帧。Step S501: Determine at least one key frame in the video image frame by using the first degree of association between the feature of the video segment and the feature of the information of interest, and the second degree of association between the feature of the video image frame and the feature of the information of interest.

关注信息的特征可以作为参考信息。从而基于视频图像帧的特征和视频片段的特征等多维度的特征与关注信息的特征进行匹配，确定出与关注信息的匹配程度较高的关键帧。所谓的与关注信息的特征的匹配程度较高可以是匹配程度不低于对应阈值的视频图像帧。The characteristics of the attention information can be used as reference information. Therefore, based on the multi-dimensional features such as the features of the video image frame and the features of the video clips, the features of the attention information are matched, and a key frame with a high degree of matching with the attention information is determined. The so-called high degree of matching with features of the information of interest may be video image frames whose matching degree is not lower than a corresponding threshold.

例如，可以首先确定视频片段的特征与关注信息的特征之间的第一关联程度。其次，可以确定视频图像帧的特征和关注信息的特征之间的第二关联程度。最终可以基于第一关联程度和第二关联程度，确定每个视频图像帧的评分。可以将评分最高的视频图像帧作为关键帧。也可以由评分由高到低排序，选择指定数量的视频图像帧作为关键帧。For example, the first correlation degree between the features of the video clip and the features of the attention information may be determined first. Secondly, a second degree of association between the feature of the video image frame and the feature of the information of interest may be determined. Finally, the score of each video image frame can be determined based on the first degree of association and the second degree of association. The video image frame with the highest score can be used as a key frame. It can also be sorted by the score from high to low, and select a specified number of video image frames as key frames.

步骤S502：根据关键帧所对应的视频片段生成视频的视频摘要。Step S502: Generate a video summary of the video according to the video segment corresponding to the key frame.

可以根据视频摘要的时长需求，对关键帧所对应的视频片段进行对应处理后，得到视频摘要。例如，可以根据时长需求，对关键帧所对应的视频片段进行抽帧处理、压缩处理等，得到满足时长需求的视频摘要。The video summary can be obtained after corresponding processing is performed on the video segment corresponding to the key frame according to the duration requirement of the video summary. For example, according to the duration requirement, the video segment corresponding to the key frame may be subjected to frame extraction processing, compression processing, etc., to obtain a video summary meeting the duration requirement.

如图6所示，在一种可能的实现方式中，步骤S502所涉及的根据关键帧所对应的视频片段生成视频的视频摘要，可以包括：As shown in FIG. 6, in a possible implementation manner, generating a video summary of the video according to the video segment corresponding to the key frame involved in step S502 may include:

步骤S601：获取关键帧所对应的视频片段的时长。Step S601: Obtain the duration of the video segment corresponding to the key frame.

可以首先判断每个关键帧所对应的视频片段是否相同。在存在至少两个视频片段的情况下，需要分别获取每个视频片段的时长。It may first be judged whether the video segments corresponding to each key frame are the same. In the case that there are at least two video clips, the duration of each video clip needs to be obtained separately.

步骤S602：根据确定出的关键帧的评分和关键帧所对应的视频片段的时长，对关键帧所对应的视频片段进行筛选，得到筛选结果。Step S602: According to the determined score of the key frame and the duration of the video segment corresponding to the key frame, filter the video segment corresponding to the key frame to obtain a screening result.

在存在至少两个视频片段的情况下，可以同时将每个视频片段中关键帧的评分，以及视频片段的时长作为参考信息，筛选出至少一个视频片段。例如，可以采用背包算法，依据关键帧的评分选择出时长适合的视频片段作为筛选结果。In the case that there are at least two video clips, at least one video clip can be screened out by taking the score of the key frame in each video clip and the duration of the video clip as reference information at the same time. For example, the knapsack algorithm can be used to select a video segment with a suitable duration as the screening result according to the score of the key frame.

步骤S603：利用筛选结果生成视频的视频摘要。Step S603: Generate a video summary of the video by using the filtering result.

在得到筛选结果后，可以利用直接拼接的方式生成视频摘要。或者，也可以先对筛选结果中的视频片段进行时长调整，利用时长调整后的视频片段生成视频摘要。时长调整可以根据关键帧的评分进行。例如，评分相对高的关键帧所在的视频片段，可以进行相对少的调整。评分相对低的关键帧所在的视频片段，可以进相对多的调整。After the screening results are obtained, video summaries can be generated by direct splicing. Or, it is also possible to firstly adjust the duration of the video clips in the screening result, and use the duration-adjusted video clips to generate a video summary. Duration adjustments can be made based on keyframe ratings. For example, video clips with relatively high-scoring keyframes can be adjusted relatively little. The video segment where the key frame with a relatively low score is located can be adjusted relatively more.

如图7所示，在一种可能的实现方式中，在筛选结果所对应的时长超过对应阈值的情况下，步骤S502所涉及的根据关键帧所对应的视频片段生成视频的视频摘要，可以包括：As shown in FIG. 7 , in a possible implementation manner, when the duration corresponding to the screening result exceeds the corresponding threshold, the generation of the video summary of the video according to the video segment corresponding to the key frame involved in step S502 may include :

步骤S701：基于关键帧所对应的视频片段中的视频图像帧的特征，对关键帧所对应的视频片段中的视频图像帧进行过滤处理。Step S701: Based on the features of the video image frames in the video segment corresponding to the key frame, filter the video image frames in the video segment corresponding to the key frame.

可以首先确定视频摘要的时长需求。如果视频摘要的时长需求为1分钟，筛选结果存在多个视频片段，多个视频片段的总时长超过1分钟，可以对多个视频片段中视频图像帧进行过滤处理。例如，可以基于在先已经确定出的视频图像帧的特征，将多个视频片段中的视频图像帧进行相似性比较。在参与比较的视频图像帧的差异不大于差异阈值的情况下，即可认为参与比较的视频图像帧是相似的。进而根据比较结果进行视频图像帧的过滤。例如，存在连续10帧视频图像帧均相似，则可以计算10帧视频图像帧的特征均值。进而将10帧视频图像帧分别与特征均值进行比较，保留与特征均值差异最大的视频图像帧，或者保留差异超过对应阈值的视频图像帧。另外，也可以直接选择首帧、尾帧和中间帧作为保留下来的视频图像帧。The duration requirement of the video summary may be determined first. If the duration requirement of the video summary is 1 minute, and there are multiple video clips in the filtering result, and the total duration of the multiple video clips exceeds 1 minute, the video image frames in the multiple video clips can be filtered. For example, video image frames in multiple video clips may be compared for similarity based on previously determined features of video image frames. When the difference between the video image frames participating in the comparison is not greater than the difference threshold, it can be considered that the video image frames participating in the comparison are similar. Further, the video image frame is filtered according to the comparison result. For example, if there are 10 consecutive video image frames that are all similar, then the feature mean value of the 10 video image frames may be calculated. Then compare the 10 video image frames with the feature mean value respectively, and keep the video image frame with the largest difference from the feature mean value, or keep the video image frame with the difference exceeding the corresponding threshold. In addition, it is also possible to directly select the first frame, the last frame and the middle frame as the reserved video image frames.

步骤S702：利用过滤处理后的视频图像帧生成视频的视频摘要。Step S702: Using the filtered video image frames to generate a video summary of the video.

由于过滤处理的目的是进行时间压缩，因此过滤后保留下来的视频图像帧的数量是符合要求的。即，将过滤处理后的视频图像帧进行组合，得到的视频摘要的时长可以满足视频摘要的时长需求。Since the purpose of the filtering process is to perform time compression, the number of video image frames retained after filtering meets the requirements. That is, the duration of the video abstract obtained by combining the filtered video image frames can meet the duration requirement of the video abstract.

如图8所示，在一种可能的实现方式中，关注信息的生成方式，可以包括：As shown in FIG. 8, in a possible implementation manner, the generating manner of paying attention to information may include:

步骤S801：从获取到的视频摘要生成指令中确定指令内容和参照内容；指令内容是利用语音数据、动作数据中的至少一种生成的；参照内容包括文字、视频或图像中的至少一种。Step S801: Determine the instruction content and reference content from the obtained video summary generation instruction; the instruction content is generated by using at least one of voice data and action data; the reference content includes at least one of text, video or image.

获取到的视频摘要生成指令可以是用户输入的信息或者可以是执行主体采集到的信息。示例性地，获取或采集到的信息可以是用户手持一张(文字)图像、或者(手持)正在播放视频的视频播放设备等，用户指着图像中的某个局部内容说“生成与这个人物状态类似的视频片段”，用户指向或者圈选图像中某段文字说“我需要这样的内容”，指向某个画面说“我想要一个这样的视频片段”等。在上述场景下，采集到的用户手指的动作、圈选的动作以及用户的语音，都可以作为指令内容，而用户手持的图像、文字或者正在播放的视频，可以作为参照内容。The acquired video summary generation instruction may be information input by the user or may be information collected by the execution subject. Exemplarily, the acquired or collected information may be that the user is holding a (text) image, or a (handholding) video playback device that is playing a video, etc., and the user points to a certain part of the image and says "generated with this character Video clips with similar status", the user points to or circles a certain text in the image and says "I need such content", points to a certain screen and says "I want a video clip like this", etc. In the above scenario, the collected finger movements, circled actions, and user voice can all be used as instruction content, while the image, text or video being played by the user can be used as reference content.

基于此，可以对获取到的视频摘要生成指令进行解析。对于用户的语音数据和动作数据，可以直接确定为指令内容。对于用户手持的物品、指向的内容、圈选的内容等，可以确定为参照内容。上述解析可以利用语义模型确定，通过对采集的内容进行语义识别，以确定出指令内容和参照内容。Based on this, the acquired video summary generation instruction can be parsed. For the user's voice data and action data, it can be directly determined as the instruction content. For the item held by the user, the content pointed to, the content circled, etc., it may be determined as the reference content. The above-mentioned analysis can be determined by using a semantic model, and the instruction content and reference content can be determined by performing semantic recognition on the collected content.

在确定出指令内容和参照内容后，还可以基于指令内容对用户进行识别。例如，通过图像或声音等信息对用户进行识别，从而为该用户设置标识。在后续进行视频摘要的展示过程中，可以基于该用户的视频摘要展示指令对该用户进行识别。如果识别结果为已输出过关注信息的用户，则可以直接展示与该用户对应的关注信息生成的视频摘要。After the instruction content and reference content are determined, the user can also be identified based on the instruction content. For example, the user is identified by information such as image or sound, so as to set an identification for the user. During the subsequent presentation of the video abstract, the user may be identified based on the user's video abstract display instruction. If the recognition result is a user who has output attention information, the video summary generated by the attention information corresponding to the user may be directly displayed.

步骤S802：利用指令内容，在参照内容中确定关注信息。Step S802: Use the instruction content to determine the attention information in the reference content.

可以对指令内容进行识别，以确定指令意图。例如，在指令内容为语音数据的情况下，可以利用语音识别技术确定指令的意图。在指令内容为动作数据的情况下，可以对动作进行识别确定指令的意图。其中，动作可以包括前述示例中指的动作、圈选的动作等。在此情况下，可以将手指指向的位置，或者被圈选选择的位置所对应的内容作为选中的参照内容。例如，选中了图像中的一个人，或者一个建筑，或者圈选了一段文字等。图像中被选中的人、建筑，或者被圈选的文字都可以作为关注信息。进而可以通过图像编码器、文本编码器等进行关注信息的特征确定。The content of the instruction can be identified to determine the intent of the instruction. For example, in the case that the content of the instruction is voice data, voice recognition technology may be used to determine the intent of the instruction. When the content of the instruction is action data, the action can be identified to determine the intent of the instruction. Wherein, the action may include the action of pointing, the action of circling in the foregoing example, and the like. In this case, the position pointed by the finger, or the content corresponding to the position selected by the circle can be used as the selected reference content. For example, a person in the image is selected, or a building is selected, or a piece of text is circled. The selected people, buildings, or circled text in the image can be used as attention information. Furthermore, the features of the attention information can be determined through an image encoder, a text encoder, and the like.

如图9所示，在一种可能的实现方式中，在指令内容包括语音数据、动作数据的情况下，步骤S802所涉及的利用指令内容，在参照内容中确定关注信息，可以包括：As shown in FIG. 9, in a possible implementation manner, when the instruction content includes voice data and action data, the use of the instruction content involved in step S802 to determine the attention information in the reference content may include:

步骤S901：确定出现语音数据的第一时刻以及出现动作数据的第二时刻。Step S901: Determine the first moment when voice data appears and the second moment when motion data appears.

在确定出指令内容的情况下，还可以对出现指令内容的时刻进行记录。通过对指令内容进行解析，可以区分出语音数据和动作数据。解析原理可以采用现有技术实现，不再赘述。例如，用户对着正在播放的视频一边说话一边圈选目标，则可以对应确定出现语音数据的第一时刻以及出现动作数据的第二时刻。在当前实施方式中，语音数据和动作数据可以是表征有指令含义的内容。When the instruction content is determined, the time when the instruction content appears can also be recorded. By analyzing the content of the instruction, the voice data and the motion data can be distinguished. The parsing principle can be realized by adopting the existing technology, and will not be repeated here. For example, if the user circles the target while talking in front of the video being played, the first moment when the voice data appears and the second moment when the action data appears can be determined correspondingly. In the current implementation manner, the voice data and action data may be contents representing instruction meanings.

步骤S902：利用第一时刻和第二时刻，将语音数据和动作数据进行关联，得到关联结果。Step S902: Using the first time point and the second time point, correlate the voice data with the action data, and obtain a correlation result.

可以预先设置时差阈值。如果第一时刻和第二时刻之间的时差不大于时差阈值，可以确定语音数据和动作数据是同时发生的。即，同时发生可以是语音数据和动作数据在时间维度存在重叠情况。基于此，可以将同时发生的语音数据和动作数据进行关联，得到关联结果。关联结果的表现形式可以是：{起始时刻t₁、结束时刻t₂、语音数据、动作数据}，或者，关联结果的表现形式可以是：{起始时刻t₁、结束时刻t₂、语音数据}、{起始时刻t₁、结束时刻t₃、动作数据}。在当前示例中，t₁时刻至t₂时刻的第一时段与t₁时刻至t₃时刻的第二时段存在时间重合。不难理解，对于不存在时间重合的指令内容，可以基于时序依次记录。The time difference threshold can be set in advance. If the time difference between the first moment and the second moment is not greater than the time difference threshold, it can be determined that the voice data and the action data occur simultaneously. That is, simultaneous occurrence may be that voice data and motion data overlap in the time dimension. Based on this, simultaneous voice data and action data can be associated to obtain an association result. The expression form of the association result can be: {start time t ₁ , end time t ₂ , voice data, action data}, or the expression form of the association result can be: {start time t ₁ , end time t ₂ , voice data}, {start time t ₁ , end time t ₃ , action data}. In the current example, the first time period from time _t1 to time _t2 overlaps with the second time period from time _t1 to time _t3 . It is not difficult to understand that for the instruction content that does not overlap in time, it can be recorded sequentially based on time sequence.

步骤S903：利用关联结果，确定参照内容的范围；范围包括视频时长、图像页数和图像有效内容中的至少一种。Step S903: Using the association result, determine the scope of the reference content; the scope includes at least one of the duration of the video, the number of pages of the image, and the effective content of the image.

如前述示例中，关联结果的起始时刻为t₁，则可以记录在t₁时刻的参照内容。在参照内容为视频的情况下，可以记录t₁时刻视频的画面，从而可以从t₁时刻作为视频的起始时刻。为了画面的连续，也可以将t₁时刻前1秒或者t₁时刻前2秒等作为视频的起始时刻。As in the foregoing example, if the start time of the association result is t ₁ , then the reference content at time t ₁ may be recorded. When the reference content is a video, the video frame at time _t1 can be recorded, so that the time _t1 can be used as the starting time of the video. In order to continue the picture, it is also possible to use 1 second before time _t1 or 2 seconds before time _t1 as the start time of the video.

在t₁时刻至t₂时刻的第一时段内，或者在t₁时刻至t₃时刻的第二时段内，可以对指令内容进行解析。例如，语音内容包括是“我需要像这部分一样的构图”或“我想要这名球员的精彩表现”等。动作内容包括在播放的视频中进行圈选、指等动作。基于此，可以利用语音数据和动作数据的时间标签，对参照内容在对应的时间段所展示的内容加载相同的时间标签。对应的时间段所展示的内容可以是对应时长的视频片段、单张或多张图像等。或者，对应的时间段所展示的内容还可以是在该时间段内，语音内容所指向的内容和/或者动作内容所指向的内容。以语音内容包括是“我需要像这部分一样的构图”、“我想要这名球员的精彩表现”进行说明，则视频或图像的构图方式为有效内容，图像中的球员为有效内容。During the first period from time _t1 to time _t2 , or within the second period from time _t1 to time _t3 , the instruction content may be analyzed. For example, voice content includes things like "I need a composition like this part" or "I want a great performance from this player". The action content includes actions such as circle selection and fingering in the playing video. Based on this, the time tags of the voice data and action data can be used to add the same time tag to the content displayed in the corresponding time period of the reference content. The content displayed in the corresponding time period may be a video segment of the corresponding time length, a single or multiple images, and the like. Alternatively, the content displayed in the corresponding time period may also be the content pointed to by the voice content and/or the content pointed to by the action content within the time period. If the voice content includes "I need the same composition as this part" and "I want this player's wonderful performance", then the composition method of the video or image is valid content, and the player in the image is valid content.

步骤S904：利用参照内容的范围，在参照内容中确定关注信息。Step S904: Use the scope of the reference content to determine the attention information in the reference content.

在当前实施方式中，关注信息的表现形式可以是文字。在确定出参照内容的范围后，对于参照内容为视频或图像，可以利用视频描述生成算法得到视频或图像的文字表达。对于参照内容为文本，可以直接利用文本表达。In the current implementation manner, the expression form of the concerned information may be text. After determining the scope of the reference content, if the reference content is a video or an image, the textual expression of the video or image can be obtained by using a video description generation algorithm. If the reference content is text, the text expression can be used directly.

如图10所示，在一种可能的实现方式中，在执行主体为客户端的情况下，还可以包括：As shown in Figure 10, in a possible implementation, when the execution subject is the client, it may also include:

步骤S1001：将视频摘要与视频进行关联。Step S1001: Associating video summaries with videos.

根据不同的关注信息，可以得到与关注信息对应的视频摘要。如前，关注信息可以是服装搭配和美食片段。对此，同一个目标视频，可以得到与服装搭配相关的第一视频摘要，以及得到与美食相关的第二视频摘要。对于两个视频摘要，可以均与目标视频进行关联，以表征于两个视频摘要都是由目标视频得来的。According to different attention information, a video summary corresponding to the attention information can be obtained. As before, the attention information can be clothing collocations and food clips. In this regard, for the same target video, the first video summary related to clothing matching and the second video summary related to food can be obtained. For the two video summaries, both may be associated with the target video, to represent that the two video summaries are obtained from the target video.

步骤S1002：在接收到视频摘要展示指令的情况下，视频的视频预览窗口展示视频摘要。Step S1002: In the case of receiving the video abstract display instruction, the video preview window of the video displays the video abstract.

视频摘要展示指令可以是语音形式、动作形式等。例如，语音形式可以是“播放某某视频的视频摘要”。动作形式可以是通过动作选择的待播放摘要的视频。进一步的，还可以对发出视频摘要展示指令的用户进行识别。例如，可以通过图像识别或声音识别等方式确定发出视频摘要展示指令的用户。进而可以判断是否已存在与该用户匹配的关注信息。若存在，则可以在视频的视频预览窗口，展示与关注信息对应的视频摘要。反之，若不存在，则可以选择热度较高的视频摘要进行展示。The video summary display instruction may be in the form of speech or action. For example, the voice form could be "Play the video summary of XX video". The form of the action may be the video of the summary to be played selected through the action. Further, it is also possible to identify the user who issued the video summary display instruction. For example, the user who issued the video summary display instruction may be determined by means of image recognition or voice recognition. Furthermore, it can be judged whether there is already concerned information matching the user. If it exists, a video abstract corresponding to the attention information may be displayed in a video preview window of the video. On the contrary, if it does not exist, you can select a more popular video abstract for display.

上述视频摘要生成的方法可以是用户终端本地的一个应用程序(APP)，或者是一个APP中的一个功能模块，也可以是云端提供的一种服务，用户调用该服务对应的调用接口，将关注信息上传至云端，并接收云端反馈的结果，例如视频的视频摘要。另外，云端可以从视频数据库中获取到与视频相关的内容。The method for generating the above-mentioned video summary can be a local application program (APP) of the user terminal, or a functional module in an APP, or a service provided by the cloud. The information is uploaded to the cloud, and the results fed back by the cloud, such as video summaries of videos, are received. In addition, the cloud can obtain video-related content from the video database.

示例性地，在云端可以部署有若干分布式计算节点，每个计算节点中都具有计算、存储等处理资源。在云端，可以组织由多个计算节点来提供卡证识别方法中的某一个或某几个服务；或者，可以组织由多个计算节点来提供视频摘要生成的方法中的某一个或某几个服务。示例性地，服务可以包括视频进行切分得到视频片段；确定视频片段的特征；确定视频片段中视频图像帧的特征；确定关注信息的特征；利用视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征，生成视频的视频摘要等。当然，一个计算节点也可以提供一种或多种服务。云端提供该服务的方式可以是对外提供服务接口，用户调用该服务接口以使用相应的服务。Exemplarily, several distributed computing nodes may be deployed on the cloud, and each computing node has processing resources such as computing and storage. In the cloud, multiple computing nodes can be organized to provide one or several services in the card identification method; or, multiple computing nodes can be organized to provide one or several of the video summary generation methods Serve. Exemplarily, the service may include segmenting the video to obtain video clips; determining the features of the video clips; determining the features of the video image frames in the video clips; determining the features of the concerned information; Features and features that focus on information, generate video summaries for videos, etc. Of course, a computing node can also provide one or more services. The way the cloud provides the service may be to provide a service interface externally, and the user invokes the service interface to use the corresponding service.

针对本发明实施例提供的方案，云端可以提供有信息识别服务的服务接口，称为目标服务接口。当用户需要查看视频摘要的时候，通过用户设备调用该目标服务接口，以向云端触发调用该目标服务接口的请求，在该请求中携带有关注信息。云端确定响应该请求的计算节点，利用该计算节点中的处理资源执行本申请实施例所提供的各步骤。For the solutions provided by the embodiments of the present invention, the cloud may provide a service interface for information identification services, which is called a target service interface. When the user needs to view the video summary, the user device invokes the target service interface to trigger a request to the cloud for invoking the target service interface, and the request carries attention information. The cloud determines the computing node that responds to the request, and uses the processing resources in the computing node to execute the steps provided in the embodiments of the present application.

本申请实施例提供了一种视频摘要生成的方法，如图11所示为本申请一实施例的视频摘要生成的方法的流程图，可以包括：An embodiment of the present application provides a method for generating a video abstract, as shown in FIG. 11 , which is a flowchart of a method for generating a video abstract in an embodiment of the present application, which may include:

步骤S1101：将接收到的对视频的关注信息发送给视频摘要的生成端。Step S1101: Send the received attention information on the video to the generator of the video summary.

本申请实施例的执行主体可以是客户端。关注信息可以是用户发出的对视频的偏好信息。发出关注信息的场景可以包括视频预览场景、视频剪辑场景、视频片段搜索场景等。偏好信息的形式可以是文字信息、声音信息或者是视频信息等，用户的偏好信息用于表征用户对目标视频在不同维度的偏好情况。示例性地，关注信息可以是用户手持一张(文字)图像、或者(手持)正在播放视频的视频播放设备等，用户指着图像中的某个局部内容说“生成与这个人物状态类似的视频片段”，用户指向或者圈选图像中某段文字说“我需要这样的内容”，指向某个画面说“我想要一个这样的视频片段”等。The execution subject of this embodiment of the application may be a client. The attention information may be the video preference information sent by the user. The scenarios for sending attention information may include video preview scenarios, video editing scenarios, video clip search scenarios, and the like. The form of the preference information may be text information, audio information or video information, etc. The user's preference information is used to represent the user's preference for the target video in different dimensions. Exemplarily, the attention information may be that the user is holding a (text) image, or a (handheld) video playback device that is playing a video, etc., and the user points to a certain part of the image and says, "Generate a video similar to this character's state." Clip", the user points to or circles a certain text in the image and says "I need such content", points to a certain screen and says "I want a video clip like this", etc.

步骤S1102：接收视频摘要的生成端响应关注信息生成的视频的视频摘要；视频的视频摘要是频摘要的生成端利用视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征生成的；视频片段是对视频进行切分得到的。Step S1102: Receive the video summary of the video generated by the generating end of the video summary in response to the attention information; the video video summary is generated by the generation terminal of the video summary using the characteristics of the video clip, the characteristics of the video image frame in the video clip and the characteristics of the attention information ; The video segment is obtained by segmenting the video.

视频摘要的生成端用于基于对视频的关注信息，生成视频的视频摘要。在视频摘要的生成过程中，还可以同时参考视频片段的特征，以及视频片段中视频图像帧的特征。The generating end of the video summary is used to generate a video summary of the video based on attention information on the video. During the generation of the video summary, the features of the video clip and the features of the video image frames in the video clip can also be referred to at the same time.

视频片段可以是利用指定规则，从待生成视频摘要的视频切分得到的。视频片段的特征可以利用预先训练好的视频特征提取模型进行。将视频片段输入预先训练好的视频特征提取模型，可以得到该视频片段对应的特征。该特征可以以向量的形式表示。The video segment can be obtained by segmenting the video to be generated by using specified rules. The features of video clips can be performed using pre-trained video feature extraction models. Input the video clip into the pre-trained video feature extraction model, and the corresponding features of the video clip can be obtained. This feature can be represented in the form of a vector.

关注信息的特征可以利用预先训练好的中文多模态预训练模型进行。在接收到的关注信息为语音数据的情况下，可以先将语音数据转为文字信息，进而利用中文多模态预训练模型中包含的文本编码器确定关注信息的特征。在接收到的关注信息为文字信息的情况下，可以直接利用中文多模态预训练模型中包含的文本编码器确定关注信息的特征。在接收到的关注信息为动作信息的情况下，可以基于动作识别技术，将动作信息转为文字信息，进而利用中文多模态预训练模型中包含的文本编码器确定关注信息的特征。The features of the attention information can be performed using the pre-trained Chinese multimodal pre-training model. In the case that the received attention information is speech data, the speech data can be converted into text information first, and then the text encoder included in the Chinese multimodal pre-training model can be used to determine the characteristics of the attention information. In the case that the received attention information is text information, the text encoder included in the Chinese multimodal pre-training model can be directly used to determine the characteristics of the attention information. When the received attention information is action information, the action information can be converted into text information based on the action recognition technology, and then the text encoder included in the Chinese multimodal pre-training model can be used to determine the characteristics of the attention information.

采用预先训练的评分模型，该模型的输入为视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征，输出为每个视频图像帧的评分。根据视频图像帧的评分以及视频图像帧视频片段的时长，利用背包算法可以确定目标视频片段。即视频摘要的生成端响应关注信息生成的视频的视频摘要。视频摘要的生成端将视频摘要发送给客户端。视频摘要的生成端的功能以及具体工作流程可以参见上述方法中的对应描述，并具备相应的有益效果，在此不再赘述。Using a pre-trained scoring model, the input of the model is the features of the video clip, the features of the video image frames in the video clip and the features of the attention information, and the output is the score of each video image frame. According to the score of the video image frame and the duration of the video segment of the video image frame, the target video segment can be determined by using the knapsack algorithm. That is, the generator of the video summary responds to the video summary of the video generated by the concerned information. The generator of the video summary sends the video summary to the client. For the functions and specific workflow of the generation end of the video summary, refer to the corresponding description in the above method, and have corresponding beneficial effects, and will not be repeated here.

步骤S1103：在视频预览窗口展示视频的视频摘要。Step S1103: displaying the video summary of the video in the video preview window.

客户端在接收到视频摘要的生成端响应关注信息发送过来的视频的视频摘要后，可以在视频预览窗口展示视频的视频摘要。After receiving the video summary sent by the video summary generator in response to the attention information, the client can display the video summary in the video preview window.

通过关注信息实现了对不同用户偏好的内容自适应生成视频摘要。实现了同一套模型的满足不同用户的需求需求。By paying attention to information, the video summarization is adaptively generated for different user preferences. The same set of models is realized to meet the needs of different users.

本申请实施例提供了一种视频摘要生成的方法，如图12所示为本申请一实施例的视频摘要生成的方法的流程图，可以包括：An embodiment of the present application provides a method for generating a video abstract, as shown in FIG. 12 , which is a flowchart of a method for generating a video abstract in an embodiment of the present application, which may include:

利用图像特征提取模型，确定目标视频中图像帧的像素维度特征

其中j可以用于表示图像帧的序号。Use the image feature extraction model to determine the pixel dimension features of the image frame in the target video

Where j can be used to represent the sequence number of the image frame.

利用视频描述生成算法，确定目标视频中图像帧的文字表示T。Using a video description generation algorithm, determine the textual representation T of the image frame in the target video.

利用文本特征提取模型，得到图像帧的文字表示T的特征

像素维度特征

和文字表示T的特征

可以共同作为图像特征。Use the text feature extraction model to obtain the features of the text representation T of the image frame

Pixel Dimension Features

and literals denote the characteristics of T

can be used together as image features.

关注信息可以是用户的偏好信息。将关注信息转换为文本后，利用文本特征提取模型，关注信息的特征

The attention information may be user preference information. After converting the attention information into text, use the text feature extraction model to focus on the characteristics of the information

利用视频镜头切割算法，可以将目标视频切分为多个镜头片段，镜头片段的时长可以表示为T_shot。利用视频特征提取模型，可以得到镜头片段的特征

其中i可以用于表示镜头片段的序号。Using the video shot cutting algorithm, the target video can be divided into multiple shot segments, and the duration of the shot segment can be expressed as T _shot . Using the video feature extraction model, the features of the shot clips can be obtained

Where i can be used to indicate the serial number of the shot clip.

利用视频场景切割算法，可以将目标视频切分为多个场景片段，场景片段中可以包括至少一个镜头片段。场景片段的时长可以表示为T_sence。利用视频特征提取模型，可以得到场景片段的特征

其中q可以用于表示场景片段的序号。Using the video scene cutting algorithm, the target video can be divided into multiple scene segments, and the scene segments can include at least one shot segment. The duration of a scene segment can be expressed as T _sence . Using the video feature extraction model, the features of scene clips can be obtained

Among them, q can be used to indicate the sequence number of the scene segment.

将像素维度特征

文字表示T的特征

关注信息的特征

镜头片段的特征

场景片段的特征

输入预先训练好的评分模型，可以得到第j帧图像帧的评分Sj。预先训练好的评分模型可以是基于翻译模型(Transformer)结构的神经网络模型。该神经网络模型可以包括输入层、编码层和输出层。输入层可以是各项特征。编码层可以采用翻译模型结构的编码器。翻译模型结构具有自注意力机制(self-attention)，自注意力机制可以直接建模任意两个特征之间的相互影响。输出层可以基于编码层的结果，可以对每个图像帧生成一个分数。基于分数进行排序，即可得到第j帧图像帧的评分Sj。对于评分较高的一帧图像或者指定数量的多帧图像，可以用于视频摘要的生成。The pixel dimension feature

The text indicates the characteristics of T

Focus on the characteristics of the information

Characteristics of Shots

Characteristics of Scene Clips

Input the pre-trained scoring model to get the score Sj of the jth image frame. The pre-trained scoring model may be a neural network model based on a translation model (Transformer) structure. The neural network model may include an input layer, an encoding layer and an output layer. The input layer can be various features. The encoding layer can adopt the encoder of the translation model structure. The translation model structure has a self-attention mechanism, which can directly model the interaction between any two features. The output layer can be based on the results of the encoding layer, which can generate a score for each image frame. Sorting based on the scores can obtain the score Sj of the jth image frame. For a frame of image with a higher score or a specified number of multiple frames of images, it can be used to generate video summaries.

利用背包算法，基于图像帧的评分Sj和镜头片段的时长T_shot，可以得到适合作为视频摘要的镜头片段。最终，基于视频摘要的时长需求，生成视频摘要。Using the knapsack algorithm, based on the score Sj of the image frame and the duration T _shot of the shot segment, a shot segment suitable as a video summary can be obtained. Finally, a video summary is generated based on the duration requirement of the video summary.

与本申请实施例提供的方法的应用场景以及方法相对应地，本申请实施例还提供一种视频摘要生成的装置。如图13所示为本申请一实施例的视频摘要生成的装置的结构框图，该视频摘要生成的装置可以包括：Corresponding to the application scenario and method of the method provided in the embodiment of the present application, the embodiment of the present application further provides an apparatus for generating a video summary. As shown in Figure 13, it is a structural block diagram of a device for generating a video summary according to an embodiment of the present application. The device for generating a video summary may include:

视频特征确定模块1301，用于确定视频片段的特征，视频片段是对视频进行切分得到的。The video feature determination module 1301 is configured to determine the feature of the video segment, and the video segment is obtained by segmenting the video.

图像特征确定模块1302，用于确定视频片段中视频图像帧的特征，每个视频片段中视频图像帧的数量为多个。The image feature determining module 1302 is configured to determine features of video image frames in the video segment, and the number of video image frames in each video segment is multiple.

关注信息确定模块1303，用于确定关注信息的特征，关注信息用于表征对视频在不同维度的关注情况。The attention information determination module 1303 is configured to determine the characteristics of the attention information, and the attention information is used to represent the attention to the video in different dimensions.

视频摘要生成模块1304，用于利用视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征，生成视频的视频摘要。The video summary generation module 1304 is configured to generate a video summary of the video by using the features of the video clip, the features of the video image frame in the video clip and the features of the attention information.

在一种可能的实现方式中，视频特征确定模块1301，可以包括：In a possible implementation, the video feature determination module 1301 may include:

切分子模块，用于根据视频的镜头切换情况，将视频切分成多个镜头片段。The cutting sub-module is used to cut the video into multiple shot segments according to the camera switching situation of the video.

视频片段确定子模块，用于将镜头片段作为确定出的视频片段。The video segment determination sub-module is used to use the shot segment as the determined video segment.

在一种可能的实现方式中，视频特征确定模块1301，还可以包括：In a possible implementation, the video feature determination module 1301 may also include:

内容片段确定子模块，用于根据镜头片段的内容的相似情况，对镜头片段进行合并，得到至少一个内容片段；A content segment determining submodule, configured to merge the shot segments according to the similarity of the contents of the shot segments to obtain at least one content segment;

视频片段确定子模块，还用于将内容片段作为确定出的视频片段；或，The video segment determining submodule is also used to use the content segment as the determined video segment; or,

将内容片段和镜头片段同时作为确定出的视频片段。The content segment and the lens segment are simultaneously used as determined video segments.

在一种可能的实现方式中，视频摘要生成模块1304，可以包括：In a possible implementation, the video summary generating module 1304 may include:

关键帧确定子模块，用于利用视频片段的特征与关注信息的特征的第一关联程度，以及视频图像帧的特征与关注信息的特征的第二关联程度，在视频图像帧中确定至少一个关键帧；The key frame determination submodule is used to determine at least one key frame in the video image frame by using the first degree of association between the feature of the video segment and the feature of the information of interest, and the second degree of association between the feature of the video image frame and the feature of the information of interest frame;

视频摘要生成执行子模块，用于根据关键帧所对应的视频片段生成视频的视频摘要。The video summary generation execution sub-module is configured to generate a video summary of the video according to the video segment corresponding to the key frame.

在一种可能的实现方式中，视频摘要生成执行子模块，可以包括：In a possible implementation, the execution submodule of video summary generation may include:

时长获取单元，用于获取关键帧所对应的视频片段的时长；The duration acquisition unit is used to acquire the duration of the video segment corresponding to the key frame;

筛选结果确定单元，用于根据确定出的关键帧的评分和关键帧所对应的视频片段的时长，对关键帧所对应的视频片段进行筛选，得到筛选结果；A filtering result determination unit is used to filter the video clips corresponding to the key frames according to the determined scoring of the key frames and the duration of the video clips corresponding to the key frames to obtain a screening result;

视频摘要生成单元，用于利用筛选结果生成视频的视频摘要。The video summary generation unit is used to generate a video summary of the video by using the filtering result.

在一种可能的实现方式中，在关键帧所对应的视频片段的时长超过对应阈值的情况下，视频摘要生成执行子模块，可以包括：In a possible implementation, when the duration of the video segment corresponding to the key frame exceeds the corresponding threshold, the execution submodule of video summary generation may include:

过滤单元，用于基于关键帧所对应的视频片段中的视频图像帧的特征，对关键帧所对应的视频片段中的视频图像帧进行过滤处理；A filter unit, configured to filter the video image frames in the video segment corresponding to the key frame based on the features of the video image frame in the video segment corresponding to the key frame;

视频摘要生成单元，用于利用过滤处理后的视频图像帧生成视频的视频摘要。The video summary generation unit is used to generate a video summary of the video using the filtered video image frame.

在一种可能的实现方式中，关注信息确定模块1303，可以包括：In a possible implementation manner, the concerned information determining module 1303 may include:

内容获取子模块，用于从获取到的视频摘要生成指令中确定指令内容和参照内容；指令内容是利用语音数据、动作数据中的至少一种生成的；参照内容包括文字、视频或图像中的至少一种；The content acquisition sub-module is used to determine the instruction content and reference content from the obtained video summary generation instruction; the instruction content is generated by using at least one of voice data and action data; the reference content includes text, video or image at least one;

关注信息确定执行子模块，用于利用指令内容，在参照内容中确定关注信息。The attention information determination execution submodule is used to determine the attention information in the reference content by using the instruction content.

在一种可能的实现方式中，在指令内容包括语音数据、动作数据的情况下，关注信息确定执行子模块，可以包括：In a possible implementation, when the instruction content includes voice data and action data, the attention information determination execution submodule may include:

时间确定单元，用于确定出现语音数据的第一时刻以及出现动作数据的第二时刻；A time determination unit is used to determine the first moment when the voice data appears and the second moment when the action data appears;

关联单元，用于利用第一时刻和第二时刻，将语音数据和动作数据进行关联，得到关联结果；An associating unit, configured to use the first moment and the second moment to associate the speech data with the action data to obtain an association result;

范围确定单元，用于利用关联结果，确定参照内容的范围；范围包括视频时长、图像页数和图像有效内容中的至少一种；A range determination unit, configured to determine the range of the reference content by using the association result; the range includes at least one of the duration of the video, the number of pages of the image, and the effective content of the image;

关注信息确定单元，用于利用参照内容的范围，在参照内容中确定关注信息。The attention information determining unit is configured to determine the attention information in the reference content by using the scope of the reference content.

在一种可能的实现方式中，还包括展示模块，展示模块可以包括：In a possible implementation manner, a display module is also included, and the display module may include:

关联子单元，用于将视频摘要与视频进行关联；an associating subunit for associating the video abstract with the video;

展示执行子单元，用于在接收到视频摘要展示指令的情况下，视频的视频预览窗口展示视频摘要。The display execution subunit is configured to display the video summary in the video preview window of the video when a video summary display instruction is received.

与本申请实施例提供的方法的应用场景以及方法相对应地，本申请实施例还提供一种视频摘要生成的装置。如图14所示为本申请一实施例的视频摘要生成的装置的结构框图，该视频摘要生成的装置可以包括：Corresponding to the application scenario and method of the method provided in the embodiment of the present application, the embodiment of the present application further provides an apparatus for generating a video summary. As shown in FIG. 14, it is a structural block diagram of a device for generating a video summary according to an embodiment of the present application. The device for generating a video summary may include:

关注信息发送模块1401，用于将接收到的对视频的关注信息发送给视频摘要的生成端；The attention information sending module 1401 is used to send the received attention information to the video to the generation end of the video summary;

视频摘要获取模块1402，用于接收视频摘要的生成端响应关注信息生成的视频的视频摘要；视频的视频摘要是频摘要的生成端利用视频片段的特征、视频片段中视频图像帧的特征和关注信息的特征生成的；视频片段是对视频进行切分得到的；Video summary acquisition module 1402, for receiving the video summary of the video generated by the generating end of the video summary in response to the attention information; The features of the information are generated; the video clip is obtained by segmenting the video;

视频摘要展示模块1403，用于在视频预览窗口展示视频的视频摘要。The video abstract display module 1403 is configured to display the video abstract of the video in the video preview window.

本申请实施例各装置中的各模块的功能可以参见上述方法中的对应描述，并具备相应的有益效果，在此不再赘述。The functions of each module in each device in the embodiment of the present application can refer to the corresponding description in the above method, and have corresponding beneficial effects, and will not be repeated here.

图15为用来实现本申请实施例的电子设备的框图。如图15所示，该电子设备包括：存储器1510和处理器1520，存储器1510内存储有可在处理器1520上运行的计算机程序。处理器1520执行该计算机程序时实现上述实施例中的方法。存储器1510和处理器1520的数量可以为一个或多个。FIG. 15 is a block diagram of an electronic device used to implement an embodiment of the present application. As shown in FIG. 15 , the electronic device includes: a memory 1510 and a processor 1520 , and the memory 1510 stores computer programs that can run on the processor 1520 . The processor 1520 implements the methods in the foregoing embodiments when executing the computer program. The number of memory 1510 and processor 1520 may be one or more.

该电子设备还包括：This electronic device also includes:

通信接口1530，用于与外界设备进行通信，进行数据交互传输。The communication interface 1530 is used to communicate with external devices for interactive data transmission.

如果存储器1510、处理器1520和通信接口1530独立实现，则存储器1510、处理器1520和通信接口1530可以通过总线相互连接并完成相互间的通信。该总线可以是工业标准体系结构(Industry Standard Architecture，ISA)总线、外部设备互连(PeripheralComponent Interconnect，PCI)总线或扩展工业标准体系结构(Extended IndustryStandard Architecture，EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示，图15中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。If the memory 1510, the processor 1520, and the communication interface 1530 are implemented independently, the memory 1510, the processor 1520, and the communication interface 1530 may be connected to each other through a bus to complete mutual communication. The bus may be an Industry Standard Architecture (Industry Standard Architecture, ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus, etc. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 15 , but it does not mean that there is only one bus or one type of bus.

可选的，在具体实现上，如果存储器1510、处理器1520及通信接口1530集成在一块芯片上，则存储器1510、处理器1520及通信接口1530可以通过内部接口完成相互间的通信。Optionally, in a specific implementation, if the memory 1510, the processor 1520, and the communication interface 1530 are integrated on one chip, the memory 1510, the processor 1520, and the communication interface 1530 may communicate with each other through the internal interface.

本申请实施例提供了一种计算机可读存储介质，其存储有计算机程序，该程序被处理器执行时实现本申请实施例中提供的方法。The embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and implements the method provided in the embodiment of the present application when the program is executed by a processor.

本申请实施例还提供了一种芯片，该芯片包括处理器，用于从存储器中调用并运行存储器中存储的指令，使得安装有芯片的通信设备执行本申请实施例提供的方法。The embodiment of the present application also provides a chip, the chip includes a processor, configured to call and execute instructions stored in the memory from the memory, so that the communication device installed with the chip executes the method provided in the embodiment of the present application.

本申请实施例还提供了一种芯片，包括：输入接口、输出接口、处理器和存储器，输入接口、输出接口、处理器以及存储器之间通过内部连接通路相连，处理器用于执行存储器中的代码，当代码被执行时，处理器用于执行申请实施例提供的方法。The embodiment of the present application also provides a chip, including: an input interface, an output interface, a processor, and a memory, the input interface, the output interface, the processor, and the memory are connected through an internal connection path, and the processor is used to execute the code in the memory , when the code is executed, the processor is used to execute the method provided by the embodiment of the application.

应理解的是，上述处理器可以是中央处理器(Central Processing Unit，CPU)，还可以是其他通用处理器、数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(FieldProgrammable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。值得说明的是，处理器可以是支持进阶精简指令集机器(Advanced RISC Machines，ARM)架构的处理器。It should be understood that the above-mentioned processor may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field Programmable Gate Array (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It should be noted that the processor may be a processor supporting Advanced RISC Machines (ARM) architecture.

进一步地，可选的，上述存储器可以包括只读存储器和随机访问存储器。该存储器可以是易失性存储器或非易失性存储器，或可包括易失性和非易失性存储器两者。其中，非易失性存储器可以包括只读存储器(Read-Only Memory，ROM)、可编程只读存储器(Programmable ROM，PROM)、可擦除可编程只读存储器(Erasable PROM，EPROM)、电可擦除可编程只读存储器(Electrically EPROM，EEPROM)或闪存。易失性存储器可以包括随机访问存储器(RandomAccess Memory，RAM)，其用作外部高速缓存。通过示例性但不是限制性说明，许多形式的RAM均可用。例如，静态随机访问存储器(Static RAM，SRAM)、动态随机访问存储器(Dynamic RandomAccess Memory，DRAM)、同步动态随机访问存储器(SynchronousDRAM，SDRAM)、双倍数据速率同步动态随机访问存储器(Double Data Rate SDRAM，DDRSDRAM)、增强型同步动态随机访问存储器(Enhanced SDRAM，ESDRAM)、同步链接动态随机访问存储器(Sync link DRAM，SLDRAM)和直接内存总线随机访问存储器(Direct RambusRAM，DR RAM)。Further, optionally, the foregoing memory may include a read-only memory and a random access memory. The memory can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. Among them, the non-volatile memory can include read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electrically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available. For example, Static Random Access Memory (Static RAM, SRAM), Dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (Double Data Rate SDRAM) , DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced SDRAM, ESDRAM), Synchronous Link Dynamic Random Access Memory (Sync link DRAM, SLDRAM) and Direct Memory Bus Random Access Memory (Direct RambusRAM, DR RAM).

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时，全部或部分地产生依照本申请的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输。In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the present application are produced in whole or in part. A computer can be a general purpose computer, special purpose computer, computer network, or other programmable device. Computer instructions may be stored in, or transmitted from, one computer-readable storage medium to another computer-readable storage medium.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包括于本申请的至少一个实施例或示例中。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, descriptions referring to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present application. Furthermore, the described specific features, structures, materials or characteristics may be combined in any suitable manner in any one or more embodiments or examples. In addition, those skilled in the art can combine and combine different embodiments or examples and features of different embodiments or examples described in this specification without conflicting with each other.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或隐含地包括至少一个该特征。在本申请的描述中，“多个”的含义是两个或两个以上，除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of the present application, "plurality" means two or more, unless otherwise specifically defined.

流程图中描述的或在此以其他方式描述的任何过程或方法可以被理解为，表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分。并且本申请的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能。Any process or method described in a flowchart or otherwise described herein may be understood as representing a module, segment, or code comprising one or more executable instructions for implementing a specific logical function or step of the process part. Also, the scope of preferred embodiments of the present application includes additional implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order as the functions involved are involved.

在流程图中描述的或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、装置或设备而使用。The logic and/or steps described in the flowcharts or otherwise described herein, for example, can be considered as a sequenced listing of executable instructions for implementing logical functions, which can be embodied in any computer-readable medium , for use with an instruction execution system, apparatus, or device (such as a computer-based system, a system including a processor, or other system that can fetch instructions from an instruction execution system, apparatus, or device and execute instructions), or in conjunction with such an instruction execution system, device or equipment.

应理解的是，本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。上述实施例方法的全部或部分步骤是可以通过程序来指令相关的硬件完成，该程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。It should be understood that each part of the present application may be realized by hardware, software, firmware or a combination thereof. In the embodiments described above, various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method in the above embodiments can be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. When the program is executed, it includes one of the steps of the method embodiment or its combination.

此外，在本申请各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。上述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读存储介质中。该存储介质可以是只读存储器，磁盘或光盘等。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the above-mentioned integrated modules are implemented in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述，仅为本申请的示例性实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请记载的技术范围内，可轻易想到其各种变化或替换，这些都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以权利要求的保护范围为准。The above is only an exemplary embodiment of the application, but the scope of protection of the application is not limited thereto, and any skilled person familiar with the technical field can easily think of its various changes within the technical scope of the application Or replacement, all of these should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

1. A method for video summary generation, comprising:

determining the characteristics of a video clip, wherein the video clip is obtained by segmenting a video;

determining the characteristics of video image frames in the video segment, wherein the number of the video image frames in the video segment is multiple;

determining the characteristics of attention information, wherein the attention information is used for representing attention conditions of the video in different dimensions;

and generating a video abstract of the video by using the characteristics of the video clip, the characteristics of the video image frame in the video clip and the characteristics of the attention information.

2. The method of claim 1, wherein determining the video segment comprises:

according to the shot switching condition of the video, cutting the video into a plurality of shot segments;

and taking the shot segment as the determined video segment.

3. The method of claim 2, wherein determining the video segment further comprises:

according to the similarity condition of the contents of the shot segments, merging the shot segments to obtain at least one content segment;

taking the content segment as the determined video segment; or the like, or, alternatively,

and simultaneously taking the content segment and the shot segment as the determined video segment.

4. The method of claim 1, wherein the generating a video summary of the video using the features of the video segments, the features of the video image frames, and the features of the information of interest comprises:

determining at least one key frame in the video image frame by utilizing a first correlation degree of the characteristics of the video segment and the characteristics of the attention information and a second correlation degree of the characteristics of the video image frame and the characteristics of the attention information;

and generating a video abstract of the video according to the video segment corresponding to the key frame.

5. The method according to claim 4, wherein the generating the video summary of the video according to the video segment corresponding to the key frame comprises:

acquiring the duration of a video clip corresponding to the key frame;

screening the video clips corresponding to the key frames according to the determined scores of the key frames and the duration of the video clips corresponding to the key frames to obtain a screening result;

and generating a video abstract of the video by using the screening result.

6. The method according to claim 4 or 5, wherein the generating the video summary of the video according to the video segment corresponding to the key frame when the duration of the video segment corresponding to the key frame exceeds the corresponding threshold value comprises:

based on the characteristics of the video image frames in the video clips corresponding to the key frames, filtering the video image frames in the video clips corresponding to the key frames;

and generating a video abstract of the video by using the video image frame after the filtering processing.

7. The method according to claim 1, wherein the manner of generating the attention information includes:

determining instruction content and reference content from the acquired video abstract generating instruction; the instruction content is generated by at least one of voice data and motion data; the reference content comprises at least one of characters, videos or images;

and determining attention information in the reference content by using the instruction content.

8. The method according to claim 7, wherein in a case where the instruction content includes voice data and motion data, the determining, with the instruction content, attention information in the reference content includes:

determining a first time instant at which the speech data occurs and a second time instant at which the motion data occurs;

associating the voice data with the action data by using the first time and the second time to obtain an association result;

determining a range of the reference content using the correlation result; the range comprises at least one of video duration, image page number and effective image content;

and determining attention information in the reference content by using the range of the reference content.

9. The method of any of claims 1 to 8, further comprising:

associating the video summary with the video;

and under the condition that a video abstract display instruction is received, displaying the video abstract in a video preview window of the video.

10. A method for video summary generation, comprising:

sending the received attention information to the video to a video abstract generating end;

receiving the video abstract of the video generated by the generation end of the video abstract responding to the attention information; the video abstract of the video is generated by a generation end of the video abstract by utilizing the characteristics of a video segment, the characteristics of a video image frame in the video segment and the characteristics of the attention information; the video clip is obtained by segmenting the video;

and displaying the video abstract of the video in a video preview window.

11. An apparatus for video summary generation, comprising:

the video characteristic determining module is used for determining the characteristics of a video clip, wherein the video clip is obtained by segmenting a video;

the image characteristic determining module is used for determining the characteristics of video image frames in the video segments, and the number of the video image frames in each video segment is multiple;

the attention information determining module is used for determining the characteristics of attention information, wherein the attention information is used for representing attention conditions of the video in different dimensions;

and the video abstract generating module is used for generating a video abstract of the video by utilizing the characteristics of the video clip, the characteristics of the video image frame in the video clip and the characteristics of the attention information.

12. An apparatus for video summary generation, comprising:

the attention information sending module is used for sending the received attention information of the video to a video abstract generating end;

the video abstract acquisition module is used for receiving a video abstract of the video generated by the generation end of the video abstract responding to the attention information; the video abstract of the video is generated by a generation end of the video abstract by utilizing the characteristics of a video segment, the characteristics of a video image frame in the video segment and the characteristics of the attention information; the video clip is obtained by segmenting the video;

and the video abstract display module is used for displaying the video abstract of the video in a video preview window.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory, the processor implementing the method of any one of claims 1-10 when executing the computer program.

14. A computer-readable storage medium, having stored therein a computer program which, when executed by a processor, implements the method of any one of claims 1-10.