CN118400575A

CN118400575A - Video processing method and related device

Info

Publication number: CN118400575A
Application number: CN202410821581.9A
Authority: CN
Inventors: 张康; 陈劲; 谢昊源; 姚广; 孙钱丽
Original assignee: Hunan MgtvCom Interactive Entertainment Media Co Ltd
Current assignee: Hunan MgtvCom Interactive Entertainment Media Co Ltd
Priority date: 2024-06-24
Filing date: 2024-06-24
Publication date: 2024-07-26
Anticipated expiration: 2044-06-24
Also published as: CN118400575B

Abstract

The present invention discloses a video processing method and a related device, which are applied to the multimedia field. A target video can be divided into multiple video clips; for any video clip, a video vector corresponding to the video clip is calculated according to a group of video frames in the video clip; according to the video requirements, pre-established text information is processed to obtain a corresponding text set; a text vector corresponding to each text information in the text set is calculated, and a video vector matching the text vector is determined, wherein the text set includes multiple text information, one text information corresponds to one text vector, and one text vector matches one video vector; according to the video clip corresponding to the determined video vector, the corresponding video is clipped. The present invention can match appropriate video clips based on video vectors and video requirements, and then automatically clip to generate corresponding short videos, with a high degree of matching and no need for manual participation, and can quickly produce a large number of high-quality short videos with high efficiency.

Description

Video processing method and related device

技术领域Technical Field

本发明涉及多媒体领域，特别涉及一种视频处理方法及相关装置。The present invention relates to the field of multimedia, and in particular to a video processing method and related devices.

背景技术Background technique

现有的短视频生产方法大多是用户在现有剪辑软件上进行选取素材和剪辑，这种方式依赖用户的经验和熟练度，生产短视频的效率低下。Most existing short video production methods require users to select materials and edit them using existing editing software. This method relies on the user's experience and proficiency, and the efficiency of producing short videos is low.

发明内容Summary of the invention

鉴于上述问题，本发明提供一种克服上述问题或者至少部分地解决上述问题的一种视频处理方法及相关装置。In view of the above problems, the present invention provides a video processing method and related devices that overcome the above problems or at least partially solve the above problems.

第一方面，一种视频处理方法，包括：In a first aspect, a video processing method includes:

将目标视频分成多个视频片段；Divide the target video into multiple video segments;

针对任一所述视频片段，根据所述视频片段中的一组视频帧，计算得到所述视频片段对应的视频向量，其中，一个视频片段对应一个视频向量；For any of the video clips, a video vector corresponding to the video clip is calculated based on a group of video frames in the video clip, wherein one video clip corresponds to one video vector;

按照视频需求，对预先建立的文本信息进行处理，得到相应的文本集；According to the video requirements, the pre-established text information is processed to obtain a corresponding text set;

计算所述文本集中的各文本信息对应的文本向量，并确定与所述文本向量匹配的视频向量，其中，所述文本集包括多个文本信息，一个文本信息对应一个文本向量，一个文本向量匹配一个视频向量；Calculating a text vector corresponding to each text information in the text set, and determining a video vector matching the text vector, wherein the text set includes a plurality of text information, one text information corresponds to one text vector, and one text vector matches one video vector;

根据确定的视频向量对应的视频片段，剪辑得到相应的视频。According to the video clips corresponding to the determined video vectors, the corresponding videos are edited.

可选的，在某些可选的实施方式中，所述针对任一所述视频片段，根据所述视频片段中的一组视频帧，计算得到所述视频片段对应的视频向量，包括：Optionally, in some optional implementations, for any of the video clips, calculating a video vector corresponding to the video clip according to a group of video frames in the video clip includes:

针对任一所述视频片段，从所述视频片段中抽取得到一组视频帧，其中，一组视频帧包括多个视频帧；For any of the video clips, extract a group of video frames from the video clip, wherein the group of video frames includes a plurality of video frames;

针对任一组视频帧，计算相应的各视频帧对应的图像向量；For any group of video frames, calculate the image vectors corresponding to each video frame;

针对任一组视频帧，根据相应的各图像向量，计算得到对应的视频向量。For any group of video frames, the corresponding video vector is calculated based on the corresponding image vectors.

可选的，在某些可选的实施方式中，所述针对任一组视频帧，计算相应的各视频帧对应的图像向量，包括：Optionally, in some optional implementations, the step of calculating the image vectors corresponding to each video frame for any group of video frames includes:

针对任一组视频帧，使用预先建立的图像神经网络模型分别对各视频帧进行计算，得到各视频帧对应的图像向量，其中，一个视频帧对应一个图像向量。For any group of video frames, a pre-established image neural network model is used to calculate each video frame respectively to obtain an image vector corresponding to each video frame, wherein one video frame corresponds to one image vector.

可选的，在某些可选的实施方式中，所述针对任一组视频帧，根据相应的各图像向量，计算得到对应的视频向量，包括：Optionally, in some optional implementations, for any group of video frames, calculating a corresponding video vector according to corresponding image vectors includes:

针对任一组视频帧，计算相应的各图像向量的平均值，得到对应的视频向量。For any group of video frames, the average value of the corresponding image vectors is calculated to obtain the corresponding video vector.

可选的，在某些可选的实施方式中，所述按照视频需求，对预先建立的文本信息进行处理，得到相应的文本集，包括：Optionally, in some optional implementations, the pre-established text information is processed according to the video requirements to obtain a corresponding text set, including:

按照视频需求，对预先建立的文本信息进行拆分处理和复制处理中的至少一项处理，得到与所述文本信息对应的文本集。According to the video requirements, at least one of a splitting process and a copy process is performed on the pre-established text information to obtain a text set corresponding to the text information.

可选的，在某些可选的实施方式中，所述计算所述文本集中的各文本信息对应的文本向量，并确定与所述文本向量匹配的视频向量，包括：Optionally, in some optional implementations, calculating the text vector corresponding to each text information in the text set and determining the video vector matching the text vector includes:

针对所述文本集中的各文本信息，使用预先建立的文本神经网络模型分别计算各文本信息对应的文本向量；For each piece of text information in the text set, a pre-established text neural network model is used to calculate the text vector corresponding to each piece of text information;

针对任一所述文本向量，计算各视频向量与所述文本向量的相似度，将相似度最高的视频向量确定为与所述文本向量匹配的视频向量。For any of the text vectors, the similarity between each video vector and the text vector is calculated, and the video vector with the highest similarity is determined as the video vector matching the text vector.

可选的，在某些可选的实施方式中，所述根据确定的视频向量对应的视频片段，剪辑得到相应的视频，包括：Optionally, in some optional implementations, the step of editing the video clips corresponding to the determined video vectors to obtain corresponding videos includes:

按照确定的视频向量对应的视频片段之间的顺序进行拼接，并配上相应的音频和字幕，得到相应的视频，其中，任一视频片段的音频和字幕与相应的文本信息对应。The video clips corresponding to the determined video vectors are spliced in order and matched with corresponding audio and subtitles to obtain a corresponding video, wherein the audio and subtitles of any video clip correspond to the corresponding text information.

可选的，在某些可选的实施方式中，所述将目标视频分成多个视频片段，包括：Optionally, in some optional implementations, dividing the target video into a plurality of video segments includes:

按照分镜信息，将目标视频分成多个视频片段，其中，所述分镜信息预先识别得到，一条分镜信息对应一个视频片段。According to the storyboard information, the target video is divided into multiple video segments, wherein the storyboard information is identified in advance, and one piece of storyboard information corresponds to one video segment.

第二方面，一种视频处理装置，包括：视频拆分单元、视频向量计算单元、文本处理单元、向量匹配单元和视频剪辑单元；In a second aspect, a video processing device includes: a video splitting unit, a video vector calculating unit, a text processing unit, a vector matching unit, and a video editing unit;

所述视频拆分单元，用于将目标视频分成多个视频片段；The video splitting unit is used to split the target video into multiple video segments;

所述视频向量计算单元，用于针对任一所述视频片段，根据所述视频片段中的一组视频帧，计算得到所述视频片段对应的视频向量，其中，一个视频片段对应一个视频向量；The video vector calculation unit is used to calculate, for any of the video segments, a video vector corresponding to the video segment according to a group of video frames in the video segment, wherein one video segment corresponds to one video vector;

所述文本处理单元，用于按照视频需求，对预先建立的文本信息进行处理，得到相应的文本集；The text processing unit is used to process the pre-established text information according to the video requirements to obtain a corresponding text set;

所述向量匹配单元，用于计算所述文本集中的各文本信息对应的文本向量，并确定与所述文本向量匹配的视频向量，其中，所述文本集包括多个文本信息，一个文本信息对应一个文本向量，一个文本向量匹配一个视频向量；The vector matching unit is used to calculate the text vector corresponding to each text information in the text set, and determine the video vector matching the text vector, wherein the text set includes a plurality of text information, one text information corresponds to one text vector, and one text vector matches one video vector;

所述视频剪辑单元，用于根据确定的视频向量对应的视频片段，剪辑得到相应的视频。The video clipping unit is used to clip the video clips corresponding to the determined video vectors to obtain corresponding videos.

第三方面，一种计算机可读存储介质，其上存储有程序，所述程序被处理器执行时实现上述任一项所述的视频处理方法。In a third aspect, a computer-readable storage medium stores a program, wherein the program, when executed by a processor, implements any of the above-mentioned video processing methods.

第四方面，一种电子设备，所述电子设备包括至少一个处理器、以及与所述处理器连接的至少一个存储器、总线；其中，所述处理器、所述存储器通过所述总线完成相互间的通信；所述处理器用于调用所述存储器中的程序指令，以执行上述任一项所述的视频处理方法。In a fourth aspect, an electronic device comprises at least one processor, and at least one memory and a bus connected to the processor; wherein the processor and the memory communicate with each other via the bus; and the processor is used to call program instructions in the memory to execute any of the video processing methods described above.

借由上述技术方案，本发明提供的一种视频处理方法及相关装置，可以将目标视频分成多个视频片段；针对任一所述视频片段，根据所述视频片段中的一组视频帧，计算得到所述视频片段对应的视频向量，其中，一个视频片段对应一个视频向量；按照视频需求，对预先建立的文本信息进行处理，得到相应的文本集；计算所述文本集中的各文本信息对应的文本向量，并确定与所述文本向量匹配的视频向量，其中，所述文本集包括多个文本信息，一个文本信息对应一个文本向量，一个文本向量匹配一个视频向量；根据确定的视频向量对应的视频片段，剪辑得到相应的视频。由此可以看出，本发明可以基于视频向量和视频需求，匹配选取合适的视频片段然后自动剪辑生成相应的短视频，匹配程度较高且不需要人工参与，可以快速生产大量高质量的短视频，效率较高。By means of the above technical scheme, a video processing method and related device provided by the present invention can divide a target video into multiple video clips; for any of the video clips, according to a group of video frames in the video clip, the video vector corresponding to the video clip is calculated, wherein one video clip corresponds to one video vector; according to the video requirements, the pre-established text information is processed to obtain a corresponding text set; the text vector corresponding to each text information in the text set is calculated, and the video vector matching the text vector is determined, wherein the text set includes multiple text information, one text information corresponds to one text vector, and one text vector matches one video vector; according to the video clip corresponding to the determined video vector, the corresponding video is clipped. It can be seen from this that the present invention can match and select appropriate video clips based on video vectors and video requirements, and then automatically clip and generate corresponding short videos, with a high degree of matching and no need for manual participation, and can quickly produce a large number of high-quality short videos with high efficiency.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to more clearly understand the technical means of the present invention, it can be implemented according to the contents of the specification. In order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand, the specific implementation methods of the present invention are listed below.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art by reading the detailed description of the preferred embodiments below. The accompanying drawings are only for the purpose of illustrating the preferred embodiments and are not to be considered as limiting the present invention. Also, the same reference symbols are used throughout the accompanying drawings to represent the same components. In the accompanying drawings:

图1示出了本发明提供的第一种视频处理方法的流程图；FIG1 shows a flow chart of a first video processing method provided by the present invention;

图2示出了本发明提供的第二种视频处理方法的流程图；FIG2 shows a flow chart of a second video processing method provided by the present invention;

图3示出了本发明提供的第三种视频处理方法的流程图；FIG3 shows a flow chart of a third video processing method provided by the present invention;

图4示出了本发明提供的一种视频处理装置的结构示意图；FIG4 shows a schematic structural diagram of a video processing device provided by the present invention;

图5示出了本发明提供的一种电子设备的结构示意图。FIG5 shows a schematic structural diagram of an electronic device provided by the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本发明的示例性实施例。虽然附图中显示了本发明的示例性实施例，然而应当理解，可以以各种形式实现本发明而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本发明，并且能够将本发明的范围完整的传达给本领域的技术人员。The exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. Although the exemplary embodiments of the present invention are shown in the accompanying drawings, it should be understood that the present invention can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided in order to enable a more thorough understanding of the present invention and to enable the scope of the present invention to be fully communicated to those skilled in the art.

如图1所示，本发明提供了一种视频处理方法，包括：S100、S200、S300、S400和S500；As shown in FIG1 , the present invention provides a video processing method, including: S100 , S200 , S300 , S400 , and S500 ;

S100、将目标视频分成多个视频片段；S100, dividing the target video into multiple video segments;

可选的，本发明所说的目标视频可以是任意一段长视频。即，本发明可以将长视频切分成多个视频片段，各视频片段按照在原长视频中的播放先后顺序依次排序，本发明对此不做限制。Optionally, the target video in the present invention may be any long video. That is, the present invention may divide a long video into multiple video clips, and the video clips are arranged in sequence according to the order in which they are played in the original long video, and the present invention does not limit this.

可选的，分成多个视频片段有利于后续根据视频需求，选择需要的视频片段进行剪辑，得到短视频，本发明对此不做限制。Optionally, dividing the video into multiple segments is beneficial for selecting required video segments for editing according to video requirements to obtain short videos, and the present invention does not impose any limitation on this.

可选的，本发明对于每个视频片段的长度不做具体限制，可以根据实际需要进行设定。例如，如图2所示，在某些可选的实施方式中，所述S100，包括：S110；Optionally, the present invention does not impose any specific restrictions on the length of each video segment, and can be set according to actual needs. For example, as shown in FIG2 , in some optional implementations, the S100 includes: S110;

S110、按照分镜信息，将目标视频分成多个视频片段；S110, dividing the target video into multiple video segments according to the storyboard information;

其中，所述分镜信息预先识别得到，一条分镜信息对应一个视频片段。The storyboard information is identified in advance, and one piece of storyboard information corresponds to one video clip.

可选的，一个长视频可能由多个分镜头的视频剪辑而成，分镜头之间有切换的时间点，作为两个分镜头的切割点。例如，镜头上一秒是面向演员A，下一秒切换到演员B，这里就是两个分镜头。本发明可以利用视频帧的连贯性预先识别分镜信息，若上一帧和当前帧不连贯（不相似）则认为是切换了分镜。本发明具体可以用现有的视频切分器（例如PysceneDetect）来预先获取长视频的分镜信息，本发明对此不做限制。Optionally, a long video may be edited from multiple split-shot videos, with switching time points between the split-shots serving as cutting points between two split-shots. For example, the camera was facing actor A in the previous second, and switched to actor B in the next second, which are two split-shots. The present invention can use the continuity of video frames to pre-identify split-shot information. If the previous frame and the current frame are not continuous (similar), it is considered that the split-shot has been switched. The present invention can specifically use an existing video segmenter (such as PysceneDetect) to pre-acquire the split-shot information of a long video, and the present invention is not limited to this.

可选的，本发明在识别得到分镜信息之后，可以根据分镜信息中的视频帧信息（记录了区分相邻两个分镜头的视频帧）或者时刻信息（记录了区分相邻两个分镜头的时刻），将目标视频分成多个视频片段，本发明对此不做限制。Optionally, after identifying the storyboard information, the present invention can divide the target video into multiple video segments according to the video frame information (recording the video frame that distinguishes two adjacent storyboards) or the moment information (recording the moment that distinguishes two adjacent storyboards) in the storyboard information, and the present invention is not limited to this.

S200、针对任一所述视频片段，根据所述视频片段中的一组视频帧，计算得到所述视频片段对应的视频向量；S200, for any of the video clips, calculating a video vector corresponding to the video clip according to a group of video frames in the video clip;

其中，一个视频片段对应一个视频向量；Among them, one video clip corresponds to one video vector;

可选的，对于任一个视频片段而言，本发明可以从中抽取一组视频帧，然后计算得到视频向量。即，如图3所示，在某些可选的实施方式中，所述S200，包括：S210、S220和S230；Optionally, for any video clip, the present invention can extract a group of video frames therefrom and then calculate the video vector. That is, as shown in FIG3 , in some optional implementations, the S200 includes: S210, S220 and S230;

S210、针对任一所述视频片段，从所述视频片段中抽取得到一组视频帧；S210: for any of the video clips, extract a group of video frames from the video clip;

其中，一组视频帧包括多个视频帧；Wherein, a set of video frames includes a plurality of video frames;

可选的，本发明可以根据实际需要，设定抽取视频帧的方式。例如，本发明可以每隔N秒抽取一个视频帧，N可以根据实际需要进行设定，本发明对此不做限制。Optionally, the present invention can set the method of extracting video frames according to actual needs. For example, the present invention can extract a video frame every N seconds, and N can be set according to actual needs, and the present invention does not limit this.

S220、针对任一组视频帧，计算相应的各视频帧对应的图像向量；S220, for any group of video frames, calculating the image vectors corresponding to each video frame;

例如，在某些可选的实施方式中，所述S220，包括：For example, in some optional implementations, the S220 includes:

可选的，如前所述，一个视频片段对应一组视频帧，为了计算视频片段的视频向量，本发明可以通过计算相应视频帧的图像向量，然后再根据图像向量计算得到相应的视频向量，本发明对此不做限制。Optionally, as mentioned above, a video clip corresponds to a group of video frames. In order to calculate the video vector of the video clip, the present invention can calculate the image vector of the corresponding video frame, and then calculate the corresponding video vector based on the image vector. The present invention is not limited to this.

可选的，本发明所说的图像神经网络模型可以同时对多个视频帧进行计算，然后得到多个视频帧分别的图像向量。当然，图像神经网络模型也可以每次对一个视频帧计算得到一个视频帧的图像向量，依次对各视频帧进行计算得到相应的图像向量，本发明对此不做限制。Optionally, the image neural network model of the present invention can calculate multiple video frames at the same time, and then obtain image vectors of the multiple video frames. Of course, the image neural network model can also calculate one video frame at a time to obtain an image vector of one video frame, and calculate each video frame in turn to obtain the corresponding image vector, and the present invention does not limit this.

可选的，本发明计算视频帧的图像向量可以使用深层卷积网络模型(VisualGeometry Group，VGG)，本发明对此不做限制。Optionally, the present invention may use a deep convolutional network model (Visual Geometry Group, VGG) to calculate the image vector of the video frame, and the present invention is not limited to this.

S230、针对任一组视频帧，根据相应的各图像向量，计算得到对应的视频向量。S230 . For any group of video frames, a corresponding video vector is calculated based on the corresponding image vectors.

可选的，如前所述，每一个抽取出来的视频帧都计算得到相应的图像向量，一个视频片段对应一组视频帧的多个视频帧。因此，针对任何一组视频帧而言，本发明可以使用该组视频帧的各视频帧的图像向量计算得到相应的视频向量，一个视频片段对应一个视频向量，本发明对此不做限制。Optionally, as mentioned above, each extracted video frame is calculated to obtain a corresponding image vector, and one video clip corresponds to multiple video frames of a group of video frames. Therefore, for any group of video frames, the present invention can use the image vectors of each video frame of the group of video frames to calculate the corresponding video vector, and one video clip corresponds to one video vector, and the present invention is not limited to this.

例如，在某些可选的实施方式中，所述S230，包括：For example, in some optional embodiments, the S230 includes:

可选的，如前所述，一组视频帧包括多个视频帧，一个视频帧对应一个图像向量。即，一组视频帧对应多个图像向量。本发明可以计算多个图像向量的平均值，作为相应视频片段的视频向量，本发明对此不做限制。Optionally, as mentioned above, a group of video frames includes multiple video frames, and one video frame corresponds to one image vector. That is, a group of video frames corresponds to multiple image vectors. The present invention can calculate the average value of multiple image vectors as the video vector of the corresponding video segment, and the present invention is not limited to this.

S300、按照视频需求，对预先建立的文本信息进行处理，得到相应的文本集；S300, processing the pre-established text information according to the video requirements to obtain a corresponding text set;

例如，在某些可选的实施方式中，所述S300，包括：For example, in some optional implementations, the S300 includes:

可选的，本发明所说的视频需求可以用参数来表征。例如甜蜜视频和解说视频可以分别用0和1表示，还有其他类型的视频，如空镜视频和打斗视频可以分别用2和3表示，以此类推，本发明对此不做限制。视频需求不同，对文本信息的内容和格式要求也不同。本发明可以预先按照需要提供对应的文本信息，以便于后续按照视频需求正确处理文本信息（例如拆分处理和复制处理等）。Optionally, the video requirements of the present invention can be characterized by parameters. For example, sweet videos and commentary videos can be represented by 0 and 1 respectively, and other types of videos, such as empty shot videos and fighting videos, can be represented by 2 and 3 respectively, and so on. The present invention does not limit this. Different video requirements have different requirements for the content and format of text information. The present invention can provide corresponding text information in advance as needed, so that the text information can be correctly processed according to the video requirements later (such as split processing and copy processing, etc.).

例如，如果组成短视频的视频片段，都有相同的属性，比如都是男女主甜蜜的画面，那么这个时候就需要使用一些和甜蜜相关的词语去匹配这些画面，因此，可以执行复制处理。又例如，如果组成短视频的视频片段，不具有相似性，比如解说类视频，视频片段之间是有逻辑关系的，那么所要求的文本信息也是要有逻辑关系的解说词文本，而不是一些相似的词语，因此，可以对文本信息进行切分，然后用切分后的文本去匹配分镜头，本发明对此不做限制。For example, if the video clips that make up the short video have the same attributes, such as sweet scenes of the male and female protagonists, then some words related to sweetness need to be used to match these scenes, so the copy process can be performed. For another example, if the video clips that make up the short video do not have similarities, such as commentary videos, there is a logical relationship between the video clips, then the required text information is also a logically related commentary text, rather than some similar words, so the text information can be segmented, and then the segmented text is used to match the storyboards, and the present invention does not limit this.

S400、计算所述文本集中的各文本信息对应的文本向量，并确定与所述文本向量匹配的视频向量；S400, calculating a text vector corresponding to each text information in the text set, and determining a video vector matching the text vector;

其中，所述文本集包括多个文本信息，一个文本信息对应一个文本向量，一个文本向量匹配一个视频向量；The text set includes a plurality of text information, one text information corresponds to one text vector, and one text vector matches one video vector;

例如，在某些可选的实施方式中，所述S400，包括：步骤1.1和步骤1.2；For example, in some optional embodiments, the S400 includes: step 1.1 and step 1.2;

步骤1.1、针对所述文本集中的各文本信息，使用预先建立的文本神经网络模型分别计算各文本信息对应的文本向量；Step 1.1, for each piece of text information in the text set, use a pre-established text neural network model to calculate the text vector corresponding to each piece of text information;

步骤1.2、针对任一所述文本向量，计算各视频向量与所述文本向量的相似度，将相似度最高的视频向量确定为与所述文本向量匹配的视频向量。Step 1.2: For any of the text vectors, calculate the similarity between each video vector and the text vector, and determine the video vector with the highest similarity as the video vector that matches the text vector.

可选的，一个文本信息可以对应一个文本向量，对于每一个文本向量而言，本发明可以匹配找到最相似的视频向量，然后存储视频向量对应的视频片段。在匹配找最相似的视频片段时，本发明可以指定一个查找范围（生产不同视频需求的短视频有不同的查找范围），然后使用矩阵乘法，快速计算出每一个文本向量和所有视频向量的相似度，然后从中选择相似度最高的视频向量，本发明对此不做限制。Optionally, a text message may correspond to a text vector. For each text vector, the present invention may match and find the most similar video vector, and then store the video clip corresponding to the video vector. When matching and finding the most similar video clip, the present invention may specify a search range (short videos for different video requirements have different search ranges), and then use matrix multiplication to quickly calculate the similarity between each text vector and all video vectors, and then select the video vector with the highest similarity, which is not limited by the present invention.

可选的，本发明在计算文本向量时可以使用长短期记忆网络（Long Short-TermMemory，LSTM）、Transformer、语言表示模型(Bidirectional Encoder Representationsfrom Transformers，BERT)、深度学习模型（Generative Pre-Trained，GPT）和多模态预训练神经网络模型（Constrastive Language-Image Pre-training，CLIP）等模型进行计算，具体可以综合性能和效率进行选择，本发明对此不做限制。Optionally, when calculating text vectors, the present invention can use models such as Long Short-Term Memory (LSTM), Transformer, Bidirectional Encoder Representations from Transformers (BERT), Generative Pre-Trained (GPT), and Constrastive Language-Image Pre-training (CLIP) for calculation. The specific selection can be made based on comprehensive performance and efficiency, and the present invention is not limited to this.

S500、根据确定的视频向量对应的视频片段，剪辑得到相应的视频。S500: Edit the video clips corresponding to the determined video vector to obtain corresponding videos.

例如，在某些可选的实施方式中，所述S500，包括：For example, in some optional implementations, the S500 includes:

可选的，如前所述，已经通过相似度匹配，确定得到了多个视频片段，本发明可以将这些视频片段按照一定的顺序进行拼接、剪辑得到相应的短视频，本发明对此不做限制。Optionally, as mentioned above, multiple video clips have been determined through similarity matching. The present invention can splice and edit these video clips in a certain order to obtain corresponding short videos, and the present invention does not impose any limitation on this.

需要说明的是：在添加字幕时，本发明可以从相应的文本信息中抽取字幕内容或者直接使用相应的文本信息作为字幕，本发明对此不做限制。对于音频而言，本发明可以将相应文本信息转化得到的音频，也可以根据文本信息匹配适当的音频。It should be noted that: when adding subtitles, the present invention can extract subtitle content from the corresponding text information or directly use the corresponding text information as subtitles, and the present invention does not limit this. For audio, the present invention can convert the corresponding text information into audio, or match appropriate audio according to the text information.

可选的，本发明可以配置参数，将所有存储的视频片段进行剪辑和处理，最后整合为一个短视频。这里的参数可以用于配置短视频中的一些剪辑方法，比如是否去除原声、是否添加背景音乐、背景音乐的类型、是否添加字幕、字幕的位置字体大小和是否添加视频滤镜等，参数根据所要生产的短视频的类型而定，这里不做限制。Optionally, the present invention can configure parameters to edit and process all stored video clips and finally integrate them into a short video. The parameters here can be used to configure some editing methods in the short video, such as whether to remove the original sound, whether to add background music, the type of background music, whether to add subtitles, the position and font size of the subtitles, and whether to add video filters, etc. The parameters depend on the type of short video to be produced and are not limited here.

可选的，为了更加清楚说明本发明的方案，下面以生产甜蜜类型的短视频和剧情解说类型的短视频为例进行说明。Optionally, in order to more clearly illustrate the solution of the present invention, the production of sweet-type short videos and plot-narration-type short videos is taken as examples for explanation below.

一、生产甜蜜类型的短视频：1. Produce sweet short videos:

（1）视频向量计算。针对每一个视频，按照分镜信息切分为多个视频片段，每个视频片段按每1秒抽一帧得到多组视频帧，使用CLIP的图像模型将从每一帧图像中抽取512维的图像向量，将每组视频帧的图像向量的均值作为改组视频帧对应视频片段的视频向量，存储在向量库中。为了后续的视频片段匹配，这里还可以对视频片段中出现的人物进行检测，并将检测结果存储在数据库中，此处的人脸检测是为了优化当前实例生产短视频的效果所添加的操作。(1) Video vector calculation. For each video, it is divided into multiple video clips according to the storyboard information. Each video clip is extracted one frame per second to obtain multiple groups of video frames. The CLIP image model is used to extract a 512-dimensional image vector from each frame of the image. The mean of the image vector of each group of video frames is used as the video vector of the video clip corresponding to the reorganized video frame and stored in the vector library. For subsequent video clip matching, the characters appearing in the video clip can also be detected here, and the detection results are stored in the database. The face detection here is an operation added to optimize the effect of producing short videos in the current instance.

（2）文本向量计算。将围绕甜蜜预设的文本信息复制多份，得到文本集，使用CLIP的文本模型将文本集中的每一个文本都抽取为512维的文本向量，得到一个文本向量集。这里围绕甜蜜预设的文本信息可以是“亲吻”、“相拥”、“紧靠”、“依偎”等和甜蜜相关或解释说明性的文本。(2) Text vector calculation. Duplicate the text information surrounding the sweetness setting to obtain a text set. Use the CLIP text model to extract each text in the text set into a 512-dimensional text vector to obtain a text vector set. The text information surrounding the sweetness setting here can be text related to sweetness or explanatory texts such as "kiss", "hug", "lean against", and "cling".

（3）视频片段匹配。存储库中同时出现了男女主角的视频片段，作为视频匹配的预选范围。将N个文本向量构建为一个（N，512）的矩阵，将M个预选的视频向量构建为一个（512，M）的矩阵，进行矩阵乘法得到一个(N，M)的矩阵，其中第i行第j列的数值代表第i个文本向量和第j个视频向量的相似度。从第一行开始记录每一行中相似度最大的列号，若某列已经被记录过，则选择去除该列后相似度最大的列，最后得到N个列号，对应N个视频向量。从数据库中找到这个N个视频向量对应的视频片段并进行存储。(3) Video clip matching. Video clips of the male and female protagonists appear in the repository at the same time as the pre-selected range for video matching. Construct the N text vectors into an (N, 512) matrix, and construct the M pre-selected video vectors into a (512, M) matrix. Perform matrix multiplication to obtain an (N, M) matrix, where the value of the i-th row and j-th column represents the similarity between the i-th text vector and the j-th video vector. Record the column number with the highest similarity in each row starting from the first row. If a column has been recorded, select the column with the highest similarity after removing the column. Finally, obtain N column numbers corresponding to N video vectors. Find the video clips corresponding to these N video vectors from the database and store them.

（4）短视频生产。获取所有匹配到的视频片段，去除所有片段的原声，从GL-Transition中随机选取转场特效拼接所有片段，以颜色查找表（Look Up Table，LUT）方式为视频添加滤镜，然后添加背景音乐和对应的歌词字幕，生成最终短视频。(4) Short video production. Obtain all matched video clips, remove the original sound of all clips, randomly select transition effects from GL-Transition to splice all clips, add filters to the video using a color lookup table (LUT), and then add background music and corresponding lyrics and subtitles to generate the final short video.

本实例主要是介绍一种情况，甜蜜类型也可以改为空镜类型、笑容类型、打斗类型和滑稽类型之类的，本发明对此不做限制。This example mainly introduces a situation. The sweet type can also be changed to an empty shot type, a smile type, a fighting type, a funny type, etc. The present invention does not limit this.

二、剧情解说类型的短视频。2. Short videos of plot explanation type.

（1）视频向量计算。针对每一个视频，按照分镜信息切分为多个视频片段，每个视频片段按每1秒抽一帧得到多组视频帧，使用EfficientNet模型从每一帧图像中抽取256维的图像向量，将每组视频帧的图像向量的均值作为改组视频帧对应视频片段的视频向量，存储在向量库中。(1) Video vector calculation. For each video, it is divided into multiple video clips according to the storyboard information. Each video clip is extracted one frame per second to obtain multiple groups of video frames. The EfficientNet model is used to extract a 256-dimensional image vector from each frame image. The mean of the image vectors of each group of video frames is used as the video vector of the video clip corresponding to the reorganized video frame and stored in the vector library.

（2）文本向量计算。将获取的文本信息，即视频的解说词，按照分句符号进行分句，得到文本集，文本集中的每一个文本是一个解说句子。用Transformer模型将文本集中的每一个句子都抽取为256维的文本向量，得到一个文本向量集。(2) Text vector calculation. The obtained text information, i.e., the video commentary, is divided into sentences according to the sentence symbols to obtain a text set. Each text in the text set is a commentary sentence. The Transformer model is used to extract each sentence in the text set into a 256-dimensional text vector to obtain a text vector set.

（3）视频片段匹配。存储库中所有的视频片段，都作为视频匹配的预选范围。将N个文本向量构建为一个（N，256）的矩阵，将M个预选的视频向量构建为一个（256，M）的矩阵，进行矩阵乘法得到一个(N，M)的矩阵，其中第i行第j列的数值代表第i个文本向量和第j个视频向量的相似度。从第一行开始记录每一行中相似度最大的列号，若某列已经被记录过，则选择去除该列后相似度最大的列，最后得到N个列号，对应N个视频向量。从数据库中找到这N个视频向量对应的视频片段并进行存储。(3) Video clip matching. All video clips in the repository are used as the pre-selected range for video matching. Construct the N text vectors into an (N, 256) matrix, and construct the M pre-selected video vectors into a (256, M) matrix. Perform matrix multiplication to obtain an (N, M) matrix, where the value of the i-th row and j-th column represents the similarity between the i-th text vector and the j-th video vector. Record the column number with the highest similarity in each row starting from the first row. If a column has been recorded, select the column with the highest similarity after removing the column. Finally, obtain N column numbers corresponding to N video vectors. Find the video clips corresponding to these N video vectors from the database and store them.

（4）短视频生产。获取所有匹配到的视频片段，这些视频片段和文本集中的解说句子是一一对应的。将文本集中的句子使用语音合成技术(Text To Speech，TTS)转换为音频，然后用转换后的音频替换对应视频片段中的原声，并添加字幕，形成视频解说片段。将所有的视频解说片段顺序拼接，得到最终的剧情解说视频。生产过程中，也可以选择保留部分原声、使用滤镜等设置，因具体的解说视频需求而定，不做强制限定，以保证生产内容的形式更丰富。(4) Short video production. Obtain all matched video clips, which correspond one-to-one to the commentary sentences in the text set. Convert the sentences in the text set into audio using speech synthesis technology (Text To Speech, TTS), then use the converted audio to replace the original sound in the corresponding video clip, and add subtitles to form a video commentary clip. Splice all the video commentary clips in sequence to obtain the final plot commentary video. During the production process, you can also choose to retain part of the original sound, use filters and other settings, depending on the specific needs of the commentary video, without mandatory restrictions, to ensure that the content produced is richer in form.

如图4所示，本发明提供了一种视频处理装置，包括：视频拆分单元100、视频向量计算单元200、文本处理单元300、向量匹配单元400和视频剪辑单元500；As shown in FIG4 , the present invention provides a video processing device, comprising: a video splitting unit 100 , a video vector calculating unit 200 , a text processing unit 300 , a vector matching unit 400 , and a video editing unit 500 ;

所述视频拆分单元100，用于将目标视频分成多个视频片段；The video splitting unit 100 is used to split the target video into multiple video segments;

所述视频向量计算单元200，用于针对任一所述视频片段，根据所述视频片段中的一组视频帧，计算得到所述视频片段对应的视频向量，其中，一个视频片段对应一个视频向量；The video vector calculation unit 200 is used to calculate, for any of the video segments, a video vector corresponding to the video segment according to a group of video frames in the video segment, wherein one video segment corresponds to one video vector;

所述文本处理单元300，用于按照视频需求，对预先建立的文本信息进行处理，得到相应的文本集；The text processing unit 300 is used to process the pre-established text information according to the video requirements to obtain a corresponding text set;

所述向量匹配单元400，用于计算所述文本集中的各文本信息对应的文本向量，并确定与所述文本向量匹配的视频向量，其中，所述文本集包括多个文本信息，一个文本信息对应一个文本向量，一个文本向量匹配一个视频向量；The vector matching unit 400 is used to calculate the text vector corresponding to each text information in the text set, and determine the video vector matching the text vector, wherein the text set includes a plurality of text information, one text information corresponds to one text vector, and one text vector matches one video vector;

所述视频剪辑单元500，用于根据确定的视频向量对应的视频片段，剪辑得到相应的视频。The video clipping unit 500 is used to clip the video clips corresponding to the determined video vectors to obtain corresponding videos.

可选的，在某些可选的实施方式中，所述视频向量计算单元200，包括：视频帧抽取子单元、图像向量计算子单元和视频向量计算子单元；Optionally, in some optional implementations, the video vector calculation unit 200 includes: a video frame extraction subunit, an image vector calculation subunit and a video vector calculation subunit;

所述视频帧抽取子单元，用于针对任一所述视频片段，从所述视频片段中抽取得到一组视频帧，其中，一组视频帧包括多个视频帧；The video frame extraction subunit is used to extract a group of video frames from any of the video segments, wherein the group of video frames includes a plurality of video frames;

所述图像向量计算子单元，用于针对任一组视频帧，计算相应的各视频帧对应的图像向量；The image vector calculation subunit is used to calculate the image vectors corresponding to each video frame for any group of video frames;

所述视频向量计算子单元，用于针对任一组视频帧，根据相应的各图像向量，计算得到对应的视频向量。The video vector calculation subunit is used to calculate the corresponding video vector for any group of video frames according to the corresponding image vectors.

可选的，在某些可选的实施方式中，所述图像向量计算子单元，包括：第一计算子单元；Optionally, in certain optional implementations, the image vector calculation subunit includes: a first calculation subunit;

所述第一计算子单元，用于针对任一组视频帧，使用预先建立的图像神经网络模型分别对各视频帧进行计算，得到各视频帧对应的图像向量，其中，一个视频帧对应一个图像向量。The first calculation subunit is used to calculate each video frame separately using a pre-established image neural network model for any group of video frames to obtain an image vector corresponding to each video frame, wherein one video frame corresponds to one image vector.

可选的，在某些可选的实施方式中，所述视频向量计算子单元，包括：第二计算子单元；Optionally, in certain optional implementations, the video vector calculation subunit includes: a second calculation subunit;

所述第二计算子单元，用于针对任一组视频帧，计算相应的各图像向量的平均值，得到对应的视频向量。The second calculation subunit is used to calculate the average value of the corresponding image vectors for any group of video frames to obtain the corresponding video vector.

可选的，在某些可选的实施方式中，所述文本处理单元300，包括：文本处理子单元；Optionally, in some optional implementations, the text processing unit 300 includes: a text processing subunit;

所述文本处理子单元，用于按照视频需求，对预先建立的文本信息进行拆分处理和复制处理中的至少一项处理，得到与所述文本信息对应的文本集。The text processing subunit is used to perform at least one of splitting processing and copying processing on the pre-established text information according to the video requirements, so as to obtain a text set corresponding to the text information.

可选的，在某些可选的实施方式中，所述向量匹配单元400，包括：第三计算子单元和第四计算子单元；Optionally, in some optional implementations, the vector matching unit 400 includes: a third calculation subunit and a fourth calculation subunit;

所述第三计算子单元，用于针对所述文本集中的各文本信息，使用预先建立的文本神经网络模型分别计算各文本信息对应的文本向量；The third calculation subunit is used to calculate the text vector corresponding to each piece of text information in the text set using a pre-established text neural network model;

所述第四计算子单元，用于针对任一所述文本向量，计算各视频向量与所述文本向量的相似度，将相似度最高的视频向量确定为与所述文本向量匹配的视频向量。The fourth calculation subunit is used to calculate the similarity between each video vector and the text vector for any of the text vectors, and determine the video vector with the highest similarity as the video vector matching the text vector.

可选的，在某些可选的实施方式中，所述视频剪辑单元500，包括：视频剪辑子单元；Optionally, in some optional implementations, the video editing unit 500 includes: a video editing subunit;

所述视频剪辑子单元，用于按照确定的视频向量对应的视频片段之间的顺序进行拼接，并配上相应的音频和字幕，得到相应的视频，其中，任一视频片段的音频和字幕与相应的文本信息对应。The video clipping subunit is used to splice the video clips in the order corresponding to the determined video vectors, and add corresponding audio and subtitles to obtain the corresponding video, wherein the audio and subtitles of any video clip correspond to the corresponding text information.

可选的，在某些可选的实施方式中，所述视频拆分单元100，包括：视频拆分子单元；Optionally, in some optional implementations, the video splitting unit 100 includes: a video splitting subunit;

所述视频拆分子单元，用于按照分镜信息，将目标视频分成多个视频片段，其中，所述分镜信息预先识别得到，一条分镜信息对应一个视频片段。The video splitting subunit is used to divide the target video into multiple video segments according to the storyboard information, wherein the storyboard information is pre-identified and one piece of storyboard information corresponds to one video segment.

本发明提供了一种计算机可读存储介质，其上存储有程序，所述程序被处理器执行时实现上述任一项所述的视频处理方法。The present invention provides a computer-readable storage medium on which a program is stored. When the program is executed by a processor, any of the above-mentioned video processing methods is implemented.

如图5所示，本发明提供了一种电子设备70，所述电子设备70包括至少一个处理器701、以及与所述处理器701连接的至少一个存储器702、总线703；其中，所述处理器701、所述存储器702通过所述总线703完成相互间的通信；所述处理器701用于调用所述存储器702中的程序指令，以执行上述任一项所述的视频处理方法。As shown in FIG. 5 , the present invention provides an electronic device 70, which includes at least one processor 701, and at least one memory 702 and a bus 703 connected to the processor 701; wherein the processor 701 and the memory 702 communicate with each other through the bus 703; and the processor 701 is used to call the program instructions in the memory 702 to execute any of the video processing methods described above.

在本发明中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。In the present invention, relational terms such as first and second, etc. are used only to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "comprise", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the statement "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

本说明书中的各个实施例均采用相关的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a related manner, and the same or similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本发明中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本发明所示的这些实施例，而是要符合与本发明所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined in the present invention may be implemented in other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention will not be limited to the embodiments shown in the present invention, but will conform to the widest scope consistent with the principles and novel features disclosed in the present invention.

以上所述仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等，均包含在本发明的保护范围内。The above description is only a preferred embodiment of the present invention and is not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A video processing method, comprising:

Divide the target video into multiple video segments;

For any of the video clips, a video vector corresponding to the video clip is calculated based on a group of video frames in the video clip, wherein one video clip corresponds to one video vector;

According to the video requirements, the pre-established text information is processed to obtain a corresponding text set;

Calculating a text vector corresponding to each text information in the text set, and determining a video vector matching the text vector, wherein the text set includes a plurality of text information, one text information corresponds to one text vector, and one text vector matches one video vector;

According to the video clips corresponding to the determined video vectors, the corresponding videos are edited.

2. The method according to claim 1, characterized in that, for any of the video clips, calculating the video vector corresponding to the video clip according to a group of video frames in the video clip comprises:

For any of the video clips, extract a group of video frames from the video clip, wherein the group of video frames includes a plurality of video frames;

For any group of video frames, calculate the image vectors corresponding to each video frame;

For any group of video frames, the corresponding video vector is calculated based on the corresponding image vectors.

3. The method according to claim 2, wherein the step of calculating the image vectors corresponding to each video frame for any group of video frames comprises:

For any group of video frames, a pre-established image neural network model is used to calculate each video frame respectively to obtain an image vector corresponding to each video frame, wherein one video frame corresponds to one image vector.

4. The method according to claim 2, characterized in that, for any group of video frames, calculating the corresponding video vector according to the corresponding image vectors comprises:

For any group of video frames, the average value of the corresponding image vectors is calculated to obtain the corresponding video vector.

5. The method according to claim 1, characterized in that the step of processing the pre-established text information according to the video requirements to obtain the corresponding text set comprises:

According to the video requirements, at least one of a splitting process and a copy process is performed on the pre-established text information to obtain a text set corresponding to the text information.

6. The method according to claim 1, characterized in that the step of calculating the text vector corresponding to each text information in the text set and determining the video vector matching the text vector comprises:

For each piece of text information in the text set, a pre-established text neural network model is used to calculate the text vector corresponding to each piece of text information;

For any of the text vectors, the similarity between each video vector and the text vector is calculated, and the video vector with the highest similarity is determined as the video vector matching the text vector.

7. The method according to claim 1, characterized in that the step of editing the video clips corresponding to the determined video vectors to obtain the corresponding videos comprises:

The video clips corresponding to the determined video vectors are spliced in order and matched with corresponding audio and subtitles to obtain a corresponding video, wherein the audio and subtitles of any video clip correspond to the corresponding text information.

8. The method according to claim 1, characterized in that dividing the target video into a plurality of video segments comprises:

According to the storyboard information, the target video is divided into multiple video segments, wherein the storyboard information is identified in advance, and one piece of storyboard information corresponds to one video segment.

9. A video processing device, characterized in that it comprises: a video splitting unit, a video vector calculating unit, a text processing unit, a vector matching unit and a video editing unit;

The video splitting unit is used to split the target video into multiple video segments;

The video vector calculation unit is used to calculate, for any of the video segments, a video vector corresponding to the video segment according to a group of video frames in the video segment, wherein one video segment corresponds to one video vector;

The text processing unit is used to process the pre-established text information according to the video requirements to obtain a corresponding text set;

The vector matching unit is used to calculate the text vector corresponding to each text information in the text set, and determine the video vector matching the text vector, wherein the text set includes a plurality of text information, one text information corresponds to one text vector, and one text vector matches one video vector;

The video clipping unit is used to clip the video clips corresponding to the determined video vectors to obtain corresponding videos.

10. A computer-readable storage medium having a program stored thereon, wherein when the program is executed by a processor, the video processing method according to any one of claims 1 to 8 is implemented.

11. An electronic device, characterized in that the electronic device comprises at least one processor, and at least one memory and a bus connected to the processor; wherein the processor and the memory communicate with each other through the bus; and the processor is used to call program instructions in the memory to execute the video processing method as described in any one of claims 1 to 8.