HK40067617B

HK40067617B - Method and apparatus for recognizing video segment, device and storage medium

Info

Publication number: HK40067617B
Application number: HK42022057504.7A
Authority: HK
Inventors: 郭卉
Original assignee: 腾讯科技（深圳）有限公司
Filing date: 2022-07-28
Publication date: 2024-12-06

Description

Methods, devices, equipment, and storage media for video clip recognition

技术领域Technical Field

本申请涉及计算机技术领域，特别涉及一种视频片段的识别方法、装置、设备以及存储介质。This application relates to the field of computer technology, and in particular to a method, apparatus, device, and storage medium for identifying video clips.

背景技术Background Technology

随着计算机技术的发展，视频呈海量增长之势，上网观看视频的用户越来越多。视频包括电视剧，而电视剧通常有片头和片尾。为了方便用户观看电视剧，视频平台会提供跳过片头和片尾的功能，而跳过片头和片尾的基础是确定电视剧中片头和片尾的位置。With the development of computer technology, video content has increased dramatically, and more and more users are watching videos online. Videos include TV series, which typically have opening and ending credits. To facilitate viewing, video platforms offer the function of skipping the opening and ending credits. The basis for skipping these credits is determining their exact positions within the TV series.

相关技术中，电视剧片头和片尾位置的确定都是采用人工标注的方法实现的，即由人工观看电视剧，然后标记电视剧的片头和片尾的位置。In related technologies, the positions of the opening and closing credits of a TV series are determined by manual annotation, that is, a person watches the TV series and then marks the positions of the opening and closing credits.

但是，采用人工标注的方法需要消耗大量的时间和人力资源，导致确定电视剧片头和片尾位置的效率较低。However, manual annotation requires a lot of time and manpower, resulting in low efficiency in determining the positions of the opening and closing credits of TV series.

发明内容Summary of the Invention

本申请实施例提供了一种视频片段的识别方法、装置、设备以及存储介质，可以提升确定电视剧片头和片尾位置的效率，技术方案如下：This application provides a method, apparatus, device, and storage medium for identifying video clips, which can improve the efficiency of determining the positions of the opening and closing credits of a TV series. The technical solution is as follows:

一方面，提供了一种视频片段的识别方法，所述方法包括：On the one hand, a method for identifying video clips is provided, the method comprising:

基于第一视频的视频帧特征以及至少一个第二视频的视频帧特征，确定多个视频帧对，所述视频帧对包括相似度符合相似度条件的第一视频帧和第二视频帧，所述第一视频帧属于所述第一视频，所述第二视频帧属于所述至少一个第二视频；Based on the video frame features of the first video and the video frame features of at least one second video, multiple video frame pairs are determined. The video frame pairs include a first video frame and a second video frame whose similarity meets the similarity condition. The first video frame belongs to the first video, and the second video frame belongs to the at least one second video.

基于所述多个视频帧对的出现时间差值，将所述多个视频帧对中的第一视频帧进行融合，得到所述第一视频中的至少一个候选视频片段，所述出现时间差值是指所述视频帧对中的两个视频帧在视频中的出现时间之间的差值；Based on the occurrence time difference of the plurality of video frame pairs, the first video frame in the plurality of video frame pairs is fused to obtain at least one candidate video segment in the first video. The occurrence time difference refers to the difference between the occurrence times of two video frames in the video frame pair in the video.

基于所述至少一个候选视频片段以及目标时间范围，确定所述第一视频中的至少一个目标视频片段，所述目标视频片段处于所述第一视频的所述目标时间范围内。Based on the at least one candidate video segment and the target time range, at least one target video segment in the first video is determined, wherein the target video segment is within the target time range of the first video.

一方面，提供了一种视频片段的识别装置，所述装置包括：On the one hand, a video clip recognition device is provided, the device comprising:

视频帧对确定模块，用于基于第一视频的视频帧特征以及至少一个第二视频的视频帧特征，确定多个视频帧对，所述视频帧对包括相似度符合相似度条件的第一视频帧和第二视频帧，所述第一视频帧属于所述第一视频，所述第二视频帧属于所述至少一个第二视频；A video frame pair determination module is used to determine multiple video frame pairs based on video frame features of a first video and video frame features of at least one second video. The video frame pairs include a first video frame and a second video frame whose similarity meets a similarity condition. The first video frame belongs to the first video, and the second video frame belongs to the at least one second video.

融合模块，用于基于所述多个视频帧对的出现时间差值，将所述多个视频帧对中的第一视频帧进行融合，得到所述第一视频中的至少一个候选视频片段，所述出现时间差值是指所述视频帧对中的两个视频帧在视频中的出现时间之间的差值；The fusion module is used to fuse the first video frame in the plurality of video frame pairs based on the occurrence time difference of the plurality of video frame pairs to obtain at least one candidate video segment in the first video. The occurrence time difference refers to the difference between the occurrence times of two video frames in the video frame pair in the video.

目标视频片段确定模块，用于基于所述至少一个候选视频片段以及目标时间范围，确定所述第一视频中的至少一个目标视频片段，所述目标视频片段处于所述第一视频的所述目标时间范围内。A target video segment determination module is used to determine at least one target video segment in the first video based on the at least one candidate video segment and a target time range, wherein the target video segment is within the target time range of the first video.

在一种可能的实施方式中，所述融合模块，用于基于所述多个视频帧对的出现时间差值，将所述多个视频帧对划分为多个视频帧组，同一个所述视频帧组中的视频帧对对应于同一个出现时间差值；对于所述多个视频帧组中的任一视频帧组，按照所述视频帧组中视频帧对的第一视频帧在所述第一视频中的出现时间，将所述视频帧组中视频帧对的第一视频帧融合为一个所述候选视频片段。In one possible implementation, the fusion module is configured to divide the plurality of video frame pairs into a plurality of video frame groups based on the occurrence time difference of the plurality of video frame pairs, wherein video frame pairs in the same video frame group correspond to the same occurrence time difference; for any video frame group in the plurality of video frame groups, the first video frame of the video frame pair in the video frame group is fused into a candidate video segment according to the occurrence time of the first video frame of the video frame pair in the first video.

在一种可能的实施方式中，所述融合模块，用于将出现时间差值相同的视频帧对划分为一个初始视频帧组；基于多个初始视频帧组对应的出现时间差值，将所述多个初始视频帧组进行融合，得到所述多个视频帧组。In one possible implementation, the fusion module is used to divide video frame pairs with the same occurrence time difference into an initial video frame group; and to fuse the multiple initial video frame groups based on the occurrence time difference corresponding to the multiple initial video frame groups to obtain the multiple video frame groups.

在一种可能的实施方式中，所述融合模块，用于按照目标顺序对所述多个初始视频帧组进行排序，得到多个候选视频帧组；在所述多个候选视频帧组中任两个相邻的候选视频帧组之间的匹配时间差值符合匹配时间差值条件的情况下，将所述两个相邻的候选视频帧组融合为一个视频帧组，所述匹配时间差值是指所述两个相邻的候选视频帧组对应的出现时间差值之间的差值。In one possible implementation, the fusion module is used to sort the plurality of initial video frame groups according to a target order to obtain a plurality of candidate video frame groups; if the matching time difference between any two adjacent candidate video frame groups in the plurality of candidate video frame groups meets the matching time difference condition, the two adjacent candidate video frame groups are fused into one video frame group, wherein the matching time difference refers to the difference between the occurrence time differences of the corresponding two adjacent candidate video frame groups.

在一种可能的实施方式中，所述两个相邻的候选视频帧组包括第一候选视频帧组和第二候选视频帧组，所述融合模块，用于在所述第一候选视频帧组对应的出现时间差值与所述第二候选视频帧组对应的出现时间差值之间的匹配时间差值小于或等于匹配差值阈值的情况下，将所述第一候选视频帧组中的视频帧对添加至所述第二候选视频帧组，得到所述视频帧组。In one possible implementation, the two adjacent candidate video frame groups include a first candidate video frame group and a second candidate video frame group. The fusion module is used to add video frame pairs from the first candidate video frame group to the second candidate video frame group to obtain the video frame group when the matching time difference between the occurrence time difference corresponding to the first candidate video frame group and the occurrence time difference corresponding to the second candidate video frame group is less than or equal to a matching difference threshold.

在一种可能的实施方式中，所述融合模块，用于将所述第一候选视频帧组中的视频帧对添加至所述第二候选视频帧组；基于所述第二候选视频帧组对应的出现时间差值，采用参考第二视频帧替换目标第二视频帧，得到所述视频帧组，所述目标第二视频帧为新添加至所述第二候选视频帧组中的第二视频帧，所述参考第二视频帧为所述第二视频中与目标第一视频帧之间的出现时间差值为所述第二候选视频帧组对应的出现时间差值的第二视频帧，所述目标第一视频帧为所述目标第二视频帧所属视频帧对中的第一视频帧。In one possible implementation, the fusion module is configured to add video frame pairs from the first candidate video frame group to the second candidate video frame group; based on the occurrence time difference value corresponding to the second candidate video frame group, a reference second video frame is used to replace the target second video frame to obtain the video frame group, wherein the target second video frame is the second video frame newly added to the second candidate video frame group, the reference second video frame is the second video frame in the second video whose occurrence time difference with the target first video frame is the occurrence time difference value corresponding to the second candidate video frame group, and the target first video frame is the first video frame in the video frame pair to which the target second video frame belongs.

在一种可能的实施方式中，所述融合模块，用于比较所述视频帧组中任两个相邻的视频帧对的第一视频帧在所述第一视频中的出现时间；在所述两个相邻的视频帧对的第一视频帧在所述第一视频中的出现时间之间的差值符合出现时间条件的情况下，将所述两个相邻的视频帧对添加至临时帧列表；在所述两个相邻的视频帧对的第一视频帧在所述第一视频中的出现时间之间的差值不符合出现时间条件的情况下，将所述临时帧列表中的视频帧对融合为参考视频片段；基于多个参考视频片段，确定所述至少一个候选视频片段。In one possible implementation, the fusion module is configured to compare the occurrence times of the first video frames of any two adjacent video frame pairs in the video frame group within the first video; if the difference between the occurrence times of the first video frames of the two adjacent video frame pairs in the first video meets the occurrence time condition, add the two adjacent video frame pairs to a temporary frame list; if the difference between the occurrence times of the first video frames of the two adjacent video frame pairs in the first video does not meet the occurrence time condition, fuse the video frame pairs in the temporary frame list into a reference video segment; and determine the at least one candidate video segment based on multiple reference video segments.

在一种可能的实施方式中，所述多个参考视频片段包括第一重合视频片段和/或第二重合视频片段，所述第一重合视频片段是指属于所述多个参考视频片段中第一参考视频片段的参考视频片段，所述第二重合视频片段是指与所述多个参考视频片段中第二参考视频片段部分重合的参考视频片段，所述融合模块，用于执行下述至少一项：In one possible implementation, the plurality of reference video segments includes a first overlapping video segment and/or a second overlapping video segment. The first overlapping video segment refers to a reference video segment belonging to the first reference video segment among the plurality of reference video segments, and the second overlapping video segment refers to a reference video segment that partially overlaps with the second reference video segment among the plurality of reference video segments. The fusion module is configured to perform at least one of the following:

在所述多个参考视频片段包括所述第一重合视频片段的情况下，将所述第一重合视频片段删除，得到所述至少一个候选视频片段；If the plurality of reference video segments include the first overlapping video segment, the first overlapping video segment is deleted to obtain the at least one candidate video segment;

在所述多个参考视频片段包括所述第二重合视频片段的情况下，将所述第二重合视频片段与所述第二参考片段之间的重合部分删除，得到所述至少一个候选视频片段。If the plurality of reference video segments include the second overlapping video segment, the overlapping portion between the second overlapping video segment and the second reference segment is deleted to obtain the at least one candidate video segment.

在一种可能的实施方式中，所述融合模块还用于：比较第三类参考视频片段的时长与目标时长，所述第三类参考视频片段是指删除重合部分的所述第二重合视频片段；在所述第三类参考视频片段的时长大于或等于所述目标时长的情况下，保留所述第三类参考视频片段；在所述第三类参考视频片段的时长小于所述目标时长的情况下，删除所述第三类参考视频片段。In one possible implementation, the fusion module is further configured to: compare the duration of a third type of reference video segment with a target duration, wherein the third type of reference video segment refers to the second overlapping video segment with the overlapping portion deleted; retain the third type of reference video segment if the duration of the third type of reference video segment is greater than or equal to the target duration; and delete the third type of reference video segment if the duration of the third type of reference video segment is less than the target duration.

在一种可能的实施方式中，所述目标视频片段确定模块，用于基于所述至少一个候选视频片段，确定所述至少一个目标候选视频片段，所述目标候选视频片段在所述至少一个候选视频片段中的出现次数符合次数条件；In one possible implementation, the target video segment determination module is configured to determine at least one target candidate video segment based on the at least one candidate video segment, wherein the number of times the target candidate video segment appears in the at least one candidate video segment meets a frequency condition;

在任一所述目标候选视频片段在所述第一视频中的出现时间处于所述目标时间范围的情况下，将所述目标候选视频片段确定为所述第一视频中的目标视频片段。If the occurrence time of any of the target candidate video segments in the first video falls within the target time range, the target candidate video segment is determined as the target video segment in the first video.

在一种可能的实施方式中，所述目标视频片段确定模块，用于基于所述至少一个候选视频片段，确定至少一个参考候选视频片段；确定每个所述参考候选视频片段在所述至少一个参考候选视频片段的出现次数；将出现次数符合所述出现次数条件的参考候选视频片段确定为目标候选视频片段。In one possible implementation, the target video segment determination module is configured to determine at least one reference candidate video segment based on the at least one candidate video segment; determine the number of times each of the reference candidate video segments appears in the at least one reference candidate video segment; and determine the reference candidate video segments whose number of appearances meets the appearance count condition as target candidate video segments.

在一种可能的实施方式中，所述至少一个候选视频片段包括第三重合视频片段和/或第四重合视频片段，所述第三重合视频片段是指属于所述至少一个候选视频片段中第一候选视频片段的候选视频片段，所述第四重合视频片段是指与所述至少一个候选视频片段中第二候选视频片段部分重合的候选视频片段，所述目标视频片段确定模块，用于执行下述至少一项：In one possible implementation, the at least one candidate video segment includes a third overlapping video segment and/or a fourth overlapping video segment. The third overlapping video segment refers to a candidate video segment belonging to a first candidate video segment among the at least one candidate video segments, and the fourth overlapping video segment refers to a candidate video segment that partially overlaps with a second candidate video segment among the at least one candidate video segments. The target video segment determination module is configured to perform at least one of the following:

在所述至少一个候选视频片段包括所述第三重合视频片段的情况下，将所述第三重合视频片段删除，得到所述至少一个参考候选视频片段；If the at least one candidate video segment includes the third overlapping video segment, the third overlapping video segment is deleted to obtain the at least one reference candidate video segment;

在所述至少一个候选视频片段包括所述第四重合视频片段，且所述第四重合视频片段与所述第二候选视频片段之间的重合度符合重合度条件的情况下，确定所述第四重合视频片段的出现次数；基于所述第四重合视频片段的出现次数，确定所述至少一个参考候选视频片段；If the at least one candidate video segment includes the fourth overlapping video segment, and the overlap between the fourth overlapping video segment and the second candidate video segment meets the overlap condition, the occurrence count of the fourth overlapping video segment is determined; based on the occurrence count of the fourth overlapping video segment, the at least one reference candidate video segment is determined.

在所述至少一个候选视频片段包括所述第四重合视频片段，且所述第四重合视频片段与所述第二候选视频片段之间的重合度不符合所述重合度条件的情况下，将所述第四重合视频片段删除，得到所述至少一个参考候选视频片段；If the at least one candidate video segment includes the fourth overlapping video segment, and the overlap between the fourth overlapping video segment and the second candidate video segment does not meet the overlap condition, the fourth overlapping video segment is deleted to obtain the at least one reference candidate video segment.

在所述至少一个候选视频片段包括所述第四重合视频片段，且所述第四重合视频片段的时长小于所述第二候选视频片段的情况下，将所述第四重合视频片段删除，得到所述至少一个参考候选视频片段。If the at least one candidate video segment includes the fourth overlapping video segment, and the duration of the fourth overlapping video segment is less than that of the second candidate video segment, the fourth overlapping video segment is deleted to obtain the at least one reference candidate video segment.

在一种可能的实施方式中，所述目标视频片段确定模块，用于执行下述任一项：In one possible implementation, the target video segment determination module is configured to perform any of the following:

在所述第四重合视频片段的出现次数大于或等于第一出现次数阈值的情况下，将所述第四重合视频片段与第二候选视频片段进行融合，得到所述至少一个参考候选视频片段；If the number of occurrences of the fourth overlapping video segment is greater than or equal to the first occurrence threshold, the fourth overlapping video segment is fused with the second candidate video segment to obtain the at least one reference candidate video segment.

在所述第四重合视频片段的出现次数小于所述第一出现次数阈值的情况下，将所述第四重合视频片段删除，得到所述至少一个参考候选视频片段。If the number of occurrences of the fourth overlapping video segment is less than the first occurrence threshold, the fourth overlapping video segment is deleted to obtain the at least one reference candidate video segment.

在一种可能的实施方式中，所述装置还包括：In one possible implementation, the device further includes:

特征提取模块，用于对待识别的目标视频的多个目标视频帧进行特征提取，得到所述多个目标视频帧的视频帧特征；The feature extraction module is used to extract features from multiple target video frames of the target video to be identified, and to obtain the video frame features of the multiple target video frames.

所述目标视频片段确定模块，还用于基于所述多个目标视频帧的视频帧特征、所述第一视频帧的视频帧特征以及所述至少一个第二视频的视频帧特征，确定所述目标视频的至少一个目标视频片段。The target video segment determination module is further configured to determine at least one target video segment of the target video based on the video frame features of the plurality of target video frames, the video frame features of the first video frame, and the video frame features of the at least one second video.

一方面，提供了一种计算机设备，所述计算机设备包括一个或多个处理器和一个或多个存储器，所述一个或多个存储器中存储有至少一条计算机程序，所述计算机程序由所述一个或多个处理器加载并执行以实现所述视频片段的识别方法。On one hand, a computer device is provided, the computer device including one or more processors and one or more memories, the one or more memories storing at least one computer program, the computer program being loaded and executed by the one or more processors to implement the video segment recognition method.

一方面，提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有至少一条计算机程序，所述计算机程序由处理器加载并执行以实现所述视频片段的识别方法。On one hand, a computer-readable storage medium is provided, wherein at least one computer program is stored in the computer-readable storage medium, the computer program being loaded and executed by a processor to implement the method for recognizing the video segment.

一方面，提供了一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现上述视频片段的识别方法。On the one hand, a computer program product is provided, including a computer program that, when executed by a processor, implements the aforementioned method for recognizing video segments.

通过本申请实施例提供的技术方案，基于视频帧特征之间的相似度，确定包含相似视频帧的视频帧对。基于出现时间差值来对视频帧对中的第一视频帧进行融合，得到至少一个候选视频片段。最终从至少一个候选视频片段中确定出处于目标时间范围的目标视频片段。确定目标片段的过程无需人工参与，由计算机设备直接基于第一视频和至少一个第二视频就能够自动进行，效率较高。The technical solution provided in this application determines video frame pairs containing similar video frames based on the similarity between video frame features. The first video frame in the video frame pair is fused based on the time difference of its occurrence to obtain at least one candidate video segment. Finally, a target video segment within a target time range is determined from the at least one candidate video segment. The process of determining the target segment requires no manual intervention; it can be automatically performed by computer equipment directly based on the first video and at least one second video, resulting in high efficiency.

附图说明Attached Figure Description

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单的介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

图1是本申请实施例提供的一种视频片段的识别方法的实施环境的示意图；Figure 1 is a schematic diagram of the implementation environment of a video segment recognition method provided in an embodiment of this application;

图2是本申请实施例提供的一种视频片段的识别方法的流程图；Figure 2 is a flowchart of a video segment recognition method provided in an embodiment of this application;

图3是本申请实施例提供的一种视频片段的识别方法的流程图；Figure 3 is a flowchart of a video segment recognition method provided in an embodiment of this application;

图4是本申请实施例提供的一种提取视频帧特征的方法的流程图；Figure 4 is a flowchart of a method for extracting video frame features according to an embodiment of this application;

图5是本申请实施例提供的一种第一子片段和第二子片段的示意图；Figure 5 is a schematic diagram of a first sub-segment and a second sub-segment provided in an embodiment of this application;

图6是本申请实施例提供的一种不同重合方式的第一子片段的示意图；Figure 6 is a schematic diagram of a first sub-segment with different overlapping methods provided in an embodiment of this application;

图7是本申请实施例提供的一种候选视频片段融合的示意图；Figure 7 is a schematic diagram of candidate video segment fusion provided in an embodiment of this application;

图8是本申请实施例提供的一种视频片段的识别方法的流程图；Figure 8 is a flowchart of a video segment recognition method provided in an embodiment of this application;

图9是本申请实施例提供的一种片段挖掘系统的流程图；Figure 9 is a flowchart of a fragment mining system provided in an embodiment of this application;

图10是本申请实施例提供的一种获取电视剧片头和片尾的方法的流程图；Figure 10 is a flowchart of a method for obtaining the opening and closing credits of a TV series according to an embodiment of this application;

图11是本申请实施例提供的一种片段数据库的存储方式的示意图；Figure 11 is a schematic diagram of a storage method for a fragment database provided in an embodiment of this application;

图12是本申请实施例提供的一种获取电视剧片头和片尾的方法的流程图；Figure 12 is a flowchart of a method for obtaining the opening and closing credits of a TV series according to an embodiment of this application;

图13是本申请实施例提供的一种识别侵权视频的方法的流程图；Figure 13 is a flowchart of a method for identifying infringing videos provided in an embodiment of this application;

图14是本申请实施例提供的一种视频片段的识别方法的流程图；Figure 14 is a flowchart of a video segment recognition method provided in an embodiment of this application;

图15是本申请实施例提供的一种视频片段的识别装置结构示意图；Figure 15 is a schematic diagram of a video segment recognition device provided in an embodiment of this application;

图16是本申请实施例提供的一种终端的结构示意图；Figure 16 is a schematic diagram of the structure of a terminal provided in an embodiment of this application;

图17是本申请实施例提供的一种服务器的结构示意图。Figure 17 is a schematic diagram of the structure of a server provided in an embodiment of this application.

具体实施方式Detailed Implementation

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请实施方式做进一步的详细描述。To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.

本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分，应理解，“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系，也不对数量和执行顺序进行限定。In this application, the terms "first," "second," etc., are used to distinguish identical or similar items with essentially the same function. It should be understood that there is no logical or temporal dependency between "first," "second," and "nth," nor are there any restrictions on quantity or execution order.

人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.

人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence (AI) is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating/interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, and machine learning/deep learning.

机器学习(Machine Learning，ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识子模型使之不断改善自身的性能。机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习等技术。Machine learning (ML) is a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge sub-models to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and instruction-based learning.

汉明距离(Hamming Distance)：用于衡量二进制特征间的距离，通过统计数值不同的特征位数量作为距离实现，如(1000)与(0011)的汉明距离为3。Hamming distance: used to measure the distance between binary features. It is achieved by counting the number of feature bits with different values. For example, the Hamming distance between (1000) and (0011) is 3.

需要说明的是，本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号，均为经用户授权或者经过各方充分授权的，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that the information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data used for analysis, data stored, data displayed, etc.) and signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.

图1是本申请实施例提供的一种视频片段的识别方法的实施环境示意图，参见图1，该实施环境中可以包括终端110和服务器140。Figure 1 is a schematic diagram of the implementation environment of a video segment recognition method provided in an embodiment of this application. Referring to Figure 1, the implementation environment may include a terminal 110 and a server 140.

终端110通过无线网络或有线网络与服务器140相连。可选地，终端110是车载终端、智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表以及智能电视等，但并不局限于此。终端110安装和运行有支持视频片段识别的应用程序。Terminal 110 is connected to server 140 via a wireless or wired network. Optionally, terminal 110 may be a vehicle-mounted terminal, smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, or smart TV, but is not limited to these. Terminal 110 has an application installed and running that supports video clip recognition.

服务器140是独立的物理服务器，或者是多个物理服务器构成的服务器集群或者分布式系统，或者是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、分发网络(Content Delivery Network，CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。服务器140为该终端110上运行的应用程序提供后台服务。Server 140 is a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms. Server 140 provides background services for the applications running on terminal 110.

本申请实施例对终端110和服务器140的数量不做限制。The embodiments of this application do not limit the number of terminals 110 and servers 140.

在介绍完本申请实施例的实施环境之后，下面将结合上述实施环境，对本申请实施例的应用场景进行介绍，在下述说明过程中，终端也即是上述实施环境中的终端110，服务器也即是上述实施环境中的服务器140。After introducing the implementation environment of the embodiments of this application, the application scenarios of the embodiments of this application will be introduced below in conjunction with the above implementation environment. In the following description, the terminal is the terminal 110 in the above implementation environment, and the server is the server 140 in the above implementation environment.

本申请实施例提供的视频片段的识别方法能够应用在识别视频的片头和片尾的场景下，比如，应用在识别电视剧的片头和片尾的场景下，或者应用在识别纪录片的片头和片尾的场景下，或者应用在识别短视频集合的片头和片尾的场景下等。The video segment recognition method provided in this application embodiment can be applied to scenarios of recognizing the opening and closing credits of videos, such as recognizing the opening and closing credits of TV series, documentaries, or short video collections.

以本申请实施例提供的视频片段的识别方法应用在识别电视剧片头和片尾的场景下为例，技术人员通过终端选择需要进行片头片尾识别的电视剧，该电视剧包括多个视频，每个视频为电视剧中的一集。在通过终端选中该电视剧的情况下，服务器能够采用本申请实施例提供的技术方案，基于该电视剧中的多个视频来进行处理，得到该多个视频中的片头和片尾。在对该多个视频进行处理的过程中，服务器基于第一视频的视频帧特征以及至少一个第二视频的视频帧特征，确定多个视频帧对，每个视频帧对包括相似度符合相似度条件的第一视频帧和第二视频帧，第一视频帧属于第一视频，第二视频帧属于该至少一个第二视频，也就是说，每个视频帧对包括第一视频中的一个视频帧以及第二视频中的一个视频帧，第一视频和该至少一个第二视频帧均属于该多个视频。服务器基于该多个视频帧对的出现时间差值，将该多个视频帧对中的第一视频帧进行融合，得到该第一视频中的至少一个候选视频片段，出现时间差值是指视频帧对中的两个视频帧在视频中的出现时间之间的差值，也即是视频帧对中第一视频帧在第一视频中的出现时间与第二视频帧在第二视频中的出现时间之间的差值。服务器基于至少一个候选视频片段以及目标时间范围，确定第一视频中的至少一个目标视频片段，由于是应用在识别电视剧片头和片尾的场景下，那么该目标时间段也即是片头或者片尾所在的时间段，确定出的目标视频片段也即是第一视频的片头或者片尾。Taking the video segment recognition method provided in this application embodiment as an example in the scenario of recognizing the opening and closing credits of a TV series, the technician selects the TV series for which the opening and closing credits need to be recognized via a terminal. The TV series includes multiple videos, each video being an episode of the TV series. When the TV series is selected via the terminal, the server can use the technical solution provided in this application embodiment to process the multiple videos in the TV series to obtain the opening and closing credits of the multiple videos. During the processing of the multiple videos, the server determines multiple video frame pairs based on the video frame features of the first video and the video frame features of at least one second video. Each video frame pair includes a first video frame and a second video frame whose similarity meets the similarity condition. The first video frame belongs to the first video, and the second video frame belongs to the at least one second video. That is, each video frame pair includes one video frame from the first video and one video frame from the second video, and both the first video and the at least one second video frame belong to the multiple videos. Based on the time difference between the occurrences of multiple video frame pairs, the server fuses the first video frames from each pair to obtain at least one candidate video segment from the first video. The time difference refers to the difference in the occurrence times of two video frames within a video frame pair, specifically the difference between the occurrence times of the first and second video frames in the first video. Based on at least one candidate video segment and a target time range, the server determines at least one target video segment from the first video. Since this is applied to the scenario of identifying the opening and closing credits of a TV series, the target time period is the time period containing the opening or closing credits, and the determined target video segment is the opening or closing credits of the first video.

需要说明的是，上述是以本申请实施例提供的视频片段的识别方法应用在识别电视剧片头和片尾的场景下为例进行说明的，上述其他应用场景的实施过程与上述说明属于同一发明构思，实施过程不再赘述。It should be noted that the above description is based on the application of the video segment recognition method provided in the embodiments of this application to the scenario of recognizing the opening and closing credits of a TV series. The implementation process of the other application scenarios described above belongs to the same inventive concept as the above description, and the implementation process will not be repeated.

另外，本申请实施例提供的视频片段的识别方法除了能够应用在上述识别电视剧的片头和片尾的场景、识别纪录片的片头和片尾的场景以及识别短视频集合的片头和片尾的场景之外，也能够应用在识别其他类型视频的片头和片尾的场景中，本申请实施例对此不做限定。In addition, the video segment recognition method provided in this application embodiment can be applied not only to recognizing the opening and closing scenes of TV series, documentary series, and short video collections, but also to recognizing the opening and closing scenes of other types of videos. This application embodiment does not limit this application.

介绍完本申请实施例的实施环境和应用场景之后，下面对本申请实施例提供的视频片段的识别方法进行说明，参见图2，本申请实施例提供的技术方案可以由终端或服务器执行，也可以由终端和服务器共同执行，在本申请实施例中，以执行主体为服务器为例进行说明，方法包括：After introducing the implementation environment and application scenarios of the embodiments of this application, the video segment recognition method provided by the embodiments of this application will be described below. Referring to Figure 2, the technical solution provided by the embodiments of this application can be executed by a terminal or a server, or by both a terminal and a server. In this embodiment of the application, the execution subject is the server as an example for description, and the method includes:

201、服务器基于第一视频的视频帧特征以及至少一个第二视频的视频帧特征，确定多个视频帧对，该视频帧对包括相似度符合相似度条件的第一视频帧和第二视频帧，该第一视频帧属于该第一视频，该第二视频帧属于该至少一个第二视频。201. The server determines multiple video frame pairs based on the video frame features of a first video and the video frame features of at least one second video. The video frame pairs include a first video frame and a second video frame whose similarity meets the similarity condition. The first video frame belongs to the first video, and the second video frame belongs to the at least one second video.

其中，第一视频和至少一个第二视频属于同一个视频集合，比如，第一视频和第二视频为同一部电视剧的不同集。视频帧特征为视频帧的嵌入特征，比如为深度哈希特征。第一视频帧和第二视频帧之间的相似度通过第一视频帧的视频帧特征以及第二视频帧的视频帧特征来确定。每个视频帧对包括一个第一视频帧和一个第二视频帧，且视频帧对中的第一视频帧和第二视频帧之间的相似度符合相似度条件，也即是视频帧对中的第一视频帧和第二视频帧为相似度较高的两个视频帧。In this setup, the first video and at least one second video belong to the same video set; for example, the first video and the second video may be different episodes of the same TV series. Video frame features are the embedding features of the video frames, such as deep hash features. The similarity between the first video frame and the second video frame is determined using the video frame features of both the first and second video frames. Each video frame pair includes one first video frame and one second video frame, and the similarity between the first and second video frames in the pair meets a similarity condition, meaning that the first and second video frames in the pair are highly similar.

202、服务器基于该多个视频帧对的出现时间差值，将该多个视频帧对中的第一视频帧进行融合，得到该第一视频中的至少一个候选视频片段，该出现时间差值是指该视频帧对中的两个视频帧在视频中的出现时间之间的差值。202. Based on the occurrence time difference of the multiple video frame pairs, the server fuses the first video frame in the multiple video frame pairs to obtain at least one candidate video segment in the first video. The occurrence time difference refers to the difference between the occurrence times of the two video frames in the video frame pair in the video.

其中，视频帧对中的第一视频帧是与第二视频帧之间相似度较高的视频帧，而候选视频片段是由多个视频帧对中的第一视频帧融合得到的，那么候选视频片段也即是第一视频中与至少一个第二视频具有重合内容的视频片段。出现时间差值能够反映第一视频帧和第二视频帧在第一视频和第二视频中出现时间的偏差。In this context, the first video frame in a video frame pair is one that has a high similarity to the second video frame. The candidate video segment is obtained by fusing the first video frames from multiple video frame pairs; therefore, the candidate video segment is a segment in the first video that overlaps with at least one second video frame. The occurrence time difference reflects the discrepancy in the timing of the first and second video frames within the first and second videos, respectively.

203、服务器基于该至少一个候选视频片段以及目标时间范围，确定该第一视频中的至少一个目标视频片段，该目标视频片段处于该第一视频的该目标时间范围内。203. Based on the at least one candidate video segment and the target time range, the server determines at least one target video segment in the first video, wherein the target video segment is within the target time range of the first video.

其中，目标时间范围是指视频中时间范围，目标时间范围由技术人员根据实际情况进行设置，本申请实施例对此不做限定。The target time range refers to the time range in the video. The target time range is set by technicians according to the actual situation, and this application embodiment does not limit it.

上述步骤201-203是对本申请实施例提供的视频片段的识别方法的简单介绍，下面将结合一些例子，对本申请实施例提供的视频片段的识别方法进行更加详细的说明，参见图3，本申请实施例提供的技术方案可以由终端或服务器执行，也可以由终端和服务器共同执行，在本申请实施例中，以执行主体为服务器为例进行说明，方法包括：Steps 201-203 above are a brief introduction to the video segment recognition method provided in the embodiments of this application. The following will provide a more detailed description of the video segment recognition method provided in the embodiments of this application, using some examples. Referring to Figure 3, the technical solution provided in the embodiments of this application can be executed by a terminal or a server, or jointly by a terminal and a server. In this embodiment, the execution subject is taken as a server as an example, and the method includes:

301、服务器对第一视频和至少一个第二视频进行特征提取，得到第一视频的视频帧特征以及至少一个第二视频的视频帧特征。301. The server performs feature extraction on the first video and at least one second video to obtain video frame features of the first video and video frame features of at least one second video.

在一种可能的实施方式中，服务器将第一视频和至少一个第二视频输入特征提取模型，通过该特征提取模型对该第一视频和该至少一个第二视频进行特征提取，得到该第一视频的视频帧特征以及该至少一个第二视频的视频帧特征。In one possible implementation, the server inputs a first video and at least one second video into a feature extraction model, and performs feature extraction on the first video and the at least one second video through the feature extraction model to obtain the video frame features of the first video and the video frame features of the at least one second video.

其中，服务器通过特征提取模型对第一视频和至少一个第二视频进行特征提取的过程，也即是对第一视频的第一视频帧以及第二视频的第二视频帧进行特征提取的过程，在这种情况下，该特征提取模型为一个图像特征提取模型。The process of the server extracting features from the first video and at least one second video using a feature extraction model is the process of extracting features from the first video frame of the first video and the second video frame of the second video. In this case, the feature extraction model is an image feature extraction model.

在这种实施方式下，通过特征提取模型对该第一视频和该至少一个第二视频进行特征提取，得到第一视频的视频帧特征以及至少一个第二视频的视频帧特征，从而实现对第一视频和至少一个第二视频进行抽象表达，提高后续的运算效率。In this implementation, feature extraction is performed on the first video and the at least one second video using a feature extraction model to obtain the video frame features of the first video and the video frame features of the at least one second video, thereby enabling an abstract representation of the first video and the at least one second video and improving the efficiency of subsequent operations.

为了对上述实施方式进行说明，下面通过三个例子对上述实施方式进行说明。To illustrate the above implementation methods, three examples are provided below.

例1、服务器将该第一视频和该至少一个第二视频输入特征提取模型，通过特征提取模型对多个第一视频帧和多个第二视频帧进行卷积和池化，得到该多个第一视频帧的视频帧特征以及多个第二视频帧的视频帧特征，其中，多个第一视频帧为第一视频的视频帧，多个第二视频帧为至少一个第二视频的视频帧。Example 1: The server inputs the first video and the at least one second video into a feature extraction model. The feature extraction model performs convolution and pooling on multiple first video frames and multiple second video frames to obtain the video frame features of the multiple first video frames and the video frame features of the multiple second video frames. The multiple first video frames are video frames of the first video, and the multiple second video frames are video frames of at least one second video.

下面以服务器对第一视频进行特征提取的方法进行说明，服务器将该第一视频的多个第一视频帧输入特征提取模型，通过该特征提取模型的卷积层，对该多个第一视频帧进行卷积，得到该多个第一视频帧的特征图。服务器通过该特征提取模型的池化层，对该多个第一视频帧的特征图进行最大池化或者平均池化中的任一项，得到该多个第一视频帧的视频帧特征。在一些实施例中，服务器以矩阵的形式来表示第一视频帧，以向量的形式来表示视频帧特征，在对第一视频帧进行卷积的过程中，采用卷积核在第一视频帧上进行滑动的方式来实现。The following describes a method for feature extraction from a first video by the server. The server inputs multiple first video frames into a feature extraction model. The convolutional layer of this model convolves the multiple first video frames to obtain feature maps. The server then uses the pooling layer of the feature extraction model to perform either max pooling or average pooling on the feature maps of the multiple first video frames to obtain the video frame features. In some embodiments, the server represents the first video frame as a matrix and the video frame features as vectors. During the convolution process, the convolution kernel slides across the first video frame.

在一些实施例中，该特征提取模型为基于卷积神经网络(Convolutional NeuralNetworks，CNN)的特征提取器，比如为采用大规模开源数据集imagenet(图网)上预训练的神经网络Resnet-101(残差网络101)，该神经网络Resnet101的结构参见表1。该神经网络Resnet-101的池化(Pooling)层的输出结果为视频帧特征，其中，101是指模型的层数，该视频帧特征为一个1×2048的向量。In some embodiments, the feature extraction model is a feature extractor based on Convolutional Neural Networks (CNNs), such as a ResNet-101 neural network (Residual Network 101) pre-trained on the large-scale open-source dataset ImageNet. The structure of the ResNet-101 neural network is shown in Table 1. The output of the pooling layer of the ResNet-101 neural network is the video frame feature, where 101 refers to the number of layers in the model, and the video frame feature is a 1×2048 vector.

表1Table 1

其中，Layer name为特征提取模型ResNet-101中各个层面的名称，Output size为输出的特征图的尺寸，max pool指最大值池化，stride是指步长，blocks是指层，一层可能包括多个卷积核，Conv是指卷积层，Pool是指池化层，Class是指分类层，full connection是指全连接，在上述提取视频帧特征的过程中，不使用Class层。In this context, Layer name refers to the name of each layer in the ResNet-101 feature extraction model, Output size refers to the size of the output feature map, max pool refers to max pooling, stride refers to the stride, blocks refers to layers (a layer may contain multiple convolutional kernels), Conv refers to convolutional layers, Pool refers to pooling layers, Class refers to classification layers, and full connection refers to fully connected layers. In the process of extracting video frame features, the Class layer is not used.

需要说明的是，上述是以特征提取模型为ResNet-101为例进行说明的，在其他可能的实施方式中，该特征提取模型还可以为其他结构，本申请实施例对此不做限定。It should be noted that the above description uses ResNet-101 as an example for feature extraction model. In other possible implementations, the feature extraction model can also be other structures, and this application embodiment does not limit this.

另外，上述特征提取过程是基于卷积来实现的，得到的视频帧特征用于表达视频帧的图像纹理的特征，这样的视频帧特征也被称为视频帧的底层特征。在其他可能的实施方式中，该特征提取模型还能够提取视频帧的语义特征，得到的视频帧特征能够反映视频帧的语义，下面对服务器通过该特征提取模型来提取视频帧的语义特征的方法进行说明。Furthermore, the above feature extraction process is based on convolution, and the resulting video frame features are used to express the image texture features of the video frame. Such video frame features are also called the low-level features of the video frame. In other possible implementations, this feature extraction model can also extract the semantic features of the video frame. The resulting video frame features can reflect the semantics of the video frame. The following describes the method by which the server extracts the semantic features of the video frame using this feature extraction model.

例2、服务器将该第一视频和该至少一个第二视频输入特征提取模型，通过特征提取模型，基于注意力机制对多个第一视频帧和多个第二视频帧进行编码，得到该多个第一视频帧的视频帧特征以及多个第二视频帧的视频帧特征，其中，多个第一视频帧为第一视频的视频帧，多个第二视频帧为至少一个第二视频的视频帧，通过该特征提取模型获取的视频帧特征也即是对应视频帧的语义特征。在这种实施方式下，该特征提取模型为语义特征编码器，比如为Transformer编码器。Example 2: The server inputs the first video and the at least one second video into a feature extraction model. The feature extraction model encodes multiple first video frames and multiple second video frames using an attention mechanism, obtaining video frame features for the multiple first video frames and multiple second video frames. Here, the multiple first video frames are video frames of the first video, and the multiple second video frames are video frames of at least one second video. The video frame features obtained through this feature extraction model are also the semantic features of the corresponding video frames. In this implementation, the feature extraction model is a semantic feature encoder, such as a Transformer encoder.

下面以服务器对多个第一视频进行特征提取的方法进行说明，服务器将该第一视频的多个第一视频帧输入特征提取模型，通过该特征提取模型，对该多个第一视频帧进行嵌入编码，得到多个嵌入向量，一个嵌入向量对应于一个第一视频帧，嵌入向量用于表示第一视频帧在第一视频中的位置以及第一视频帧的内容。服务器将多个嵌入向量输入特征提取模型，通过特征提取模型的三个线性变换矩阵，对多个嵌入向量进行线性变换，得到每个第一视频帧对应的查询(Query)向量、键(Key)向量以及值(Value)向量。服务器通过特征提取模型，基于多个第一视频帧对应的查询向量以及键向量，获取多个第一视频帧的注意力权重。服务器通过特征提取模型，基于每个第一视频帧的注意力权重和每个第一视频帧的值向量，获取每个第一视频帧的注意力编码向量，注意力编码向量也即是第一视频帧的视频帧特征。The following describes the method of feature extraction from multiple first videos by the server. The server inputs multiple first video frames from the first video into a feature extraction model. This model performs embedding encoding on the multiple first video frames, obtaining multiple embedding vectors. Each embedding vector corresponds to one first video frame, representing the position and content of the first video frame within the first video. The server inputs these embedding vectors into the feature extraction model and performs linear transformations using three linear transformation matrices, obtaining a query vector, a key vector, and a value vector corresponding to each first video frame. Based on the query and key vectors corresponding to the multiple first video frames, the server obtains the attention weights for each first video frame using the feature extraction model. Finally, based on the attention weights and value vectors of each first video frame, the server obtains the attention encoding vector for each first video frame. This attention encoding vector is essentially the video frame feature of the first video frame.

比如，服务器通过特征提取模型，将每个嵌入向量分别与三个线性变换矩阵相乘，得到每个第一视频帧分别对应的查询向量、键向量以及值向量。对于多个第一视频帧中的第一个第一视频帧，服务器通过特征提取模型，基于第一个第一视频帧的查询向量，与多个第一视频帧的键向量，确定多个第一视频帧对第一个第一视频帧之间的多个注意力权重。对于多个第一视频帧中的第一个第一视频帧，服务器通过特征提取模型，将多个第一视频帧对第一个第一视频帧的注意力权重，与多个第一视频帧的值向量进行加权求和，得到第一个第一视频帧的注意力编码向量，也即是第一个第一视频帧的视频帧特征。For example, the server uses a feature extraction model to multiply each embedding vector by three linear transformation matrices to obtain the query vector, key vector, and value vector corresponding to each first video frame. For the first video frame among multiple first video frames, the server uses the feature extraction model to determine multiple attention weights between the first video frame and the key vectors of the multiple first video frames, based on the query vector of the first video frame. For the first video frame among multiple first video frames, the server uses the feature extraction model to perform a weighted sum of the attention weights between the multiple first video frames and the value vectors of the multiple first video frames to obtain the attention encoding vector of the first video frame, which is the video frame feature of the first video frame.

上述例1和例2分别以该特征提取模型提取视频帧的底层特征以及语义特征为例进行说明的，在其他可能的实施方式中，服务器还能够通过该特征提取模型同时获取视频帧的底层特征以及语义特征，下面通过例3进行说明。Examples 1 and 2 above illustrate the extraction of low-level features and semantic features of video frames by the feature extraction model. In other possible implementations, the server can also obtain low-level features and semantic features of video frames simultaneously through the feature extraction model, as illustrated in Example 3 below.

例3、服务器将该第一视频和该至少一个第二视频输入特征提取模型，通过特征提取模型对多个第一视频帧和多个第二视频帧进行卷积和池化，得到该多个第一视频帧的底层特征以及多个第二视频帧的底层特征，其中，多个第一视频帧为第一视频的视频帧，多个第二视频帧为至少一个第二视频的视频帧。服务器通过该特征提取模型，基于注意力机制对多个第一视频帧和多个第二视频帧进行编码，得到该多个第一视频帧的语义特征以及多个第二视频帧的语义特征。服务器将各个第一视频帧的底层特征和语义特征进行融合，得到各个第一视频帧的视频帧特征。服务器将各个第二视频帧的底层特征和语义特征进行融合，得到各个第二视频帧的视频帧特征。Example 3: The server inputs the first video and at least one second video into a feature extraction model. The model performs convolution and pooling on multiple first and second video frames to obtain the low-level features of the first and second video frames. The first video frames are video frames of the first video, and the second video frames are video frames of at least one second video. The server then encodes the multiple first and second video frames using an attention mechanism based on the feature extraction model to obtain the semantic features of the first and second video frames. The server fuses the low-level and semantic features of each first video frame to obtain the video frame features of each first video frame. Similarly, the server fuses the low-level and semantic features of each second video frame to obtain the video frame features of each second video frame.

举例来说，该特征提取模型包括第一子模型和第二子模型，该第一子模型用于提取视频帧的底层特征，该第二子模型用于提取视频帧的语义特征。服务器将该第一视频和该至少一个第二视频输入特征提取模型之后，通过该第一子模型来获取该多个第一视频帧的底层特征以及多个第二视频帧的底层特征，通过第二子模型来获取该多个第一视频帧的语义特征以及多个第二视频帧的语义特征。服务器将各个视频帧的底层特征和语义特征进行融合时，可以采用加权求和的方式，加权求和的权重由技术人员根据实际情况进行设置，比如设置为0.5，本申请实施例对此不做限定。服务器通过该第一子模型和该第二子模型获取视频帧的底层特征和语义特征的方法分别与上述例1和例2同理，实现过程在此不再赘述。For example, the feature extraction model includes a first sub-model and a second sub-model. The first sub-model is used to extract the low-level features of video frames, and the second sub-model is used to extract the semantic features of video frames. After the server inputs the first video and the at least one second video into the feature extraction model, it obtains the low-level features of the multiple first video frames and the multiple low-level features of the multiple second video frames through the first sub-model, and obtains the semantic features of the multiple first video frames and the multiple second video frames through the second sub-model. When the server fuses the low-level features and semantic features of each video frame, it can use a weighted summation method. The weight of the weighted summation can be set by the technician according to the actual situation, such as 0.5. This application embodiment does not limit this. The method by which the server obtains the low-level features and semantic features of video frames through the first sub-model and the second sub-model is the same as in Examples 1 and 2 above, and the implementation process will not be described in detail here.

需要说明的是，上述是以特征提取模型提取视频帧的底层特征和语义特征为例进行说明的，随着科学技术的发展，服务器还能够采用其他结构的特征提取模型来获取视频帧特征，本申请实施例对此不做限定。It should be noted that the above description uses the feature extraction model to extract the low-level features and semantic features of video frames as an example. With the development of science and technology, servers can also use feature extraction models with other structures to obtain video frame features. This application does not limit this.

在一些实施例中，第一视频和至少一个第二视频是属于同一个视频集合中的视频，其中，第一视频是待确定目标视频片段的视频，该至少一个第二视频为该视频集合中除该第一视频以外的全部视频，或者，该至少一个第二视频为从该视频集合中抽取的视频，抽取时屏蔽该第一视频。在该至少一个第二视频为从视频集合中抽取的视频的情况下，服务器从该视频集合中随机抽取目标视频数量个第二视频，在抽取过程中，屏蔽该第一视频，也即是抽取出的目标视频数量个第二视频中不包括该第一视频，该目标视频数量由技术人员根据实际情况进行设置，本申请实施例对此不做限定。服务器将该第一视频和该至少一个第二视频分别组成至少一个视频对，每个视频对包括该第一视频和该至少一个第二视频中的一个第二视频。In some embodiments, the first video and at least one second video belong to the same video set. The first video is the video of the target video segment to be determined, and the at least one second video is all videos in the video set except for the first video. Alternatively, the at least one second video is a video extracted from the video set, with the first video masked during extraction. When the at least one second video is a video extracted from the video set, the server randomly extracts a target number of second videos from the video set. During the extraction process, the first video is masked; that is, the extracted target number of second videos does not include the first video. The number of target videos is set by a technician according to actual conditions, and this application embodiment does not limit this. The server groups the first video and the at least one second video into at least one video pair, and each video pair includes one of the first video and the at least one second video.

比如，在该视频集合包括46个视频的情况下，对于每个第一视频i，服务器从该视频集合的剩余视频中随机抽取10个第二视频r，将该第一视频i和该10个第二视频r分别组成10个视频对，在后续处理过程中，以视频对为单位来进行，其中，10也即是目标视频数量。For example, if the video set includes 46 videos, for each first video i, the server randomly selects 10 second videos r from the remaining videos in the video set, and forms 10 video pairs with the first video i and the 10 second videos r respectively. In subsequent processing, the processing is carried out in units of video pairs, where 10 is the target number of videos.

另外，在一些实施例中，服务器对该第一视频和该至少一个第二视频进行特征提取之前，对该第一视频和该至少一个第二视频进行抽帧，得到该第一视频的多个第一视频帧和各个第二视频的多个第二视频帧。通过对视频进行抽帧，能够减少后续特征提取过程的运算量，能够提升特征提取的效率。In some embodiments, before performing feature extraction on the first video and the at least one second video, the server performs frame extraction on the first video and the at least one second video to obtain multiple first video frames of the first video and multiple second video frames of each second video. By extracting frames from the video, the computational load of the subsequent feature extraction process can be reduced, thereby improving the efficiency of feature extraction.

以第一视频为例，服务器以目标间隔从第一视频中进行抽帧，得到该第一视频的多个第一视频帧，其中，目标间隔是指第一视频的目标播放时长，比如1s，或者，该目标间隔是指目标数量的帧间隔，比如25帧。在该目标间隔是指第一视频的目标播放时长的情况下，服务器每隔目标播放时长从该第一视频中抽取一帧作为第一视频帧。在第一视频为6s，目标播放时长为1s的情况下，服务器从该第一视频中抽取6个第一视频帧。在该目标时间间隔是指目标数量的帧间隔的情况下，服务器每隔目标数量的视频帧从该第一视频中进行抽取，得到多个第一视频帧。在第一视频包括100个视频帧，目标数量为10的情况下，服务器从该第一视频中抽取10个第一视频帧。比如，参见图4，服务器以目标间隔从第一视频400中进行抽帧，得到该第一视频的多个第一视频帧401。服务器将该第一视频的多个第一视频帧401输入特征提取模型402，通过该特征提取模型402输出该多个第一视频帧401的视频帧特征403。Taking the first video as an example, the server extracts frames from the first video at target intervals to obtain multiple first video frames. The target interval refers to the target playback duration of the first video, such as 1 second, or it refers to a target number of frame intervals, such as 25 frames. When the target interval refers to the target playback duration of the first video, the server extracts one frame from the first video every target playback duration. If the first video is 6 seconds long and the target playback duration is 1 second, the server extracts 6 first video frames from the first video. When the target time interval refers to the target number of frame intervals, the server extracts frames from the first video every target number of video frames to obtain multiple first video frames. If the first video includes 100 video frames and the target number is 10, the server extracts 10 first video frames from the first video. For example, referring to Figure 4, the server extracts frames from the first video 400 at target intervals to obtain multiple first video frames 401. The server inputs multiple first video frames 401 of the first video into the feature extraction model 402, and outputs video frame features 403 of the multiple first video frames 401 through the feature extraction model 402.

需要说明的是，上述步骤301为可选步骤，既可以是服务器提前执行的，也可以是服务器在执行本申请实施例提供的技术方案时执行的，本申请实施例对此不做限定。It should be noted that step 301 above is an optional step. It can be executed by the server in advance or by the server when executing the technical solution provided in the embodiments of this application. The embodiments of this application do not limit this step.

302、服务器基于第一视频的视频帧特征以及至少一个第二视频的视频帧特征，确定多个视频帧对，该视频帧对包括相似度符合相似度条件的第一视频帧和第二视频帧，该第一视频帧属于该第一视频，该第二视频帧属于该至少一个第二视频。302. The server determines multiple video frame pairs based on the video frame features of the first video and the video frame features of at least one second video. The video frame pairs include a first video frame and a second video frame whose similarity meets the similarity condition. The first video frame belongs to the first video, and the second video frame belongs to the at least one second video.

在一种可能的实施方式中，服务器确定多个第一视频帧的视频帧特征与多个第二视频帧的视频帧特征之间的相似度。服务器将相似度符合目标条件的第一视频帧和第二视频帧确定为一个视频帧对，每个视频帧对包括一个第一视频帧和一个第二视频帧。In one possible implementation, the server determines the similarity between video frame features of a plurality of first video frames and video frame features of a plurality of second video frames. The server identifies first and second video frames whose similarity meets a target condition as a video frame pair, and each video frame pair includes one first video frame and one second video frame.

其中，视频帧特征之间的相似度通过欧氏距离或者余弦相似度，本申请实施例对此不做限定。The similarity between video frame features is determined by Euclidean distance or cosine similarity, but this application does not limit the specific method used.

在这种实施方式下，服务器能够基于第一视频帧和第二视频帧之间的相似度来确定多个视频帧对，由于视频帧对中的视频帧为不同视频中相似度较高的视频帧，后续基于视频帧对就能够快捷地确定出相似的视频片段，从而最终确定出目标视频片段，效率较高。In this implementation, the server can determine multiple video frame pairs based on the similarity between the first video frame and the second video frame. Since the video frames in the video frame pair are video frames with high similarity from different videos, similar video segments can be quickly determined based on the video frame pair, thereby finally determining the target video segment, which is highly efficient.

在相似度为欧式距离的情况下，服务器确定多个第一视频帧的视频帧特征与多个第二视频帧的视频帧特征之间的欧式距离。服务器将欧式距离小于或等于距离阈值的第一视频帧和第二视频帧确定为一个视频帧对。其中，距离阈值由技术人员根据实际情况进行设置，本申请实施例对此不做限定。在距离阈值为0.5的情况下，在任一第一视频帧的视频帧特征与任一第二视频帧的视频帧特征之间的欧式距离小于或等于0.5的情况下，服务器将该第一视频帧和该第二视频帧确定为一个视频帧对。When the similarity is determined by Euclidean distance, the server determines the Euclidean distance between the video frame features of multiple first video frames and the video frame features of multiple second video frames. The server identifies first and second video frames whose Euclidean distance is less than or equal to a distance threshold as a video frame pair. The distance threshold is set by a technician based on actual conditions, and this embodiment does not limit its setting. When the distance threshold is 0.5, if the Euclidean distance between the video frame features of any first video frame and the video frame features of any second video frame is less than or equal to 0.5, the server identifies the first and second video frames as a video frame pair.

在相似度为余弦相似度的情况下，服务器确定多个第一视频帧的视频帧特征与多个第二视频帧的视频帧特征之间的余弦相似度。服务器将余弦相似度大于或等于相似度阈值的第一视频帧和第二视频帧确定为一个视频帧对。在相似度阈值为0.8的情况下，在任一第一视频帧的视频帧特征与任一第二视频帧的视频帧特征之间的余弦相似度大于或等于0.8的情况下，服务器将该第一视频帧和该第二视频帧确定为一个视频帧对。When the similarity is cosine similarity, the server determines the cosine similarity between the video frame features of multiple first video frames and the video frame features of multiple second video frames. The server identifies first and second video frames with a cosine similarity greater than or equal to a similarity threshold as a video frame pair. When the similarity threshold is 0.8, if the cosine similarity between the video frame features of any first video frame and the video frame features of any second video frame is greater than or equal to 0.8, the server identifies that first and second video frame as a video frame pair.

在一些实施例中，在服务器将第一视频和至少一个第二视频组成至少一个视频对的情况下，服务器以视频对为单位来确定视频对中第一视频的视频帧特征以及第二视频的视频帧特征之间的相似度，从而确定视频对下的多个视频帧对。比如，对于视频对(i，r)来说，服务器确定第一视频i的视频帧特征与第二视频r的视频帧特征之间的相似度。服务器将相似度符合目标条件的第一视频帧和第二视频帧确定为一个视频帧对。也即是，对于第一视频i中每个视频帧j，确定第一视频帧j与第二视频r中每个第二视频帧的视频帧特征之间的欧式距离。服务器将欧式距离小于t₀的第二视频帧作为第一视频帧j的相似帧，该第一视频帧j与该相似帧组成一个视频帧对。服务器将获取到的第一视频帧j的相似帧存储在第一列表中，该第一列表也被称为相似帧列表(sim-id-list)。在一些实施例中，服务器将帧的标识存储在该第一列表中，帧的标识用于指示帧所属的视频以及帧在视频中的位置。比如，对于j＝1的第一视频帧，相似帧列表sim-id-list为[1，2，3]，表示与第二视频r的第1、2、3秒对应的视频帧为相似帧，其中，j＝1表示第一视频中第1秒对应的视频帧。In some embodiments, when the server forms at least one video pair with a first video and at least one second video, the server determines the similarity between the video frame features of the first video and the video frame features of the second video in the video pair on a per-video-pair basis, thereby determining multiple video frame pairs under the video pair. For example, for video pair (i, r), the server determines the similarity between the video frame features of the first video i and the video frame features of the second video r. The server determines the first video frame and the second video frame whose similarity meets the target condition as a video frame pair. That is, for each video frame j in the first video i, the Euclidean distance between the first video frame j and the video frame features of each second video frame in the second video r is determined. The server takes the second video frame with an Euclidean distance less than _t0 as a similar frame of the first video frame j, and the first video frame j and the similar frame form a video frame pair. The server stores the obtained similar frames of the first video frame j in a first list, which is also called a similar frame list (sim-id-list). In some embodiments, the server stores the frame identifier in the first list, and the frame identifier is used to indicate the video to which the frame belongs and the position of the frame in the video. For example, for the first video frame j=1, the similar frame list sim-id-list is [1, 2, 3], which means that the video frames corresponding to the 1st, 2nd, and 3rd seconds of the second video r are similar frames, where j=1 means the video frame corresponding to the 1st second in the first video.

可选地，在步骤302之后，在确定出的视频帧对的数量为0的情况下，服务器确定该第一视频中不存在目标视频片段。Optionally, after step 302, if the number of determined video frame pairs is 0, the server determines that the target video segment does not exist in the first video.

303、服务器确定多个视频帧对的出现时间差值。303. The server determines the time difference between the occurrence of multiple video frame pairs.

在一种可能的实施方式中，服务器将该多个视频帧对中第一视频帧在第一视频中的出现时间与该视频帧对中第二视频帧在第二视频中的出现时间相减，得到该多个视频帧对的出现时间差值。在一些实施例中，服务器将该多个视频帧对出现时间差值存储在第二列表中，该第二列表也被称为出现时间差值列表(diff-time-list)，在后续处理过程中，能够直接从该第二列表中调用对应的出现时间差值。比如，对于j＝1的第一视频帧，相似帧列表sim-id-list为[1，2，3]，那么对应的出现时间差值列表diff-time-list为[0，1，2]。In one possible implementation, the server subtracts the occurrence time of the first video frame in the first video from the occurrence time of the second video frame in the second video to obtain the occurrence time difference of the multiple video frame pairs. In some embodiments, the server stores the occurrence time difference of the multiple video frame pairs in a second list, also known as the occurrence time difference list (diff-time-list), from which the corresponding occurrence time difference can be directly retrieved during subsequent processing. For example, for the first video frame j=1, if the similar frame list sim-id-list is [1, 2, 3], then the corresponding occurrence time difference list diff-time-list is [0, 1, 2].

304、服务器基于该多个视频帧对的出现时间差值，将该多个视频帧对划分为多个视频帧组，同一个该视频帧组中的视频帧对对应于同一个出现时间差值，该出现时间差值是指该视频帧对中的两个视频帧在视频中的出现时间之间的差值。304. Based on the occurrence time difference of the multiple video frame pairs, the server divides the multiple video frame pairs into multiple video frame groups. The video frame pairs in the same video frame group correspond to the same occurrence time difference, which refers to the difference between the occurrence times of the two video frames in the video frame pair in the video.

在一种可能的实施方式中，服务器将出现时间差值相同的视频帧对划分为一个初始视频帧组。服务器基于多个初始视频帧组对应的出现时间差值，将该多个初始视频帧组进行融合，得到该多个视频帧组。In one possible implementation, the server groups video frame pairs with the same occurrence time difference into an initial video frame group. Based on the occurrence time difference corresponding to multiple initial video frame groups, the server merges these multiple initial video frame groups to obtain the multiple video frame groups.

其中，初始视频帧组包括多个出现时间差值相同的视频帧对，不同初始视频帧组对应于不同的出现时间差值，其中，初始视频帧组对应的出现时间差值是指该初始视频帧组中视频帧对的出现时间差值。The initial video frame group includes multiple pairs of video frames with the same occurrence time difference. Different initial video frame groups correspond to different occurrence time differences. The occurrence time difference of the initial video frame group refers to the occurrence time difference of the video frame pairs in the initial video frame group.

在这种实施方式下，出现时间差值相同的视频帧对中的视频帧可能会构成完整的视频片段，通过将视频帧对聚合为视频帧组，便于后续确定相似的视频片段。In this implementation, video frames in a pair of video frames with the same time difference may constitute a complete video segment. By aggregating video frame pairs into video frame groups, it is easier to identify similar video segments later.

举例来说，服务器按照目标顺序对该多个初始视频帧组进行排序，得到多个候选视频帧组。在该多个候选视频帧组中任两个相邻的候选视频帧组之间的匹配时间差值符合匹配时间差值条件的情况下，服务器将该两个相邻的候选视频帧组融合为一个视频帧组，该匹配时间差值是指该两个相邻的候选视频帧组对应的出现时间差值之间的差值。For example, the server sorts the initial video frame groups according to the target order to obtain multiple candidate video frame groups. If the matching time difference between any two adjacent candidate video frame groups meets the matching time difference condition, the server merges the two adjacent candidate video frame groups into one video frame group. The matching time difference refers to the difference between the occurrence time differences of the corresponding two adjacent candidate video frame groups.

为了对上述举例中描述的技术过程进行更加清楚的说明，下面将分为两个部分对上述举例进行进一步说明。To provide a clearer explanation of the technical processes described in the examples above, the examples will be further explained in two parts below.

第一部分、服务器按照目标顺序对该多个初始视频帧组进行排序，得到多个候选视频帧组。Part 1: The server sorts the multiple initial video frame groups according to the target order to obtain multiple candidate video frame groups.

在一种可能的实施方式中，服务器按照对应出现时间差值从小到大的顺序对该多个初始视频帧组进行排序，得到多个候选视频帧组。在这种情况下，目标顺序是指出现时间差值从大至小的顺序。在一些实施例中，在任一初始视频帧组中，服务器按照视频帧对中第一视频帧在第一视频帧中出现时间的先后进行排序。In one possible implementation, the server sorts the multiple initial video frame groups in ascending order of their corresponding time differences to obtain multiple candidate video frame groups. In this case, the target order refers to the order of their time differences from largest to smallest. In some embodiments, within any initial video frame group, the server sorts the frames according to the order in which the first video frame appears in the first video frame pair.

在这种实施方式下，服务器按照从大至小的顺序对该多个初始视频帧组进行排序，在得到的多个候选视频帧组中，任两个候选视频帧组对应的出现时间差值均较为接近，便于后续的融合过程。In this implementation, the server sorts the multiple initial video frame groups in descending order. Among the multiple candidate video frame groups obtained, the time difference between any two candidate video frame groups is relatively close, which facilitates the subsequent fusion process.

举例来说，以多个初始视频帧组为[3，5]，[11，12]，[2，4]，[4，6]，[6，9]，[7，10]，[10，11]，其中，每个括号代表一个视频帧对[i，r]，括号中的前一个数字为第一视频帧i的标识，第二个数字为第二视频帧r的标识，该标识为视频帧在视频中的出现时间。对于视频帧对[3，5]来说，出现时间差值为5-3＝2，对于视频帧对[6，9]来说，出现时间差值为9-6＝3。服务器按照对应出现时间差值从小到大的顺序对该多个初始视频帧组进行排序，得到多个候选视频帧组[10，11]，[11，12]，[2，4]，[3，5]，[4，6]，[6，9]，[7，10]。For example, consider multiple initial video frame groups: [3, 5], [11, 12], [2, 4], [4, 6], [6, 9], [7, 10], [10, 11]. Each set of parentheses represents a video frame pair [i, r]. The first number in the parentheses is the identifier of the first video frame i, and the second number is the identifier of the second video frame r, which represents the appearance time of the video frame in the video. For the video frame pair [3, 5], the appearance time difference is 5 - 3 = 2, and for the video frame pair [6, 9], the appearance time difference is 9 - 6 = 3. The server sorts these initial video frame groups in ascending order of their respective appearance time differences, resulting in multiple candidate video frame groups: [10, 11], [11, 12], [2, 4], [3, 5], [4, 6], [6, 9], [7, 10].

在一种可能的实施方式中，服务器按照对应出现时间差值从小至大的顺序对该多个初始视频帧组进行排序，得到多个候选视频帧组。在这种情况下，目标顺序是指出现时间差值从小至大的顺序。在一些实施例中，在任一初始视频帧组中，服务器按照视频帧对中第一视频帧在第一视频帧中出现时间的先后进行排序。In one possible implementation, the server sorts the multiple initial video frame groups in ascending order of their corresponding time differences to obtain multiple candidate video frame groups. In this case, the target order refers to the order of their time differences from smallest to largest. In some embodiments, within any initial video frame group, the server sorts the frames according to the order in which the first video frame appears in the first video frame pair.

在这种实施方式下，服务器按照从小至大的顺序对该多个初始视频帧组进行排序，在得到的多个候选视频帧组中，任两个候选视频帧组对应的出现时间差值均较为接近，便于后续的融合过程。In this implementation, the server sorts the multiple initial video frame groups in ascending order. Among the multiple candidate video frame groups obtained, the time difference between any two candidate video frame groups is relatively close, which facilitates the subsequent fusion process.

在一些实施例中，在采用第一列表来存储视频帧对，采用第二列表存储出现差值的情况下，服务器基于第一列表和第二列表生成第三列表，该第三列表用于存储视频帧对以及出现差值，该第三列表能够存储多个初始视频帧组，比如，该第三列表的形式为第三列表(match-dt-list)：{d：{count，start-id，match-id-list}，…}，其中，d为出现时间差值，d：{count，start-id，match-id-list}表示出现时间差值为d的初始视频帧组，count为该初始视频帧组中视频帧对的数量，start-id为第一视频帧的最小标识，match-id-list为视频帧对。In some embodiments, when a first list is used to store video frame pairs and a second list is used to store occurrence differences, the server generates a third list based on the first and second lists. This third list is used to store video frame pairs and occurrence differences. The third list can store multiple initial video frame groups. For example, the third list can be in the form of a third list (match-dt-list): {d: {count, start-id, match-id-list}, ...}, where d is the occurrence time difference, d: {count, start-id, match-id-list} represents the initial video frame group with occurrence time difference d, count is the number of video frame pairs in the initial video frame group, start-id is the minimum identifier of the first video frame, and match-id-list is the video frame pair.

第二部分、在该多个候选视频帧组中任两个相邻的候选视频帧组之间的匹配时间差值符合匹配时间差值条件的情况下，服务器将该两个相邻的候选视频帧组融合为一个视频帧组。Part Two: If the matching time difference between any two adjacent candidate video frame groups in the multiple candidate video frame groups meets the matching time difference condition, the server merges the two adjacent candidate video frame groups into one video frame group.

在一种可能的实施方式中，该两个相邻的候选视频帧组包括第一候选视频帧组和第二候选视频帧组，在该第一候选视频帧组对应的出现时间差值与该第二候选视频帧组对应的出现时间差值之间的匹配时间差值小于或等于匹配差值阈值的情况下，服务器将该第一候选视频帧组中的视频帧对添加至该第二候选视频帧组，得到该视频帧组。In one possible implementation, the two adjacent candidate video frame groups include a first candidate video frame group and a second candidate video frame group. When the matching time difference between the occurrence time difference corresponding to the first candidate video frame group and the occurrence time difference corresponding to the second candidate video frame group is less than or equal to the matching difference threshold, the server adds the video frame pair in the first candidate video frame group to the second candidate video frame group to obtain the video frame group.

其中，将该多个候选视频帧组融合为多个视频帧组包括多个迭代过程，在将该第一候选视频帧组和该第二候选视频帧组融合为一个视频帧组之后，服务器还能够确定新融合的该视频帧组与后一个候选视频帧组之间的匹配时间差值，在该匹配时间差值符合匹配时间差值条件的情况下，将该新融合的该视频帧组与后一个候选视频帧组进行再一次融合，融合过程与融合该第一候选视频帧组与该第二候选视频帧组的过程属于同一发明构思，实现过程不再赘述。当然，在该匹配时间差值不符合匹配时间差值条件的情况下，服务器再确定该后一个候选视频帧组与后两个候选视频帧组之间的匹配时间差值，从而基于匹配时间差值进行进一步地处理。匹配差值阈值由技术人员根据实际情况进行设置，本申请实施例对此不做限定。The process of fusing multiple candidate video frame groups into multiple video frame groups involves multiple iterative processes. After fusing the first and second candidate video frame groups into one video frame group, the server can further determine the matching time difference between the newly fused video frame group and the next candidate video frame group. If the matching time difference meets the matching time difference condition, the newly fused video frame group is fused again with the next candidate video frame group. This fusion process is based on the same inventive concept as the fusion of the first and second candidate video frame groups, and its implementation will not be elaborated further. Of course, if the matching time difference does not meet the matching time difference condition, the server further determines the matching time difference between the next candidate video frame group and the next two candidate video frame groups, and performs further processing based on the matching time difference. The matching difference threshold can be set by a technician according to the actual situation, and this application embodiment does not limit it.

在这种实施方式下，通过基于出现时间差值来融合候选视频帧组，能够减少候选视频帧组的数量，从而减少后续处理的运算量，提高运算效率。In this implementation, by fusing candidate video frame groups based on the time difference of occurrence, the number of candidate video frame groups can be reduced, thereby reducing the amount of computation in subsequent processing and improving computational efficiency.

举例来说，服务器确定该第一候选视频帧组和该第二候选视频帧组的匹配时间差值。在该匹配时间差值小于或等于匹配差值阈值的情况下，服务器基于该第二候选视频帧组对应的出现时间差值，采用参考第二视频帧替换目标第二视频帧，得到该视频帧组，该目标第二视频帧为新添加至该第二候选视频帧组中的第二视频帧，该参考第二视频帧为该第二视频中与目标第一视频帧之间的出现时间差值为该第二候选视频帧组对应的出现时间差值的第二视频帧，该目标第一视频帧为该目标第二视频帧所属视频帧对中的第一视频帧。For example, the server determines the matching time difference between the first candidate video frame group and the second candidate video frame group. If the matching time difference is less than or equal to a matching difference threshold, the server replaces the target second video frame with a reference second video frame based on the occurrence time difference corresponding to the second candidate video frame group, thus obtaining the video frame group. The target second video frame is the second video frame newly added to the second candidate video frame group, and the reference second video frame is the second video frame in the second video frame whose occurrence time difference with the target first video frame is the occurrence time difference corresponding to the second candidate video frame group. The target first video frame is the first video frame in the video frame pair to which the target second video frame belongs.

在这种实施方式下，在将第一候选视频帧组中的视频帧对添加至第二候选视频帧组之后，服务器还能够根据第二候选视频帧组的出现时间差值对新添加至第二候选视频帧组中的视频帧对进行调整，以使得调整之后的视频帧对的出现时间差值与该第二候选视频帧组相同，保持视频帧对的出现差值与视频帧组的出现差值之间的一致性。In this implementation, after adding video frame pairs from the first candidate video frame group to the second candidate video frame group, the server can also adjust the newly added video frame pairs to the second candidate video frame group according to the occurrence time difference of the second candidate video frame group, so that the occurrence time difference of the adjusted video frame pairs is the same as that of the second candidate video frame group, thus maintaining the consistency between the occurrence difference of the video frame pairs and the occurrence difference of the video frame group.

为了更加清楚的进行说明，下面以第一候选视频帧组对应的出现时间差值为3，包括[6，9]，[7，10]两个视频帧对，第二候选视频帧组对应的出现时间差值为2，包括[2，4]，[3，5]，[4，6]三个视频帧对，匹配差值阈值为3为例进行说明。由于第一候选视频帧组与第二候选视频帧组之间的匹配时间差值为1，那么服务器确定该匹配时间差值小于该匹配差值阈值，需要对该第一候选视频帧组和该第二候选视频帧组进行合并。服务器将第一候选视频帧组中的两个视频帧对[6，9]和[7，10]添加至第二候选视频帧组，该第二候选视频帧组变为[2，4]，[3，5]，[4，6]，[6，9]，[7，10]，由于第二候选视频帧组对应的出现时间差值为2，那么服务器基于该出现时间差值2，将添加至第二候选视频帧组中两个视频帧对[6，9]和[7，10]中的第二视频帧进行调整，得到两个新的视频帧对[6，8]和[7，9]。对新加入第二候选视频帧组的第二视频帧进行调整之后，该第二候选视频帧组变为[2，4]，[3，5]，[4，6]，[6，8]，[7，9]，每个视频帧对的出现时间差值均为2。To illustrate this more clearly, let's take the following example: the first candidate video frame group has an occurrence time difference of 3, including two video frame pairs [6, 9] and [7, 10]; the second candidate video frame group has an occurrence time difference of 2, including three video frame pairs [2, 4], [3, 5], and [4, 6]; and the matching difference threshold is 3. Since the matching time difference between the first and second candidate video frame groups is 1, the server determines that this matching time difference is less than the matching difference threshold and therefore needs to merge the first and second candidate video frame groups. The server adds two video frame pairs [6, 9] and [7, 10] from the first candidate video frame group to the second candidate video frame group, making the second candidate video frame group [2, 4], [3, 5], [4, 6], [6, 9], [7, 10]. Since the occurrence time difference of the corresponding second candidate video frame group is 2, the server adjusts the second video frame in the two video frame pairs [6, 9] and [7, 10] added to the second candidate video frame group based on this occurrence time difference of 2, resulting in two new video frame pairs [6, 8] and [7, 9]. After adjusting the second video frame newly added to the second candidate video frame group, the second candidate video frame group becomes [2, 4], [3, 5], [4, 6], [6, 8], [7, 9], with the occurrence time difference of each video frame pair being 2.

需要说明的是，上述是以服务器将第一候选视频帧组中的视频帧对添加至第二候选视频帧组中为例进行说明的，在其他可能的实施方式中，服务器也能够将第二候选视频帧中的视频帧对添加至第一候选视频帧组。It should be noted that the above description is based on the example of the server adding video frame pairs from the first candidate video frame group to the second candidate video frame group. In other possible implementations, the server can also add video frame pairs from the second candidate video frame group to the first candidate video frame group.

在一些实施例中，服务器基于第一候选视频帧组和第二候选视频帧组中视频帧对的数量来确定将第一候选视频帧组中的视频帧对添加至第二候选视频帧组，还是将第二候选视频帧组中的视频帧对添加至第一候选视频帧组。比如，在第一候选视频帧组中视频帧对的数量大于第二候选视频帧组中视频帧对的数量的情况下，服务器将该第二候选视频帧组中的视频帧对添加至该第一候选视频帧组。在第二候选视频帧组中视频帧对的数量大于第一候选视频帧组中视频帧对的数量的情况下，服务器将该第一候选视频帧组中的视频帧对添加至该第二候选视频帧组。在第二候选视频帧组中视频帧对的数量等于第一候选视频帧组中视频帧对的数量的情况下，服务器将该第一候选视频帧组中的视频帧对添加至该第二候选视频帧组。或者，在第二候选视频帧组中视频帧对的数量等于第一候选视频帧组中视频帧对的数量的情况下，服务器将该第二候选视频帧组中的视频帧对添加至该第一候选视频帧组。In some embodiments, the server determines whether to add video frame pairs from the first candidate video frame group to the second candidate video frame group or vice versa, based on the number of video frame pairs in the first candidate video frame group and the second candidate video frame group. For example, if the number of video frame pairs in the first candidate video frame group is greater than the number of video frame pairs in the second candidate video frame group, the server adds the video frame pairs from the second candidate video frame group to the first candidate video frame group. If the number of video frame pairs in the second candidate video frame group is greater than the number of video frame pairs in the first candidate video frame group, the server adds the video frame pairs from the first candidate video frame group to the second candidate video frame group. If the number of video frame pairs in the second candidate video frame group is equal to the number of video frame pairs in the first candidate video frame group, the server adds the video frame pairs from the first candidate video frame group to the second candidate video frame group. Alternatively, if the number of video frame pairs in the second candidate video frame group is equal to the number of video frame pairs in the first candidate video frame group, the server adds the video frame pairs from the second candidate video frame group to the first candidate video frame group.

在这种情况下，服务器能够根据候选视频帧组中视频帧对的数量来确定合并候选视频帧组的方式，将包括视频帧数量较少的候选视频帧组添加至包括视频帧数量较多的视频帧组，以减少运算量，提高效率。In this scenario, the server can determine how to merge candidate video frame groups based on the number of video frame pairs in each group, adding candidate video frame groups with fewer video frames to video frame groups with more video frames, thereby reducing computational load and improving efficiency.

305、对于该多个视频帧组中的任一视频帧组，服务器按照该视频帧组中视频帧对的第一视频帧在该第一视频中的出现时间，将该视频帧组中视频帧对的第一视频帧融合为一个候选视频片段。305. For any video frame group among the multiple video frame groups, the server merges the first video frames of the video frame pair in the video frame group into a candidate video segment according to the appearance time of the first video frame in the first video.

在一种可能的实施方式中，服务器比较该视频帧组中任两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间。在该两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间之间的差值符合出现时间条件的情况下，服务器将该两个相邻的视频帧对添加至临时帧列表。在该两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间之间的差值不符合出现时间条件的情况下，服务器将该临时帧列表中的视频帧对融合为一个参考视频片段。服务器基于多个参考视频片段，确定该至少一个候选视频片段。In one possible implementation, the server compares the occurrence times of the first video frames of any two adjacent video frame pairs in the video frame group within the first video. If the difference between the occurrence times of the first video frames of the two adjacent video frame pairs in the first video meets the occurrence time condition, the server adds the two adjacent video frame pairs to a temporary frame list. If the difference between the occurrence times of the first video frames of the two adjacent video frame pairs in the first video does not meet the occurrence time condition, the server merges the video frame pairs in the temporary frame list into a reference video segment. The server determines the at least one candidate video segment based on multiple reference video segments.

其中，临时帧列表用于存储出现时间之间的差值符合出现时间条件的视频帧对。在一些实施例中，出现时间之间的差值符合出现时间条件是指，出现时间之间的差值小于或等于出现时间差值阈值，出现时间差值阈值由技术人员根据实际情况进行设置，比如设置为8s，本申请实施例对此不做限定。The temporary frame list is used to store video frame pairs whose occurrence time differences meet the occurrence time condition. In some embodiments, the occurrence time difference meeting the occurrence time condition means that the occurrence time difference is less than or equal to an occurrence time difference threshold. The occurrence time difference threshold is set by technicians according to actual conditions, such as 8 seconds. This embodiment of the application does not limit this.

为了对上述实施方式进行更加清楚的说明，下面将分为四个部分对上述实施方式进行说明。To provide a clearer explanation of the above embodiments, the following description will be divided into four parts.

第一部分、服务器比较该视频帧组中任两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间。Part 1: The server compares the occurrence times of the first video frame in any two adjacent video frame pairs in the video frame group within the first video.

在一些实施例中，服务器以第一视频帧在第一视频中的出现时间作为第一视频帧的标识，以第二视频帧在第二视频中的出现时间作为第二视频帧的标识在这种情况下，服务器比较任两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间时，比较这两个第一视频帧的标识即可。比如，该视频帧组包括视频帧对[2，4]，[3，5]，[4，6]，[6，8]，[7，9]，服务器依次比较视频帧对的第一视频帧在第一视频中的出现时间。在第一次比较过程中，服务器比较第一个视频帧对[2，4]的第一视频帧2与第二个视频帧对[3，5]的第一视频帧3在第一视频中的出现时间。In some embodiments, the server uses the occurrence time of a first video frame in the first video as the identifier of the first video frame, and the occurrence time of a second video frame in the second video as the identifier of the second video frame. In this case, when the server compares the occurrence times of the first video frames of any two adjacent video frame pairs in the first video, it only needs to compare the identifiers of these two first video frames. For example, the video frame group includes video frame pairs [2, 4], [3, 5], [4, 6], [6, 8], [7, 9], and the server compares the occurrence times of the first video frames of the video frame pairs in the first video in the first video. In the first comparison process, the server compares the occurrence times of the first video frame 2 of the first video frame pair [2, 4] with the first video frame 3 of the second video frame pair [3, 5] in the first video in the first video.

第二部分、在该两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间之间的差值符合出现时间条件的情况下，服务器将该两个相邻的视频帧对添加至临时帧列表。Part Two: If the difference between the occurrence times of the first video frame in the first video of the two adjacent video frame pairs meets the occurrence time condition, the server adds the two adjacent video frame pairs to the temporary frame list.

在一种可能的实施方式中，在该两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间之间的差值小于或等于出现时间差值阈值的情况下，服务器将该两个相邻的视频帧对添加至临时帧列表。比如，还是以该视频帧组视频帧对[2，4]，[3，5]，[4，6]，[6，8]，[7，9]为例，对于视频帧对[2，4]和[3，5]来说，在出现时间差值阈值为3的情况下，由于[2，4]和[3，5]中的第一视频帧在第一视频中的出现时间差值为3-2＝1，因此服务器将该两个视频帧对添加至临时帧列表(Tmplist)，Tmplist＝[[2，4]，[3，5]]。In one possible implementation, if the difference between the occurrence times of the first video frames in the first video of two adjacent video frame pairs is less than or equal to an occurrence time difference threshold, the server adds the two adjacent video frame pairs to a temporary frame list. For example, taking the video frame group [2,4], [3,5], [4,6], [6,8], [7,9] as an example, for video frame pairs [2,4] and [3,5], when the occurrence time difference threshold is 3, since the occurrence time difference of the first video frames in [2,4] and [3,5] in the first video is 3-2=1, the server adds the two video frame pairs to a temporary frame list (Tmplist), Tmplist=[[2,4], [3,5]].

服务器确定将视频帧对添加至临时帧列表包括多个迭代过程，在任一迭代过程中，服务器比较当前视频帧对的第一视频帧与上一个视频帧对的第一视频帧在第一视频中的出现时间差值，这里当前视频帧对是指当前正在处理的视频帧对，上一个视频帧对是指上一次迭代过程中处理的视频帧对。比如，服务器在将视频帧对[2，4]和[3，5]添加至临时帧列表之后，进一步确定视频帧对[3，5]和[4，6]的第一视频帧在第一视频中的出现时间差值与出现时间差值阈值之间的关系，由于[3，5]和[4，6]中的第一视频帧在第一视频中的出现时间差值为4-3＝1，因此服务器将视频帧对[4，6]添加至临时帧列表(Tmplist)，Tmplist＝[[2，4]，[3，5]，[4，6]]。通过多个迭代过程，得到临时帧列表Tmplist＝[[2，4]，[3，5]，[4，6]，[6，8]，[7，9]]。The server determines that adding video frame pairs to the temporary frame list involves multiple iterations. In any iteration, the server compares the time difference between the first video frame of the current video frame pair and the first video frame of the previous video frame pair in the first video. Here, the current video frame pair refers to the video frame pair currently being processed, and the previous video frame pair refers to the video frame pair processed in the previous iteration. For example, after adding video frame pairs [2, 4] and [3, 5] to the temporary frame list, the server further determines the relationship between the time difference between the first video frames of video frame pairs [3, 5] and [4, 6] in the first video and the time difference threshold. Since the time difference between the first video frames of [3, 5] and [4, 6] in the first video is 4-3=1, the server adds video frame pair [4, 6] to the temporary frame list (Tmplist), where Tmplist = [[2, 4], [3, 5], [4, 6]]. Through multiple iterations, a temporary frame list Tmplist = [[2, 4], [3, 5], [4, 6], [6, 8], [7, 9] is obtained.

第三部分、在该两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间之间的差值不符合出现时间条件的情况下，服务器将该临时帧列表中的视频帧对融合为参考视频片段。Part Three: If the difference between the occurrence times of the first video frame in the first video of two adjacent video frame pairs does not meet the occurrence time condition, the server merges the video frame pairs in the temporary frame list into a reference video segment.

其中，参考视频片段包括第一子片段和第二子片段，第一子片段是由视频帧对中的第一视频帧构成的，第二子片段是由视频帧对中的第二视频帧构成的。The reference video segment includes a first sub-segment and a second sub-segment. The first sub-segment is composed of the first video frame in the video frame pair, and the second sub-segment is composed of the second video frame in the video frame pair.

在一种可能的实施方式中，在该两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间之间的差值大于出现时间差值阈值的情况下，服务器将该临时帧列表中的第一视频帧融合为第一子片段，将该临时帧列表中的第二视频帧融合为第二子片段，该第一子片段和该第二子片段构成该参考视频片段。由于视频帧对中的第一视频帧和第二视频帧是相似度较高的视频帧，那么第一子片段和第二子片段也即是相似度较高的片段。比如，参见图5，示出了第一子片段501和第二子片段502的形式，第一子片段501开头的第一个视频帧和第二子片段502开头的第一个视频帧构成一个视频帧对，第一子片段501结尾的第一个视频帧和第二子片段502结尾的第一个视频帧构成另一个视频帧对。在一些实施例中，一个参考视频片段中的第一子片段和第二子片段也被称为匹配段。In one possible implementation, if the difference in the occurrence time between the first video frames of two adjacent video frame pairs in the first video is greater than a threshold for occurrence time difference, the server merges the first video frames in the temporary frame list into a first sub-segment and the second video frames in the temporary frame list into a second sub-segment. The first and second sub-segments together constitute the reference video segment. Since the first and second video frames in the video frame pair are highly similar, the first and second sub-segments are also highly similar segments. For example, referring to Figure 5, which illustrates the form of first sub-segment 501 and second sub-segment 502, the first video frame at the beginning of first sub-segment 501 and the first video frame at the beginning of second sub-segment 502 constitute one video frame pair, and the first video frame at the end of first sub-segment 501 and the first video frame at the end of second sub-segment 502 constitute another video frame pair. In some embodiments, the first and second sub-segments in a reference video segment are also referred to as matching segments.

比如，该两个相邻的视频帧对为[9，11]和[2，4]，该两个视频帧对的第一视频帧在该第一视频中的出现时间之间的差值为9-2＝7，那么服务器将临时帧列表中的第一视频帧融合为一个参考视频片段。比如，临时帧列表Tmplist＝[[2，4]，[3，5]，[4，6]，[6，8]，[7，9]]，那么服务器将该临时帧列表中的第一视频帧[2，]，[3，]，[4，]，[6，]，[7，]融合为第一子片段(2，7)，将该临时帧列表中的第二视频帧[，4]，[，5]，[，6]，[，8]，[，9]融合为第二子片段(4，9)，该第一子片段(2，7)和该第二子片段(4，9)构成该参考视频片段(2，7，4，9)，其中，该参考视频片段的格式为(src-startTime，src-endTime，ref-startTime，ref-endTime)，其中，src-startTime是指第一子片段的开头，也即是临时帧列表中序号最小的第一视频帧，src-endTime是指第一子片段的结尾，也即是临时帧列表中序号最大的第一视频帧，ref-startTime是指第二子片段的开头，也即是临时帧列表中序号最小的第二视频帧，ref-endTime是指第二子片段的结尾，也即是临时帧列表中序号最大的第二视频帧，序号是指视频帧的标识，表示视频帧在视频中的位置，序号越小，表示视频帧在视频中的位置越靠前，序号越小，表示视频帧在视频中的位置越靠后。在一些实施例中，服务器将参考视频片段存储在匹配段列表match-duration-list中。由于确定视频帧对时遍历了第一视频和第二视频的所有视频帧，可能会出现某一视频帧与多个视频帧相似的情况，从而会出现match-duration-list中存在两个参考视频片段的时间有交叠。For example, if the two adjacent video frame pairs are [9, 11] and [2, 4], and the difference between the appearance times of the first video frame in the first video is 9-2=7, then the server will merge the first video frame in the temporary frame list into a reference video segment. For example, if the temporary frame list Tmplist = [[2,4],[3,5],[4,6],[6,8],[7,9]], then the server merges the first video frames [2,], [3,], [4,], [6,], [7,] in the temporary frame list into the first sub-segment (2,7), and merges the second video frames [,4], [,5], [,6], [,8], [,9] in the temporary frame list into the second sub-segment (4,9). The first sub-segment (2,7) and the second sub-segment (4,9) constitute the reference video segment (2,7,4,9), where the format of the reference video segment is (src-startTime, src-endTime, ref-startTime, ...). The `ref-endTime` parameter is used to define the timeline of the first sub-segment. Here, `src-startTime` refers to the beginning of the first sub-segment (the first video frame with the smallest sequence number in the temporary frame list), and `src-endTime` refers to the end of the first sub-segment (the first video frame with the largest sequence number in the temporary frame list). Similarly, `ref-startTime` refers to the beginning of the second sub-segment (the second video frame with the smallest sequence number in the temporary frame list), and `ref-endTime` refers to the end of the second sub-segment (the second video frame with the largest sequence number in the temporary frame list). The sequence number identifies the video frame's position within the video; a smaller sequence number indicates an earlier position, and vice versa. In some embodiments, the server stores reference video segments in a `match-duration-list`. Because determining the video frame pair involves traversing all video frames of both the first and second videos, a single video frame may resemble multiple video frames, resulting in overlapping timestamps between two reference video segments in the `match-duration-list`.

在一些实施例中，该参考视频片段还能够携带第一子片段对应的出现时间差值、第一子片段的时长以及第一子片段包括的视频帧的数量等信息，以便于服务器调用。In some embodiments, the reference video segment may also carry information such as the occurrence time difference of the first sub-segment, the duration of the first sub-segment, and the number of video frames included in the first sub-segment, so that the server can call it.

另外，除了上述第三部分提供的方式之外，本申请实施例提供了另一种触发将临时帧列表中的视频帧对融合为参考视频片段的方法。In addition to the methods provided in Part III above, this application provides another method for triggering the fusion of video frame pairs in a temporary frame list into a reference video segment.

在一种可能的实施方式中，在当前处理的视频帧对为该视频帧组中最后一个视频帧对的情况下，服务器将视频帧对添加至临时帧列表，将该临时帧列表中的视频帧对融合为参考视频帧片段。比如，该视频帧组包括[2，4]，[3，5]，[4，6]，[6，8]，[7，9]五个视频帧对，在服务器处理视频帧对[7，9]时，由于该视频帧对[7，9]是该视频帧组中的最后一个视频帧对，服务器将该视频帧对[7，9]加入临时帧列表，将该临时帧列表中的视频帧对融合为参考视频片段，融合过程参见上一种实施方式的描述，在此不再赘述。In one possible implementation, when the currently processed video frame pair is the last video frame pair in the video frame group, the server adds the video frame pair to a temporary frame list and merges the video frame pairs in the temporary frame list into a reference video frame segment. For example, if the video frame group includes five video frame pairs [2, 4], [3, 5], [4, 6], [6, 8], and [7, 9], when the server processes video frame pair [7, 9], since [7, 9] is the last video frame pair in the video frame group, the server adds video frame pair [7, 9] to the temporary frame list and merges the video frame pairs in the temporary frame list into a reference video segment. The merging process is described in the previous implementation and will not be repeated here.

第四部分、服务器基于多个参考视频片段，确定该至少一个候选视频片段。Part Four: The server determines at least one candidate video segment based on multiple reference video segments.

其中，该多个参考视频片段包括第一重合视频片段和/或第二重合视频片段，该第一重合视频片段是指属于该多个参考视频片段中第一参考视频片段的参考视频片段，该第二重合视频片段是指与该多个参考视频片段中第二参考视频片段部分重合的参考视频片段。The plurality of reference video segments includes a first overlapping video segment and/or a second overlapping video segment. The first overlapping video segment refers to a reference video segment that belongs to the first reference video segment among the plurality of reference video segments, and the second overlapping video segment refers to a reference video segment that partially overlaps with the second reference video segment among the plurality of reference video segments.

其中，第一重合视频片段属于该第一参考视频片段是指，第一重合视频片段的内容被该第一参考视频片段完全包含，或者或第一参考视频片段完全包含了该第一重合视频片段。Wherein, the first overlapping video segment belongs to the first reference video segment, meaning that the content of the first overlapping video segment is completely contained by the first reference video segment, or the first reference video segment completely contains the first overlapping video segment.

为了对上述第四部分的内容进行更加清楚的说明，下面对服务器从多个参考视频片段中确定第一重合视频片段的方法进行说明。To provide a clearer explanation of the content in Part Four above, the method by which the server determines the first overlapping video segment from multiple reference video segments will be described below.

在一种可能的实施方式中，服务器基于该多个参考视频片段中第一子片段在第一视频中的出现时间，从该多个参考视频片段中确定第一重合视频片段。In one possible implementation, the server determines the first overlapping video segment from the plurality of reference video segments based on the occurrence time of the first sub-segment in the first video.

其中，第一子片段也即是第一视频帧构成的视频片段，出现时间包括第一子片段在第一视频中的开始时间和结束时间。The first sub-segment is a video segment composed of the first video frame, and its occurrence time includes the start and end times of the first sub-segment in the first video.

举例来说，对于该多个参考视频片段中的参考视频片段A₁和参考视频片段B₁，服务器比较该参考视频片段A₁的第一子片段在第一视频中的出现时间以及该参考视频片段B₁的第一子片段在第一视频中的出现时间，在该参考视频片段B₁的第一子片段在第一视频中的出现时间是该参考视频片段A₁的第一子片段在第一视频中的出现时间的子集的情况下，确定该参考视频片段B₁为第一重合视频片段。比如，参见图6，该多个参考视频片段包括参考视频片段A₁和参考视频片段B₁，服务器比较该参考视频片段A₁的第一子片段m₁在该第一视频中的出现时间以及该参考视频片段B₁的第一子片段n₁在该第一视频中的出现时间。在该第一子片段n₁的开始时间在第一子片段m₁之后，且该第一子片段n₁的结束时间在第一子片段m₁之前的情况下，服务器将该参考视频片段B₁确定为第一重合视频片段，该参考视频片段A₁也即是上述第一参考视频片段。For example, for reference video segment _A1 and reference video segment _B1 among the multiple reference video segments, the server compares the occurrence time of the first sub-segment of reference video segment _A1 in the first video with the occurrence time of the first sub-segment of reference video segment _B1 in the first video. If the occurrence time of the first sub-segment of reference video segment _B1 in the first video is a subset of the occurrence time of the first sub-segment of reference video segment _A1 in the first video, then reference video segment _B1 is determined to be the first overlapping video segment. For example, referring to Figure 6, the multiple reference video segments include reference video segment _A1 and reference video segment _B1 . The server compares the occurrence time of the first sub-segment _m1 of reference video segment _A1 in the first video with the occurrence time of the first sub-segment _n1 of reference video segment _B1 in the first video. If the start time of the first sub-segment _n1 is after the start time of the first sub-segment _m1 and the end time of the first sub-segment _n1 is before the end time of the first sub-segment _m1 , the server determines the reference video segment _B1 as the first overlapping video segment, and the reference video segment _A1 is the aforementioned first reference video segment.

对服务器从多个参考视频片段中确定第一重合视频片段的方法进行说明之后，下面对服务器从多个参考视频片段中确定第二重合视频片段的方法进行说明。After explaining the method by which the server determines the first overlapping video segment from multiple reference video segments, the method by which the server determines the second overlapping video segment from multiple reference video segments will now be explained.

在一种可能的实施方式中，服务器基于该多个参考视频片段中第一子片段在第一视频中的出现时间，从该多个参考视频片段中确定第二重合视频片段。In one possible implementation, the server determines the second overlapping video segment from the plurality of reference video segments based on the occurrence time of the first sub-segment in the first video.

举例来说，对于该多个参考视频片段中的参考视频片段A₂和参考视频片段B₂，服务器比较该参考视频片段A₂的第一子片段在第一视频中的出现时间以及该参考视频片段B₂的第一子片段在第一视频中的出现时间，在该参考视频片段B的第一子片段在第一视频中的出现时间与该参考视频片段A的第一子片段在第一视频中的出现时间的存在交集的情况下，将参考视频片段A和参考视频片段B中时长较短的参考视频片段确定为第二重合视频片段。比如，参见图6，该多个参考视频片段包括参考视频片段A₂和参考视频片段B₂，服务器比较该参考视频片段A₂的第一子片段m₂在该第一视频中的出现时间以及该参考视频片段B₂的第一子片段n₂在该第一视频中的出现时间。在该第一子片段n₂的开始时间在第一子片段m₂的开始时间之后，结束时间之前，且该第一子片段n₂的结束时间在第一子片段m₂之后，或者该第一子片段n₂的开始时间在第一子片段m₂之前，且该第一子片段n₂的结束时间在第一子片段m₂的结束时间之前，开始时间之后，参考视频片段B₂的时长小于参考视频片段A₂的情况下，服务器将该参考视频片段B₂确定为第二重合视频片段，该参考视频片段A₂也即是上述第二参考视频片段。For example, for reference video segment _A2 and reference video segment _B2 among the multiple reference video segments, the server compares the occurrence time of the first sub-segment of reference video segment _A2 in the first video with the occurrence time of the first sub-segment of reference video segment _B2 in the first video. If there is an intersection between the occurrence time of the first sub-segment of reference video segment B in the first video and the occurrence time of the first sub-segment of reference video segment A in the first video, the reference video segment with shorter duration between reference video segments A and B is identified as the second overlapping video segment. For example, referring to Figure 6, the multiple reference video segments include reference video segment _A2 and reference video segment _B2 . The server compares the occurrence time of the first sub-segment _m2 of reference video segment _A2 in the first video with the occurrence time of the first sub-segment _n2 of reference video segment _B2 in the first video. If the start time of the first sub-segment _n2 is after the start time of the first sub-segment _m2 and before its end time, and the end time of the first sub-segment _n2 is after the first sub-segment _m2 , or if the start time of the first sub-segment _n2 is before the first sub-segment _m2 and the end time of the first sub-segment _n2 is before the end time of the first sub-segment _m2 and after its start time, and the duration of the reference video segment _B2 is less than that of the reference video segment _A2 , then the server determines the reference video segment _B2 as the second overlapping video segment, and the reference video segment _A2 is the aforementioned second reference video segment.

在介绍完服务器确定第一重合视频片段和第二重合视频片段的方法之后，下面对上述第四部分提供的步骤进行说明。After introducing the method by which the server determines the first and second overlapping video segments, the steps provided in Part IV above will be explained below.

在一种可能的实施方式中，在该多个参考视频片段包括该第一重合视频片段的情况下，服务器将该第一重合视频片段删除，得到该至少一个候选视频片段。In one possible implementation, if the plurality of reference video segments include the first overlapping video segment, the server deletes the first overlapping video segment to obtain the at least one candidate video segment.

在这种实施方式下，服务器能够将重复的第一重合视频片段从多个参考视频片段中删除，以减少得到的候选视频片段的数量，减少运算量，提高运算效率。In this implementation, the server can remove duplicate first overlapping video segments from multiple reference video segments to reduce the number of candidate video segments obtained, reduce the amount of computation, and improve computational efficiency.

在一种可能的实施方式中，在该多个参考视频片段包括该第二重合视频片段的情况下，服务器将该第二重合视频片段与该第二参考片段之间的重合部分删除，得到该至少一个候选视频片段。In one possible implementation, if the plurality of reference video segments include the second overlapping video segment, the server deletes the overlapping portion between the second overlapping video segment and the second reference segment to obtain the at least one candidate video segment.

在上述实施方式的基础上，可选地，服务器还能够执行下述步骤：Based on the above implementation method, the server may optionally also perform the following steps:

在一些实施例中，将该第二重合视频片段与该第二参考片段之间的重合部分删除之后，服务器比较第三类参考视频片段的时长与目标时长，该第三类参考视频片段是指删除重合部分的该第二重合视频片段。在该第三类参考视频片段的时长大于或等于该目标时长的情况下，服务器保留该第三类参考视频片段。在该第三类参考视频片段的时长小于该目标时长的情况下，服务器删除该第三类参考视频片段。In some embodiments, after deleting the overlapping portion between the second overlapping video segment and the second reference segment, the server compares the duration of a third type of reference video segment with a target duration. The third type of reference video segment refers to the second overlapping video segment after deleting the overlapping portion. If the duration of the third type of reference video segment is greater than or equal to the target duration, the server retains the third type of reference video segment. If the duration of the third type of reference video segment is less than the target duration, the server deletes the third type of reference video segment.

其中，目标时长由技术人员根据实际情况进行设置，本申请实施例对此不做限定。在服务器保留该第三类参考视频片段的情况下，也即是采用该第三类参考视频片段替换了原本的第二重合视频片段。The target duration is set by technicians according to the actual situation, and this application embodiment does not limit it. When the server retains the third type of reference video segment, that is, the third type of reference video segment replaces the original second overlapping video segment.

下面通过两个例子对上述实施方式进行说明：The above implementation method is illustrated below with two examples:

例1、对于该多个参考视频片段中的参考视频片段A₂和参考视频片段B₂，在该参考视频片段A₂的第一子片段m₂和该参考视频片段B₂的第一子片段n₂具有部分重合，且第一子片段m₂的开始时间早于第一子片段n₂的情况下，服务器将第一子片段n₂的开始时间移动到第一子片段m₂的结束时间，得到子片段l₁，该子片段l₁为第三类参考视频片段的第一子片段。在该子片段l₁的时长小于或等于该目标时长的情况下，服务器删除该子片段l₁，同时删除该子片段l₁所属的第三类参考视频片段。在该子片段l₁的时长大于该目标时长的情况下，服务器保留该子片段l₁，同时保留该子片段l₁所属的第三类参考视频片段。Example 1: For reference video segment _A2 and reference video segment _B2 among multiple reference video segments, if the first sub-segment _m2 of reference video segment _A2 and the first sub-segment _n2 of reference video segment _B2 partially overlap, and the start time of the first sub-segment _m2 is earlier than that of the first sub-segment _n2 , the server moves the start time of the first sub-segment _n2 to the end time of the first sub-segment _m2 , resulting in sub-segment _l1 . Sub-segment _l1 is the first sub-segment of the third type of reference video segment. If the duration of sub-segment _l1 is less than or equal to the target duration, the server deletes sub-segment _l1 and the third type of reference video segment to which sub-segment _l1 belongs. If the duration of sub-segment _l1 is greater than the target duration, the server retains sub-segment _l1 and the third type of reference video segment to which sub-segment _l1 belongs.

例2、对于该多个参考视频片段中的参考视频片段A₂和参考视频片段B₂，在该参考视频片段A₂的第一子片段m₂和该参考视频片段B₂的第一子片段n₂具有部分重合，且第一子片段n₂的开始时间早于第一子片段m₂的情况下，服务器将第一子片段n₂的结束时间移动到第一子片段m₂的开始时间，得到子片段l₂，该子片段l₂为第三类参考视频片段的第一子片段。在该子片段l₂的时长小于或等于该目标时长的情况下，服务器删除该子片段l₂，同时删除该子片段l₂所属的第三类参考视频片段。在该子片段l₂的时长大于该目标时长的情况下，服务器保留该子片段l₂，同时保留该子片段l₂所属的第三类参考视频片段。Example 2: For reference video segment _A2 and reference video segment _B2 among the multiple reference video segments, if the first sub-segment _m2 of reference video segment _A2 and the first sub-segment _n2 of reference video segment _B2 partially overlap, and the start time of the first sub-segment _n2 is earlier than that of the first sub-segment _m2 , the server moves the end time of the first sub-segment _n2 to the start time of the first sub-segment _m2 , resulting in sub-segment _l2 . This sub-segment _l2 is the first sub-segment of the third type of reference video segment. If the duration of sub-segment _l2 is less than or equal to the target duration, the server deletes sub-segment _l2 and the third type of reference video segment to which sub-segment _l2 belongs. If the duration of sub-segment _l2 is greater than the target duration, the server retains sub-segment _l2 and the third type of reference video segment to which sub-segment _l2 belongs.

306、服务器基于该至少一个候选视频片段，确定该至少一个目标候选视频片段，该目标候选视频片段在该至少一个候选视频片段中的出现次数符合次数条件。306. Based on the at least one candidate video segment, the server determines the at least one target candidate video segment, wherein the number of times the target candidate video segment appears in the at least one candidate video segment meets the frequency condition.

在一种可能的实施方式中，服务器基于该至少一个候选视频片段，确定至少一个参考候选视频片段。服务器确定每个参考候选视频片段在该至少一个参考候选视频片段的出现次数。服务器将出现次数符合该出现次数条件的参考候选视频片段确定为目标候选视频片段。In one possible implementation, the server determines at least one reference candidate video segment based on the at least one candidate video segment. The server determines the number of times each reference candidate video segment appears in the at least one reference candidate video segment. The server determines the reference candidate video segments whose appearance count meets the appearance count condition as target candidate video segments.

其中，参考候选视频片段在该至少一个参考候选视频片段的出现次数是指，该至少一个参考候选视频片段中该参考候选视频片段的数量。比如，该至少一个参考候选视频片段为1，2，3，1，4，5，那么对于参考候选视频片段1来说，出现次数就为2。The number of times a reference candidate video segment appears in the at least one reference candidate video segment refers to the number of that reference candidate video segment among the at least one reference candidate video segments. For example, if the at least one reference candidate video segment is 1, 2, 3, 1, 4, 5, then the number of times reference candidate video segment 1 appears is 2.

为了对上述实施方式进行说明，下面将分为三个部分对上述实施方式进行说明。To illustrate the above embodiments, the following description will be divided into three parts.

第一部分、服务器基于该至少一个候选视频片段，确定至少一个参考候选视频片段。Part 1: Based on the at least one candidate video segment, the server determines at least one reference candidate video segment.

其中，该至少一个候选视频片段包括第三重合视频片段和/或第四重合视频片段，该第三重合视频片段是指属于该至少一个候选视频片段中第一候选视频片段的候选视频片段，该第四重合视频片段是指与该至少一个候选视频片段中第二候选视频片段部分重合的候选视频片段。The at least one candidate video segment includes a third overlapping video segment and/or a fourth overlapping video segment. The third overlapping video segment refers to a candidate video segment that belongs to the first candidate video segment among the at least one candidate video segments, and the fourth overlapping video segment refers to a candidate video segment that partially overlaps with the second candidate video segment among the at least one candidate video segments.

为了对上述第一部分的内容进行更加清楚的说明，下面对服务器从至少一个候选视频片段中确定第三重合视频片段的方法进行说明。To provide a clearer explanation of the content in the first part above, the following describes the method by which the server determines a third overlapping video segment from at least one candidate video segment.

在一种可能的实施方式中，服务器基于该至少一个候选视频片段中第一子片段在第一视频中的出现时间，从该至少一个候选视频片段中确定第三重合视频片段。In one possible implementation, the server determines a third overlapping video segment from the at least one candidate video segment based on the occurrence time of the first sub-segment in the first video.

其中，候选视频片段包括第一子片段和第二子片段，第一子片段是由视频帧对中的第一视频帧构成的，第二子片段是由视频帧对中的第二视频帧构成的。The candidate video segments include a first sub-segment and a second sub-segment. The first sub-segment is composed of the first video frame in the video frame pair, and the second sub-segment is composed of the second video frame in the video frame pair.

举例来说，该至少一个候选视频片段为两个候选视频片段，对于该至少一个候选视频片段中的候选视频片段C₁和候选视频片段D₁，服务器比较该候选视频片段C₁的第一子片段在第一视频中的出现时间以及该候选视频片段D₁的第一子片段在第一视频中的出现时间，在该候选视频片段D₁的第一子片段在第一视频中的出现时间是该候选视频片段C₁的第一子片段在第一视频中的出现时间的子集的情况下，确定该候选视频片段D₁为第三重合视频片段。For example, if the at least one candidate video segment is two candidate video segments, for candidate video segment _C1 and candidate video segment _D1 in the at least one candidate video segment, the server compares the occurrence time of the first sub-segment of candidate video segment _C1 in the first video with the occurrence time of the first sub-segment of candidate video segment _D1 in the first video. If the occurrence time of the first sub-segment of candidate video segment _D1 in the first video is a subset of the occurrence time of the first sub-segment of candidate video segment _C1 in the first video, then candidate video segment _D1 is determined to be the third overlapping video segment.

比如，该至少一个候选视频片段为两个候选视频片段，包括候选视频片段C₁和候选视频片段D₁，服务器比较该候选视频片段C₁的第一子片段o₁在该第一视频中的出现时间以及该候选视频片段D₁的第一子片段p₁在该第一视频中的出现时间。在该第一子片段p₁的开始时间在第一子片段o₁之后，且该第一子片段p₁的结束时间在第一子片段o₁之前的情况下，服务器将该候选视频片段D₁确定为第三重合视频片段，该候选视频片段C₁也即是上述第一候选视频片段。For example, the at least one candidate video segment may be two candidate video segments, including candidate video segment _C1 and candidate video segment _D1 . The server compares the appearance time of the first sub-segment _o1 of candidate video segment _C1 in the first video with the appearance time of the first sub-segment _p1 of candidate video segment _D1 in the first video. If the start time of the first sub-segment _p1 is after the first sub-segment _o1 and the end time of the first sub-segment _p1 is before the first sub-segment _o1 , the server determines candidate video segment _D1 as the third overlapping video segment, and candidate video segment _C1 is the aforementioned first candidate video segment.

对服务器从至少一个候选视频片段中确定第三重合视频片段的方法进行说明之后，下面对服务器从至少一个候选视频片段中确定第四重合视频片段的方法进行说明。After describing the method by which the server determines the third overlapping video segment from at least one candidate video segment, the method by which the server determines the fourth overlapping video segment from at least one candidate video segment will now be described.

在一种可能的实施方式中，服务器基于该至少一个候选视频片段中第一子片段在第一视频中的出现时间，从该至少一个候选视频片段中确定第四重合视频片段。In one possible implementation, the server determines a fourth overlapping video segment from the at least one candidate video segment based on the occurrence time of the first sub-segment in the first video.

举例来说，该至少一个候选视频片段为两个候选视频片段，对于该至少一个候选视频片段中的候选视频片段C₂和候选视频片段D₂，服务器比较该候选视频片段C₂的第一子片段在第一视频中的出现时间以及该候选视频片段D₂的第一子片段在第一视频中的出现时间，在该候选视频片段D₂的第一子片段在第一视频中的出现时间与该候选视频片段C₂的第一子片段在第一视频中的出现时间的存在交集的情况下，将候选视频片段C₂和候选视频片段D₂中时长较短的候选视频片段确定为第四重合视频片段。For example, the at least one candidate video segment may be two candidate video segments. For candidate video segment _C2 and candidate video segment _D2 , the server compares the occurrence time of the first sub-segment of candidate video segment _C2 in the first video with the occurrence time of the first sub-segment of candidate video segment _D2 in the first video. If there is an intersection between the occurrence time of the first sub-segment of candidate video segment _D2 and the occurrence time of the first sub-segment of candidate video segment _C2 in the first video, the candidate video segment with shorter duration between candidate video segment _C2 and candidate video segment _D2 is determined as the fourth overlapping video segment.

比如，该至少一个候选视频片段为两个候选视频片段，包括候选视频片段C₂和候选视频片段D₂，服务器比较该候选视频片段C₂的第一子片段o₂在该第一视频中的出现时间以及该候选视频片段D₂的第一子片段p₂在该第一视频中的出现时间。在该第一子片段p₂的开始时间在第一子片段o₂的开始时间之后，结束时间之前，且该第一子片段p₂的结束时间在第一子片段o₂之后，或者该第一子片段p₂的开始时间在第一子片段o₂之前，且该第一子片段p₂的结束时间在第一子片段o₂的结束时间之前，开始时间之后，候选视频片段D₂的时长小于候选视频片段C₂的情况下，服务器将该候选视频片段D₂确定为第四重合视频片段，该候选视频片段C₂也即是上述第二候选视频片段。For example, the at least one candidate video segment may be two candidate video segments, including candidate video segment _C2 and candidate video segment _D2 . The server compares the appearance time of the first sub-segment _o2 of candidate video segment _C2 in the first video with the appearance time of the first sub-segment _p2 of candidate video segment _D2 in the first video. If the start time of the first sub-segment _p2 is after the start time of the first sub-segment _o2 and before its end time, and the end time of the first sub-segment _p2 is after the first sub-segment _o2 , or if the start time of the first sub-segment _p2 is before the first sub-segment _o2 and the end time of the first sub-segment _p2 is before _its start time, and the duration of candidate video segment _D2 is less than that of candidate video segment _C2 , the server determines candidate video segment _D2 as the fourth overlapping video segment, and candidate video segment _C2 is the aforementioned second candidate video segment.

在介绍完服务器确定第三重合视频片段和第四重合视频片段的方法之后，下面对上述第一部分提供的步骤进行说明。After introducing the method by which the server determines the third and fourth overlapping video segments, the steps provided in the first part above will be explained below.

在一种可能的实施方式中，在该至少一个候选视频片段包括该第三重合视频片段的情况下，服务器将该第三重合视频片段删除，得到该至少一个参考候选视频片段。在一些实施例中，在删除该第三重合视频片段之前，服务器将该第三重合视频片段的出现次数叠加到该第一候选视频片段上。由于第三重合视频片段被第一候选视频片段完全包含，那么将该第三重合视频片段的出现次数叠加到该第一候选视频片段上能够提高该第一候选视频片段在后续处理中的权重。In one possible implementation, if the at least one candidate video segment includes the third overlapping video segment, the server deletes the third overlapping video segment to obtain the at least one reference candidate video segment. In some embodiments, before deleting the third overlapping video segment, the server adds the occurrence count of the third overlapping video segment to the first candidate video segment. Since the third overlapping video segment is completely contained by the first candidate video segment, adding the occurrence count of the third overlapping video segment to the first candidate video segment can increase the weight of the first candidate video segment in subsequent processing.

在这种实施方式下，服务器能够将重复的第三重合视频片段从至少一个候选视频片段中删除，以减少得到的参考候选视频片段的数量，减少运算量，提高运算效率。In this implementation, the server can remove duplicate third overlapping video segments from at least one candidate video segment, thereby reducing the number of reference candidate video segments obtained, reducing the amount of computation, and improving computational efficiency.

下面通过一个具体的例子来进行说明。The following is a specific example to illustrate this.

在该候选视频片段D₁的第一子片段o₁是该候选视频片段C₁的第一子片段p₁的子集，且第一子片段o₁的时长＞0.5*第一子片段p₁的情况下，则服务器删除第一子片段o₁，同时删除该候选视频片段D₁，将该候选视频片段D₁的出现次数叠加到该候选视频片段C₁上。If the first sub-segment _o1 of candidate video segment _D1 is a subset of the first sub-segment _p1 of candidate video segment _C1 , and the duration of the first sub-segment _o1 is greater than 0.5 * the duration of the first sub-segment _p1 , then the server deletes the first sub-segment _o1 and the candidate video segment _D1 , and adds the occurrence count of the candidate video segment _D1 to the candidate video segment _C1 .

在上述实施方式的基础上，可选地，服务器将第三重合视频片段的出现次数叠加到该第一候选视频片段上之前，还能够确定该第三重合视频片段的时长和第一候选视频片段的时长，基于该第三重合视频片段的时长和第一候选视频片段的时长来确定是否将该第三重合视频片段的出现次数叠加到该第一候选视频片段上。Based on the above implementation, optionally, before the server adds the occurrence count of the third overlapping video segment to the first candidate video segment, it can also determine the duration of the third overlapping video segment and the duration of the first candidate video segment, and determine whether to add the occurrence count of the third overlapping video segment to the first candidate video segment based on the duration of the third overlapping video segment and the duration of the first candidate video segment.

比如，服务器确定该第三重合视频片段的时长和第一候选视频片段的时长。服务器确定该第三重合视频片段的时长与第一候选视频片段的时长之间的第一比值，在该第一比值大于或等于比值阈值的情况下，服务器将该第三重合视频片段的出现次数叠加到该第一候选视频片段上；在该第一比值小于比值阈值的情况下，服务器不将该第三重合视频片段的出现次数叠加到该第一候选视频片段上，其中，比值阈值由技术人员根据实际情况进行设置，比如设置为0.5，本申请实施例对此不做限定。For example, the server determines the duration of the third overlapping video segment and the duration of the first candidate video segment. The server determines a first ratio between the duration of the third overlapping video segment and the duration of the first candidate video segment. If the first ratio is greater than or equal to a ratio threshold, the server adds the occurrence count of the third overlapping video segment to the first candidate video segment; if the first ratio is less than the ratio threshold, the server does not add the occurrence count of the third overlapping video segment to the first candidate video segment. The ratio threshold is set by a technician according to the actual situation, for example, it is set to 0.5. This application embodiment does not limit this.

在一种可能的实施方式中，在该至少一个候选视频片段包括该第四重合视频片段，且该第四重合视频片段与该第二候选视频片段之间的重合度符合重合度条件的情况下，服务器确定该第四重合视频片段的出现次数。服务器基于该第四重合视频片段的出现次数，确定该至少一个参考候选视频片段。In one possible implementation, if the at least one candidate video segment includes the fourth overlapping video segment, and the overlap between the fourth overlapping video segment and the second candidate video segment meets the overlap condition, the server determines the number of occurrences of the fourth overlapping video segment. Based on the number of occurrences of the fourth overlapping video segment, the server determines the at least one reference candidate video segment.

其中，重合度是指，重合的视频片段的时长与被比较的视频片段的时长之间的比值。比如，对于第四重合视频片段和第二候选视频片段，第二候选视频片段为被比较的视频片段，确定第四重合视频片段和第二候选视频片段之间的重合度时，将第四重合视频片段与第二候选视频片段之间重合的视频片段的时长与第二候选视频片段的时长相除即可得到。重合度符合重合度条件是指，重合度大于或等于重合度阈值。The overlap ratio refers to the ratio between the duration of the overlapping video segments and the duration of the video segment being compared. For example, considering a fourth overlapping video segment and a second candidate video segment (the second candidate segment being compared), determining the overlap ratio between them involves dividing the duration of the overlapping video segment by the duration of the second candidate video segment. A successful overlap condition is defined as an overlap ratio greater than or equal to an overlap ratio threshold.

下面通过两种实施方式对上述实施方式中服务器基于该第四重合视频片段的出现次数，确定该至少一个参考候选视频片段的方法进行说明。The following describes, through two implementation methods, the method by which the server determines the at least one reference candidate video segment based on the occurrence frequency of the fourth overlapping video segment in the above implementation.

实施方式1、在该第四重合视频片段的出现次数大于或等于第一出现次数阈值的情况下，服务器将该第四重合视频片段与第二候选视频片段进行融合，得到该至少一个参考候选视频片段。在一些实施例中，在该第四重合视频片段与第二候选视频片段进行融合之前，服务器将该第四重合视频片段的出现次数叠加到该第二候选视频片段上。Implementation Method 1: If the number of occurrences of the fourth overlapping video segment is greater than or equal to a first occurrence threshold, the server merges the fourth overlapping video segment with the second candidate video segment to obtain the at least one reference candidate video segment. In some embodiments, before merging the fourth overlapping video segment with the second candidate video segment, the server adds the occurrence count of the fourth overlapping video segment to the second candidate video segment.

其中，第一出现次数阈值由技术人员根据实际情况进行设置，比如设置为3，本申请实施例对此不做限定。出现次数大于或等于第一出现次数阈值表示该第四重合视频片段不可忽略，需要进行进一步处理以提高获取目标视频片段的准确性。The first occurrence threshold is set by technicians according to the actual situation, such as 3, but this application embodiment does not limit this. An occurrence count greater than or equal to the first occurrence threshold indicates that the fourth overlapping video segment cannot be ignored and requires further processing to improve the accuracy of obtaining the target video segment.

下面对上述实施方式中服务器将该第四重合视频片段与第二候选视频片段进行融合的方法进行说明。The method by which the server merges the fourth overlapping video segment with the second candidate video segment in the above embodiment will be described below.

在一些实施例中，以第四重合视频片段的时长小于该第二候选视频片段为例，服务器将从该第四重合视频片段中将与该第二候选视频片段之间的重复部分删除，将剩余部分添加到该第二候选视频片段上，得到一个候选视频片段。比如，参见图7，第四重合视频片段701的时长小于该第二候选视频片段702，第四重合视频片段704的时长小于该第二候选视频片段705。在该第四重合视频片段701的结束时间晚于该第二候选视频片段702的情况下，服务器将该第四重合视频片段701与该第二候选视频片段702融合，得到一个候选视频片段703。在该第四重合视频片段704的开始时间早于该第二候选视频片段705的情况下，服务器将该第四重合视频片段704与该第二候选视频片段705融合，得到一个候选视频片段706。In some embodiments, taking the example where the duration of the fourth overlapping video segment is shorter than that of the second candidate video segment, the server will delete the duplicate portion between the fourth overlapping video segment and the second candidate video segment, and add the remaining portion to the second candidate video segment to obtain a candidate video segment. For example, referring to Figure 7, the duration of the fourth overlapping video segment 701 is shorter than that of the second candidate video segment 702, and the duration of the fourth overlapping video segment 704 is shorter than that of the second candidate video segment 705. If the end time of the fourth overlapping video segment 701 is later than that of the second candidate video segment 702, the server will merge the fourth overlapping video segment 701 with the second candidate video segment 702 to obtain a candidate video segment 703. If the start time of the fourth overlapping video segment 704 is earlier than that of the second candidate video segment 705, the server will merge the fourth overlapping video segment 704 with the second candidate video segment 705 to obtain a candidate video segment 706.

实施方式2、在该第四重合视频片段的出现次数小于该第一出现次数阈值的情况下，服务器将该第四重合视频片段删除，得到该至少一个参考候选视频片段。服务器将该第四重合视频片段的出现次数叠加到该第二候选视频片段上。Implementation Method 2: If the number of occurrences of the fourth overlapping video segment is less than the first occurrence threshold, the server deletes the fourth overlapping video segment and obtains the at least one reference candidate video segment. The server then adds the occurrence count of the fourth overlapping video segment to the second candidate video segment.

其中，出现次数小于第一出现次数阈值表示该第四重合视频片段可以忽略，服务器将该第四重合视频片段删除即可。If the number of occurrences is less than the first occurrence threshold, it means that the fourth overlapping video segment can be ignored, and the server can delete the fourth overlapping video segment.

在一种可能的实施方式中，在该至少一个候选视频片段包括该第四重合视频片段，且该第四重合视频片段与该第二候选视频片段之间的重合度不符合该重合度条件的情况下，服务器将该第四重合视频片段删除，得到该至少一个参考候选视频片段。在一些实施例中，在将该第四重合视频片段删除之前，服务器将该第四重合视频片段的出现次数叠加到该第二候选视频片段上。In one possible implementation, if the at least one candidate video segment includes the fourth overlapping video segment, and the overlap between the fourth overlapping video segment and the second candidate video segment does not meet the overlap condition, the server deletes the fourth overlapping video segment to obtain the at least one reference candidate video segment. In some embodiments, before deleting the fourth overlapping video segment, the server adds the occurrence count of the fourth overlapping video segment to the second candidate video segment.

在一种可能的实施方式中，在该至少一个候选视频片段包括该第四重合视频片段，且该第四重合视频片段的时长小于该第二候选视频片段的情况下，服务器将该第四重合视频片段删除，得到该至少一个参考候选视频片段。在一些实施例中，在将该第四重合视频片段删除之前，服务器将该第四重合视频片段的出现次数叠加到该第二候选视频片段上。In one possible implementation, if the at least one candidate video segment includes the fourth overlapping video segment, and the duration of the fourth overlapping video segment is less than that of the second candidate video segment, the server deletes the fourth overlapping video segment to obtain the at least one reference candidate video segment. In some embodiments, before deleting the fourth overlapping video segment, the server adds the occurrence count of the fourth overlapping video segment to the second candidate video segment.

在一些实施例中，至少一个参考候选视频片段被服务器存储在匹配列表(match-list)中以便调用。In some embodiments, at least one reference candidate video segment is stored in a match list by the server for retrieval.

第二部分、服务器确定参考候选视频片段在该至少一个参考候选视频片段的出现次数。Part Two: The server determines the number of times a reference candidate video segment appears in at least one reference candidate video segment.

通过上述第一部分的处理过程，服务器基于至少一个候选视频片段，确定至少一个参考候选视频片段，确定过程中涉及出现次数的合并和删除，服务器重新确定该至少一个参考候选视频片段的出现次数。在一些实施例中，服务器能够将该至少一个参考候选视频片段的出现次数存储在出现次数列表(count-list)中以便调用。Through the processing described in the first part, the server determines at least one reference candidate video segment based on at least one candidate video segment. This determination process involves merging and deleting occurrence counts, and the server then re-determines the occurrence count of the at least one reference candidate video segment. In some embodiments, the server can store the occurrence count of the at least one reference candidate video segment in a count list for later retrieval.

比如，在确定第一视频中的目标视频片段时，服务器采用三个第二视频来进行挖掘，为了方便进行说明，将该第一视频命名为i，将该三个第二视频分别命名为vid1、vid2以及vid3。采用上述各个步骤之后，服务器基于该第一视频i和第二视频vid1确定了两个候选视频片段[(2，7，4，9)，(10，11，11，12)]，基于该第一视频i和第二视频vid2确定了一个候选视频片段[(2，7，4，9)]，基于该第一视频i和第二视频vid3确定了一个候选视频片段[(2，7，4，10)]。服务器对这四个候选视频片段进行统计，确定候选视频片段(2，7，4，9)的出现次数为2次，(2，7，4，10)的出现次数为1次，(10，11，11，12)的出现次数为1次。通过上述第一部分的方式融合这四个候选视频片段之后，得到两个参考候选视频片段[(2，7，4，9)，(10，11，11，12)]，且参考候选视频片段(2，7，4，9)的出现次数为3，参考候选视频片段(10，11，11，12)的出现次数为1，以次数列表(count-list)来进行存储时，即count-list＝[3，1]。For example, when determining the target video segment in the first video, the server uses three second videos for mining. For ease of explanation, the first video is named i, and the three second videos are named vid1, vid2, and vid3, respectively. After the above steps, the server determines two candidate video segments [(2, 7, 4, 9), (10, 11, 11, 12)] based on the first video i and the second video vid1, one candidate video segment [(2, 7, 4, 9)] based on the first video i and the second video vid2, and one candidate video segment [(2, 7, 4, 10)] based on the first video i and the second video vid3. The server counts these four candidate video segments and determines that the candidate video segment (2, 7, 4, 9) appears twice, (2, 7, 4, 10) appears once, and (10, 11, 11, 12) appears once. After fusing these four candidate video segments using the method described in Part 1 above, two reference candidate video segments are obtained [(2, 7, 4, 9), (10, 11, 11, 12)]. The reference candidate video segment (2, 7, 4, 9) appears 3 times, and the reference candidate video segment (10, 11, 11, 12) appears 1 time. When stored as a count list, the count list is [3, 1].

第三部分、服务器将出现次数符合该出现次数条件的参考候选视频片段确定为目标候选视频片段。Part Three: The server identifies the reference candidate video segments whose occurrence count meets the occurrence count condition as target candidate video segments.

在一种可能的实施方式中，服务器将出现次数大于或等于第二出现次数阈值的参考候选视频片段确定为目标候选视频片段。In one possible implementation, the server identifies reference candidate video segments that appear more than or equal to a second occurrence threshold as target candidate video segments.

其中，第二出现次数阈值与该至少一个参考候选视频片段的数量正相关，也即是该至少一个参考候选视频片段的数量越多，该第二出现次数阈值也就越大；该至少一个参考候选视频片段的数量越少，该第二出现次数阈值也就越小。在一些实施例中，该第二出现次数阈值为目标比值与该至少一个参考候选视频片段的数量的乘积，该目标比值为小于1的正数。The second occurrence threshold is positively correlated with the number of the at least one reference candidate video segment; that is, the more the number of the at least one reference candidate video segment, the larger the second occurrence threshold; and the fewer the number of the at least one reference candidate video segment, the smaller the second occurrence threshold. In some embodiments, the second occurrence threshold is the product of a target ratio and the number of the at least one reference candidate video segment, where the target ratio is a positive number less than 1.

比如，在得到的两个参考候选视频片段[(2，7，4，9)，(10，11，11，12)]，且参考候选视频片段(2，7，4，9)的出现次数为3，参考候选视频片段(10，11，11，12)的出现次数为1，第二出现次数阈值为3的情况下，服务器将参考候选视频片段(10，11，11，12)删除，最终保留参考候选视频片段(2，7，4，9)，以及出现次数3。以匹配列表(match-list)和次数列表(count-list)来进行存储时，即match-list＝(2，7，4，9)；count-list＝[3]。For example, given two candidate video segments [(2, 7, 4, 9), (10, 11, 11, 12)], where candidate video segment (2, 7, 4, 9) appears 3 times and candidate video segment (10, 11, 11, 12) appears 1 time, and the second occurrence threshold is 3, the server will delete candidate video segment (10, 11, 11, 12) and ultimately retain candidate video segment (2, 7, 4, 9) with an occurrence count of 3. When storing the data in a match-list and a count-list, match-list = (2, 7, 4, 9); count-list = [3].

307、在任一目标候选视频片段在该第一视频中的出现时间处于目标时间范围的情况下，服务器将该目标候选视频片段确定为该第一视频中的目标视频片段。307. If the appearance time of any target candidate video segment in the first video is within the target time range, the server determines the target candidate video segment as the target video segment in the first video.

其中，该目标时间范围由技术人员根据实际情况进行设置，比如，在本申请实施例提供的技术方案应用在识别视频的片头和片尾的场景下时，该目标时间范围为视频的片头和片尾可能存在的时间范围，在这种情况下，目标时间范围包括第一时间范围和第二时间范围，第一时间范围为片头可能存在的范围，第二时间范围为片尾可能存在的范围。比如，将视频前1/5时间设置为片头时间，也即是第一时间范围，后1/5时间为片尾时间，也即是第二时间范围，对于10分钟的视频，设定片头仅可能出现在前2分钟，片尾在后2分钟。其中，1/5是技术人员根据实际情况设置的，针对不同类型的视频可以进行相应的调整，比如，针对15分钟左右的少儿动漫可取1/5，针对电视剧45分钟长视频可取1/8。The target time range is set by technicians based on actual conditions. For example, when the technical solution provided in this application is applied to the scenario of identifying the intro and outro of a video, the target time range is the possible time range of the intro and outro of the video. In this case, the target time range includes a first time range and a second time range. The first time range is the range in which the intro may exist, and the second time range is the range in which the outro may exist. For example, the first 1/5 of the video time can be set as the intro time, which is the first time range, and the last 1/5 of the video time can be set as the outro time, which is the second time range. For a 10-minute video, the intro can be set to appear only in the first 2 minutes, and the outro in the last 2 minutes. The 1/5 is set by technicians based on actual conditions and can be adjusted accordingly for different types of videos. For example, 1/5 can be used for a children's animation of about 15 minutes, and 1/8 can be used for a 45-minute long TV series.

需要说明的是，上述步骤301-307是以服务器确定第一视频中的目标视频片段为例进行说明的，在该第一视频和该至少一个第二视频属于同一个视频集合的情况下，服务器能够采用与上述步骤301-307同理的方法来确定该视频集合中其他视频的目标视频片段，其他视频是指除第一视频以外的视频。It should be noted that steps 301-307 above are illustrated using the example of the server determining the target video segment in the first video. When the first video and the at least one second video belong to the same video set, the server can use the same method as steps 301-307 above to determine the target video segments of other videos in the video set. Other videos refer to videos other than the first video.

下面结合图8对本申请实施例提供的技术方案进行说明。The technical solutions provided in the embodiments of this application will be described below with reference to Figure 8.

参见图8，在本申请实施例中，服务器基于视频帧特征之间的相似度进行匹配，得到多个视频帧对。服务器基于出现时间差值将该多个视频帧对划分为多个初始视频帧组。服务器基于出现时间差值将该多个初始视频帧组融合为多个候选视频帧组。服务器将该多个候选视频帧组融合为多个视频帧组。服务器基于该多个视频帧组，输出第一视频的目标视频片段。Referring to Figure 8, in this embodiment, the server matches video frame features based on similarity to obtain multiple video frame pairs. The server divides these multiple video frame pairs into multiple initial video frame groups based on their occurrence time differences. The server merges these multiple initial video frame groups into multiple candidate video frame groups based on their occurrence time differences. The server merges these multiple candidate video frame groups into multiple video frame groups. Based on these multiple video frame groups, the server outputs the target video segment of the first video.

在一些实施例中，上述步骤301-307可以由一个片段挖掘系统来实现，在本申请实施例提供的技术方案应用在识别视频片头片尾的场景下时，该片段挖掘系统也即是片头片尾挖掘系统。参见图9该视频片段挖掘系统提供了如下功能。提取多个视频的视频帧特征。对于每个视频，将该视频与该多个视频中的其他视频组成视频对。基于多个视频对来进行匹配，得到多个视频帧对。将多个视频帧对进行融合，得到多个视频帧组。基于该多个视频帧组，确定目标视频片段在该视频中的位置。基于该目标视频片段在该视频中的位置，获取该目标视频片段。在本申请实施例提供的技术方案应用在识别视频片头片尾的场景下时，该目标视频片段也即是该视频的片头或者片尾。In some embodiments, steps 301-307 described above can be implemented by a segment mining system. When the technical solution provided in this application is applied to the scenario of identifying video intros and outros, the segment mining system is also an intro/outro mining system. Referring to Figure 9, the video segment mining system provides the following functions: Extracting video frame features from multiple videos. For each video, forming a video pair with other videos in the multiple videos. Matching multiple video pairs to obtain multiple video frame pairs. Fusing multiple video frame pairs to obtain multiple video frame groups. Determining the position of the target video segment in the video based on the multiple video frame groups. Obtaining the target video segment based on its position in the video. When the technical solution provided in this application is applied to the scenario of identifying video intros and outros, the target video segment is also the intro or outro of the video.

参见图10本申请实施例提供的技术方案应用在识别电视剧片头片尾的场景下时，获取电视剧，该电视剧包括多个视频。将该多个视频输入片段挖掘系统，通过该片段挖掘系统输出该多个视频的片头和片尾。在一些实施例中，该片段挖掘系统能够输出该多个视频的片头和片尾的时间戳。Referring to Figure 10, when the technical solution provided in this application is applied to the scenario of identifying the opening and closing credits of a TV series, the TV series is obtained, which includes multiple videos. These multiple videos are input into a segment mining system, which then outputs the opening and closing credits of the multiple videos. In some embodiments, the segment mining system can output the timestamps of the opening and closing credits of the multiple videos.

308、服务器将第一视频中的目标视频片段存储在片段数据库中。308. The server stores the target video segment from the first video in the segment database.

在一种可能的实施方式中，服务器对该第一视频的目标视频片段进行特征提取，得到该目标视频片段的视频帧特征。服务器将该目标视频片段的视频帧特征存储在该片段数据库中。在一些实施例中，服务器将该目标视频片段的视频帧特征关联到该第一视频，比如，服务器将该目标视频片段的视频帧特征的标识设置为第一视频的标识。在该第一视频属于某个视频集合的情况下，服务器将该第一视频的标识关联到该视频集合的标识，以便于后续的查询过程。In one possible implementation, the server extracts features from a target video segment of the first video to obtain the video frame features of the target video segment. The server stores these video frame features in a segment database. In some embodiments, the server associates the video frame features of the target video segment with the first video; for example, the server sets the identifier of the video frame features of the target video segment as the identifier of the first video. If the first video belongs to a video set, the server associates the identifier of the first video with the identifier of the video set to facilitate subsequent querying.

其中，对目标视频片段进行特征提取得到该目标视频片段的视频帧特征与上述步骤301属于同一发明构思，实现过程参见上述步骤301的描述，在此不再赘述。The process of extracting features from the target video segment to obtain the video frame features of the target video segment belongs to the same inventive concept as step 301 above. The implementation process is described in step 301 above and will not be repeated here.

比如，目标视频片段为(2，7)，服务器从该第一视频中获取2-7秒对应的目标视频片段，从该目标视频片段中抽取多个参考视频帧。服务器对该多个参考视频帧进行特征提取，得到该多个参考视频帧的视频帧特征。服务器将该多个参考视频帧的视频帧特征存储在片段数据库中。服务器将该多个参考视频帧的视频帧特征与第一视频的标识Vid1相关联，将第一视频的标识Vid1与该第一视频所属的视频集合的标识Cid1相关联。图11示出了一种片段数据库的存储形式，参见图11，在数据库1100中，em1-emN是视频帧特征，vid1-vidK是不同视频的标识，N和K均为正整数。For example, if the target video segment is (2, 7), the server obtains the target video segment corresponding to seconds 2-7 from the first video, and extracts multiple reference video frames from the target video segment. The server performs feature extraction on these multiple reference video frames to obtain the video frame features of these multiple reference video frames. The server stores the video frame features of these multiple reference video frames in a segment database. The server associates the video frame features of these multiple reference video frames with the identifier Vid1 of the first video, and associates the identifier Vid1 of the first video with the identifier Cid1 of the video set to which the first video belongs. Figure 11 shows a storage format of a segment database. Referring to Figure 11, in database 1100, em1-emN are video frame features, vid1-vidK are identifiers of different videos, and N and K are both positive integers.

服务器将第一视频中的目标视频片段存储在片段数据库中之后，还能够利用该片段数据库进行视频片段检索，方法如下：After the server stores the target video segment from the first video in the segment database, it can also use this segment database to retrieve video segments, as follows:

在一种可能的实施方式中，服务器对待识别的目标视频的多个目标视频帧进行特征提取，得到该多个目标视频帧的视频帧特征。服务器基于该多个目标视频帧的视频帧特征、该第一视频帧的视频帧特征以及该至少一个第二视频的视频帧特征，确定该目标视频的至少一个目标视频片段。In one possible implementation, the server extracts features from multiple target video frames of the target video to be identified, obtaining video frame features of the multiple target video frames. Based on the video frame features of the multiple target video frames, the video frame features of the first video frame, and the video frame features of the at least one second video, the server determines at least one target video segment of the target video.

其中，服务器对目标视频的多个目标视频帧进行特征提取，得到该多个目标视频帧的视频帧特征的过程与上述步骤301属于同一发明构思，实现过程参见上述步骤301的描述，在此不再赘述。服务器基于该多个目标视频帧的视频帧特征、该第一视频帧的视频帧特征以及该至少一个第二视频的视频帧特征，确定该目标视频的至少一个目标视频片段的过程，与上述步骤302-307属于同一发明构思，实现过程参见上述步骤302-307的描述，在此不再赘述。在一些实施例中，在该片段数据库进行视频片段检索是由视频检索系统实现的。在一些实施例中，该第一视频帧的视频帧特征以及该至少一个第二视频的视频帧特征存储在片段数据库中。The process by which the server extracts features from multiple target video frames of the target video to obtain the video frame features of these multiple target video frames belongs to the same inventive concept as step 301 above, and the implementation process is described in step 301 above, and will not be repeated here. The process by which the server determines at least one target video segment of the target video based on the video frame features of the multiple target video frames, the video frame features of the first video frame, and the video frame features of the at least one second video belongs to the same inventive concept as steps 302-307 above, and the implementation process is described in steps 302-307 above, and will not be repeated here. In some embodiments, video segment retrieval in the segment database is implemented by a video retrieval system. In some embodiments, the video frame features of the first video frame and the video frame features of the at least one second video are stored in the segment database.

上述视频片段识别的方法能够应用在识别视频片段片头片尾的场景下，还能够应用在识别侵权视频的场景下，下面将分别对这两种应用场景进行介绍。The video clip recognition method described above can be applied to scenarios involving the identification of video clip intros and outros, as well as to the identification of infringing videos. These two application scenarios will be introduced below.

在该视频片段的检索方法应用在检索视频片段片头片尾的场景下，将待识别的目标视频输入该视频检索系统，由该视频检索系统对该目标视频进行特征提取，得到该多个目标视频帧的视频帧特征。通过该视频检索系统，基于该多个目标视频帧的视频帧特征在片段数据库中进行匹配，得到该目标视频中的目标视频片段，该目标视频片段也即是该目标视频的片头或者片尾。This video clip retrieval method is applied to the scenario of retrieving the beginning and end of video clips. The target video to be identified is input into the video retrieval system, which extracts features from the target video to obtain the video frame features of multiple target video frames. Based on these video frame features, the video retrieval system matches the clips in a clip database to obtain the target video clip, which is either the beginning or the end of the target video.

以识别电视剧中新更新视频的片头和片尾为例，比如，该电视剧已经更新了10集，通过上述步骤301-307获取了这10集的片头和片尾，通过上述步骤308将这10集的片头和片尾存储在片段数据库中。在该电视剧更新第11集时，将该第11集作为该目标视频，将该目标视频输入该视频检索系统，由该视频检索系统对该目标视频进行特征提取，得到该多个目标视频帧的视频帧特征。通过该视频检索系统，基于该多个目标视频帧的视频帧特征在片段数据库中进行匹配，得到该目标视频中的目标视频片段，该目标视频片段也即是该目标视频的片头或者片尾。在该片段数据库中将视频帧特征与视频的标识和视频集合的标识关联的情况下，能够基于视频集合的标识在有限的范围内进行匹配，从而提高确定目标视频片段的效率，其中，该视频集合也即是该电视剧。Taking the identification of the opening and closing credits of newly updated videos in a TV series as an example, suppose the TV series has already released 10 episodes. The opening and closing credits of these 10 episodes are obtained through steps 301-307 above, and stored in the segment database through step 308. When the TV series releases its 11th episode, this 11th episode is used as the target video. The target video is input into the video retrieval system, which extracts features from the target video to obtain the video frame features of multiple target video frames. The video retrieval system then matches these video frame features against the segment database to obtain the target video segment, which is either the opening or closing credits of the target video. By associating the video frame features with the video identifier and the video set identifier in the segment database, matching can be performed within a limited range based on the video set identifier, thereby improving the efficiency of identifying the target video segment. The video set is, in this case, the TV series itself.

下面将结合图12进一步说明。The following will provide further explanation with reference to Figure 12.

确定待进行片头片尾识别的电视剧。获取该电视剧中的多个视频。将该多个视频输入片段挖掘系统1201，由该片段挖掘系统1201输出该多个视频的片头和片尾。将该多个视频的片头和片尾存储在片段数据库1202中。在该电视剧更新了目标视频的情况下，将该目标视频输入视频检索系统1203，由视频检索系统1203采用该目标视频在该片段数据库1202中进行检索，得到该目标视频的片头和片尾。本申请实施例提供的技术方案对同一视频集合中的视频挖掘片头片尾采用视频相同时间段检索的方法，即对同一视频集合，通过检索以及时序定位找到相同的视频片段，作为挖掘到的片头片尾。交叉排重，是指视频集合内部的视频经过相互检索找到重复的视频片段。视频排重检索的目的是对第一视频，检索出其与库存视频相同的视频片段。The TV series to be identified for title sequence and ending sequence recognition is determined. Multiple videos from the TV series are acquired. These multiple videos are input into a segment mining system 1201, which outputs the title and ending sequences of the videos. The title and ending sequences of these videos are stored in a segment database 1202. If the target video for the TV series is updated, the target video is input into a video retrieval system 1203, which uses the target video to search the segment database 1202 to obtain the title and ending sequences of the target video. The technical solution provided in this application uses a method of searching for video segments within the same video set that occur within the same video set. That is, for the same video set, identical video segments are found through retrieval and temporal positioning, and these are used as the mined title and ending sequences. Cross-duplicate deduplication refers to finding duplicate video segments within a video set through mutual retrieval. The purpose of video deduplication retrieval is to retrieve video segments that are identical to those in the database for the first video.

需要说明的是，一个视频可能有多个片头或片尾满足上列要求，这属于正常情况，对于片头曲+本集花絮+同一广告植入+正片类型的电视剧，片头曲、广告植入是多个视频中可匹配到的，但花絮由于每集都不同，故不会被匹配到，所以会出现2个片头。It should be noted that a video may have multiple opening or closing credits that meet the above requirements, which is normal. For a TV series with an opening theme song + episode highlights + the same product placement + main episode, the opening theme song and product placement can be matched in multiple videos, but the highlights will not be matched because they are different in each episode, so there will be 2 opening credits.

在该视频片段的检索方法应用在识别侵权视频的场景下，将待识别的目标视频输入该视频检索系统，由该视频检索系统对该目标视频进行特征提取，得到该多个目标视频帧的视频帧特征，其中，该目标视频也即是待进行侵权识别的视频。通过该视频检索系统，基于该多个目标视频帧的视频帧特征在片段数据库中进行匹配，得到该目标视频中的目标视频片段，该目标视频片段也即是该目标视频的片头或者片尾。将该目标视频片段从该目标视频中删除，基于删除目标视频片段后的目标视频来进行侵权识别，侵权识别的目的是确定删除目标视频片段后的目标视频与指定视频的内容是否相同。其中，侵权识别由侵权识别系统来实现，侵权识别系统能够对查询视频在侵权保护视频数据库中进行排重，如果发现重复，则表示侵权。然而由于仅需要保护正片内容，常规影视剧的片头片尾不在侵权排重范围内，采用本申请实施例提供的技术方案能够实现对影视剧进行片头片尾识别。In the application of this video clip retrieval method to identify infringing videos, the target video to be identified is input into the video retrieval system. The video retrieval system extracts features from the target video to obtain video frame features of multiple target video frames, where the target video is the video to be identified for infringement. Through the video retrieval system, the video frame features of the multiple target video frames are matched in a clip database to obtain the target video clip within the target video, which is either the opening or closing credits of the target video. The target video clip is then deleted from the target video. Infringement identification is performed based on the target video after deleting the target video clip. The purpose of infringement identification is to determine whether the content of the target video after deleting the target video clip is the same as that of the specified video. Infringement identification is implemented by an infringement identification system that can deduplicate the query video in the infringement protection video database. If duplicates are found, it indicates infringement. However, since only the main content needs to be protected, the opening and closing credits of regular films and television dramas are not within the scope of infringement deduplication. The technical solution provided in this application embodiment can achieve the identification of opening and closing credits of films and television dramas.

下面将结合图13进一步说明。The following will provide further explanation with reference to Figure 13.

确定待进行侵权识别的电视剧。获取该电视剧中的多个视频，将该多个视频存储在侵权保护视频数据库1301中。将该多个视频输入片段挖掘系统1302，由该片段挖掘系统1302输出该多个视频的片头和片尾。将该多个视频的片头和片尾存储在片段数据库1303中。在需要对目标视频进行侵权识别的情况下，将该目标视频输入视频检索系统1304，由视频检索系统1304采用该目标视频在该片段数据库1303中进行检索，得到该目标视频的片头和片尾。将该目标视频的片头和片尾删除，通过侵权识别系统1305来输出该目标视频的侵权结果，侵权结果包括侵权和不侵权。The TV series to be identified for infringement identification is determined. Multiple videos from the TV series are obtained and stored in the infringement protection video database 1301. These videos are then input into a segment mining system 1302, which outputs the intro and outro of each video. The intro and outro of these videos are stored in the segment database 1303. If infringement identification of the target video is required, the target video is input into a video retrieval system 1304, which searches the segment database 1303 using the target video to obtain its intro and outro. The intro and outro of the target video are then deleted, and the infringement identification system 1305 outputs the infringement result for the target video, which includes whether it infringes or not.

在一些实施例中，基于上述方式基于目标视频在片段数据库中进行查询之后，在得到该目标视频的多个目标视频片段的情况下，服务器将该多个目标视频片段中最长的目标视频片段确定为最终的目标视频片段，在本申请实施例提供的技术方案应用在识别视频的片头和片尾的情况下，该目标视频片段也即是该目标视频的片头和片尾，该过程参见图14。In some embodiments, after querying the segment database based on the target video in the above manner, if multiple target video segments of the target video are obtained, the server determines the longest target video segment among the multiple target video segments as the final target video segment. When the technical solution provided in this application embodiment is applied to identify the beginning and end of a video, the target video segment is also the beginning and end of the target video. The process is shown in Figure 14.

另外，视频检索系统与片段挖掘系统可以同时提供对外接口，即检索入库、挖掘入库以同时开放由用户指定要使用的具体功能。也可以仅提供一个识别接口，后台根据库存中是否已经有该视频标识对应电视剧的片头片尾进行检索还是挖掘的判断，由后台触发要使用的具体功能，该具体功能包括检索和挖掘。In addition, the video retrieval system and the segment mining system can simultaneously provide external interfaces, allowing for both retrieval and mining to be performed simultaneously, with users specifying the specific functions they wish to use. Alternatively, only one identification interface can be provided, with the backend determining whether to perform a retrieval or mining based on whether the corresponding TV series' opening and closing credits already exist in the database. The backend then triggers the specific functions to be used, which include both retrieval and mining.

上述所有可选技术方案，可以采用任意结合形成本申请的可选实施例，在此不再一一赘述。All of the above-mentioned optional technical solutions can be combined in any way to form the optional embodiments of this application, and will not be described in detail here.

通过上述视频段匹配算法设计，实现基于视频帧特征的相似视频片段匹配方法，可支持变长(体现在匹配逻辑中，相同出现时间差值下合并视频帧对时并不要求合并的帧必须前后连续)、位置变化(体现在匹配逻辑中，当出现时间差值为0，则位置无变化，出现时间差值大于0则位置可以有变化)的相似视频段匹配。该方法耗时小，性能优。Based on the above video segment matching algorithm design, a similar video segment matching method based on video frame features is implemented. This method supports similar video segments with variable length (reflected in the matching logic, when merging video frame pairs with the same occurrence time difference, the merged frames do not need to be consecutive) and positional changes (reflected in the matching logic, when the occurrence time difference is 0, the position remains unchanged; when the occurrence time difference is greater than 0, the position can change). This method is time-efficient and has excellent performance.

图15是本申请实施例提供的一种视频片段的识别装置的结构示意图，参见图15，装置包括：视频帧对确定模块1501、融合模块1502以及目标视频片段确定模块1503。Figure 15 is a schematic diagram of a video segment recognition device provided in an embodiment of this application. Referring to Figure 15, the device includes: a video frame pair determination module 1501, a fusion module 1502, and a target video segment determination module 1503.

视频帧对确定模块1501，用于基于第一视频的视频帧特征以及至少一个第二视频的视频帧特征，确定多个视频帧对，该视频帧对包括相似度符合相似度条件的第一视频帧和第二视频帧，该第一视频帧属于该第一视频，该第二视频帧属于该至少一个第二视频。The video frame pair determination module 1501 is used to determine multiple video frame pairs based on the video frame features of a first video and the video frame features of at least one second video. The video frame pair includes a first video frame and a second video frame whose similarity meets the similarity condition. The first video frame belongs to the first video, and the second video frame belongs to the at least one second video.

融合模块1502，用于基于该多个视频帧对的出现时间差值，将该多个视频帧对中的第一视频帧进行融合，得到该第一视频中的至少一个候选视频片段，该出现时间差值是指该视频帧对中的两个视频帧在视频中的出现时间之间的差值。The fusion module 1502 is used to fuse the first video frame in the plurality of video frame pairs based on the occurrence time difference of the plurality of video frame pairs to obtain at least one candidate video segment in the first video. The occurrence time difference refers to the difference between the occurrence times of two video frames in the video frame pair in the video.

目标视频片段确定模块1503，用于基于该至少一个候选视频片段以及目标时间范围，确定该第一视频中的至少一个目标视频片段，该目标视频片段处于该第一视频的该目标时间范围内。The target video segment determination module 1503 is used to determine at least one target video segment in the first video based on the at least one candidate video segment and the target time range, wherein the target video segment is within the target time range of the first video.

在一种可能的实施方式中，该融合模块1502，用于基于该多个视频帧对的出现时间差值，将该多个视频帧对划分为多个视频帧组，同一个该视频帧组中的视频帧对对应于同一个出现时间差值。对于该多个视频帧组中的任一视频帧组，按照该视频帧组中视频帧对的第一视频帧在该第一视频中的出现时间，将该视频帧组中视频帧对的第一视频帧融合为一个该候选视频片段。In one possible implementation, the fusion module 1502 is used to divide the plurality of video frame pairs into a plurality of video frame groups based on the occurrence time difference of the plurality of video frame pairs, wherein video frame pairs in the same video frame group correspond to the same occurrence time difference. For any video frame group in the plurality of video frame groups, the first video frames of the video frame pairs in the video frame group are fused into a candidate video segment according to the occurrence time of the first video frame of the video frame pair in the first video.

在一种可能的实施方式中，该融合模块1502，用于将出现时间差值相同的视频帧对划分为一个初始视频帧组。基于多个初始视频帧组对应的出现时间差值，将该多个初始视频帧组进行融合，得到该多个视频帧组。In one possible implementation, the fusion module 1502 is used to divide video frame pairs with the same occurrence time difference into an initial video frame group. Based on the occurrence time difference corresponding to multiple initial video frame groups, the multiple initial video frame groups are fused to obtain the multiple video frame groups.

在一种可能的实施方式中，该融合模块1502，用于按照目标顺序对该多个初始视频帧组进行排序，得到多个候选视频帧组。在该多个候选视频帧组中任两个相邻的候选视频帧组之间的匹配时间差值符合匹配时间差值条件的情况下，将该两个相邻的候选视频帧组融合为一个视频帧组，该匹配时间差值是指该两个相邻的候选视频帧组对应的出现时间差值之间的差值。In one possible implementation, the fusion module 1502 is used to sort the multiple initial video frame groups according to a target order to obtain multiple candidate video frame groups. If the matching time difference between any two adjacent candidate video frame groups meets a matching time difference condition, the two adjacent candidate video frame groups are fused into one video frame group. The matching time difference refers to the difference between the occurrence time differences of the corresponding two adjacent candidate video frame groups.

在一种可能的实施方式中，该两个相邻的候选视频帧组包括第一候选视频帧组和第二候选视频帧组，该融合模块1502，用于在该第一候选视频帧组对应的出现时间差值与该第二候选视频帧组对应的出现时间差值之间的匹配时间差值小于或等于匹配差值阈值的情况下，将该第一候选视频帧组中的视频帧对添加至该第二候选视频帧组，得到该视频帧组。In one possible implementation, the two adjacent candidate video frame groups include a first candidate video frame group and a second candidate video frame group. The fusion module 1502 is used to add the video frame pairs in the first candidate video frame group to the second candidate video frame group to obtain the video frame group when the matching time difference between the occurrence time difference corresponding to the first candidate video frame group and the occurrence time difference corresponding to the second candidate video frame group is less than or equal to the matching difference threshold.

在一种可能的实施方式中，该融合模块1502，用于将该第一候选视频帧组中的视频帧对添加至该第二候选视频帧组。基于该第二候选视频帧组对应的出现时间差值，采用参考第二视频帧替换目标第二视频帧，得到该视频帧组，该目标第二视频帧为新添加至该第二候选视频帧组中的第二视频帧，该参考第二视频帧为该第二视频中与目标第一视频帧之间的出现时间差值为该第二候选视频帧组对应的出现时间差值的第二视频帧，该目标第一视频帧为该目标第二视频帧所属视频帧对中的第一视频帧。In one possible implementation, the fusion module 1502 is used to add video frame pairs from the first candidate video frame group to the second candidate video frame group. Based on the occurrence time difference corresponding to the second candidate video frame group, a reference second video frame is used to replace the target second video frame to obtain the video frame group. The target second video frame is the second video frame newly added to the second candidate video frame group, and the reference second video frame is the second video frame in the second video frame whose occurrence time difference with the target first video frame is the occurrence time difference corresponding to the second candidate video frame group. The target first video frame is the first video frame in the video frame pair to which the target second video frame belongs.

在一种可能的实施方式中，该融合模块1502，用于比较该视频帧组中任两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间。在该两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间之间的差值符合出现时间条件的情况下，将该两个相邻的视频帧对添加至临时帧列表。在该两个相邻的视频帧对的第一视频帧在该第一视频中的出现时间之间的差值不符合出现时间条件的情况下，将该临时帧列表中的视频帧对融合为参考视频片段。基于多个参考视频片段，确定该至少一个候选视频片段。In one possible implementation, the fusion module 1502 is used to compare the occurrence times of the first video frames of any two adjacent video frame pairs in the video frame group within the first video. If the difference between the occurrence times of the first video frames of the two adjacent video frame pairs in the first video meets the occurrence time condition, the two adjacent video frame pairs are added to a temporary frame list. If the difference between the occurrence times of the first video frames of the two adjacent video frame pairs in the first video does not meet the occurrence time condition, the video frame pairs in the temporary frame list are fused into a reference video segment. Based on multiple reference video segments, at least one candidate video segment is determined.

在一种可能的实施方式中，该多个参考视频片段包括第一重合视频片段和/或第二重合视频片段，该第一重合视频片段是指属于该多个参考视频片段中第一参考视频片段的参考视频片段，该第二重合视频片段是指与该多个参考视频片段中第二参考视频片段部分重合的参考视频片段，该融合模块1502，用于执行下述至少一项：In one possible implementation, the plurality of reference video segments includes a first overlapping video segment and/or a second overlapping video segment. The first overlapping video segment refers to a reference video segment belonging to the first reference video segment among the plurality of reference video segments, and the second overlapping video segment refers to a reference video segment that partially overlaps with the second reference video segment among the plurality of reference video segments. The fusion module 1502 is configured to perform at least one of the following:

在该多个参考视频片段包括该第一重合视频片段的情况下，将该第一重合视频片段删除，得到该至少一个候选视频片段。If the multiple reference video segments include the first overlapping video segment, the first overlapping video segment is deleted to obtain the at least one candidate video segment.

在该多个参考视频片段包括该第二重合视频片段的情况下，将该第二重合视频片段与该第二参考片段之间的重合部分删除，得到该至少一个候选视频片段。If the plurality of reference video segments include the second overlapping video segment, the overlapping portion between the second overlapping video segment and the second reference segment is deleted to obtain the at least one candidate video segment.

在一种可能的实施方式中，该融合模块1502还用于：比较第三类参考视频片段的时长与目标时长，该第三类参考视频片段是指删除重合部分的该第二重合视频片段。在该第三类参考视频片段的时长大于或等于该目标时长的情况下，保留该第三类参考视频片段。在该第三类参考视频片段的时长小于该目标时长的情况下，删除该第三类参考视频片段。In one possible implementation, the fusion module 1502 is further configured to: compare the duration of a third type of reference video segment with a target duration, wherein the third type of reference video segment refers to the second overlapping video segment after deleting overlapping portions. If the duration of the third type of reference video segment is greater than or equal to the target duration, the third type of reference video segment is retained. If the duration of the third type of reference video segment is less than the target duration, the third type of reference video segment is deleted.

在一种可能的实施方式中，该目标视频片段确定模块1503，用于基于该至少一个候选视频片段，确定该至少一个目标候选视频片段，该目标候选视频片段在该至少一个候选视频片段中的出现次数符合次数条件。In one possible implementation, the target video segment determination module 1503 is used to determine at least one target candidate video segment based on the at least one candidate video segment, wherein the number of times the target candidate video segment appears in the at least one candidate video segment meets a frequency condition.

在任一该目标候选视频片段在该第一视频中的出现时间处于该目标时间范围的情况下，将该目标候选视频片段确定为该第一视频中的目标视频片段。If the appearance time of any target candidate video segment in the first video falls within the target time range, the target candidate video segment is determined as the target video segment in the first video.

在一种可能的实施方式中，该目标视频片段确定模块1503，用于基于所述至少一个候选视频片段，确定至少一个参考候选视频片段。确定每个参考候选视频片段在该至少一个参考候选视频片段的出现次数。将出现次数符合该出现次数条件的参考候选视频片段确定为目标候选视频片段。In one possible implementation, the target video segment determination module 1503 is configured to determine at least one reference candidate video segment based on the at least one candidate video segment. It determines the frequency of each reference candidate video segment within the at least one reference candidate video segment. Reference candidate video segments whose frequency of occurrence meets the occurrence frequency condition are determined as target candidate video segments.

在一种可能的实施方式中，该至少一个候选视频片段包括第三重合视频片段和/或第四重合视频片段，该第三重合视频片段是指属于该至少一个候选视频片段中第一候选视频片段的候选视频片段，该第四重合视频片段是指与该至少一个候选视频片段中第二候选视频片段部分重合的候选视频片段，该目标视频片段确定模块1503，用于执行下述至少一项：In one possible implementation, the at least one candidate video segment includes a third overlapping video segment and/or a fourth overlapping video segment. The third overlapping video segment refers to a candidate video segment that belongs to the first candidate video segment among the at least one candidate video segments, and the fourth overlapping video segment refers to a candidate video segment that partially overlaps with the second candidate video segment among the at least one candidate video segments. The target video segment determination module 1503 is configured to perform at least one of the following:

在该至少一个候选视频片段包括该第三重合视频片段的情况下，将该第三重合视频片段删除，得到该至少一个参考候选视频片段。If the at least one candidate video segment includes the third overlapping video segment, the third overlapping video segment is deleted to obtain the at least one reference candidate video segment.

在该至少一个候选视频片段包括该第四重合视频片段，且该第四重合视频片段与该第二候选视频片段之间的重合度符合重合度条件的情况下，确定该第四重合视频片段的出现次数。基于该第四重合视频片段的出现次数，确定该至少一个参考候选视频片段。If at least one candidate video segment includes the fourth overlapping video segment, and the overlap between the fourth overlapping video segment and the second candidate video segment meets the overlap condition, the occurrence count of the fourth overlapping video segment is determined. Based on the occurrence count of the fourth overlapping video segment, the at least one reference candidate video segment is determined.

在该至少一个候选视频片段包括该第四重合视频片段，且该第四重合视频片段与该第二候选视频片段之间的重合度不符合该重合度条件的情况下，将该第四重合视频片段删除，得到该至少一个参考候选视频片段。If the at least one candidate video segment includes the fourth overlapping video segment, and the overlap between the fourth overlapping video segment and the second candidate video segment does not meet the overlap condition, the fourth overlapping video segment is deleted to obtain the at least one reference candidate video segment.

在该至少一个候选视频片段包括该第四重合视频片段，且该第四重合视频片段的时长小于该第二候选视频片段的情况下，将该第四重合视频片段删除，得到该至少一个参考候选视频片段。If the at least one candidate video segment includes the fourth overlapping video segment, and the duration of the fourth overlapping video segment is less than that of the second candidate video segment, the fourth overlapping video segment is deleted to obtain the at least one reference candidate video segment.

在一种可能的实施方式中，该目标视频片段确定模块1503，用于执行下述任一项：In one possible implementation, the target video segment determination module 1503 is configured to perform any of the following:

在该第四重合视频片段的出现次数大于或等于第一出现次数阈值的情况下，将该第四重合视频片段与第二候选视频片段进行融合，得到该至少一个参考候选视频片段。If the number of occurrences of the fourth overlapping video segment is greater than or equal to the first occurrence threshold, the fourth overlapping video segment is fused with the second candidate video segment to obtain the at least one reference candidate video segment.

在该第四重合视频片段的出现次数小于该第一出现次数阈值的情况下，将该第四重合视频片段删除，得到该至少一个参考候选视频片段。If the number of occurrences of the fourth overlapping video segment is less than the first occurrence threshold, the fourth overlapping video segment is deleted to obtain the at least one reference candidate video segment.

在一种可能的实施方式中，该装置还包括：In one possible implementation, the device further includes:

特征提取模块，用于对待识别的目标视频的多个目标视频帧进行特征提取，得到该多个目标视频帧的视频帧特征。The feature extraction module is used to extract features from multiple target video frames of the target video to be identified, thereby obtaining the video frame features of the multiple target video frames.

该目标视频片段确定模块1503，还用于基于该多个目标视频帧的视频帧特征、该第一视频帧的视频帧特征以及该至少一个第二视频的视频帧特征，确定该目标视频的至少一个目标视频片段。The target video segment determination module 1503 is further configured to determine at least one target video segment of the target video based on the video frame features of the plurality of target video frames, the video frame features of the first video frame, and the video frame features of the at least one second video.

需要说明的是：上述实施例提供的视频片段的识别装置在识别视频片段时，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将计算机设备的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外，上述实施例提供的视频片段的识别装置与视频片段的识别方法实施例属于同一构思，其具体实现过程详见方法实施例，这里不再赘述。It should be noted that the video segment recognition device provided in the above embodiments is only illustrated by the division of the above functional modules. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the computer device can be divided into different functional modules to complete all or part of the functions described above. In addition, the video segment recognition device and the video segment recognition method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process can be found in the method embodiments, which will not be repeated here.

本申请实施例提供了一种计算机设备，用于执行上述方法，该计算机设备可以实现为终端或者服务器，下面先对终端的结构进行介绍：This application provides a computer device for performing the above-described method. This computer device can be implemented as a terminal or a server. The structure of the terminal will be described below:

图16是本申请实施例提供的一种终端的结构示意图。Figure 16 is a schematic diagram of the structure of a terminal provided in an embodiment of this application.

通常，终端1600包括有：一个或多个处理器1601和一个或多个存储器1602。Typically, terminal 1600 includes one or more processors 1601 and one or more memories 1602.

处理器1601可以包括一个或多个处理核心，比如4核心处理器、8核心处理器等。处理器1601可以采用DSP(Digital Signal Processing，数字信号处理)、FPGA(Field－Programmable Gate Array，现场可编程门阵列)、PLA(Programmable Logic Array，可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1601也可以包括主处理器和协处理器，主处理器是用于对在唤醒状态下的数据进行处理的处理器，也称CPU(Central ProcessingUnit，中央处理器)；协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中，处理器1601可以在集成有GPU(Graphics Processing Unit，图像处理器)，GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中，处理器1601还可以包括AI(Artificial Intelligence，人工智能)处理器，该AI处理器用于处理有关机器学习的计算操作。Processor 1601 may include one or more processing cores, such as a quad-core processor, an octa-core processor, etc. Processor 1601 may be implemented using at least one hardware form selected from DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). Processor 1601 may also include a main processor and a coprocessor. The main processor, also known as a CPU (Central Processing Unit), is used to process data in the wake-up state; the coprocessor is a low-power processor used to process data in the standby state. In some embodiments, processor 1601 may integrate a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the screen. In some embodiments, processor 1601 may also include an AI (Artificial Intelligence) processor, which is used to handle computational operations related to machine learning.

存储器1602可以包括一个或多个计算机可读存储介质，该计算机可读存储介质可以是非暂态的。存储器1602还可包括高速随机存取存储器，以及非易失性存储器，比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中，存储器1602中的非暂态的计算机可读存储介质用于存储至少一个计算机程序，该至少一个计算机程序用于被处理器1601所执行以实现本申请中方法实施例提供的视频片段的识别方法。The memory 1602 may include one or more computer-readable storage media, which may be non-transitory. The memory 1602 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In some embodiments, the non-transitory computer-readable storage media in the memory 1602 are used to store at least one computer program, which is executed by the processor 1601 to implement the video segment recognition method provided in the method embodiments of this application.

在一些实施例中，终端1600还可选包括有：外围设备接口1603和至少一个外围设备。处理器1601、存储器1602和外围设备接口1603之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1603相连。具体地，外围设备包括：射频电路1604、显示屏1605、摄像头组件1606、音频电路1607和电源1608中的至少一种。In some embodiments, the terminal 1600 may also optionally include a peripheral device interface 1603 and at least one peripheral device. The processor 1601, memory 1602, and peripheral device interface 1603 can be connected via a bus or signal line. Each peripheral device can be connected to the peripheral device interface 1603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes at least one of the following: a radio frequency circuit 1604, a display screen 1605, a camera assembly 1606, an audio circuit 1607, and a power supply 1608.

外围设备接口1603可被用于将I/O(Input/Output，输入/输出)相关的至少一个外围设备连接到处理器1601和存储器1602。在一些实施例中，处理器1601、存储器1602和外围设备接口1603被集成在同一芯片或电路板上；在一些其他实施例中，处理器1601、存储器1602和外围设备接口1603中的任意一个或两个可以在单独的芯片或电路板上实现，本实施例对此不加以限定。Peripheral interface 1603 can be used to connect at least one I/O (Input/Output) related peripheral device to processor 1601 and memory 1602. In some embodiments, processor 1601, memory 1602 and peripheral interface 1603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of processor 1601, memory 1602 and peripheral interface 1603 can be implemented on separate chips or circuit boards, which is not limited in this embodiment.

射频电路1604用于接收和发射RF(Radio Frequency，射频)信号，也称电磁信号。射频电路1604通过电磁信号与通信网络以及其他通信设备进行通信。射频电路1604将电信号转换为电磁信号进行发送，或者，将接收到的电磁信号转换为电信号。可选地，射频电路1604包括：天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。The radio frequency (RF) circuit 1604 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The RF circuit 1604 communicates with communication networks and other communication devices via electromagnetic signals. The RF circuit 1604 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals back into electrical signals. Optionally, the RF circuit 1604 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, etc.

显示屏1605用于显示UI(User Interface，用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏1605是触摸显示屏时，显示屏1605还具有采集在显示屏1605的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器1601进行处理。此时，显示屏1605还可以用于提供虚拟按钮和/或虚拟键盘，也称软按钮和/或软键盘。Display screen 1605 is used to display a user interface (UI). This UI may include graphics, text, icons, video, and any combination thereof. When display screen 1605 is a touch display screen, it also has the ability to collect touch signals on or above its surface. These touch signals can be input as control signals to processor 1601 for processing. In this case, display screen 1605 can also be used to provide virtual buttons and/or a virtual keyboard, also known as soft buttons and/or a soft keyboard.

摄像头组件1606用于采集图像或视频。可选地，摄像头组件1606包括前置摄像头和后置摄像头。通常，前置摄像头设置在终端的前面板，后置摄像头设置在终端的背面。The camera assembly 1606 is used to capture images or videos. Optionally, the camera assembly 1606 includes a front-facing camera and a rear-facing camera. Typically, the front-facing camera is located on the front panel of the terminal, and the rear-facing camera is located on the back of the terminal.

音频电路1607可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波，并将声波转换为电信号输入至处理器1601进行处理，或者输入至射频电路1604以实现语音通信。The audio circuit 1607 may include a microphone and a speaker. The microphone is used to collect sound waves from the user and the environment, and convert the sound waves into electrical signals that are input to the processor 1601 for processing, or input to the radio frequency circuit 1604 to realize voice communication.

电源1608用于为终端1600中的各个组件进行供电。电源1608可以是交流电、直流电、一次性电池或可充电电池。Power supply 1608 is used to supply power to the various components in terminal 1600. Power supply 1608 can be AC power, DC power, a disposable battery, or a rechargeable battery.

在一些实施例中，终端1600还包括有一个或多个传感器1609。该一个或多个传感器1609包括但不限于：加速度传感器1610、陀螺仪传感器1611、压力传感器1612、光学传感器1613以及接近传感器1614。In some embodiments, the terminal 1600 further includes one or more sensors 1609. The one or more sensors 1609 include, but are not limited to: an accelerometer 1610, a gyroscope 1611, a pressure sensor 1612, an optical sensor 1613, and a proximity sensor 1614.

加速度传感器1610可以检测以终端1600建立的坐标系的三个坐标轴上的加速度大小。Accelerometer 1610 can detect the magnitude of acceleration on the three coordinate axes of a coordinate system established with terminal 1600.

陀螺仪传感器1611可以终端1600的机体方向及转动角度，陀螺仪传感器1611可以与加速度传感器1610协同采集用户对终端1600的3D动作。The gyroscope sensor 1611 can detect the orientation and rotation angle of the terminal 1600. The gyroscope sensor 1611 can work in conjunction with the accelerometer sensor 1610 to collect the user's 3D movements on the terminal 1600.

压力传感器1612可以设置在终端1600的侧边框和/或显示屏1605的下层。当压力传感器1612设置在终端1600的侧边框时，可以检测用户对终端1600的握持信号，由处理器1601根据压力传感器1612采集的握持信号进行左右手识别或快捷操作。当压力传感器1612设置在显示屏1605的下层时，由处理器1601根据用户对显示屏1605的压力操作，实现对UI界面上的可操作性控件进行控制。The pressure sensor 1612 can be installed on the side bezel of the terminal 1600 and/or on the lower layer of the display screen 1605. When the pressure sensor 1612 is installed on the side bezel of the terminal 1600, it can detect the user's grip signal on the terminal 1600, and the processor 1601 can perform left/right hand recognition or quick operation based on the grip signal collected by the pressure sensor 1612. When the pressure sensor 1612 is installed on the lower layer of the display screen 1605, the processor 1601 can control the operable controls on the UI interface based on the user's pressure operation on the display screen 1605.

光学传感器1613用于采集环境光强度。在一个实施例中，处理器1601可以根据光学传感器1613采集的环境光强度，控制显示屏1605的显示亮度。An optical sensor 1613 is used to collect ambient light intensity. In one embodiment, a processor 1601 can control the display brightness of a display screen 1605 based on the ambient light intensity collected by the optical sensor 1613.

接近传感器1614用于采集用户与终端1600的正面之间的距离。The proximity sensor 1614 is used to detect the distance between the user and the front of the terminal 1600.

本领域技术人员可以理解，图16中示出的结构并不构成对终端1600的限定，可以包括比图示更多或更少的组件，或者组合某些组件，或者采用不同的组件布置。Those skilled in the art will understand that the structure shown in FIG16 does not constitute a limitation on the terminal 1600, and may include more or fewer components than shown, or combine certain components, or employ different component arrangements.

上述计算机设备还可以实现为服务器，下面对服务器的结构进行介绍：The aforementioned computer equipment can also be implemented as a server. The structure of a server is described below:

图17是本申请实施例提供的一种服务器的结构示意图，该服务器1700可因配置或性能不同而产生比较大的差异，可以包括一个或多个处理器(Central Processing Units，CPU)1701和一个或多个的存储器1702，其中，所述一个或多个存储器1702中存储有至少一条计算机程序，所述至少一条计算机程序由所述一个或多个处理器1701加载并执行以实现上述各个方法实施例提供的方法。当然，该服务器1700还可以具有有线或无线网络接口、键盘以及输入输出接口等部件，以便进行输入输出，该服务器1700还可以包括其他用于实现设备功能的部件，在此不做赘述。Figure 17 is a schematic diagram of a server structure provided in an embodiment of this application. The server 1700 can vary significantly due to different configurations or performance. It may include one or more Central Processing Units (CPUs) 1701 and one or more memories 1702. The one or more memories 1702 store at least one computer program, which is loaded and executed by the one or more processors 1701 to implement the methods provided in the various method embodiments described above. Of course, the server 1700 may also have wired or wireless network interfaces, a keyboard, and input/output interfaces for input and output. The server 1700 may also include other components for implementing device functions, which will not be elaborated upon here.

在示例性实施例中，还提供了一种计算机可读存储介质，该计算机可读存储介质中存储有至少一条计算机程序，该计算机程序由处理器加载并执行以实现上述实施例中的视频片段的识别方法。例如，该计算机可读存储介质可以是只读存储器(Read-OnlyMemory，ROM)、随机存取存储器(Random Access Memory，RAM)、只读光盘(Compact DiscRead-Only Memory，CD-ROM)、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a computer-readable storage medium is also provided, which stores at least one computer program that is loaded and executed by a processor to implement the video segment recognition method described in the above embodiments. For example, the computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a compact disc read-only memory (CD-ROM), magnetic tape, floppy disk, or optical data storage device, etc.

在示例性实施例中，还提供了一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现上述视频片段的识别方法。In an exemplary embodiment, a computer program product is also provided, including a computer program that, when executed by a processor, implements the above-described method for recognizing video segments.

在一些实施例中，本申请实施例所涉及的计算机程序可被部署在一个计算机设备上执行，或者在位于一个地点的多个计算机设备上执行，又或者，在分布在多个地点且通过通信网络互连的多个计算机设备上执行，分布在多个地点且通过通信网络互连的多个计算机设备可以组成区块链系统。In some embodiments, the computer program involved in the present application embodiments may be deployed and executed on a computer device, or executed on multiple computer devices located in one location, or executed on multiple computer devices distributed in multiple locations and interconnected through a communication network. Multiple computer devices distributed in multiple locations and interconnected through a communication network may constitute a blockchain system.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成，也可以通过程序来指令相关的硬件完成，该程序可以存储于一种计算机可读存储介质中，上述提到的存储介质可以是只读存储器，磁盘或光盘等。Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.

上述仅为本申请的可选实施例，并不用以限制本申请，凡在本申请的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above are merely optional embodiments of this application and are not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

Claims

1. A method for identifying video segments, characterized in that the method comprises:

Based on the video frame features of the first video and the video frame features of at least one second video, multiple video frame pairs are determined. The video frame pairs include a first video frame and a second video frame whose similarity meets the similarity condition. The first video frame belongs to the first video, and the second video frame belongs to the at least one second video.

Video frame pairs with the same occurrence time difference are divided into an initial video frame group, where the occurrence time difference refers to the difference between the occurrence times of the two video frames in the video frame pair.

Multiple initial video frame groups are sorted according to the target order to obtain multiple candidate video frame groups;

If the matching time difference between any two adjacent candidate video frame groups in the plurality of candidate video frame groups meets the matching time difference condition, the two adjacent candidate video frame groups are merged into a target video frame group. The matching time difference refers to the difference between the occurrence time differences of the corresponding two adjacent candidate video frame groups.

For any one of the multiple target video frame groups, the first video frame of the video frame pair in the target video frame group is merged into a candidate video segment according to the occurrence time of the first video frame of the video frame pair in the first video.

Based on at least one candidate video segment, at least one target candidate video segment is determined, wherein the number of times the target candidate video segment appears in the at least one candidate video segment meets a frequency condition;

If the appearance time of any of the target candidate video segments in the first video is within the target time range, the target candidate video segment is determined as the target video segment in the first video, and the target video segment is the beginning or the end of the video.

2. The method according to claim 1, wherein the two adjacent candidate video frame groups include a first candidate video frame group and a second candidate video frame group, and the step of fusing the two adjacent candidate video frame groups into a target video frame group includes:

If the matching time difference between the occurrence time difference corresponding to the first candidate video frame group and the occurrence time difference corresponding to the second candidate video frame group is less than or equal to the matching difference threshold, the video frame pairs in the first candidate video frame group are added to the second candidate video frame group to obtain the target video frame group.

3. The method according to claim 2, wherein adding the video frame pairs from the first candidate video frame group to the second candidate video frame group to obtain the target video frame group comprises:

Add the video frame pairs from the first candidate video frame group to the second candidate video frame group;

Based on the occurrence time difference value corresponding to the second candidate video frame group, the target second video frame is replaced by a reference second video frame to obtain the target video frame group. The target second video frame is the second video frame newly added to the second candidate video frame group. The reference second video frame is the second video frame in the second video whose occurrence time difference with the target first video frame is the occurrence time difference value corresponding to the second candidate video frame group. The target first video frame is the first video frame in the video frame pair to which the target second video frame belongs.

4. The method according to claim 1, wherein fusing the first video frames of the video frame pairs in the target video frame group into a candidate video segment according to the occurrence time of the first video frame of the video frame pair in the first video includes:

Compare the occurrence times of the first video frame in the first video of any two adjacent video frame pairs in the target video frame group.

If the difference between the occurrence times of the first video frame of the two adjacent video frame pairs in the first video meets the occurrence time condition, the two adjacent video frame pairs are added to the temporary frame list.

If the difference between the occurrence times of the first video frame in the first video of two adjacent video frame pairs does not meet the occurrence time condition, the video frame pairs in the temporary frame list are merged into a reference video segment.

Based on multiple reference video segments, at least one candidate video segment is determined.

5. The method according to claim 4, wherein the plurality of reference video segments includes a first overlapping video segment and/or a second overlapping video segment, the first overlapping video segment being a reference video segment belonging to a first reference video segment among the plurality of reference video segments, the second overlapping video segment being a reference video segment partially overlapping with a second reference video segment among the plurality of reference video segments, and determining the at least one candidate video segment based on the plurality of reference video segments includes at least one of the following:

If the plurality of reference video segments include the first overlapping video segment, the first overlapping video segment is deleted to obtain the at least one candidate video segment;

If the plurality of reference video segments include the second overlapping video segment, the overlapping portion between the second overlapping video segment and the second reference video segment is deleted to obtain the at least one candidate video segment.

6. The method according to claim 5, characterized in that, after deleting the overlapping portion between the second overlapping video segment and the second reference video segment when the plurality of reference video segments include the second overlapping video segment, the method further includes:

Compare the duration of the third type of reference video segment with the target duration, wherein the third type of reference video segment refers to the second overlapping video segment after deleting the overlapping part;

If the duration of the third type of reference video segment is greater than or equal to the target duration, the third type of reference video segment shall be retained.

If the duration of the third type of reference video segment is less than the target duration, the third type of reference video segment shall be deleted.

7. The method according to claim 1, wherein determining at least one target candidate video segment based on at least one candidate video segment comprises:

Based on the at least one candidate video segment, at least one reference candidate video segment is determined;

Determine the number of times each of the reference candidate video segments appears in the at least one reference candidate video segment;

Reference candidate video segments whose occurrence frequency meets the stated frequency condition are identified as target candidate video segments.

8. The method according to claim 7, wherein the at least one candidate video segment includes a third overlapping video segment and/or a fourth overlapping video segment, the third overlapping video segment being a candidate video segment belonging to a first candidate video segment among the at least one candidate video segments, the fourth overlapping video segment being a candidate video segment partially overlapping with a second candidate video segment among the at least one candidate video segments, and determining at least one reference candidate video segment based on the at least one candidate video segment includes at least one of the following:

If the at least one candidate video segment includes the third overlapping video segment, the third overlapping video segment is deleted to obtain the at least one reference candidate video segment;

If the at least one candidate video segment includes the fourth overlapping video segment, and the overlap between the fourth overlapping video segment and the second candidate video segment meets the overlap condition, the occurrence count of the fourth overlapping video segment is determined; based on the occurrence count of the fourth overlapping video segment, the at least one reference candidate video segment is determined.

If the at least one candidate video segment includes the fourth overlapping video segment, and the overlap between the fourth overlapping video segment and the second candidate video segment does not meet the overlap condition, the fourth overlapping video segment is deleted to obtain the at least one reference candidate video segment.

If the at least one candidate video segment includes the fourth overlapping video segment, and the duration of the fourth overlapping video segment is less than that of the second candidate video segment, the fourth overlapping video segment is deleted to obtain the at least one reference candidate video segment.

9. The method according to claim 8, wherein determining the at least one reference candidate video segment based on the occurrence count of the fourth overlapping video segment includes any one of the following:

If the number of occurrences of the fourth overlapping video segment is greater than or equal to the first occurrence threshold, the fourth overlapping video segment is fused with the second candidate video segment to obtain the at least one reference candidate video segment.

If the number of occurrences of the fourth overlapping video segment is less than the first occurrence threshold, the fourth overlapping video segment is deleted to obtain the at least one reference candidate video segment.

10. The method according to claim 1, wherein the method further comprises:

Feature extraction is performed on multiple target video frames of the target video to be identified to obtain the video frame features of the multiple target video frames;

Based on the video frame features of the plurality of target video frames, the video frame features of the first video frame, and the video frame features of the at least one second video, at least one target video segment of the target video is determined.

11. A video clip recognition device, characterized in that the device comprises:

A video frame pair determination module is used to determine multiple video frame pairs based on video frame features of a first video and video frame features of at least one second video. The video frame pairs include a first video frame and a second video frame whose similarity meets a similarity condition. The first video frame belongs to the first video, and the second video frame belongs to the at least one second video.

The fusion module is used to divide video frame pairs with the same occurrence time difference into an initial video frame group, wherein the occurrence time difference refers to the difference between the occurrence times of the two video frames in the video frame pair in the video.

The fusion module is further configured to sort multiple initial video frame groups according to the target order to obtain multiple candidate video frame groups; if the matching time difference between any two adjacent candidate video frame groups meets the matching time difference condition, the two adjacent candidate video frame groups are fused into a target video frame group, wherein the matching time difference refers to the difference between the occurrence time differences of the corresponding two adjacent candidate video frame groups.

The fusion module is further configured to, for any target video frame group among the plurality of target video frame groups, fuse the first video frame of the video frame pair in the target video frame group into a candidate video segment according to the occurrence time of the first video frame of the video frame pair in the first video.

The target video segment determination module is used to determine at least one target candidate video segment based on at least one candidate video segment, wherein the number of times the target candidate video segment appears in the at least one candidate video segment meets a frequency condition;

The target video segment determination module is further configured to determine the target candidate video segment as the target video segment in the first video when the appearance time of any target candidate video segment in the first video is within a target time range, wherein the target video segment is the beginning or the end of a video.

12. The apparatus according to claim 11, wherein the two adjacent candidate video frame groups include a first candidate video frame group and a second candidate video frame group, and the fusion module is used to add video frame pairs in the first candidate video frame group to the second candidate video frame group to obtain the target video frame group when the matching time difference between the occurrence time difference corresponding to the first candidate video frame group and the occurrence time difference corresponding to the second candidate video frame group is less than or equal to a matching difference threshold.

13. The apparatus according to claim 12, wherein the fusion module is configured to add video frame pairs from the first candidate video frame group to the second candidate video frame group; based on the occurrence time difference corresponding to the second candidate video frame group, replace the target second video frame with a reference second video frame to obtain the target video frame group, wherein the target second video frame is a second video frame newly added to the second candidate video frame group, the reference second video frame is a second video frame in the second video whose occurrence time difference with the target first video frame is the occurrence time difference corresponding to the second candidate video frame group, and the target first video frame is the first video frame in the video frame pair to which the target second video frame belongs.

14. The apparatus according to claim 11, wherein the fusion module is configured to: compare the occurrence times of the first video frames of any two adjacent video frame pairs in the target video frame group in the first video; add the two adjacent video frame pairs to a temporary frame list if the difference between the occurrence times of the first video frames of the two adjacent video frame pairs in the first video meets the occurrence time condition; fuse the video frame pairs in the temporary frame list into a reference video segment if the difference between the occurrence times of the first video frames of the two adjacent video frame pairs in the first video does not meet the occurrence time condition; and determine the at least one candidate video segment based on multiple reference video segments.

15. The apparatus according to claim 14, wherein the plurality of reference video segments includes a first overlapping video segment and/or a second overlapping video segment, the first overlapping video segment being a reference video segment belonging to a first reference video segment among the plurality of reference video segments, and the second overlapping video segment being a reference video segment partially overlapping with a second reference video segment among the plurality of reference video segments, the fusion module being configured to perform at least one of the following:

16. The apparatus according to claim 15, wherein the fusion module is further configured to compare the duration of a third type of reference video segment with a target duration, wherein the third type of reference video segment refers to the second overlapping video segment with the overlapping portion deleted; retain the third type of reference video segment if the duration of the third type of reference video segment is greater than or equal to the target duration; and delete the third type of reference video segment if the duration of the third type of reference video segment is less than the target duration.

17. The apparatus according to claim 11, wherein the target video segment determination module is configured to determine at least one reference candidate video segment based on the at least one candidate video segment; determine the number of times each of the reference candidate video segments appears in the at least one reference candidate video segment; and determine the reference candidate video segments whose number of appearances meets the number condition as target candidate video segments.

18. The apparatus according to claim 17, wherein the at least one candidate video segment includes a third overlapping video segment and/or a fourth overlapping video segment, the third overlapping video segment being a candidate video segment belonging to a first candidate video segment among the at least one candidate video segments, and the fourth overlapping video segment being a candidate video segment partially overlapping with a second candidate video segment among the at least one candidate video segments, and the target video segment determining module is configured to perform at least one of the following:

19. The apparatus according to claim 18, wherein the target video segment determination module is configured to perform any of the following:

20. The apparatus according to claim 11, wherein the apparatus further comprises:

The feature extraction module is used to extract features from multiple target video frames of the target video to be identified, and to obtain the video frame features of the multiple target video frames.

The target video segment determination module is further configured to determine at least one target video segment of the target video based on the video frame features of the plurality of target video frames, the video frame features of the first video frame, and the video frame features of the at least one second video.

21. A computer device, characterized in that the computer device comprises one or more processors and one or more memories, wherein the one or more memories store at least one computer program, the computer program being loaded and executed by the one or more processors to implement the video segment recognition method as described in any one of claims 1 to 10.

22. A computer-readable storage medium, characterized in that the computer-readable storage medium stores at least one computer program, the computer program being loaded and executed by a processor to implement the video segment recognition method as described in any one of claims 1 to 10.

23. A computer program product comprising a computer program, characterized in that, when executed by a processor, the computer program implements the video segment recognition method according to any one of claims 1 to 10.