Disclosure of Invention
Based on the defects of the prior art, the application provides a video highlight clip clipping method and device, electronic equipment and storage medium, so as to solve the problem that the prior art cannot accurately clip high-quality video highlight content clips.
In order to achieve the above object, the present application provides the following technical solutions:
The first aspect of the application provides a video highlight clip clipping method, which comprises the following steps:
Acquiring audio data of a target video;
Detecting event categories to which each frame of audio frame data of the audio data of the target video belongs through a multi-scale convolutional neural network and an adaptive attention mechanism respectively;
determining a target audio event of the target video based on the event category of the audio frame data of each frame, and marking a time stamp of the target audio event, wherein the target audio event refers to an event of audio reflecting video highlight content;
aligning the target audio event to video data of each machine position of the target video based on the timestamp of the target audio event;
analyzing the optimal machine position of each period of time in the range of the timestamp of the target audio event based on the audio characteristics of the audio data of each machine position of the target video and the video characteristics of the video data of each machine position;
And respectively clipping the audio data and the video data of the optimal machine position of each time period in the range of the timestamp of the target audio event to obtain the highlight video clip corresponding to the target audio event.
Optionally, in the video highlight clip clipping method, before detecting an event category to which each frame of audio frame data of the audio data of the target video belongs, the method further includes:
Preprocessing the audio data of the target video;
Sliding a time window on the preprocessed audio data of the target video through a preset sliding step length to obtain each frame of audio frame data of the audio data of the target video, wherein the audio data contained in the time window sliding once is one frame of audio frame data;
And extracting frequency domain features of the audio frame data of each frame by utilizing short-time Fourier transformation, extracting at least a Mel frequency cepstrum coefficient and a time-frequency chart of the audio frame data, and constructing a multidimensional feature matrix of the audio frame data of each frame.
Optionally, in the video highlight clip clipping method, the detecting, by a multi-scale convolutional neural network and an adaptive attention mechanism, an event category to which each frame of audio frame data of the audio data of the target video belongs includes:
Inputting a multidimensional feature matrix of the audio frame data extracted from the audio frame data into the multiscale convolutional neural network for each frame of the audio frame data respectively;
performing multi-scale convolution on the multi-dimensional feature matrix of the audio frame data through the multi-scale convolution neural network to obtain an advanced feature matrix of the audio frame data;
optimizing an advanced feature matrix of the audio frame data by using an adaptive attention mechanism;
And calculating the probability that the audio frame data belongs to each event category based on the optimized advanced feature matrix of the audio frame data through a classifier, and determining the event category with the highest probability as the event category to which the audio frame data belongs.
Optionally, in the video highlight clip method, the determining the target audio event of the target video based on the event category to which the audio frame data of each frame belongs, and marking the timestamp of the target audio event includes:
Analyzing whether a target event category exists or not based on the probability that the continuous N frames of the audio frame data belong to the event category to which the audio frame data belong, wherein the target event category is the category of the target audio event, and the sum of the probabilities of the target event category to which the audio frame data belong in each frame belonging to the target event category in the continuous N frames of the audio frame data is larger than a probability threshold;
If the existence of the target event category is analyzed based on the probability that the continuous N frames of the audio frame data belong to the event category, determining that a piece of target audio event of the target video belonging to the target event category exists in the continuous N frames of the audio frame data;
Determining a timestamp of the audio frame data of the first frame belonging to the target event category in the consecutive N frames of the audio frame data as a start timestamp of the target audio event, and determining a timestamp of the audio frame data of the last frame belonging to the target event category as an end timestamp of the target audio event.
Optionally, in the video highlight clip method, the aligning the target audio event to the video data of each machine position of the target video based on the timestamp of the target audio event includes:
Searching a key frame with the smallest difference value between the time and the time stamp of the target audio event from video data of any machine position of the target video;
aligning the target audio event with the video data of the machine position according to the difference value between the time stamp of the target audio event and the searched time stamp of the key frame;
and aligning the video data of each machine position of the target video by using a dynamic time warping algorithm.
Optionally, in the video highlight clip method, the analyzing the optimal machine position of each period of time in the range of the timestamp of the target audio event based on the audio feature of the audio data of each machine position of the target video and the video feature of the video data of each machine position includes:
subtracting the first adjustment time from the time stamp of the target audio event and adding the second adjustment time to obtain the minimum value and the maximum value of the time range corresponding to the target audio event;
Respectively weighting the volume intensity and the source direction of the audio data of each unit time of the machine position in the time range corresponding to the target audio event aiming at each machine position of the target video to obtain the corresponding audio feature score of each unit time of the machine position;
Weighting the character action amplitude and the emotion response information of the video data of each unit time in the time range corresponding to the target audio event to obtain a video feature score of each unit time corresponding to the unit time;
Adding the audio feature score and the video feature score of each unit time corresponding to the machine position to obtain a fusion feature score of each unit time corresponding to the machine position;
And selecting the machine position with the highest fusion characteristic score corresponding to each unit time in the time range corresponding to the target audio event as the optimal machine position of each unit time.
Optionally, in the video highlight clip clipping method, clipping audio data and video data of an optimal machine position of each period of time within a range of a timestamp of the target audio event respectively to obtain a highlight video clip corresponding to the target audio event includes:
sequentially polling each unit time in a time range corresponding to the target audio event;
If the current polled optimal machine position in the unit time is different from the current machine position for editing and the time difference between the current machine position and the time point of switching to the current machine position for editing is larger than the preset time difference, switching the current machine position for editing to the current polled optimal machine position in the unit time;
And editing the audio data and the video data of the current editing machine position until all unit time in the time range corresponding to the target audio event is polled.
The second aspect of the present application provides a video highlight clip clipping apparatus, comprising:
a data acquisition unit for acquiring audio data of a target video;
The category detection unit is used for detecting event categories to which each frame of audio frame data of the audio data of the target video belongs through a multi-scale convolutional neural network and an adaptive attention mechanism respectively;
the time marking unit is used for determining a target audio event of the target video based on the event category of the audio frame data of each frame and marking the time stamp of the target audio event, wherein the target audio event refers to an event reflecting the audio of the video highlight content;
an alignment unit, configured to align the target audio event to video data of each machine position of the target video based on the timestamp of the target audio event;
The machine position analysis unit is used for analyzing the optimal machine position of each period of time in the range of the timestamp of the target audio event based on the audio characteristics of the audio data of each machine position of the target video and the video characteristics of the video data of each machine position;
And the clipping unit is used for clipping the audio data and the video data of the optimal machine position of each time period in the range of the timestamp of the target audio event respectively to obtain the highlight video clip corresponding to the target audio event.
Optionally, in the video highlight clip clipping apparatus described above, the video highlight clip clipping apparatus further includes:
The preprocessing unit is used for preprocessing the audio data of the target video;
The sliding unit is used for sliding a time window on the preprocessed audio data of the target video through a preset sliding step length to obtain audio frame data of each frame of the audio data of the target video, wherein the audio data contained in the time window in one sliding is one frame of audio frame data;
And the characteristic extraction unit is used for extracting frequency domain characteristics of the audio frame data of each frame by utilizing short-time Fourier transformation respectively, extracting at least a Mel frequency cepstrum coefficient and a time-frequency diagram of the audio frame data, and constructing a multidimensional characteristic matrix of the audio frame data of each frame.
Optionally, in the video highlight clip apparatus described above, the category detection unit includes:
An input unit configured to input, for each frame of the audio frame data, a multidimensional feature matrix of the audio frame data extracted from the audio frame data into the multiscale convolutional neural network;
The convolution unit is used for carrying out multi-scale convolution on the multi-dimensional feature matrix of the audio frame data through the multi-scale convolution neural network to obtain an advanced feature matrix of the audio frame data;
an optimizing unit, configured to optimize an advanced feature matrix of the audio frame data by using an adaptive attention mechanism;
The class determining unit is used for calculating the probability that the audio frame data belongs to each event class based on the optimized advanced feature matrix of the audio frame data through the classifier, and determining the event class with the highest probability as the event class to which the audio frame data belongs.
Optionally, in the video highlight clip apparatus described above, the time stamp unit includes:
The event judging unit is used for analyzing whether a target event category exists or not based on the probability that the continuous N frames of the audio frame data belong to the event category to which the audio frame data belong, wherein the target event category is the category of the target audio event, and the sum of the probabilities of the target event category to which the audio frame data belong in each frame of the audio frame data belonging to the target event category in the continuous N frames of the audio frame data is larger than a probability threshold;
the event determining unit is used for determining that a target audio event of the target video belonging to the target event category exists in the continuous N frames of the audio frame data when the existence of the target event category is analyzed based on the probability that the continuous N frames of the audio frame data belong to the event category;
A time determining unit, configured to determine, as a start time stamp of the target audio event, a time stamp of the audio frame data of the first frame belonging to the target event category in the consecutive N frames of the audio frame data, and determine, as an end time stamp of the target audio event, a time stamp of the audio frame data of the last frame belonging to the target event category.
Optionally, in the video highlight clip clipping apparatus described above, the alignment unit includes:
the key frame searching unit is used for searching a key frame with the smallest difference value between the time and the time stamp of the target audio event from video data of any machine position of the target video;
The video data alignment unit is used for aligning the target audio event with the video data of the machine position according to the difference value between the timestamp of the target audio event and the searched timestamp of the key frame;
and the machine bit data alignment unit is used for aligning the video data of each machine bit of the target video by using a dynamic time warping algorithm.
Optionally, in the video highlight clip device, the machine analysis unit includes:
The time adjustment unit is used for subtracting the first adjustment time from the time stamp of the target audio event and adding the second adjustment time to obtain the minimum value and the maximum value of the time range corresponding to the target audio event;
The audio scoring unit is used for respectively weighting the volume intensity and the source direction of the audio data of each unit time of the machine position in the time range corresponding to the target audio event aiming at each machine position of the target video to obtain the corresponding audio feature score of each unit time of the machine position;
The video scoring unit is used for weighting the character action amplitude and the emotion response information of the video data of each unit time in the time range corresponding to the target audio event to obtain the video feature score of each unit time corresponding to the unit time;
the scoring fusion unit is used for adding the audio feature score and the video feature score of each unit time corresponding to the machine position to obtain a fusion feature score of each unit time corresponding to the machine position;
And the machine position selecting unit is used for selecting the machine position with the highest fusion characteristic score corresponding to each unit time in the time range corresponding to the target audio event as the optimal machine position of each unit time.
Optionally, in the video highlight clip clipping apparatus described above, the clipping unit includes:
The polling unit is used for sequentially polling each unit time in the time range corresponding to the target audio event;
The machine position switching unit is used for switching the current machine position to the current polled optimal machine position in the unit time when the current polled optimal machine position in the unit time is different from the current machine position in the clipping and the time difference between the current machine position and the time point of switching to the current machine position in the clipping is larger than the preset time difference;
And the data clipping unit is used for clipping the audio data and the video data of the current clipping machine position until all unit time in the time range corresponding to the target audio event is polled.
A third aspect of the present application provides an electronic device, comprising:
a memory and a processor;
wherein the memory is used for storing programs;
The processor is configured to execute the program, where the program is executed, and specifically configured to implement the video highlight clip clipping method according to any one of the foregoing claims.
A fourth aspect of the present application provides a computer storage medium storing a computer program for implementing a video highlight clip method as claimed in any one of the preceding claims when executed by a processor.
The application provides a video highlight clip method, which comprises the steps of firstly obtaining audio data of a target video. Event categories to which each frame of audio frame data of the audio data of the target video belongs are detected through the multi-scale convolutional neural network and the self-adaptive attention mechanism respectively, so that various complex scenes can be adapted through the multi-scale convolutional neural network and the self-adaptive attention mechanism, and whether audio events reflecting the high-frequency content of the video exist in the audio frame data can be accurately determined. The target audio event of the target video is then determined based on the event category to which each frame of audio frame data belongs, and the timestamp of the target audio event is marked. And then aligning the target audio event to the video data of each machine position of the target video based on the time stamp of the target audio event so as to correct the deviation between the audio data and the video data and the deviation between the video data of each machine position, thereby being convenient for accurately clipping corresponding video content from each machine position. And then, based on the audio characteristics of the audio data of each machine position of the target video and the video characteristics of the video data of each machine position, analyzing the optimal machine position of each period of time in the range of the time stamp of the target audio event, thereby determining the machine position of the highlight video content of the video with the highest quality most accurately shot in each period of time through the integrated analysis of the video data and the audio data of each machine position, and facilitating the subsequent accurate editing of the highlight video fragment with high quality. And respectively clipping the audio data and the video data of the optimal machine position of each time period in the range of the timestamp of the target audio event to obtain a highlight video fragment corresponding to the target audio event, thereby realizing a method for accurately clipping the high-quality highlight video fragment in the video.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In the present application, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The embodiment of the application provides a video highlight clip method, as shown in fig. 1, comprising the following steps:
s101, acquiring audio data of a target video.
Wherein the target video refers to a video requiring highlight clip.
The situation of different video contents can be reflected through the difference of the audio data of the video, so that the position of the high-frequency video fragment can be analyzed through the audio related to the highlighted video contents in the audio data of the video, such as the occurrence of events such as applause, laughter, cheering and the like, so that the high-frequency video contents are determined. Therefore, it is necessary to acquire audio data of a target video to analyze the audio data thereof.
S102, detecting event types of each frame of audio frame data of the audio data of the target video through a multi-scale convolutional neural network and an adaptive attention mechanism.
In order to analyze the specific situation of the audio data and perform the analysis in real time, in the embodiment of the present application, the audio data is divided into multiple frames of audio frame data.
Optionally, in order to accurately analyze the complete height segments of each segment, the analysis result is prevented from being too discrete or missing, so in the embodiment of the application, the audio frame is divided into a plurality of segments not directly according to a certain time interval, but by adopting a sliding window mode, the multi-frame audio frame data is logically divided. The size of the sliding window is specifically set, then the sliding is started from the starting position of the audio data of the target video, and the sliding is performed according to the set step length, that is, the time length of each sliding is the set step length. And taking the audio data in the current sliding window as one frame of audio frame data every time the sliding window is slid. The setting step length is usually set smaller than the sliding window, so that repeated audio data exists among the frames of audio frame data, and more accurate analysis can be performed.
In order to analyze audio data from multiple scales and improve the accuracy of analysis results, in the embodiment of the application, the multi-scale convolutional neural network is used for extracting the multiple scale features, and the self-adaptive attention mechanism is used for self-adaptively adjusting the weights of different features so as to adapt to various complex scenes, thereby ensuring the accuracy of the final analysis results.
In order to analyze audio frame data, it is mainly analyzed whether audio events corresponding to highlight content, i.e., target audio events reflecting audio of video highlight content, such as events related to audio, such as applause, cheering, singing, etc., are included therein. The event category to which the audio frame data belongs is specifically analyzed. Alternatively, event categories may include non-target audio events and target audio events. And, the target audio event may be specifically further divided into a plurality of categories according to the event.
In the embodiment of the application, the target video has video data of a plurality of machine positions, and in order to obtain high-quality highlight fragments, the video data of each machine position is analyzed later to clip the video data of the optimal machine position. Corresponding audio data can exist in different machine positions, and the audio data of different machine positions can be different due to the distance between acquisition equipment, the environment, the equipment setting and the like.
Therefore, optionally, when the audio data of the target video is analyzed, the audio data of each machine position may be analyzed separately, and then the analysis results of the audio data of each machine position are fused to obtain a final result, for example, a weighted average or a maximum voting strategy is used for fusion.
Or selecting the machine-position audio with the best tone quality and the clearest pickup as a global audio source, and analyzing the audio data. Or selecting the audio data collected by the special main microphone, which has the highest signal-to-noise ratio or the highest capability level, for analysis, and taking the analysis result as the final detection result.
And when analyzing audio frame data, the analysis is mainly performed on the characteristics of the audio frame data. In another embodiment of the present application, the extraction of the features of the audio frame data is further included before step S102 is performed. As shown in fig. 2, a method for extracting features of audio frame data according to an embodiment of the present application includes:
S201, preprocessing the audio data of the target video.
In order to improve the quality of the audio data, in the embodiment of the application, noise reduction, filtering, normalization and other processes are performed on the audio data to remove background noise, redundant information and the like.
S202, sliding a time window on the preprocessed audio data of the target video through a preset sliding step length to obtain audio frame data of each frame of the audio data of the target video.
The audio data contained in the time window sliding once is one frame of audio frame data.
S203, extracting frequency domain features of each frame of audio frame data by utilizing short-time Fourier transform respectively, extracting at least a Mel frequency cepstrum coefficient and a time-frequency chart of the audio frame data, and constructing a multi-dimensional feature matrix of each frame of audio frame data.
In the embodiment of the application, short-time fourier transform is mainly utilized to convert each frame of audio frame data into frequency domain features, and in order to adapt to the recognition requirements of different types of events, a mel frequency cepstrum coefficient and a time-frequency diagram are further extracted, and finally, the extracted features are utilized to form a multi-dimensional feature matrix. Of course, other features, such as tone features, etc., may be further extracted.
Optionally, in another embodiment of the present application, a specific implementation of step S102, as shown in fig. 3, includes the following steps:
S301, respectively inputting a multidimensional feature matrix of the audio frame data extracted from the audio frame data into a multiscale convolutional neural network aiming at each frame of the audio frame data.
S302, performing multi-scale convolution on the multi-dimensional feature matrix of the audio frame data through a multi-scale convolution neural network to obtain an advanced feature matrix of the audio frame data.
Specifically, the multi-scale convolutional neural network can utilize convolution kernels with different sizes to carry out convolution operation on the multi-dimensional feature matrix, so that multi-scale feature extraction is realized, the time features of short events and continuous events are captured efficiently, and the time sensitivity of a correct system to audio events is improved. The process of convolution can be expressed in particular as:
Where k s is a convolution kernel of size s and X represents the multidimensional feature matrix of the input audio frame data.
S303, optimizing the advanced feature matrix of the audio frame data by using an adaptive attention mechanism.
Specifically, the extracted advanced feature matrix projection values of the audio frame data are queried, keyed and value spaces, so that corresponding matrixes are obtained. And then calculating the attention score of the feature by using the matrix so as to determine a matrix of corresponding attention weights, thereby obtaining the weights of various features. Specifically, the calculation mode of the attention score of the matrix calculation feature specifically includes:
Wherein Q is a query matrix, K is a key matrix, V is a value matrix, and d k is the dimension of the key matrix.
S304, calculating the probability that the audio frame data belongs to each event category based on the optimized advanced feature matrix of the audio frame data through a classifier, and determining the event category with the highest probability as the event category to which the audio frame data belongs.
Specifically, the classifier calculates the probability that it belongs to an event class by the following formula:
Wherein W i is a vector of weights corresponding to event category C i, W j is a set of weight vectors for all time categories, and X is an input feature vector.
S103, determining a target audio event of the target video based on the event category of each frame of audio frame data, and marking a time stamp of the target audio event.
Wherein the target audio event refers to an event of audio reflecting video highlight content.
It should be noted that, the target audio event reflecting the video highlight content may last for a certain time, and the event category to which the audio frame data belongs may reflect whether the target audio event exists, so that each piece of the target audio event appearing may be determined according to the event category to which each piece of the audio frame data belongs, and then the timestamp of each piece of the target audio event, that is, the time range for marking the target audio event, is edited.
Alternatively, in another embodiment of the present application, a specific implementation of step S103, as shown in fig. 4, includes the following steps:
s401, analyzing whether a target event category exists or not based on the probability that the continuous N frames of audio frame data belong to the event category to which the continuous N frames of audio frame data belong.
The target event category is a category of a target audio event, and a probability sum of the target event categories to which each frame of audio frame data belonging to the target event category belongs in the continuous N frames of audio frame data is larger than a probability threshold.
It should be noted that, the length of one frame of audio frame data is smaller, so that it is convenient to accurately analyze the event category to which it belongs, but when the highlight video content appears, it can last longer, so that a target event appears also can last longer correspondingly. Accordingly, the target event occurs in the multi-frame audio frame data, so in the embodiment of the application, the continuous multi-frame audio frame data is subjected to summary analysis, and when the probability sum of the target event categories to which each frame of audio frame data belonging to one target event category belongs in the continuous N frames of audio frame data is greater than the probability threshold value, the existence of the target event category is determined. The selection of the analyzed N frames of audio data may also be performed according to a certain step size. If it is analyzed that the target event category exists based on the probability that the consecutive N frames of audio frame data belong to the event category to which it belongs, step S402 is performed.
S402, determining that a target audio event of a target video belonging to a target event category exists in continuous N frames of audio frame data.
S403, determining the time stamp of the audio frame data of the first frame belonging to the target event category in the continuous N frames of audio frame data as the starting time stamp of the target audio event, and determining the time stamp of the audio frame data of the last frame belonging to the target event category as the ending time stamp of the target audio event.
It should be noted that, since the target audio event is a continuous event, the time stamp of the marked target audio event includes a start time stamp and an end time stamp thereof.
Alternatively, the time stamp of a frame of audio frame data may specifically be the start time of the frame of audio frame data.
S104, aligning the target audio event to the video data of each machine position of the target video based on the time stamp of the target audio event.
Considering that the rhythm of the audio event is different from the rhythm of the visual change, and when the picture changes frequently, the original alignment precision of the audio frame and the video frame can be reduced, so that event positioning errors are caused, and the content of the clip is wrong. Therefore, in the embodiment of the application, the target audio event is aligned to the video data of the target video based on the time stamp of the target audio event, that is, the corresponding time of the target audio event on the video data is determined, so that the accurate clipping can be performed later, and the clipping is not performed directly according to the time stamp of the target audio event. In order to facilitate editing from the video data of each of the slots, it is necessary to align the target audio event onto the video data of each of the slots of the target video.
Optionally, in another embodiment of the present application, a specific implementation of step S104, as shown in fig. 5, includes the following steps:
S501, searching out a key frame with the smallest difference value between the time and the time stamp of the target audio event from video data of any machine position of the target video.
It should be noted that, the video data includes a key frame and a predicted frame. And a key frame refers to a video frame that contains complete picture information and can be independently decoded. The predicted frame is a video frame which depends on the information of the previous and next frames and can be decoded by referring to the key frame. So that only a precise cut can be made at the key frames at the time of editing and thus alignment with the key frames of the video data is particularly required.
Specifically, the difference between the time stamp of each frame of video frame data in the video data and the time stamp of the target audio event may be calculated, specifically, the difference between the starting time of the time stamp of the video frame data and the starting time of the time stamp of the target audio event may be calculated, and then the key frame with the minimum difference is selected.
S502, aligning the target audio event with the video data of the machine position according to the difference value of the time stamp of the target audio event and the time stamp of the searched key frame.
The difference between the time stamp of the target audio event and the time stamp of the searched key frame is a deviation between the time stamp of the target audio event and the time stamp of the searched key frame, so that the time stamp of the target audio event can be aligned to the video data based on the deviation, and the time stamp of the target audio event can be converted from the time stamp of the audio to the time stamp of the corresponding video data.
S503, utilizing a dynamic time warping algorithm to align video data of each machine position of the target video.
Since there may be a delay problem in the video data of each station, in order to align the target audio event to the video data of one station, the video data of each station may be aligned, and the video data of each station needs to be aligned. In the embodiment of the present application, a dynamic time warping algorithm is used for alignment, which can be specifically expressed as:
Where T x and T y represent video frame time sequences for machine bit x and machine bit y, respectively, and T xi and T yi represent time stamps for the ith frame video frames for machine bit x and machine bit y, respectively.
S105, analyzing the optimal machine position of each period of time in the range of the timestamp of the target audio event based on the audio characteristics of the audio data of each machine position of the target video and the video characteristics of the video data of each machine position.
It should be noted that, contents and contents quality photographed by different stations at different times are different, and corresponding audio data are also different. Therefore, in order to obtain accurate and high-quality highlight content, it is necessary to continuously switch to the content photographed at the optimal position for editing in editing video data and audio data within the range of the time stamp of the target audio event. It is necessary to first determine the optimal opportunity for each period of time within the range of the timestamp of the target audio event.
Because the video data and the audio data can both feed back the content, the quality and the like of the video, in the embodiment of the application, the audio characteristics of the audio data and the video characteristics of the video data of each machine position are analyzed in each time period within the range of the time stamp of the target audio event, then the score of each time period is evaluated based on the audio characteristics and the video characteristics, and then the optimal machine position in each time period is selected according to the score.
Alternatively, in another embodiment of the present application, a specific implementation of step S105, as shown in fig. 6, includes the following steps:
S601, subtracting the first adjustment time from the time stamp of the target audio event and adding the second adjustment time to obtain the minimum value and the maximum value of the time range corresponding to the target audio event.
In order to make the clipped video clip have better cut-in points and end points, namely, a video of a more complete scene, rather than suddenly playing from the middle position of a certain piece of content. In order to make the length of the clipped video clip meet the preference of the user, in the embodiment of the application, the time range of the target audio event is correspondingly expanded through the set first adjusting time and second adjusting time, namely, the time stamp of the target audio event is subtracted by the first adjusting time Tp and the second adjusting time Tq to obtain the minimum value and the maximum value of the time range corresponding to the target audio event, so that the range of the time stamp Te of the target audio event is [ Te-Tp, te+Tq ]. Specifically, the minimum value in the timestamp of the target audio event may be subtracted from the first adjustment time, and the maximum value in the timestamp of the target audio event may be added to the second adjustment time.
Therefore, optionally, on the premise of user authorization, the video watching behavior data of the user can be collected, then the preference of the user can be analyzed according to the behavior data, real-time feedback is received, and the first adjustment time and the second adjustment time are dynamically adjusted according to the preference of the user and the implementation feedback, so that the highlight segment meeting the personalized requirements of the user is clipped.
S602, respectively weighting the volume intensity and the source direction of the audio data of each unit time of each machine position in the time range corresponding to the target audio event aiming at each machine position of the target video to obtain the corresponding audio feature score of each unit time of the machine position.
It should be noted that, since the volume intensity may reflect the distance between the machine position and the photographed scene, and the source direction may reflect the angle of machine position photographing, both may reflect the importance degree of the machine position relative to the current photographed scene, and whether the machine position is the best machine position or not.
Wherein, the unit time is set to be a shorter time so as to analyze the change of the optimal machine position in time.
Specifically, the volume intensity and the weight corresponding to the source direction are used for weighting the volume intensity and the weight corresponding to the source direction, so that the audio feature score is obtained, which can be specifically expressed as:
Wherein A is volume intensity, D is source direction, and alpha and beta are weights corresponding to volume intensity and source direction. Optionally, the weights corresponding to the two weights can be continuously optimized through a machine learning mode.
S603, weighting the character action amplitude and the emotion response information of the video data of each unit time in the time range corresponding to the target audio event to obtain the video feature score of each unit time corresponding to the unit time.
Similarly, the person action amplitude and the emotion response information are weighted through corresponding weights, so that a video feature score is obtained, which can be expressed as follows:
wherein M is the motion amplitude of the person, E is the emotional response information; And The weight corresponding to the motion amplitude and the emotion response information of the person.
S604, adding the audio feature score and the video feature score of each unit time corresponding to the machine position to obtain the fusion feature score of each unit time corresponding to the machine position.
S605, selecting the machine position with the highest fusion characteristic score corresponding to each unit time in the time range corresponding to the target audio event as the optimal machine position of each unit time.
S106, respectively clipping the audio data and the video data of the optimal machine position of each time period in the range of the timestamp of the target audio event to obtain the highlight video clip corresponding to the target audio event.
Optionally, after determining the best machine position of each time period, audio data and video data of the best machine position can be continuously switched to be clipped in the clipping process, so that a highlight video clip corresponding to a target audio event formed by the data of the best machine position of each time period can be obtained.
It should be noted that, it is considered that the target video may have a plurality of target audio events. So alternatively, if there is coincidence between adjacent target audio events, or if the time interval is less than a certain threshold, then it may be combined into one target audio event and then processed. While during the clipping process, a Bezier curve smooth transition may be employed for data where the target audio event is not present.
Optionally, in order to avoid too frequent switching of the machine to affect the viewing experience of the user, in another embodiment of the present application, a specific implementation of step S106, as shown in fig. 7, includes the following steps:
S701, sequentially polling each unit time in a time range corresponding to the target audio event.
S702, judging whether the optimal machine position of the unit time currently polled is different from the current machine position of the clipping machine, and whether the time difference between the optimal machine position and the time point of switching to the current machine position of the clipping machine is larger than a preset time difference.
That is, after the optimal position per unit time currently polled is determined, it is judged whether it is consistent with the clipping position currently being adopted, that is, whether the optimal position is changed. If so, the current editing machine position is not changed, and editing is continued according to the current editing machine position, namely, step S704 is executed. If the time difference is not consistent with the current editing machine position, judging whether the time difference of the time point of switching to the current editing machine position is larger than the preset time difference, namely judging whether the time from the last switching machine position is larger than the preset time difference. If the time difference is greater than the preset time difference, the machine position is selected not to be switched, and the machine position is continuously clipped according to the current clipping machine position, namely, the step S704 is executed. If the current time difference is greater than the preset time difference, step S703 is executed to take the current polled optimal machine position in unit time as the latest current machine position.
S703, switching the current clipping machine position to the current polled optimal machine position in unit time.
S704, clipping the audio data and the video data of the current clipping machine position until all unit time in the time range corresponding to the target audio event is polled.
Optionally, in order to facilitate subsequent searching, management, etc., in another embodiment of the present application, after performing step S106, the method may further include:
And marking labels of the highlight clips of the target videos based on event categories of the target audio events and the highlight video clips corresponding to the target audio events respectively.
After marking the tag, the user can then retrieve according to the tag, and can perform personalized pushing and the like for the user according to the tag.
Optionally, in another embodiment of the present application, after obtaining the highlight video segments corresponding to each target audio event, the highlight video segments corresponding to each target audio event may be further spliced, so as to obtain a combined video of one high-frequency video segment of the target video.
The embodiment of the application provides a video highlight clip method, which comprises the steps of firstly obtaining audio data of a target video. Event categories to which each frame of audio frame data of the audio data of the target video belongs are detected through the multi-scale convolutional neural network and the self-adaptive attention mechanism respectively, so that whether audio events reflecting the high-frequency content of the video exist in the audio frame data can be determined. The target audio event of the target video is then determined based on the event category to which each frame of audio frame data belongs, and the timestamp of the target audio event is marked. And then aligning the target audio event to the video data of each machine position of the target video based on the time stamp of the target audio event so as to correct the deviation between the audio data and the video data and the deviation between the video data of each machine position, thereby being convenient for accurately clipping corresponding video content from each machine position. And then, based on the audio characteristics of the audio data of each machine position of the target video and the video characteristics of the video data of each machine position, analyzing the optimal machine position of each period of time in the range of the time stamp of the target audio event, thereby determining the machine position of the highlight video content of the video with the highest quality most accurately shot in each period of time through the integrated analysis of the video data and the audio data of each machine position, and facilitating the subsequent accurate editing of the highlight video fragment with high quality. And respectively clipping the audio data and the video data of the optimal machine position of each time period in the range of the timestamp of the target audio event to obtain a highlight video fragment corresponding to the target audio event, thereby realizing a method for accurately clipping the high-quality highlight video fragment in the video.
Another embodiment of the present application provides a video highlight clip clipping apparatus, as shown in fig. 8, including:
a data acquisition unit 801 for acquiring audio data of a target video.
The category detection unit 802 is configured to detect, through the multi-scale convolutional neural network and the adaptive attention mechanism, an event category to which each frame of audio frame data of the audio data of the target video belongs.
The time marking unit 803 is configured to determine a target audio event of the target video based on the event category to which each frame of audio frame data belongs, and mark a time stamp of the target audio event. Wherein the target audio event refers to an event of audio reflecting video highlight content.
An alignment unit 804, configured to align the target audio event to the video data of each machine position of the target video based on the time stamp of the target audio event.
The machine position analysis unit 805 is configured to analyze an optimal machine position of each period of time within a range of a time stamp of the target audio event based on an audio feature of audio data of each machine position of the target video and a video feature of video data of each machine position.
And a clipping unit 806, configured to clip the audio data and the video data of the best machine position of each period of time within the range of the timestamp of the target audio event, so as to obtain a highlight video clip corresponding to the target audio event.
Optionally, in the video highlight clip apparatus provided in another embodiment of the present application, the video highlight clip apparatus further includes:
and the preprocessing unit is used for preprocessing the audio data of the target video.
And the sliding unit is used for sliding a time window on the preprocessed audio data of the target video through a preset sliding step length to obtain audio frame data of each frame of the audio data of the target video. The audio data contained in the time window sliding once is one frame of audio frame data.
The feature extraction unit is used for extracting frequency domain features of each frame of audio frame data by utilizing short-time Fourier transform respectively, extracting at least a Mel frequency cepstrum coefficient and a time-frequency diagram of the audio frame data, and constructing a multi-dimensional feature matrix of each frame of audio frame data.
Optionally, in the video highlight clip apparatus provided in another embodiment of the present application, the category detection unit includes:
And the input unit is used for inputting the multidimensional feature matrix of the audio frame data extracted from the audio frame data into the multiscale convolutional neural network for each frame of the audio frame data respectively.
The convolution unit is used for carrying out multi-scale convolution on the multi-dimensional feature matrix of the audio frame data through the multi-scale convolution neural network to obtain an advanced feature matrix of the audio frame data.
And the optimizing unit is used for optimizing the advanced feature matrix of the audio frame data by utilizing the self-adaptive attention mechanism.
The class determining unit is used for calculating the probability that the audio frame data belongs to each event class based on the high-level feature matrix of the optimized audio frame data through the classifier, and determining the event class with the highest probability as the event class to which the audio frame data belongs.
Optionally, in the video highlight clip apparatus provided in another embodiment of the present application, the time stamp unit includes:
And the event judging unit is used for analyzing whether the target event category exists or not based on the probability that the continuous N frames of audio frame data belong to the event category to which the continuous N frames of audio frame data belong. The target event category is a category of a target audio event, and a probability sum of the target event categories to which each frame of audio frame data belonging to the target event category belongs in the continuous N frames of audio frame data is larger than a probability threshold.
The event determining unit is used for determining that a target audio event of a target video belonging to a target event category exists in the continuous N frames of audio frame data when the existence of the target event category is analyzed based on the probability that the continuous N frames of audio frame data belong to the event category to which the continuous N frames of audio frame data belong.
And the time determining unit is used for determining the time stamp of the audio frame data of the first frame belonging to the target event category in the continuous N frames of audio frame data as the starting time stamp of the target audio event and determining the time stamp of the audio frame data of the last frame belonging to the target event category as the ending time stamp of the target audio event.
Optionally, in the video highlight clip apparatus provided in another embodiment of the present application, the alignment unit includes:
And the key frame searching unit is used for searching the key frame with the smallest difference value between the time and the time stamp of the target audio event from the video data of any machine position of the target video.
And the video data alignment unit is used for aligning the target audio event with the video data of the machine position according to the difference value between the time stamp of the target audio event and the time stamp of the searched key frame.
And the machine bit data alignment unit is used for aligning the video data of each machine bit of the target video by utilizing a dynamic time warping algorithm.
Optionally, in the video highlight clip device provided in another embodiment of the present application, the machine analysis unit includes:
And the time adjustment unit is used for subtracting the first adjustment time from the time stamp of the target audio event and adding the second adjustment time to obtain the minimum value and the maximum value of the time range corresponding to the target audio event.
The audio scoring unit is used for respectively weighting the volume intensity and the source direction of the audio data of each unit time of the machine position in the time range corresponding to the target audio event aiming at each machine position of the target video to obtain the corresponding audio feature score of each unit time of the machine position.
And the video scoring unit is used for weighting the character action amplitude and the emotion response information of the video data of each unit time in the time range corresponding to the target audio event to obtain the video feature score of each unit time corresponding to the unit time.
And the grading fusion unit is used for adding the corresponding audio feature grading and the video feature grading of each unit time of the machine position to obtain the corresponding fusion feature grading of each unit time of the machine position.
The machine position selecting unit is used for selecting the machine position with the highest fusion characteristic score corresponding to each unit time in the time range corresponding to the target audio event as the optimal machine position of each unit time.
Optionally, in the video highlight clip clipping apparatus provided in another embodiment of the present application, a clipping unit includes:
And the polling unit is used for sequentially polling each unit time in the time range corresponding to the target audio event.
And the machine position switching unit is used for switching the current machine position to the current polled optimal machine position in unit time when the current polled optimal machine position in unit time is different from the current machine position and the time difference between the current machine position and the time point of switching to the current machine position is larger than the preset time difference.
And the data clipping unit is used for clipping the audio data and the video data of the current clipping machine position until all unit time in the time range corresponding to the target audio event is polled.
The specific working process of each unit provided in the foregoing embodiment of the present application may refer to the specific implementation manner provided in the foregoing method embodiment, and will not be described herein.
Another embodiment of the present application provides an electronic device, as shown in fig. 9, including:
A memory 901 and a processor 902.
The memory 901 is used for storing a program.
The processor 902 is configured to execute a program stored in the memory 901, where the program is executed, and specifically configured to implement a video highlight clip method provided in any one of the embodiments described above.
Another embodiment of the present application provides a computer storage medium storing a computer program for implementing the video highlight clip clipping method as provided in any one of the above embodiments when the computer program is executed by a processor.
Computer storage media, including both non-transitory and non-transitory, removable and non-removable media, may be implemented in any method or technology for storage of information. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.