CN114266778B

CN114266778B - Video processing method, device, equipment and storage medium

Info

Publication number: CN114266778B
Application number: CN202111580218.5A
Authority: CN
Inventors: 胡可飞; 邓帆; 朱思
Original assignee: Beijing Eswin Computing Technology Co Ltd
Current assignee: Beijing Eswin Computing Technology Co Ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2025-06-20
Anticipated expiration: 2041-12-22
Also published as: CN114266778A

Abstract

The embodiment of the present application discloses a video processing method, device, equipment and storage medium. The method includes: performing image segmentation processing on the initial video frame sequence to determine the image segmentation result; based on the image segmentation result, determining the target sub-video frame sequence in the initial video frame sequence, the target sub-video frame sequence is composed of continuous video frames including the target object, and the number of frames of the target sub-video frame sequence is less than the number of frames of the initial video frame sequence; based on the target sub-video frame sequence, generating a target video frame sequence, each video frame in the target video frame sequence includes the target object, and the number of frames of the target video frame sequence is greater than the number of frames of the target sub-video frame sequence. By adopting the embodiment of the present application, a target video frame sequence in which each video frame includes the target object can be generated based on a target sub-video frame sequence composed of continuous video frames including the target object in the initial video frame sequence, which has high applicability.

Description

Video processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a video processing method, apparatus, device, and storage medium.

Background

The unsupervised video object segmentation (Video Object Segmentation, VOS) problem requires that major objects, such as animals, persons, etc., in each video frame of a sequence of video frames be identified and located without providing any additional input, and that the same object needs to be tracked between different video frames. The video frame sequences are mostly from video clips such as movie drama, sports, dance, street beats and the like, and the various scenes can cause problems such as shot switching, target shielding, rapid movement, target midway appearance or disappearance and the like.

The existing target segmentation method mainly comprises the steps of firstly determining a target in a first video frame in a video frame sequence, and further determining a target in a subsequent video frame based on the target in the first video frame. Based on the method, an error accumulation phenomenon occurs along with the increase of the frame number, so that the recognition and tracking effects on the target object in the subsequent video frame are poor.

Disclosure of Invention

The embodiment of the application provides a video processing method, which can generate a target video frame sequence of each video frame comprising a target object based on a target sub-video frame sequence formed by a section of continuous video frames comprising the target object in an initial video frame sequence, and has high applicability.

In a first aspect, an embodiment of the present application provides a video processing method, including:

performing image segmentation processing on the initial video frame sequence, and determining an image segmentation result;

Determining a target sub-video frame sequence in the initial video frame sequence based on the image segmentation result, wherein the target sub-video frame sequence consists of continuous video frames comprising a target object, and the frame number of the target sub-video frame sequence is smaller than that of the initial video frame sequence;

generating a target video frame sequence based on the target sub-video frame sequence, wherein each video frame in the target video frame sequence comprises the target object, and the frame number of the target video frame sequence is larger than the frame number of the target sub-video frame sequence.

In a second aspect, an embodiment of the present application provides a video processing apparatus, including:

The image processing module is used for carrying out image segmentation processing on the initial video frame sequence and determining an image segmentation result;

a sequence determining module, configured to determine a target sub-video frame sequence in the initial video frame sequence based on the image segmentation result, where the target sub-video frame sequence is composed of continuous video frames including a target object, and a frame number of the target sub-video frame sequence is smaller than a frame number of the initial video frame sequence;

And the sequence generating module is used for generating a target video frame sequence based on the target sub-video frame sequence, wherein each video frame in the target video frame sequence comprises the target object, and the frame number of the target video frame sequence is larger than that of the target sub-video frame sequence.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the processor and the memory are connected to each other;

The memory is used for storing a computer program;

the processor is configured to execute the video processing method provided by the embodiment of the application when the computer program is called.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program that is executed by a processor to implement the video processing method provided by the embodiments of the present application.

In the embodiment of the application, a section of target sub-video frame sequence in the initial video frame sequence can be determined based on the image segmentation result corresponding to the initial video frame sequence, and each video frame in the target sub-video frame sequence comprises a target object. The target video frame sequence which also comprises the target object can be determined based on the target sub-video frame sequence, and the target video frame sequence is also composed of continuous video frames comprising the target object and has more video frames than the target sub-video frame sequence, so that the target object in most video frames in the initial video frame sequence is tracked, and the applicability is high.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a video processing method according to an embodiment of the present application;

FIG. 2 is a schematic view of a scenario in which mask features are optimized according to an embodiment of the present application;

FIG. 3 is a schematic view of a scenario for determining a predicted target mask feature provided by an embodiment of the present application;

FIG. 4 is a schematic view of a scenario featuring a determination of attention as provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, fig. 1 is a flowchart of a video processing method according to an embodiment of the present application. As shown in fig. 1, the video processing method provided by the embodiment of the present application may include the following steps:

and S11, performing image segmentation processing on the initial video frame sequence, and determining an image segmentation result.

In some possible embodiments, the initial video frame sequence may be a video frame sequence corresponding to any video segment such as a movie, a sports video, etc., and may be specifically determined based on actual application scene requirements, which is not limited herein.

Specifically, when the image segmentation processing is performed on the initial video frame sequence, the image segmentation processing may be performed on each video frame in the initial video frame sequence, so as to obtain an image segmentation result corresponding to each video frame in the initial video frame sequence.

When each video frame in the initial video frame sequence is subjected to image segmentation processing, initial image characteristics corresponding to the video frame can be determined, and then an image segmentation result corresponding to the video frame is obtained based on the initial image characteristics corresponding to the video frame.

When the image segmentation processing is performed on the initial video frame sequence, the image segmentation processing may be directly performed on each video frame in the initial video frame sequence based on an image segmentation algorithm, for example, SOLOv algorithm, specifically, may be determined based on the actual application scene requirement, which is not limited herein.

In some possible embodiments, the image segmentation result corresponding to the initial video frame sequence may include mask features for a plurality of objects included in each video frame in the initial video frame sequence. For each video frame, the image segmentation result corresponding to the video frame may further include an object included in the video frame determined based on each mask feature corresponding to the video frame.

For example, the image segmentation result corresponding to the initial video frame sequence includes a mask feature corresponding to each video frame, and an object, such as a person, an animal, or the like, included in each video frame determined based on the mask feature corresponding to each video frame.

Step S12, determining a target sub-video frame sequence in the initial video frame sequence based on the image segmentation result.

In some possible implementations, the target sub-video frame sequence is composed of consecutive video frames including the target object, and the number of frames of the target sub-video frame sequence is less than the number of frames of the initial video frame sequence. That is, the target sub-video frame sequence is a video frame sequence segment in which each video frame in the initial video frame sequence includes a target object, and the target object is any one of the objects corresponding to each video frame in the initial video frame sequence.

Specifically, all objects included in each video frame in the initial video frame sequence may be determined based on the image segmentation result corresponding to the initial video frame sequence, and the first object and the second object may be further determined therefrom.

The first object is an object included in each video frame in the initial video frame sequence, and the second object is an object in the initial video frame sequence, where at least one video frame does not include the object.

Further, for each second object, since at least one video frame in the initial video frame sequence does not include the second object, at least one sub video frame sequence corresponding to the second object may be determined from the initial video frame sequence, each sub video frame sequence being composed of consecutive video frames including the second object.

For example, the initial video frame sequence has a frame number of 100 frames, wherein the 33 th, 34 th and 55 th video frames do not include the second object, and three video frame sequences of 1 st to 32 th, 35 th to 54 th and 56 th to 100 th video frames in the initial video frame sequence may be determined as sub-video frame sequences corresponding to the second object.

Further, any second object may be determined as a target object, and any sub-video frame sequence in the at least one sub-video frame sequence corresponding to the second object may be determined as a target sub-video frame sequence. For example, a sub-video frame sequence having the largest number of frames (i.e., the longest sequence) among at least one sub-video frame sequence corresponding to the second object may be determined as the target sub-video frame sequence.

The specific determination mode of the target object and the specific mode of determining the target sub-video frame sequence from at least one sub-video frame sequence corresponding to the target object may be determined based on the actual application scene requirement, which is not limited herein.

Alternatively, in determining the target sub-video frame sequence in the initial video frame sequence, at least one sub-video frame sequence of the initial video frame sequence corresponding to each object may be determined by a tracking algorithm based on the image segmentation result of the initial video frame sequence. That is, for each object, at least one sub-video frame sequence consisting of consecutive video frames comprising the object may be determined from the initial video frame sequence based on a tracking algorithm.

The tracking algorithm may be a Simple Online and real-time tracking (Simple Online AND REALTIME TRACKING, SORT) algorithm or other algorithms, which are not limited herein.

Further, for each object, if at least one sub-video frame sequence including the object is determined from the initial video frame sequence and the number of frames of any one sub-video frame sequence is smaller than the number of frames of the initial video frame sequence, the object may be determined as a second object.

Optionally, the target object may be determined from objects included in each video frame in the initial video frame sequence, for example, any object is determined as the target object. Further, a mask feature of a target object is determined from mask features of a plurality of objects included in each video frame in the initial video frame sequence (for convenience of description, the mask feature of the target object is hereinafter referred to as a target mask feature), and a target sub-video frame sequence in the initial video frame sequence is determined based on the mask feature of the target object.

If the video frames corresponding to the target mask features are determined from the initial video frame sequence, at least one sub-video frame sequence is obtained based on the video frames corresponding to the target mask features, and any sub-video frame sequence is determined as the target sub-video frame sequence in the initial video frame sequence.

If the frame number of the target sub-video frame sequence determined from the initial video frame sequence is the same as that of the initial video frame, determining the target object from other objects again, and determining the target sub-video frame sequence consisting of continuous video frames comprising the new target object based on the mode.

In some possible embodiments, after determining the mask features of the objects included in each video frame in the initial video frame sequence, each mask feature may be further optimized to further improve the segmentation accuracy of the mask feature and optimize the edge details of the mask feature to improve the integrity and accuracy of the mask feature.

Specifically, for each mask feature, a video frame corresponding to the mask feature may be determined, and an object corresponding to the mask feature may be determined from the video frame. Further, the image characteristics of the object can be determined, and the image characteristics of the object and the mask characteristics are fused to obtain fusion target fusion characteristics, so that the optimized mask characteristics corresponding to the mask characteristics can be obtained based on the fusion target fusion characteristics. After determining the optimized mask features for each mask feature, a target sub-video frame sequence in the initial video frame sequence may be determined based on each optimized mask feature based on any of the above embodiments.

As an example, a target object is first determined from objects included in each video frame in the initial video frame sequence, and a mask feature (hereinafter referred to as a target mask feature for convenience of description) of the target object is determined from each mask feature.

For each target mask feature, an image feature of a target object (hereinafter referred to as a third image feature for convenience of description) included in a video frame corresponding to the target mask feature may be determined, and an optimized mask feature corresponding to the target mask feature may be determined based on the target mask feature and the third image feature.

After determining the optimized mask features corresponding to each target mask feature, a target sub-video frame sequence in the initial video frame sequence may be determined based on the optimized mask features corresponding to the target object.

The determining of the optimized mask feature corresponding to any mask feature (e.g., any target mask feature) may be performed based on a neural network model, for example, edge details of the mask feature may be optimized based on REFINENET network models, and the selection of a specific neural network model may be determined based on actual application scene requirements, which is not limited herein.

Referring to fig. 2, fig. 2 is a schematic view of a scenario in which mask features are optimized according to an embodiment of the present application. The video frame shown in fig. 2 is any video frame in the initial video frame sequence, and the building in the video frame is the target object. Based on the above, the image features of the target object and the target mask features of the target object may be fused, and the fused features may be input REFINENET into a network model, and finally the optimized mask features of the target mask features may be obtained based on the network model.

And step S13, generating a target video frame sequence based on the target sub-video frame sequence.

In some possible embodiments, based on the target sub-video frame sequence, a target video frame sequence having a greater number of frames than the target sub-video frame sequence may be generated, i.e., a target video frame sequence having a longer sequence length. Wherein each video frame in the target video frame sequence comprises a target object, i.e. the target video frame sequence consists of consecutive video frames comprising the target object.

In particular, the target mask features of the target object may be sequentially determined for other video frames in the initial video frame sequence than the target sub-video frame sequence based on the target mask features of the target object included in the plurality of video frames in the target sub-video frame sequence.

The target mask feature corresponding to each video frame after the target sub-video frame sequence is determined based on the target mask feature corresponding to the previous video frame of the video frame, and the target mask feature corresponding to the first video frame (hereinafter referred to as the first video frame for convenience of description) after the target sub-video frame sequence is determined based on the target mask feature corresponding to the last video frame in the target sub-video frame sequence.

The target mask feature corresponding to each video frame before the target sub-video frame sequence is determined based on the target mask feature corresponding to the next video frame of the video frame, and the target mask feature corresponding to the last video frame before the target sub-video frame sequence is determined based on the target mask feature corresponding to the first video frame in the target sub-video frame sequence.

For example, the initial video frame sequence includes 20 frames, and the target sub-video frame sequence is a video frame sequence corresponding to the 3 rd to 18 th video frames including the target object. Then the target mask feature for the target object may be determined based on the target mask feature for the last video frame of the sequence of target sub-video frames (i.e., the 18 th video frame of the sequence of initial video frames), the 19 th video frame of the sequence of initial video frames that follows the sequence of target sub-video frames corresponds to the target object, and the target mask feature for the target object may be determined based on the target mask feature for the 19 th video frame of the sequence of initial video frames that corresponds to the target object.

Similarly, the target mask feature of the target object may be determined based on the target mask feature of the target object included in the first video frame of the target sub-video frame sequence (i.e., the 3 rd video frame of the initial video frame sequence), the 2 nd video frame of the initial video frame sequence preceding the target sub-video frame sequence corresponding to the target object, and the target mask feature of the target object may be determined based on the target mask feature of the 2 nd video frame of the initial video frame sequence corresponding to the target object.

In this manner, the other video frames in the sequence of forward predicted initial video frames may correspond to the target mask features of the target object based on the target mask features of the target object included in the first video frame in the sequence of target sub-video frames, and the other video frames in the sequence of backward predicted initial video frames may correspond to the target mask features of the target object based on the target mask features of the target object included in the last video frame in the sequence of target sub-video frames.

Further, a mask feature sequence may be generated based on target mask features corresponding to each video frame in the target sub-video frame sequence and target mask features corresponding to other video frames in the initial video frame sequence. The mask feature sequence is obtained when each video frame in the mask feature sequence is subjected to image segmentation processing, and the mask features corresponding to other video frames in the mask feature sequence are obtained based on the prediction of the mask feature corresponding to the first or last video frame in the mask feature sequence.

For any video frame except for the target sub-video frame sequence in the initial video frame sequence, in the case that the image segmentation processing does not determine the target mask characteristics of the target object in the video frame or in the case that the target mask characteristics of the target object in the video frame are omitted when the target sub-video frame sequence is determined based on an algorithm such as SORT, the target mask characteristics of the video frame corresponding to the target object can be predicted based on the above mode.

Further, based on the determined mask feature sequence, a target video frame sequence may be generated in which each video frame includes a target object and the number of frames is longer than the number of frames of the target sub-video frame sequence. If the target sub-video frame sequence is a sequence of 20 consecutive frames comprising the target object, a sequence of target video frames comprising more than 20 consecutive frames of the target object may be generated based on the above-described mask feature sequence. If the mask feature sequence corresponding to the target object is determined based on the mask feature sequence, the target tracking of the target object in the initial video frame sequence can be realized based on the generated target video frame sequence.

The target object included in any video frame in the target video frame sequence may be determined based on a target mask feature corresponding to the video frame in the mask feature sequence.

In some possible embodiments, different second objects in the second objects included in each video frame in the initial video frame sequence may be determined as target objects, and a target video frame sequence corresponding to the different second objects may be obtained, so as to implement target tracking on each second object in the initial video frame sequence, and predict the second object in the video frame in the initial video frame sequence that does not include the second object.

The first object is an object included in each video frame in the initial video frame sequence, so that the first corresponding target tracking in the initial video frame sequence is realized based on the first object included in each video frame in the initial video frame sequence or a mask feature corresponding to the first object.

In some possible embodiments, when determining that any video frame in the initial video frame sequence other than the target sub-video frame sequence corresponds to the target mask feature of the target object, the target mask feature of the video frame corresponding to the target object may be determined based on the video frame and the target mask feature of the video frame corresponding to the target object, and at least one video frame (hereinafter referred to as a third video frame for convenience of description) in the target sub-video frame sequence and the target mask feature of the target object included in the video frame sequence.

When it is determined that any two other video frames except the target sub-video frame sequence in the initial video frame sequence correspond to the target mask feature of the target object, at least one third video frame selected from the target sub-video frame sequence corresponding to the two video frames may be completely identical or partially identical or completely different, and may be specifically determined based on the actual application scene requirement, which is not limited herein.

When determining that any one of the other video frames except the target sub-video frame sequence in the initial video frame sequence corresponds to the target mask feature of the target object, any one of the third video frames selected from the target sub-video frame sequence corresponding to the video frame is any one of the video frames in the target sub-video frame sequence, which can be determined specifically based on the actual application scene requirement, and is not limited herein.

When it is determined that the last video frame (the second video frame) of the initial video frame sequence before the target sub-video frame sequence or the first video frame (the first video frame) of the initial video frame sequence after the target sub-video frame sequence corresponds to the target mask feature of the target object, the last video frame in the target sub-video frame sequence may be included in at least one third video frame selected from the target sub-video frame sequence corresponding to the first video frame, and the first video frame in the target sub-video frame sequence may be included in at least one third video frame selected from the target sub-video frame sequence corresponding to the second video frame, which may be specifically determined based on the actual application scene requirement, without limitation.

Taking the first video frame (i.e., the first video frame) after the target sub-video frame sequence in the initial video frame sequence as an example, any one or more video frames in the target sub-video frame sequence may be determined to be third video frames, and the first video frame may be determined to correspond to the target mask feature of the target object based on the last video frame in the target sub-video frame sequence, the target mask feature of the target object included in the last video frame in the target sub-video frame sequence, each third video frame, and the target mask feature of the target object included in each third video frame.

Specifically, a predicted target mask feature of the first video frame corresponding to the target object may be determined based on the last video frame in the sequence of target sub-video frames, the target mask feature of the target object included in the last video frame in the sequence of target sub-video frames, the target mask feature of each third video frame, and the target object included in each third video frame, and the predicted target mask feature may be determined as the target mask feature of the first video frame corresponding to the target object.

Optionally, after determining that the first video frame corresponds to the predicted target mask feature of the target object, an intersection ratio of the target mask feature corresponding to the last video frame in the sequence of target sub-video frames and the predicted target mask feature may be determined. I.e. determining the intersection and union of the target mask feature corresponding to the last video frame in the sequence of target sub-video frames with the predicted target mask feature and determining the ratio of the intersection and union.

If the intersection ratio of the target mask feature corresponding to the last video frame in the target sub-video frame sequence and the predicted target mask feature is smaller than a preset threshold value, the difference between the target mask feature corresponding to the last video frame in the target sub-video frame sequence and the predicted target mask feature is larger, and the predicted target predicted mask feature can be determined to be the target mask feature of the target object included in the first video frame.

If the intersection ratio of the target mask feature corresponding to the last video frame in the target sub-video frame sequence and the predicted target mask feature is greater than or equal to a preset threshold, the difference between the target mask feature corresponding to the last video frame in the target sub-video frame sequence and the predicted target mask feature is smaller, and at this time, the target mask feature corresponding to the last video frame in the target sub-video frame sequence can be determined as the target mask feature corresponding to the first video frame.

Based on the above manner, the target mask characteristics corresponding to the finally determined first video frame can be made to be closer to the actual mask characteristics of the target object included in the first video frame. The specific value of the preset threshold may be determined based on the actual application scene requirement, for example, may be 0.9, and the like, which is not limited herein.

It should be specifically noted that, for any video frame in the initial video frame sequence other than the target sub-video frame sequence, after determining the predicted target mask feature corresponding to the video frame, the intersection ratio of the target predicted mask feature of the previous video frame or the next video frame to the predicted target mask feature of the video frame may be determined, so as to determine the target mask feature of the video frame corresponding to the target object based on the intersection ratio.

In some possible implementations, when determining that the first video frame corresponds to the predicted target mask feature of the target object based on the last video frame in the sequence of target sub-video frames, the target mask feature of the target object included in each third video frame, and the target mask feature of the target object included in each third video frame, the attention feature may be determined based on the target mask feature of the target object included in the last video frame in the sequence of target sub-video frames, the target mask feature of the target object included in each third video frame, and the target mask feature of the target object included in the last video frame in the sequence of target sub-video frames, and the predicted target mask feature of the first video frame corresponding to the target object is determined based on the attention feature and the target mask feature of the target object included in the last video frame in the sequence of target sub-video frames.

Specifically, for each third video frame, an image feature (hereinafter referred to as a first image feature for convenience of description) and a context feature (hereinafter referred to as a first context feature for convenience of description) corresponding to the third video frame may be determined based on the third video frame and its corresponding object mask feature.

For each third video frame, the target object included in the third video frame may be replaced with a corresponding target mask feature to obtain a new third video frame, and then feature processing is performed on the new third video frame to obtain a first image feature and a first context feature corresponding to the third video frame. Based on the first image feature and the first context feature corresponding to each third video frame can be obtained.

Further, the first image features corresponding to each third video frame are fused to obtain a fused image feature (hereinafter referred to as a first fused image feature for convenience of description). For example, the feature values of the first image features corresponding to the third video frames in each channel may be fused, or the feature values of the first image features in each channel may be averaged to obtain a first fused image feature, which is not limited herein.

Similarly, the first context features corresponding to the third video frames can be fused to obtain fused context features. For example, the first context features corresponding to each third video frame may be fused or subjected to mean processing, to obtain fused context features.

Further, the gaussian blur processing may be performed on the object mask corresponding to the last video frame in the object sub-video frame sequence to obtain a blur mask feature, and the feature processing may be performed on the last video frame in the object sub-video frame sequence to obtain an image feature (for convenience of description, hereinafter referred to as a second image feature) and a context feature (for convenience of description, hereinafter referred to as a second context feature) corresponding to the last video frame in the object sub-video frame sequence. And determining the attention feature based on the first fused image feature, the fused context feature, the blur mask feature, the second image feature, and the second context feature.

The first fused image feature, the fused context feature, the blur mask feature, the second image feature, and the second context feature may be input into an attention network, through which the attention feature is ultimately obtained.

After determining the attention feature, feature processing may be performed on the attention feature and the second image feature to obtain a prediction mask feature of the first video frame corresponding to the target object.

With reference to fig. 3, fig. 3 is a schematic view of a scenario for determining a predicted target mask feature according to an embodiment of the present application. Fig. 3 is used to determine that a first video frame (first video frame) in the initial video frame sequence that follows the target sub-video frame sequence corresponds to a predicted target mask feature for a target object, and wherein the target object is a person in the video frame.

Fig. 3 determines two third video frames from the target sub-video frame sequence, after replacing the target object (character) included in each third video frame with a corresponding target mask feature, encodes each third video frame to obtain first image features k1 and k2 corresponding to each third video frame and first context features v1 and v2 corresponding to each third video frame, further fuses k1 and k2 to obtain a first fused image feature km, and fuses v1 and v2 to obtain a fused context feature vm.

And simultaneously, for the last video frame in the target sub-video frame sequence, encoding the video frame to obtain a second image characteristic kq and a second context characteristic vq corresponding to the last video frame in the target sub-video frame sequence. And carrying out Gaussian blur processing on the target mask feature corresponding to the last video frame in the target sub-video frame sequence to obtain a blur mask feature p.

Further, the first fused image feature km, the fused context feature vm, the fuzzy Gaussian feature p, the second image feature kq and the second context feature vq are input into an attention network to obtain an attention feature y.

The second image feature kq and the attention feature y are input to a decoder, resulting in a predicted target mask feature of the first video frame corresponding to the target object.

After determining the attention features, the attention features can be sampled in parallel by hole convolution with different sampling rates based on a hole space convolution pooling pyramid (AtrousSpatial Pyramid Pooling, ASPP), and further fusion processing is carried out on the attention features according to a plurality of proportions, so that the processed attention features are obtained. And further, the processed attention feature and the second image feature can be subjected to feature processing to obtain a prediction mask feature of the first video frame corresponding to the target object.

The attention network may be a Motion-Guided space-time Memory network (SPACE TIME Memory STM) network, or may be another neural network, which is not limited herein.

In some possible embodiments, when the first fused image feature, the fused context feature, the blur mask feature, the second image feature, and the second context feature are input into the attention network, and the attention feature is finally obtained through the attention network, the blur mask may be further processed based on the second image feature, so that the blur mask further covers related information of a last video frame in the target sub-video frame sequence, and the processed blur mask feature is obtained.

Specifically, the second image feature and the fuzzy mask feature can be fused to obtain a fusion target fusion feature, the bias parameter and the weight parameter corresponding to the fuzzy mask feature are obtained based on the fusion target fusion feature, and then the fuzzy mask feature is processed based on the bias parameter and the weight parameter, so that the processed fuzzy mask feature is obtained. For example, the blur mask feature may be processed by different convolution layers and activation functions, respectively, to obtain corresponding weight parameters and bias parameters.

Further, the first image feature and the second image feature may be further fused to obtain a corresponding fused image feature (hereinafter referred to as a second fused image feature for convenience of description), so as to determine the attention feature based on the second image fusion feature, the second context feature, the processed blur mask feature, and the context fusion feature.

With reference to fig. 4, fig. 4 is a schematic view of a scenario for determining attention features provided by an embodiment of the present application. Wherein fig. 4 is a network structure diagram of the attention network shown in fig. 3. After the first fused image feature km, the fused context feature vm, the blur mask feature p, the second image feature kq, and the second context feature vq are obtained based on fig. 3, the second image feature kq and the blur mask feature p may be fused to obtain the target fused feature.

The method comprises the steps of processing target fusion features through a convolution layer and an activation function (Sigmoid) to obtain bias parameters w corresponding to fuzzy mask features, processing target fusion features through another convolution layer and an activation parameter Sigmoid) to obtain weight parameters b corresponding to fuzzy mask features p, and determining the bias parameters w and the weight parameters b to be different convolution layers. After obtaining the bias parameter w and the weight parameter b, the fuzzy mask feature p and the weight parameter w may be subjected to a dot product operation, and added to the bias parameter over the operation result to obtain a processed mask feature p'.

Alternatively, a vector product of the second image feature kq and the first fused image feature km may be determined and processed by an activation function (softmax) to obtain the second fused image feature. And performing dot multiplication operation on the second fused image feature and the processed mask feature p', determining a vector product of an operation result and the fused context feature vm, and further fusing the vector product, the second context feature and the second fused image feature to obtain the attention feature.

It should be specifically noted that, the implementation manner of determining the predicted target mask feature of the first video frame and determining the target mask feature of the first video frame may be applicable to other video frames in the initial video frame sequence except for the target sub-video frame sequence, which is not described herein.

In some possible embodiments, in the process of determining the target mask features corresponding to other video frames in the initial video frame sequence except for the target sub-video frame sequence, after determining the target mask feature corresponding to one video frame, a mask feature sequence may be generated based on the target mask feature corresponding to the video frame in the target sub-video frame sequence and the newly determined target mask feature, so as to generate a new video frame sequence based on the mask feature sequence.

That is, in the process of forward predicting the target mask feature of the target object included in the first video frame in the initial video frame sequence based on the target mask feature of the target object included in the first video frame in the initial video frame sequence, backward predicting the target mask feature of the target object included in the last video frame in the initial video frame sequence based on the target mask feature of the target object included in the last video frame in the initial video frame sequence, a frame number corresponding to the newly determined target mask feature may be determined to be greater than the target video frame sequence of the target sub video frame sequence once a target mask feature is predicted.

In this case, since the target object may correspond to a plurality of sub-video frame sequences, and the target sub-video frame sequence is one of the plurality of sub-video frame sequences corresponding to the target object, in the above-described process, there may be a plurality of sub-video frame sequences composed of consecutive video frames including the target object. At this time, if there are a preset number of overlapping video frames in the two sub-video frame sequences, a new video frame sequence may be regenerated based on the two sub-video frame sequences, so as to avoid repeatedly predicting the target mask features of the target objects included in the other sub-video frame sequences.

That is, if a first video frame sequence and a second video frame sequence composed of consecutive video frames including a target object are obtained, for example, the target sub-video frame sequence or a new video frame sequence generated based on the target sub-video frame sequence is regarded as the first video frame sequence, and any other sub-video frame sequence corresponding to the target object is regarded as the second video frame sequence. If there are a preset number of video frames overlapping with each other in the first video frame sequence and the second video frame sequence, it may be determined that the first video frame sequence and the second video frame sequence include the same target object, and there are video frames having the same partial frame numbers, and at this time, a third video frame sequence may be generated based on the first video frame sequence and the second video frame sequence.

The overlapping preset number of video frames may be regarded as the same frame number and include the same target object in the initial video frame sequence, and the preset number may be determined based on the actual application scene requirement, which is not limited herein.

Further, if the number of frames of the third video frame sequence is less than the number of frames in the initial video frame sequence, indicating that the target mask features corresponding to the other video frames in the initial video frame sequence except the third video frame sequence are not yet determined, based on the target mask features of the target object included in the first video frame in the third video frame sequence, the other video frames in the initial video frame sequence except the third video frame sequence are forward predicted to correspond to the target mask features of the target object, and based on the target mask features of the target object included in the last video frame in the third video frame sequence, the other video frames in the initial video frame sequence except the third video frame sequence are backward predicted to correspond to the target mask features of the target object. And the method is circulated until two video frame sequences of the preset number of the video frames which are overlapped do not exist, the target mask characteristics corresponding to the rest individual video frames can be continuously determined based on the first video frame and/or the last video frame in the finally obtained video frame sequences, the final video frame sequence is obtained based on the final mask characteristic sequence, and the final video frame sequence is determined to be the target object tracking video frame sequence corresponding to the initial video frame sequence.

For example, the initial video frame sequence includes 20 frames, and the sub-video frame sequence corresponding to the target object is a continuous video frame sequence including the target object composed of 3 rd to 8 th video frames, and a continuous video frame sequence including the target object composed of 10 th to 19 th video frames.

Assuming that the previous sub-video frame sequence is determined as the target sub-video frame sequence, the target mask features corresponding to the 1 st frame video frame and the 2 nd frame video frame can be obtained by starting continuous prediction based on the mask features corresponding to the 3 rd frame video frame in the target sub-video frame sequence, and the target mask features corresponding to the 9 th frame and a plurality of subsequent video frames can be obtained by starting continuous prediction based on the mask features corresponding to the 8 th frame in the target sub-video frame sequence. And along with the continuous progress of the predicted target mask feature, a target mask feature sequence can be obtained in real time and a target video frame sequence can be generated in real time.

When there is a preset number of overlapping video frames in the sub-video frame sequence (assumed to be the first video frame sequence) corresponding to the 10 th to 19 th video frames in the initial video frame sequence, if the last video frame in the first video frame sequence is the 12 th video frame in the initial video frame sequence, the 10 th to 12 th video frames in the first video frame sequence corresponding to the initial video frame sequence and the 10 th to 12 th video frames in the second video frame sequence corresponding to the initial video frame sequence each include a target object, a third video frame sequence can be generated based on the first video frame sequence and the second video frame sequence, and the third video frame sequence is composed of the 1 st to 19 th video frames corresponding to the initial video frame sequence, and each video frame includes a target object.

Further, based on the target mask feature of the last video frame in the third video frame sequence, the target mask feature corresponding to the 20 th video frame can be determined, and further, a final target video frame sequence is obtained based on the latest obtained mask feature sequence, and the final video frame sequence is determined to be the target object tracking video frame sequence corresponding to the initial video frame sequence, so that target tracking of the target object in the initial video frame sequence is achieved.

Based on the implementation manner, the object tracking video frame sequence corresponding to each object in the initial video frame sequence can be obtained, so that the target tracking of each object in the initial video frame sequence can be realized. Meanwhile, each video frame corresponding to the same video frame (hereinafter referred to as a target video frame for convenience of description) in the initial video frame sequence in the object tracking video frame sequence corresponding to each object can be fused to obtain a fused video frame corresponding to the target video frame.

The fusion video frame comprises all objects in the initial video frame sequence, so that the content of each video frame in the initial video frame sequence can be modified and the definition of a target object in each video frame can be improved. And arranging the fused video frames according to the frame numbers of the corresponding target video frames to obtain a fused video frame sequence, namely an optimized video frame sequence corresponding to the initial video frame sequence, so that each object in the initial video frame sequence can be subjected to target tracking based on the optimized video frame sequence.

In the embodiment of the application, by determining the mask characteristics of other video frames except the target sub-video frame sequence in the initial video frame sequence corresponding to the target object, the target video frame sequences which all comprise the target object can be determined based on the mask characteristic sequences under the condition that the target mask characteristics of the target object in the video frame are not determined in the image segmentation process or the target object in partial video frames is omitted when the target object is tracked based on the target tracking algorithm, so that the target tracking of the target object in the initial video frame sequence is realized, the accuracy and the continuity of the target tracking are improved, and the applicability is high.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application. The video processing device provided by the embodiment of the application comprises:

an image processing module 51, configured to perform image segmentation processing on the initial video frame sequence, and determine an image segmentation result;

A sequence determining module 52, configured to determine a target sub-video frame sequence in the initial video frame sequence based on the image segmentation result, where the target sub-video frame sequence is composed of continuous video frames including a target object, and a frame number of the target sub-video frame sequence is smaller than a frame number of the initial video frame sequence;

The sequence generating module 53 is configured to generate a target video frame sequence based on the target sub-video frame sequence, where each video frame in the target video frame sequence includes the target object, and a frame number of the target video frame sequence is greater than a frame number of the target sub-video frame sequence.

In some possible embodiments, the image segmentation result includes mask features of a plurality of objects included in each video frame in the initial video frame sequence;

the sequence determining module 52 is configured to:

Determining a target mask feature of a target object from mask features of a plurality of objects included in each video frame in the initial video frame sequence;

and determining a target sub-video frame sequence in the initial video frame sequence based on the target mask characteristics of the target object.

In some possible embodiments, the sequence generating module 53 is configured to:

Sequentially determining that other video frames in the initial video frame sequence except the target sub-video frame sequence correspond to the target mask features of the target object based on the target mask features of the target object included in the plurality of video frames in the target sub-video frame sequence;

Wherein the target mask feature corresponding to each video frame following the target sub-video frame sequence is determined based on the target mask feature corresponding to the previous video frame of the video frame, and the target mask feature corresponding to the first video frame is determined based on the target mask feature corresponding to the last video frame in the target sub-video frame sequence, and the first video frame is the first video frame following the target sub-video frame sequence;

The target mask feature corresponding to each video frame before the target sub-video frame sequence is determined based on the target mask feature corresponding to the next video frame of the video frame, and the target mask feature corresponding to the second video frame is determined based on the target mask feature corresponding to the first video frame in the target sub-video frame sequence, and the second video frame is the last video frame before the target sub-video frame sequence;

Generating a mask feature sequence based on the target mask features corresponding to each video frame in the target sub-video frame sequence and the target mask features corresponding to other video frames in the initial video frame sequence except the target sub-video frame sequence;

And generating a target video frame sequence based on each video frame corresponding to the mask feature sequence.

determining at least one third video frame in the target sub-video frame sequence, wherein each third video frame is any video frame in the target sub-video frame sequence;

determining a predicted target mask feature corresponding to the first video frame based on a last video frame in the sequence of target sub-video frames and a target mask feature corresponding to the last video frame and each third video frame and a target mask feature corresponding to the third video frame;

And determining the target mask characteristic corresponding to the first video frame based on the target mask characteristic corresponding to the last video frame in the target sub-video frame sequence and the predicted target mask characteristic.

for each third video frame, determining a first image feature and a first context feature corresponding to the third video frame based on the third video frame and the corresponding target mask feature;

fusing the first image features to obtain first fused image features, and fusing the first context features to obtain fused context features;

Carrying out Gaussian blur processing on the target mask feature corresponding to the last video frame in the target sub-video frame sequence to obtain a blur mask feature;

determining a second image feature and a second context feature corresponding to the last video frame in the target sub-video frame sequence;

determining an attention feature based on the first fused image feature, the fused context feature, the blur mask feature, the second context feature, and the second image feature;

And determining a prediction target mask characteristic corresponding to the first video frame based on the attention characteristic and the second image characteristic.

Processing the fuzzy mask features based on the second image features to obtain processed fuzzy mask features;

Determining a second fused image feature based on the second image feature and the first fused image feature;

and determining an attention feature based on the second fused image feature, the second context feature, the processed blur mask feature, and the fused context feature.

fusing the second image features and the fuzzy mask features to obtain fusion target fusion features;

Obtaining bias parameters and weight parameters corresponding to the fuzzy mask features based on the fusion target fusion features;

And obtaining the processed fuzzy mask characteristic based on the bias parameter, the weight parameter and the fuzzy mask characteristic.

determining the cross ratio of the target mask feature corresponding to the last video frame in the target sub-video frame sequence and the predicted target mask feature;

if the intersection ratio is smaller than a preset threshold value, determining the predicted target mask characteristic as a target mask characteristic corresponding to the first video frame;

and if the merging ratio is greater than or equal to the preset threshold, determining the target mask characteristic corresponding to the last video frame in the target sub-video frame sequence as the target mask characteristic corresponding to the first video frame.

In some possible embodiments, the sequence determining module 52 is configured to:

For each target mask feature, determining a third image feature of a target object included in the video frame corresponding to the target mask feature, and determining an optimized mask feature corresponding to the target mask feature based on the third image feature and the target mask feature;

A target sub-video frame sequence in the initial video frame sequence is determined based on each of the optimization mask features.

In some possible embodiments, the sequence determining module 52 is further configured to:

If a first video frame sequence and a second video frame sequence which are formed by continuous video frames comprising the target object are obtained, and a preset number of overlapped video frames exist in the first video frame sequence and the second target sub-video frame, a third target sub-video frame sequence is generated based on the first video frame sequence and the second video frame sequence, and the third target sub-video frame sequence is formed by continuous video frames comprising the target object.

In a specific implementation, the video processing apparatus may execute, through each functional module built in the video processing apparatus, an implementation manner provided by each step in fig. 1, and specifically, the implementation manner provided by each step may be referred to, which is not described herein again.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 6, the electronic device 600 in this embodiment may include a processor 601, a network interface 604, and a memory 605, and the electronic device 600 may further include a user interface 603, and at least one communication bus 602. Wherein the communication bus 602 is used to enable connected communications between these components. The user interface 603 may include a Display screen (Display), a Keyboard (Keyboard), and the optional user interface 603 may further include a standard wired interface, a wireless interface, among others. The network interface 604 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 604 may be a high-speed RAM memory or a non-volatile memory (NVM), such as at least one disk memory. The memory 605 may also optionally be at least one storage device located remotely from the processor 601. As shown in fig. 6, an operating system, a network communication module, a user interface module, and a device control application may be included in the memory 605, which is one type of computer-readable storage medium.

In the electronic device 600 shown in fig. 6, the network interface 604 may provide network communication functions, while the user interface 603 is primarily an interface for providing input to a user, while the processor 601 may be used to invoke the device control application stored in the memory 605 to implement:

the processor 601 is configured to:

In some possible embodiments, the processor 601 is configured to:

In some possible embodiments, the above processor 601 is further configured to:

It should be appreciated that in some possible embodiments, the above-described processor 601 may be a central processing unit (central processing unit, CPU), which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application Specific Integrated Circuits (ASICs), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include read only memory and random access memory and provide instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In a specific implementation, the electronic device 600 may execute, through each functional module built in the electronic device, an implementation manner provided by each step in fig. 1, and specifically, the implementation manner provided by each step may be referred to, which is not described herein again.

The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored and executed by a processor to implement the method provided by each step in fig. 1, and specifically, the implementation manner provided by each step may be referred to, which is not described herein.

The computer readable storage medium may be the video processing apparatus or the internal storage unit of the electronic device provided in any one of the foregoing embodiments, for example, a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like, which are provided on the electronic device. The computer readable storage medium may also include a magnetic disk, an optical disk, a read-only memory (ROM), a random access memory (randomaccess memory, RAM), or the like. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used to store the computer program and other programs and data required by the electronic device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application provide a computer program product comprising a computer program or computer instructions which, when executed by a processor, provides a method according to the steps of the embodiment of the present application as shown in fig. 1.

The terms first, second and the like in the claims and in the description and drawings are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or electronic device that comprises a list of steps or elements is not limited to the list of steps or elements but may, alternatively, include other steps or elements not listed or inherent to such process, method, article, or electronic device. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments. The term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A video processing method, characterized in that the method comprises:

Performing image segmentation processing on the initial video frame sequence to determine the image segmentation result;

Based on the image segmentation result, determining a target sub-video frame sequence in the initial video frame sequence, wherein the target sub-video frame sequence is composed of continuous video frames including a target object, and the number of frames of the target sub-video frame sequence is less than the number of frames of the initial video frame sequence;

Based on the target mask features of the target object included in the multiple video frames in the target sub-video frame sequence, sequentially determining the target mask features of the target object corresponding to the other video frames in the initial video frame sequence except the target sub-video frame sequence;

Among them, the target mask features corresponding to each video frame after the target sub-video frame sequence are determined based on the target mask features corresponding to the previous video frame of the video frame, the target mask features corresponding to the first video frame are determined based on the target mask features corresponding to the last video frame in the target sub-video frame sequence, and the first video frame is the first video frame after the target sub-video frame sequence; the target mask features corresponding to each video frame before the target sub-video frame sequence are determined based on the target mask features corresponding to the next video frame of the video frame, the target mask features corresponding to the second video frame are determined based on the target mask features corresponding to the first video frame in the target sub-video frame sequence, and the second video frame is the last video frame before the target sub-video frame sequence;

Generate a mask feature sequence based on the target mask features corresponding to each video frame in the target sub-video frame sequence and the target mask features corresponding to other video frames in the initial video frame sequence except the target sub-video frame sequence;

A target video frame sequence is generated based on each video frame corresponding to the mask feature sequence, each video frame in the target video frame sequence includes the target object, and the number of frames of the target video frame sequence is greater than the number of frames of the target sub-video frame sequence.

2. The method according to claim 1, characterized in that the image segmentation result comprises mask features of multiple objects included in each video frame in the initial video frame sequence;

The step of determining a target sub-video frame sequence in the initial video frame sequence based on the image segmentation result includes:

Based on the target mask feature of the target object, a target sub-video frame sequence in the initial video frame sequence is determined.

3. The method according to claim 1, characterized in that determining the target mask feature corresponding to the first video frame based on the target mask feature corresponding to the last video frame in the target sub-video frame sequence comprises:

Determine at least one third video frame in the target sub-video frame sequence, each of the third video frames being any video frame in the target sub-video frame sequence;

Determining a predicted target mask feature corresponding to the first video frame based on the last video frame in the target sub-video frame sequence and its corresponding target mask feature, and each of the third video frames and its corresponding target mask feature;

Based on the target mask feature corresponding to the last video frame in the target sub-video frame sequence and the predicted target mask feature, the target mask feature corresponding to the first video frame is determined.

4. The method according to claim 3, characterized in that determining the predicted target mask feature corresponding to the first video frame based on the last video frame in the target sub-video frame sequence and its corresponding target mask feature, and each of the third video frames and its corresponding target mask feature, comprises:

For each of the third video frames, based on the third video frame and its corresponding target mask feature, determining a first image feature and a first context feature corresponding to the third video frame;

fusing the first image features to obtain a first fused image feature, and fusing the first context features to obtain a fused context feature;

Performing Gaussian blur processing on the target mask feature corresponding to the last video frame in the target sub-video frame sequence to obtain a blurred mask feature;

Determine a second image feature and a second context feature corresponding to the last video frame in the target sub-video frame sequence;

determining an attention feature based on the first fused image feature, the fused context feature, the fuzzy mask feature, the second context feature, and the second image feature;

Based on the attention feature and the second image feature, a predicted target mask feature corresponding to the first video frame is determined.

5. The method according to claim 4, characterized in that the determining of the attention feature based on the first fused image feature, the fused context feature, the fuzzy mask feature, the second context feature and the second image feature comprises:

Processing the fuzzy mask feature based on the second image feature to obtain a processed fuzzy mask feature;

Determine a second fused image feature based on the second image feature and the first fused image feature;

An attention feature is determined based on the second fused image feature, the second context feature, the processed blur mask feature, and the fused context feature.

6. The method according to claim 5, characterized in that the processing of the fuzzy mask feature based on the second image feature to obtain the processed fuzzy mask feature comprises:

Fusing the second image feature and the fuzzy mask feature to obtain a fusion target fusion feature;

Obtaining a bias parameter and a weight parameter corresponding to the fuzzy mask feature based on the fusion target fusion feature;

Based on the bias parameter, the weight parameter and the fuzzy mask feature, a processed fuzzy mask feature is obtained.

7. The method according to claim 3, characterized in that the determining the target mask feature corresponding to the first video frame based on the target mask feature corresponding to the last video frame in the target sub-video frame sequence and the predicted target mask feature comprises:

Determine an intersection-over-union ratio of a target mask feature corresponding to a last video frame in the target sub-video frame sequence and the predicted target mask feature;

If the intersection-over-union ratio is less than a preset threshold, determining the predicted target mask feature as the target mask feature corresponding to the first video frame;

If the intersection-over-union ratio is greater than or equal to the preset threshold, the target mask feature corresponding to the last video frame in the target sub-video frame sequence is determined as the target mask feature corresponding to the first video frame.

8. The method according to claim 2, characterized in that the determining of the target sub-video frame sequence in the initial video frame sequence based on the target mask feature of the target object comprises:

For each of the target mask features, determining a third image feature of the target object included in the video frame corresponding to the target mask feature, and determining an optimized mask feature corresponding to the target mask feature based on the third image feature and the target mask feature;

Based on each of the optimized mask features, a target sub-video frame sequence in the initial video frame sequence is determined.

9. The method according to claim 1, characterized in that the method further comprises:

If a first video frame sequence and a second video frame sequence consisting of continuous video frames including the target object are obtained, and there are a preset number of overlapping video frames between the first video frame sequence and the second target sub-video frame, a third target sub-video frame sequence is generated based on the first video frame sequence and the second video frame sequence, and the third target sub-video frame sequence is composed of continuous video frames including the target object.

10. A video processing device, characterized in that the device comprises:

An image processing module is used to perform image segmentation processing on the initial video frame sequence and determine the image segmentation result;

A sequence determination module, configured to determine a target sub-video frame sequence in the initial video frame sequence based on the image segmentation result, wherein the target sub-video frame sequence is composed of continuous video frames including a target object, and the number of frames of the target sub-video frame sequence is less than the number of frames of the initial video frame sequence;

A sequence generation module, configured to sequentially determine target mask features of the target object corresponding to other video frames in the initial video frame sequence except the target sub-video frame sequence based on target mask features of the target object included in a plurality of video frames in the target sub-video frame sequence;

The sequence generation module is used to generate a mask feature sequence based on the target mask features corresponding to each video frame in the target sub-video frame sequence and the target mask features corresponding to other video frames in the initial video frame sequence except the target sub-video frame sequence;

The sequence generation module is used to generate a target video frame sequence based on each video frame corresponding to the mask feature sequence, each video frame in the target video frame sequence includes the target object, and the number of frames in the target video frame sequence is greater than the number of frames in the target sub-video frame sequence.

11. An electronic device, comprising a processor and a memory, wherein the processor and the memory are connected to each other;

The memory is used to store computer programs;

The processor is configured to execute the method according to any one of claims 1 to 9 when calling the computer program.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the method according to any one of claims 1 to 9.