CN118784939B

CN118784939B - Controllable generation type video frame inserting method based on diffusion model

Info

Publication number: CN118784939B
Application number: CN202411255207.3A
Authority: CN
Inventors: 陈昊; 王文; 陈哲恺; 沈春华
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2024-09-09
Filing date: 2024-09-09
Publication date: 2024-12-20
Anticipated expiration: 2044-09-09
Also published as: CN118784939A

Abstract

The present invention discloses a controllable generative video interpolation method based on a diffusion model, including: based on a graph-generated video diffusion model, introducing a tail frame control condition to realize video interpolation; introducing a trajectory control scheme based on user dragging, allowing the user to realize controllable interpolation through simple interaction; in the case where the user does not provide a trajectory, obtaining the matching information of key points between the first and last frames through a feature point matching algorithm, and using this information to obtain a temporally consistent interpolation result; using the similarity between the features in the model to do trajectory update to update the coordinates of the points; and ensuring the accuracy of the updated point coordinates by checking the consistency of the updated points obtained by two nearest neighbor algorithms. The method of the present invention improves the accuracy and controllability of video interpolation, can realize user interactive video interpolation generation, and provides more comprehensive performance guidance.

Description

Controllable generation type video frame inserting method based on diffusion model

Technical Field

The invention belongs to the technical field of video model application, and particularly relates to a controllable generation type video frame inserting method based on a diffusion model.

Background

In the current multimedia technology and artificial intelligence fields, the development of video processing technology has achieved significant achievements, particularly in terms of video insertion (Video Frame Interpolation). Video interpolation is an important task in the fields of computer vision and video processing, the purpose of which is to synthesize intermediate frames from two consecutive video frames. Most previous approaches consider video interpolation as a low-level visual task, assuming that the motion between frames is small, these approaches can be broadly divided into stream-based and core-based approaches. In particular, the stream-based approach utilizes estimated optical flow for frame synthesis, whereas the kernel-based approach relies on spatially adaptive kernels to synthesize interpolated pixels, which may be affected by inaccurate optical flow estimation, while the kernel-based approach is often limited by kernel size. To achieve a two-way effect, some approaches incorporate a stream and kernel based end-to-end video plug-in approach.

Recently, inspired by the large-scale pre-trained video diffusion model generation capabilities, some approaches attempt to solve the video plug-in problem from the generation perspective. For example, LDM-VFI (Video Frame Interpolation WITH LATENT Diffusion Models) Models video interpolation as a problem of conditional generation and uses Diffusion Models for perceptually oriented video interpolation, VIDIM (Video Interpolation With Diffusion Models) uses cascaded Diffusion Models to generate high-fidelity interpolated video with non-linear motion. While these methods have advanced, they still have difficulty producing reliable video interpolation results when dealing with large differences between the starting and ending frames, and furthermore, the emphasis of these methods is to generate a single viable solution for video interpolation without controllability of the video interpolation results.

In the field of text-to-video generation, large-scale pre-trained diffusion models have demonstrated the ability to generate high quality, diverse, and realistic videos, however these models have limitations in terms of precise text control and user interactivity. Traditional video control methods such as VideoComposer and SPARSECTRL utilize structural information such as sketch and depth map, and although the control is provided to a certain extent, the process of obtaining the control signals is complex, and the control signals are not easy to operate by users, so that the wide application of the control signals in practical application is limited. In contrast, motion control such as MotionCtrl is used as a more visual control mode, such as control on the motion track of an object and the motion gesture of a camera, can be realized through simple user input, and the interactivity and the practicability of video generation are greatly improved.

Nevertheless, effectively integrating motion control into the video interpolation process of a diffusion model to generate video that both conforms to the textual description and accurately follows user control remains an unsolved technical problem.

Disclosure of Invention

In view of the above, the invention provides a controllable generation type video frame inserting method based on a diffusion model, which improves the accuracy and controllability of video frame inserting through an innovative motion control strategy and an optimized diffusion model architecture so as to adapt to various complex motion scenes and meet the customized requirements of users, thereby promoting the further development of video processing technology.

A controllable generation type video frame inserting method based on a diffusion model comprises the following steps:

(1) Selecting a pre-trained diffusion model based on the graphical video;

(2) Inputting the first frame and the last frame provided by the user into a diffusion model;

(3) Introducing track control based on user dragging to the diffusion model to generate a video frame inserting result conforming to the user intention;

(4) Under the condition that a user does not provide a dragging track, the track is generated through feature point matching interpolation and automatic point track tracking is achieved, and quality and consistency of a video frame inserting result are improved.

Further, the diffusion model in the step (1) is used for generating a video interpolation result by adopting a SVD (Stable Video Diffusion) model, and the video interpolation result comprises a variation self-encoder (Variational Auto-Encoder, VAE), a CLIP (Contrastive Language-Image Pre-tracking) Image encoder, a 3D U-Net and a cross attention mechanism, the variation self-encoder is used for extracting hidden space features of a video frame, the CLIP Image encoder is used for extracting semantic features of the video frame, the hidden space features are spliced with hidden variables with noise and then are input into the 3D U-Net, the cross attention mechanism takes the semantic features as key and value inputs, the internal features of the 3D U-Net are taken as query inputs, the output of the cross attention mechanism is used for updating the internal features of the 3D U-Net, and the 3D U-Net is output after being subjected to multi-round iterative denoising to obtain the video interpolation result. The traditional SVD model takes a single frame as input and takes the single frame as a first frame so as to generate a video interpolation result in an inference mode.

Further, in the step (2), when the input of the diffusion model is the first frame and the last frame, the first frame and the last frame are generated into respective hidden space features and semantic features through a variation self-encoder and a CLIP image encoder, then the hidden space features of the first frame and the last frame are spliced with hidden variables with noise and then input to a 3D U-Net, and the semantic features of the first frame and the last frame are spliced and then are used as the input of cross attention mechanism keys and values.

Further, in the step (3), in order to facilitate user interaction, a user is allowed to control a video frame inserting result in a dragging mode, namely, a track of a key point in a dragging process is obtained and converted into a gaussian heat map, the gaussian heat map is input into an encoding module to be encoded to obtain a characteristic of the track of the key point, and the characteristic is injected into 3D U-Net of a diffusion model. By introducing the track control, the video frame inserting can be more controllable, the user can meet the own requirements through simple interaction, and the experimental result shows that the performance of the video frame inserting can be further improved by introducing the track control.

The method comprises the steps of obtaining a track control condition of points before introducing track control based on user dragging, specifically, initializing some sampling points randomly around a fixed sparse grid in a first frame, using Co-Tracker to obtain tracks of the sampling points in the whole video, removing invisible tracks in more than half video frames in the training process, sampling the points with large track motion change in the rest tracks with high probability, after sampling to obtain track points, only preserving a small number of track points and converting coordinates of the track points into a Gaussian heat map, and then using the Gaussian heat map as input of a coding module, wherein the coding module replicates a coder part of 3D U-Net, and finally injecting characteristics output by the coding module into 3D U-Net of the diffusion model through zero convolution.

Further, in the step (4), matching information of key points is obtained between the first frame and the last frame provided by the user through a feature point matching algorithm, any key point pair matched in the first frame and the last frame is marked as p ⁰ and p ⁿ, corresponding tracks are obtained through interpolation estimation, and all estimated tracks are obtained through traversing key point pairs matched in other modes. The method can avoid the problem of frame skip under the condition of great difference between the front frame and the rear frame, generate more coherent video frame inserting results and improve the quality and the consistency of video frame inserting.

Further, the automatic point track tracking in the step (4) is to update the point coordinates of the track in the middle frame by utilizing the similarity between features in the diffusion model 3D U-Net, specifically, firstly, converting the track generated by the feature point matching interpolation into a Gaussian heat map, then, encoding to obtain track features, then, injecting the track features into the 3D U-Net of the diffusion model, then, utilizing the features of the last up-sampling module in the 3D U-Net to obtain image features of each frame through bilateral linear interpolation, obtaining a point set omega with the distance smaller than a threshold value r ₁ from any track point k on the middle frame through searching, further, utilizing a nearest neighbor algorithm to calculate and obtain a point p ^k,0 closest to the features of a first frame key point p ⁰ and a point p ^k,n closest to the features of a tail frame key point p ⁿ from the point set omega, updating the track point k coordinates to be the point p ^k,0 to the point p 3878 through a noise point between the point p ₂, and the next operation of the next update of the track is carried out according to the noise point set omega of the next operation of the track to be the latest operation of the track between the point p ^k,0 and the point set of the next operation of the track to be the next update of the track in the middle frame of the step (D U) and the next operation of the track is used as the next update of the track between the point coordinates of the track to be updated in the next to the next operation of the track is 96. The method can improve the accuracy and consistency of video frame insertion by further improving the accuracy of point track updating.

A computer device comprising a memory and a processor, the memory having a computer program stored therein, the processor being configured to execute the computer program to implement the diffusion model-based controllably generated video plug-in method described above.

A computer readable storage medium storing a computer program which when executed by a processor implements the above-described diffusion model-based controllably generated video plug-in method.

The invention relates to a controllable generation type video frame inserting method, which mainly comprises the following key technical points:

1. The invention is based on a picture-generated video diffusion model, introduces a tail frame control condition to realize video frame insertion, and allows a user to generate a plurality of feasible video frame insertion schemes under the condition of giving a head frame and a tail frame by injecting the tail frame condition in a hidden space and a semantic space; the method can effectively utilize the priori knowledge of the pre-training model and improve the quality and diversity of video interpolation frames.

2. The invention introduces a track control scheme based on user dragging, allows a user to realize controllable frame insertion through simple interaction, obtains tracks of the points in the whole video by randomly initializing sampling points in a first frame and using a video point track tracking algorithm, and can control the result of the video frame insertion.

3. Under the condition that a user does not provide a track, the invention obtains the matching information of key points between the head frame and the tail frame through a characteristic point matching algorithm, and obtains a frame inserting result consistent in time sequence by utilizing the information; the method can effectively solve the problem of frame skip in the video frame insertion, and improves the quality and consistency of the video frame insertion.

4. The invention updates the coordinates of the points by utilizing the similarity among the characteristics in 3D U-Net as the track update, thereby ensuring the accuracy of the point track; the method can effectively improve the accuracy and the robustness of point track tracking and improve the quality and the reliability of video frame insertion.

5. Track consistency test, namely ensuring the accuracy of updated point coordinates by checking the consistency of updated points obtained by a two-time nearest neighbor algorithm; the method can effectively improve the accuracy and reliability of point track updating and improve the quality and reliability of video frame insertion.

In addition, the method of the invention provides better controllability, allows a user to control the video frame inserting result through simple input, and greatly improves the interactivity and the practicability of video generation by using motion control as a more visual control mode and controlling the video frame inserting result through simple input.

Drawings

FIG. 1 is a schematic diagram of a process for controllably generating video clips using a diffusion model in accordance with the present invention.

FIG. 2 is a schematic diagram of the present invention for automated point trace tracking using similarities between features in 3D U-Net.

Fig. 3 shows a video frame inserting result of the method applied to the new view angle generation scene.

Fig. 4 shows a video frame inserting result of the method applied to a cartoon frame inserting scene.

Fig. 5 shows a video frame inserting result of the method of the present invention applied to a time-lapse photography and slow-motion video generation scene.

Fig. 6 shows a video frame interpolation result of the method of the present invention applied to an image deformation scene.

Detailed Description

In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.

The invention discloses a controllable generation type video frame inserting method based on a diffusion model, which aims at realizing the generation of a video frame inserting of user interaction type, and comprises the following key technologies: firstly, the invention introduces a model structure and how to add the tail frame condition to realize video frame insertion based on the video diffusion model, which allows a user to generate a plurality of feasible video frame insertion schemes under the given head and tail frames. Secondly, the invention introduces a track control scheme based on user dragging, allows a user to realize controllable frame insertion through simple interaction, and further enhances the controllability of video frame insertion. Finally, we find that when the head and tail frames are very different, the model is easy to generate a frame skip video result, and therefore, the invention introduces correlation modeling displayed on key points in the video to obtain a frame interpolation result of the connector in time sequence. Through the technologies, the invention realizes the generation of the video plug frames of the user interaction type, improves the stability and the controllability of the video plug frames, has wide application prospect, and comprises the following specific implementation processes:

(1) The SVD is a common image-generated video diffusion model, which can generate high-quality video results according to the input first frame image. SVD follows the paradigm of the extension Diffusion, specifically, video is compressed into a lower dimensional hidden space by VAE codec, which can be expressed as In the reverse process, SVD adopts 3D U-Net as denoising, and in the process, two strategies are adopted to introduce the first frame condition, namely firstly, the hidden space characteristics after VAE coding are carried out in the channel dimensionThe method is characterized in that hidden variables with noise of each frame of video are respectively connected, in addition, semantic features of a first frame of image are extracted through a CLIP image encoder and are injected into a model through a cross attention mechanism, 3D-Unet is trained through DSM (Denoising Score Matching ), and training targets can be expressed as follows:

(2) The tail frame condition is introduced and fine-tuned based on a picture-generated video diffusion model, the tail frame control is additionally introduced to realize video frame insertion, so that the tail frame control condition is introduced into SVD (scalable vector data) in the invention, and the tail frame condition is respectively injected into a hidden space and a semantic space in order to keep the priori of the pre-trained SVD as much as possible. In particular, we will first frame implicit spatial features Connecting with noise hidden variable of the first frame and hiding space characteristics of the tail frame after VAE codingThe method is connected with a noise hidden variable of the tail frame, and in the middle frame, the noise hidden variable is simply connected with a learnable conditional token after broadcasting (broadcasting). In addition, the invention extracts the embedded features of the CLIP images of the first frame and the last frame respectively and connects the embedded features together, and the embedded features serve as keys and values in a cross attention mechanism, and training targets after fine adjustment of the model can be expressed as follows:

(3) Controllable video interpolation from conditional probability distribution given head-to-tail frames In order to facilitate user interaction, the invention considers adopting a control mode of dragging (drags) in order to sample the video, and a plurality of feasible video frame inserting results can be realized, especially under the condition of large head-to-tail frame difference. To obtain the trajectory control conditions for points to train the model, we randomly initialize some sampling points around a fixed sparse grid in the first frame and use Co-Tracker to obtain the trajectories of these points in the whole video, and in training we remove trajectories that are not visible in more than half of the video frames and follow large points that change with a large probability of sampling trajectory motion. In addition, considering that the control points input by the user are often sparse, in training, we only keep 1 to 5 control points, after sampling to obtain the track points, we convert the point coordinates into a Gaussian heat map, expressed asAs input to the control module. The invention adopts a mechanism similar to control Net to add track control, specifically, an encoder which replicates 3D U-Net encodes a track diagram and injects the track diagram into a decoder of U-Net through zero convolution as shown in figure 1, wherein a training target of a model after track control is introduced can be expressed as:

The loss function considers the track control condition to generate a video frame inserting result which meets the user intention, and the track control ensures that the video frame inserting is more controllable, and the requirements of the user are met through simple interaction. In addition, we have found in experiments that introducing trajectory control can further improve the performance of video interpolation.

(4) Explicit correlation modeling based on matching points, in the case that a user does not provide a track, although the method can obtain good frame inserting effect under the condition that the difference between the front frame and the rear frame is moderate or small, we find that in the case that the difference between the front frame and the rear frame is large, the frame jumping situation sometimes occurs, that is, the first half of the generated frame inserting video is highly correlated with the first frame, the second half is highly correlated with the last frame, and in the middle frame, the video is mutated from the content correlated with the first frame to the content correlated with the last frame.

In the given input video head and tail frames, the invention can obtain the matching information of key points between the head and tail frames through the characteristic point matching algorithm, and the invention usesIs expressed by, whereinIs the number of the matching points and the matching point,Is a known point on the track (at initialization, the key point of a match on the head-to-tail frame). Although different key point matching algorithms are possible, this embodiment uses classical SIFT feature point matching because it is simple and we have found empirically that it works well.

Subsequently, we pass through the known trace pointsInterpolation to obtain the ith trackIn this way all estimated trajectories can be obtained。

(5) Automated point track tracking, namely although an initial track can provide key point associated information with consistent time sequence, a connecting line track obtained based on head-tail key frame insertion is not necessarily accurate; for this reason, in each denoising step, the invention updates the coordinates of the points by using the similarity between features in 3D U-Net as a trajectory update.

As shown in FIG. 2, we interpolate the features on 3D U-Net to image resolution to obtain the image feature F, here we use the features of the penultimate upsampling module in 3D U-Net because it has higher resolution and better discrimination, usingFeatures at coordinates p on the graph feature F are represented, which are obtained by bilateral linear interpolation, because the coordinate points p are not necessarily integers.

We search for updated coordinate points around the interpolated intermediate frame coordinates usingRepresenting a set of points having a distance to point p less than r ₁, i.eThe nearest neighbor algorithm is used for obtaining the update point closest to the key point on the first frame on the kth frame:

(6) In order to further improve the accuracy of point track updating, the matching point-to-point coordinates of the last frame are used for updating, so that the updating point closest to the key point on the tail frame on the kth frame can be obtained:

To ensure the accuracy of the updated point coordinates, we check the two nearest neighbor algorithm to get the consistency of the updated point when the distance between them is smaller than r ₂, i.e When we use the midpoint of these two points as the updated point coordinates, i.e。

We then add this point to the known trace pointAnd interpolate to obtain updated trackAs input to the next denoising step.

Based on the technical scheme, the controllable generation type video frame inserting method has wide application value, and comprises the following steps:

1. The new view angle generation is realized from a sparse view angle input, and the new view angle generation from a static scene to a dynamic scene is realized, for example, pictures from different view angles can be generated through video interpolation as the head and tail frames of the video, so that the new view angle generation is realized, as shown in fig. 3.

2. Cartoon frame insertion, namely obtaining a cartoon through frame insertion of a cartoon picture, simplifying the cartoon making process, supporting frame insertion of color cartoons and line manuscript cartoons, for example, generating a cartoon video through frame insertion of the cartoon picture, and supporting frame insertion of the color cartoons and the line manuscript cartoons, as shown in figure 4.

3. Video editing, namely realizing video editing by utilizing video insertion frames, comprising action editing and video filling, providing new ideas, for example, realizing the action editing of the video and modifying the actions of people in the video through the video insertion frames, and simultaneously, realizing the video filling, for example, deleting certain objects in the video.

4. The time-lapse photography is that only a few images at key time are needed, the whole video can be obtained through frame insertion to display the slow change process, for example, the time-lapse photography video can be generated through frame insertion of the video to display the ice melting, flower and grass growth and moon profit and loss change process, as shown in (a) of fig. 5.

5. Slow motion video generation, in which slow motion video can be generated by video interpolation, for example, normal speed video is converted into slow motion video, and key frames and actions in the video are highlighted as shown in fig. 5 (b).

6. The interactive video can be edited through video frame insertion, for example, a user can control the playing speed, the playing direction and the frame insertion effect of the video through a mouse or a touch screen, and more flexible and interactive video editing experience is realized.

7. Image deformation, namely gradual change effect of an image from one content to another content can be generated through video frame insertion, for example, young photos of people are gradually changed into an aged state, and the technology can be applied to the fields of advertisements, film special effects and the like, and unique visual effects are created, as shown in fig. 6.

The embodiments described above are described in order to facilitate the understanding and application of the present invention to those skilled in the art, and it will be apparent to those skilled in the art that various modifications may be made to the embodiments described above and that the general principles described herein may be applied to other embodiments without the need for inventive faculty. Therefore, the present invention is not limited to the above-described embodiments, and those skilled in the art, based on the present disclosure, should make improvements and modifications within the scope of the present invention.

Claims

1. A controllable generation type video frame inserting method based on a diffusion model comprises the following steps:

(1) Selecting a pre-training diffusion model based on a picture-generated video, wherein the diffusion model adopts an SVD (scalable vector graphics) model for generating a video interpolation result and comprises a variation self-encoder, a CLIP (computer-based image) encoder, a 3D U-Net and a cross-attention mechanism, wherein the variation self-encoder is used for extracting hidden space characteristics of a video frame, the CLIP image encoder is used for extracting semantic characteristics of the video frame, the hidden space characteristics are spliced with hidden variables with noise and then are input into the 3D U-Net, the cross-attention mechanism takes the semantic characteristics as key and value input, the internal characteristics of the 3D U-Net are used as query input, the output of the cross-attention mechanism is used for updating the internal characteristics of the 3D U-Net, and the 3D U-Net is output after the multi-round iteration denoising to obtain the video interpolation result;

(4) And under the condition that the user does not provide a dragging track, generating the track through characteristic point matching interpolation and automatically tracking the track of the point.

2. The method of claim 1, wherein in the step (2), when the diffusion model is input into the first frame and the last frame, the first frame and the last frame are generated into respective hidden space features and semantic features by a variable self-encoder and a CLIP image encoder, then the hidden space features of the first frame and the last frame are spliced with hidden variables with noise and then input into the 3D U-Net, and the semantic features of the first frame and the last frame are spliced and then are used as the input of cross-attention mechanism keys and values.

3. The method of claim 1, wherein in step (3), in order to facilitate user interaction, a user is allowed to control a video frame inserting result in a dragging mode, namely, a track of a key point in a dragging process is obtained and converted into a Gaussian heat map, the Gaussian heat map is input into a coding module for coding to obtain a characteristic of the track of the key point, and the characteristic is injected into 3D U-Net of a diffusion model.

4. The method of claim 1, wherein in the step (4), key point matching information is obtained between a first frame and a last frame provided by a user through a feature point matching algorithm, any key point pair matched in the first frame and the last frame is marked as p ⁰ and p ⁿ, corresponding tracks are obtained through interpolation estimation, and all estimated tracks are obtained through traversing other matched key point pairs.

5. The controllable generation type video interpolation method based on the diffusion model is characterized in that the automatic point track tracking in the step (4) is to update the point coordinates of the track in the middle frame by utilizing the similarity between features in the diffusion model 3D U-Net, specifically, firstly, the track generated by the feature point matching interpolation is converted into a Gaussian heat map and then encoded to obtain track features, then the track features are injected into 3D U-Net of the diffusion model, then the features of each frame image feature is obtained by utilizing the features of the last up-sampling module in the 3D U-Net through bilateral linear interpolation, for any track point k on the middle frame, the point set omega with the distance smaller than a threshold value r ₁ is obtained through searching, then, the point p ^k,0 closest to the features of the key point p ⁰ of the first frame and the point p 5675 closest to the features of the key point p ⁿ of the last frame are calculated from the point set omega by utilizing a nearest neighbor algorithm, if the point p ^k,0 is updated to the point p 3825 with the next noise point p 3896 as the coordinates of the middle point which is updated by a distance of the next up-down point p 3996, and the operation point between the next noise point p 3825 and the next noise point is updated as the point coordinates of the point between the two frames of the next noise points of the tracks is calculated by using the threshold value p 3998, and the operation point is updated by updating the coordinates of the point p's 3.

6. A computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to implement a diffusion model-based controllable generation video frame inserting method as claimed in any one of claims 1 to 5.

7. A computer readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement a diffusion model-based controllably generated video frame inserting method according to any one of claims 1 to 5.