[go: up one dir, main page]

CN118784939B - Controllable generation type video frame inserting method based on diffusion model - Google Patents

Controllable generation type video frame inserting method based on diffusion model Download PDF

Info

Publication number
CN118784939B
CN118784939B CN202411255207.3A CN202411255207A CN118784939B CN 118784939 B CN118784939 B CN 118784939B CN 202411255207 A CN202411255207 A CN 202411255207A CN 118784939 B CN118784939 B CN 118784939B
Authority
CN
China
Prior art keywords
frame
point
video
track
diffusion model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411255207.3A
Other languages
Chinese (zh)
Other versions
CN118784939A (en
Inventor
陈昊
王文
陈哲恺
沈春华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202411255207.3A priority Critical patent/CN118784939B/en
Publication of CN118784939A publication Critical patent/CN118784939A/en
Application granted granted Critical
Publication of CN118784939B publication Critical patent/CN118784939B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • H04N19/34Scalability techniques involving progressive bit-plane based encoding of the enhancement layer, e.g. fine granular scalability [FGS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/14Picture signal circuitry for video frequency region
    • H04N5/144Movement detection
    • H04N5/145Movement estimation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/01Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level
    • H04N7/0127Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level by changing the field or frame frequency of the incoming video signal, e.g. frame rate converter
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/01Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level
    • H04N7/0135Conversion of standards, e.g. involving analogue television standards or digital television standards processed at pixel level involving interpolation processes

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本发明公开了一种基于扩散模型的可控生成式视频插帧方法,包括:基于图生视频扩散模型,引入尾帧控制条件来实现视频插帧;引入基于用户拖动的轨迹控制方案,允许用户通过简单的交互实现可控插帧;在用户不提供轨迹的情况下,通过特征点匹配算法获得首尾帧帧之间关键点的匹配信息,并利用这些信息来获得时序上一致的插帧结果;利用模型中的特征之间的相似性做轨迹更新来更新点的坐标;通过检查两次最近邻算法得到更新点的一致性来确保更新点坐标的准确性。本发明方法提高了视频插帧的准确性和可控性,能够实现用户交互式的视频插帧生成,提供更全面的性能指导。

The present invention discloses a controllable generative video interpolation method based on a diffusion model, including: based on a graph-generated video diffusion model, introducing a tail frame control condition to realize video interpolation; introducing a trajectory control scheme based on user dragging, allowing the user to realize controllable interpolation through simple interaction; in the case where the user does not provide a trajectory, obtaining the matching information of key points between the first and last frames through a feature point matching algorithm, and using this information to obtain a temporally consistent interpolation result; using the similarity between the features in the model to do trajectory update to update the coordinates of the points; and ensuring the accuracy of the updated point coordinates by checking the consistency of the updated points obtained by two nearest neighbor algorithms. The method of the present invention improves the accuracy and controllability of video interpolation, can realize user interactive video interpolation generation, and provides more comprehensive performance guidance.

Description

Controllable generation type video frame inserting method based on diffusion model
Technical Field
The invention belongs to the technical field of video model application, and particularly relates to a controllable generation type video frame inserting method based on a diffusion model.
Background
In the current multimedia technology and artificial intelligence fields, the development of video processing technology has achieved significant achievements, particularly in terms of video insertion (Video Frame Interpolation). Video interpolation is an important task in the fields of computer vision and video processing, the purpose of which is to synthesize intermediate frames from two consecutive video frames. Most previous approaches consider video interpolation as a low-level visual task, assuming that the motion between frames is small, these approaches can be broadly divided into stream-based and core-based approaches. In particular, the stream-based approach utilizes estimated optical flow for frame synthesis, whereas the kernel-based approach relies on spatially adaptive kernels to synthesize interpolated pixels, which may be affected by inaccurate optical flow estimation, while the kernel-based approach is often limited by kernel size. To achieve a two-way effect, some approaches incorporate a stream and kernel based end-to-end video plug-in approach.
Recently, inspired by the large-scale pre-trained video diffusion model generation capabilities, some approaches attempt to solve the video plug-in problem from the generation perspective. For example, LDM-VFI (Video Frame Interpolation WITH LATENT Diffusion Models) Models video interpolation as a problem of conditional generation and uses Diffusion Models for perceptually oriented video interpolation, VIDIM (Video Interpolation With Diffusion Models) uses cascaded Diffusion Models to generate high-fidelity interpolated video with non-linear motion. While these methods have advanced, they still have difficulty producing reliable video interpolation results when dealing with large differences between the starting and ending frames, and furthermore, the emphasis of these methods is to generate a single viable solution for video interpolation without controllability of the video interpolation results.
In the field of text-to-video generation, large-scale pre-trained diffusion models have demonstrated the ability to generate high quality, diverse, and realistic videos, however these models have limitations in terms of precise text control and user interactivity. Traditional video control methods such as VideoComposer and SPARSECTRL utilize structural information such as sketch and depth map, and although the control is provided to a certain extent, the process of obtaining the control signals is complex, and the control signals are not easy to operate by users, so that the wide application of the control signals in practical application is limited. In contrast, motion control such as MotionCtrl is used as a more visual control mode, such as control on the motion track of an object and the motion gesture of a camera, can be realized through simple user input, and the interactivity and the practicability of video generation are greatly improved.
Nevertheless, effectively integrating motion control into the video interpolation process of a diffusion model to generate video that both conforms to the textual description and accurately follows user control remains an unsolved technical problem.
Disclosure of Invention
In view of the above, the invention provides a controllable generation type video frame inserting method based on a diffusion model, which improves the accuracy and controllability of video frame inserting through an innovative motion control strategy and an optimized diffusion model architecture so as to adapt to various complex motion scenes and meet the customized requirements of users, thereby promoting the further development of video processing technology.
A controllable generation type video frame inserting method based on a diffusion model comprises the following steps:
(1) Selecting a pre-trained diffusion model based on the graphical video;
(2) Inputting the first frame and the last frame provided by the user into a diffusion model;
(3) Introducing track control based on user dragging to the diffusion model to generate a video frame inserting result conforming to the user intention;
(4) Under the condition that a user does not provide a dragging track, the track is generated through feature point matching interpolation and automatic point track tracking is achieved, and quality and consistency of a video frame inserting result are improved.
Further, the diffusion model in the step (1) is used for generating a video interpolation result by adopting a SVD (Stable Video Diffusion) model, and the video interpolation result comprises a variation self-encoder (Variational Auto-Encoder, VAE), a CLIP (Contrastive Language-Image Pre-tracking) Image encoder, a 3D U-Net and a cross attention mechanism, the variation self-encoder is used for extracting hidden space features of a video frame, the CLIP Image encoder is used for extracting semantic features of the video frame, the hidden space features are spliced with hidden variables with noise and then are input into the 3D U-Net, the cross attention mechanism takes the semantic features as key and value inputs, the internal features of the 3D U-Net are taken as query inputs, the output of the cross attention mechanism is used for updating the internal features of the 3D U-Net, and the 3D U-Net is output after being subjected to multi-round iterative denoising to obtain the video interpolation result. The traditional SVD model takes a single frame as input and takes the single frame as a first frame so as to generate a video interpolation result in an inference mode.
Further, in the step (2), when the input of the diffusion model is the first frame and the last frame, the first frame and the last frame are generated into respective hidden space features and semantic features through a variation self-encoder and a CLIP image encoder, then the hidden space features of the first frame and the last frame are spliced with hidden variables with noise and then input to a 3D U-Net, and the semantic features of the first frame and the last frame are spliced and then are used as the input of cross attention mechanism keys and values.
Further, in the step (3), in order to facilitate user interaction, a user is allowed to control a video frame inserting result in a dragging mode, namely, a track of a key point in a dragging process is obtained and converted into a gaussian heat map, the gaussian heat map is input into an encoding module to be encoded to obtain a characteristic of the track of the key point, and the characteristic is injected into 3D U-Net of a diffusion model. By introducing the track control, the video frame inserting can be more controllable, the user can meet the own requirements through simple interaction, and the experimental result shows that the performance of the video frame inserting can be further improved by introducing the track control.
The method comprises the steps of obtaining a track control condition of points before introducing track control based on user dragging, specifically, initializing some sampling points randomly around a fixed sparse grid in a first frame, using Co-Tracker to obtain tracks of the sampling points in the whole video, removing invisible tracks in more than half video frames in the training process, sampling the points with large track motion change in the rest tracks with high probability, after sampling to obtain track points, only preserving a small number of track points and converting coordinates of the track points into a Gaussian heat map, and then using the Gaussian heat map as input of a coding module, wherein the coding module replicates a coder part of 3D U-Net, and finally injecting characteristics output by the coding module into 3D U-Net of the diffusion model through zero convolution.
Further, in the step (4), matching information of key points is obtained between the first frame and the last frame provided by the user through a feature point matching algorithm, any key point pair matched in the first frame and the last frame is marked as p 0 and p n, corresponding tracks are obtained through interpolation estimation, and all estimated tracks are obtained through traversing key point pairs matched in other modes. The method can avoid the problem of frame skip under the condition of great difference between the front frame and the rear frame, generate more coherent video frame inserting results and improve the quality and the consistency of video frame inserting.
Further, the automatic point track tracking in the step (4) is to update the point coordinates of the track in the middle frame by utilizing the similarity between features in the diffusion model 3D U-Net, specifically, firstly, converting the track generated by the feature point matching interpolation into a Gaussian heat map, then, encoding to obtain track features, then, injecting the track features into the 3D U-Net of the diffusion model, then, utilizing the features of the last up-sampling module in the 3D U-Net to obtain image features of each frame through bilateral linear interpolation, obtaining a point set omega with the distance smaller than a threshold value r 1 from any track point k on the middle frame through searching, further, utilizing a nearest neighbor algorithm to calculate and obtain a point p k,0 closest to the features of a first frame key point p 0 and a point p k,n closest to the features of a tail frame key point p n from the point set omega, updating the track point k coordinates to be the point p k,0 to the point p 3878 through a noise point between the point p 2, and the next operation of the next update of the track is carried out according to the noise point set omega of the next operation of the track to be the latest operation of the track between the point p k,0 and the point set of the next operation of the track to be the next update of the track in the middle frame of the step (D U) and the next operation of the track is used as the next update of the track between the point coordinates of the track to be updated in the next to the next operation of the track is 96. The method can improve the accuracy and consistency of video frame insertion by further improving the accuracy of point track updating.
A computer device comprising a memory and a processor, the memory having a computer program stored therein, the processor being configured to execute the computer program to implement the diffusion model-based controllably generated video plug-in method described above.
A computer readable storage medium storing a computer program which when executed by a processor implements the above-described diffusion model-based controllably generated video plug-in method.
The invention relates to a controllable generation type video frame inserting method, which mainly comprises the following key technical points:
1. The invention is based on a picture-generated video diffusion model, introduces a tail frame control condition to realize video frame insertion, and allows a user to generate a plurality of feasible video frame insertion schemes under the condition of giving a head frame and a tail frame by injecting the tail frame condition in a hidden space and a semantic space; the method can effectively utilize the priori knowledge of the pre-training model and improve the quality and diversity of video interpolation frames.
2. The invention introduces a track control scheme based on user dragging, allows a user to realize controllable frame insertion through simple interaction, obtains tracks of the points in the whole video by randomly initializing sampling points in a first frame and using a video point track tracking algorithm, and can control the result of the video frame insertion.
3. Under the condition that a user does not provide a track, the invention obtains the matching information of key points between the head frame and the tail frame through a characteristic point matching algorithm, and obtains a frame inserting result consistent in time sequence by utilizing the information; the method can effectively solve the problem of frame skip in the video frame insertion, and improves the quality and consistency of the video frame insertion.
4. The invention updates the coordinates of the points by utilizing the similarity among the characteristics in 3D U-Net as the track update, thereby ensuring the accuracy of the point track; the method can effectively improve the accuracy and the robustness of point track tracking and improve the quality and the reliability of video frame insertion.
5. Track consistency test, namely ensuring the accuracy of updated point coordinates by checking the consistency of updated points obtained by a two-time nearest neighbor algorithm; the method can effectively improve the accuracy and reliability of point track updating and improve the quality and reliability of video frame insertion.
In addition, the method of the invention provides better controllability, allows a user to control the video frame inserting result through simple input, and greatly improves the interactivity and the practicability of video generation by using motion control as a more visual control mode and controlling the video frame inserting result through simple input.
Drawings
FIG. 1 is a schematic diagram of a process for controllably generating video clips using a diffusion model in accordance with the present invention.
FIG. 2 is a schematic diagram of the present invention for automated point trace tracking using similarities between features in 3D U-Net.
Fig. 3 shows a video frame inserting result of the method applied to the new view angle generation scene.
Fig. 4 shows a video frame inserting result of the method applied to a cartoon frame inserting scene.
Fig. 5 shows a video frame inserting result of the method of the present invention applied to a time-lapse photography and slow-motion video generation scene.
Fig. 6 shows a video frame interpolation result of the method of the present invention applied to an image deformation scene.
Detailed Description
In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.
The invention discloses a controllable generation type video frame inserting method based on a diffusion model, which aims at realizing the generation of a video frame inserting of user interaction type, and comprises the following key technologies: firstly, the invention introduces a model structure and how to add the tail frame condition to realize video frame insertion based on the video diffusion model, which allows a user to generate a plurality of feasible video frame insertion schemes under the given head and tail frames. Secondly, the invention introduces a track control scheme based on user dragging, allows a user to realize controllable frame insertion through simple interaction, and further enhances the controllability of video frame insertion. Finally, we find that when the head and tail frames are very different, the model is easy to generate a frame skip video result, and therefore, the invention introduces correlation modeling displayed on key points in the video to obtain a frame interpolation result of the connector in time sequence. Through the technologies, the invention realizes the generation of the video plug frames of the user interaction type, improves the stability and the controllability of the video plug frames, has wide application prospect, and comprises the following specific implementation processes:
(1) The SVD is a common image-generated video diffusion model, which can generate high-quality video results according to the input first frame image. SVD follows the paradigm of the extension Diffusion, specifically, video is compressed into a lower dimensional hidden space by VAE codec, which can be expressed as In the reverse process, SVD adopts 3D U-Net as denoising, and in the process, two strategies are adopted to introduce the first frame condition, namely firstly, the hidden space characteristics after VAE coding are carried out in the channel dimensionThe method is characterized in that hidden variables with noise of each frame of video are respectively connected, in addition, semantic features of a first frame of image are extracted through a CLIP image encoder and are injected into a model through a cross attention mechanism, 3D-Unet is trained through DSM (Denoising Score Matching ), and training targets can be expressed as follows:
(2) The tail frame condition is introduced and fine-tuned based on a picture-generated video diffusion model, the tail frame control is additionally introduced to realize video frame insertion, so that the tail frame control condition is introduced into SVD (scalable vector data) in the invention, and the tail frame condition is respectively injected into a hidden space and a semantic space in order to keep the priori of the pre-trained SVD as much as possible. In particular, we will first frame implicit spatial features Connecting with noise hidden variable of the first frame and hiding space characteristics of the tail frame after VAE codingThe method is connected with a noise hidden variable of the tail frame, and in the middle frame, the noise hidden variable is simply connected with a learnable conditional token after broadcasting (broadcasting). In addition, the invention extracts the embedded features of the CLIP images of the first frame and the last frame respectively and connects the embedded features together, and the embedded features serve as keys and values in a cross attention mechanism, and training targets after fine adjustment of the model can be expressed as follows:
(3) Controllable video interpolation from conditional probability distribution given head-to-tail frames In order to facilitate user interaction, the invention considers adopting a control mode of dragging (drags) in order to sample the video, and a plurality of feasible video frame inserting results can be realized, especially under the condition of large head-to-tail frame difference. To obtain the trajectory control conditions for points to train the model, we randomly initialize some sampling points around a fixed sparse grid in the first frame and use Co-Tracker to obtain the trajectories of these points in the whole video, and in training we remove trajectories that are not visible in more than half of the video frames and follow large points that change with a large probability of sampling trajectory motion. In addition, considering that the control points input by the user are often sparse, in training, we only keep 1 to 5 control points, after sampling to obtain the track points, we convert the point coordinates into a Gaussian heat map, expressed asAs input to the control module. The invention adopts a mechanism similar to control Net to add track control, specifically, an encoder which replicates 3D U-Net encodes a track diagram and injects the track diagram into a decoder of U-Net through zero convolution as shown in figure 1, wherein a training target of a model after track control is introduced can be expressed as:
The loss function considers the track control condition to generate a video frame inserting result which meets the user intention, and the track control ensures that the video frame inserting is more controllable, and the requirements of the user are met through simple interaction. In addition, we have found in experiments that introducing trajectory control can further improve the performance of video interpolation.
(4) Explicit correlation modeling based on matching points, in the case that a user does not provide a track, although the method can obtain good frame inserting effect under the condition that the difference between the front frame and the rear frame is moderate or small, we find that in the case that the difference between the front frame and the rear frame is large, the frame jumping situation sometimes occurs, that is, the first half of the generated frame inserting video is highly correlated with the first frame, the second half is highly correlated with the last frame, and in the middle frame, the video is mutated from the content correlated with the first frame to the content correlated with the last frame.
In the given input video head and tail frames, the invention can obtain the matching information of key points between the head and tail frames through the characteristic point matching algorithm, and the invention usesIs expressed by, whereinIs the number of the matching points and the matching point,Is a known point on the track (at initialization, the key point of a match on the head-to-tail frame). Although different key point matching algorithms are possible, this embodiment uses classical SIFT feature point matching because it is simple and we have found empirically that it works well.
Subsequently, we pass through the known trace pointsInterpolation to obtain the ith trackIn this way all estimated trajectories can be obtained
(5) Automated point track tracking, namely although an initial track can provide key point associated information with consistent time sequence, a connecting line track obtained based on head-tail key frame insertion is not necessarily accurate; for this reason, in each denoising step, the invention updates the coordinates of the points by using the similarity between features in 3D U-Net as a trajectory update.
As shown in FIG. 2, we interpolate the features on 3D U-Net to image resolution to obtain the image feature F, here we use the features of the penultimate upsampling module in 3D U-Net because it has higher resolution and better discrimination, usingFeatures at coordinates p on the graph feature F are represented, which are obtained by bilateral linear interpolation, because the coordinate points p are not necessarily integers.
We search for updated coordinate points around the interpolated intermediate frame coordinates usingRepresenting a set of points having a distance to point p less than r 1, i.eThe nearest neighbor algorithm is used for obtaining the update point closest to the key point on the first frame on the kth frame:
(6) In order to further improve the accuracy of point track updating, the matching point-to-point coordinates of the last frame are used for updating, so that the updating point closest to the key point on the tail frame on the kth frame can be obtained:
To ensure the accuracy of the updated point coordinates, we check the two nearest neighbor algorithm to get the consistency of the updated point when the distance between them is smaller than r 2, i.e When we use the midpoint of these two points as the updated point coordinates, i.e
We then add this point to the known trace pointAnd interpolate to obtain updated trackAs input to the next denoising step.
Based on the technical scheme, the controllable generation type video frame inserting method has wide application value, and comprises the following steps:
1. The new view angle generation is realized from a sparse view angle input, and the new view angle generation from a static scene to a dynamic scene is realized, for example, pictures from different view angles can be generated through video interpolation as the head and tail frames of the video, so that the new view angle generation is realized, as shown in fig. 3.
2. Cartoon frame insertion, namely obtaining a cartoon through frame insertion of a cartoon picture, simplifying the cartoon making process, supporting frame insertion of color cartoons and line manuscript cartoons, for example, generating a cartoon video through frame insertion of the cartoon picture, and supporting frame insertion of the color cartoons and the line manuscript cartoons, as shown in figure 4.
3. Video editing, namely realizing video editing by utilizing video insertion frames, comprising action editing and video filling, providing new ideas, for example, realizing the action editing of the video and modifying the actions of people in the video through the video insertion frames, and simultaneously, realizing the video filling, for example, deleting certain objects in the video.
4. The time-lapse photography is that only a few images at key time are needed, the whole video can be obtained through frame insertion to display the slow change process, for example, the time-lapse photography video can be generated through frame insertion of the video to display the ice melting, flower and grass growth and moon profit and loss change process, as shown in (a) of fig. 5.
5. Slow motion video generation, in which slow motion video can be generated by video interpolation, for example, normal speed video is converted into slow motion video, and key frames and actions in the video are highlighted as shown in fig. 5 (b).
6. The interactive video can be edited through video frame insertion, for example, a user can control the playing speed, the playing direction and the frame insertion effect of the video through a mouse or a touch screen, and more flexible and interactive video editing experience is realized.
7. Image deformation, namely gradual change effect of an image from one content to another content can be generated through video frame insertion, for example, young photos of people are gradually changed into an aged state, and the technology can be applied to the fields of advertisements, film special effects and the like, and unique visual effects are created, as shown in fig. 6.
The embodiments described above are described in order to facilitate the understanding and application of the present invention to those skilled in the art, and it will be apparent to those skilled in the art that various modifications may be made to the embodiments described above and that the general principles described herein may be applied to other embodiments without the need for inventive faculty. Therefore, the present invention is not limited to the above-described embodiments, and those skilled in the art, based on the present disclosure, should make improvements and modifications within the scope of the present invention.

Claims (7)

1. A controllable generation type video frame inserting method based on a diffusion model comprises the following steps:
(1) Selecting a pre-training diffusion model based on a picture-generated video, wherein the diffusion model adopts an SVD (scalable vector graphics) model for generating a video interpolation result and comprises a variation self-encoder, a CLIP (computer-based image) encoder, a 3D U-Net and a cross-attention mechanism, wherein the variation self-encoder is used for extracting hidden space characteristics of a video frame, the CLIP image encoder is used for extracting semantic characteristics of the video frame, the hidden space characteristics are spliced with hidden variables with noise and then are input into the 3D U-Net, the cross-attention mechanism takes the semantic characteristics as key and value input, the internal characteristics of the 3D U-Net are used as query input, the output of the cross-attention mechanism is used for updating the internal characteristics of the 3D U-Net, and the 3D U-Net is output after the multi-round iteration denoising to obtain the video interpolation result;
(2) Inputting the first frame and the last frame provided by the user into a diffusion model;
(3) Introducing track control based on user dragging to the diffusion model to generate a video frame inserting result conforming to the user intention;
(4) And under the condition that the user does not provide a dragging track, generating the track through characteristic point matching interpolation and automatically tracking the track of the point.
2. The method of claim 1, wherein in the step (2), when the diffusion model is input into the first frame and the last frame, the first frame and the last frame are generated into respective hidden space features and semantic features by a variable self-encoder and a CLIP image encoder, then the hidden space features of the first frame and the last frame are spliced with hidden variables with noise and then input into the 3D U-Net, and the semantic features of the first frame and the last frame are spliced and then are used as the input of cross-attention mechanism keys and values.
3. The method of claim 1, wherein in step (3), in order to facilitate user interaction, a user is allowed to control a video frame inserting result in a dragging mode, namely, a track of a key point in a dragging process is obtained and converted into a Gaussian heat map, the Gaussian heat map is input into a coding module for coding to obtain a characteristic of the track of the key point, and the characteristic is injected into 3D U-Net of a diffusion model.
4. The method of claim 1, wherein in the step (4), key point matching information is obtained between a first frame and a last frame provided by a user through a feature point matching algorithm, any key point pair matched in the first frame and the last frame is marked as p 0 and p n, corresponding tracks are obtained through interpolation estimation, and all estimated tracks are obtained through traversing other matched key point pairs.
5. The controllable generation type video interpolation method based on the diffusion model is characterized in that the automatic point track tracking in the step (4) is to update the point coordinates of the track in the middle frame by utilizing the similarity between features in the diffusion model 3D U-Net, specifically, firstly, the track generated by the feature point matching interpolation is converted into a Gaussian heat map and then encoded to obtain track features, then the track features are injected into 3D U-Net of the diffusion model, then the features of each frame image feature is obtained by utilizing the features of the last up-sampling module in the 3D U-Net through bilateral linear interpolation, for any track point k on the middle frame, the point set omega with the distance smaller than a threshold value r 1 is obtained through searching, then, the point p k,0 closest to the features of the key point p 0 of the first frame and the point p 5675 closest to the features of the key point p n of the last frame are calculated from the point set omega by utilizing a nearest neighbor algorithm, if the point p k,0 is updated to the point p 3825 with the next noise point p 3896 as the coordinates of the middle point which is updated by a distance of the next up-down point p 3996, and the operation point between the next noise point p 3825 and the next noise point is updated as the point coordinates of the point between the two frames of the next noise points of the tracks is calculated by using the threshold value p 3998, and the operation point is updated by updating the coordinates of the point p's 3.
6. A computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to implement a diffusion model-based controllable generation video frame inserting method as claimed in any one of claims 1 to 5.
7. A computer readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement a diffusion model-based controllably generated video frame inserting method according to any one of claims 1 to 5.
CN202411255207.3A 2024-09-09 2024-09-09 Controllable generation type video frame inserting method based on diffusion model Active CN118784939B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411255207.3A CN118784939B (en) 2024-09-09 2024-09-09 Controllable generation type video frame inserting method based on diffusion model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411255207.3A CN118784939B (en) 2024-09-09 2024-09-09 Controllable generation type video frame inserting method based on diffusion model

Publications (2)

Publication Number Publication Date
CN118784939A CN118784939A (en) 2024-10-15
CN118784939B true CN118784939B (en) 2024-12-20

Family

ID=92979194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411255207.3A Active CN118784939B (en) 2024-09-09 2024-09-09 Controllable generation type video frame inserting method based on diffusion model

Country Status (1)

Country Link
CN (1) CN118784939B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119421011B (en) * 2024-10-24 2025-08-22 智子引擎(北京)科技有限公司 A video generation method, device, equipment and medium based on diffusion model
CN119762632B (en) * 2024-10-25 2025-10-03 杭州电子科技大学 Diffusion model video generation method based on optical flow information
CN119693507B (en) * 2025-02-26 2025-04-18 深圳市灵图闪创科技有限公司 Animation coloring method and equipment based on video generation model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010062955A (en) * 2008-09-04 2010-03-18 Japan Science & Technology Agency System for converting video signal

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7108192B2 (en) * 1999-09-17 2006-09-19 Silverbrook Research Pty Ltd Rotationally symmetric tags
CN100458358C (en) * 2007-07-10 2009-02-04 浙江大学 Converse measuring method and device based on axial direction stereovision
US20230344962A1 (en) * 2021-03-31 2023-10-26 Meta Platforms, Inc. Video frame interpolation using three-dimensional space-time convolution
WO2024073092A1 (en) * 2022-09-29 2024-04-04 Meta Platforms Technologies, Llc Text to video generation
CN116524898A (en) * 2023-03-23 2023-08-01 中国科学院自动化研究所 Audio and video generation method, device, electronic device and storage medium
CN116962593A (en) * 2023-04-17 2023-10-27 腾讯科技(深圳)有限公司 Video frame inserting method, device, equipment and storage medium
CN117314733A (en) * 2023-08-14 2023-12-29 清华大学深圳国际研究生院 Video filling method, device, equipment and storage medium based on diffusion model
CN117319582A (en) * 2023-09-28 2023-12-29 上海数珩信息科技股份有限公司 Method and device for human action video acquisition and fluent synthesis
CN117793375A (en) * 2023-12-11 2024-03-29 同济大学 Video frame supplementing method based on image diffusion model
CN117729370A (en) * 2023-12-12 2024-03-19 南京邮电大学 A method and system for text generation video based on latent diffusion model
CN118509549A (en) * 2023-12-29 2024-08-16 中国科学院深圳先进技术研究院 Method for constructing a diffusion model for animation video interpolation and method for generating intermediate frames using the diffusion model
CN118505867A (en) * 2024-05-30 2024-08-16 中国科学院深圳先进技术研究院 Method, device and equipment for constructing enhanced time sequence consistency animation frame insertion diffusion model
CN118555461B (en) * 2024-07-29 2024-10-15 浙江天猫技术有限公司 Video generation method, device, equipment, system and computer program product

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010062955A (en) * 2008-09-04 2010-03-18 Japan Science & Technology Agency System for converting video signal

Also Published As

Publication number Publication date
CN118784939A (en) 2024-10-15

Similar Documents

Publication Publication Date Title
Xing et al. Dynamicrafter: Animating open-domain images with video diffusion priors
CN118784939B (en) Controllable generation type video frame inserting method based on diffusion model
Po et al. State of the art on diffusion models for visual computing
Wang et al. Unianimate: Taming unified video diffusion models for consistent human image animation
Yan et al. Df40: Toward next-generation deepfake detection
Chen et al. Seine: Short-to-long video diffusion model for generative transition and prediction
US20240169479A1 (en) Video generation with latent diffusion models
Ma et al. Follow-your-click: Open-domain regional image animation via short prompts
Ye et al. Geneface++: Generalized and stable real-time audio-driven 3d talking face generation
CN117544833B (en) Method, device, apparatus and medium for generating video
Logacheva et al. Deeplandscape: Adversarial modeling of landscape videos
JP2025082797A (en) Method, apparatus, device, and medium for generating video
US20250111866A1 (en) Video editing using image diffusion
Wang et al. Framer: Interactive frame interpolation
Zhang et al. TAPIP3D: Tracking Any Point in Persistent 3D Geometry
CN119316668B (en) Text-driven zero-sample 6-DOF video editing method and system
CN120047582A (en) Time-consistent human image animation method
Wang et al. Survey of Video Diffusion Models: Foundations, Implementations, and Applications
Wang et al. Scaling AI filmmaking with collaborative networking
CN120932161B (en) A Video Sequence Prediction Method and System Based on Object Segmentation
US20250363590A1 (en) Recursively-Cascading Diffusion Model for Image Interpolation
CN120103877B (en) Portrait posture control method, device, computer equipment and storage medium
CN117896526B (en) Video frame interpolation method and system based on bidirectional coding structure
Soleimani et al. A survey of emerging approaches and advances in video generation
Li et al. G‐SplatGAN: Disentangled 3D Gaussian Generation for Complex Shapes via Multi‐Scale Patch Discriminators

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant