CN117201837A

CN117201837A - Video generation method, device, electronic equipment and storage medium

Info

Publication number: CN117201837A
Application number: CN202311385864.5A
Authority: CN
Inventors: 刘羽; 张毅; 袁伦喜
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2023-12-08

Abstract

Embodiments of the present application disclose a video generation method, device, electronic device and storage medium. The method includes: obtaining a target detection result corresponding to the target multimedia data, and the target detection result is used to indicate the target object included in the target multimedia data; according to the target According to the detection results, the target template video is determined; the target template video corresponds to the target style; based on the target template video, the target video generation model is determined; the target video generation model is obtained by adjusting the pre-trained video generation model based on the target template video; through the target video The generation model processes the target template video according to the target detection results to generate a target video, the video frames of the target video correspond to the target style, and the video frames of the target video include the target object. Implementing the embodiments of the present application can improve the efficiency of generating style videos.

Description

Video generation method, device, electronic equipment and storage medium

技术领域Technical field

本申请涉及多媒体技术领域，具体涉及一种视频生成方法、装置、电子设备及存储介质。This application relates to the field of multimedia technology, and specifically to a video generation method, device, electronic equipment and storage medium.

背景技术Background technique

随着互联网和社交媒体的普及，视频成为了人们分享信息、故事和体验的主要方式。人们热衷于制作各种风格的视频以满足不同的表达需求，这些视频在视觉、听觉、美学等方面通常具有明显的风格特征。目前，用户一般是参考现成的视频模板，手动制作各种风格的视频。但是，手动制作视频的门槛较高，耗时也较长。因此，如何提高生成风格视频的效率，成了亟需解决的技术问题。With the popularity of the Internet and social media, video has become the main way for people to share information, stories and experiences. People are keen to produce videos of various styles to meet different expression needs. These videos usually have obvious style characteristics in terms of vision, hearing, aesthetics, etc. Currently, users generally refer to ready-made video templates and manually create videos of various styles. However, the threshold for manually producing videos is higher and takes longer. Therefore, how to improve the efficiency of generating style videos has become an urgent technical problem that needs to be solved.

发明内容Contents of the invention

本申请实施例公开了一种视频生成方法、装置、电子设备及存储介质，能够提高生成的风格视频的视频质量，并提高了生成风格视频的效率。Embodiments of the present application disclose a video generation method, device, electronic device and storage medium, which can improve the video quality of the generated style video and improve the efficiency of generating the style video.

本申请实施例公开一种视频生成方法，所述方法包括：An embodiment of the present application discloses a video generation method, which method includes:

获取目标多媒体数据对应的目标检测结果，所述目标检测结果用于指示所述目标多媒体数据中包括的目标对象；Obtaining a target detection result corresponding to the target multimedia data, where the target detection result is used to indicate a target object included in the target multimedia data;

根据所述目标检测结果，确定目标模板视频；所述目标模板视频对应目标风格；According to the target detection result, a target template video is determined; the target template video corresponds to the target style;

根据所述目标模板视频，确定目标视频生成模型；所述目标视频生成模型是根据所述目标模板视频对预训练的视频生成模型进行调整得到的；Determine a target video generation model according to the target template video; the target video generation model is obtained by adjusting a pre-trained video generation model according to the target template video;

通过所述目标视频生成模型根据所述目标检测结果对所述目标模板视频进行处理，以生成目标视频，所述目标视频的视频帧对应所述目标风格，且所述目标视频的视频帧中包括所述目标对象。The target template video is processed according to the target detection result by the target video generation model to generate a target video, the video frames of the target video correspond to the target style, and the video frames of the target video include The target object.

本申请实施例公开一种视频生成方法，应用于终端设备，所述方法包括：An embodiment of the present application discloses a video generation method, which is applied to a terminal device. The method includes:

响应于选择操作，确定选择的目标多媒体数据；In response to the selection operation, determining the selected target multimedia data;

通过目标检测模型对所述目标多媒体数据进行目标检测，得到目标检测结果；所述目标检测结果用于指示所述目标多媒体数据中包括的目标对象；Perform target detection on the target multimedia data through a target detection model to obtain a target detection result; the target detection result is used to indicate the target object included in the target multimedia data;

获取基于所述目标检测结果生成的目标视频，所述目标视频的视频帧对应目标风格，且所述目标视频的视频帧中包括所述目标对象；Obtain a target video generated based on the target detection result, the video frames of the target video correspond to the target style, and the video frames of the target video include the target object;

其中，所述目标视频是通过目标视频生成模型根据所述目标检测结果对目标模板视频进行处理生成的；所述目标模板视频对应所述目标风格，所述目标模板视频根据所述目标检测结果确定；所述目标视频生成模型是根据所述目标模板视频对预训练的视频生成模型进行调整得到的。Wherein, the target video is generated by processing a target template video according to the target detection result through a target video generation model; the target template video corresponds to the target style, and the target template video is determined according to the target detection result. ; The target video generation model is obtained by adjusting a pre-trained video generation model based on the target template video.

本申请实施例公开一种视频生成装置，所述装置包括：An embodiment of the present application discloses a video generation device, which includes:

获取模块，用于获取目标多媒体数据对应的目标检测结果，所述目标检测结果用于指示所述目标多媒体数据中包括的目标对象；An acquisition module, configured to acquire a target detection result corresponding to the target multimedia data, where the target detection result is used to indicate a target object included in the target multimedia data;

模板确定模块，用于根据所述目标检测结果，确定目标模板视频；所述目标模板视频对应目标风格；A template determination module, configured to determine a target template video according to the target detection result; the target template video corresponds to the target style;

模型确定模块，用于根据所述目标模板视频，确定目标视频生成模型；所述目标视频生成模型是根据所述目标模板视频对预训练的视频生成模型进行调整得到的；A model determination module, configured to determine a target video generation model based on the target template video; the target video generation model is obtained by adjusting a pre-trained video generation model based on the target template video;

视频生成模块，用于通过所述目标视频生成模型根据所述目标检测结果对所述目标模板视频进行处理，以生成目标视频，所述目标视频的视频帧对应所述目标风格，且所述目标视频的视频帧中包括所述目标对象。A video generation module, configured to process the target template video according to the target detection result through the target video generation model to generate a target video, the video frames of the target video correspond to the target style, and the target The target object is included in the video frame of the video.

本申请实施例公开一种视频生成装置，应用于终端设备，所述装置包括：An embodiment of the present application discloses a video generation device, which is applied to terminal equipment. The device includes:

响应模块，用于响应于选择操作，确定选择的目标多媒体数据；a response module, configured to determine the selected target multimedia data in response to the selection operation;

检测模块，用于通过目标检测模型对所述目标多媒体数据进行目标检测，得到目标检测结果；所述目标检测结果用于指示所述目标多媒体数据中包括的目标对象；A detection module, configured to perform target detection on the target multimedia data through a target detection model to obtain a target detection result; the target detection result is used to indicate a target object included in the target multimedia data;

获取模块，用于获取基于所述目标检测结果生成的目标视频，所述目标视频的视频帧对应目标风格，且所述目标视频的视频帧中包括所述目标对象；其中，所述目标视频是通过目标视频生成模型根据所述目标检测结果对目标模板视频进行处理生成的；所述目标模板视频对应所述目标风格，所述目标模板视频根据所述目标检测结果确定；所述目标视频生成模型是根据所述目标模板视频对预训练的视频生成模型进行调整得到的。An acquisition module, configured to acquire a target video generated based on the target detection result, the video frame of the target video corresponds to the target style, and the video frame of the target video includes the target object; wherein, the target video is The target template video is processed and generated by the target video generation model according to the target detection result; the target template video corresponds to the target style, and the target template video is determined according to the target detection result; the target video generation model It is obtained by adjusting the pre-trained video generation model according to the target template video.

本申请实施例公开一种电子设备，包括存储器及处理器，所述存储器中存储有计算机程序，所述计算机程序被所述处理器执行时，使得所述处理器实现本申请实施例公开的任一实施例中的方法。An embodiment of the present application discloses an electronic device, including a memory and a processor. A computer program is stored in the memory. When the computer program is executed by the processor, the processor can implement any of the methods disclosed in the embodiment of the present application. A method in an embodiment.

本申请实施例公开一种计算机可读存储介质，其存储计算机程序，其中，所述计算机程序使得计算机执行本申请实施例公开的任一实施例中的方法。An embodiment of the present application discloses a computer-readable storage medium that stores a computer program, wherein the computer program causes the computer to execute the method in any embodiment disclosed in the embodiment of the present application.

与相关技术相比，本申请实施例公开了一种视频生成方法、装置、电子设备及存储介质，具有以下有益效果：Compared with related technologies, embodiments of the present application disclose a video generation method, device, electronic device and storage medium, which have the following beneficial effects:

获取目标多媒体数据对应的目标检测结果，目标检测结果用于指示目标多媒体数据中包括的目标对象；根据目标检测结果，确定对应目标风格的目标模板视频；根据目标模板视频，确定目标视频生成模型，目标视频生成模型是根据目标模板视频对预训练的视频生成模型进行调整得到的；通过目标视频生成模型根据目标检测结果对目标模板视频进行处理，以生成目标视频，目标视频的视频帧对应目标风格，且目标视频的视频帧中包括目标对象。本申请实施例中，可基于目标多媒体数据对应的目标检测结果，匹配对应的目标模板视频，从而通过目标模板视频对应的目标视频生成模型根据目标检测结果对目标模板视频进行处理，生成目标视频，通过利用目标模板视频调整过的目标视频生成模型生成目标视频，使得生成的风格视频更具有针对性，能够提高生成的风格视频的视频质量，并提高生成风格视频的效率。Obtain the target detection result corresponding to the target multimedia data, and the target detection result is used to indicate the target object included in the target multimedia data; according to the target detection result, determine the target template video corresponding to the target style; according to the target template video, determine the target video generation model, The target video generation model is obtained by adjusting the pre-trained video generation model based on the target template video; the target video generation model processes the target template video according to the target detection results to generate the target video, and the video frames of the target video correspond to the target style. , and the target object is included in the video frame of the target video. In the embodiment of the present application, the corresponding target template video can be matched based on the target detection result corresponding to the target multimedia data, so that the target template video is processed according to the target detection result through the target video generation model corresponding to the target template video, and the target video is generated. By using the target video generation model adjusted by the target template video to generate the target video, the generated style video is more targeted, the video quality of the generated style video can be improved, and the efficiency of generating the style video can be improved.

附图说明Description of the drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. Those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

图1是一个实施例中视频生成方法的应用场景示意图；Figure 1 is a schematic diagram of an application scenario of a video generation method in an embodiment;

图2是一个实施例中视频生成方法的流程示意图；Figure 2 is a schematic flowchart of a video generation method in one embodiment;

图3是另一个实施例中视频生成方法的流程示意图；Figure 3 is a schematic flow chart of a video generation method in another embodiment;

图4是另一个实施例中视频生成方法的流程示意图；Figure 4 is a schematic flow chart of a video generation method in another embodiment;

图5是一个实施例中视频生成方法的时序图；Figure 5 is a timing diagram of a video generation method in an embodiment;

图6是另一个实施例中视频生成方法的时序图；Figure 6 is a timing diagram of a video generation method in another embodiment;

图7是另一个实施例中视频生成方法的时序图；Figure 7 is a timing diagram of a video generation method in another embodiment;

图8是一个实施例中视频生成装置的结构示意图；Figure 8 is a schematic structural diagram of a video generation device in one embodiment;

图9是另一个实施例中视频生成装置的结构示意图；Figure 9 is a schematic structural diagram of a video generation device in another embodiment;

图10是一个实施例中电子设备的结构示意图。Figure 10 is a schematic structural diagram of an electronic device in an embodiment.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

需要说明的是，本申请实施例及附图中的术语“包括”和“具有”以及它们任何变形，意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元，而是可选地还包括没有列出的步骤或单元，或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "including" and "having" and any variations thereof in the embodiments and drawings of this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but optionally also includes steps or units that are not listed, or optionally also includes Other steps or units inherent to such processes, methods, products or devices.

本申请实施例公开了一种视频生成方法、装置、电子设备及存储介质，能够提高生成风格视频的效率。以下分别进行详细说明。Embodiments of the present application disclose a video generation method, device, electronic device and storage medium, which can improve the efficiency of generating style videos. Each is explained in detail below.

请参阅图1，图1是一个实施例中视频生成方法的应用场景示意图。如图1所示，服务器10可与一个或多个终端设备20进行通信连接。Please refer to Figure 1, which is a schematic diagram of an application scenario of a video generation method in an embodiment. As shown in FIG. 1 , the server 10 can communicate with one or more terminal devices 20 .

服务器10可以是提供计算、存储、网络等资源服务的计算机，可选的，服务器10可以是云服务器。在一些实施例中，终端设备20可通过超文本传输协议(Hyper TextTransfer Protocol，HTTP)/安全超文本传送协议(Hypertext Transfer ProtocolSecure，HTTPS)与云服务器进行通信，或者，终端设备20可通过代表性状态转移应用程序接口(Representational State Transfer Application Programming Interface，RESTAPI)与云服务器进行通信，从而终端设备20可通过网页浏览、应用程序与云服务器进行通信，从而从云服务器获取数据或者将数据上传到服务器。The server 10 may be a computer that provides computing, storage, network and other resource services. Alternatively, the server 10 may be a cloud server. In some embodiments, the terminal device 20 may communicate with the cloud server through Hyper Text Transfer Protocol (HTTP)/Hypertext Transfer Protocol Secure (HTTPS), or the terminal device 20 may communicate through a representative The Representational State Transfer Application Programming Interface (RESTAPI) communicates with the cloud server, so that the terminal device 20 can communicate with the cloud server through web browsing and applications, thereby obtaining data from the cloud server or uploading data to the server. .

终端设备20可以包括但不限于手机、可穿戴设备、平板电脑、车载终端等电子设备，本申请实施例不作限定。The terminal device 20 may include but is not limited to mobile phones, wearable devices, tablet computers, vehicle-mounted terminals and other electronic devices, which are not limited in the embodiments of this application.

在一些实施例中，终端设备20可响应于选择操作，确定选择的目标多媒体数据；通过目标检测模型对目标多媒体数据进行目标检测，得到目标检测结果；目标检测结果用于指示目标多媒体数据中包括的目标对象；获取基于目标检测结果生成的目标视频，目标视频的视频帧对应目标风格，且目标视频的视频帧中包括目标对象；其中，目标视频是通过目标视频生成模型根据目标检测结果对目标模板视频进行处理生成的；目标模板视频对应目标风格，目标模板视频根据目标检测结果确定；目标视频生成模型是根据目标模板视频对预训练的视频生成模型进行调整得到的。In some embodiments, the terminal device 20 can determine the selected target multimedia data in response to the selection operation; perform target detection on the target multimedia data through the target detection model to obtain a target detection result; the target detection result is used to indicate that the target multimedia data includes The target object; obtain the target video generated based on the target detection result, the video frame of the target video corresponds to the target style, and the video frame of the target video includes the target object; among them, the target video is generated based on the target detection result through the target video generation model. The template video is processed and generated; the target template video corresponds to the target style, and the target template video is determined based on the target detection results; the target video generation model is obtained by adjusting the pre-trained video generation model based on the target template video.

需要说明的是，本申请实施例的方法可应用于电子设备，电子设备可以是服务器或终端设备，具体不作限定。It should be noted that the method in the embodiment of the present application can be applied to electronic equipment, and the electronic equipment can be a server or a terminal device, and is not specifically limited.

在一些实施例中，终端设备20可将目标检测结果发送至服务器10，从而通过服务器10对目标检测结果进行处理并生成目标视频。服务器10可接收终端设备20发送的目标检测结果，服务器10根据目标检测结果，确定目标模板视频，目标模板视频对应目标风格；服务器10根据目标模板视频，确定目标视频生成模型，目标视频生成模型是根据目标模板视频对预训练的视频生成模型进行调整得到的；服务器10通过目标视频生成模型根据目标检测结果对目标模板视频进行处理，以生成目标视频，目标视频的视频帧对应目标风格，且目标视频的视频帧中包括目标对象。服务器10可将目标视频发送给终端设备20，终端设备20可对接收到的目标视频进行显示。In some embodiments, the terminal device 20 may send the target detection result to the server 10 , so that the server 10 processes the target detection result and generates a target video. The server 10 can receive the target detection result sent by the terminal device 20. The server 10 determines the target template video according to the target detection result, and the target template video corresponds to the target style; the server 10 determines the target video generation model according to the target template video, and the target video generation model is The pre-trained video generation model is adjusted according to the target template video; the server 10 processes the target template video according to the target detection results through the target video generation model to generate the target video, the video frames of the target video correspond to the target style, and the target The video frame includes the target object. The server 10 may send the target video to the terminal device 20, and the terminal device 20 may display the received target video.

在另一些实施例中，终端设备20可以直接对目标检测结果进行处理并生成目标视频。终端设备20可以在获取目标多媒体数据对应的目标检测结果之后，可根据目标检测结果，确定目标模板视频，目标模板视频对应目标风格；终端设备20根据目标模板视频，确定目标视频生成模型，目标视频生成模型是根据目标模板视频对预训练的视频生成模型进行调整得到的；终端设备20通过目标视频生成模型根据目标检测结果对目标模板视频进行处理，以生成目标视频，目标视频的视频帧对应目标风格，且目标视频的视频帧中包括目标对象。终端设备20可对生成到的目标视频进行显示。In other embodiments, the terminal device 20 can directly process the target detection results and generate the target video. After obtaining the target detection result corresponding to the target multimedia data, the terminal device 20 can determine the target template video according to the target detection result, and the target template video corresponds to the target style; the terminal device 20 determines the target video generation model according to the target template video, and the target video The generation model is obtained by adjusting the pre-trained video generation model based on the target template video; the terminal device 20 processes the target template video according to the target detection results through the target video generation model to generate the target video, and the video frames of the target video correspond to the target. style, and the target object is included in the video frames of the target video. The terminal device 20 can display the generated target video.

请进一步参阅图2，图2是一个实施例中视频生成方法的流程示意图；该视频生成方法可应用于电子设备，该电子设备可以是上述的终端设备，也可以是上述的服务器。如图2所示，该视频生成方法可以包括以下步骤：Please further refer to Figure 2, which is a schematic flow chart of a video generation method in one embodiment; the video generation method can be applied to an electronic device, and the electronic device can be the above-mentioned terminal device or the above-mentioned server. As shown in Figure 2, the video generation method may include the following steps:

201、获取目标多媒体数据对应的目标检测结果，目标检测结果用于指示目标多媒体数据中包括的目标对象。201. Obtain the target detection result corresponding to the target multimedia data. The target detection result is used to indicate the target object included in the target multimedia data.

在一些实施例中，终端设备可响应于选择操作，确定选择的目标多媒体数据，并通过目标检测模型对目标多媒体数据进行目标检测，得到目标检测结果。终端设备可将目标检测结果发送至服务器，从而通过服务器对目标检测结果进行处理并生成目标视频；或者，终端设备可直接对目标检测结果进行处理并生成目标视频。In some embodiments, the terminal device may respond to the selection operation, determine the selected target multimedia data, and perform target detection on the target multimedia data through a target detection model to obtain a target detection result. The terminal device can send the target detection results to the server, so that the target detection results are processed by the server and generate a target video; or, the terminal device can directly process the target detection results and generate a target video.

可选的，终端设备20可显示一个或多个多媒体数据，选择操作可以是触发多媒体数据对应的虚拟控件的用户操作，比如，单击或双击该多媒体数据对应的虚拟控件，从而确定选择的目标多媒体数据。Optionally, the terminal device 20 can display one or more multimedia data, and the selection operation can be a user operation that triggers a virtual control corresponding to the multimedia data, such as clicking or double-clicking the virtual control corresponding to the multimedia data, thereby determining the selected target. Multimedia data.

在一些实施例中，目标检测模型可包括但不限于区域卷积神经网络(Region-based Convolutional Neural Networks，RNN)模型、单个尺度目标检测器(Single ShotDetector，SSD)模型、YOLO(You Only Look Once)模型等。终端设备20通过目标检测模型对目标多媒体数据进行目标检测，识别目标多媒体数据中的目标对象，并为目标对象生成一个边界框，边界框可指示目标对象在目标多媒体数据中的位置，并且，可生成该目标对象对应的类别标签，类别标签可指示目标对象属于哪个对象类别，目标对象的对象类别可包括人、车辆、动物等，进一步地，动物类别还可区分为狗、猫、兔子等。In some embodiments, target detection models may include, but are not limited to, Region-based Convolutional Neural Networks (RNN) models, Single Shot Detector (SSD) models, YOLO (You Only Look Once ) model, etc. The terminal device 20 performs target detection on the target multimedia data through the target detection model, identifies the target object in the target multimedia data, and generates a bounding box for the target object. The bounding box can indicate the position of the target object in the target multimedia data, and can Generate a category label corresponding to the target object. The category label can indicate which object category the target object belongs to. The object category of the target object can include people, vehicles, animals, etc. Further, the animal category can also be divided into dogs, cats, rabbits, etc.

目标检测结果可指示目标多媒体数据中包括的目标对象，可选的，目标检测结果可以指目标多媒体数据中目标对象对应的内容，也可以指边界框中对应的内容。The target detection result may indicate the target object included in the target multimedia data. Optionally, the target detection result may refer to the content corresponding to the target object in the target multimedia data or the corresponding content in the bounding box.

在一些实施例中，目标检测结果还可包括但不限于目标对象对应的对象类别、目标对象在目标多媒体数据中的位置等；或者，目标检测结果可包括目标对象对应的对象图像，对象图像为从目标多媒体数据中截取出来的目标对象的图像区域。In some embodiments, the target detection results may also include but are not limited to the object category corresponding to the target object, the position of the target object in the target multimedia data, etc.; or, the target detection results may include the object image corresponding to the target object, and the object image is The image area of the target object extracted from the target multimedia data.

目标多媒体数据可包括但不限于图像数据、视频数据等。示例性的，若用户选择的目标多媒体数据为宠物图像，终端设备可对宠物图像进行目标检测，得到目标检测结果，目标检测结果可指示目标多媒体数据中包括的宠物；若用户选择的目标多媒体数据为宠物视频，终端设备可对宠物视频进行逐帧目标检测，得到宠物视频中各个视频帧对应的目标检测结果，由于宠物视频中宠物是动态的，宠物的角度也是变化的，因此宠物视频中各个视频帧对应的目标检测结果可指示目标多媒体数据中包括的宠物的多个角度，比如，宠物的正面、背面、侧面等。因此，终端设备对视频数据进行逐帧目标检测，也能实现对目标对象的检测和判断，并且可以在一定程度上丰富目标对象的各种角度信息，使得目标视频生成模型能够根据目标检测结果包含的目标对象的角度信息，生成更加准确、接近真实目标对象的风格视频。The target multimedia data may include but is not limited to image data, video data, etc. For example, if the target multimedia data selected by the user is a pet image, the terminal device can perform target detection on the pet image to obtain a target detection result, and the target detection result can indicate the pet included in the target multimedia data; if the target multimedia data selected by the user For pet videos, the terminal device can perform frame-by-frame target detection on the pet video to obtain the target detection results corresponding to each video frame in the pet video. Since the pet in the pet video is dynamic and the angle of the pet also changes, each frame in the pet video The target detection results corresponding to the video frames can indicate multiple angles of the pet included in the target multimedia data, such as the front, back, side, etc. of the pet. Therefore, the terminal device performs frame-by-frame target detection on video data, and can also detect and judge the target object, and can enrich various angle information of the target object to a certain extent, so that the target video generation model can include The angle information of the target object is used to generate a style video that is more accurate and closer to the real target object.

202、根据目标检测结果，确定目标模板视频，目标模板视频对应目标风格。202. According to the target detection results, determine the target template video, and the target template video corresponds to the target style.

目标风格可以是指目标模板视频对应的视频风格，可选的，目标风格可包括但不限于视频的色彩色调风格、滤镜特效风格、动画特效风格、动作风格等，具体不作限定。The target style may refer to the video style corresponding to the target template video. Optionally, the target style may include but is not limited to the color tone style, filter special effects style, animation special effects style, action style, etc. of the video, and is not specifically limited.

在一些实施例中，目标检测结果包括目标对象对应的对象类别。电子设备可根据目标对象对应的对象类别，确定对象类别对应的目标模板视频。电子设备可预先存储多个对象类别与多个模板视频之间的对应关系，并存储多个对象类别分别对应的模板视频。电子设备根据目标对象对应的对象类别，从多个模板视频中确定对象类别对应的目标模板视频。比如，对象类别为猫，对应的模板视频可以是猫跳舞视频，对应的风格为跳舞动作；对象类别为狗，对应的模板视频可以是狗唱歌视频，对应的风格为唱歌动作。In some embodiments, the target detection result includes an object category corresponding to the target object. The electronic device can determine the target template video corresponding to the object category according to the object category corresponding to the target object. The electronic device can pre-store correspondences between multiple object categories and multiple template videos, and store template videos corresponding to the multiple object categories. The electronic device determines the target template video corresponding to the object category from the plurality of template videos according to the object category corresponding to the target object. For example, if the object category is cat, the corresponding template video can be a cat dancing video, and the corresponding style can be dancing movements; if the object category is dog, the corresponding template video can be a dog singing video, and the corresponding style can be singing movements.

203、根据目标模板视频，确定目标视频生成模型，目标视频生成模型是根据目标模板视频对预训练的视频生成模型进行调整得到的。203. Determine the target video generation model based on the target template video. The target video generation model is obtained by adjusting the pre-trained video generation model based on the target template video.

目标视频生成模型可以指根据目标模板视频对预训练的视频生成模型进行调整得到的，已训练的视频生成模型。The target video generation model may refer to a trained video generation model obtained by adjusting a pre-trained video generation model based on the target template video.

在一些实施例中，视频生成模型在预训练阶段，可以利用大量的视频数据集训练得到的。预训练之后的视频生成模型具备生成视频的能力。为了进一步提高生成风格视频的针对性和视频质量，可利用多个不同的模板视频对预训练之后的视频生成模型进行模型参数调整，从而分化出多个不同的视频生成模型。各个视频生成模型是根据与其对应的模板视频进行调整得到的，调整得到的视频生成模型具备生成与其对应的模板视频相同风格的视频的能力。In some embodiments, the video generation model can be trained using a large number of video data sets in the pre-training stage. The pre-trained video generation model has the ability to generate videos. In order to further improve the pertinence and video quality of generated style videos, multiple different template videos can be used to adjust the model parameters of the pre-trained video generation model, thereby differentiating multiple different video generation models. Each video generation model is adjusted based on its corresponding template video. The adjusted video generation model has the ability to generate videos of the same style as its corresponding template video.

目标模板视频为多个不同的模板视频中的任一模板视频，目标视频生成模型为利用目标模板视频对预训练之后的视频生成模型进行调整得到的视频生成模型。调整得到的目标视频生成模型具备生成与目标模板视频相同风格的视频的能力。The target template video is any template video among multiple different template videos, and the target video generation model is a video generation model obtained by adjusting the pre-trained video generation model using the target template video. The adjusted target video generation model has the ability to generate videos of the same style as the target template video.

电子设备可预先存储多个模板视频与多个视频生成模型之间的对应关系。比如，视频生成模型A是根据模板视频A调整得到的，视频生成模型B是根据模板视频B调整得到的。因此，在确定目标模板视频之后，可根据多个模板视频与多个视频生成模型之间的对应关系，确定与目标模板视频对应的目标视频生成模型。The electronic device can pre-store correspondences between multiple template videos and multiple video generation models. For example, video generation model A is adjusted based on template video A, and video generation model B is adjusted based on template video B. Therefore, after the target template video is determined, the target video generation model corresponding to the target template video can be determined based on the correspondence between the multiple template videos and the multiple video generation models.

在一些实施例中，电子设备利用目标模板视频包含的多个视频帧对预训练的视频生成模型进行模型参数调整，具体地，可计算预训练的视频生成模型生成的预测视频帧与目标模板视频包含的多个视频帧之间的训练损失，从而根据训练损失调整预训练的视频生成模型的参数，以得到目标视频生成模型。In some embodiments, the electronic device uses multiple video frames contained in the target template video to adjust model parameters of the pre-trained video generation model. Specifically, the predicted video frame generated by the pre-trained video generation model and the target template video can be calculated. The training loss between multiple video frames is included, thereby adjusting the parameters of the pre-trained video generation model according to the training loss to obtain the target video generation model.

在一些实施例中，目标视频生成模型可以是人工智能模型，可包括但不限于生成对抗网络(Generative Adversarial Networks，GAN)模型、变分自编码器(VariationalAuto-Encoder，VAE)模型、扩散模型(Diffusion Models，DM)等。In some embodiments, the target video generation model may be an artificial intelligence model, which may include but is not limited to a Generative Adversarial Networks (GAN) model, a Variational Auto-Encoder (VAE) model, a diffusion model ( Diffusion Models, DM), etc.

通过上述实施例，利用模板视频对预训练的视频生成模型进行调整，得到模板视频对应的视频生成模型，使得生成的风格视频更具有针对性，能够提高生成的风格视频的视频质量，并提高生成风格视频的效率。Through the above embodiments, the template video is used to adjust the pre-trained video generation model to obtain a video generation model corresponding to the template video, so that the generated style video is more targeted, the video quality of the generated style video can be improved, and the generation of the style video can be improved. Style video efficiency.

204、通过目标视频生成模型根据目标检测结果对目标模板视频进行处理，以生成目标视频。204. Use the target video generation model to process the target template video according to the target detection result to generate the target video.

电子设备将目标检测结果及目标模板视频作为目标视频生成模型的输入，通过目标视频生成模型根据目标检测结果对目标模板视频进行处理，输出对应的目标视频，目标视频的视频帧对应目标风格，且目标视频的视频帧中包括目标对象。The electronic device uses the target detection results and the target template video as inputs to the target video generation model, processes the target template video according to the target detection results through the target video generation model, and outputs the corresponding target video. The video frames of the target video correspond to the target style, and The target object is included in the video frames of the target video.

在一些实施例中，目标风格可包括目标色调，目标视频的视频帧的色调与目标色调一致，且目标视频的视频帧中包括目标对象。In some embodiments, the target style may include a target tone, the tone of the video frame of the target video is consistent with the target tone, and the video frame of the target video includes the target object.

在一些实施例中，目标风格可包括目标动作，目标视频的视频帧中包括执行目标动作的目标对象。比如，目标检测结果指示目标多媒体数据中包括的目标对象为狗，目标动作为跳舞，则通过目标视频生成模型输出的目标视频的视频帧中会包含跳舞的猫。In some embodiments, the target style may include a target action, and the video frames of the target video include a target object performing the target action. For example, if the target detection result indicates that the target object included in the target multimedia data is a dog and the target action is dancing, then the video frame of the target video output by the target video generation model will contain a dancing cat.

在一些实施例中，电子设备可对目标模板视频进行特征提取，得到目标模板视频对应的视频特征数据；将视频特征数据输入至控制模型，通过控制模型根据视频特征数据生成视频生成控制信息，并将视频生成控制信息输入至目标视频生成模型；通过目标视频生成模型根据目标检测结果及视频生成控制信息，对目标模板视频进行处理，以生成目标视频。In some embodiments, the electronic device can perform feature extraction on the target template video to obtain video feature data corresponding to the target template video; input the video feature data into the control model, and use the control model to generate video generation control information based on the video feature data, and The video generation control information is input to the target video generation model; the target video generation model processes the target template video according to the target detection results and the video generation control information to generate the target video.

本申请实施例中，可基于目标多媒体数据对应的目标检测结果，匹配对应的目标模板视频，从而通过目标模板视频对应的目标视频生成模型根据目标检测结果对目标模板视频进行处理，生成目标视频，通过利用目标模板视频调整过的目标视频生成模型生成目标视频，使得生成的风格视频更具有针对性，能够提高生成的风格视频的视频质量，并提高生成风格视频的效率。In the embodiment of the present application, the corresponding target template video can be matched based on the target detection result corresponding to the target multimedia data, so that the target template video is processed according to the target detection result through the target video generation model corresponding to the target template video, and the target video is generated. By using the target video generation model adjusted by the target template video to generate the target video, the generated style video is more targeted, the video quality of the generated style video can be improved, and the efficiency of generating the style video can be improved.

请进一步参阅图3，图3是另一个实施例中视频生成方法的流程示意图，如图3所示，该视频生成方法可以包括以下步骤：Please further refer to Figure 3. Figure 3 is a schematic flow chart of a video generation method in another embodiment. As shown in Figure 3, the video generation method may include the following steps:

301、获取目标多媒体数据对应的目标检测结果，目标检测结果用于指示目标多媒体数据中包括的目标对象。301. Obtain the target detection result corresponding to the target multimedia data. The target detection result is used to indicate the target object included in the target multimedia data.

302、根据目标检测结果，确定目标模板视频，目标模板视频对应目标风格。302. According to the target detection results, determine the target template video, and the target template video corresponds to the target style.

303、根据目标模板视频，确定目标视频生成模型，目标视频生成模型是根据目标模板视频对预训练的视频生成模型进行调整得到的。303. Determine the target video generation model based on the target template video. The target video generation model is obtained by adjusting the pre-trained video generation model based on the target template video.

304、对目标模板视频进行特征提取，得到目标模板视频对应的视频特征数据。304. Extract features from the target template video to obtain video feature data corresponding to the target template video.

在一些实施例中，电子设备可对目标模板视频包含的各个视频帧进行特征提取，得到各个视频帧对应的视频特征数据。特征提取的方式可包括但不限于通过卷积神经网络(CNN)、循环神经网络(RNN)等深度学习模型进行特征提取的方式。在一些实施例中，各个视频帧对应的视频特征数据可包括但不限于各个视频帧的边缘信息、深度信息、语义分割信息等，从而可提取得到各个视频帧中的关键信息，减少冗余信息的干扰。In some embodiments, the electronic device can perform feature extraction on each video frame included in the target template video to obtain video feature data corresponding to each video frame. Feature extraction methods may include, but are not limited to, feature extraction through deep learning models such as convolutional neural networks (CNN) and recurrent neural networks (RNN). In some embodiments, the video feature data corresponding to each video frame may include but is not limited to edge information, depth information, semantic segmentation information, etc. of each video frame, thereby extracting key information in each video frame and reducing redundant information. interference.

控制模型可以是控制目标视频生成模型生成视频的神经网络模型。控制模型可引入各种自定义条件，从而控制目标视频生成模型生成符合自定义条件的视频。具体地，控制模型可根据视频特征数据生成视频生成控制信息。The control model may be a neural network model that controls the video generated by the target video generation model. The control model can introduce various custom conditions to control the target video generation model to generate videos that meet the custom conditions. Specifically, the control model may generate video generation control information according to the video feature data.

视频生成控制信息是用于控制目标视频生成模型生成目标风格的控制信息，视频生成控制信息可包括目标风格对应的关键信息。比如，若目标风格包括目标动作，视频生成控制信息可包括目标动作对应的姿态信息；比如，若目标风格包括目标色调，视频生成控制信息可包括目标色调对应的色调信息。The video generation control information is control information used to control the target video generation model to generate the target style. The video generation control information may include key information corresponding to the target style. For example, if the target style includes a target action, the video generation control information may include gesture information corresponding to the target action; for example, if the target style includes a target hue, the video generation control information may include hue information corresponding to the target hue.

在一些实施例中，控制模型可包括但不限于边缘提取模块、深度估计模块、姿态识别模块等，控制模型可通过这些模块从视频特征数据中提取得到目标风格对应的关键信息，并根据目标风格对应的关键信息生成视频生成控制信息，从而目标视频生成模型可以根据目标检测结果及视频生成控制信息，对目标模板视频进行处理，生成目标视频，且目标视频的视频帧对应目标风格，且目标视频的视频帧中包括目标对象。In some embodiments, the control model may include, but is not limited to, an edge extraction module, a depth estimation module, a gesture recognition module, etc. The control model may use these modules to extract key information corresponding to the target style from the video feature data, and use these modules to extract key information corresponding to the target style. The corresponding key information generates video generation control information, so that the target video generation model can process the target template video according to the target detection results and video generation control information to generate the target video, and the video frames of the target video correspond to the target style, and the target video The video frame includes the target object.

示例性的，控制模型包括姿态识别模块，姿态识别模块可通过姿态识别算法从各个视频帧对应的视频特征数据中识别得到目标模板视频中各个视频帧包含的目标动作对应的姿态信息，姿态估计算法可包括但不限于Openpose算法、DeepCut算法等。控制模型根据目标模板视频中各个视频帧对应的姿态信息生成视频生成控制信息，从而控制目标视频生成模型根据目标检测结果及视频生成控制信息，对目标模板视频进行处理，以生成视频帧中包含执行目标动作的目标对象的目标视频。Exemplarily, the control model includes a posture recognition module. The posture recognition module can identify the posture information corresponding to the target action contained in each video frame in the target template video from the video feature data corresponding to each video frame through a posture recognition algorithm. The posture estimation algorithm It can include but is not limited to Openpose algorithm, DeepCut algorithm, etc. The control model generates video generation control information based on the posture information corresponding to each video frame in the target template video, thereby controlling the target video generation model to process the target template video based on the target detection results and video generation control information to generate video frames containing execution Target video of the target object of the target action.

可见，通过控制模型根据视频特征数据生成视频生成控制信息，并将视频生成控制信息输入至目标视频生成模型，以使目标视频生成模型能够根据视频生成控制信息生成符合目标风格的目标视频，有助于提高生成视频的可控性以及准确性。It can be seen that the control model generates video generation control information according to the video feature data, and inputs the video generation control information to the target video generation model, so that the target video generation model can generate a target video that conforms to the target style based on the video generation control information, which is helpful. To improve the controllability and accuracy of generated videos.

305、将视频特征数据输入至控制模型，通过控制模型根据视频特征数据生成视频生成控制信息，并将视频生成控制信息输入至目标视频生成模型。305. Input the video feature data into the control model, generate video generation control information according to the video feature data through the control model, and input the video generation control information into the target video generation model.

步骤301～步骤305的实施方式可参考上述实施例，具体不作赘述。For the implementation of steps 301 to 305, reference may be made to the above embodiments, and details will not be described again.

306、获取与目标检测结果及目标模板视频对应的文字描述信息。306. Obtain text description information corresponding to the target detection result and the target template video.

目标模板视频对应的文字描述信息可以指目标模板视频对应的目标风格的文字描述信息，该文字描述信息可用于描述目标对象及目标风格。示例性的，若目标风格是目标动作，对应的文字描述信息可以是描述目标模板视频中目标对象的动作流程，比如“一只狗戴着耳机，摇晃着脑袋在听歌”；若目标风格是目标色调，对应的文字描述信息可以是目标模板视频中的色调风格，比如“明亮鲜艳”。The text description information corresponding to the target template video may refer to the text description information corresponding to the target style of the target template video, and the text description information may be used to describe the target object and the target style. For example, if the target style is a target action, the corresponding text description information can describe the action process of the target object in the target template video, such as "a dog is wearing headphones, shaking his head and listening to music"; if the target style is Target tone, the corresponding text description information can be the tone style in the target template video, such as "bright and vivid".

在一些实施例中，文字描述信息可以是提示词(prompt)，prompt是用于指引人工智能模型完成任务的文本输入信息，prompt可用于为人工智能模型提示输入的上下文以及输入的参数信息，帮助人工智能模型更好地理解输入的意图并作出相应的响应，帮助人工智能模型更好地理解和完成任务。In some embodiments, the text description information may be a prompt word (prompt). The prompt is text input information used to guide the artificial intelligence model to complete the task. The prompt can be used to prompt the artificial intelligence model for the input context and input parameter information to help The artificial intelligence model better understands the intent of the input and responds accordingly, helping the artificial intelligence model better understand and complete tasks.

作为一种可选的实施方式，文字描述信息可以是提示词(prompt)模板，prompt模板可包括关键词，比如，“一只xxx带着耳机，摇晃着脑袋在听歌”，其中的“xxx”为关键词，在根据目标检测结果确定目标对象的对象类别之后，可将提示词(prompt)模板中的关键词“xxx”替换成具体的对象类别，比如，若对象类别是狗，则将提示词(prompt)模板替换成“一只狗带着耳机，摇晃着脑袋在听歌”。As an optional implementation, the text description information may be a prompt template, and the prompt template may include keywords, for example, "a xxx is wearing headphones, shaking his head and listening to music", where "xxx " is the keyword. After determining the object category of the target object based on the target detection results, the keyword "xxx" in the prompt template can be replaced with the specific object category. For example, if the object category is a dog, then The prompt template is replaced with "a dog wearing headphones, shaking his head and listening to music."

目标检测结果包括目标对象对应的对象图像，对象图像可以包括目标对象对应的图像内容，可以是从目标多媒体数据中截取得到的内容。The target detection result includes an object image corresponding to the target object. The object image may include image content corresponding to the target object, and may be content intercepted from the target multimedia data.

307、通过目标视频生成模型分别对对象图像及文字描述信息进行解析，得到解析结果，并根据解析结果及视频生成控制信息对目标模板视频进行处理，以生成目标视频，目标视频的视频帧对应目标风格，且目标视频的视频帧中包括目标对象。307. Analyze the object image and text description information respectively through the target video generation model to obtain the analysis results, and process the target template video according to the analysis results and video generation control information to generate the target video. The video frames of the target video correspond to the target style, and the target object is included in the video frames of the target video.

电子设备通过目标视频生成模型分别对对象图像及文字描述信息进行解析，得到解析结果，可选的，解析结果可包括对象图像对应的对象类别，以及文字描述信息指示的模型任务。The electronic device analyzes the object image and text description information respectively through the target video generation model to obtain an analysis result. Optionally, the analysis result may include the object category corresponding to the object image and the model task indicated by the text description information.

示例性的，对象图像中的目标对象为狗，目标模板视频为描述“一只xxx戴着耳机，摇晃着脑袋在听歌”的视频，目标风格为目标动作，目标动作为“带着耳机，摇晃着脑袋在听歌”，视频生成控制信息可包括该目标动作对应的姿态信息，解析结果可指示对象图像对应的对象类别为狗，并且，文字描述信息指示的模型任务为“生成一段一只狗戴着耳机，摇晃着脑袋在听歌的视频”，因此，根据上述解析结果及视频生成控制信息对目标模板视频进行处理，可生成目标视频。For example, the target object in the object image is a dog, the target template video is a video describing "a xxx is wearing headphones, shaking his head and listening to music", the target style is the target action, and the target action is "wearing headphones, Shaking his head and listening to the song", the video generation control information may include the posture information corresponding to the target action, the analysis result may indicate that the object category corresponding to the object image is a dog, and the model task indicated by the text description information is "generate a paragraph of a dog. A video of a dog wearing headphones and shaking his head while listening to music." Therefore, the target video can be generated by processing the target template video based on the above analysis results and video generation control information.

可见，向目标视频生成模型输入文字描述信息，有助于指导模型生成与用户需求相关的响应，明确模型的任务，保证确定模型生成的视频符合目标风格，提高了生成视频的准确性，并且可及时对文字描述信息进行关键词替换，可适应不同应用场景的需求，也提高了通过人工智能模型生成视频的灵活性。It can be seen that inputting text description information into the target video generation model can help guide the model to generate responses related to user needs, clarify the tasks of the model, ensure that the video generated by the model conforms to the target style, improve the accuracy of the generated video, and can Timely keyword replacement of text description information can adapt to the needs of different application scenarios, and also improves the flexibility of video generation through artificial intelligence models.

并且，通过控制模型根据视频特征数据生成视频生成控制信息，并将视频生成控制信息输入至目标视频生成模型，以使目标视频生成模型能够根据视频生成控制信息生成符合目标风格的目标视频，有助于提高生成视频的可控性以及准确性。Moreover, the control model generates video generation control information according to the video feature data, and the video generation control information is input to the target video generation model, so that the target video generation model can generate a target video that conforms to the target style based on the video generation control information, which is helpful. To improve the controllability and accuracy of generated videos.

以目标视频生成模型是扩散模型为例，控制模型可以是ControlNet模型，其中，扩散模型可以是稳定扩散(Stable Diffusion，SD)模型，具体不作限定。电子设备可根据目标模板视频以及与目标模板视频对应的文字描述信息对预训练的扩散模型进行训练，得到与目标风格对应的扩散模型；对目标模板视频进行特征提取得到的视频特征数据可包括边缘信息、深度信息、语义分割信息，将视频特征数据输入至ControlNet模型，ControlNet模型可根据视频特征数据生成视频生成控制信息，并将视频生成控制信息输入至扩散模型。Taking the target video generation model as a diffusion model as an example, the control model may be a ControlNet model, in which the diffusion model may be a stable diffusion (Stable Diffusion, SD) model, which is not specifically limited. The electronic device can train the pre-trained diffusion model based on the target template video and the text description information corresponding to the target template video to obtain a diffusion model corresponding to the target style; the video feature data obtained by feature extraction from the target template video can include edges information, depth information, and semantic segmentation information. The video feature data is input into the ControlNet model. The ControlNet model can generate video generation control information based on the video feature data, and input the video generation control information into the diffusion model.

ControlNet模型定义了一组输入条件，可以用来控制扩散模型的输出。这些条件可能包括颜色方案、对象类别或其他特定任务参数，ControlNet模型可以将输入条件作为额外的输入信息传递给扩散模型，使得扩散模型可以根据这些输入条件来调整输出结果，并更好地适应特定任务。可选的，ControlNet可提供包括Canny边缘、语义分割图、关键点、涂鸦等多种输入条件，拓展了扩散模型的能力边界，使得扩散模型生成视频的可控性大幅提高。The ControlNet model defines a set of input conditions that can be used to control the output of the diffusion model. These conditions may include color schemes, object categories, or other task-specific parameters. The ControlNet model can pass the input conditions as additional input information to the diffusion model, so that the diffusion model can adjust the output results according to these input conditions and better adapt to specific tasks. Task. Optionally, ControlNet can provide a variety of input conditions including Canny edges, semantic segmentation maps, key points, graffiti, etc., which expands the capabilities of the diffusion model and greatly improves the controllability of the videos generated by the diffusion model.

ControlNet模型可创建扩散模型的两个副本，其中一个是“锁定”的，不能被修改，而另一个是“可训练”的，可以在特定任务上进行微调，ControlNet模型使用“权重共享”技术，该技术可以将扩散模型的权重复制到两个不同的神经网络中，这样，在微调扩散模型时，锁定副本仍然保留着通用知识，可以提供更好的初始状态。The ControlNet model creates two copies of the diffusion model, one of which is "locked" and cannot be modified, while the other is "trainable" and can be fine-tuned on specific tasks. The ControlNet model uses "weight sharing" technology. This technique copies the weights of the diffusion model into two different neural networks, so that when fine-tuning the diffusion model, the locked copies still retain common knowledge and can provide better initial states.

在本申请实施例中，可基于目标多媒体数据对应的目标检测结果，匹配对应的目标模板视频，并对目标模板视频进行特征提取，得到输入至控制模型的视频特征数据，以通过控制模型根据视频特征数据生成视频生成控制信息，提高了生成视频的可控性以及准确性，然后通过目标视频生成模型根据目标检测结果、目标模板视频对应的文字描述信息以及视频生成控制信息对目标模板视频进行处理，生成目标视频，目标视频的视频帧对应目标风格且包含目标多媒体数据中的目标对象，通过利用目标模板视频调整过的目标视频生成模型生成目标视频，使得生成的风格视频更具有针对性，能够提高生成的风格视频的视频质量，并提高生成风格视频的效率。In the embodiment of the present application, the corresponding target template video can be matched based on the target detection result corresponding to the target multimedia data, and feature extraction is performed on the target template video to obtain the video feature data input to the control model, so as to use the control model according to the video The feature data generates video generation control information, which improves the controllability and accuracy of the generated video, and then uses the target video generation model to process the target template video based on the target detection results, the text description information corresponding to the target template video, and the video generation control information. , generate a target video. The video frames of the target video correspond to the target style and contain the target object in the target multimedia data. The target video is generated by using the target video generation model adjusted by the target template video, making the generated style video more targeted and able to Improve the video quality of generated style videos and improve the efficiency of generating style videos.

请进一步参阅图4，图4是另一个实施例中视频生成方法的流程示意图，如图4所示，该视频生成方法可以包括以下步骤：Please further refer to Figure 4. Figure 4 is a schematic flow chart of a video generation method in another embodiment. As shown in Figure 4, the video generation method may include the following steps:

401、获取目标多媒体数据对应的目标检测结果，目标检测结果用于指示目标多媒体数据中包括的目标对象。401. Obtain the target detection result corresponding to the target multimedia data. The target detection result is used to indicate the target object included in the target multimedia data.

在一些实施例中，目标多媒体数据满足视频生成条件，视频生成条件包括以下中的一种或多种：目标多媒体数据中的目标对象对应的对象类别符合预设对象类别；目标多媒体数据中的目标对象的数量小于或等于预设数量；目标多媒体数据中的目标对象对应的图像尺寸大于或等于预设尺寸。In some embodiments, the target multimedia data satisfies video generation conditions, and the video generation conditions include one or more of the following: the object category corresponding to the target object in the target multimedia data meets the preset object category; the target in the target multimedia data The number of objects is less than or equal to the preset number; the image size corresponding to the target object in the target multimedia data is greater than or equal to the preset size.

终端设备响应于选择操作，确定选择的目标多媒体数据。比如，用户可以在相册应用程序中选择任一照片或视频作为目标多媒体数据，也可以打开相机应用程序，进行拍摄，得到目标多媒体数据。The terminal device determines the selected target multimedia data in response to the selection operation. For example, the user can select any photo or video as the target multimedia data in the photo album application, or can open the camera application, shoot, and obtain the target multimedia data.

终端设备可判断目标多媒体数据是否满足视频生成条件。可选的，终端设备可以在上述视频生成条件均不满足，或者存在一个不满足的情况下，输出提示用户重新选择目标多媒体数据的提示信息。The terminal device can determine whether the target multimedia data meets the video generation conditions. Optionally, the terminal device can output prompt information prompting the user to reselect the target multimedia data when none of the above video generation conditions are met, or one of them is not met.

示例性的，若预设对象类别为动物，则视频生成条件可以包括：目标对象的对象类别为动物、检测出的目标对象唯一、目标检测生成的边界框的短边大于512像素。需要说明的是，目标对象的对象类别为狗、猫、兔子等，都可符合预设对象类别。若存在任一条件不满足，则提示用户重新选择目标多媒体数据。For example, if the preset object category is animal, the video generation conditions may include: the object category of the target object is animal, the detected target object is unique, and the short side of the bounding box generated by target detection is greater than 512 pixels. It should be noted that the object category of the target object is dog, cat, rabbit, etc., which can all conform to the preset object category. If any condition is not met, the user is prompted to reselect the target multimedia data.

通过视频生成条件的限制，有助于获取到针对性的、与视频生成模型适配的目标多媒体数据，从而能够提高生成的风格视频的有效性和效率，并且能够获取到高质量的、有效的目标多媒体数据，从而提高生成的风格视频的视频质量。By limiting the video generation conditions, it is helpful to obtain targeted multimedia data that is adapted to the video generation model, thereby improving the effectiveness and efficiency of the generated style video, and obtaining high-quality, effective Target multimedia data, thereby improving the video quality of the generated style video.

402、根据目标检测结果，确定目标模板视频，目标模板视频对应目标风格。402. According to the target detection results, determine the target template video, and the target template video corresponds to the target style.

403、根据目标模板视频，确定目标视频生成模型，目标视频生成模型是根据目标模板视频对预训练的视频生成模型进行调整得到的。403. Determine the target video generation model based on the target template video. The target video generation model is obtained by adjusting the pre-trained video generation model based on the target template video.

404、对目标模板视频进行特征提取，得到目标模板视频对应的视频特征数据。404. Perform feature extraction on the target template video to obtain video feature data corresponding to the target template video.

405、将视频特征数据输入至控制模型，通过控制模型根据当前视频帧对应的视频特征数据，生成当前视频帧对应的视频生成控制信息，并将当前视频帧对应的视频生成控制信息输入至目标视频生成模型。405. Input the video feature data into the control model, use the control model to generate video generation control information corresponding to the current video frame according to the video feature data corresponding to the current video frame, and input the video generation control information corresponding to the current video frame into the target video. Generate models.

在一些实施例中，电子设备是对目标模板视频的各个视频帧进行特征提取，得到目标模板视频中各个视频帧对应的视频特征数据；将各个视频帧对应的视频特征数据输入至控制模型，可通过控制模型根据当前视频帧对应的视频特征数据，生成当前视频帧对应的视频生成控制信息，并将当前视频帧对应的视频生成控制信息输入至目标视频生成模型。In some embodiments, the electronic device performs feature extraction on each video frame of the target template video to obtain video feature data corresponding to each video frame in the target template video; the video feature data corresponding to each video frame is input to the control model, and The control model generates video generation control information corresponding to the current video frame according to the video feature data corresponding to the current video frame, and inputs the video generation control information corresponding to the current video frame into the target video generation model.

当前视频帧可以是各个视频帧中，控制模型当前正在处理的视频帧。The current video frame can be the video frame currently being processed by the control model among each video frame.

406、获取与目标检测结果及目标模板视频对应的文字描述信息。406. Obtain text description information corresponding to the target detection result and the target template video.

407、对目标模板视频的各个视频帧进行图像分割，以得到各个视频帧对应的分割数据。407. Perform image segmentation on each video frame of the target template video to obtain segmentation data corresponding to each video frame.

电子设备可对目标模板视频进行逐帧主体检测和分割，以将各个视频帧中的前景(主体)图像区域和背景图像区域分割，可选的，对各个视频帧进行图像分割的方法可包括但不限于Mask R-CNN算法、GrabCut算法、全卷积网络(FCN)等方法。The electronic device can perform frame-by-frame subject detection and segmentation on the target template video to segment the foreground (subject) image area and background image area in each video frame. Optionally, the method of image segmentation for each video frame may include: Not limited to Mask R-CNN algorithm, GrabCut algorithm, fully convolutional network (FCN) and other methods.

需要说明的是，各个视频帧的前景图像区域和背景图像区域的景深不同，因此，可以基于各个视频帧中像素的深度信息，对各个视频帧包含的前景图像区域和背景图像区域进行分割。各个视频帧对应的分割数据可包括各个视频帧对应的前景数据和背景数据。It should be noted that the depth of field of the foreground image area and the background image area of each video frame is different. Therefore, the foreground image area and the background image area contained in each video frame can be segmented based on the depth information of the pixels in each video frame. The segmentation data corresponding to each video frame may include foreground data and background data corresponding to each video frame.

可见，对目标模板视频进行逐帧主体检测和分割，有助于准确地在目标模板视频的视频帧中确定用于生成目标对象的图像内容的区域，有助于提高风格视频生成的效率和准确性。It can be seen that frame-by-frame subject detection and segmentation of the target template video helps to accurately determine the area used to generate the image content of the target object in the video frame of the target template video, and helps to improve the efficiency and accuracy of style video generation. sex.

408、通过目标视频生成模型分别对对象图像及文字描述信息进行解析，得到解析结果，并根据解析结果、当前视频帧对应的视频生成控制信息以及当前视频帧对应的分割数据，在当前视频帧的前景图像区域中生成目标对象对应的图像内容，并将图像内容与当前视频帧的背景图像区域融合，得到当前视频帧对应的目标视频帧。408. Analyze the object image and text description information respectively through the target video generation model to obtain the analysis results, and based on the analysis results, the video generation control information corresponding to the current video frame, and the segmentation data corresponding to the current video frame, in the current video frame Generate image content corresponding to the target object in the foreground image area, and fuse the image content with the background image area of the current video frame to obtain the target video frame corresponding to the current video frame.

当前视频帧对应的分割数据可包括当前视频帧对应的前景数据和背景数据，前景数据可用于指示当前视频帧的前景图像区域，背景数据可用于指示当前视频帧的背景图像区域，并且，对象图像可包括目标对象对应的图像内容。The segmentation data corresponding to the current video frame may include foreground data and background data corresponding to the current video frame, the foreground data may be used to indicate the foreground image area of the current video frame, the background data may be used to indicate the background image area of the current video frame, and, the object image Image content corresponding to the target object may be included.

因此，电子设备可根据解析结果、当前视频帧对应的视频生成控制信息以及当前视频帧对应的分割数据，在当前视频帧的前景图像区域中生成目标对象对应的图像内容，并将图像内容与当前视频帧的背景图像区域融合，融合的方式可包括但不限于Alpha混合、基于像素的融合等方法。Therefore, the electronic device can generate image content corresponding to the target object in the foreground image area of the current video frame based on the parsing result, the video generation control information corresponding to the current video frame, and the segmentation data corresponding to the current video frame, and compare the image content with the current video frame. The background image area of the video frame is fused. The fusion method may include but is not limited to Alpha blending, pixel-based fusion and other methods.

示例性的，对象图像中的目标对象为狗，目标模板视频为描述“一只xxx戴着耳机，摇晃着脑袋在听歌”的视频，目标风格为目标动作，目标动作为“带着耳机，摇晃着脑袋在听歌”，视频生成控制信息可包括该目标动作对应的姿态信息，解析结果可指示对象图像对应的对象类别为狗，并且，文字描述信息指示的模型任务为“生成一段一只狗戴着耳机，摇晃着脑袋在听歌的视频”，电子设备根据解析结果、当前视频帧对应的视频生成控制信息以及当前视频帧对应的分割数据，可在当前视频帧的前景图像区域中生成一只狗戴着耳机，摇晃着脑袋在听歌的图像内容，并将该图像内容与当前视频帧的背景图像区域融合，得到当前视频帧对应的目标视频帧。For example, the target object in the object image is a dog, the target template video is a video describing "a xxx is wearing headphones, shaking his head and listening to music", the target style is the target action, and the target action is "wearing headphones, Shaking his head and listening to the song", the video generation control information may include the posture information corresponding to the target action, the analysis result may indicate that the object category corresponding to the object image is a dog, and the model task indicated by the text description information is "generate a paragraph of a dog. "Video of a dog wearing headphones, shaking his head and listening to music", the electronic device can generate in the foreground image area of the current video frame based on the analysis results, the video generation control information corresponding to the current video frame, and the segmentation data corresponding to the current video frame. The image content of a dog wearing headphones and shaking his head while listening to music is fused with the background image area of the current video frame to obtain the target video frame corresponding to the current video frame.

当前视频帧对应的目标视频帧为将当前视频帧的前景图像区域替换为目标对象对应的图像内容的视频帧。The target video frame corresponding to the current video frame is a video frame in which the foreground image area of the current video frame is replaced with the image content corresponding to the target object.

可见，在当前视频帧的前景图像区域中生成目标对象对应的图像内容，并将图像内容与当前视频帧的背景图像区域融合，得到当前视频帧对应的目标视频帧，可以在目标模板视频的基础上生成对应目标风格、且包含目标对象的视频，提高了风格视频生成的效率和准确性。It can be seen that the image content corresponding to the target object is generated in the foreground image area of the current video frame, and the image content is merged with the background image area of the current video frame to obtain the target video frame corresponding to the current video frame, which can be based on the target template video Generate videos corresponding to the target style and containing the target object, which improves the efficiency and accuracy of style video generation.

409、基于目标模板视频的各个视频帧对应的目标视频帧，生成目标视频。409. Generate a target video based on the target video frame corresponding to each video frame of the target template video.

电子设备可将目标模板视频的各个视频帧对应的目标视频帧，按照各个视频帧的时间顺序组合起来，得到目标视频。The electronic device can combine the target video frames corresponding to each video frame of the target template video according to the time sequence of each video frame to obtain the target video.

在本申请实施例中，可基于目标多媒体数据对应的目标检测结果，匹配对应的目标模板视频，并对目标模板视频进行特征提取，得到输入至控制模型的视频特征数据，以通过控制模型根据视频特征数据生成视频生成控制信息，提高了生成视频的可控性以及准确性，然后，对目标模板视频的各个视频帧进行图像分割，得到各个视频帧对应的图像数据，通过目标视频生成模型根据目标检测结果、目标模板视频对应的文字描述信息、当前视频帧对应的分割数据以及视频生成控制信息对目标模板视频进行处理，在当前视频帧的前景图像区域中生成目标对象对应的图像内容，并将图像内容与当前视频帧的背景图像区域融合，得到当前视频帧对应的目标视频帧，并基于目标模板视频的各个视频帧对应的目标视频帧，生成目标视频，目标视频的视频帧对应目标风格且包含目标多媒体数据中的目标对象，通过利用目标模板视频调整过的目标视频生成模型生成目标视频，使得生成的风格视频更具有针对性，能够提高生成的风格视频的视频质量，并提高生成风格视频的效率。In the embodiment of the present application, the corresponding target template video can be matched based on the target detection result corresponding to the target multimedia data, and feature extraction is performed on the target template video to obtain the video feature data input to the control model, so as to use the control model according to the video The feature data generates video generation control information, which improves the controllability and accuracy of the generated video. Then, image segmentation is performed on each video frame of the target template video to obtain the image data corresponding to each video frame, and the target video generation model is used according to the target The detection results, the text description information corresponding to the target template video, the segmentation data corresponding to the current video frame, and the video generation control information process the target template video, generate the image content corresponding to the target object in the foreground image area of the current video frame, and The image content is fused with the background image area of the current video frame to obtain the target video frame corresponding to the current video frame, and the target video is generated based on the target video frame corresponding to each video frame of the target template video, and the video frame of the target video corresponds to the target style and Containing the target object in the target multimedia data, the target video is generated by using the target video generation model adjusted by the target template video, making the generated style video more targeted, improving the video quality of the generated style video, and improving the generated style video. s efficiency.

请进一步参阅图5，图5是一个实施例中视频生成方法的时序图，如图5所示，该视频生成方法可以包括以下步骤：Please further refer to Figure 5. Figure 5 is a timing diagram of a video generation method in an embodiment. As shown in Figure 5, the video generation method may include the following steps:

510、终端设备响应于选择操作，确定选择的目标多媒体数据。510. In response to the selection operation, the terminal device determines the selected target multimedia data.

520、终端设备通过目标检测模型对目标多媒体数据进行目标检测，得到目标检测结果。520. The terminal device performs target detection on the target multimedia data through the target detection model, and obtains the target detection result.

目标检测结果用于指示目标多媒体数据中包括的目标对象。The target detection result is used to indicate the target object included in the target multimedia data.

530，终端设备获取基于目标检测结果生成的目标视频。530. The terminal device obtains the target video generated based on the target detection result.

目标视频的视频帧对应目标风格，且目标视频的视频帧中包括目标对象。The video frames of the target video correspond to the target style, and the video frames of the target video include the target object.

其中，目标视频是通过目标视频生成模型根据目标检测结果对目标模板视频进行处理生成的；目标模板视频对应目标风格，目标模板视频根据目标检测结果确定；目标视频生成模型是根据目标模板视频对预训练的视频生成模型进行调整得到的。Among them, the target video is generated by processing the target template video based on the target detection results through the target video generation model; the target template video corresponds to the target style, and the target template video is determined based on the target detection results; the target video generation model is based on the target template video. It is obtained by adjusting the trained video generation model.

540，服务器接收终端设备发送的目标检测结果，目标检测结果为终端设备响应于选择操作，确定选择的目标多媒体数据，并通过目标检测模型对目标多媒体数据进行目标检测得到的。540. The server receives the target detection result sent by the terminal device. The target detection result is obtained by the terminal device responding to the selection operation, determining the selected target multimedia data, and performing target detection on the target multimedia data through the target detection model.

550，服务器获取目标多媒体数据对应的目标检测结果，目标检测结果用于指示目标多媒体数据中包括的目标对象。550. The server obtains the target detection result corresponding to the target multimedia data, and the target detection result is used to indicate the target object included in the target multimedia data.

560，服务器根据目标检测结果，确定目标模板视频。560. The server determines the target template video based on the target detection results.

目标模板视频对应目标风格。The target template video corresponds to the target style.

570，服务器根据目标模板视频，确定目标视频生成模型。570. The server determines the target video generation model based on the target template video.

目标视频生成模型是根据目标模板视频对预训练的视频生成模型进行调整得到的。The target video generation model is obtained by adjusting the pre-trained video generation model based on the target template video.

580，服务器通过目标视频生成模型根据目标检测结果对目标模板视频进行处理，以生成目标视频。580. The server processes the target template video according to the target detection result through the target video generation model to generate the target video.

590，服务器将目标视频发送至终端设备。590, the server sends the target video to the terminal device.

步骤510～步骤590的实施方式可参考上述实施例，具体不作赘述。For the implementation of steps 510 to 590, reference may be made to the above embodiments, and details will not be described again.

请进一步参阅图6，图6是另一个实施例中视频生成方法的时序图，如图6所示，通过选择宠物图像，可生成特定动作和风格的宠物视频。终端设备(端侧)通过相册选择宠物照片，对图片进行目标检测并判别是否含有单只目标宠物，将符合要求的目标检测结果上传至服务器(云侧)进行视频模板匹配，对预置的提示词prompt和目标前景图像(对象图像)进行图文解析；对目标模板视频进行逐帧主体检测和分割，提取视频帧的相关特征，通过ControlNet控制扩散模型进行视频生成，并最终下发至端侧。Please further refer to Figure 6, which is a sequence diagram of a video generation method in another embodiment. As shown in Figure 6, by selecting a pet image, a pet video of a specific action and style can be generated. The terminal device (device side) selects pet photos through the album, performs target detection on the pictures and determines whether they contain a single target pet, uploads the target detection results that meet the requirements to the server (cloud side) for video template matching, and performs video template matching on the preset prompts The word prompt and the target foreground image (object image) are used for image and text analysis; the target template video is subject to frame-by-frame detection and segmentation, the relevant features of the video frame are extracted, the video is generated through the ControlNet controlled diffusion model, and finally delivered to the end side .

请进一步参阅图7，图7是另一个实施例中视频生成方法的时序图，如图7所示，具体可包括如下步骤：(1)图像选取及目标检测：用户打开图像相册选择图像，基于目标检测模型对图像进行目标检测，得到指定类别宠物信息以及目标数量；(2)图片筛选：对于存在以下情况的图片需要用户重新选择：检测出多个宠物类别的、检测出多个目标的、检测出的目标框短边尺寸小于512的；(3)模型微调：基于内置的视频模板，提取特征信息用于控制模型ControlNet，并对视频帧进行图像分割，微调生成针对特定动作模板的模型；(4)Prompt模板替换及解析：基于检测结果替换Prompt中的关键字，并通过文本解析输入之对应模型；(5)参考图像输入对应动作模型生成结果：将参考图像输出至对应动作的模型，结合prompt解析，生成该参考目标指定动作的视频动画。Please further refer to Figure 7. Figure 7 is a sequence diagram of a video generation method in another embodiment. As shown in Figure 7, the specific steps may include: (1) Image selection and target detection: The user opens the image album to select an image, based on The target detection model performs target detection on the image and obtains the pet information of the specified category and the number of targets; (2) Image filtering: The user needs to reselect images with the following conditions: multiple pet categories are detected, multiple targets are detected, The short side size of the detected target frame is less than 512; (3) Model fine-tuning: Based on the built-in video template, extract feature information to control the model ControlNet, perform image segmentation on the video frame, and fine-tune to generate a model for a specific action template; (4) Prompt template replacement and analysis: replace the keywords in the Prompt based on the detection results, and input the corresponding model through text analysis; (5) Reference image input corresponding action model generation result: output the reference image to the corresponding action model, Combined with prompt analysis, a video animation of the specified action of the reference target is generated.

请参阅图8，图8是一个实施例中视频生成装置的结构示意图，如图8所示，视频生成装置800可包括：获取模块810、模板确定模块820、模型确定模块830和视频生成模块840。Please refer to Figure 8. Figure 8 is a schematic structural diagram of a video generation device in one embodiment. As shown in Figure 8, the video generation device 800 may include: an acquisition module 810, a template determination module 820, a model determination module 830 and a video generation module 840. .

获取模块810，用于获取目标多媒体数据对应的目标检测结果，目标检测结果用于指示目标多媒体数据中包括的目标对象；The acquisition module 810 is used to obtain the target detection result corresponding to the target multimedia data, and the target detection result is used to indicate the target object included in the target multimedia data;

模板确定模块820，用于根据目标检测结果，确定目标模板视频；目标模板视频对应目标风格；The template determination module 820 is used to determine the target template video according to the target detection result; the target template video corresponds to the target style;

模型确定模块830，用于根据目标模板视频，确定目标视频生成模型；目标视频生成模型是根据目标模板视频对预训练的视频生成模型进行调整得到的；The model determination module 830 is used to determine the target video generation model based on the target template video; the target video generation model is obtained by adjusting the pre-trained video generation model based on the target template video;

视频生成模块840，用于通过目标视频生成模型根据目标检测结果对目标模板视频进行处理，以生成目标视频，目标视频的视频帧对应目标风格，且目标视频的视频帧中包括目标对象。The video generation module 840 is used to process the target template video according to the target detection results through the target video generation model to generate a target video, the video frames of the target video correspond to the target style, and the video frames of the target video include the target object.

在一个实施例中，视频生成装置800还包括：控制模块；In one embodiment, the video generation device 800 further includes: a control module;

控制模块，用于对目标模板视频进行特征提取，得到目标模板视频对应的视频特征数据；将视频特征数据输入至控制模型，通过控制模型根据视频特征数据生成视频生成控制信息，并将视频生成控制信息输入至目标视频生成模型；The control module is used to extract features from the target template video to obtain video feature data corresponding to the target template video; input the video feature data into the control model, generate video generation control information based on the video feature data through the control model, and control the video generation Information is input to the target video generation model;

视频生成模块840，还用于通过目标视频生成模型根据目标检测结果及视频生成控制信息，对目标模板视频进行处理，以生成目标视频。The video generation module 840 is also used to process the target template video according to the target detection results and video generation control information through the target video generation model to generate the target video.

在一个实施例中，目标检测结果包括目标对象对应的对象图像；视频生成模块840，还用于：获取与目标检测结果及目标模板视频对应的文字描述信息；通过目标视频生成模型分别对对象图像及文字描述信息进行解析，得到解析结果，并根据解析结果及视频生成控制信息对目标模板视频进行处理，以生成目标视频。In one embodiment, the target detection result includes an object image corresponding to the target object; the video generation module 840 is also used to: obtain text description information corresponding to the target detection result and the target template video; and generate the object image through the target video generation model. and text description information are parsed to obtain the parsing results, and the target template video is processed according to the parsing results and video generation control information to generate the target video.

在一个实施例中，视频生成装置800还包括：分割模块；In one embodiment, the video generation device 800 further includes: a segmentation module;

分割模块，用于对目标模板视频的各个视频帧进行图像分割，以得到各个视频帧对应的分割数据；The segmentation module is used to perform image segmentation on each video frame of the target template video to obtain the segmentation data corresponding to each video frame;

控制模块，还用于通过控制模型根据当前视频帧对应的视频特征数据，生成当前视频帧对应的视频生成控制信息，并将当前视频帧对应的视频生成控制信息输入至目标视频生成模型；The control module is also used to generate video generation control information corresponding to the current video frame according to the video feature data corresponding to the current video frame through the control model, and input the video generation control information corresponding to the current video frame to the target video generation model;

视频生成模块840，还用于根据解析结果、当前视频帧对应的视频生成控制信息以及当前视频帧对应的分割数据，在当前视频帧的前景图像区域中生成目标对象对应的图像内容，并将图像内容与当前视频帧的背景图像区域融合，得到当前视频帧对应的目标视频帧；基于目标模板视频的各个视频帧对应的目标视频帧，生成目标视频。The video generation module 840 is also used to generate image content corresponding to the target object in the foreground image area of the current video frame based on the parsing results, the video generation control information corresponding to the current video frame, and the segmentation data corresponding to the current video frame, and convert the image into The content is fused with the background image area of the current video frame to obtain the target video frame corresponding to the current video frame; the target video is generated based on the target video frame corresponding to each video frame of the target template video.

在一个实施例中，目标检测结果包括目标对象对应的对象类别；模板确定模块820，还用于根据目标对象对应的对象类别，确定对象类别对应的目标模板视频。In one embodiment, the target detection result includes the object category corresponding to the target object; the template determination module 820 is also configured to determine the target template video corresponding to the object category according to the object category corresponding to the target object.

在一个实施例中，目标多媒体数据满足视频生成条件，视频生成条件包括以下中的一种或多种：目标多媒体数据中的目标对象对应的对象类别符合预设对象类别；目标多媒体数据中的目标对象的数量小于或等于预设数量；目标多媒体数据中的目标对象对应的图像尺寸大于或等于预设尺寸。In one embodiment, the target multimedia data satisfies video generation conditions, and the video generation conditions include one or more of the following: the object category corresponding to the target object in the target multimedia data conforms to the preset object category; the target in the target multimedia data The number of objects is less than or equal to the preset number; the image size corresponding to the target object in the target multimedia data is greater than or equal to the preset size.

在一个实施例中，目标风格包括目标动作，目标视频的视频帧中包括执行目标动作的目标对象。In one embodiment, the target style includes a target action, and the video frames of the target video include a target object performing the target action.

在一个实施例中，获取模块810，还用于接收终端设备发送的目标检测结果，目标检测结果为终端设备响应于选择操作，确定选择的目标多媒体数据，并通过目标检测模型对目标多媒体数据进行目标检测得到的。In one embodiment, the acquisition module 810 is also used to receive the target detection result sent by the terminal device. The target detection result is that the terminal device determines the selected target multimedia data in response to the selection operation, and performs the target multimedia data on it through the target detection model. target detection.

请参阅图9，图9是另一个实施例中视频生成装置的结构示意图，如图9所示，视频生成装置900可应用于上述的终端设备，视频生成装置900可包括：响应模块910、检测模块920和获取模块930；Please refer to Figure 9. Figure 9 is a schematic structural diagram of a video generation device in another embodiment. As shown in Figure 9, the video generation device 900 can be applied to the above-mentioned terminal equipment. The video generation device 900 can include: a response module 910, a detection module module 920 and acquisition module 930;

响应模块910，用于响应于选择操作，确定选择的目标多媒体数据；Response module 910, configured to determine the selected target multimedia data in response to the selection operation;

检测模块920，用于通过目标检测模型对目标多媒体数据进行目标检测，得到目标检测结果；目标检测结果用于指示目标多媒体数据中包括的目标对象；The detection module 920 is used to perform target detection on the target multimedia data through the target detection model to obtain a target detection result; the target detection result is used to indicate the target object included in the target multimedia data;

获取模块930，用于获取基于目标检测结果生成的目标视频，目标视频的视频帧对应目标风格，且目标视频的视频帧中包括目标对象；其中，目标视频是通过目标视频生成模型根据目标检测结果对目标模板视频进行处理生成的；目标模板视频对应目标风格，目标模板视频根据目标检测结果确定；目标视频生成模型是根据目标模板视频对预训练的视频生成模型进行调整得到的。The acquisition module 930 is used to acquire a target video generated based on the target detection result, the video frame of the target video corresponds to the target style, and the video frame of the target video includes the target object; wherein, the target video is generated based on the target detection result through the target video generation model. The target template video is processed and generated; the target template video corresponds to the target style, and the target template video is determined based on the target detection results; the target video generation model is obtained by adjusting the pre-trained video generation model based on the target template video.

请参阅图10，图10是一个实施例中电子设备的结构示意图。如图10所示，该电子设备1000可以包括：存储有可执行程序代码的存储器1010；与存储器1010耦合的处理器1020；其中，处理器1020调用存储器1010中存储的可执行程序代码，执行本申请实施例公开的任一种视频生成方法。Please refer to Figure 10, which is a schematic structural diagram of an electronic device in an embodiment. As shown in Figure 10, the electronic device 1000 may include: a memory 1010 storing executable program code; a processor 1020 coupled to the memory 1010; wherein the processor 1020 calls the executable program code stored in the memory 1010 to execute the program. Any video generation method disclosed in the application embodiments.

本申请实施例公开一种计算机可读存储介质，其存储计算机程序，其中，计算机程序被所述处理器执行时，使得所述处理器实现本申请实施例公开的任意一种视频生成方法。An embodiment of the present application discloses a computer-readable storage medium that stores a computer program. When the computer program is executed by the processor, the processor implements any video generation method disclosed in the embodiment of the present application.

应理解，说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此，在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外，这些特定特征、结构或特性可以以任意适合的方式结合在一个或多个实施例中。本领域技术人员也应该知悉，说明书中所描述的实施例均属于可选实施例，所涉及的动作和模块并不一定是本申请所必须的。It will be understood that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic associated with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily necessary for this application.

在本申请的各种实施例中，应理解，上述各过程的序号的大小并不意味着执行顺序的必然先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本申请实施例的实施过程构成任何限定。上述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物单元，即可位于一个地方，或者也可以分布到多个网络单元上。可根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。另外，在本申请各实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In the various embodiments of the present application, it should be understood that the size of the sequence numbers of the above-mentioned processes does not necessarily mean the order of execution. The execution order of each process should be determined by its functions and internal logic, and should not be implemented in this application. The implementation of the examples does not constitute any limitations. The units described above as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of this embodiment. In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.

上述集成的单元若以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可获取的存储器中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或者部分，可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储器中，包括若干请求用以使得一台计算机设备(可以为个人计算机、服务器或者网络设备等，具体可以是计算机设备中的处理器)执行本申请的各个实施例上述方法的部分或全部步骤。If the above-mentioned integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-accessible memory. Based on this understanding, the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a memory. , including several requests to cause a computer device (which can be a personal computer, a server or a network device, etc., specifically a processor in a computer device) to execute some or all of the steps of the above methods in various embodiments of the present application.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序可以存储于一计算机可读存储介质中，存储介质包括只读存储器(Read-Only Memory，ROM)、随机存储器(Random Access Memory，RAM)、可编程只读存储器(Programmable Read-only Memory，PROM)、可擦除可编程只读存储器(Erasable Programmable Read Only Memory，EPROM)、一次可编程只读存储器(One-time Programmable Read-Only Memory，OTPROM)、电子抹除式可复写只读存储器(Electrically-Erasable Programmable Read-Only Memory，EEPROM)、只读光盘(CompactDisc Read-Only Memory，CD-ROM)或其他光盘存储器、磁盘存储器、磁带存储器、或者能够用于携带或存储数据的计算机可读的任何其他介质。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing relevant hardware through a program. The program can be stored in a computer-readable storage medium, and the storage medium includes a read-only storage medium. Memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), programmable read-only memory (Programmable Read-only Memory, PROM), erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM), One-time Programmable Read-Only Memory (OTPROM), Electronically Erasable Programmable Read-Only Memory (EEPROM), CompactDisc Read -Only Memory (CD-ROM) or other optical disk storage, magnetic disk storage, magnetic tape storage, or any other computer-readable medium that can be used to carry or store data.

以上对本申请实施例公开的一种视频生成方法、装置、电子设备及存储介质进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想。同时，对于本领域的一般技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本申请的限制。The above is a detailed introduction to a video generation method, device, electronic equipment and storage medium disclosed in the embodiments of the present application. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only for To help understand the method and core ideas of this application. At the same time, for those of ordinary skill in the art, there will be changes in the specific implementation and application scope based on the ideas of the present application. In summary, the content of this description should not be understood as a limitation of the present application.

Claims

1. A video generation method, characterized in that the method includes:

Obtaining a target detection result corresponding to the target multimedia data, where the target detection result is used to indicate a target object included in the target multimedia data;

According to the target detection result, a target template video is determined; the target template video corresponds to the target style;

Determine a target video generation model according to the target template video; the target video generation model is obtained by adjusting a pre-trained video generation model according to the target template video;

The target template video is processed according to the target detection result by the target video generation model to generate a target video, the video frames of the target video correspond to the target style, and the video frames of the target video include The target object.

2. The method according to claim 1, characterized in that, before processing the target template video according to the target detection result through the target video generation model to generate a target video, the method further include:

Perform feature extraction on the target template video to obtain video feature data corresponding to the target template video;

inputting the video feature data to a control model, generating video generation control information according to the video feature data through the control model, and inputting the video generation control information to the target video generation model;

The target video generation model processes the target template video according to the target detection result to generate a target video, including:

The target video generation model processes the target template video according to the target detection result and the video generation control information to generate a target video.

3. The method of claim 2, wherein the target detection result includes an object image corresponding to the target object; and the target video generation model is generated based on the target detection result and the video. Control information to process the target template video to generate a target video, including:

Obtain text description information corresponding to the target detection result and the target template video;

The object image and the text description information are respectively analyzed through the target video generation model to obtain an analysis result, and the target template video is processed according to the analysis result and the video generation control information to generate target video.

4. The method according to claim 3, characterized in that, before analyzing the object image and the text description information respectively through the target video generation model, the method further includes:

Perform image segmentation on each video frame of the target template video to obtain segmentation data corresponding to each video frame;

Generating video generation control information according to the video feature data through the control model, and inputting the video generation control information to the target video generation model includes:

Generate video generation control information corresponding to the current video frame according to the video feature data corresponding to the current video frame through the control model, and input the video generation control information corresponding to the current video frame into the target video generation model;

Processing the target template video according to the parsing result and the video generation control information to generate a target video includes:

According to the analysis result, the video generation control information corresponding to the current video frame, and the segmentation data corresponding to the current video frame, image content corresponding to the target object is generated in the foreground image area of the current video frame, and Fusion of the image content with the background image area of the current video frame to obtain a target video frame corresponding to the current video frame;

A target video is generated based on target video frames corresponding to each video frame of the target template video.

5. The method according to claim 1, wherein the target detection result includes an object category corresponding to the target object; and determining the target template video according to the target detection result includes:

According to the object category corresponding to the target object, the target template video corresponding to the object category is determined.

6. The method according to claim 1, characterized in that the target multimedia data satisfies video generation conditions, and the video generation conditions include one or more of the following:

The object category corresponding to the target object in the target multimedia data conforms to the preset object category;

The number of target objects in the target multimedia data is less than or equal to a preset number;

The image size corresponding to the target object in the target multimedia data is greater than or equal to the preset size.

7. The method according to any one of claims 1 to 6, wherein the target style includes a target action, and the video frame of the target video includes the target object performing the target action.

8. The method according to claim 1, characterized in that the method is applied to a server; the obtaining the target detection result corresponding to the target multimedia data includes:

Receive a target detection result sent by the terminal device. The target detection result is obtained by the terminal device determining the selected target multimedia data in response to the selection operation, and performing target detection on the target multimedia data through a target detection model.

9. A video generation method, characterized in that it is applied to a terminal device, and the method includes:

In response to the selection operation, determining the selected target multimedia data;

Perform target detection on the target multimedia data through a target detection model to obtain a target detection result; the target detection result is used to indicate the target object included in the target multimedia data;

Obtain a target video generated based on the target detection result, the video frames of the target video correspond to the target style, and the video frames of the target video include the target object;

Wherein, the target video is generated by processing a target template video according to the target detection result through a target video generation model; the target template video corresponds to the target style, and the target template video is determined according to the target detection result. ; The target video generation model is obtained by adjusting a pre-trained video generation model based on the target template video.

10. A video generation device, characterized in that the device includes:

An acquisition module, configured to acquire a target detection result corresponding to the target multimedia data, where the target detection result is used to indicate a target object included in the target multimedia data;

A template determination module, configured to determine a target template video according to the target detection result; the target template video corresponds to the target style;

A model determination module, configured to determine a target video generation model based on the target template video; the target video generation model is obtained by adjusting a pre-trained video generation model based on the target template video;

A video generation module, configured to process the target template video according to the target detection result through the target video generation model to generate a target video, the video frames of the target video correspond to the target style, and the target The target object is included in the video frame of the video.

11. A video generation device, characterized in that it is applied to terminal equipment, and the device includes:

a response module, configured to determine the selected target multimedia data in response to the selection operation;

A detection module, configured to perform target detection on the target multimedia data through a target detection model to obtain a target detection result; the target detection result is used to indicate a target object included in the target multimedia data;

An acquisition module, configured to acquire a target video generated based on the target detection result, the video frame of the target video corresponds to the target style, and the video frame of the target video includes the target object; wherein, the target video is The target template video is processed and generated by the target video generation model according to the target detection result; the target template video corresponds to the target style, and the target template video is determined according to the target detection result; the target video generation model It is obtained by adjusting the pre-trained video generation model according to the target template video.

12. An electronic device, characterized by comprising a memory and a processor, a computer program stored in the memory, and when the computer program is executed by the processor, the processor implements claims 1 to 8 Or any of the methods described in 9.

13. A computer-readable storage medium with a computer program stored thereon, characterized in that when the computer program is executed by a processor, the method according to any one of claims 1 to 8 or 9 is implemented.