CN118433466A

CN118433466A - Video generation method, apparatus, electronic device, storage medium, and program product

Info

Publication number: CN118433466A
Application number: CN202410749339.5A
Authority: CN
Inventors: 曹旭阳; 安山
Original assignee: Beijing Jingdong Tuoxian Technology Co Ltd
Current assignee: Beijing Jingdong Tuoxian Technology Co Ltd
Priority date: 2024-06-11
Filing date: 2024-06-11
Publication date: 2024-08-02

Abstract

The present disclosure provides a video generation method, apparatus, electronic device, storage medium, and program product. The video generation method comprises the following steps: determining text characteristics according to received text contents for describing the target object; according to the text characteristics, determining the identification of the target material image from an image library; the image library comprises a plurality of material images and image features of the material images, and the similarity between the image features and the text features of the target material images is larger than a preset threshold; inputting the text content, the mark and the preset script template as prompt information into a generating language model to generate a video generating script; and generating a first target video by running a video generation script.

Description

Video generation method, apparatus, electronic device, storage medium, and program product

Technical Field

The present disclosure relates to the field of information processing technology, and more particularly, to a video generation method, apparatus, electronic device, storage medium, and program product.

Background

In the generation process of the commodity display video under the electronic market scene, in the related example, a video script is generated by manually designing each picture of the whole commodity display video, and then a video template is generated according to the video script. The video template is similar to a slide template, and in the case that a new commodity display video needs to be made, the video, the picture and the text in the video template need to be replaced by the video, the picture and the text related to the new commodity display.

In implementing the concepts of the present disclosure, the inventors have found that there are at least the following problems in the related examples: the related examples are based on given materials, fixed video generation scripts and fixed video templates, and are combined with a video generation tool to generate commodity display videos, so that the flexibility of generating commodity display videos is poor.

Disclosure of Invention

In view of this, the present disclosure provides a video generation method, apparatus, electronic device, storage medium, and program product.

One aspect of the present disclosure provides a video generation method, including: and determining the text characteristics according to the received text content for describing the target object. And determining the identification of the target material image from the image library according to the text characteristics. The image library comprises a plurality of material images and image features of the material images, and the similarity between the image features of the target material images and the text features is larger than a preset threshold value. And inputting the text content, the identification and the preset script template as prompt information into a generating language model to generate a video generating script. And generating a first target video by running the video generation script.

According to an embodiment of the present disclosure, determining the text feature according to the received text content for describing the target object includes: and splitting the text content text to obtain a plurality of text paragraphs. And extracting the characteristics of the text paragraphs, and determining the text characteristics.

According to an embodiment of the present disclosure, the splitting processing is performed on the text content to obtain a plurality of text paragraphs, including: and carrying out semantic analysis on the text content to obtain a semantic analysis result. And splitting the text content based on the semantic analysis result to obtain the text paragraphs.

According to an embodiment of the present disclosure, the video generation script includes a first correspondence between a first text paragraph and the identification of the target material image. The generating a first target video by running the video generation script includes: and reading the identification of the target material image recorded in the material image identification field in the video generation script by running the video generation script. And calling the target material image from the image library according to the identification. And taking the first text paragraph corresponding to the identification determined according to the first corresponding relation as the caption of the target material image. And generating the first target video according to the target material image and the subtitle.

According to an embodiment of the present disclosure, the video generation script further includes: and the second corresponding relation between the second text paragraph and the target field is used for calling the digital man engine to describe text contents. The method further comprises the following steps: and calling a digital person engine pointed to by the target field, controlling the digital person engine to broadcast a second text paragraph corresponding to the target field determined according to the second corresponding relation in the form of a digital person, and generating a second target video.

According to an embodiment of the present disclosure, the determining, according to the text feature, the identification of the target material image from the image library includes: and calculating the similarity between the text features and the image features. And determining the target material image characteristic from the plurality of material image characteristics according to the plurality of similarity. And determining a target material image corresponding to the characteristics of the target material image, and determining the identification of the target material image.

According to the embodiment of the disclosure, the image library is constructed by the following steps: and detecting the scenes of the material video to obtain a detection result, wherein the material video comprises a plurality of display scenes corresponding to the material objects. And cutting the material video based on the detection result to obtain a plurality of video clips. And performing frame extraction processing on each video segment to obtain a material image of a single scene for displaying the material object. And processing the material image to obtain the characteristics of the material image. And constructing the image library according to the material images and the characteristics of the material images.

Another aspect of the present disclosure provides a video generating apparatus, including: the device comprises a first determining module, a second determining module, a first generating module and a second generating module. The first determining module is used for determining text characteristics according to received text content for describing the target object. And the second determining module is used for determining the identification of the target material image from the image library according to the text characteristics. The image library comprises a plurality of material images and image features of the material images, and the similarity between the image features of the target material images and the text features is larger than a preset threshold value. The first generation module is used for inputting the text content, the identification and the preset script template as prompt information into a generation type language model to generate a video generation script. The second generation module is used for generating a first target video by running the video generation script.

Another aspect of the present disclosure provides an electronic device, comprising: one or more processors. And a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described above.

Another aspect of the present disclosure provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement a method as described above.

Another aspect of the present disclosure provides a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

According to the embodiment of the disclosure, the video generation method provided by the disclosure adopts the generation language model, does not depend on a fixed video generation script and a video template any more, can flexibly process different text contents as input, realizes flexible matching of the text contents and target material images, and generates diversified first target videos, thereby improving flexibility of commodity display video generation.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments thereof with reference to the accompanying drawings in which:

FIG. 1 schematically illustrates an application scenario diagram of a video generation method, apparatus, electronic device, storage medium and program product according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a video generation method according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a system framework diagram of a video generation method according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a flow chart of a first target video generation method according to an embodiment of the disclosure;

fig. 5 schematically shows a block diagram of a video generating apparatus according to an embodiment of the present disclosure; and

Fig. 6 schematically illustrates a block diagram of an electronic device adapted to implement a video generation method according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one of skill in the art having generally understood the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In embodiments of the present disclosure, the collection, updating, analysis, processing, use, transmission, provision, disclosure, storage, etc., of the data involved (including, but not limited to, user personal information) all comply with relevant legal regulations, are used for legal purposes, and do not violate well-known. In particular, necessary measures are taken for the personal information of the user, illegal access to the personal information data of the user is prevented, and the personal information safety and network safety of the user are maintained.

In embodiments of the present disclosure, the user's authorization or consent is obtained before the user's personal information is obtained or collected.

In the generation process of the commodity display video under the electronic market scene, related technologies firstly generate a video script by manually designing each picture of the whole commodity display video, and then generate a video template according to the video script. The video template in the related art is similar to the slide template, and in the case that a new commodity display video needs to be made, the video, the picture and the text in the video template need to be replaced by the video, the picture and the text related to the new commodity display.

The generation method of the commodity display video in the related art at least has the following problems: based on given materials, fixed video generation scripts and fixed video templates, generating commodity display videos by combining video generation tools, and providing text content, video formats and picture formats (such as the length of texts, the size of videos and the length-width ratio of pictures and the like related to new commodity display) related to commodity display according to the related requirements in the video templates, otherwise, generating the commodity display videos cannot be performed, and the flexibility of generating the commodity display videos is poor. Moreover, under the condition that the display videos of different commodities are all fixed and the same video templates, the display videos of different commodities have the problem of uniformly, and the watching experience when watching the commodity display videos is affected. Although the problems of the uniformity of commodity display videos and poor user experience can be relieved to a certain extent by manually designing different video templates, the number of the video templates and the timeliness of the video templates limit the manufacturing diversity of the commodity display videos.

In view of this, the embodiments of the present disclosure provide a video generating method. The video generation method comprises the following steps: and determining the text characteristics according to the received text content for describing the target object. And determining the identification of the target material image from the image library according to the text characteristics. The image library comprises a plurality of material images and image features of the material images, and the similarity between the image features and the text features of the target material images is larger than a preset threshold value. And inputting the text content, the identification and the preset script template as prompt information into a generating language model to generate a video generating script. And generating a first target video by running a video generation script.

Fig. 1 schematically illustrates an application scenario diagram of a video generation method, apparatus, electronic device, storage medium and program product according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the video generating method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the video generating apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The video generation method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105. Accordingly, the video generating apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The video generating method according to the embodiment of the present disclosure will be described in detail below with reference to fig. 2 to 4 based on the scenario described in fig. 1.

Fig. 2 schematically shows a flowchart of a video generation method according to an embodiment of the present disclosure.

As shown in FIG. 2, the video generating method 200 includes operations S210-S240.

In operation S210, a text feature is determined according to the received text content for describing the target object.

In operation S220, the identification of the target material image is determined from the image library according to the text feature. The image library comprises a plurality of material images and image features of the material images, and the similarity between the image features and the text features of the target material images is larger than a preset threshold value.

In operation S230, the text content, the logo, and the predetermined script template are input as hint information into the generative language model to generate a video generating script.

In operation S240, a first target video is generated by running a video generation script.

According to embodiments of the present disclosure, the target object may be a product, brand, or a specific subject commodity that needs to be displayed. Based on the text content, which is the driving text used for describing the target object and is input by the user, the text content is processed, and key information or representative text characteristics extracted from the text content used for describing the target object are determined.

According to an embodiment of the present disclosure, the image library includes a plurality of material images and a plurality of image features of the plurality of material images, the material images may be image resources used in a video generation process, may be images related to a specific subject or commodity prepared in advance, and the material images may include commodity pictures, commodity usage scene pictures, background pictures, and the like. The target material image identifier may be a specific number, file name, etc. of the target material image in the image library, for accurately locating, retrieving, and referencing the target material image. The predetermined threshold may be a value predetermined in a similarity comparison of image features and text features of the target material image for determining whether the similarity between the image features and the text features of the target material image is sufficiently high to ensure that the selected target material image matches the text features describing the text content.

According to the embodiment of the disclosure, the identification of the target object is determined from the image library based on the text feature, and the target material image matched with the text feature is obtained, so that the target material image which can be identified and obtained is matched with the text feature when the video generation script is run, and the quality of video generation is improved.

According to an embodiment of the present disclosure, the predetermined script template may be a script template predefined in the video generation process, and the predetermined script may include a text placeholder for identifying a location where text content should appear, and may be replaced by specific text content when the video is generated. The image placeholder is used for identifying the position where the target material image should be inserted, and is replaced by a specific target material image when the video is generated. A generative language model is an artificial intelligence model whose primary task is to generate text or other forms of data having a language structure.

According to the embodiment of the disclosure, the generated language model can generate the video generation script according to the received prompt information such as the text content describing the target object, the identification of the target material image, a preset script template and the like.

According to an embodiment of the present disclosure, a video generation script is generated by calling a generation type language model, inputting text content, an identification of a target material image, and a predetermined script template as hint information into the generation type language model. For example: by invoking the generative language model, the subtitle content, i.e. the first text paragraph content, may be: the vitamin A brand is good, the large brand is trustworthy, and the quality is guaranteed. The health-care food is suitable for people of all ages, needs to be supplemented with various vitamins, has the multi-vitamin and multi-mineral tablet, is eaten every day, supplements living needs, and is more suitable for the physique of Chinese people. The hint words of the generated language model may be: please develop a video generation script according to the following: the vitamin A brand is very good, the large brand is trustworthy, and the quality is guaranteed. The brand A multivitamin and multiple mineral tablets are added for supplementing various vitamins in the elderly, one tablet is eaten every day, the life needs are supplemented, and the health care tablet is more suitable for the physique of Chinese people. ". Requiring this to appear in the future as subtitles on the first target video, the video generation script content is returned in json format, each scene comprising the following fields: a search unit field (sku) for searching for a correlation. A Content field (Content) is retrieved for subtitles under each video clip in the first target video. A Description field (Description) is retrieved for describing the target material image of each scene. Only the json result is returned, and other contents are not required to be returned.

According to an embodiment of the present disclosure, by running a video generation script, a subtitle is generated in text content, and a commodity display video, i.e., a first target video, conforming to a target object description is generated with a target material image as a material.

According to the embodiment of the disclosure, the generated language model can flexibly process different text contents as input, so that flexible matching of the text contents and target material images is realized, diversified first target videos are generated, and flexibility and diversity of commodity display video generation are improved. Compared with the traditional method that related technology needs to provide materials and formats strictly according to fixed requirements, the video generation method provided by the present disclosure can generate commodity display videos of different styles and forms more flexibly by adopting the generation type language model instead of relying on fixed video generation scripts and video templates, thereby improving the flexibility of video generation.

According to an embodiment of the present disclosure, an image library is constructed by: and detecting the scene of the material video to obtain a detection result, wherein the material video comprises a plurality of display scenes corresponding to the material objects. And cutting the material video based on the detection result to obtain a plurality of video clips. And performing frame extraction processing on each video segment to obtain a material image of a single scene for displaying the material object. And processing the material image to obtain the characteristics of the material image. And constructing an image library according to the material images and the characteristics of the material images.

According to an embodiment of the present disclosure, the material video includes a plurality of presentation scenes corresponding to the material objects, and the material video is a smaller video clip, that is, there is no scene switching in a section of the material video. And detecting the scene of the material video through a video scene detection algorithm to obtain a detection result.

According to the embodiment of the disclosure, the material video is segmented according to the detection result, and one long video can be segmented into different video segments according to the video content to obtain a plurality of video segments. Each of the plurality of video clips contains only one scene.

According to the embodiment of the disclosure, the dimensions of the material video and the material image are different, the material image has a length and a width, the dimension of the material image is two-dimensional, and the material video has not only the length and the width but also a time dimension, and the dimension of the material video is three-dimensional. The input of the multi-mode pre-training model can only be two-dimensional material images, so that after the material video is segmented to obtain a plurality of video segments, frame extraction processing is needed to be carried out on each video segment in the plurality of video segments to obtain the material images.

According to the embodiment of the present disclosure, a material image of a single scene for displaying a material object may be obtained by sampling frames from each video clip at regular intervals. The method can also select an intermediate frame in each video segment frequency to perform frame extraction processing, process a material image corresponding to the intermediate frame, and use the material image as the input of a multi-mode pre-training model and extract the characteristics to obtain the material image characteristics capable of representing the material video.

According to the embodiment of the disclosure, processing the material images includes performing quality evaluation processing on the material images, that is, evaluating the quality of the material images to remove the material images with lower quality.

According to embodiments of the present disclosure, processing the material images further includes performing a deduplication process on the material images, and multiple visually similar or duplicate material images may be identified and removed by an image deduplication algorithm, ensuring that only one copy of these similar material images remains in the final material image features.

According to the embodiment of the disclosure, for the material image subjected to quality evaluation processing and de-duplication processing, image feature extraction is performed through a multi-mode pre-training model, so as to obtain the material image feature. When the resolution of a material image is 1920×1080, and data compression is not considered, 1920×1080= 2073600 floating point numbers need to be stored when the material image is stored. And extracting the characteristics of the material image through the multi-mode pre-training model, and only storing 512 floating point numbers. Therefore, the multi-mode pre-training model can reduce the dimension of the material image and improve the storage efficiency.

According to the embodiment of the disclosure, an image library is constructed based on storage of material images and material image features, extraction of the material image features, correspondence between identification of the material image features and the material images, and establishment of an index structure of a material object.

According to the embodiment of the disclosure, the multiple display scenes in the material video are identified by detecting the scenes of the material video, so that the understanding of the video content in the material video is facilitated, and the material video is segmented into smaller units, namely video clips, by segmenting the material video, so that the material video can be flexibly processed and edited. And through the construction of an image library, the retrieval and the calling of the subsequent target material images are facilitated, and the efficiency of generating the first target video is improved.

According to an embodiment of the present disclosure, determining, according to a text feature, an identification of a target material image from an image library includes: and calculating the similarity between the text characteristic and the plurality of image characteristics respectively. And determining target material image characteristics from the plurality of material image characteristics according to the plurality of similarities. And determining a target material image corresponding to the characteristics of the target material image, and determining the identification of the target material image.

According to embodiments of the present disclosure, a plurality of similarities between a text feature and a plurality of image features may be calculated from the text feature and the plurality of image features based on a similarity function. Wherein the similarity function is a function for measuring a degree of similarity between the text feature and each of the plurality of image features. The similarity function can be constructed based on various similarity measurement modes such as cosine similarity, euclidean distance, manhattan distance and the like.

According to the embodiments of the present disclosure, according to the plurality of similarities, one image feature having the highest similarity with the text feature may be selected as the target material image feature, and a plurality of image features having a similarity with the text feature greater than a predetermined threshold may be selected as the target material image feature. The predetermined threshold may be a predetermined similarity threshold, which can ensure that the image features with higher similarity are screened, so as to ensure that the image features of the target material have enough similarity with the text features, and meet the requirement of video generation.

According to the embodiments of the present disclosure, the cosine distance between the text feature and each of the plurality of image features may be calculated based on the cosine similarity, and one image feature closest to the cosine distance between the text features may be determined as the target material image feature.

According to the embodiment of the disclosure, after the target material image features are determined, the identification matched with the target object is determined from the plurality of material images according to the target material image features, so that accurate retrieval and calling of the target material image from the plurality of material images can be ensured.

According to the embodiment of the disclosure, based on the determination of the image characteristics of the target material, the association of the text characteristics and the image characteristics is realized, an effective image retrieval and matching mechanism is provided for the generation of the first target video, the accuracy and efficiency of video generation are improved, and the generated first target video is ensured to be matched with the text content input by the user.

Fig. 3 schematically illustrates a system frame diagram of a video generation method according to an embodiment of the present disclosure.

As shown in fig. 3, the system framework 300 of the video generation method mainly includes three branches: an image library branch 310, a commodity description branch 320, and a first target video generation branch 330.

According to an embodiment of the present disclosure, the image library branch 310 is used to screen high quality images and video materials from a huge amount of commodity materials, and extract the image features of the target materials for matching with text contents. The image library branch 310 includes a material video, a plurality of video clips after scene detection and segmentation processing of the material video, a material image obtained by performing frame extraction processing on each video clip in the plurality of video clips, a video screening engine for performing quality evaluation processing and de-duplication processing on the material image, a multi-mode pre-training model for performing image feature extraction on the material image after the quality evaluation processing and the de-duplication processing, a material image feature obtained by image feature extraction, a target material image feature obtained by determining from the plurality of material image features, and an identification of the target material image.

According to an embodiment of the present disclosure, the commodity description branch 320 is used to extract text features in text content describing a target object entered by a user, and to retrieve and input target story image features corresponding to the text features. The commodity description branch 320 includes text content entered by a user to describe a target object, a plurality of text paragraphs resulting from splitting the text of the text content, a multimodal pre-training model that encodes text in the text paragraphs into text features, and text features.

According to an embodiment of the present disclosure, the first target video generation branch 330 is used for automated video production of the retrieved target material images. The first target video generation branch 330 includes a material retrieval engine, a retrieval result, a video production engine, a video generation script, and a first target video. Wherein the story search engine is configured to extract text features in the commodity description branch 320 and target story image features in the image library branch 310. And determining a target material image matched with the text characteristic based on the similarity function, and storing the matched target material image into a search result. By circularly calling the material retrieval engine, a target material image matched with all text features in the text content can be found. In the case where the loop is ended, the number of target material images matching all the number of text features can be obtained. The video production engine comprises a generation type language model, the text content, the identification of the target material image and the preset script template are used as prompt information to be input into the generation type language model, and a video generation script matched with the text content and the retrieved target material image is generated. And running a video generation script, taking a first text paragraph in the text paragraphs as a subtitle of a target material image, taking the retrieved target material image as a production material of the first target video, and generating the first target video.

According to embodiments of the present disclosure, target story images that match all text features in the text content may be found by a multimodal pre-training model. The multi-mode pre-training model can be used as a multi-mode image-text pre-training model, the target material image features and the text features are aligned through the multi-mode pre-training model, the target material image features and the text features are matched, and the matching degree between the target material image features and the text features is improved. Training and fine tuning can also be performed on private or public data sets to achieve the goal of retrieving target story images that match all text features in the text content.

In accordance with embodiments of the present disclosure, digital persons may also be generated as material of the first target video generation branch 330 based on existing digital person technology, employing text content in conjunction with the digital person engine.

According to an embodiment of the present disclosure, determining text features from text content received for describing a target object includes: and splitting the text of the text content to obtain a plurality of text paragraphs. And extracting the characteristics of the text paragraphs to determine the text characteristics.

According to embodiments of the present disclosure, splitting the text content may include splitting the text content based on paragraph separators, and splitting the text content by detecting the separators. Text content may also be split based on end-of-sentence punctuation. The text content may also be split based on identifying keywords in the text content. Text content may also be split based on semantic analysis. And splitting the text content to obtain a plurality of split text paragraphs.

According to the embodiment of the disclosure, the feature extraction of the key information can be performed on each text paragraph in the plurality of text paragraphs, and the key information extracted from the text paragraphs is integrated into the text feature.

According to embodiments of the present disclosure, text in a text paragraph may also be encoded as text features by a text encoder in a multimodal pre-training model.

According to the embodiment of the disclosure, the text content is split into the plurality of text paragraphs by splitting the text content, so that the text content is processed more finely, and each text paragraph is processed more flexibly and accurately. By extracting the characteristics of a plurality of text paragraphs, determining the text characteristics is beneficial to extracting key information in text contents, so that the identification of a subsequent determined target material image can be performed based on the text characteristics, and the effectiveness and accuracy of describing a target object are improved.

According to an embodiment of the present disclosure, splitting text content to obtain a plurality of text paragraphs includes: and carrying out semantic analysis on the text content to obtain a semantic analysis result. And splitting the text content based on the semantic analysis result to obtain a plurality of text paragraphs.

According to embodiments of the present disclosure, text content may be semantically analyzed by invoking a generative language model. Inputting the text to be analyzed, namely the text content, into a generated language model, and learning a large amount of language data by the generated language model to understand the semantics of the input text, namely the text content, and carrying out semantic analysis on the text content to obtain a semantic analysis result.

According to the embodiment of the disclosure, based on a semantic analysis result, a generated language model is called to split text content, so that a plurality of text paragraphs are obtained. The plurality of text paragraphs may include a first text paragraph and a second text paragraph.

According to the embodiment of the disclosure, the semantics of the text content can be further understood through semantic analysis, so that the understanding degree of the text content is improved, and the key information can be captured more accurately.

Fig. 4 schematically illustrates a flowchart of a first target video generation method according to an embodiment of the disclosure.

As shown in FIG. 4, the first target video generation method 400 includes operations S410-S440.

In operation S410, the identification of the target material image recorded in the material image identification field in the video generation script is read by running the video generation script.

In operation S420, the target material image is called from the image library according to the identification.

In operation S430, the first text paragraph corresponding to the identifier, which is determined according to the first correspondence relationship, is used as a subtitle of the target material image.

In operation S440, a first target video is generated from the target material image and the subtitle.

According to an embodiment of the present disclosure, a first correspondence between a first text paragraph and an identification of a target material image is included in a video generation script. And reading the identification of the target material image in the video generation script by running the video generation script, and calling the target material image from the image library based on the first corresponding relation between the identification of the target material image and the target material image. According to the first corresponding relation between the first text paragraph and the target material image identification, the first text paragraph can be used as a caption of the target material image through a computer language tool for video production and video editing, and the target material image is used as a production material of the first target video, so that the first target video is generated.

According to the embodiment of the disclosure, music can also be randomly selected from the audio library by preparing the audio library in advance, and added to the first target video, combined with the first target video, and soundtrack for the first target video.

According to the embodiment of the disclosure, the matching degree of the generated first target video and the target material image is ensured through the first corresponding relation between the first text paragraph and the identification of the target material image. Based on the first corresponding relation between the first text paragraph and the identification of the target material image, the first text paragraph is embedded into the generated first target video as the subtitle of the target material image, text information is added for the first target video, and a user is helped to better understand the content of the first target video.

According to an embodiment of the present disclosure, the video generation method further includes: and calling a digital person engine pointed to by the target field, controlling the digital person engine to broadcast a second text paragraph corresponding to the target field determined according to the second corresponding relation in the form of a digital person, and generating a second target video.

According to the embodiment of the disclosure, the digital man engine can be called through the target field based on the second corresponding relation between the second text paragraph and the target field, the second text paragraph is used as the input of the digital man engine, the text is converted into the voice through the voice synthesis technology of the digital man engine, the input second text paragraph is processed, and the voice is generated in the form of the second text paragraph of the digital man report. The generated speech may be part of the video generation and may be combined with the target material image to generate a second target video.

According to embodiments of the present disclosure, the digital person engine may be implemented by a variety of different digital person driving techniques, and the digital person may be generated by digital person rendering techniques, which in turn may be used to implement the digital person engine.

According to the embodiment of the disclosure, the function of converting the text information of the second text paragraph into voice is realized through the voice synthesis technology of the digital human engine, so that voice elements are added for the second target video, and the multi-mode presentation mode is improved. And by calling the digital human engine, the voice is integrated into the second target video in a digital demographics mode, so that the voice is more vivid and natural, humanized elements are added for the second target video, and the watching experience of a user is improved.

Fig. 5 schematically shows a block diagram of a video generating apparatus according to an embodiment of the present disclosure.

As shown in fig. 5, the video generating apparatus 500 includes a first determining module 510, a second determining module 520, a first generating module 530, and a second generating module 540.

A first determining module 510 is configured to determine a text feature according to the text content received to describe the target object.

A second determining module 520 is configured to determine, according to the text feature, an identification of the target material image from the image library. The image library comprises a plurality of material images and image features of the material images, and the similarity between the image features and the text features of the target material images is larger than a preset threshold value.

The first generation module 530 is configured to input the text content, the identifier, and the predetermined script template as prompt information into the generated language model, and generate a video generation script.

The second generating module 540 is configured to generate the first target video by running a video generating script.

According to an embodiment of the present disclosure, the first determination module 510 includes a split sub-module and an extraction sub-module. The splitting submodule is used for splitting the text of the text content to obtain a plurality of text paragraphs. The extraction submodule is used for extracting characteristics of a plurality of text paragraphs and determining text characteristics.

According to an embodiment of the present disclosure, the splitting submodule includes a semantic analysis unit and a splitting processing unit. The semantic analysis unit is used for carrying out semantic analysis on the text content to obtain a semantic analysis result. The splitting processing unit is used for splitting the text content based on the semantic analysis result to obtain a plurality of text paragraphs.

According to an embodiment of the present disclosure, the second determination module 520 includes a calculation sub-module, a first determination sub-module, and a second determination sub-module. The computing submodule is used for computing similarity between the text features and the plurality of image features respectively. The first determining submodule is used for determining target material image characteristics from a plurality of material image characteristics according to the plurality of similarities. The second determining submodule is used for determining a target material image corresponding to the characteristics of the target material image and determining the identification of the target material image.

According to an embodiment of the present disclosure, the second generation module 540 includes a reading sub-module, a calling sub-module, a caption sub-module, and a first generation sub-module. The reading submodule reads the identification of the target material image recorded in the material image identification field in the video generation script by running the video generation script. And the calling sub-module is used for calling the target material image from the image library according to the identification. The subtitle submodule is used for taking the first text paragraph corresponding to the identifier, which is determined according to the first corresponding relation, as the subtitle of the target material image. The first generation sub-module is used for generating a first target video according to the target material image and the subtitle.

According to an embodiment of the present disclosure, the video generating apparatus 500 may further include a third generating module. The third generation module is used for calling the digital person engine pointed to by the target field, controlling the digital person engine to broadcast a second text paragraph corresponding to the target field determined according to the second corresponding relation in the form of a digital person, and generating a second target video.

According to an embodiment of the present disclosure, the video generating apparatus 500 may further include a construction module. The construction module comprises a detection sub-module, a molecule cutting module, a frame extraction sub-module, a processing sub-module and a construction sub-module. The detection sub-module is used for detecting the scene of the material video to obtain a detection result. The material video includes a plurality of presentation scenes corresponding to the material objects. And the molecule cutting module is used for cutting the material video based on the detection result to obtain a plurality of video clips. And the frame extraction submodule is used for carrying out frame extraction processing on each video fragment to obtain a material image of a single scene for displaying the material object. The processing sub-module is used for processing the material image to obtain the characteristics of the material image. The construction submodule is used for constructing an image library according to the material images and the characteristics of the material images.

Any number of modules, sub-modules, units, sub-units, or at least some of the functionality of any number of the sub-units according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented as split into multiple modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in any other reasonable manner of hardware or firmware that integrates or encapsulates the circuit, or in any one of or a suitable combination of three of software, hardware, and firmware. Or one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be at least partially implemented as computer program modules, which, when executed, may perform the corresponding functions.

For example, any of the first determination module 510, the second determination module 520, the first generation module 530, and the second generation module 540 may be combined in one module/unit/sub-unit, or any of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Or at least some of the functionality of one or more of these modules/units/sub-units may be combined with at least some of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to embodiments of the present disclosure, at least one of the first determination module 510, the second determination module 520, the first generation module 530, and the second generation module 540 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in hardware or firmware, such as any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware. Or at least one of the first module 510, the second determination module 520, the first generation module 530 and the second generation module 540 may be at least partially implemented as computer program modules which, when executed, may perform the respective functions.

It should be noted that, in the embodiment of the present disclosure, the video generating apparatus portion corresponds to the video generating method portion in the embodiment of the present disclosure, and the description of the video generating apparatus portion specifically refers to the video generating method portion and is not described herein.

Fig. 6 schematically illustrates a block diagram of an electronic device adapted to implement a video generation method according to an embodiment of the disclosure. The electronic device shown in fig. 6 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 6, an electronic device 600 according to an embodiment of the present disclosure includes a processor 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. The processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 601 may also include on-board memory for caching purposes. The processor 601 may comprise a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the disclosure.

In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are stored. The processor 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. The processor 601 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 602 and/or the RAM 603. Note that the program may be stored in one or more memories other than the ROM 602 and the RAM 603. The processor 601 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in one or more memories.

According to an embodiment of the present disclosure, the electronic device 600 may also include an input/output (I/O) interface 605, the input/output (I/O) interface 605 also being connected to the bus 604. The electronic device 600 may also include one or more of the following components connected to an input/output (I/O) interface 605: including an input portion 606 of a keyboard, mouse, etc. Including an output portion 607 such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc. Including a storage portion 608 of a hard disk or the like. And a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.

According to embodiments of the present disclosure, the method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 601. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments. Or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 602 and/or RAM 603 and/or one or more memories other than ROM 602 and RAM 603 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program comprising program code for performing the methods provided by the embodiments of the present disclosure, the program code for causing an electronic device to implement the video generation methods provided by the embodiments of the present disclosure when the computer program product is run on the electronic device.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 601. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of signals over a network medium, and downloaded and installed via the communication section 609, and/or installed from the removable medium 611. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure may be combined and/or combined in various combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, features recited in various embodiments of the present disclosure may be combined and/or combined in various ways without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A video generation method, comprising:

determining text characteristics according to received text contents for describing the target object;

According to the text characteristics, determining the identification of the target material image from an image library; the image library comprises a plurality of material images and image features of the material images, and the similarity between the image features of the target material images and the text features is larger than a preset threshold;

Inputting the text content, the identification and a preset script template as prompt information into a generating language model to generate a video generating script; and

And generating a first target video by running the video generation script.

2. The method of claim 1, wherein the determining text features from the received text content describing the target object comprises:

splitting the text content text to obtain a plurality of text paragraphs; and

And extracting the characteristics of the text paragraphs, and determining the text characteristics.

3. The method of claim 2, wherein the splitting the text content to obtain a plurality of text paragraphs includes:

Carrying out semantic analysis on the text content to obtain a semantic analysis result; and

And splitting the text content based on the semantic analysis result to obtain the text paragraphs.

4. A method according to any one of claims 1-3, wherein the video generation script includes a first correspondence between a first text passage and an identification of the target material image; the method for generating the first target video by running the video generation script comprises the following steps:

reading the identification of the target material image recorded in the material image identification field in the video generation script by running the video generation script;

Calling the target material image from the image library according to the identification;

Taking a first text paragraph corresponding to the mark, which is determined according to the first corresponding relation, as a subtitle of the target material image;

And generating the first target video according to the target material image and the subtitle.

5. The method of claim 4, the video generation script further comprising: a second corresponding relation between a second text paragraph and a target field, wherein the target field is used for calling a digital human engine to describe text content; the method further comprises the steps of:

And calling a digital person engine pointed to by the target field, and controlling the digital person engine to broadcast a second text paragraph corresponding to the target field determined according to the second corresponding relation in the form of a digital person to generate a second target video.

6. The method according to any one of claims 1-3 and 5, wherein said determining, from the image library, the identity of the target material image based on the text feature, comprises:

calculating the similarity between the text features and a plurality of image features respectively;

determining the target material image characteristics from a plurality of material image characteristics according to a plurality of the similarities; and

And determining a target material image corresponding to the characteristics of the target material image, and determining the identification of the target material image.

7. The method of any one of claims 1-3, 5, wherein the image library is constructed by:

Performing scene detection on the material video to obtain a detection result, wherein the material video comprises a plurality of display scenes corresponding to the material objects;

Based on the detection result, the material video is segmented to obtain a plurality of video clips;

performing frame extraction processing on each video segment to obtain a material image of a single scene for displaying the material object;

Processing the material image to obtain the characteristics of the material image; and

And constructing the image library according to the material images and the characteristics of the material images.

8. A video generating apparatus comprising:

The first determining module is used for determining text characteristics according to received text contents for describing the target object;

The second determining module is used for determining the identification of the target material image from the image library according to the text characteristics; the image library comprises a plurality of material images and image features of the material images, and the similarity between the image features of the target material images and the text features is larger than a preset threshold;

the first generation module is used for inputting the text content, the identification and the preset script template as prompt information into a generation type language model to generate a video generation script; and

And the second generation module is used for generating a first target video by running the video generation script.

9. An electronic device, comprising:

One or more processors;

a memory for storing one or more programs,

Wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon executable instructions which when executed by a processor cause the processor to implement the method of any of claims 1 to 7.

11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.