CN119169157A

CN119169157A - A method, device, equipment and storage medium for generating virtual human video

Info

Publication number: CN119169157A
Application number: CN202411180274.3A
Authority: CN
Inventors: 刘博�; 柴金祥
Original assignee: Shanghai Movu Technology Co Ltd; Mofa Shanghai Information Technology Co Ltd
Current assignee: Shanghai Movu Technology Co Ltd; Mofa Shanghai Information Technology Co Ltd
Priority date: 2024-08-27
Filing date: 2024-08-27
Publication date: 2024-12-20
Anticipated expiration: 2044-08-27
Also published as: CN119169157B

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for generating a virtual person video. The method comprises the steps of obtaining input content, wherein the input content comprises a speech text and a video script and is used for generating dubbing and pictures of a video, determining at least one moving time point based on the speech text and the video script, adding at least one lower body moving animation on each moving time point to obtain an animation set corresponding to each moving time point, determining a target lower body moving animation sequence based on the animation set corresponding to each moving time point, and generating a virtual human whole body moving video according to the target lower body moving animation sequence and the input content. The virtual person video generation method provided by the embodiment of the invention can prevent the problems of uncontrollable walking, sliding and the like of the virtual person in the generated video, so as to improve the generation quality and display effect of the virtual person video.

Description

Virtual person video generation method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of video generation, in particular to a method, a device, equipment and a storage medium for generating a virtual person video.

Background

Along with the proposal of the concept of "meta-universe", in the fields of electronic commerce, games, animation and the like, the virtual people are more and more widely concerned, and the influence is gradually expanded. For example, in live video, a virtual person replaces a real person to introduce and explain a product, so that not only can the labor cost be saved, but also a virtual person anchor and the product can be bound, and the loss of the anchor from the job to the product sales can be avoided. In the product recommendation, a virtual person replaces a real person to introduce the product, so that the method not only attracts attention of field viewers, but also can realize continuous high-quality repeated introduction for a long time.

To replace the real person with the virtual person and obtain the trust of the audience, the virtual person should be brought as close to the real person as possible in speaking. In addition to being able to make the same sound as a real person, it is also necessary to make corresponding gestures, expressions, actions, and move or walk in the video.

In the related art, a virtual person video can be automatically generated based on audio input, walking or movement of the virtual person in the video is completely controlled by video generation software, certain difference exists between the walking or movement of the virtual person and the holding habit of a real person, and a sliding step easily occurs when the virtual person moves or walks, namely, the foot of the virtual person slides on the ground in an unnatural manner, so that the video quality and the display effect are seriously affected.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for generating a virtual person video, which can prevent the problems of uncontrollable walking, sliding and the like of a virtual person in the generated video so as to improve the generation quality and the display effect of the virtual person video.

In a first aspect, an embodiment of the present invention provides a method for generating a virtual person video, including:

the method comprises the steps of obtaining input content, wherein the input content comprises a speech text and a video script and is used for generating dubbing and pictures of a video;

determining at least one moving time point based on the speech text and the video script;

Adding at least one lower body moving animation on each moving time point to obtain an animation set corresponding to each moving time point, wherein the moving time point is the starting time point of playing the lower body moving animation;

And generating a virtual human whole-body moving video according to the target lower-body moving animation sequence and the input content.

In a second aspect, an embodiment of the present invention further provides a device for generating a virtual person video, including:

The system comprises an input content acquisition module, a video processing module and a video processing module, wherein the input content acquisition module is used for acquiring input content, and the input content comprises a speech text and a video script and is used for generating dubbing and pictures of video;

A moving time point determining module, configured to determine at least one moving time point based on the speech text and the video script;

The lower body moving animation adding module is used for adding at least one lower body moving animation at each moving time point to obtain an animation set corresponding to each moving time point, wherein the moving time point is the starting time point of the lower body moving animation;

The target lower body moving animation sequence determining module is used for determining a target lower body moving animation sequence based on the animation set corresponding to each moving time point;

and the virtual person whole body movement video generation module is used for generating a virtual person whole body movement video according to the target lower body movement animation sequence and the input content.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

And a memory communicatively coupled to the at least one processor, wherein,

The memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the virtual person video generation method according to the embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium stores computer instructions, where the computer instructions are configured to cause a processor to implement the method for generating a virtual human video according to the embodiment of the present invention when executed.

In a fifth aspect, embodiments of the present invention further provide a computer program product, including a computer program, which when executed by a processor implements a method for generating a virtual person video according to embodiments of the present invention.

The embodiment of the invention discloses a method, a device, equipment and a storage medium for generating a virtual person video. The method comprises the steps of obtaining input content, wherein the input content comprises a speech text and a video script and is used for generating dubbing and pictures of videos, determining at least one moving time point based on the speech text and the video script, adding at least one lower body moving animation on each moving time point to obtain an animation set corresponding to each moving time point, determining a target lower body moving animation sequence based on the animation set corresponding to each moving time point, and generating a virtual human whole body moving video according to the target lower body moving animation sequence and the input content. According to the virtual human video generation method provided by the embodiment of the invention, at least one moving time point is determined based on the speech text and the video script, then the lower body animation added at each moving time point is used for determining the target lower body moving animation sequence, and then the virtual human whole body moving video is generated based on the target lower body moving animation sequence and the input content, so that the accurate control on the time point, the lower body moving animation, the whole body movement and the like of the virtual human movement is realized, the problems of uncontrollable virtual human walking, sliding and the like in the generated video can be prevented, and the generation quality and the display effect of the virtual human video are improved.

Drawings

FIG. 1 is a flow chart of a method of generating a virtual person video in an embodiment of the invention;

fig. 2 is an exemplary diagram of the present embodiment in which a time axis is divided into a movable period, a non-movable period, and a movement pending period;

FIG. 3 is an exemplary diagram of the present embodiment adding a lower body movement animation at a movement time point;

fig. 4 is a schematic structural diagram of a virtual person video generating apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Fig. 1 is a flowchart of a method for generating a virtual person video according to a first embodiment of the present invention, where the method may be applied to a case of generating a virtual person moving or walking video, and the method may be performed by a virtual person video generating device, where the device may be implemented in a form of software and/or hardware, and optionally, may be implemented by an electronic device, where the electronic device may be a mobile terminal, a PC side, a server, or the like.

The virtual person may also be referred to as a virtual digital person, for example, a virtual news broadcaster, a virtual teacher, a virtual anchor, etc., and the application scenario of the virtual person is not limited in this embodiment. The virtual person in this embodiment may be a two-dimensional character image, a three-dimensional character model, or a character model of more dimensions, which is not limited in this embodiment. Through the scheme of the embodiment, after a user inputs at least one text and generates audio according to the text, a video for explaining the text by a virtual person can be quickly generated, for example, the video can be a news broadcasting video, a product introduction video and a knowledge science popularization video. In addition, in order to enhance the display effect of the video, the embodiment also allows the user to input some video materials, such as images or videos, at the same time of inputting the text, wherein the images or videos can be used as display contents when virtual people in the generated video explain. Further, for convenience of user operation, the aforementioned video material may be input in the form of still pictures, moving pictures (such as gif charts), videos, PPT documents, PDF documents. It should be appreciated that each PPT document contains at least one slide, one slide corresponding to each picture, and that text, art words, pictures, video, etc. may be inserted into the slides. Similarly, each PDF document contains at least one PDF page, which corresponds to a picture, and the PDF pages contain text, pictures and other elements. In an application scenario, a user inputs a PPT document and corresponding text and video scripts, wherein the PPT document contains text, images and videos, and the embodiment can generate videos for explaining the PPT document by a virtual person based on the input contents.

It should be noted that, each slide in the PPT document corresponds to a remark, so that the text of a line and a video script corresponding to the PPT document may be input in the form of an independent text, or may be input as a remark of the PPT document, which is not limited in this embodiment.

As shown in fig. 1, the method specifically includes the following steps:

S110, acquiring input content.

The input content comprises a speech text and a video script and is used for generating dubbing and pictures of the video. The speech text may be directly input by the user through the client, or obtained from a local memory of the client, or downloaded from a server, where the way of obtaining the speech text is not limited. The text of the station WORD can be a manuscript for the virtual person to explain the content, and can be expressed in a certain document format, such as a TXT document, a WORD document, a PDF document, and the like. The speech text is used for generating dubbing of video, namely audio of a virtual person. A video script may be understood as instructional content for video production, i.e., the pictures that the video script uses to generate video. The system at least comprises a pause mark, a display content mark, a shot scene mark, a shot switching mark, a mirror carrying mark and a preset action mark which correspond to the text of the line. The video script is edited in advance by the user, i.e. the content contained in the video script is designed according to the personalized requirements of the user.

Optionally, after the input content is acquired, the method further comprises the steps of generating dubbing of the video based on the speech text and the corresponding pause mark, and taking a time axis of the dubbing as a time axis of the video.

Wherein the pause tags can be used to tag blank silence periods from sentence to sentence, or from phrase to phrase. The stop mark corresponding to the speech text may include a position of the stop and a stop duration. Based on the speech text and the corresponding pause mark, the dubbing of the video can be generated by converting the speech text into audio containing pauses and taking the audio as dubbing of the virtual person. It should be appreciated that if audio is generated based solely on the speech text, pauses in the audio are determined based primarily on punctuation marks in the speech text, the generated dubbing has a strong "machine feel" and does not conform to the speaking habits of a person. In order to make the generated dubbing more realistic, the present embodiment allows setting a pause flag for the corresponding speech of the line. For example, a pause of 0.5 seconds is set where there is no punctuation but a short pause is required, and a pause of 2 seconds is set where there is already a period but a longer pause is required. Therefore, on the basis of stopping based on punctuation marks in the prior art, the corresponding stopping marks are additionally arranged to adjust the speed and stopping of dubbing, so that the whole dubbing is more lifelike. In addition, by adjusting the speech speed and the pause, the effect of emotion expression can be realized by dubbing.

In this embodiment, after the speech and the corresponding pause mark thereof are obtained, the speech may be converted into the audio corresponding to the speech template and containing the pause according to the speech template selected by the user (e.g., male middle voice, male bass, female treble, etc.). The voice templates may be pre-designed voices that may be selected by the user or generated from voices recorded by the user, and the voice templates selected by the user match the avatar of the virtual person. That is, the proper tone is selected for the dubbing, so that the image of the virtual person is consistent in vision and hearing, and the whole figure is richer and plump.

It should be noted that, the main content of the virtual person video generated in this embodiment is virtual person explanation, so the dubbing generated based on the speech text and the corresponding pause mark will penetrate the whole virtual person video, and the duration of dubbing is the same as that of the virtual person video, and the time axes of the two are also the same. Optionally, the input content may further include an image, which may be a picture or a video, and the picture or the video is embodied in the generated virtual person video in a form of presentation content, so as to assist the virtual person in explaining the content.

It will be appreciated that in some formal situations, such as product release meetings, educational training classrooms, etc., only language instruction is boring, and it is necessary to cooperate with pictures or videos to make the instruction so as to enhance the audience's understanding of the instruction and to leave an impressive impression. Therefore, the embodiment also allows the user to input pictures or videos as explanation materials, and combines the language and actions of the virtual person to form a section of high-quality video.

Based on the above description, the present embodiment can generate dubbing of a video based on the speech text and the corresponding pause mark, and further generate a corresponding virtual person video based on the dubbing. Thus, one possible implementation is to directly use dubbing instead of speech text as input content, and take the time axis of the dubbing as the time axis of video.

Correspondingly, if the input content is a speech text, the video script can be marked in the speech text or on a time axis, and if the input content is dubbing, the video script is marked directly on the time axis.

S120, determining at least one moving time point based on the speech text and the video script.

The moving time point may be a certain time point on a dubbing time axis, and the moving time point may be a time point when the virtual person switches from a stopped state to a moving state (may also be referred to as a walking state), that is, a starting time point of a certain moving process. Determining a plurality of movement time points on the dubbing time axis may enable the virtual person to be in a state of stopping and moving during the presentation of the lecture content.

Based on the description of the background art, the related technology can automatically generate the virtual human video based on the audio input, but the walking or movement of the virtual human in the video is completely controlled by the video generation software, the walking or movement of the virtual human cannot be precisely controlled, and the sliding step is easy to occur. To solve this problem, the present embodiment first controls the start time point at which walking or movement of the dummy occurs by determining the movement time point.

Based on the foregoing description of dubbing, the whole virtual human video is generated by adding corresponding video pictures on the basis of dubbing, and the video pictures specifically comprise picture background, virtual human image and other elements.

After generating dubbing, the corresponding time point and time period of each word in the speech of the speech line text can be clearly found on the time axis. Therefore, the virtual person is controlled to make corresponding actions at the appointed time point and the appointed time period, and the speech of the virtual person can be more vivid.

In daily work and life, when a real person performs content explanation, the real person usually walks or moves at intervals or at intervals of the content explanation.

Correspondingly, in this embodiment, the moving time point is also determined based on the sentence breaking situation in the speech text or the speaking rhythm in the dubbing or according to a certain frequency.

Alternatively, the at least one movement time point may be determined based on the speech text and the video script by marking the at least one movement time point on a time axis based on the speech text and the video script.

The method comprises the steps of marking at least one moving time point on a time axis, wherein the method comprises the steps of marking the time point of a preset type punctuation mark on the time axis in a speech line text as the moving time point, or marking the time point of a port in dubbing on the time axis as the moving time point, or marking one moving time point on the time axis every set time length.

The punctuation marks of the preset type can be marks capable of representing the end of a period, such as periods, exclamation marks and the like. Based on the foregoing description, the speech corresponding to each word in the speech text can find the time point and the time period corresponding to the word on the time axis. Similarly, each punctuation mark of the speech of the embodiment can find a corresponding time point on the time axis, and further marks the time point corresponding to the punctuation mark of the preset type as a moving time point.

The air port may also be referred to as ventilation, that is, the time point at which the air port is located may be understood as the time point at which the ventilation is located, and may be determined according to the speaking rhythm of dubbing. In this embodiment, the speaking rhythm of the dubbing is analyzed, the time point of the air port on the time axis is obtained, and the time point is determined as the moving time point. It should be noted that, the recognition of the punctuation marks of the preset type is performed on the speech of the speech, and the recognition of the air ports is performed on the dubbing, and determining the pause and silence in dubbing as an air port, and marking the time point corresponding to the air port as a moving time point.

The set duration may be a preset value, which is not limited herein. Specifically, a moving time point is marked every set time length on the time axis.

In addition to the above-listed ways of marking the moving time point, the moving time point may be marked according to other preset rules, for example, a time point corresponding to a specified word in the speech of the line is marked as the moving time point, which is not limited in this embodiment.

Optionally, after the input content is acquired, the method further comprises the step of dividing the time axis into a movable period, an immovable period and a movement pending period based on the video script.

The video script further comprises at least one of a show content mark, a shot scene mark, a shot switching mark, a mirror carrying mark and a preset action mark. The movable period can be understood as a period in which the virtual person can move, the non-movable period can be understood as a period in which the virtual person is prohibited from moving, and the movement waiting period can be understood as a period in which the virtual person can move if a certain condition is met, and the virtual person is prohibited from moving if the certain condition is not met. In the application scene, in the process of introducing the content by the virtual person, the virtual person does not need to move or cannot move under some preset conditions, so that a time axis is required to be divided into a movable period, an immovable period and a movement pending period.

The showing content mark is used for marking a showing mode of the content in a certain period, such as full-screen showing or non-full-screen showing, the lens scene mark is used for marking a lens Jing Bie in a certain period, such as panorama, long-range view, middle-range view, close-up view and the like, the lens switching mark is used for marking the switching of the lens at a certain moment, the lens operating mark is used for marking a lens operating mode in a certain period, such as lens pushing, lens pulling, lens lifting, lens shaking and the like, and the preset action mark is used for marking a preset action made by a virtual person in a certain period.

The method for dividing the time axis into a movable time period, an immovable time period and a movable pending time period based on the video script can be characterized in that the time period of displaying the content in a full screen mode, the time period corresponding to the shot switching mark and the time period corresponding to the preset action mark are determined to be immovable time periods, the time period of marking the shot scene as panoramic is determined to be movable pending time periods, and the remaining time periods on the time axis are determined to be movable time periods. For example, it is assumed that fig. 2 is an exemplary diagram of dividing a time axis into a movable period, an immovable period, and a movable pending period in the present embodiment, and as shown in fig. 2, the overall duration of the time axis is 300 seconds, where two immovable periods are divided into 50 seconds to 80 seconds and 200 seconds to 259 seconds, one movable pending period is 90 seconds to 100 seconds, and the remaining periods are movable periods.

In this embodiment, the time axis is divided into a movable period, an immovable period, and a movement pending period, so that the dummy cannot move in the immovable period, to strengthen control over walking or movement occurrence of the dummy.

Specifically, in the period of full-screen display, the whole video picture will display the explanation material (picture or video) input by the user in full screen, and the virtual person image is no longer displayed, and the virtual person does not need to walk or move at this time. In the period corresponding to the shot switching mark, the video picture is subjected to shot switching, and if the virtual person walks or moves in the period, the position of the virtual person can be caused to move instantaneously before and after the shot switching. In the period corresponding to the preset action mark, the upper body of the virtual person needs to make preset actions, such as waving hands, comparing centers, arching hands, lifting hands and the like, and if the virtual person walks or moves while making the actions, the whole body limb actions are uncoordinated. Therefore, the present embodiment divides the aforementioned several periods into the non-movable period, prohibiting the virtual person from walking or moving in the non-movable period.

In the period of the panoramic lens, the video picture comprehensively displays the virtual person and the picture background from a distance, and the panoramic lens mainly displays the whole video picture to the audience instead of focusing on the virtual person. If the virtual person walks or moves for a long time during this period, the viewer is allowed to concentrate on the virtual person, and other elements in the video picture, such as the explanation material (picture or video) input by the user, are ignored. Therefore, the present embodiment divides the period of the panoramic lens into a movement pending period, allowing the virtual person to walk or move briefly in the movement pending period, but not for a long time.

And S130, adding at least one lower body moving animation at each moving time point to obtain an animation set corresponding to each moving time point.

The moving time point is the starting time point of the lower body moving animation, namely, the added lower body moving animation starts to play from the moving time point. The lower body moving animation may be an animation of a virtual person moving or walking on the lower body, which is created in advance, and is stored in a material library. The lower body moving animation comprises information such as a moving starting position, a target position, a moving duration and the like. The distinction between the various lower body movement animations may be that the starting and/or target positions of the movements are different. By way of example, assuming that the lower body moving animation is an animation of the movement of a dummy between three position points a, b, and c, the lower body moving animation included may include a→b (5 seconds), a→c (7 seconds), b→a (10 seconds), b→c (8 seconds), c→a (7 seconds), and c→b (9 seconds).

It should be noted that, in the related art, the motion of the virtual walking or moving is directly generated by the video generating software based on the audio input, and the embodiment prevents the occurrence of the sliding step by making the lower body moving animation in advance in addition to controlling the starting time point of the occurrence of the walking or moving of the virtual person by determining the moving time point. It should be understood that, in this embodiment, multiple lower body moving animations are pre-made and stored in the material library, and only a certain lower body moving animation can be selected from the material library when the virtual human video is generated, so that the lower body moving animation is used as the movement of the lower body of the virtual human, and the starting position, the target position and the movement duration of the movement have certainty. Moreover, the prefabricated lower body moving animations are carefully designed and repeatedly tested by a designer, so that the model quality is high and the robustness is good.

That is, in the related art, the motion of the virtual person walking or moving is directly generated based on the audio input, but in the embodiment, a certain prefabricated lower body moving animation is selected from the material library as the lower body motion of the virtual person walking or moving, so that the occurrence of the sliding step can be effectively prevented.

In the present embodiment, when adding the lower body moving animation at the moving time point, whether to add the lower body moving animation to the moving time point may be determined by the overlapping of the lower body moving animation with the non-movable period or the movement pending period.

The method for obtaining the animation set corresponding to each moving time point comprises the steps of obtaining the moving duration of at least one moving animation of the lower body, determining the corresponding moving time period of the moving animation of the lower body on a time axis according to the moving duration and the moving time point, determining the overlapping result of the moving time period corresponding to each moving animation of the lower body, the immovable time period and the moving undetermined time period, and adding at least one moving animation of the lower body for each moving time point based on the overlapping result to obtain the animation set corresponding to each moving time point.

It should be appreciated that each lower body moving animation lasts for a period of time, i.e., corresponds to a movement duration, so it is necessary to mark the duration of the lower body moving animation on the time axis to determine whether adding a lower body moving animation at a certain movement time point will cause the virtual person to move during the non-movable period and the pending period of movement.

The method for determining the corresponding movement time period of the lower body movement animation on the time axis according to the movement time length and the movement time point can be that the movement time point is determined as the play starting point of the lower body movement animation on the time axis, the movement time point and the movement time length are accumulated, the play ending point of the lower body movement animation on the time axis is obtained, and the corresponding movement time period of the lower body movement animation on the time axis is obtained. For example, assuming that the movement time point is 50 th second and the movement duration of a certain lower body movement animation is 10 seconds, the corresponding movement period on the lower body movement animation time axis is 50 th second to 60 th second.

Wherein the overlapping results include overlapping and non-overlapping. In this embodiment, if there is an intersection between the moving period corresponding to the lower body moving animation and the non-moving period, the moving period corresponding to the lower body moving animation and the non-moving period overlap, and if there is no intersection between the moving period corresponding to the lower body moving animation and the non-moving period, the moving period and the non-moving period do not overlap. Similarly, if the intersection exists between the moving time period corresponding to the lower body moving animation and the moving undetermined time period, the moving time period corresponding to the lower body moving animation and the moving undetermined time period are overlapped, and if the intersection does not exist between the moving time period corresponding to the lower body moving animation and the moving undetermined time period, the moving time period corresponding to the lower body moving animation and the moving undetermined time period are not overlapped. In this embodiment, whether the lower body animation is added to the movement time point is determined based on the overlapping result of the movement period corresponding to the lower body movement animation and the non-movable period and the movement pending period. For example, assuming that the corresponding movement period on the lower body movement animation time axis is 50 th to 60 th seconds and the immovable period is 57 th to 80 th seconds, the two periods overlap, and the overlapping period is 57 th to 60 th seconds, and the overlapping is 3 seconds in total.

Specifically, the mode of adding at least one lower body moving animation for each moving time point based on the overlapping result can be that the lower body moving animation meeting the following conditions is added to an animation set corresponding to the moving time point, wherein the moving time period is not overlapped with the non-movable time period and the moving undetermined time period, or the moving time period is overlapped with the moving undetermined time period and the overlapping time period is smaller than a set threshold value.

The set threshold may be preset, for example, to any value between 5 and 10 seconds. In this embodiment, if the moving period of the lower body moving animation on the time axis is not overlapped with the non-movable period and the moving undetermined period, the lower body moving animation is added to the animation set corresponding to the moving time point, and if the moving period of the lower body moving animation on the time axis is overlapped with the non-movable period, the lower body moving animation is filtered, that is, the lower body moving animation is not added to the animation set corresponding to the moving time point. If the moving time period of the lower body moving animation on the time axis is overlapped with the moving undetermined time period, acquiring overlapped time period, and if the overlapped time period is smaller than a set threshold value, adding the lower body moving animation into an animation set corresponding to the moving time point; and if the overlapping time length is greater than or equal to the set threshold value, filtering the lower body moving animation, namely not adding the lower body moving animation into the animation set corresponding to the moving time point.

As is clear from the foregoing description, as long as the movement period overlaps with the non-movement period, it is described that such movement is not allowed, that is, such movement cannot be performed at the movement time point. If the movement period overlaps with the movement pending period, it is necessary to determine whether such movement is allowed further based on the overlap period.

If the animation set corresponding to a certain moving time point is an empty set, the moving time point is not suitable for the lower body to move, and the moving time point is deleted.

Fig. 3 is an exemplary diagram of adding a lower body moving animation at a moving time point according to the present embodiment. As shown in fig. 3, the overall duration of the time axis is 300 seconds, which includes two immovable periods, 50 seconds to 80 seconds and 200 seconds to 259 seconds, respectively, one movement pending period is 90 seconds to 100 seconds, and the rest periods are movable periods. And 5 moving time points are determined on a time axis, and all lower body moving animations conforming to the current picture scene and picture style are found in a preset material library, so that the moving periods corresponding to the lower body moving animations are sequentially determined at the 5 moving time points. The moving time point is taken as a playing starting point of the lower body moving animation on a time axis, and the moving time point and the moving duration of each lower body moving animation are accumulated, so that the moving time period corresponding to each lower body moving animation at the moving time point can be determined on the time axis. The above operation is repeatedly performed at each movement time point, and thus a movement period corresponding to each lower body movement animation at 5 movement time points can be determined on the time axis. In fig. 3, all of the 5 moving periods are not overlapped with the non-movable period, the 2 nd moving period is overlapped with the movement undetermined period, but the overlapping duration is smaller than the preset threshold, and the lower body moving animations corresponding to the 5 moving periods are respectively added into the animation sets corresponding to the moving time points.

And S140, determining a target lower body moving animation sequence based on the animation set corresponding to each moving time point.

The target lower body moving animation sequence is a sequence formed by respectively selecting one lower body moving animation from an animation set of each moving time point, and is an animation sequence finally used for generating a virtual person moving video. In this embodiment, since the animation set of each moving time point includes one or more lower body moving animations, a plurality of lower body moving animation sequences can be formed, and finally a target lower body moving animation sequence is determined from the plurality of lower body moving animation sequences.

Optionally, the mode of determining the target lower body moving animation sequence based on the animation set corresponding to each moving time point can be that one lower body moving animation is selected from the animation set corresponding to each moving time point based on the movement continuity to be combined, at least one candidate lower body moving animation sequence is obtained, and the target lower body moving animation sequence is determined from the at least one candidate lower body moving animation sequence.

The moving continuity is that the target position of the previous moving animation in the two adjacent moving animations is the same as the initial position of the next moving animation. The process of selecting one lower body moving animation from the animation sets corresponding to the moving time points based on the moving continuity comprises the steps of traversing each lower body moving animation in the animation set corresponding to the first moving time point, obtaining the target position of the traversed lower body moving animation, then selecting a lower body moving animation with the same initial position as the target position of the traversed lower body moving animation in the animation set corresponding to the first moving time point from the animation set corresponding to the second moving time point, then selecting a lower body moving animation with the same initial position as the target position of the lower body moving animation selected in the animation set corresponding to the second moving time point from the animation set corresponding to the third moving time point, and then analogizing until the lower body moving animation is selected in the animation set corresponding to the last moving time point, so as to form a candidate lower body moving animation sequence, and obtaining a plurality of candidate lower body moving animation sequences after the lower body moving animation in the animation set corresponding to the first moving time point is completed.

In this embodiment, the method for determining the target lower body movement animation sequence from the at least one candidate lower body movement animation sequence may be to randomly select one candidate lower body movement animation sequence to determine the target lower body movement animation sequence, or to evaluate the at least one candidate lower body movement animation sequence respectively in order to improve the quality of the generated video, and determine one target lower body movement animation sequence based on the evaluation result.

Optionally, the target lower body movement animation sequence is determined from the at least one candidate lower body movement animation sequence by evaluating the at least one candidate lower body movement animation sequence based on a set evaluation index to obtain the score of each candidate lower body movement animation sequence, and determining the candidate lower body movement animation sequence with the highest score as the target lower body movement animation sequence.

The set evaluation index comprises at least one of a first movement time, a time interval of two adjacent movements, a movement frequency and a movement end position.

Wherein the first movement time may be a starting time point of the first lower body movement animation. Specifically, the method for evaluating at least one candidate lower body moving animation sequence based on the first moving time may be that, for each candidate lower body moving animation sequence, a deviation between the first moving time of the lower body moving animation sequence and the reference first moving time is obtained, and a score corresponding to the first moving time is obtained based on the deviation. The reference first movement time may be preset, and the score of the first movement time of the candidate lower body movement animation sequence is in a negative correlation relationship with the deviation, that is, the larger the deviation is, the smaller the score is, the smaller the deviation is, and the higher the score is. In this embodiment, the specific relationship between the score and the deviation of the first movement time is not limited.

The time interval between two adjacent movements can be understood as the interval between the end time of the previous lower body movement animation and the start time of the next lower body movement animation in the two adjacent lower body movement animations. The method for evaluating at least one candidate lower body moving animation sequence based on the time intervals of two adjacent moves comprises the steps of acquiring the time intervals between every two adjacent lower body moving animation sequences, respectively determining the deviation of each time interval and a reference time interval, determining the scores of each time interval based on the deviation, and finally taking the average value of the scores as the scores corresponding to the time intervals of two adjacent moves of the candidate lower body moving animation sequence. The reference time interval may be preset, and the score of the time interval between adjacent lower body moving animations is in a negative correlation relationship with the deviation, that is, the larger the deviation is, the smaller the score is, the smaller the deviation is, and the higher the score is. In this embodiment, the specific relationship between the score and the deviation of the time interval between two adjacent movements is not limited.

The movement frequency is understood to be the number of movements per unit time. The process of evaluating at least one candidate lower-body moving animation sequence based on the moving frequency may be, for each candidate lower-body moving animation sequence, obtaining the moving frequency of the candidate lower-body moving animation sequence by using the quotient of the moving times and the total duration of dubbing, then determining the deviation between the moving frequency and the reference moving frequency, and determining the score of the moving frequency of the candidate lower-body moving animation sequence based on the deviation. The reference moving frequency is preset, and the score of the moving frequency of the candidate lower body moving animation sequence is in a negative correlation relation with the deviation, namely the larger the deviation is, the smaller the score is, the smaller the deviation is, and the higher the score is. In this embodiment, the specific relationship between the score of the moving frequency and the deviation is not limited.

The final position may be understood as a position where the dummy finally stops, or a target position of the last lower body moving animation in the candidate lower body moving animation sequence. Specifically, the at least one candidate lower body moving animation sequence may be evaluated based on the end position by obtaining, for each candidate lower body moving animation sequence, a deviation of the end position of the lower body moving animation sequence from a reference end position time, and obtaining a score corresponding to the end position based on the deviation. The reference end position may be preset, and the score of the end position of the candidate lower body moving animation sequence is in a negative correlation relationship with the deviation, that is, the larger the deviation is, the smaller the score is, the smaller the deviation is, and the higher the score is. In this embodiment, the specific relationship between the score and the deviation of the end point position is not limited.

In this embodiment, after obtaining the scores corresponding to the set evaluation indexes, weighted summation is performed on the scores corresponding to the set evaluation indexes, so as to obtain the scores of the candidate lower body moving animation sequences, and finally, the candidate lower body moving animation sequence with the highest score is determined as the target lower body moving animation sequence.

And S150, generating a virtual human whole-body moving video according to the target lower-body moving animation sequence and the input content.

The virtual human whole-body moving video is a whole-body video comprising a virtual human upper-body moving animation and a virtual human lower-body moving animation.

The process of generating the virtual human whole body moving video according to the target lower body moving animation sequence and the input content comprises the steps of generating the virtual human lower body moving video according to the target lower body moving animation sequence and generating the virtual human whole body moving video according to the virtual human lower body moving video and the input content.

The process of generating the virtual human lower body moving video according to the target lower body moving animation sequence can be that all lower body moving animations in the target lower body moving animation sequence are connected in series in sequence, and in the process of connecting in series, video clips between two adjacent lower body moving animations are filled by using a plurality of lower body in-situ standing animations which are designed in advance. Specifically, there are nuances among different lower body in-place standing animations, and the duration of each lower body in-place standing animation is also different, when filling a video clip, a certain lower body in-place standing animation is randomly selected from the lower body in-place standing animations and is connected in series with the previous lower body moving animation, if the video clip cannot be filled, a certain lower body in-place standing animation is randomly selected again and is connected in series with the previous lower body in-place standing animation until the video clip is filled. If the time length of the serially connected lower body in-situ standing animation exceeds the time length of the video clips, the lower body in-situ standing animation can be cut to just fill the video clips, so that the video clips with any time length are filled, and two adjacent lower body moving animations are serially connected. In the generated lower body moving video of the virtual person, the virtual person stops at the target position of the previous lower body moving animation between the two lower body moving animations, and keeps the in-situ standing posture until the next lower body moving animation starts.

In this embodiment, the method for generating the virtual human whole body moving video according to the virtual human lower body moving video and the input content may be that the virtual human lower body moving video and the input content are input into a first generation model trained in advance, and the virtual human whole body moving video is output.

Optionally, the method for generating the whole-body motion video of the virtual person according to the lower-body motion video of the virtual person and the input content can be that the upper-body motion video of the virtual person is generated according to the lower-body motion video of the virtual person and the input content, and the lower-body motion video of the virtual person and the upper-body motion video of the virtual person are fused to obtain the whole-body motion video of the virtual person.

The video frames of the virtual upper body moving video can correspond to the video frames of the virtual lower body moving video one by one. In this embodiment, the method for generating the virtual person upper body movement video according to the virtual person lower body movement video and the input content may be that dubbing of the virtual person lower body movement video and the video is input into a second generation model trained in advance, and the virtual person upper body movement video is output.

In this embodiment, the manner of fusing the virtual person lower body movement video and the virtual person upper body movement video may be that the virtual person lower body movement video and the virtual person upper body movement video are input into a pre-trained fusion model to output the virtual person whole body movement video. Or respectively decoding the virtual human lower body moving video and the virtual human upper body moving video to obtain a plurality of lower body video frames and a plurality of upper body video frames, then fusing the lower body video frames with the corresponding upper body video frames to obtain whole body video frames, and finally encoding the whole body video frames to obtain the virtual human whole body moving video.

Fig. 4 is a schematic structural diagram of a virtual person video generating apparatus according to an embodiment of the present invention, where, as shown in fig. 4, the apparatus includes:

An input content acquisition module 410, configured to acquire input content, where the input content includes a speech text and a video script, and is configured to generate dubbing and pictures of a video;

A moving time point determining module 420 for determining at least one moving time point based on the speech text and the video script;

The lower body moving animation adding module 430 is configured to add at least one lower body moving animation at each moving time point, and obtain an animation set corresponding to each moving time point, where the moving time point is a starting time point of the lower body moving animation;

a target lower body moving animation sequence determining module 440, configured to determine a target lower body moving animation sequence based on the animation set corresponding to each moving time point;

The virtual whole body movement video generation module 450 is used for generating a virtual whole body movement video according to the target lower body movement animation sequence and the input content.

Optionally, the video script comprises a pause mark corresponding to the speech text, and further comprises a video time axis acquisition module for:

Generating dubbing of the video based on the speech text and the corresponding pause mark;

The dubbing time axis is taken as the video time axis.

Optionally, the mobile time point determining module 420 is further configured to:

marking at least one moving time point on a time axis based on the speech text and the video script;

wherein marking at least one movement time point on the time axis comprises at least one of:

marking the time point of the punctuation mark of the preset type in the speech line text on the time axis as a moving time point, or

Marking the time point of the air port in dubbing on the time axis as the moving time point, or

The moving time points are marked every set time length on the time axis.

Optionally, the device further comprises a time interval dividing module for:

dividing the time axis into a movable time period, an immovable time period and a movement undetermined time period based on the video script, wherein the video script further comprises at least one of a display content mark, a lens scene mark, a lens switching mark, a mirror carrying mark and a preset action mark.

Optionally, the lower body moving animation adding module 430 is further configured to:

Acquiring the moving duration of at least one lower body moving animation;

according to the movement duration and the movement time point, determining a corresponding movement period of the lower body movement animation on a time axis;

determining the overlapping result of the moving time period and the non-movable time period corresponding to each lower body moving animation and the moving undetermined time period;

And adding at least one lower body moving animation for each moving time point based on the overlapping result, and obtaining an animation set corresponding to each moving time point.

And adding the lower body moving animation meeting the conditions that the moving period is not overlapped with the non-movable period and the moving undetermined period or the moving period is overlapped with the moving undetermined period and the overlapping time length is smaller than the set threshold value into an animation set corresponding to the moving time point.

Optionally, the target lower body moving animation sequence determining module 440 is further configured to:

selecting one lower body moving animation from the animation set corresponding to each moving time point based on the moving continuity, and combining to obtain at least one candidate lower body moving animation sequence, wherein the moving continuity is that the target position of the previous moving animation in the two adjacent moving animations is the same as the initial position of the next moving animation;

a target lower body movement animation sequence is determined from the at least one candidate lower body movement animation sequence.

Evaluating at least one candidate lower body moving animation sequence based on a set evaluation index to obtain the score of each candidate lower body moving animation sequence, wherein the set evaluation index comprises at least one of first moving time, time interval between two adjacent moving times, moving frequency and moving end position;

And determining the candidate lower body movement animation sequence with the highest score as a target lower body movement animation sequence.

Optionally, the virtual person whole body mobile video generating module 450 is further configured to:

generating a virtual human lower body moving video according to the target lower body moving animation sequence;

And generating the virtual human whole body moving video according to the virtual human lower body moving video and the input content.

Generating a virtual human upper body moving video according to the virtual human lower body moving video and the input content;

and fusing the virtual human lower body moving video with the virtual human upper body moving video to obtain the virtual human whole body moving video.

The device can execute the method provided by all the embodiments of the invention, and has the corresponding functional modules and beneficial effects of executing the method. Technical details not described in detail in this embodiment can be found in the methods provided in all the foregoing embodiments of the invention.

Fig. 5 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components, connections and relationships between components, and functions thereof, are shown, are meant to be exemplary only, and are not meant to limit implementations of the invention described and/or claimed herein.

As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including an input unit 16, such as a keyboard, mouse, etc., an output unit 17, such as various types of displays, speakers, etc., a storage unit 18, such as a magnetic disk, optical disk, etc., and a communication unit 19, such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the virtual human video generation method.

In some embodiments, the method of generating a virtual person video may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the virtual person video generation method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the method of generating the virtual person video in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), a blockchain network, and the Internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

The embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements a method of generating a virtual person video as provided by any of the embodiments of the present application.

Computer program product in the implementation, the computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for generating a virtual human video, comprising:

Obtaining input content; wherein the input content includes dialogue text and video script, which is used to generate the dubbing and picture of the video;

Determining at least one moving time point based on the dialogue text and the video script;

Add at least one lower body movement animation at each movement time point to obtain an animation set corresponding to each movement time point; wherein the movement time point is the starting time point of playing the lower body movement animation;

Determine the target lower body movement animation sequence based on the animation set corresponding to each of the movement time points;

A virtual human full-body movement video is generated according to the target lower body movement animation sequence and the input content.

2. The method according to claim 1, wherein the video script contains pause marks corresponding to the dialogue text, and after obtaining the input content, further comprising:

Generating a dubbing of the video based on the dialogue text and corresponding pause marks;

The timeline of the dubbing is used as the timeline of the video.

3. The method according to claim 2, characterized in that determining at least one moving time point based on the dialogue text and the video script comprises:

Marking at least one of the moving time points on the timeline based on the dialogue text and the video script;

Wherein, marking at least one of the moving time points on the time axis includes at least one of the following methods:

Marking a time point of a punctuation mark of a preset type in the dialogue text on the time axis as the moving time point; or,

Marking the time point of the breath in the dubbing on the time axis as the moving time point; or,

A moving time point is marked on the time axis at intervals of a set time.

4. The method according to claim 2, characterized in that after obtaining the input content, it also includes:

Based on the video script, the timeline is divided into a movable period, an immovable period and a movable pending period; wherein the video script also includes at least one of the following: display content mark, shot view mark, shot switching mark, camera movement mark and preset action mark.

5. The method according to claim 4, characterized in that at least one lower body movement animation is added at each movement time point to obtain an animation set corresponding to each movement time point, comprising:

Get the moving duration of at least one lower body moving animation;

Determine, according to the movement duration and the movement time point, a movement period corresponding to the lower body movement animation on the time axis;

Determine the overlapping result of the moving time period corresponding to each of the lower body moving animations and the immovable time period and the moving pending time period;

At least one lower body movement animation is added to each movement time point based on the overlapping result to obtain an animation set corresponding to each movement time point.

6. The method according to claim 5, characterized in that adding at least one lower body movement animation to each of the movement time points based on the overlapping result comprises:

The lower body movement animation that meets the following conditions is added to the animation set corresponding to the movement time point: the movement period does not overlap with the immovable period and the movement pending period; or, the movement period overlaps with the movement pending period and the overlapping duration is less than a set threshold.

7. The method according to claim 1, characterized in that determining the target lower body movement animation sequence based on the animation set corresponding to each of the movement time points comprises:

Based on the movement continuity, a lower body movement animation is selected from the animation set corresponding to each movement time point for combination to obtain at least one candidate lower body movement animation sequence; wherein the movement continuity is that the target position of the previous movement animation is the same as the starting position of the next movement animation in two adjacent movement animations;

8. The method according to claim 7, characterized in that determining a target lower body movement animation sequence from the at least one candidate lower body movement animation sequence comprises:

The at least one candidate lower body movement animation sequence is evaluated based on a set evaluation index to obtain a score for each candidate lower body movement animation sequence; wherein the set evaluation index includes at least one of the following: first movement time, time interval between two adjacent movements, movement frequency, and end point position of movement;

The candidate lower body movement animation sequence with the highest score is determined as the target lower body movement animation sequence.

9. The method according to claim 1, characterized in that generating a virtual human full body movement video according to the target lower body movement animation sequence and the input content comprises:

Generate a virtual human lower body movement video according to the target lower body movement animation sequence;

A full-body movement video of the virtual person is generated according to the lower-body movement video of the virtual person and the input content.

10. The method according to claim 9, characterized in that generating a full-body movement video of a virtual person according to the lower-body movement video of the virtual person and the input content comprises:

Generate a virtual human upper body movement video according to the virtual human lower body movement video and the input content;

The virtual human lower body movement video and the virtual human upper body movement video are fused to obtain the virtual human full body movement video.

11. A virtual human video generation device, comprising:

An input content acquisition module, used to acquire input content; wherein the input content includes dialogue text and video script, which is used to generate the dubbing and picture of the video;

A moving time point determination module, used to determine at least one moving time point based on the dialogue text and the video script;

A lower body movement animation adding module, used to add at least one lower body movement animation at each movement time point, and obtain an animation set corresponding to each movement time point; wherein the movement time point is the starting time point of the lower body movement animation;

A target lower body movement animation sequence determination module, used to determine the target lower body movement animation sequence based on the animation set corresponding to each movement time point;

The virtual human full body movement video generation module is used to generate a virtual human full body movement video according to the target lower body movement animation sequence and the input content.

12. An electronic device, characterized in that the electronic device comprises:

at least one processor; and a memory in communication with the at least one processor; wherein,

The memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor so that the at least one processor can execute the method for generating a virtual human video according to any one of claims 1 to 10.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions, and the computer instructions are used to enable a processor to implement the method for generating a virtual human video according to any one of claims 1 to 10 when executed.

14. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the computer program implements the method for generating a virtual human video according to any one of claims 1 to 10.