CN116030166A

CN116030166A - Animation data generation method and device, medium and computer equipment

Info

Publication number: CN116030166A
Application number: CN202310094834.2A
Authority: CN
Inventors: 周凡
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2023-04-28

Abstract

The present disclosure provides an animation data generation method and apparatus, a medium, and a computer device, the method comprising: acquiring first animation data corresponding to a first video frame in a video sequence; inputting the first animation data into a pre-trained motion prediction model, and obtaining second animation data corresponding to a video frame next to the first video frame in the video sequence output by the motion prediction model; the motion prediction model is obtained based on a plurality of sample animation sequences and first motion period sample characteristics of the sample animation sequences, the first motion period sample characteristics of the sample animation sequences are extracted from the first sample characteristics of the sample animation sequences, and the first sample characteristics of the sample animation sequences are obtained by encoding the sample animation sequences through an encoder.

Description

Animation data generation method and device, medium and computer equipment

Technical Field

The present disclosure relates to the field of animation techniques, and in particular, to an animation data generation method and apparatus, a medium, and a computer device.

Background

At present, when animation is performed, an animation sequence is generally generated based on RGB video, and an avatar is driven based on the animation data. Because the quality of the animation sequence generated by the method is poor, in order to improve the animation quality, a group of high-quality animation sequences are often obtained, the high-quality animation sequences are mapped to a feature space to obtain high-quality animation features, and the animation sequences generated based on RGB video are also mapped to the feature space to obtain driving animation features. Then searching for the high-quality animation features with the highest similarity with the driving animation features, reversely mapping the searched high-quality animation features into an animation sequence, and driving the virtual image based on the animation sequence obtained by the reversely-newly mapping. However, the above method takes an animation sequence as an input, has poor real-time performance, and is not suitable for a real-time scene.

Disclosure of Invention

In a first aspect of an embodiment of the present disclosure, there is provided an animation data generation method, including: acquiring first animation data corresponding to a first video frame in a video sequence; inputting the first animation data and the first motion cycle characteristics corresponding to the first animation data into a pre-trained motion prediction model, and obtaining second animation data corresponding to a next frame video frame of the first video frame in the video sequence output by the motion prediction model; the motion prediction model is obtained based on a plurality of sample animation sequences and first motion period sample characteristics of the sample animation sequences, the first motion period sample characteristics of the sample animation sequences are extracted from the first sample characteristics of the sample animation sequences, and the first sample characteristics of the sample animation sequences are obtained by encoding the sample animation sequences through an encoder.

In a second aspect of the embodiments of the present disclosure, there is provided an animation data generation apparatus, the apparatus including: the first acquisition module is used for acquiring first animation data corresponding to a first video frame in the video sequence; the second acquisition module is used for inputting the first animation data and the first motion cycle characteristics corresponding to the first animation data into a pre-trained motion prediction model and acquiring second animation data corresponding to a video frame next to the first video frame in the video sequence output by the motion prediction model; the motion prediction model is obtained based on a plurality of sample animation sequences and first motion period sample characteristics of the sample animation sequences, the first motion period sample characteristics of the sample animation sequences are extracted from the first sample characteristics of the sample animation sequences, and the first sample characteristics of the sample animation sequences are obtained by encoding the sample animation sequences through an encoder.

A third aspect of embodiments of the present disclosure provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the first aspect described above.

In a fourth aspect of the disclosed embodiments, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect described above when executing the program.

According to the method and the device for predicting the motion prediction model, the sample animation sequence is encoded through the encoder to obtain the first sample characteristic, the first motion period sample characteristic is extracted from the first sample characteristic to train the motion prediction model, the motion prediction model can take the first animation data corresponding to the single-frame first video frame and the first motion period characteristic corresponding to the first animation data as input to predict the second animation data corresponding to the next frame video frame of the first video frame, the animation sequence does not need to be input in the prediction process, and therefore instantaneity in the generation process of the animation data is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a schematic diagram of an animation data generation process in the related art.

Fig. 2 is a flowchart of an animation data generation method of an embodiment of the present disclosure.

FIG. 3 is a schematic diagram of a sliding window and sample animation sequence of an embodiment of the disclosure.

Fig. 4 is a schematic diagram of feature space, frequency shift, frequency, amplitude, and offset parameters of an embodiment of the present disclosure.

Fig. 5A is a schematic diagram of a process of acquiring a motion cycle feature space according to an embodiment of the present disclosure.

Fig. 5B is a process schematic diagram of a training action prediction network in accordance with an embodiment of the present disclosure.

Fig. 6 is a block diagram of an animation data generation device of an embodiment of the present disclosure.

FIG. 7 is a schematic diagram of a computer device in one embodiment of the disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The following problems are common in the current scheme for driving an avatar (i.e., animation) in real time based on RGB video: 1) Because the target object in each frame of video frame is separately estimated, the global displacement of the target object in the video frame is difficult to accurately obtain, and the constraint is lacked between frames; 2) The accuracy of the attitude estimation model is insufficient; 3) Derived problems due to global displacement inaccuracy and posture prediction errors, such as a slip problem, i.e., the posture inaccuracy of the skeletal end of the avatar. The existence of the problems leads to low animation quality and high noise of the real-time driving of RGB video.

In order to solve the above problems, high quality animation (for example, animation captured by animation capturing) may be used to optimize the animation obtained based on the RGB video driver, and the essential idea is that the animation obtained by the RGB video driver has low quality and high noise, but still can embody the essential feature of motion, which is learned by an encoder in a deep learning model. Thus, referring to fig. 1, a high quality motion library including a high quality animation sequence, which may be an animation sequence obtained through animation capture (abbreviated as "motion capture"), may be acquired. An encoder is trained through the high-quality motion library and is used as motion prior information of a target object in a video frame, and the high-quality animation sequence is mapped to a high-dimensional space through the encoder to obtain essential characteristics (called dynamic capture sequence characteristics) of the high-quality animation sequence in the high-dimensional space. The high-dimensional space is continuous and smooth, and the smoothness of the animation sequence obtained by remapping the high-dimensional space can be ensured. And similarly, mapping the animation sequence obtained by the RGB video drive to a high-dimensional space through an encoder to obtain the essential characteristics (called drive sequence characteristics) of the animation sequence obtained by the RGB video drive in the high-dimensional space, searching the dynamic capture sequence characteristics with the highest similarity with the drive sequence characteristics in the high-dimensional space to replace the drive sequence characteristics, and outputting the animation sequence obtained by remapping the dynamic capture sequence characteristics with the highest similarity with the drive sequence characteristics as an optimized animation sequence, thereby achieving the effects of smoothness and denoising.

However, the above scheme requires an animation sequence as an input, and is poor in real-time performance, and is not suitable for real-time scenes.

Based on this, an embodiment of the present disclosure provides an animation data generation method, referring to fig. 2, including:

step S201: acquiring first animation data corresponding to a first video frame in a video sequence;

step S202: inputting the first animation data and the first motion cycle characteristics corresponding to the first animation data into a pre-trained motion prediction model, and obtaining second animation data corresponding to a next frame video frame of the first video frame in the video sequence output by the motion prediction model;

the motion prediction model is obtained based on a plurality of sample animation sequences and first motion period sample characteristics of the sample animation sequences, the first motion period sample characteristics of the sample animation sequences are extracted from the first sample characteristics of the sample animation sequences, and the first sample characteristics of the sample animation sequences are obtained by encoding the sample animation sequences through an encoder.

In step S201, the video sequence may be a video acquired in real time by a video acquisition device such as a camera or a video camera, or may be a video stored in the electronic device in advance. The video sequence may be a motion video of a target object, wherein the target object may be a person, animal, or the like, including an object of a joint including, but not limited to, ankle, knee, hip, shoulder, elbow, wrist, or the like. The action video is a video of an action performed by a target object, wherein the action performed by the target object includes, but is not limited to, any one of walking, running, playing a ball, dance, and the like. The video sequence may include a plurality of video frames, and the first video frame may be a video frame in which any one of the frames in the video sequence includes the target object. The first animation data corresponding to the first video frame can be obtained by estimating the pose of the target object in the first video frame. In some embodiments, the first animation data may include rotation angle and position information of respective joints of the target object in the first video frame. The rotation angle of the joint can be represented based on a rotation matrix of the joint, the position information of the joint can be global position information of the joint under a world coordinate system, and the position information can be represented by three-dimensional coordinates (x, y, z). Further, the first animation data may further include a skeletal length of the target object in the first video frame.

In step S202, the first animation data may be used as an input to predict the second animation data corresponding to the video frame next to the first video frame. Then, the next frame video frame of the first video frame may be further determined as the first video frame and the second animation data may be determined as the first animation data, and the process returns to step S201. In this way, each frame of animation data starting from the second frame of animation data in the video sequence can be predicted in an iterative manner. For example, assuming that the first video frame is the i-th frame video frame in the video sequence, the motion picture data corresponding to the i-th frame video frame may be used as the input of the motion prediction model, so as to predict the motion picture data corresponding to the i+1-th frame video frame, and then the motion picture data corresponding to the i+1-th frame video frame may be used as the input of the motion prediction model, so as to predict the motion picture data corresponding to the i+2-th frame video frame, and so on. Wherein i is a positive integer.

In the above embodiment, the motion prediction model is trained based on a plurality of sample animation sequences and the first motion cycle sample features of the plurality of sample animation sequences, and the first motion cycle sample features of the sample animation sequences may be extracted from the first sample features of the sample animation sequences. The sample animation sequence may also be an action video of the target object, and the target object in the sample animation sequence and the target object in the video sequence may be the same in category.

In some embodiments, each sample animation sequence may include consecutive multi-frame sample animation data in the initial sample animation sequence. For example, multi-frame sample animation data within a preset sliding window centered on a certain frame of sample animation data in an initial sample animation sequence may be acquired. As shown in fig. 3, assuming that the length of the sliding window is 5, 5 frames of sample animation data (including v-2 th, v-1 th, v-th, v+1 th, and v+2 th frames of sample animation data) centered on the v-th frame of sample animation data in the initial sample animation sequence may be determined as one of the sample animation sequences (denoted as the v-th sample animation sequence), 5 frames of sample animation data (including v-1 th, v-th, v+1 th, and v+2 th frames of sample animation data) centered on the v+1 th frame of the initial sample animation sequence may be determined as the other sample animation sequence (denoted as the v+1 th sample animation sequence), and so on.

In some embodiments, each frame of sample animation data in the initial sample animation sequence may be pre-processed. Optionally, preprocessing includes translating the position information of each joint of the target object in a frame of sample animation data based on the position information of a root joint in the frame of sample animation data, so as to remove the influence of global position transformation of the target object on the whole motion.

Specifically, for each frame of sample animation data in the initial sample animation sequence, the relative position information between the target joint and the root joint can be determined based on the difference between the position information of any target node in the frame of sample animation data and the position information of the root joint, and the position information of the target joint can be updated based on the relative position information. The root joint may be specified in advance from among a plurality of joints of the target object in the first video frame, for example, a hip joint may be specified as the root joint. The position information of the joint may be global position information of the joint, for example, position information of the joint in a world coordinate system.

Assuming that the initial sample animation sequence includes a number of frames of sample animation data, the number of frames per second (Frames Per Second, FPS) of the sample animation data is 60, and the dimension of the initial sample animation sequence is R ^3×J×H Where H represents the number of frames of sample animation data in the initial sample animation sequence, J represents the number of joints of the target object, and 3 represents eachThe joints include X, Y, Z three spatial dimensions of position information, X, Y, Z may be the result of forward kinematics performed on the target object animation to describe the spatial position of each joint in the world coordinate system.

Recording the position information and rotation matrix of the root joint of the ith frame sample animation data in the initial sample animation sequence as p _root And r _root Global position of j-th joint p _j Then the global position information p of the j-th joint can be obtained _j Global position information p of root joint _root The difference value between the two is compared with the rotation matrix r of the root joint _root The product between the inverse matrices of (c) is determined as the position information after the update of the j-th joint

The specific calculation formula can be expressed as:

wherein inv denotes an inversion operation. The positional information of each joint in each frame of sample animation data can be processed in the above-described manner. Through the processing, when the target object executes the same action at different spatial positions, the updated global position information is kept unchanged, so that the influence of global position change of the target object on the whole motion is reduced.

After updating the position information of each joint in each frame of sample animation data of the initial sample animation sequence, a sample animation sequence can be generated based on multi-frame sample animation data in a preset sliding window, wherein the v-th sample animation sequence comprises multi-frame sample animation data centering on the v-th frame of sample animation data in the initial sample animation sequence, and the position information of each joint in the sample animation data in each sample animation sequence is updated position information. The process of generating a sample animation sequence based on a sliding window described above may also be referred to as sliding window slicing. The number of frames of sample animation data in a sample animation sequence can be reduced by sliding window slicing, so that the processing efficiency is improved, and the data volume is reduced.

Optionally, the preprocessing includes a de-centering process of each frame of sample animation data in the sample animation sequence. Specifically, for each node (referred to as a target joint) of the plurality of joints, average position information of the target joint in the sample animation sequence may be determined based on position information of the target joint in each frame of animation data of the sample animation sequence; and updating the position information of the target joint in the target sample animation data of the sample animation sequence based on the position information of the target joint in the target sample animation data of the sample animation sequence and the average position information of the target joint in the sample animation sequence.

Assuming that the sample animation sequence is centered on the v-th frame sample animation data, and the sliding window corresponding to the sample animation sequence comprises the first 60 frames of sample animation data and the last 60 frames of sample animation data of the v-th frame sample animation data, and 121 frames of sample animation data are added in total, recording the sample animation sequence epsilon R ^D×N Where d=3×j, n=121. The average position information of any one target joint in the sample animation sequence can be recorded as

So that the updated location information of the target node can be marked +. >

The position information of the same target node in each sample animation data of the sample animation sequence can form a motion curve corresponding to the target node, and the motion curve is used for representing the position change condition of the target node in the motion process of the target object. Through the decentration treatment, the balance of the motion curve distributed on the two sides of the origin of coordinates can be ensured.

In some embodiments, the motion profile may not be normalized if it is desired to take into account the characteristics of the motion amplitude. The preprocessing may also include normalizing the motion profile without considering the characteristics of the motion amplitude.

Through the pretreatment, a single motion curve sample describing the motion essence can be finally obtained

Where d=3×j, n=121. All sample animation sequences may be preprocessed in the manner described above.

After the sample animation sequence is acquired, the sample animation sequence may be encoded by a type of encoder such as a variable self-encoder (Variational Auto Encoder, VAE) to obtain the first sample feature. The first sample feature may be a high-dimensional feature vector, denoted L.epsilon.R ^M×N . The mapping relation between the first sample feature of the sample animation sequence and the sample animation sequence can be established in advance, and the first sample feature of the sample animation sequence is acquired based on the mapping relation, so that the acquisition efficiency of the first sample feature is improved, and the generation efficiency of the animation data is further improved.

The present disclosure contemplates that the motion of the target object may have an explicit periodicity or an implicit periodicity, where explicit periodicity refers to the fact that the motion performed by the target object may significantly observe a complete motion cycle, such as walking, running, etc.; implicit periodicity refers to the fact that the action performed by the target object is difficult to observe for a complete cycle, but a single action can be considered as a segment of a periodic action, such as dance, random action, etc. Thus, after the first sample feature is obtained, a first motion cycle sample feature may be extracted from the first sample feature.

It will be appreciated that any non-periodic signal, if the dirichlet condition is met, may be approximated as a linear combination of periodic signals on several different channels, i.e. the first motion period sample characteristics. In some embodiments, the non-periodic signal (i.e. implicit periodic motion such as dance) may be decomposed into a series of periodic signals, and in general, the more channels, the better the fit to the complex motion, but if the number of channels is too large, the smaller the difference between the periodic signals on each channel will be, so the number of channels may be determined together based on the fit effect and the difference between the periodic signals on each channel. In some embodiments, the number of channels may be 6, 8, or 10.

In particular, referring to fig. 5A, a first motion period sample characteristic of a sample animation sequence may be determined based on amplitude and frequency shift parameters of periodic signals of a plurality of channels that are decomposed from the first sample characteristic of the sample animation sequence.

The amplitude of the periodic signal of any channel is obtained by performing fast Fourier transform on the first sample characteristic of the sample animation sequence. The deep learning framework such as pytorch provides a derivable fast fourier transform (Fast Fourier Transform, FFT) algorithm by which the amplitude of the first sample feature at each channel can be obtained. The frequency shift parameter of the periodic signal of either channel is fitted to the first sample feature of the sample animation sequence by a pre-trained fully connected layer (Fully Connected layer, FC).

Specifically, the fourier coefficients of the first sample feature on each channel may be obtained by an FFT algorithm, the power spectrum of a channel may be calculated based on the fourier coefficient of the channel, and the amplitude of the channel may be calculated based on the power spectrum of the channel.

Fourier coefficients of the first sample feature can be obtained by FFT transforming the first sample feature, denoted c=fft (L) ∈c ^M×K+1 Wherein, the method comprises the steps of, wherein,

representing the rounding down operation, c includes the fourier coefficients for each of the M channels at K frequency components. The Fourier coefficient based on the first sample feature can calculate the power spectrum p E R of the periodic signal ^M×K+1 Wherein the power spectrum of the jth frequency component of the ith channel may be noted as:

wherein c _i,j Is the ith passFourier coefficients of the j-th frequency component of the trace. The amplitude A of the ith channel can be obtained from the power spectrum of the jth frequency component of the ith channel _i And frequency F _i The marks are respectively:

the offset B of the ith channel can also be calculated from the Fourier coefficients of the 0 th frequency component of the ith channel _i The method is characterized by comprising the following steps:

wherein the ith frequency component f _i Is the i-th element in the frequency vector (0, 1/T, 2/T.., K/T).

Inputting the first sample characteristic of the ith channel into the full connection layer to obtain two output parameters s of the ith channel _x Sum s _y The frequency shift parameter S of the ith channel _i It can be noted that:

S _i ＝atan2(s _y ,s _x )

based on the obtained amplitude A _i Frequency F _i Offset B _i And a frequency shift parameter S _i Can acquire the periodic signal under the ith channel

The method is characterized by comprising the following steps:

based on

Can generate the motion cycle characteristic space +.>

Marked as->

In the related art, the feature space is generated by directly adopting the features obtained by encoding by the encoder, and the feature space has poor interpretability, but has continuous, smooth and other features. Motion cycle feature space of embodiments of the present disclosure

Some actions without obvious periods (i.e. implicit periodic motion) can be well described by the local periodicity of actions corresponding to each sample animation data in the sliding window, in some embodiments, the first motion periodic feature can be represented by the following formula (1), where t represents the sliding window centered on the t-th frame sample animation data, i represents the i-th feature channel, P uses a frequency shift parameter to well describe the periodic variation of the motion feature with time (e.g. what stage the t-th frame sample animation data is in the action), has definite directionality, and meanwhile uses amplitude to describe actions with different intensities corresponding to different magnitudes, such as jogging and jogging, because the offset in the periodic signal represents a direct current component, which is not negligible with respect to the periodic motion, the offset of the periodic signal of each channel is not used to generate the motion periodic feature space. Because the frequency lacks the interpretability, the frequency of the periodic signal of each channel is not used for generating the motion period characteristic space, so that the generated motion period characteristic space has better interpretability. The visualized single motion profile is shown in fig. 4 for amplitude, frequency parameters, frequency and offset under multiple channels.

In some embodiments, the encoder, the full-connection layer, and the decoder to which the encoder corresponds may be obtained through joint training. Specifically, for each sample animation sequence of the plurality of sample animation sequences, the sample animation sequence may be encoded by an initial encoder to obtain a second sample feature of the sample animation sequence; performing fast Fourier transform on the second sample characteristic to obtain the amplitude, the frequency and the offset of the periodic signal corresponding to the second sample characteristic on each channel; fitting the second sample characteristics through an initial full-connection layer to obtain frequency shift parameters of periodic signals corresponding to the second sample characteristics on each channel; generating a periodic signal corresponding to a second motion periodic sample feature corresponding to the second sample feature based on amplitude, frequency, offset and frequency shift parameters of the periodic signal corresponding to the second sample feature on each channel; decoding the periodic signal corresponding to the second sample characteristic through an initial decoder corresponding to the initial encoder to obtain a first prediction animation sequence corresponding to the sample animation sequence; and based on the plurality of sample animation sequences and the first prediction animation sequence corresponding to each sample animation sequence, carrying out joint training on the initial encoder, the initial full-connection layer and the initial decoder to obtain the trained encoder, the trained full-connection layer and the decoder corresponding to the encoder.

The manner of performing fourier transform on the second sample feature in the training process and performing fitting on the second sample feature through the full connection layer is the same as that of performing fourier transform on the first sample feature in the foregoing embodiment and performing fitting on the first sample feature through the full connection layer, and will not be described in detail herein.

Alternatively, a root mean square error (Mean Square Error, MSE) penalty function may be established based on the plurality of sample animation sequences and the first predicted animation sequence corresponding to each sample animation sequence, and joint training may be performed based on the penalty function. After training, the training encoder and the full-connection layer stack sample animation sequence are processed, so that a motion cycle characteristic space comprising first motion cycle sample characteristics of each sample animation sequence is obtained.

In some embodiments, the motion prediction network is obtained by weighting network parameters of a plurality of expert systems, and the network parameters of each expert system are obtained by a gating network. Wherein the motion cycle characteristic P of the first video frame can be used for _i ∈R ^2M As an input of the gating network Ω, the motion cycle characteristics of the first video frame are processed by the gating network Ω, so as to obtain weights ω= { ω of each expert system ₁ ,ω ₂ ,ω ₃ ,...,ω _Q }, wherein omega _u Representing the weight of the u-th expert system, and Q represents the number of expert systems. Then, the network parameters of the corresponding expert systems are weighted by the weights of the expert systems, so that the action prediction network can be obtained. And inputting the first animation data into the motion prediction network for processing, so that the second animation data output by the motion prediction network can be obtained.

The motion prediction network Θ comprises a plurality of expert systems, each expert system can better process input under specific conditions, and compared with the motion prediction network adopting network parameters to fix, the embodiment of the disclosure can dynamically adjust the network parameters of the motion prediction network Θ based on the motion cycle characteristics of the first video frame, so that the application range of the motion prediction network to various conditions is improved, and the generated motion is more natural and real.

As shown in fig. 5B, during training, the first motion period sample feature of the sample animation sequence may be input into an initial gating network, and initial weights of each initial expert system output by the initial gating network may be obtained; weighting network parameters corresponding to the initial expert systems by the initial weights of the initial expert systems to obtain an initial action prediction network; inputting the sample animation sequence into the initial motion prediction network, and obtaining a second prediction animation sequence output by the initial motion prediction network and the predicted motion period sample characteristics, frequency and amplitude of the second prediction animation sequence; and training the initial gating network and each initial expert system based on the sample animation sequence, the first motion cycle sample characteristic of the sample animation sequence, the second prediction animation sequence and the predicted motion cycle sample characteristic, frequency and amplitude of the second prediction animation sequence to obtain the trained gating network and each expert system.

Wherein, before training, the network parameters of Q expert systems, alpha= { alpha, can be initialized first ₁ ,α ₂ ,α ₃ ,...,α _Q Then the actual network parameters of the network are calculated as

The initial motion prediction network based on the parameters may animate the sample into a sequence X _i ∈R ^D×1 As input, the network parameters are controlled by expert system and gating network, outputting predicted second predicted animation sequence +.>

Predicted motion cycle sample feature +.>

Amplitude->

Frequency of

The predicted motion cycle sample feature may be expressed as, < + >>

Wherein R (θ) represents a rotation matrix corresponding to θ, I () represents linear interpolation, weight 0.5,/and->

Delta t represents the time interval between two frames of sample animation data, and the motion period sample characteristics are updated in such a way that the predicted motion period sample characteristics output by the motion prediction network can be periodically changed unidirectionally along the time axis, so that the phenomenon that the motion is out due to the fact that the motion period sample characteristics directly predicted do not have periodicity and unidirectionally characteristics is avoidedNow stiff. The training of the gating network and the action prediction network adopts an autoregressive mode, and the input and output of the whole network are expressed as follows:

the loss expression is as follows (2), wherein X _i ,X _i+1 ,P _i ,P _i+1 For the input and corresponding labels, ω and α are network parameters to be optimized, the error uses the L2 penalty to calculate the error between the first motion cycle sample feature and the predicted motion cycle sample feature of the sample animation sequence, and weights are used. Training is continuously carried out on the motion cycle characteristic space by using two adjacent frames of sample animation data, and finally a trained gating network and each expert system are obtained.

Wherein mu ₁ Sum mu ₂ The weights of the sample characteristics of the predicted motion period and the weights of the sample animation sequence can be set according to actual needs.

After training, the optimization of the single frame animation data obtained by the video sequence driving is as follows, and the ith frame animation data obtained by the video sequence driving is set as X _i ∈R ^D×1 The corresponding motion cycle is characterized by P _i ∈R ^2M X is then _i ,P _i The output of the corresponding ith frame of animation data obtained after inputting the motion prediction network is

Correlation of motion cycle characteristics As previously described, the video sequence drives the i+1st frame of animation data obtained directly to X _i+1 The final optimized i+1st frame animation data is +>

In this way, pre-actuation can be preventedThe difference between the animation data generated by the network and the animation data obtained by the actual driving is too large. Then will->

As a new input, continue to calculate the output +.>

Thereby realizing continuous driving real-time optimization.

In the above embodiment, if the first video frame is the first frame video frame in the video sequence, the target animation data with the highest initial animation data similarity corresponding to the first video frame may be determined from the multi-frame animation data pre-stored in the motion library; and determining the target animation data as first animation data corresponding to the first video frame, and determining the motion cycle characteristic corresponding to the target animation data as the first motion cycle characteristic.

Wherein, a mapping relation between each frame of animation data in the action library and the first motion cycle characteristic of the animation data can be pre-established, and after the target animation data is determined, the motion cycle characteristic corresponding to the target animation data is searched based on the mapping relation.

It will be appreciated that the solutions described in the above embodiments can be freely combined to obtain new solutions without conflicts, and for reasons of brevity, will not be described further herein.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Referring to fig. 6, an embodiment of the present disclosure further provides an animation data generation apparatus, including:

a first obtaining module 601, configured to obtain first animation data corresponding to a first video frame in a video sequence;

a second obtaining module 602, configured to input the first animation data and a first motion cycle characteristic corresponding to the first animation data into a pre-trained motion prediction model, and obtain second animation data corresponding to a video frame next to the first video frame in the video sequence output by the motion prediction model;

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The embodiments of the present disclosure also provide a computer device at least including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of the preceding embodiments when executing the program.

FIG. 7 illustrates a more specific hardware architecture diagram of a computing device provided by embodiments of the present description, which may include: a processor 702, a memory 704, an input/output interface 706, a communication interface 708, and a bus 710. Wherein the processor 702, the memory 704, the input/output interface 706 and the communication interface 708 enable communication connections between each other within the device via a bus 710.

The processor 702 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure. The processor 702 may also include a graphics card, which may be an Nvidia titanium X graphics card, a 1080Ti graphics card, or the like.

The Memory 704 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 704 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 704 and executed by processor 702.

The input/output interface 706 is used to connect with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

The communication interface 708 is used to connect communication modules (not shown) to enable communication interactions of the device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 710 includes a path to transfer information between components of the device (e.g., processor 702, memory 704, input/output interface 706, and communication interface 708).

It should be noted that although the above-described device only shows the processor 702, the memory 704, the input/output interface 706, the communication interface 708, and the bus 710, in a specific implementation, the device may also include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the previous embodiments.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

From the foregoing description of embodiments, it will be apparent to those skilled in the art that the present embodiments may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be embodied in essence or what contributes to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present specification.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, a laptop computer, a cellular telephone, an image capture device telephone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, in which the modules illustrated as separate components may or may not be physically separate, and the functions of the modules may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present disclosure. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely a specific implementation of the embodiments of this disclosure, and it should be noted that, for a person skilled in the art, several improvements and modifications may be made without departing from the principles of the embodiments of this disclosure, and these improvements and modifications should also be considered as protective scope of the embodiments of this disclosure.

Claims

1. A method of generating animation data, the method comprising:

acquiring first animation data corresponding to a first video frame in a video sequence;

inputting the first animation data and the first motion cycle characteristics corresponding to the first animation data into a pre-trained motion prediction model, and obtaining second animation data corresponding to a next frame video frame of the first video frame in the video sequence output by the motion prediction model;

2. The method of claim 1, wherein if the first video frame is a first frame video frame in the video sequence, the method further comprises:

determining target animation data with highest initial animation data similarity corresponding to the first video frame from multi-frame animation data prestored in an action library;

And determining the target animation data as first animation data corresponding to the first video frame, and determining the motion cycle characteristic corresponding to the target animation data as the first motion cycle characteristic.

3. The method of claim 1, wherein the first motion cycle sample characteristic of the sample animation sequence is determined based on amplitude and frequency shift parameters of periodic signals of a plurality of channels, the periodic signals of the plurality of channels being decomposed from the first sample characteristic of the sample animation sequence;

the amplitude of the periodic signal of any channel is obtained by performing fast Fourier transform on the first sample characteristic of the sample animation sequence, and the frequency shift parameter of the periodic signal of any channel is obtained by fitting the first sample characteristic of the sample animation sequence through a pre-trained full-connection layer.

4. A method according to claim 3, characterized in that the encoder and the fully connected layer are trained on the basis of:

for each sample animation sequence in the plurality of sample animation sequences, encoding the sample animation sequence by an initial encoder to obtain a second sample characteristic of the sample animation sequence;

Performing fast Fourier transform on the second sample characteristic to obtain the amplitude, the frequency and the offset of the periodic signal corresponding to the second sample characteristic on each channel;

fitting the second sample characteristics through an initial full-connection layer to obtain frequency shift parameters of periodic signals corresponding to the second sample characteristics on each channel;

generating a periodic signal corresponding to the second sample feature based on the amplitude, frequency, offset and frequency shift parameters of the periodic signal corresponding to the second sample feature on each channel;

decoding the periodic signal corresponding to the second sample characteristic through an initial decoder corresponding to the initial encoder to obtain a first prediction animation sequence corresponding to the sample animation sequence;

and based on the plurality of sample animation sequences and the first prediction animation sequence corresponding to each sample animation sequence, carrying out joint training on the initial encoder, the initial full-connection layer and the initial decoder to obtain the trained encoder, the trained full-connection layer and the decoder corresponding to the encoder.

5. The method according to claim 1, wherein the action prediction network is obtained by weighting network parameters of a plurality of expert systems, and weights of the expert systems are obtained through a gating network; inputting the first animation data into a pre-trained motion prediction model, and obtaining second animation data corresponding to a video frame next to the first video frame in the video sequence output by the motion prediction model, wherein the method comprises the following steps:

Acquiring the motion cycle characteristics of the first video frame;

processing the motion cycle characteristics of the first video frame through the gating network to obtain the weight of each expert system in the plurality of expert systems;

weighting the network parameters of the corresponding expert systems by the weights of the expert systems to obtain the action prediction network;

and inputting the first animation data into the motion prediction network, and acquiring the second animation data output by the motion prediction network.

6. The method of claim 5, wherein the action prediction network is trained based on:

inputting a first motion period sample characteristic of a sample animation sequence into an initial gating network, and acquiring initial weights of all initial expert systems output by the initial gating network;

weighting network parameters corresponding to the initial expert systems by the initial weights of the initial expert systems to obtain an initial action prediction network;

inputting the sample animation sequence into the initial motion prediction network, and obtaining a second prediction animation sequence output by the initial motion prediction network and the predicted motion period sample characteristics, frequency and amplitude of the second prediction animation sequence;

And training the initial gating network and the initial action prediction network based on the sample animation sequence, the first motion cycle sample characteristic of the sample animation sequence, the second prediction animation sequence and the predicted motion cycle sample characteristic, frequency and amplitude of the second prediction animation sequence to obtain the trained gating network and each expert system.

7. The method of claim 1, wherein each sample animation sequence includes multi-frame sample animation data within a sliding window centered around target sample animation data in an initial sample animation sequence, and wherein each frame of sample animation data in the initial sample animation sequence includes position information for a plurality of joints; before training the motion prediction model based on a plurality of sample animation sequences and first motion cycle sample features of the plurality of sample animation sequences, the method further comprises:

the following steps are performed for each target joint of the plurality of joints:

determining average position information of the target joint in the sample animation sequence based on the position information of the target joint in each frame of animation data of the sample animation sequence;

And updating the position information of the target joint in the target sample animation data of the sample animation sequence based on the position information of the target joint in the target sample animation data of the sample animation sequence and the average position information of the target joint in the sample animation sequence.

8. The method of claim 7, wherein the plurality of joints comprises a root joint; before determining the average position information of the target joint in the sample animation sequence, the method further comprises:

determining relative position information between the target joint and the root joint based on a difference between the position information of the target joint and the position information of the root joint;

and updating the position information of the target joint based on the relative position information.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method of any one of claims 1 to 8.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any one of claims 1 to 8 when executing the program.