Disclosure of Invention
The invention mainly aims to provide a digital person driving method, device, equipment and storage medium, and aims to solve the technical problems that a large amount of human resources are consumed and the cost is high in driving digital persons in a real person performance driving mode in the prior art.
In order to achieve the above object, the present invention provides a digital person driving method including:
generating digital person audio and phoneme sequence information corresponding to the digital person to be driven according to the language to be broadcasted;
Determining the limb language of the digital person corresponding to the digital person to be driven based on the phoneme sequence information and the action state machine;
Generating a digital human image sequence according to the digital human limb language;
Generating a digital human video based on the digital human audio and the digital human image sequence;
and presenting the digital person video through the digital person to be driven.
Optionally, the step of determining the digital person limb language corresponding to the digital person to be driven based on the phoneme sequence information and the action state machine includes:
Determining a human face key point sequence based on the phoneme sequence information and a preset digital human visual mapping table, wherein the human face key point sequence is a sequence formed by the face key feature points of the digital human to be driven;
determining a body key point sequence based on the digital human behavior description information and an action state machine, wherein the body key point sequence is a sequence formed by body key feature points of the digital human to be driven;
acquiring an overall key point sequence according to the human face key point sequence and the body key point sequence;
and determining the limb language of the digital person corresponding to the digital person to be driven according to the integral key point sequence.
Optionally, the step of determining the face key point sequence based on the phoneme sequence information and a preset digital human-visual mapping table includes:
determining a face key point offset sequence based on the phoneme sequence information and a preset digital human visual mapping table;
Performing time sequence smoothing on the face key point offset sequence to obtain a smoothed face key point offset sequence;
and determining a face key point sequence based on the smoothed face key point offset sequence and the digital neutral face key points.
Optionally, the step of determining the body key point sequence based on the digital human behavior description information and the action state machine comprises the following steps:
determining a body key point offset sequence based on the digital human behavior description information and the action state machine;
performing time sequence smoothing on the body key point offset sequence to obtain a smoothed body key point offset sequence;
a body keypoint sequence is determined based on the body keypoint offset sequence and the digital human neutral body keypoints.
Optionally, the step of presenting the digital person video by the digital person to be driven includes:
Respectively converting the digital human audio and the digital human video into an audio stream and a video stream through a target impeller;
synchronizing the audio stream and the video stream according to the target time stamp to obtain a synchronized audio and video stream;
Pushing the synchronized audio and video stream to a target server, and presenting the synchronized audio and video stream through the target server and the digital person to be driven.
Optionally, before the step of converting the digital human audio and the digital human video into an audio stream and a video stream by the target propeller, the method further includes:
Carrying out segmentation processing on the digital human video to determine a static video segment in the digital human video;
constructing a target corpus based on the static video segments;
generating a fused digital human video based on the dynamic video segments and the static video segments in the target corpus when an inference request is received;
the step of converting the digital human audio and the digital human video into an audio stream and a video stream by a target pusher, respectively, comprises the following steps:
And respectively converting the digital human audio and the fused digital human video into an audio stream and a video stream through a target impeller.
Optionally, before the step of generating the digital person audio frequency and the phoneme sequence information corresponding to the digital person to be driven according to the language to be broadcasted, the method further includes:
Constructing a state node action behavior corresponding to an action state machine by a human body posture estimation method, wherein the action state machine is a model for controlling the action of a digital person to be driven;
constructing transition action behaviors among nodes corresponding to the action state machine;
and modeling the behavior of the digital person to be driven based on the state node action behavior and the inter-node transition action behavior.
In addition, in order to achieve the above object, the present invention also proposes a digital man-driving device, the device comprising:
The voice generating module is used for generating digital person audio frequency and phoneme sequence information corresponding to the digital person to be driven according to the language to be broadcasted;
The limb language determining module is used for determining the limb language of the digital person corresponding to the digital person to be driven based on the phoneme sequence information and the action state machine;
The image sequence generating module is used for generating a digital human image sequence according to the digital human limb language;
a video generation module for generating a digital human video based on the digital human audio and the digital human image sequence;
And the video presentation module is used for presenting the digital person video through the digital person to be driven.
In addition, to achieve the above object, the present invention also proposes a digital man-driven device, the device comprising: a memory, a processor, and a digital person driver stored on the memory and executable on the processor, the digital person driver configured to implement the steps of the digital person driving method as described above.
In addition, to achieve the above object, the present invention also proposes a storage medium having stored thereon a digital person driver which, when executed by a processor, implements the steps of the digital person driving method as described above.
According to the invention, digital person audio and phoneme sequence information corresponding to a digital person to be driven is generated according to a language to be broadcasted; determining the limb language of the digital person corresponding to the digital person to be driven based on the phoneme sequence information and the action state machine; generating a digital human image sequence according to the digital human limb language; generating a digital human video based on the digital human audio and the digital human image sequence; presenting digital person video through the digital person to be driven; compared with the prior art that the digital person is driven by the real person performance driving mode, a large amount of support is usually required in the modeling and driving process, and the cost is high.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic diagram of a digital man-driven device of a hardware running environment according to an embodiment of the present invention.
As shown in fig. 1, the digital man drive device may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the structure shown in fig. 1 is not limiting of the digital human drive device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a digital personal drive program may be included in the memory 1005 as one type of storage medium.
In the digital man-driven device shown in FIG. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the digital person driving apparatus of the present invention may be provided in the digital person driving apparatus, which invokes the digital person driving program stored in the memory 1005 through the processor 1001 and performs the digital person driving method provided by the embodiment of the present invention.
An embodiment of the present invention provides a digital person driving method, referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the digital person driving method of the present invention.
In this embodiment, the digital person driving method includes the steps of:
Step S10: and generating digital person audio and phoneme sequence information corresponding to the digital person to be driven according to the language to be broadcasted.
It should be noted that, the execution body of the method of the present embodiment may be a digital personal driving device for driving a digital person in real time, for example, a mobile phone, a tablet computer, a personal computer, or other digital personal driving systems that can implement the same or similar functions and include the digital personal driving device. The digital person driving method provided in this embodiment and the following embodiments will be specifically described herein with a digital person driving system (hereinafter referred to as a system). Wherein, the digital person refers to a virtual character with a digitized appearance which exists depending on the display equipment.
It should be noted that, the language to be broadcasted may be any natural language that needs a digital person to broadcast. The language type of the language to be broadcasted in this embodiment may include, but is not limited to, english, chinese, and the like.
It can be understood that the digital person to be driven can be the digital person to be driven in the scheme; correspondingly, the digital human audio can be audio for driving the digital human to be driven; the phoneme sequence information can be sequence information consisting of prosody boundaries in the language to be broadcasted and initials, finals and intonation in the pinyin.
In this embodiment, referring to fig. 3, fig. 3 is a schematic flow chart of generating digital human audio in the first embodiment of the digital human driving method of the present invention. As shown in fig. 3, the system may input the natural language T n to be broadcasted to a deep learning neural network (such as the speech synthesis network in fig. 3) to generate digital human Audio (i.e. the target Audio dst) and the phoneme sequence information { ph 1,ph2,…,phn } corresponding to the digital human to be driven through the speech synthesis network, where the deep learning neural network may be a VITS deep neural network (VERY DEEP ITERATIVE STRATEGY).
Step S20: and determining the limb language of the digital person corresponding to the digital person to be driven based on the phoneme sequence information and the action state machine.
It should be understood that the above-mentioned action state machine is a common single-chip microcomputer programming concept, which is generally used in aspects of control systems, communication protocols, user interfaces, and the like, and the state machine can abstract states of the system and transitions between states into a state diagram, so as to simplify design and implementation of a program. In practice, the action state machine generally comprises: state, transition, event, action, and the like.
It should be noted that the digital human limb language, that is, the body language of the digital human, includes: digital person facial expressions, gesture actions, body gestures, etc., which are not limited in this embodiment.
Step S30: and generating a digital human image sequence according to the digital human limb language.
It will be appreciated that the above-described sequence of digital human images may be used to characterize a sequence of images in the body language of a digital human. In this embodiment, the corresponding digital human image sequence may be generated by a trained deep network model.
Step S40: a digital human video is generated based on the digital human audio and the digital human image sequence.
The digital person video may be a video showing information such as voice, mouth shape, expression, and motion of the digital person.
Step S50: and presenting the digital person video through the digital person to be driven.
It should be noted that, in this embodiment, the system may use a push technology to enable the digital person to be driven to present the generated digital person video in real time, so as to implement driving of the digital person to be driven.
In a specific implementation, the scheme can be provided with a natural language synthesis digital human voice module, a natural language synthesis digital human voice module and an audio-visual linkage digital human video generation module, the functions of the modules refer to fig. 4, and fig. 4 is a schematic flow diagram of driving a digital human in the digital human driving method. As shown in fig. 4, the present solution may input the natural language T n to be broadcasted to the natural language synthesis digital human voice module, so that the natural language synthesis digital human voice module generates digital human Audio dst and phoneme sequence information { ph 1,ph2,…,phn }. Then, the natural language synthesis digital human voice module can obtain a facial human body posture model through inputting a phoneme sequence { ph 1,ph2,…,phn }, and can obtain a limb language accompanying action T b after registering and fusing the facial human body posture model and the body human body posture model by combining the facial human body posture model and the body human body posture model obtained by an action state machine. Finally, the audiovisual linkage digital human Video generation module can combine the limb language accompanying action T b and the digital human Audio dst to generate a digital human Video dst.
Further, before the step S10, the method further includes:
constructing a state node action behavior corresponding to an action state machine by a human body posture estimation method, wherein the action state machine is a model for controlling the action of a digital person to be driven; constructing transition action behaviors among nodes corresponding to the action state machine; and modeling the behavior of the digital person to be driven based on the state node action behavior and the inter-node transition action behavior.
It should be noted that, the above human body posture estimation method may be a method of estimating a position and posture of a human body by a computer vision technique.
It should be understood that the state node behavior may be a node action behavior corresponding to a state element in the action state machine. Correspondingly, the transition action behavior between nodes can be the transition action behavior corresponding to the transition element between nodes in the action state machine.
In this embodiment, referring to fig. 5, fig. 5 is an overall framework diagram of digital human behavior modeling in the digital human driving method of the present invention. As shown in fig. 5, firstly, the system can convert any original Video src for modeling digital human behavior into a corresponding image sequence { I 1,I2,…,In }, and then the system can model node actions and transition actions of the digital human based on a two-dimensional gesture model and a three-dimensional gesture model respectively, wherein the node action behavior construction based on the two-dimensional gesture can be realized based on a DWPose model; node action behavior construction based on a three-dimensional gesture model can be realized by using a HybrIK model; the transition action behavior construction based on the two-dimensional gesture can be realized based on an LERP model; node action behavior construction based on a three-dimensional gesture model can be achieved by utilizing a SLERP model.
It should be noted that DWPose is a model based on RTMPose, and uses a two-stage distillation algorithm to make light weight improvement, so that the reasoning speed is greatly improved, wherein the RTMPose model is a real-time multi-person key point detection algorithm for human body posture estimation. In practical application, DWPose model can realize effective human whole body skeleton gesture prediction through two-stage distillation method:
The goal of the first stage distillation in the DWPose model is to have the student learn the F t and T i of the teacher's network over the network. Wherein feature distillation uses a layer of 1 x 1 convolution to map student features to the teacher network with the same number of channels, the distance between student feature F s and teacher feature F t is calculated with MSE loss, which can be expressed as:
For logits distillation, DWPose improves on the basis of RTMPose, giving up the distillation weight mask to calculate the KL divergence, the loss function can be expressed as:
Where N is the number of samples in one batch, K is the number of keypoints, and L is the length of the bounding box. W n,k is a target weight mask used to distinguish invisible keypoints, and V i is the value of the tag.
In addition, DWPose devised a weight decay strategy such that the weights of the feature loss and logits loss in the loss function decrease over time, the final student network loss function being as follows:
The second stage is to fix the human skeleton and then fine-tune the head to adjust the characteristic distribution through the self-learning of the trained student model, so that the head achieves better performance. The following equation represents the second stage loss function and the overall loss function, where the second stage loss function only considers logits losses, and γ is a hyper-parameter representing the loss scale.
In addition, node action behavior construction based on the three-dimensional gesture model utilizes HybrIK to construct node action behavior of the three-dimensional gesture model. HybrIK is a neural network combined with inverse kinematics, which well solves the problem that the human body structure generated in three-dimensional human body posture estimation is unreal, and can extract a three-dimensional human body skeleton from an image and convert the three-dimensional human body skeleton into an SMPL parameterized human body posture model for representation.
Specifically, hybrIK predicts the shape parameter β, twist rotation angle Φ and initial pose T of the preliminary three-dimensional key point P, SMPL of the human body by using a CNN, inputs the shape parameter β, twist rotation angle Φ and initial pose T into a HybrIK model combining reverse dynamics and deep neural network to infer the SMPL pose parameter θ, and finally inputs the obtained β and θ parameters into the SMPL by forward dynamics and regression method to obtain the final three-dimensional key point Q. This approach allows the three-dimensional keypoint estimation and the three-dimensional mesh estimation to form a closed loop.
It should be noted that inverse kinematics is a mathematical process of calculating the relative rotation R from the body joint position P, and no unique solution exists. HybrIK is to decompose the bone rotation into a wobble part R sw and a torsion part R tw, which is expressed by the following formulaAndExpressed as a template vector and a target vector, respectively:
the wobble portion R sw in which the bone rotates can be calculated by analysis and expressed by the following formula: wherein the method comprises the steps of ForIs a diagonal symmetric matrix,Feature matrix denoted 3 x 3,And alpha is a given vector/>, respectivelyAndIs provided, and a rotation angle and a rotation axis of the same.
The torsional part R tw of the bone rotation is obtained through prediction of a neural network, and the R tw expression method is shown in the following formula and needs to be learned through the neural network.
The 3D keypoint estimated loss, the torsion angle estimated loss, and the SMPL parameter loss are integrated in the calculation of the loss function. Wherein the 3D keypoint estimate takes ResNet as the backbone, 3 deconvolution layers are employed, and the l1 penalty of estimating pose position is employed, where P k represents GT joint coordinates.
Phi k is represented by a two-dimensional vector (C φk,Sφk) to avoid discontinuity problems in the torsion angle estimation, and the l2 loss is used, whereinIndicating GT twist angle.
The loss calculation of the shape parameter β and the posture parameter θ of the SMPL is shown in the following formula.
To sum up, the overall loss function can be expressed as the following equation, where μ 1、μ2、μ3 is expressed as the weight of each loss term:
In addition, the system can also construct transfer conditions of an action state machine according to the digital human behavior description B and the state transfer table STT, and complete conversion and transition among various animations by utilizing the action state machine, so that behavior modeling of the digital human is realized. By means of the mode, the scheme can establish a mathematical model of the behavior of the digital person from five aspects of state, transition, event, action and expansion, and respectively establish the node action behavior and the transition action behavior of the digital person so as to facilitate the follow-up driving of the digital person to be driven.
The embodiment discloses generating digital person audio frequency and phoneme sequence information corresponding to a digital person to be driven according to a language to be broadcasted; determining the limb language of the digital person corresponding to the digital person to be driven based on the phoneme sequence information and the action state machine; generating a digital human image sequence according to the digital human limb language; generating a digital human video based on the digital human audio and the digital human image sequence; presenting digital person video through the digital person to be driven; compared with the prior art that the digital person is driven by the real person performance driving mode, a large amount of support is usually required in the modeling and driving process, and the cost is high.
Referring to fig. 6, fig. 6 is a flowchart of a second embodiment of the digital man driving method according to the present invention.
Based on the first embodiment, in this embodiment, the step S20 includes:
Step S201: and determining a human face key point sequence based on the phoneme sequence information and a preset digital human visual mapping table, wherein the human face key point sequence is a sequence formed by the face key feature points of the digital human to be driven.
It should be noted that the preset digital human visual mapping table may be a mapping table storing a correspondence between a pronunciation action in syllables in natural language and visual expressions of faces or lips. In this embodiment, the system may obtain the preset digital human visual mapping table through the template mouth shape and the specific digital human neutral face key point and through the expression redirection.
It should be understood that the above-mentioned face key point sequence may be a sequence composed of face key feature points of the digital person to be driven, where the face key feature points of the digital person to be driven may include: the present embodiment is not limited to the feature points on the facial organs such as eyebrows, eyes, mouth, nose, and the like.
Further, the step S201 includes:
step S201a: and determining a face key point offset sequence based on the phoneme sequence information and a preset digital human visual mapping table.
It should be noted that the above-mentioned face key point offset sequence may be a sequence formed by a position offset between a current face key point of a digital person to be driven and a target face key point, where the position of the target face key point may be a position where a face key point of the digital person to be driven is located when the digital person to be driven broadcasts digital human audio.
Step S201b: and carrying out time sequence smoothing processing on the face key point offset sequence to obtain a smoothed face key point offset sequence.
Step S201c: and determining a face key point sequence based on the smoothed face key point offset sequence and the digital neutral face key points.
It should be understood that the above-mentioned digital neutral human face key points may be key points of the face corresponding to the commonly used digital human expression.
In practical application, the system can search a preset digital human visual mapping table through phoneme sequence information, obtain a human face key point offset sequence according to a blink rule, then perform sequential smoothing processing on the human face key point offset sequence to obtain a smoothed human face key point offset sequence, and integrate the smoothed human face key point offset sequence with a digital human neutral human face key point to obtain a human face key point sequence.
Step S202: and determining a body key point sequence based on the digital human behavior description information and an action state machine, wherein the body key point sequence is a sequence formed by body key feature points of the digital human to be driven.
It should be noted that the above digital person behavior description information may be information for describing the corresponding action behaviors of the digital person in different states.
It can be understood that the above-mentioned body key point sequence is a sequence composed of body key feature points of the digital person to be driven, wherein the body key feature points of the digital person to be driven may include: the present embodiment is not limited to the feature points on the limbs such as the arms, legs, buttocks, etc.
Further, the step S202 includes:
Step S202a: a body keypoint offset sequence is determined based on the digital human behavior descriptive information and the action state machine.
It should be noted that the body keypoint offset sequence may be a sequence consisting of a position offset between a current body keypoint of the digital person to be driven and a target body keypoint, where the position of the target body keypoint may be a position where the body keypoint of the digital person to be driven is located when the digital person to be driven broadcasts digital human audio.
Step S202b: and carrying out time sequence smoothing on the body key point offset sequence to obtain a smoothed body key point offset sequence.
Step S202c: a body keypoint sequence is determined based on the body keypoint offset sequence and the digital human neutral body keypoints.
It should be understood that the above-mentioned digital human neutral body key points may be key points of the body corresponding to the usual digital human actions.
In practical application, the system can obtain a body key point offset sequence based on digital human behavior description information through an action state machine, then can carry out time sequence smoothing processing on the body key point offset sequence to obtain a smoothed body key point offset sequence, and integrates the smoothed body key point offset sequence and the digital human neutral body key points to obtain a body key point sequence.
Step S203: and acquiring an overall key point sequence according to the human face key point sequence and the body key point sequence.
It should be noted that the above-mentioned integral key point sequence may be a sequence obtained by integrating a face key point sequence and a body key point sequence.
Step S204: and determining the limb language of the digital person corresponding to the digital person to be driven according to the integral key point sequence.
In a specific implementation, referring to fig. 7, fig. 7 is a schematic flow chart of generating a digital human limb language in a second embodiment of the digital human driving method of the present invention. As shown in fig. 7, the system may search a preset digital human visual mapping table through phoneme sequence information { ph 1,ph2,…,phn }, obtain a human face key point offset sequence Δl f according to a blink rule, then perform sequential smoothing on the human face key point offset sequence Δl f to obtain a smoothed human face key point offset sequence, and integrate the smoothed human face key point offset sequence with a digital neutral human face key point L N f to obtain a human face key point sequence L f. Meanwhile, the system can obtain a body key point offset sequence DeltaL b based on the digital human behavior description information B through an action state machine, then can carry out time sequence smoothing on the body key point offset sequence DeltaL b to obtain a smoothed body key point offset sequence, and integrates the smoothed body key point offset sequence with a digital human neutral body key point L N b to obtain a body key point sequence L b. Finally, the system may integrate the face key point sequence L f and the body key point sequence L b to obtain an overall key point sequence L, so as to generate a digital human limb language T b corresponding to the digital human to be driven based on the overall key point sequence L.
The embodiment discloses a method for determining a key point sequence of a human face based on phoneme sequence information and a preset digital human visual mapping table; determining a body key point sequence based on the digital human behavior description information and the action state machine; acquiring an overall key point sequence according to the human face key point sequence and the body key point sequence; and determining the limb language of the digital person corresponding to the digital person to be driven according to the whole key point sequence. The embodiment can integrate the human face key point sequence and the body key point sequence of the robot to be driven to obtain the whole key point sequence, so that the digital human limb language corresponding to the digital human to be driven can be accurately determined according to the whole key point sequence, and the driving accuracy of the digital human is improved.
Referring to fig. 8, fig. 8 is a flowchart of a third embodiment of the digital man driving method according to the present invention.
Based on the above embodiments, in this embodiment, the step S50 includes:
Step S501: and respectively converting the digital human audio and the digital human video into an audio stream and a video stream through a target impeller.
It should be noted that, the target pusher may be a device or software for implementing real-time audio/video stream pushing. The target pusher in this embodiment may include, but is not limited to, an RTMP (Real-TIME MESSAGING Protocol) pusher, where the RTMP is a Real-time audio/video streaming media transmission Protocol.
In a specific implementation, referring to fig. 9, fig. 9 is a schematic flow chart of an RTMP protocol push in a third embodiment of the digital human driving method according to the present invention. As shown in fig. 9, the pushing by the RTMP pusher may include: the digital human audio and digital human video can be converted into corresponding audio streams and video streams after being pushed by using the RTMP pusher.
Step S502: and carrying out synchronous processing on the audio stream and the video stream according to the target time stamp to obtain the synchronized audio and video stream.
It should be noted that the target timestamp may be a timestamp of a pre-specified generated audio/video stream.
Step S503: pushing the synchronized audio and video stream to a target server, and presenting the synchronized audio and video stream through the target server and the digital person to be driven.
It should be understood that the target server may be a server for processing concurrent requests, such as an nmginx server, which is not limited in this embodiment.
In a specific implementation, referring to fig. 10, fig. 10 is a schematic diagram of a digital human video push presentation in a third embodiment of the digital human driving method of the present invention. As shown in fig. 10, in the pre-generation stage, the sequence information of the digital human voice and the phonemes generated by the natural language synthesis digital human voice module, the digital human limb language generated by the natural language synthesis digital human voice module, and the digital human video generated by the audiovisual linkage digital human video generation module are all stored in the rabitimq process pool. In the online reasoning stage, when no reasoning request is received, the pusher circularly plays the video of the default posture read in a static state; upon receiving the inference request, the streamer will obtain video from a queue of images generated in real-time.
After digital human video including voice, mouth shape, expression and motion is generated, the scheme can design the message format of the request message for the carrier driving the text, so that the system can respond through the local or remote request message. Meanwhile, in the scheme, an API of FFmpeg (namely an open-source multimedia processing frame) can be utilized to develop and realize an RTMP plug-flow device, and an Nginx server is utilized to build an RTMP live broadcast plug-flow server, so that video streams acquired by FFmpeg are pushed to the Nginx server, a user can watch digital human videos through a stream pulling tool, and live broadcasting of digital people is realized. In the data stream of the present solution, a carrier is needed to drive the text to access the present system. Therefore, the system designs a message format, which is used as a carrier for driving texts and a request for generating digital human videos by online reasoning, and after the service of the system is started, a message request file can be read locally or a request message can be obtained from a network through WebSocket connection. The messages are mainly divided into three types: INITINFERENCE, INFERENCE and closeInference, for each request message type, the format of the message is defined in the system, and referring to fig. 11, fig. 11 is a partial field display diagram of a inference message in a third embodiment of the digital man-machine driving method according to the present invention. In practical application, after receiving the request message, the system can search the corpus according to SENTENCEID fields and analyze text into the practical input text of the reasoning process. And then the text sequentially passes through a natural language synthesis digital human voice module, a voice-driven limb language generation module and an audio-visual linkage digital human video generation module to generate a digital human speaking video. The system designates an inference model of a digital person through modelId fields, uses RabbitMQ as a message queue middleware, takes charge of data transfer between each module, and manages each module in a driving flow by using a process pool. After the digital person video is generated, the system can convert the generated audio and video into audio and video streams through the FFmpeg propeller, perform audio and video synchronization according to the timestamp of the audio and video streams designated by the timestamp field, and finally forward the audio and video synchronization to the push address corresponding to the rtmpUrl field through the Nginx server, so that a subsequent user can watch the digital person video through a pull tool, and live broadcasting of the digital person is realized.
Further, in order to optimize the digital person driving process in real time, the accuracy and efficiency of the digital person driving are improved, and before step S501, the method further includes: carrying out segmentation processing on the digital human video to determine a static video segment in the digital human video; constructing a target corpus based on the static video segments; and generating a fused digital human video based on the dynamic video segments and the static video segments in the target corpus when receiving the reasoning request.
It should be noted that the still video clip may be a still video clip that is not changed in the digital personal video; accordingly, the dynamic video clips may be dynamically changed video clips generated in real time.
It should be appreciated that the target corpus described above may be a database for storing still video clips.
It can be understood that the above-mentioned fused digital human video may be a video clip obtained by fusing a dynamic video clip and a static video clip.
In practical applications, real-time optimization of the digital human driven process may be achieved in combination with pre-computation, in particular, because many video clip content has similarity, such as "mr. Wang afternoon. "and" Mr. p.p.afternoon. The two sentences are identical except that the words "king" and "Zhang" are different. Therefore, in practical application, the video can be segmented, the static part (i.e. the static video segment) which is not changed is pre-calculated offline, and a pre-generated target corpus is constructed. In real-time rendering, only the dynamic part (i.e., the dynamic video segment) in the target corpus is rendered and replaced into the pre-generated video. Meanwhile, since only the mouth shape needs to be replaced, the body model can multiplex pre-calculated contents, and only the facial NMFC image of the digital person needs to be rendered in real time.
The step S501 includes: and respectively converting the digital human audio and the fused digital human video into an audio stream and a video stream through a target impeller.
In this embodiment, when no reasoning request is received, the target pusher will circularly play the video of the default gesture that is statically read; when receiving the reasoning request, the impeller acquires videos from a queue of images generated in real time, namely dynamic video clips are obtained, so that the dynamic video clips and static video clips in a target corpus can be fused later, and then digital human audio and fused digital human video are converted into audio streams and video streams. The target pusher can load and push the video clips of the offline generation part in advance, so that the waiting time of a user is reduced, and the digital man-driven efficiency is improved.
In addition, the scheme can realize real-time optimization of the digital man-driven flow based on the message queue, specifically, each module can be used as a process in the online reasoning process, the message queue is used for transmitting data among processes in a Chunk form, and the process pool is used for managing the processes. By the method, data communication delay among modules can be reduced, and parallelism of the front-end module and the rear-end module is improved.
The aim of the real-time optimization combined with the pre-calculation is to reduce the time consumption of the audio-visual linkage digital human video generation module, and the aim of the real-time optimization based on the message queue is to reduce the time consumption of the data IO operation among the modules and increase the parallelism among the modules.
The embodiment discloses that digital human audio and digital human video are respectively converted into an audio stream and a video stream by a target impeller; synchronizing the audio stream and the video stream according to the target time stamp to obtain a synchronized audio/video stream; and pushing the synchronized audio and video stream to a target server, and presenting the synchronized audio and video stream through the target server and the digital person to be driven, so that a user can watch the digital person video through a streaming tool, and live broadcasting of the digital person is realized.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a digital man driver, and the digital man driver realizes the steps of the digital man driving method when being executed by a processor.
Referring to fig. 12, fig. 12 is a block diagram showing the structure of a first embodiment of the digital man driving device of the present invention.
As shown in fig. 12, a digital man driving apparatus according to an embodiment of the present invention includes:
The voice generation module 501 is configured to generate digital person audio and phoneme sequence information corresponding to a digital person to be driven according to a language to be broadcasted;
The limb language determining module 502 is configured to determine a limb language of the digital person corresponding to the digital person to be driven based on the phoneme sequence information and the action state machine;
An image sequence generating module 503, configured to generate a digital human image sequence according to the digital human limb language;
a video generation module 504 for generating a digital human video based on the digital human audio and the digital human image sequence;
a video presenting module 505, configured to present the digital person video through the digital person to be driven.
Further, the digital man driving apparatus further includes: a model building module, wherein:
The model construction module is used for constructing a state node action behavior corresponding to an action state machine by a human body posture estimation method, wherein the action state machine is a model for controlling actions of a digital person to be driven; constructing transition action behaviors among nodes corresponding to the action state machine; and modeling the behavior of the digital person to be driven based on the state node action behavior and the inter-node transition action behavior.
The digital person driving device of the embodiment discloses that digital person audio frequency and phoneme sequence information corresponding to a digital person to be driven are generated according to a language to be broadcasted; determining the limb language of the digital person corresponding to the digital person to be driven based on the phoneme sequence information and the action state machine; generating a digital human image sequence according to the digital human limb language; generating a digital human video based on the digital human audio and the digital human image sequence; presenting digital person video through the digital person to be driven; compared with the prior art that the digital person is driven by the real person performance driving mode, a large amount of support is usually required in the modeling and driving process, and the cost is high.
Based on the above-described first embodiment of the digital personal drive device of the present invention, a second embodiment of the digital personal drive device of the present invention is presented.
In this embodiment, the limb language determining module 502 is further configured to determine a face key point sequence based on the phoneme sequence information and a preset digital human-visual mapping table, where the face key point sequence is a sequence composed of facial key feature points of the digital human to be driven; determining a body key point sequence based on the digital human behavior description information and an action state machine, wherein the body key point sequence is a sequence formed by body key feature points of the digital human to be driven; acquiring an overall key point sequence according to the human face key point sequence and the body key point sequence; and determining the limb language of the digital person corresponding to the digital person to be driven according to the integral key point sequence.
Further, the limb language determining module 502 is further configured to determine a face key point offset sequence based on the phoneme sequence information and a preset digital human-to-television mapping table; performing time sequence smoothing on the face key point offset sequence to obtain a smoothed face key point offset sequence; and determining a face key point sequence based on the smoothed face key point offset sequence and the digital neutral face key points.
Further, the limb language determining module 502 is further configured to determine a body key point offset sequence based on the digital pedestrian behavior description information and the action state machine; performing time sequence smoothing on the body key point offset sequence to obtain a smoothed body key point offset sequence; a body keypoint sequence is determined based on the body keypoint offset sequence and the digital human neutral body keypoints.
The embodiment discloses a method for determining a key point sequence of a human face based on phoneme sequence information and a preset digital human visual mapping table; determining a body key point sequence based on the digital human behavior description information and the action state machine; acquiring an overall key point sequence according to the human face key point sequence and the body key point sequence; and determining the limb language of the digital person corresponding to the digital person to be driven according to the whole key point sequence. The embodiment can integrate the human face key point sequence and the body key point sequence of the robot to be driven to obtain the whole key point sequence, so that the digital human limb language corresponding to the digital human to be driven can be accurately determined according to the whole key point sequence, and the driving accuracy of the digital human is improved.
Based on the above-described device embodiments, a third embodiment of the digital human driving device of the present invention is presented.
In this embodiment, the video presenting module 504 is further configured to convert the digital human audio and the digital human video into an audio stream and a video stream respectively through a target pusher; synchronizing the audio stream and the video stream according to the target time stamp to obtain a synchronized audio and video stream; pushing the synchronized audio and video stream to a target server, and presenting the synchronized audio and video stream through the target server and the digital person to be driven.
Further, the video presenting module 504 is further configured to segment the digital human video, and determine a still video clip in the digital human video; constructing a target corpus based on the static video segments; generating a fused digital human video based on the dynamic video segments and the static video segments in the target corpus when an inference request is received; and respectively converting the digital human audio and the fused digital human video into an audio stream and a video stream through a target impeller.
The embodiment discloses that digital human audio and digital human video are respectively converted into an audio stream and a video stream by a target impeller; synchronizing the audio stream and the video stream according to the target time stamp to obtain a synchronized audio/video stream; and pushing the synchronized audio and video stream to a target server, and presenting the synchronized audio and video stream through the target server and the digital person to be driven, so that a user can watch the digital person video through a streaming tool, and live broadcasting of the digital person is realized.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.