[go: up one dir, main page]

CN1320497C - Statistics and rule combination based phonetic driving human face carton method - Google Patents

Statistics and rule combination based phonetic driving human face carton method Download PDF

Info

Publication number
CN1320497C
CN1320497C CNB021402868A CN02140286A CN1320497C CN 1320497 C CN1320497 C CN 1320497C CN B021402868 A CNB021402868 A CN B021402868A CN 02140286 A CN02140286 A CN 02140286A CN 1320497 C CN1320497 C CN 1320497C
Authority
CN
China
Prior art keywords
video
audio
face
voice
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CNB021402868A
Other languages
Chinese (zh)
Other versions
CN1466104A (en
Inventor
陈益强
高文
王兆其
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CNB021402868A priority Critical patent/CN1320497C/en
Publication of CN1466104A publication Critical patent/CN1466104A/en
Application granted granted Critical
Publication of CN1320497C publication Critical patent/CN1320497C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

一种基于统计与规则相结合的语音驱动人脸动画方法,包括步骤:利用音视频同步切割方法得到音视频对应数据流;通过音视频分析方法,得到相应的特征向量;运用统计学习方法学习到音视频同步隐射关系模型;运用统计学习到的模型以及规则得到与用户给定语音序列相对应的人脸运动参数,并驱动人脸动画模型。本发明使用视频采集,语音分析及图象处理等方法,记录真实人脸说话时的语音与人脸特征点运动数据,同时对语音和人脸特征点之间的关联模式进行统计学习。当给定新语音,利用学习到的模型以及一些规则,可以得到与该语音对应的人脸特征点运动参数,驱动人脸动画模型。

Figure 02140286

A voice-driven face animation method based on the combination of statistics and rules, comprising the steps of: using the audio-video synchronous cutting method to obtain the corresponding data stream of the audio-video; using the audio-video analysis method to obtain the corresponding feature vector; using the statistical learning method to learn Audio and video synchronization implicit relationship model; using statistically learned models and rules to obtain the facial motion parameters corresponding to the user's given voice sequence, and drive the facial animation model. The present invention uses methods such as video collection, voice analysis and image processing to record the voice and face feature point motion data when the real face speaks, and at the same time carry out statistical study on the association mode between the voice and face feature points. When a new voice is given, using the learned model and some rules, the motion parameters of facial feature points corresponding to the voice can be obtained to drive the facial animation model.

Figure 02140286

Description

Based on statistics and the regular voice-driven human face animation method that combines
Technical field
The present invention relates to a kind of based on statistics and the regular voice-driven human face animation method that combines, especially refer to a kind of use video acquisition, method such as speech analysis and image processing, voice and human face characteristic point exercise data when the record real human face is spoken are set up an initial phonetic image database; Can calculate the displacement of speech data analysis window by video acquisition frame per second and speech data sampling rate, utilize these several data to utilize statistical learning method to obtain the voice synchronous corresponding relation model corresponding simultaneously with frame of video.Utilize this model, add rule, can obtain people's face kinematic parameter of any voice correspondence, drive the human face animation model.
Background technology
After the technology of recovering realistic three-dimensional face by a width of cloth or a few subpicture or video sequence became a reality, research at present was the simulation of realistic three-dimensional face behavior.The same with the problem that runs in the phonetic synthesis, it is not so difficult to obtain a large amount of real human face motion videos and people's face synthesis unit, and difficulty is how to edit and reuse the human face animation data of these existence.A kind of method provides a cover and is used for the instrument of edit easily, generates animation sequence after the key frame that edits is done interpolation, and this method is the most direct, the making but the expert who needs to be familiar with animation takes much time.Second kind is adopted control technology, with other relevant signals such as text, sound, video, or sensor is realized the control to human face animation.With text control, the sound of output is synthetic speech, and is difficult to synchronously grasp.By video control, be a difficult point to the tracking and the feature extraction of video image.Take sensor plan, equipment manufacturing cost is too high, and the variation of the unique point of some details can only estimate.Therefore now feasible and Many researchers is to realize the voice-driven human face animation what do.People are very responsive for the behavior of people's face, for whether realistic being easy to judged, and also find corresponding people's face motor behavior from voice signal easily.Realize the human face animation of voice driven, the association mode between the moving and human face expression of voice and lip is synthetic is vital for personage's the sense of reality and confidence level.
Cognitive scholar and psychologist have observed great amount of relevant information and have existed in voice and the behavior of people's face.Face's information can increase the observer to voice content and pro forma understanding, and is considered by a lot of systems based on speech interfaces.On the contrary, the synthetic higher people's face of confidence level is considered to generate acceptable visual human and animation people's major obstacle.People are for explaining that the human motion behavior has higher susceptibility, and untrue natural animation people face can disturb even interrupt the understanding of people to voice usually.Present voice driven research can be divided into two classes: by speech recognition with not by speech recognition.First method is by voice are divided into linguistic unit, and as phoneme (Phoneme), vision primitive (Viseme) and syllable (syllable) are further directly insinuated these linguistic units after the lip posture synthetic with the splicing method subsequently.This method very directly is easy to realize, but shortcoming is to have ignored dynamic factor and stationary problem--interaction that-potential voice paragraph and muscle model move and influence intractable.Till now, the effort on the nearly all stationary problem concentrates on heuristic rule and the classical smoothing method.Such as Baldy is the 3D visual human face system that a speech primitive drives, and adopts the voice synchronous model of the hand-designed of psychologist's approval for the processing of stationary problem.Though video rewrites (Video Rewrite) method and obtains new video by the video-frequency band of three-tone correspondence is arranged, the result is than the animation model nature that generates, but it is worthy of note that three-tone is represented is that transition between the voice connects the not motion between representative's face frame.The quality of simultaneity factor depends on number and the smoothing technique that the three-tone sample is provided.When we represented the elementary cell of audio frequency and video with discrete speech primitive or visual primitive, a lot of necessary information can be lost.In fact, the needs that difference is pronounced height and can be transmitted language content are only satisfied in the design of speech primitive.Speech primitive is represented for identification very effectively but is not best for synthetic, this is difficult between the prediction sound rhythm and the human face expression mainly due to them, between acoustic energy and posture are amplified, and the moving relation between synchronous of sound paragraph and lip.Second method is to walk around this form of speech primitive, finds the relation of insinuating between voice signal and the controlled variable, directly drives lip motion then.Can train with neural network, go the PREDICTIVE CONTROL parameter with each five frame voice signal of front and back.But the general method that adopts the corresponding voice segments controlled variable of manual demarcation though avoided the difficult problem that human face characteristic point obtains automatically, also causes system to be difficult to describe the variation of people's face complexity simultaneously.Have some 3D position trackers are placed in around lip next door and the cheek, though can obtain the accurate data of people's face motion, for people portion such as eyes on the face, and the variation of eyebrow or the like does not but realize yet.Someone proposes with a kind of method (HMM) with coherent signal PREDICTIVE CONTROL signal, and it is used for the voice-driven human face animation.But handle complicated voice data with problem reduction with HMM.Above processing simultaneously all is based on statistical learning, can processed voice and stronger the insinuating of relevance such as lip moves, but for voice and nictation, incidence relation then is difficult to obtain by study a little less than voice and the impetus etc.
Summary of the invention
The purpose of this invention is to provide a kind of employing and realize the method for voice to the mapping of people's face based on statistics and the regular method that combines.
For achieving the above object, method provided by the invention comprises step:
Utilize the audio-visual synchronization cutting method to obtain audio frequency and video corresponding data stream;
By the audio frequency and video analytical approach, obtain corresponding proper vector;
The utilization statistical learning method is learnt audio-visual synchronization and is insinuated relational model;
The utilization statistical learning to model and rule obtain and the corresponding people's face of new speech kinematic parameter, drive the human face animation model.
The present invention uses video acquisition, methods such as speech analysis and image processing, and voice and human face characteristic point exercise data when the record real human face is spoken are set up an initial phonetic image database; Can obtain phonetic feature by speech analysis, comprise linear predictor coefficient and prosodic parameter (energy and zero-crossing rate and fundamental frequency), can extract the human face animation parameter characteristic of correspondence point of MPEG4 definition from frame of video, calculate by relative frame do difference calculating and relative displacement and can obtain the human face animation parameter.Utilize cluster, methods such as statistics and neural network are finished the study mapping from phonetic feature to the human face animation parameter.After the study, when new voice are come in, can obtain phonetic feature by analyzing, phonetic feature can obtain the human face animation parameter by mapping, and on this basis, utilization people face movement knowledge storehouse adds rule constrain on the result, realize the animation of the sense of reality.
Description of drawings
Fig. 1 is a learning phase framework synoptic diagram;
Fig. 2 is that human face characteristic point is followed the tracks of synoptic diagram;
Fig. 3 is feature point detection and range of influence synoptic diagram;
Fig. 4 is part FDPFAP corresponding point and the FAPU among the MPEG4;
Fig. 5 is 29 kinds of FAP patterns;
Fig. 6 is an application stage framework synoptic diagram;
Fig. 7 is a statistics vision mode method and comparison (comparison of lip height parameter) based on neural net method;
Fig. 8 is a voice-driven human face animation example, and last figure is true audio frequency and video, and figure below is according to the people's face motion sequence that utilizes audio frequency to obtain of the present invention.
Embodiment
At first utilize no tutor's cluster analysis can obtain people's face kinematic parameter (FAP) proper vector class of video.Add up the people's face dynamic model (essence is FAP classification transition matrix) that takes place synchronously with speech events then, we are referred to as to add up vision mode, and the statistical language model in its principle and the natural language processing is similar.The last a plurality of neural networks of learning training (ANN) are finished from speech pattern insinuating to the human face animation pattern.Can obtain some face animation mode sequences to the new speech data by calculating after the machine learning, utilize the statistics vision mode can therefrom select beautiful woman's face kinematic parameter (FAP) sequence, utilize people's face sports rule that the FAP sequence is revised and replenished then, after finishing smoothly, use these FAP can directly drive face wire frame model.This strategy has following distinctive feature:
1) whole process is set up and can be provided description with the Bayes rule of classics,
arg max L Pr ( L | A ) = arg max L Pr ( A | L ) . Pr ( L ) Pr ( A ) ,
Wherein A can regard voice signal as.Maximal possibility estimation Pr (A|L) weighs the modeled accuracy of voice signal, and prior model Pr (L) sets up about the background knowledge of real human face motion or claims the statistics vision mode.
2) cluster analysis of voice signal is on the classification learning that is based upon face posture, do like this than consider hypothesis to pass through the speech perception classification good.Simultaneously, because therefore the corresponding diverse phonetic feature of same lip, adopts neural network that of a sort voice signal is trained, can make the robustness that predicts the outcome improve.
3) the statistics vision mode is allowed people's face movement locus that we find whole word to optimize, and fully uses contextual information to avoid neural metwork training to be difficult to realize context-sensitive defective simultaneously.
4) vision signal only need be analyzed once, is used for the corresponding relation of training utterance and human face animation parameter (FAP), and it is synthetic that results model can be used to do other people people's face.
5) introducing of people's face sports rule made the animation of the not high part of original and voice association degree also can be truer, moved as nictation and head etc.
6) whole framework can be used for correlation predictive and the control or synthetic between other signals.
Above-mentionedly comprise following two aspects with the voice-driven human face animation method that combines of rule: learn and the application stage based on statistics:
1) learning phase comprises the steps (Fig. 1):
A) audio-visual synchronization is recorded and is cut apart
By gamma camera can be synchronous recorded speech and video data, form avi file, but, audio-video signal must be divided into the audio and video stream of different passages in order to need with post analysis.Traditional method usually rule of thumb fixedly installs certain gamma camera that adopts, and the present invention proposes the audio-visual synchronization dividing method and can be used for any gamma camera collection video.
Suppose that the video acquisition frame per second is Videoframecount/msec, the audio frequency frame per second is Audiosamplecount/msec, and the displacement of speech analysis window is Windowmove, and speech analysis window size is Windowsize, needing voice window number is m, and speech analysis window and speech analysis window displacement ratio are n;
Windowmove=Audiosamplecount/(Videoframecount*m) (1);
Windowsize=Windowmove*n (2);
Wherein m and n are adjustable parameter, set according to actual conditions.Synchronization parameter by this method setting can make audio-visual synchronization be accurate to sample bits.
In order to cover the complete various pronunciations of trying one's best, the written material that the text information that method selects 863 Chinese phonetic synthesis storehouse CoSS-1 to sum up pronounces as words person.CoSS-1 comprises the pronunciation of 1268 independent syllables of all Chinese, also comprises the pronunciation of a large amount of 2-4 words and the voice of 200 statements.Note various individual characters, the synchronized audio/video storehouse of speech and statement.By the marker characteristic point, can obtain lip, cheek, the exercise data of positions such as eyelid.Gamma camera is set obtains the image feature sequence by transferring the video of gathering to image and utilize trace routine to handle 10 frame/seconds.Suppose m=6, n=2/3 we to adopt the speech sample rate be 8040Hz, then the window of speech analysis is long is 8040/10*6=134, frame moves and is 134*2/3=89.
B) audio and video characteristic extracts.
For the linear forecasting parameter of speech data in the audio extraction hamming window and prosodic parameter (energy, zero-crossing rate and fundamental frequency) as speech feature vector
For video, extract consistent with the Mpeg-4 on the face unique point of people, calculate the difference Ve1={V1 of each unique point coordinate and standard frame coordinate then, V2 ... Vn}, calculate the specific people corresponding yardstick reference quantity of each unique point P={P1 on the face that presses the Mpeg-4 definition again, P2 ... Pn} can obtain people's face kinematic parameter by formula (3).
Fap i=(V I (x|y)/ P I (x|y)) * 1024 (3) Fap iExpression and I people's face kinematic parameter that unique point is corresponding, V I (x|y)The V of expression iX or y coordinate, P I (x|y)Expression and V I (x|y)Corresponding yardstick reference quantity.
For phonetic feature, in speech analysis, use traditional hamming window, each frame obtains 16 rank LPC and RASTA-PLP mixing constant and some prosodic parameters like this.
For people's face motion feature, use human face animation representation scheme based on MPEG4.MPEG-4 uses FDP (people's face defined parameters) and FAP (human face animation parameter) to specify faceform and animation thereof, uses FAPU (human face animation parameter unit) to indicate the displacement activity of FAP.Based on above-mentioned principle, obtain the moving exercise data of human face expression and lip, to obtain corresponding FDP and FAP parameter exactly.In order to obtain people's face exercise data, developed face characteristic that a cover computer vision system can follow the tracks of many individual characteies synchronously as the corners of the mouth and lip line, eyes and nose etc.Fig. 2 shows the unique point that we can follow the tracks of and obtain.Because it is synthetic more important to us to obtain the accurate unique point exercise data track algorithm more numerous than experiment.We adopt by obtain data and require words person to reduce head movement in the way of mark particular color on the face as far as possible, and Fig. 3 shows the unique point and the range of influence of final acquisition.
The data of coming out by feature point extraction are absolute coordinatess, and because the influence of words person's head movement or body kinematics makes that handling the coordinate figure that obtains with simple image has very big noise, the pre-service of therefore need reforming.The unique point that our hypothesis is not influenced by FAP is not move relatively, utilizes this unchangeability to finish conversion from image coordinate to faceform's relative coordinate, thereby can remove by the influence to data of the kinetic rotation of words person and telescopic variation.Unique point to Mpeg4 definition among Fig. 4, we have chosen P0 (11.2), P1 (11.3), P2 (11.1) and P3 (adding at a supratip point) form orthogonal coordinate system (X-axis P0P1, Y-axis P2 P3), according to this coordinate system, can calculate the anglec of rotation and flexible yardstick in accordance with the following methods.Suppose that these coordinates of reference points are P0 (x0, y0), P1 (x1, y1), P2 (x2, y2) and P3 (x3, y3), the origin of new coordinate-system can be calculated by two straight-line intersections that they connect into, and (xnew ynew) can also calculate the anglec of rotation Φ of new coordinate with respect to normal coordinates simultaneously to be assumed to be P.Like this arbitrfary point (x, y) value under new coordinate-system (x ', y ') can be calculated according to following formula:
x′=x×cos(θ)-x×sin(θ)+P(x new) (4)
y′=y×sin(θ)-y×cos(θ)+P(y new) (5)
For fear of flexible influence, suppose to be added in o'clock not moving on the bridge of the nose with respect to first frame, any other point can calculate relative displacement with this point according to formula (6) and (7), thereby transfers image coordinate to the faceform coordinate, obtains the accurate data of unique point motion:
x k″=(x k′-x k3)-(x 1′-x 13) (6)
y k″=(y k′-y k3)-(y 1′-y 13) (7)
(x wherein 13, y 13) coordinate of prenasale of expression the 1st frame, (x 1', y 1') other characteristic point coordinates of expression the 1st frame, (x K3, y K3) coordinate of prenasale of expression k frame, (x k', y k') other characteristic point coordinates of expression k frame, (x k", y kThe last coordinates computed of other unique points of ") expression k frame.After filtering, each characteristic point coordinates can calculate the FAP value with reference to the human face animation parameter unit (FAPU) of Fig. 4 definition.Suppose that the ESO and the ENSO that define among Fig. 4 are respectively 200 and 140, then 5.3 (x y) may be calculated respectively corresponding to two FAP values:
FAP39=X×1024/200 (8)
FAP41=Y×1024/140 (9)
C) audio frequency characteristics is to the statistical learning of video features.
1. at first audio frequency and video are pressed a), b) described feature set Audio, the Video cut apart synchronously;
2. concentrate video not have the supervision cluster analysis to Video, obtain people's face motion basic model, be made as the I class;
3. utilize statistical method to obtain transition probability between two classes or the multiclass, be called the statistics vision mode, and the quality of coming evaluation model with entropy, and then carry out b) up to the entropy minimum.
The data that 4. will belong among the set of voice features Audio of correspondence of same individual face motion basic model are divided into corresponding subclass Audio (i), and which class I represents.
5. each subclass Audio (i) is trained with a neural network, be input as the phonetic feature F (Audio (i)) in the subclass, be output as the degree of approximation P (Video (i)) that belongs to this classification.
1. 2. middle people's face motion basic model clustering method
For basic people's face pattern, cognitive scholar has provided some achievements in research, but generally all is the qualitative 6 kinds of basic facial expressions or more that provide, and the synthetic result's of this qualitative expression the sense of reality is bad.The researchist is also arranged by the True Data cluster is come discovery mode, but at present mostly cluster analysis all on the phoneme basis, carry out, ignored the dynamic of statement level people face motion.We wish to find the pattern of one group of effective expression people face motion from a large amount of true statements, the pattern of this discovery can have clearly meaning such as 14 kinds of lips of MPEG4 definition, also can be the synthetic basic model of a kind of people's of being effective to face.By mode discovery, not only be beneficial to the convergence of neural metwork training, also synthetic complex process explanation of the moving face of lip and understanding are laid the first stone simultaneously for follow-up.In cluster process, since the number of such basic model and uncertain, the no tutor's cluster of general employing.
For clustering algorithm, the problem that is provided with that has a lot of parameters, parameter is provided with for the cluster result influence very big, for the moving face basic model of lip cluster, owing to do not have the experiment sample collection of known class as the error rate evaluation, simultaneously therefore geometric properties that again can't the Direct observation higher dimensional space is estimated cluster result and is had difficulties.Though in the class spacing of cluster data or the class apart from obtaining being used to instruct the cluster evaluation, but can't be described in and use the effect that cluster can reach in the real system, usually the quality of effect is vital for animation system, and we directly adopt with cluster data and True Data and ask the way of variance whether to weigh cluster result to reach the requirement of describing main motor pattern.By adjust the clustering algorithm parameter as: wish clusters number, maximum frequency of training, every class smallest sample number, separation parameter P and merge parameters C etc. and can obtain different cluster results carries out variance to these results by (10) to calculate, the result is as shown in table 1:
ErrorSquare ( X , Y ) = ( X - Y ) * ( X - Y ) T / | | X | | - - - ( 10 )
Wherein X is the True Data matrix, and Y is the matrix of True Data after the classification mapping, ‖ X ‖ representing matrix size.
Every class smallest sample number Separation parameter P/ merges parameters C Clusters number Variance ratio
1 32 P=0.5-1,C=1-1.5 18 3.559787
2 20 P=0.5-1,C=1-1.5 21 4.813459
3 10 P=0.5-1,C=1-1.5 23 2.947106
4 5 P=0.5-1,C=1-1.5 29 2.916784
5 3 P=0.5-1,C=1-1.5 33 2.997993
Table 1: cluster result relatively
Above-mentioned cluster is carried out on 6200 sample datas, wishes that the number of cluster is made as 64, and maximum frequency of training is made as 200, all the other parameter manual shift, and P represents separation parameter, and C represents to merge parameter, and P and C change in [0.5,1] interval.We find that variance ratio more is not mild decline, and certain shake occurs, this select mainly due to different cluster selection of parameter such as initial classes center and the deletion step of clustering algorithm to resultant influence.Estimate and can find out from variance, the 3rd row, the cluster result variance of the 4th row and the 5th row is more or less the same, and can think and tend towards stability that the number with people's face basic facial expression pattern is made as 29 thus.Fig. 5 demonstrates the result:
2. the statistics vision mode method for building up in 3.
The purpose of setting up the statistics vision mode is the people's face movement locus that finds whole word to optimize in order to allow, fully uses contextual information to avoid single neural metwork training to be difficult to utilize context-sensitive defective simultaneously.The statistics vision mode can calculate the probability that video sequence occurs.If we suppose that F is the human face animation sequence of a particular statement, as,
F=f 1f 2…f Q
So, P (F) can be calculated by following formula
P(F)=P(f 1f 2…f Q)=P(f 1)P(f 2|f 1)…P(f Q|f 1f 2…f Q-1) (11)
Yet,, estimate all possible conditional probability P (f for any face posture and the sequence formed j| f 1f 2F F-1) be impossible, in practice, generally adopt the N unit syntax to solve this problem, can approximate evaluation P (F) be
P ( F ) = Π i = 1 Q P ( f i | f i - 1 f i - 2 · · · f i - N + 1 ) , - - - ( 12 )
Conditional probability P (f i| f I-1f I-2F I-N+1) can obtain by simple relative statistic method:
P ( f i | f i - 1 f i - 2 · · · f i - N + 1 ) = F ( f i , f i - 1 , · · · f i - N + 1 ) F ( f i - 1 , · · · f i - N + 1 ) - - - ( 13 )
Wherein, F is the same occurrence numbers of various face postures in given training video database.After setting up the statistics vision mode, we adopt puzzled degree to estimate the performance quality of whole training pattern.Suppose θ iBe the cluster centre of gathering I by the cluster that cluster analysis obtains, for θ={ θ 1, θ 2θ n, we wish to find the vision mode of an optimization.Puzzled degree for model θ can define according to following method:
pp = 2 H ( S , θ ) ≈ 2 - 1 n log p ( S | θ ) - - - ( 14 )
Wherein, S=s 1, s 2..., s nThe human face animation argument sequence of expression statement. p ( S | θ ) = Σ i p ( s i + 1 | s i · · · s 1 ) The probability of expression human face animation argument sequence S under model p (θ).P (θ) represents our background knowledge for the motion of people's face in fact, can utilize above-mentioned statistical method to obtain simultaneously.Such as using the bi-gram commonly used in the natural language processing or the method for the ternary syntax, table 2 shows that the puzzled degree of the statistics vision mode that different cluster results obtain compares:
Number of state Bi-gram(PP) Tri-gram(pp)
1 18 8.039958 2.479012
2 21 6.840446 2.152096
3 26 5.410093 1.799709
4 29 4.623306 1.623896
5 33 4.037879 1.478828
Table 2: the puzzlement degree relatively
By the statistics vision mode, we obtain the distribution probability of one group of state transitions, when a plurality of human face animation sequences provide, can utilize the Viterbi algorithm to obtain the human face animation sequence that maximum possible takes place on probability.
3. the network learning method in 5.
If regard voice the task of a pattern-recognition as to the mapping of FAP pattern, there are a lot of learning algorithms to be used, as hidden Markov model (HMM), support vector machine (SVM) and neural network or the like.Because neural network embodies stronger efficient and robustness for study input and output mapping, we select a kind of neural network (BP net) to learn the sentence of a large amount of records.Each cluster node can be finished training with two neural networks, and one is used for the sign state, and value is 0 or 1, and another is used for sign speed.These two kinds of Feedback Neural Network can be unified to be described as:
y k = f 2 ( Σ j = 0 n 2 w kj ( 2 ) f 1 ( Σ i = 0 n 1 w ji ( 1 ) x i ) ) - - - ( 15 )
Wherein x ∈ Φ is an audio frequency characteristics, w (1)And w (2)Be the weights and the threshold value of each layer, f 1And f 2The is-symbol function.Train very simply, behind the given data set, adopt the Levenberg-Marquardt optimized Algorithm to adjust weights and threshold value is come neural network training.
All calculate 16 dimension LPC and RASTA-PLP mixed vectors for each frame of voice and add 2 dimension prosodic parameters, form 18 dimension speech feature vectors, 6 frames are combined into an input vector before and after getting, and the input of so each neural network is 108 vectors of tieing up.For the state neural network, the output node number is decided to be 1, expression 0 or 1.Adopt 30 for middle hidden node number, the parameter of neural network is made as simultaneously: learning rate 0.001, the error of network are 0.005.For the speed neural network, the output node number is decided to be 18, expression 18 dimension FAP proper vectors.Adopt 80 for middle hidden node number.The parameter of neural network is made as simultaneously: learning rate 0.001, the error of network are 0.005.
2) application stage comprises the steps (Fig. 6):
1) audio recording:
Can directly utilize microphone or other sound pick-up outfits to obtain speech data
2) audio feature extraction
Audio feature extraction method according to learning phase is extracted phonetic feature
3) based on the mapping of the audio frequency characteristics of statistical learning model to video features
Phonetic feature is sent into the neural network of everyone face pattern correspondence as input, and each state neural network all has an output, the degree of approximation that belongs to this classification that obtains exporting; After a sentence is finished, utilize statistics vision mode and Viterbi decoding algorithm to obtain the transferring route of the class of a maximum probability, coupling together is exactly the human face animation mode sequences corresponding with voice;
Though information main or that provided by voice plays a major role, the viterbi algorithm guarantees that formation sequence meets the proper motion of people's face.Though directly represent each state of sequence just can drive people's face grid with each cluster centre, choose basic model owing to simplify, jitter phenomenon can appear in human face animation.Classic method generally solves with interpolation, though can eliminate shake, but do not meet the dynamic perfromance of human face animation, we have two neural networks to predict now under each state, one of them predetermined speed, the net result sequence of utilizing transition matrix to obtain like this includes enough information and can generate and the consistent animation of nature person's face motion, and whole formula is very succinct, makes T=(t 1, t 2T n) be people's face motion state point of prediction, V={v 1, v 2V nBe the speed under each state point.
Y ( t * i / m ) - > t + 1 = Y t + ( ( Y t + 1 - Y t ) / m ) * v t * iIf i < = m / 2 - - - ( 16 )
Y ( t * i / m ) - > t + 1 = Y t + 1 - ( ( Y t + 1 - Y t ) / ( i * m ) ) * v t + 1 Ifi > m / 2 - - - ( 17 )
Y wherein (t*i/m)-t+1The I frame of expression from state t to state t+1, m are represented the frame number that need insert to state t+1 from state t. because the speed parameter has been arranged, make the human face animation that generates meet the polytrope of people's face motion more than interpolation method.
4) revise based on the video features stream of people's face sports rule
After the people's face kinematic parameter sequence that obtains based on statistical model, because a bit little influence that study predicts the outcome, can cause the sense of reality of whole animation sequence to descend, some people's face motion simultaneously is little with the correlation degree of phonetic feature, as nictation, nod, for this reason, on the basis of statistical learning, the rule that adds people's face movement knowledge storehouse is revised sequence, thereby improve result's output, make the animation sense of reality stronger.
5) audio-visual synchronization is broadcasted
Obtain voice and animation played file, can directly broadcast at different passages, because the data that itself obtain are strict synchronism, it also is synchronous therefore broadcasting.
Four) experimental result relatively
System has been adopted qualitative and quantitative two kinds of valuation methods: quantitative test is based on and calculates the error of weighing between predicted data and the True Data, to a lot of machine learning systems, all should adopt quantivative approach.Qualitative test is to judge by perception whether the people's face motion that synthesizes is true, and for synthetic, qualitative test is very important.In quantitative test, weighed the error of predicted data and True Data, comprise two groups of closed set (training data is a test data) and openers (test data is not through training).Fig. 7 shows the test result of upper lip height parameter in two words, and compare with single neural net method, last two figure test datas are training data, two figure test datas are non-training data down, by testing all FAP parameters and calculating the mean square deviation of predicted data and True Data, obtain the result of table 3 by formula (10).
Test data Mean square deviation (VM+ANN) Mean square deviation (ANN)
Training data 2.859213 3.863582
Test data 4.097657 5.558253
The variance ratio of table 3:FAP parameter prediction data and True Data
Evaluation for multi-mode system is not sought unity of standard so far, for the voice-driven human face animation system, in obtaining the your human face analysis data corresponding with voice, can't calculate the error of predicted data and True Data, the Practical Performance that therefore simple quantitative result can not representative system.For the tone testing evaluation of unspecified person, generally can only adopt method qualitatively, in experiment, require five people's audiovisual systems, and from intelligent, naturality, the acceptability of friendly and the motion of people's face is assessed.Because what system not only can solve people's dynamic change of portion on the face but also use is the raw tone of recording, and effectively solves stationary problem, therefore obtained higher evaluation.
Utilize the system of this paper, behind a given people's voice, the FAP pattern that neural network can the every frame phonetic feature of real-time estimate correspondence is by can directly driving the people's face grid based on Mpeg4 after level and smooth.Fig. 8 provides the partial frame of voice-driven human face animation.

Claims (5)

1.一种基于统计与规则相结合的语音驱动人脸动画方法,包括步骤:1. A voice-driven face animation method based on the combination of statistics and rules, comprising steps: 利用音视频同步切割方法得到音视频对应数据流;Using the audio and video synchronous cutting method to obtain the corresponding data stream of audio and video; 通过音视频分析方法,得到相应的特征向量;Through the audio and video analysis method, the corresponding feature vector is obtained; 运用统计学习方法学习到音视频同步隐射关系模型;Using statistical learning methods to learn the audio and video synchronization implicit relationship model; 运用统计学习到的模型加上规则得到与新语音相对应的人脸运动参数。The face motion parameters corresponding to the new speech are obtained by using the statistically learned model plus rules. 2.按权利要求1所述的方法,其特征在于所述的音视频同步分割方法包括步骤:2. by the described method of claim 1, it is characterized in that described audio-video synchronous segmentation method comprises the steps: a、假设视频采集帧率为Videoframecount/msec,音频帧率为Audiosamplecount/msec,语音分析窗位移为Windowmove,语音分析窗大小为Windowsize,需要语音窗个数为m,语音分析窗与语音分析窗位移比例为n;a. Assume that the video capture frame rate is Videoframecount/msec, the audio frame rate is Audiosamplecount/msec, the speech analysis window displacement is Windowmove, the speech analysis window size is Windowsize, the number of speech windows required is m, and the speech analysis window and speech analysis window shift The ratio is n; b、Windowmove=Audiosamplecount/(Videoframecount*m)b. Windowmove=Audiosamplecount/(Videoframecount*m)    Windowsize=Windowmove*nWindowsize=Windowmove*n 其中,m与n为可调参数,根据实际情况设定。Among them, m and n are adjustable parameters, which are set according to the actual situation. 3.按权利要求1所述的方法,其特征在于所述的音视频分析与特征提取方法包括步骤:3. by the described method of claim 1, it is characterized in that described audio-video analysis and feature extraction method comprise steps: a、对于音频提取海明窗中语音数据的线性预测参数以及韵律参数作为语音特征向量;a, extract linear prediction parameters and prosodic parameters of speech data in the Hamming window as speech feature vectors for audio; b、对于视频,提取人脸上与Mpeg-4一致的特征点,然后计算各特征点坐标与标准帧坐标的差值Vel={V1,V2…Vn),再计算按Mpeg-4定义的特定人脸上的各特征点对应尺度参考量P={P1,P2,…,Pn),通过下述公式即可得到人脸运动参数:b. For the video, extract the feature points consistent with Mpeg-4 on the human face, then calculate the difference Vel={V1, V2...Vn) between the coordinates of each feature point and the coordinates of the standard frame, and then calculate the specific points defined by Mpeg-4 Each feature point on the face corresponds to the scale reference P={P1, P2, ..., Pn), and the face motion parameters can be obtained by the following formula: Fapi=(Vi(x|y)/Pi(x|y))*1024Fap i =(V i(x|y) /P i(x|y) )*1024 其中,Fapi表示与第i个特征点对应的人脸运动参数,Vi(x|y)表示的Vi的x或y坐标,Pi(x|y)表示与Vi(x|y)对应的尺度参考量。Among them, Fap i represents the face motion parameter corresponding to the i-th feature point, V i(x|y) represents the x or y coordinate of V i , and P i(x|y) represents the relationship with V i(x|y ) corresponds to the scale reference. 4.按权利要求1所述的方法,其特征在于所述的音视频同步隐射关系模型的统计学习方法包括步骤:4. by the described method of claim 1, it is characterized in that the statistical learning method of described audio-video synchronous implicit relationship model comprises the steps: a)首先得到同步分割特征集Audio,Video;a) First obtain the synchronous segmentation feature set Audio, Video; b)对Video集中视频进行无监督聚类分析,得到人脸运动基本模式,设为I类;B) unsupervised clustering analysis is carried out to the concentrated video of Video, obtains the basic mode of human face movement, is set as I class; c)利用统计方法得到两类或多类之间的转移概率,称为统计视觉模型,并用熵来评价模型的好坏,然后再进行b)直到熵最小;c) Use statistical methods to obtain the transition probability between two or more classes, which is called a statistical visual model, and use entropy to evaluate the quality of the model, and then proceed to b) until the entropy is minimized; d)将属于同一个人脸运动基本模式的对应的语音特征集Audio中的数据分成相应的子集Audio(i),i代表第几类;D) divide the data in the corresponding speech feature set Audio belonging to the same human face motion basic pattern into corresponding subset Audio(i), and i represents the first several categories; e)对每个子集Audio(i)用一个神经网络进行训练,输入为子集中的语音特征F(Audio(i)),输出为属于这个类别的近似程度P(Video(i))。e) Use a neural network to train each subset Audio(i), the input is the speech feature F(Audio(i)) in the subset, and the output is the degree of approximation P(Video(i)) belonging to this category. 5.按权利要求1所述的方法,其特征在于所述的得到与语音特征相对应的人脸运动参数包括步骤:5. by the described method of claim 1, it is characterized in that described obtaining and the human face motion parameter corresponding to speech feature comprise steps: a)对于给定新语音,提取语音特征;a) For a given new speech, extract speech features; b)将语音特征作为输入送入每个人脸模式对应的神经网络,得到输出的属于这个类别的近似程度;b) The speech feature is sent into the neural network corresponding to each face pattern as input, and the approximate degree of output belonging to this category is obtained; c)当一个句子完成后,利用统计视觉模型及Viterbi译码算法得到一条最大概率的类的转移路线,连接起来就是与语音对应的人脸动画模式序列;c) After a sentence is completed, use the statistical visual model and the Viterbi decoding algorithm to obtain a transfer route of the class with the highest probability, which is connected to a sequence of facial animation patterns corresponding to the voice; d)对预测的人脸动画模式序列通过人脸运动知识库中的规则进行修订,使结果更加真实自然。d) Revise the predicted face animation pattern sequence through the rules in the face motion knowledge base to make the result more realistic and natural.
CNB021402868A 2002-07-03 2002-07-03 Statistics and rule combination based phonetic driving human face carton method Expired - Lifetime CN1320497C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB021402868A CN1320497C (en) 2002-07-03 2002-07-03 Statistics and rule combination based phonetic driving human face carton method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB021402868A CN1320497C (en) 2002-07-03 2002-07-03 Statistics and rule combination based phonetic driving human face carton method

Publications (2)

Publication Number Publication Date
CN1466104A CN1466104A (en) 2004-01-07
CN1320497C true CN1320497C (en) 2007-06-06

Family

ID=34147542

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB021402868A Expired - Lifetime CN1320497C (en) 2002-07-03 2002-07-03 Statistics and rule combination based phonetic driving human face carton method

Country Status (1)

Country Link
CN (1) CN1320497C (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100343874C (en) * 2005-07-11 2007-10-17 北京中星微电子有限公司 Voice-based colored human face synthesizing method and system, coloring method and apparatus
CN100369469C (en) * 2005-08-23 2008-02-13 王维国 Method for synthesizing video and audio file by voice-driven head image
CN100476877C (en) * 2006-11-10 2009-04-08 中国科学院计算技术研究所 Speech and text-driven cartoon face animation generation method
CN101482976B (en) * 2009-01-19 2010-10-27 腾讯科技(深圳)有限公司 Method for driving change of lip shape by voice, method and apparatus for acquiring lip cartoon
CN101488346B (en) * 2009-02-24 2011-11-02 深圳先进技术研究院 Speech visualization system and speech visualization method
CN102820030B (en) * 2012-07-27 2014-03-26 中国科学院自动化研究所 Vocal organ visible speech synthesis system
GB2510200B (en) * 2013-01-29 2017-05-10 Toshiba Res Europe Ltd A computer generated head
CN103279970B (en) * 2013-05-10 2016-12-28 中国科学技术大学 A kind of method of real-time voice-driven human face animation
US10586368B2 (en) * 2017-10-26 2020-03-10 Snap Inc. Joint audio-video facial animation system
CN109409307B (en) * 2018-11-02 2022-04-01 深圳龙岗智能视听研究院 Online video behavior detection method based on space-time context analysis
CN110072047B (en) * 2019-01-25 2020-10-09 北京字节跳动网络技术有限公司 Image deformation control method, device and hardware device
CN110599573B (en) * 2019-09-03 2023-04-11 电子科技大学 Method for realizing real-time human face interactive animation based on monocular camera
CN110610534B (en) * 2019-09-19 2023-04-07 电子科技大学 Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN111243626B (en) * 2019-12-30 2022-12-09 清华大学 Method and system for generating speaking video
CN115100329B (en) * 2022-06-27 2023-04-07 太原理工大学 Multi-mode driving-based emotion controllable facial animation generation method
CN116912373B (en) * 2023-05-23 2024-04-16 苏州超次元网络科技有限公司 Animation processing method and system
CN118349640A (en) * 2024-01-08 2024-07-16 河北软件职业技术学院 A cloud computing virtual teaching aid digital human construction system based on AIGC

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0992933A2 (en) * 1998-10-09 2000-04-12 Mitsubishi Denki Kabushiki Kaisha Method for generating realistic facial animation directly from speech utilizing hidden markov models
US20020024519A1 (en) * 2000-08-20 2002-02-28 Adamsoft Corporation System and method for producing three-dimensional moving picture authoring tool supporting synthesis of motion, facial expression, lip synchronizing and lip synchronized voice of three-dimensional character
US20020054047A1 (en) * 2000-11-08 2002-05-09 Minolta Co., Ltd. Image displaying apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0992933A2 (en) * 1998-10-09 2000-04-12 Mitsubishi Denki Kabushiki Kaisha Method for generating realistic facial animation directly from speech utilizing hidden markov models
US20020024519A1 (en) * 2000-08-20 2002-02-28 Adamsoft Corporation System and method for producing three-dimensional moving picture authoring tool supporting synthesis of motion, facial expression, lip synchronizing and lip synchronized voice of three-dimensional character
US20020054047A1 (en) * 2000-11-08 2002-05-09 Minolta Co., Ltd. Image displaying apparatus

Also Published As

Publication number Publication date
CN1466104A (en) 2004-01-07

Similar Documents

Publication Publication Date Title
CN1320497C (en) Statistics and rule combination based phonetic driving human face carton method
Ferstl et al. Multi-objective adversarial gesture generation
CN103218842B (en) A kind of voice synchronous drives the method for the three-dimensional face shape of the mouth as one speaks and facial pose animation
CN104361620B (en) A kind of mouth shape cartoon synthetic method based on aggregative weighted algorithm
Ferstl et al. Adversarial gesture generation with realistic gesture phasing
CN1162839C (en) Method and apparatus for generating an acoustic model
CN101101752B (en) A lip-reading recognition system for monosyllabic languages based on visual features
Zeng et al. Audio-visual affect recognition
CN110516696A (en) A dual-modal fusion emotion recognition method with adaptive weight based on speech and expression
CN111967272B (en) Visual dialogue generating system based on semantic alignment
CN106919251A (en) A kind of collaborative virtual learning environment natural interactive method based on multi-modal emotion recognition
CN111126280B (en) Gesture recognition fusion-based aphasia patient auxiliary rehabilitation training system and method
CN105551071A (en) Method and system of face animation generation driven by text voice
CN105390133A (en) Tibetan TTVS system realization method
CN103258340A (en) Pronunciation method of three-dimensional visual Chinese mandarin pronunciation dictionary with pronunciation being rich in emotion expression ability
CN118897887B (en) An efficient digital human interaction system integrating multimodal information
CN104538025A (en) Method and device for converting gestures to Chinese and Tibetan bilingual voices
CN118519524A (en) Virtual digital person system and method for learning disorder group
CN119992624A (en) An intelligent children&#39;s education method based on emotion recognition using a reading pen
Windle et al. The uea digital humans entry to the genea challenge 2023
CN1499484A (en) Recognition system of Chinese continuous speech
Wen et al. 3D Face Processing: Modeling, Analysis and Synthesis
Li et al. A novel speech-driven lip-sync model with CNN and LSTM
CN118968579A (en) Audio-driven digital human generation system and method based on user prompts
Braffort Research on computer science and sign language: Ethical aspects

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CX01 Expiry of patent term

Granted publication date: 20070606

CX01 Expiry of patent term