CN1320497C

CN1320497C - Statistics and rule combination based phonetic driving human face carton method

Info

Publication number: CN1320497C
Application number: CNB021402868A
Authority: CN
Inventors: 陈益强; 高文; 王兆其
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2002-07-03
Filing date: 2002-07-03
Publication date: 2007-06-06
Anticipated expiration: 2022-07-03
Also published as: CN1466104A

Abstract

A voice-driven face animation method based on the combination of statistics and rules, comprising the steps of: using the audio-video synchronous cutting method to obtain the corresponding data stream of the audio-video; using the audio-video analysis method to obtain the corresponding feature vector; using the statistical learning method to learn Audio and video synchronization implicit relationship model; using statistically learned models and rules to obtain the facial motion parameters corresponding to the user's given voice sequence, and drive the facial animation model. The present invention uses methods such as video collection, voice analysis and image processing to record the voice and face feature point motion data when the real face speaks, and at the same time carry out statistical study on the association mode between the voice and face feature points. When a new voice is given, using the learned model and some rules, the motion parameters of facial feature points corresponding to the voice can be obtained to drive the facial animation model.

Description

Based on statistics and the regular voice-driven human face animation method that combines

Technical field

The present invention relates to a kind of based on statistics and the regular voice-driven human face animation method that combines, especially refer to a kind of use video acquisition, method such as speech analysis and image processing, voice and human face characteristic point exercise data when the record real human face is spoken are set up an initial phonetic image database; Can calculate the displacement of speech data analysis window by video acquisition frame per second and speech data sampling rate, utilize these several data to utilize statistical learning method to obtain the voice synchronous corresponding relation model corresponding simultaneously with frame of video.Utilize this model, add rule, can obtain people's face kinematic parameter of any voice correspondence, drive the human face animation model.

Background technology

After the technology of recovering realistic three-dimensional face by a width of cloth or a few subpicture or video sequence became a reality, research at present was the simulation of realistic three-dimensional face behavior.The same with the problem that runs in the phonetic synthesis, it is not so difficult to obtain a large amount of real human face motion videos and people's face synthesis unit, and difficulty is how to edit and reuse the human face animation data of these existence.A kind of method provides a cover and is used for the instrument of edit easily, generates animation sequence after the key frame that edits is done interpolation, and this method is the most direct, the making but the expert who needs to be familiar with animation takes much time.Second kind is adopted control technology, with other relevant signals such as text, sound, video, or sensor is realized the control to human face animation.With text control, the sound of output is synthetic speech, and is difficult to synchronously grasp.By video control, be a difficult point to the tracking and the feature extraction of video image.Take sensor plan, equipment manufacturing cost is too high, and the variation of the unique point of some details can only estimate.Therefore now feasible and Many researchers is to realize the voice-driven human face animation what do.People are very responsive for the behavior of people's face, for whether realistic being easy to judged, and also find corresponding people's face motor behavior from voice signal easily.Realize the human face animation of voice driven, the association mode between the moving and human face expression of voice and lip is synthetic is vital for personage's the sense of reality and confidence level.

Cognitive scholar and psychologist have observed great amount of relevant information and have existed in voice and the behavior of people's face.Face's information can increase the observer to voice content and pro forma understanding, and is considered by a lot of systems based on speech interfaces.On the contrary, the synthetic higher people's face of confidence level is considered to generate acceptable visual human and animation people's major obstacle.People are for explaining that the human motion behavior has higher susceptibility, and untrue natural animation people face can disturb even interrupt the understanding of people to voice usually.Present voice driven research can be divided into two classes: by speech recognition with not by speech recognition.First method is by voice are divided into linguistic unit, and as phoneme (Phoneme), vision primitive (Viseme) and syllable (syllable) are further directly insinuated these linguistic units after the lip posture synthetic with the splicing method subsequently.This method very directly is easy to realize, but shortcoming is to have ignored dynamic factor and stationary problem--interaction that-potential voice paragraph and muscle model move and influence intractable.Till now, the effort on the nearly all stationary problem concentrates on heuristic rule and the classical smoothing method.Such as Baldy is the 3D visual human face system that a speech primitive drives, and adopts the voice synchronous model of the hand-designed of psychologist's approval for the processing of stationary problem.Though video rewrites (Video Rewrite) method and obtains new video by the video-frequency band of three-tone correspondence is arranged, the result is than the animation model nature that generates, but it is worthy of note that three-tone is represented is that transition between the voice connects the not motion between representative's face frame.The quality of simultaneity factor depends on number and the smoothing technique that the three-tone sample is provided.When we represented the elementary cell of audio frequency and video with discrete speech primitive or visual primitive, a lot of necessary information can be lost.In fact, the needs that difference is pronounced height and can be transmitted language content are only satisfied in the design of speech primitive.Speech primitive is represented for identification very effectively but is not best for synthetic, this is difficult between the prediction sound rhythm and the human face expression mainly due to them, between acoustic energy and posture are amplified, and the moving relation between synchronous of sound paragraph and lip.Second method is to walk around this form of speech primitive, finds the relation of insinuating between voice signal and the controlled variable, directly drives lip motion then.Can train with neural network, go the PREDICTIVE CONTROL parameter with each five frame voice signal of front and back.But the general method that adopts the corresponding voice segments controlled variable of manual demarcation though avoided the difficult problem that human face characteristic point obtains automatically, also causes system to be difficult to describe the variation of people's face complexity simultaneously.Have some 3D position trackers are placed in around lip next door and the cheek, though can obtain the accurate data of people's face motion, for people portion such as eyes on the face, and the variation of eyebrow or the like does not but realize yet.Someone proposes with a kind of method (HMM) with coherent signal PREDICTIVE CONTROL signal, and it is used for the voice-driven human face animation.But handle complicated voice data with problem reduction with HMM.Above processing simultaneously all is based on statistical learning, can processed voice and stronger the insinuating of relevance such as lip moves, but for voice and nictation, incidence relation then is difficult to obtain by study a little less than voice and the impetus etc.

Summary of the invention

The purpose of this invention is to provide a kind of employing and realize the method for voice to the mapping of people's face based on statistics and the regular method that combines.

For achieving the above object, method provided by the invention comprises step:

Utilize the audio-visual synchronization cutting method to obtain audio frequency and video corresponding data stream;

By the audio frequency and video analytical approach, obtain corresponding proper vector;

The utilization statistical learning method is learnt audio-visual synchronization and is insinuated relational model;

The utilization statistical learning to model and rule obtain and the corresponding people's face of new speech kinematic parameter, drive the human face animation model.

The present invention uses video acquisition, methods such as speech analysis and image processing, and voice and human face characteristic point exercise data when the record real human face is spoken are set up an initial phonetic image database; Can obtain phonetic feature by speech analysis, comprise linear predictor coefficient and prosodic parameter (energy and zero-crossing rate and fundamental frequency), can extract the human face animation parameter characteristic of correspondence point of MPEG4 definition from frame of video, calculate by relative frame do difference calculating and relative displacement and can obtain the human face animation parameter.Utilize cluster, methods such as statistics and neural network are finished the study mapping from phonetic feature to the human face animation parameter.After the study, when new voice are come in, can obtain phonetic feature by analyzing, phonetic feature can obtain the human face animation parameter by mapping, and on this basis, utilization people face movement knowledge storehouse adds rule constrain on the result, realize the animation of the sense of reality.

Description of drawings

Fig. 1 is a learning phase framework synoptic diagram;

Fig. 2 is that human face characteristic point is followed the tracks of synoptic diagram;

Fig. 3 is feature point detection and range of influence synoptic diagram;

Fig. 4 is part FDPFAP corresponding point and the FAPU among the MPEG4;

Fig. 5 is 29 kinds of FAP patterns;

Fig. 6 is an application stage framework synoptic diagram;

Fig. 7 is a statistics vision mode method and comparison (comparison of lip height parameter) based on neural net method;

Fig. 8 is a voice-driven human face animation example, and last figure is true audio frequency and video, and figure below is according to the people's face motion sequence that utilizes audio frequency to obtain of the present invention.

Embodiment

At first utilize no tutor's cluster analysis can obtain people's face kinematic parameter (FAP) proper vector class of video.Add up the people's face dynamic model (essence is FAP classification transition matrix) that takes place synchronously with speech events then, we are referred to as to add up vision mode, and the statistical language model in its principle and the natural language processing is similar.The last a plurality of neural networks of learning training (ANN) are finished from speech pattern insinuating to the human face animation pattern.Can obtain some face animation mode sequences to the new speech data by calculating after the machine learning, utilize the statistics vision mode can therefrom select beautiful woman's face kinematic parameter (FAP) sequence, utilize people's face sports rule that the FAP sequence is revised and replenished then, after finishing smoothly, use these FAP can directly drive face wire frame model.This strategy has following distinctive feature:

1) whole process is set up and can be provided description with the Bayes rule of classics,

\underset{L}{\arg \max} \Pr (L | A) = \underset{L}{\arg \max} \frac{\Pr (A | L) . \Pr (L)}{\Pr (A)},

Wherein A can regard voice signal as.Maximal possibility estimation Pr (A|L) weighs the modeled accuracy of voice signal, and prior model Pr (L) sets up about the background knowledge of real human face motion or claims the statistics vision mode.

2) cluster analysis of voice signal is on the classification learning that is based upon face posture, do like this than consider hypothesis to pass through the speech perception classification good.Simultaneously, because therefore the corresponding diverse phonetic feature of same lip, adopts neural network that of a sort voice signal is trained, can make the robustness that predicts the outcome improve.

3) the statistics vision mode is allowed people's face movement locus that we find whole word to optimize, and fully uses contextual information to avoid neural metwork training to be difficult to realize context-sensitive defective simultaneously.

4) vision signal only need be analyzed once, is used for the corresponding relation of training utterance and human face animation parameter (FAP), and it is synthetic that results model can be used to do other people people's face.

5) introducing of people's face sports rule made the animation of the not high part of original and voice association degree also can be truer, moved as nictation and head etc.

6) whole framework can be used for correlation predictive and the control or synthetic between other signals.

Above-mentionedly comprise following two aspects with the voice-driven human face animation method that combines of rule: learn and the application stage based on statistics:

1) learning phase comprises the steps (Fig. 1):

A) audio-visual synchronization is recorded and is cut apart

By gamma camera can be synchronous recorded speech and video data, form avi file, but, audio-video signal must be divided into the audio and video stream of different passages in order to need with post analysis.Traditional method usually rule of thumb fixedly installs certain gamma camera that adopts, and the present invention proposes the audio-visual synchronization dividing method and can be used for any gamma camera collection video.

Suppose that the video acquisition frame per second is Videoframecount/msec, the audio frequency frame per second is Audiosamplecount/msec, and the displacement of speech analysis window is Windowmove, and speech analysis window size is Windowsize, needing voice window number is m, and speech analysis window and speech analysis window displacement ratio are n;

Windowmove＝Audiosamplecount/(Videoframecount*m) (1)；

Windowsize＝Windowmove*n (2)；

Wherein m and n are adjustable parameter, set according to actual conditions.Synchronization parameter by this method setting can make audio-visual synchronization be accurate to sample bits.

In order to cover the complete various pronunciations of trying one's best, the written material that the text information that method selects 863 Chinese phonetic synthesis storehouse CoSS-1 to sum up pronounces as words person.CoSS-1 comprises the pronunciation of 1268 independent syllables of all Chinese, also comprises the pronunciation of a large amount of 2-4 words and the voice of 200 statements.Note various individual characters, the synchronized audio/video storehouse of speech and statement.By the marker characteristic point, can obtain lip, cheek, the exercise data of positions such as eyelid.Gamma camera is set obtains the image feature sequence by transferring the video of gathering to image and utilize trace routine to handle 10 frame/seconds.Suppose m=6, n=2/3 we to adopt the speech sample rate be 8040Hz, then the window of speech analysis is long is 8040/10*6=134, frame moves and is 134*2/3=89.

B) audio and video characteristic extracts.

For the linear forecasting parameter of speech data in the audio extraction hamming window and prosodic parameter (energy, zero-crossing rate and fundamental frequency) as speech feature vector

For video, extract consistent with the Mpeg-4 on the face unique point of people, calculate the difference Ve1={V1 of each unique point coordinate and standard frame coordinate then, V2 ... Vn}, calculate the specific people corresponding yardstick reference quantity of each unique point P={P1 on the face that presses the Mpeg-4 definition again, P2 ... Pn} can obtain people's face kinematic parameter by formula (3).

Fap _i=(V _{I (x|y)}/ P _{I (x|y)}) * 1024 (3) Fap _iExpression and I people's face kinematic parameter that unique point is corresponding, V _{I (x|y)}The V of expression _iX or y coordinate, P _{I (x|y)}Expression and V _{I (x|y)}Corresponding yardstick reference quantity.

For phonetic feature, in speech analysis, use traditional hamming window, each frame obtains 16 rank LPC and RASTA-PLP mixing constant and some prosodic parameters like this.

For people's face motion feature, use human face animation representation scheme based on MPEG4.MPEG-4 uses FDP (people's face defined parameters) and FAP (human face animation parameter) to specify faceform and animation thereof, uses FAPU (human face animation parameter unit) to indicate the displacement activity of FAP.Based on above-mentioned principle, obtain the moving exercise data of human face expression and lip, to obtain corresponding FDP and FAP parameter exactly.In order to obtain people's face exercise data, developed face characteristic that a cover computer vision system can follow the tracks of many individual characteies synchronously as the corners of the mouth and lip line, eyes and nose etc.Fig. 2 shows the unique point that we can follow the tracks of and obtain.Because it is synthetic more important to us to obtain the accurate unique point exercise data track algorithm more numerous than experiment.We adopt by obtain data and require words person to reduce head movement in the way of mark particular color on the face as far as possible, and Fig. 3 shows the unique point and the range of influence of final acquisition.

The data of coming out by feature point extraction are absolute coordinatess, and because the influence of words person's head movement or body kinematics makes that handling the coordinate figure that obtains with simple image has very big noise, the pre-service of therefore need reforming.The unique point that our hypothesis is not influenced by FAP is not move relatively, utilizes this unchangeability to finish conversion from image coordinate to faceform's relative coordinate, thereby can remove by the influence to data of the kinetic rotation of words person and telescopic variation.Unique point to Mpeg4 definition among Fig. 4, we have chosen P0 (11.2), P1 (11.3), P2 (11.1) and P3 (adding at a supratip point) form orthogonal coordinate system (X-axis P0P1, Y-axis P2 P3), according to this coordinate system, can calculate the anglec of rotation and flexible yardstick in accordance with the following methods.Suppose that these coordinates of reference points are P0 (x0, y0), P1 (x1, y1), P2 (x2, y2) and P3 (x3, y3), the origin of new coordinate-system can be calculated by two straight-line intersections that they connect into, and (xnew ynew) can also calculate the anglec of rotation Φ of new coordinate with respect to normal coordinates simultaneously to be assumed to be P.Like this arbitrfary point (x, y) value under new coordinate-system (x ', y ') can be calculated according to following formula:

x′＝x×cos(θ)-x×sin(θ)+P(x _new) (4)

y′＝y×sin(θ)-y×cos(θ)+P(y _new) (5)

For fear of flexible influence, suppose to be added in o'clock not moving on the bridge of the nose with respect to first frame, any other point can calculate relative displacement with this point according to formula (6) and (7), thereby transfers image coordinate to the faceform coordinate, obtains the accurate data of unique point motion:

x _k″＝(x _k′-x _k3)-(x ₁′-x ₁₃) (6)

y _k″＝(y _k′-y _k3)-(y ₁′-y ₁₃) (7)

(x wherein ₁₃, y ₁₃) coordinate of prenasale of expression the 1st frame, (x ₁', y ₁') other characteristic point coordinates of expression the 1st frame, (x _K3, y _K3) coordinate of prenasale of expression k frame, (x _k', y _k') other characteristic point coordinates of expression k frame, (x _k", y _kThe last coordinates computed of other unique points of ") expression k frame.After filtering, each characteristic point coordinates can calculate the FAP value with reference to the human face animation parameter unit (FAPU) of Fig. 4 definition.Suppose that the ESO and the ENSO that define among Fig. 4 are respectively 200 and 140, then 5.3 (x y) may be calculated respectively corresponding to two FAP values:

FAP39＝X×1024/200 (8)

FAP41＝Y×1024/140 (9)

C) audio frequency characteristics is to the statistical learning of video features.

1. at first audio frequency and video are pressed a), b) described feature set Audio, the Video cut apart synchronously;

2. concentrate video not have the supervision cluster analysis to Video, obtain people's face motion basic model, be made as the I class;

3. utilize statistical method to obtain transition probability between two classes or the multiclass, be called the statistics vision mode, and the quality of coming evaluation model with entropy, and then carry out b) up to the entropy minimum.

The data that 4. will belong among the set of voice features Audio of correspondence of same individual face motion basic model are divided into corresponding subclass Audio (i), and which class I represents.

5. each subclass Audio (i) is trained with a neural network, be input as the phonetic feature F (Audio (i)) in the subclass, be output as the degree of approximation P (Video (i)) that belongs to this classification.

1. 2. middle people's face motion basic model clustering method

For basic people's face pattern, cognitive scholar has provided some achievements in research, but generally all is the qualitative 6 kinds of basic facial expressions or more that provide, and the synthetic result's of this qualitative expression the sense of reality is bad.The researchist is also arranged by the True Data cluster is come discovery mode, but at present mostly cluster analysis all on the phoneme basis, carry out, ignored the dynamic of statement level people face motion.We wish to find the pattern of one group of effective expression people face motion from a large amount of true statements, the pattern of this discovery can have clearly meaning such as 14 kinds of lips of MPEG4 definition, also can be the synthetic basic model of a kind of people's of being effective to face.By mode discovery, not only be beneficial to the convergence of neural metwork training, also synthetic complex process explanation of the moving face of lip and understanding are laid the first stone simultaneously for follow-up.In cluster process, since the number of such basic model and uncertain, the no tutor's cluster of general employing.

For clustering algorithm, the problem that is provided with that has a lot of parameters, parameter is provided with for the cluster result influence very big, for the moving face basic model of lip cluster, owing to do not have the experiment sample collection of known class as the error rate evaluation, simultaneously therefore geometric properties that again can't the Direct observation higher dimensional space is estimated cluster result and is had difficulties.Though in the class spacing of cluster data or the class apart from obtaining being used to instruct the cluster evaluation, but can't be described in and use the effect that cluster can reach in the real system, usually the quality of effect is vital for animation system, and we directly adopt with cluster data and True Data and ask the way of variance whether to weigh cluster result to reach the requirement of describing main motor pattern.By adjust the clustering algorithm parameter as: wish clusters number, maximum frequency of training, every class smallest sample number, separation parameter P and merge parameters C etc. and can obtain different cluster results carries out variance to these results by (10) to calculate, the result is as shown in table 1:

ErrorSquare (X, Y) = \sqrt{(X - Y) * {(X - Y)}^{T}} / | | X | | - - - (10)

Wherein X is the True Data matrix, and Y is the matrix of True Data after the classification mapping, ‖ X ‖ representing matrix size.

	Every class smallest sample number	Separation parameter P/ merges parameters C	Clusters number	Variance ratio
	Every class smallest sample number	Separation parameter P/ merges parameters C	Clusters number	Variance ratio	1	32	P＝0.5-1，C＝1-1.5	18	3.559787
2	20	P＝0.5-1，C＝1-1.5	21	4.813459	1	32	P＝0.5-1，C＝1-1.5	18	3.559787
2	20	P＝0.5-1，C＝1-1.5	21	4.813459	3	10	P＝0.5-1，C＝1-1.5	23	2.947106
4	5	P＝0.5-1，C＝1-1.5	29	2.916784	3	10	P＝0.5-1，C＝1-1.5	23	2.947106
4	5	P＝0.5-1，C＝1-1.5	29	2.916784	5	3	P＝0.5-1，C＝1-1.5	33	2.997993

Table 1: cluster result relatively

Above-mentioned cluster is carried out on 6200 sample datas, wishes that the number of cluster is made as 64, and maximum frequency of training is made as 200, all the other parameter manual shift, and P represents separation parameter, and C represents to merge parameter, and P and C change in [0.5,1] interval.We find that variance ratio more is not mild decline, and certain shake occurs, this select mainly due to different cluster selection of parameter such as initial classes center and the deletion step of clustering algorithm to resultant influence.Estimate and can find out from variance, the 3rd row, the cluster result variance of the 4th row and the 5th row is more or less the same, and can think and tend towards stability that the number with people's face basic facial expression pattern is made as 29 thus.Fig. 5 demonstrates the result:

2. the statistics vision mode method for building up in 3.

The purpose of setting up the statistics vision mode is the people's face movement locus that finds whole word to optimize in order to allow, fully uses contextual information to avoid single neural metwork training to be difficult to utilize context-sensitive defective simultaneously.The statistics vision mode can calculate the probability that video sequence occurs.If we suppose that F is the human face animation sequence of a particular statement, as,

F＝f ₁f ₂…f _Q

So, P (F) can be calculated by following formula

P(F)＝P(f ₁f ₂…f _Q)＝P(f ₁)P(f ₂|f ₁)…P(f _Q|f ₁f ₂…f _Q-1) (11)

Yet,, estimate all possible conditional probability P (f for any face posture and the sequence formed _j| f ₁f ₂F _F-1) be impossible, in practice, generally adopt the N unit syntax to solve this problem, can approximate evaluation P (F) be

P (F) = Π_{i = 1}^{Q} P (f_{i} | f_{i - 1} f_{i - 2} \cdot \cdot \cdot f_{i - N + 1}), - - - (12)

Conditional probability P (f _i| f _I-1f _I-2F _I-N+1) can obtain by simple relative statistic method:

P (f_{i} | f_{i - 1} f_{i - 2} \cdot \cdot \cdot f_{i - N + 1}) = \frac{F (f_{i}, f_{i - 1}, \cdot \cdot \cdot f_{i - N + 1})}{F (f_{i - 1}, \cdot \cdot \cdot f_{i - N + 1})} - - - (13)

Wherein, F is the same occurrence numbers of various face postures in given training video database.After setting up the statistics vision mode, we adopt puzzled degree to estimate the performance quality of whole training pattern.Suppose θ _iBe the cluster centre of gathering I by the cluster that cluster analysis obtains, for θ={ θ ₁, θ ₂θ _n, we wish to find the vision mode of an optimization.Puzzled degree for model θ can define according to following method:

pp = 2^{H (S, θ)} \approx 2^{- \frac{1}{n} \log p (S | θ)} - - - (14)

Wherein, S=s ₁, s ₂..., s _nThe human face animation argument sequence of expression statement.

p (S | θ) = \underset{i}{Σ} p (s_{i + 1} | s_{i} \cdot \cdot \cdot s_{1})

The probability of expression human face animation argument sequence S under model p (θ).P (θ) represents our background knowledge for the motion of people's face in fact, can utilize above-mentioned statistical method to obtain simultaneously.Such as using the bi-gram commonly used in the natural language processing or the method for the ternary syntax, table 2 shows that the puzzled degree of the statistics vision mode that different cluster results obtain compares:

	Number of state	Bi-gram(PP)	Tri-gram(pp)
	Number of state	Bi-gram(PP)	Tri-gram(pp)	1	18	8.039958	2.479012
2	21	6.840446	2.152096	1	18	8.039958	2.479012
2	21	6.840446	2.152096	3	26	5.410093	1.799709
4	29	4.623306	1.623896	3	26	5.410093	1.799709
4	29	4.623306	1.623896	5	33	4.037879	1.478828

Table 2: the puzzlement degree relatively

By the statistics vision mode, we obtain the distribution probability of one group of state transitions, when a plurality of human face animation sequences provide, can utilize the Viterbi algorithm to obtain the human face animation sequence that maximum possible takes place on probability.

3. the network learning method in 5.

If regard voice the task of a pattern-recognition as to the mapping of FAP pattern, there are a lot of learning algorithms to be used, as hidden Markov model (HMM), support vector machine (SVM) and neural network or the like.Because neural network embodies stronger efficient and robustness for study input and output mapping, we select a kind of neural network (BP net) to learn the sentence of a large amount of records.Each cluster node can be finished training with two neural networks, and one is used for the sign state, and value is 0 or 1, and another is used for sign speed.These two kinds of Feedback Neural Network can be unified to be described as:

y_{k} = f_{2} (Σ_{j = 0}^{n_{2}} w_{kj}^{(2)} f_{1} (Σ_{i = 0}^{n_{1}} w_{ji}^{(1)} x_{i})) - - - (15)

Wherein x ∈ Φ is an audio frequency characteristics, w ⁽¹⁾And w ⁽²⁾Be the weights and the threshold value of each layer, f ₁And f ₂The is-symbol function.Train very simply, behind the given data set, adopt the Levenberg-Marquardt optimized Algorithm to adjust weights and threshold value is come neural network training.

All calculate 16 dimension LPC and RASTA-PLP mixed vectors for each frame of voice and add 2 dimension prosodic parameters, form 18 dimension speech feature vectors, 6 frames are combined into an input vector before and after getting, and the input of so each neural network is 108 vectors of tieing up.For the state neural network, the output node number is decided to be 1, expression 0 or 1.Adopt 30 for middle hidden node number, the parameter of neural network is made as simultaneously: learning rate 0.001, the error of network are 0.005.For the speed neural network, the output node number is decided to be 18, expression 18 dimension FAP proper vectors.Adopt 80 for middle hidden node number.The parameter of neural network is made as simultaneously: learning rate 0.001, the error of network are 0.005.

2) application stage comprises the steps (Fig. 6):

1) audio recording:

Can directly utilize microphone or other sound pick-up outfits to obtain speech data

2) audio feature extraction

Audio feature extraction method according to learning phase is extracted phonetic feature

3) based on the mapping of the audio frequency characteristics of statistical learning model to video features

Phonetic feature is sent into the neural network of everyone face pattern correspondence as input, and each state neural network all has an output, the degree of approximation that belongs to this classification that obtains exporting; After a sentence is finished, utilize statistics vision mode and Viterbi decoding algorithm to obtain the transferring route of the class of a maximum probability, coupling together is exactly the human face animation mode sequences corresponding with voice;

Though information main or that provided by voice plays a major role, the viterbi algorithm guarantees that formation sequence meets the proper motion of people's face.Though directly represent each state of sequence just can drive people's face grid with each cluster centre, choose basic model owing to simplify, jitter phenomenon can appear in human face animation.Classic method generally solves with interpolation, though can eliminate shake, but do not meet the dynamic perfromance of human face animation, we have two neural networks to predict now under each state, one of them predetermined speed, the net result sequence of utilizing transition matrix to obtain like this includes enough information and can generate and the consistent animation of nature person's face motion, and whole formula is very succinct, makes T=(t ₁, t ₂T _n) be people's face motion state point of prediction, V={v ₁, v ₂V _nBe the speed under each state point.

Y_{(t * i / m) - > t + 1} = Y_{t} + ((Y_{t + 1} - Y_{t}) / m) * v_{t} * iIf i < = m / 2 - - - (16)

Y_{(t * i / m) - > t + 1} = Y_{t + 1} - ((Y_{t + 1} - Y_{t}) / (i * m)) {* v}_{t + 1} Ifi > m / 2 - - - (17)

Y wherein _(t*i/m)-t+1The I frame of expression from state t to state t+1, m are represented the frame number that need insert to state t+1 from state t. because the speed parameter has been arranged, make the human face animation that generates meet the polytrope of people's face motion more than interpolation method.

4) revise based on the video features stream of people's face sports rule

After the people's face kinematic parameter sequence that obtains based on statistical model, because a bit little influence that study predicts the outcome, can cause the sense of reality of whole animation sequence to descend, some people's face motion simultaneously is little with the correlation degree of phonetic feature, as nictation, nod, for this reason, on the basis of statistical learning, the rule that adds people's face movement knowledge storehouse is revised sequence, thereby improve result's output, make the animation sense of reality stronger.

5) audio-visual synchronization is broadcasted

Obtain voice and animation played file, can directly broadcast at different passages, because the data that itself obtain are strict synchronism, it also is synchronous therefore broadcasting.

Four) experimental result relatively

System has been adopted qualitative and quantitative two kinds of valuation methods: quantitative test is based on and calculates the error of weighing between predicted data and the True Data, to a lot of machine learning systems, all should adopt quantivative approach.Qualitative test is to judge by perception whether the people's face motion that synthesizes is true, and for synthetic, qualitative test is very important.In quantitative test, weighed the error of predicted data and True Data, comprise two groups of closed set (training data is a test data) and openers (test data is not through training).Fig. 7 shows the test result of upper lip height parameter in two words, and compare with single neural net method, last two figure test datas are training data, two figure test datas are non-training data down, by testing all FAP parameters and calculating the mean square deviation of predicted data and True Data, obtain the result of table 3 by formula (10).

Test data	Mean square deviation (VM+ANN)	Mean square deviation (ANN)
Test data	Mean square deviation (VM+ANN)	Mean square deviation (ANN)	Training data	2.859213	3.863582
Test data	4.097657	5.558253	Training data	2.859213	3.863582

The variance ratio of table 3:FAP parameter prediction data and True Data

Evaluation for multi-mode system is not sought unity of standard so far, for the voice-driven human face animation system, in obtaining the your human face analysis data corresponding with voice, can't calculate the error of predicted data and True Data, the Practical Performance that therefore simple quantitative result can not representative system.For the tone testing evaluation of unspecified person, generally can only adopt method qualitatively, in experiment, require five people's audiovisual systems, and from intelligent, naturality, the acceptability of friendly and the motion of people's face is assessed.Because what system not only can solve people's dynamic change of portion on the face but also use is the raw tone of recording, and effectively solves stationary problem, therefore obtained higher evaluation.

Utilize the system of this paper, behind a given people's voice, the FAP pattern that neural network can the every frame phonetic feature of real-time estimate correspondence is by can directly driving the people's face grid based on Mpeg4 after level and smooth.Fig. 8 provides the partial frame of voice-driven human face animation.

Claims

1. A voice-driven face animation method based on the combination of statistics and rules, comprising steps:

Using the audio and video synchronous cutting method to obtain the corresponding data stream of audio and video;

Through the audio and video analysis method, the corresponding feature vector is obtained;

Using statistical learning methods to learn the audio and video synchronization implicit relationship model;

The face motion parameters corresponding to the new speech are obtained by using the statistically learned model plus rules.

2. by the described method of claim 1, it is characterized in that described audio-video synchronous segmentation method comprises the steps:

a. Assume that the video capture frame rate is Videoframecount/msec, the audio frame rate is Audiosamplecount/msec, the speech analysis window displacement is Windowmove, the speech analysis window size is Windowsize, the number of speech windows required is m, and the speech analysis window and speech analysis window shift The ratio is n;

b. Windowmove＝Audiosamplecount/(Videoframecount*m)

Windowsize=Windowmove*n

Among them, m and n are adjustable parameters, which are set according to the actual situation.

3. by the described method of claim 1, it is characterized in that described audio-video analysis and feature extraction method comprise steps:

a, extract linear prediction parameters and prosodic parameters of speech data in the Hamming window as speech feature vectors for audio;

b. For the video, extract the feature points consistent with Mpeg-4 on the human face, then calculate the difference Vel={V1, V2...Vn) between the coordinates of each feature point and the coordinates of the standard frame, and then calculate the specific points defined by Mpeg-4 Each feature point on the face corresponds to the scale reference P={P1, P2, ..., Pn), and the face motion parameters can be obtained by the following formula:

Fap _i ＝(V _i(x|y) /P _i(x|y) )*1024

Among them, Fap _i represents the face motion parameter corresponding to the i-th feature point, V _i(x|y) represents the x or y coordinate of V _i , and P _i(x|y) represents the relationship with V _{i(x|y )} corresponds to the scale reference.

4. by the described method of claim 1, it is characterized in that the statistical learning method of described audio-video synchronous implicit relationship model comprises the steps:

a) First obtain the synchronous segmentation feature set Audio, Video;

B) unsupervised clustering analysis is carried out to the concentrated video of Video, obtains the basic mode of human face movement, is set as I class;

c) Use statistical methods to obtain the transition probability between two or more classes, which is called a statistical visual model, and use entropy to evaluate the quality of the model, and then proceed to b) until the entropy is minimized;

D) divide the data in the corresponding speech feature set Audio belonging to the same human face motion basic pattern into corresponding subset Audio(i), and i represents the first several categories;

e) Use a neural network to train each subset Audio(i), the input is the speech feature F(Audio(i)) in the subset, and the output is the degree of approximation P(Video(i)) belonging to this category.

5. by the described method of claim 1, it is characterized in that described obtaining and the human face motion parameter corresponding to speech feature comprise steps:

a) For a given new speech, extract speech features;

b) The speech feature is sent into the neural network corresponding to each face pattern as input, and the approximate degree of output belonging to this category is obtained;

c) After a sentence is completed, use the statistical visual model and the Viterbi decoding algorithm to obtain a transfer route of the class with the highest probability, which is connected to a sequence of facial animation patterns corresponding to the voice;

d) Revise the predicted face animation pattern sequence through the rules in the face motion knowledge base to make the result more realistic and natural.