Embodiment
At first utilize no tutor's cluster analysis can obtain people's face kinematic parameter (FAP) proper vector class of video.Add up the people's face dynamic model (essence is FAP classification transition matrix) that takes place synchronously with speech events then, we are referred to as to add up vision mode, and the statistical language model in its principle and the natural language processing is similar.The last a plurality of neural networks of learning training (ANN) are finished from speech pattern insinuating to the human face animation pattern.Can obtain some face animation mode sequences to the new speech data by calculating after the machine learning, utilize the statistics vision mode can therefrom select beautiful woman's face kinematic parameter (FAP) sequence, utilize people's face sports rule that the FAP sequence is revised and replenished then, after finishing smoothly, use these FAP can directly drive face wire frame model.This strategy has following distinctive feature:
1) whole process is set up and can be provided description with the Bayes rule of classics,
Wherein A can regard voice signal as.Maximal possibility estimation Pr (A|L) weighs the modeled accuracy of voice signal, and prior model Pr (L) sets up about the background knowledge of real human face motion or claims the statistics vision mode.
2) cluster analysis of voice signal is on the classification learning that is based upon face posture, do like this than consider hypothesis to pass through the speech perception classification good.Simultaneously, because therefore the corresponding diverse phonetic feature of same lip, adopts neural network that of a sort voice signal is trained, can make the robustness that predicts the outcome improve.
3) the statistics vision mode is allowed people's face movement locus that we find whole word to optimize, and fully uses contextual information to avoid neural metwork training to be difficult to realize context-sensitive defective simultaneously.
4) vision signal only need be analyzed once, is used for the corresponding relation of training utterance and human face animation parameter (FAP), and it is synthetic that results model can be used to do other people people's face.
5) introducing of people's face sports rule made the animation of the not high part of original and voice association degree also can be truer, moved as nictation and head etc.
6) whole framework can be used for correlation predictive and the control or synthetic between other signals.
Above-mentionedly comprise following two aspects with the voice-driven human face animation method that combines of rule: learn and the application stage based on statistics:
1) learning phase comprises the steps (Fig. 1):
A) audio-visual synchronization is recorded and is cut apart
By gamma camera can be synchronous recorded speech and video data, form avi file, but, audio-video signal must be divided into the audio and video stream of different passages in order to need with post analysis.Traditional method usually rule of thumb fixedly installs certain gamma camera that adopts, and the present invention proposes the audio-visual synchronization dividing method and can be used for any gamma camera collection video.
Suppose that the video acquisition frame per second is Videoframecount/msec, the audio frequency frame per second is Audiosamplecount/msec, and the displacement of speech analysis window is Windowmove, and speech analysis window size is Windowsize, needing voice window number is m, and speech analysis window and speech analysis window displacement ratio are n;
Windowmove=Audiosamplecount/(Videoframecount*m) (1);
Windowsize=Windowmove*n (2);
Wherein m and n are adjustable parameter, set according to actual conditions.Synchronization parameter by this method setting can make audio-visual synchronization be accurate to sample bits.
In order to cover the complete various pronunciations of trying one's best, the written material that the text information that method selects 863 Chinese phonetic synthesis storehouse CoSS-1 to sum up pronounces as words person.CoSS-1 comprises the pronunciation of 1268 independent syllables of all Chinese, also comprises the pronunciation of a large amount of 2-4 words and the voice of 200 statements.Note various individual characters, the synchronized audio/video storehouse of speech and statement.By the marker characteristic point, can obtain lip, cheek, the exercise data of positions such as eyelid.Gamma camera is set obtains the image feature sequence by transferring the video of gathering to image and utilize trace routine to handle 10 frame/seconds.Suppose m=6, n=2/3 we to adopt the speech sample rate be 8040Hz, then the window of speech analysis is long is 8040/10*6=134, frame moves and is 134*2/3=89.
B) audio and video characteristic extracts.
For the linear forecasting parameter of speech data in the audio extraction hamming window and prosodic parameter (energy, zero-crossing rate and fundamental frequency) as speech feature vector
For video, extract consistent with the Mpeg-4 on the face unique point of people, calculate the difference Ve1={V1 of each unique point coordinate and standard frame coordinate then, V2 ... Vn}, calculate the specific people corresponding yardstick reference quantity of each unique point P={P1 on the face that presses the Mpeg-4 definition again, P2 ... Pn} can obtain people's face kinematic parameter by formula (3).
Fap
i=(V
I (x|y)/ P
I (x|y)) * 1024 (3) Fap
iExpression and I people's face kinematic parameter that unique point is corresponding, V
I (x|y)The V of expression
iX or y coordinate, P
I (x|y)Expression and V
I (x|y)Corresponding yardstick reference quantity.
For phonetic feature, in speech analysis, use traditional hamming window, each frame obtains 16 rank LPC and RASTA-PLP mixing constant and some prosodic parameters like this.
For people's face motion feature, use human face animation representation scheme based on MPEG4.MPEG-4 uses FDP (people's face defined parameters) and FAP (human face animation parameter) to specify faceform and animation thereof, uses FAPU (human face animation parameter unit) to indicate the displacement activity of FAP.Based on above-mentioned principle, obtain the moving exercise data of human face expression and lip, to obtain corresponding FDP and FAP parameter exactly.In order to obtain people's face exercise data, developed face characteristic that a cover computer vision system can follow the tracks of many individual characteies synchronously as the corners of the mouth and lip line, eyes and nose etc.Fig. 2 shows the unique point that we can follow the tracks of and obtain.Because it is synthetic more important to us to obtain the accurate unique point exercise data track algorithm more numerous than experiment.We adopt by obtain data and require words person to reduce head movement in the way of mark particular color on the face as far as possible, and Fig. 3 shows the unique point and the range of influence of final acquisition.
The data of coming out by feature point extraction are absolute coordinatess, and because the influence of words person's head movement or body kinematics makes that handling the coordinate figure that obtains with simple image has very big noise, the pre-service of therefore need reforming.The unique point that our hypothesis is not influenced by FAP is not move relatively, utilizes this unchangeability to finish conversion from image coordinate to faceform's relative coordinate, thereby can remove by the influence to data of the kinetic rotation of words person and telescopic variation.Unique point to Mpeg4 definition among Fig. 4, we have chosen P0 (11.2), P1 (11.3), P2 (11.1) and P3 (adding at a supratip point) form orthogonal coordinate system (X-axis P0P1, Y-axis P2 P3), according to this coordinate system, can calculate the anglec of rotation and flexible yardstick in accordance with the following methods.Suppose that these coordinates of reference points are P0 (x0, y0), P1 (x1, y1), P2 (x2, y2) and P3 (x3, y3), the origin of new coordinate-system can be calculated by two straight-line intersections that they connect into, and (xnew ynew) can also calculate the anglec of rotation Φ of new coordinate with respect to normal coordinates simultaneously to be assumed to be P.Like this arbitrfary point (x, y) value under new coordinate-system (x ', y ') can be calculated according to following formula:
x′=x×cos(θ)-x×sin(θ)+P(x
new) (4)
y′=y×sin(θ)-y×cos(θ)+P(y
new) (5)
For fear of flexible influence, suppose to be added in o'clock not moving on the bridge of the nose with respect to first frame, any other point can calculate relative displacement with this point according to formula (6) and (7), thereby transfers image coordinate to the faceform coordinate, obtains the accurate data of unique point motion:
x
k″=(x
k′-x
k3)-(x
1′-x
13) (6)
y
k″=(y
k′-y
k3)-(y
1′-y
13) (7)
(x wherein
13, y
13) coordinate of prenasale of expression the 1st frame, (x
1', y
1') other characteristic point coordinates of expression the 1st frame, (x
K3, y
K3) coordinate of prenasale of expression k frame, (x
k', y
k') other characteristic point coordinates of expression k frame, (x
k", y
kThe last coordinates computed of other unique points of ") expression k frame.After filtering, each characteristic point coordinates can calculate the FAP value with reference to the human face animation parameter unit (FAPU) of Fig. 4 definition.Suppose that the ESO and the ENSO that define among Fig. 4 are respectively 200 and 140, then 5.3 (x y) may be calculated respectively corresponding to two FAP values:
FAP39=X×1024/200 (8)
FAP41=Y×1024/140 (9)
C) audio frequency characteristics is to the statistical learning of video features.
1. at first audio frequency and video are pressed a), b) described feature set Audio, the Video cut apart synchronously;
2. concentrate video not have the supervision cluster analysis to Video, obtain people's face motion basic model, be made as the I class;
3. utilize statistical method to obtain transition probability between two classes or the multiclass, be called the statistics vision mode, and the quality of coming evaluation model with entropy, and then carry out b) up to the entropy minimum.
The data that 4. will belong among the set of voice features Audio of correspondence of same individual face motion basic model are divided into corresponding subclass Audio (i), and which class I represents.
5. each subclass Audio (i) is trained with a neural network, be input as the phonetic feature F (Audio (i)) in the subclass, be output as the degree of approximation P (Video (i)) that belongs to this classification.
1. 2. middle people's face motion basic model clustering method
For basic people's face pattern, cognitive scholar has provided some achievements in research, but generally all is the qualitative 6 kinds of basic facial expressions or more that provide, and the synthetic result's of this qualitative expression the sense of reality is bad.The researchist is also arranged by the True Data cluster is come discovery mode, but at present mostly cluster analysis all on the phoneme basis, carry out, ignored the dynamic of statement level people face motion.We wish to find the pattern of one group of effective expression people face motion from a large amount of true statements, the pattern of this discovery can have clearly meaning such as 14 kinds of lips of MPEG4 definition, also can be the synthetic basic model of a kind of people's of being effective to face.By mode discovery, not only be beneficial to the convergence of neural metwork training, also synthetic complex process explanation of the moving face of lip and understanding are laid the first stone simultaneously for follow-up.In cluster process, since the number of such basic model and uncertain, the no tutor's cluster of general employing.
For clustering algorithm, the problem that is provided with that has a lot of parameters, parameter is provided with for the cluster result influence very big, for the moving face basic model of lip cluster, owing to do not have the experiment sample collection of known class as the error rate evaluation, simultaneously therefore geometric properties that again can't the Direct observation higher dimensional space is estimated cluster result and is had difficulties.Though in the class spacing of cluster data or the class apart from obtaining being used to instruct the cluster evaluation, but can't be described in and use the effect that cluster can reach in the real system, usually the quality of effect is vital for animation system, and we directly adopt with cluster data and True Data and ask the way of variance whether to weigh cluster result to reach the requirement of describing main motor pattern.By adjust the clustering algorithm parameter as: wish clusters number, maximum frequency of training, every class smallest sample number, separation parameter P and merge parameters C etc. and can obtain different cluster results carries out variance to these results by (10) to calculate, the result is as shown in table 1:
Wherein X is the True Data matrix, and Y is the matrix of True Data after the classification mapping, ‖ X ‖ representing matrix size.
|
Every class smallest sample number |
Separation parameter P/ merges parameters C |
Clusters number |
Variance ratio |
1 |
32 |
P=0.5-1,C=1-1.5 |
18 |
3.559787 |
2 |
20 |
P=0.5-1,C=1-1.5 |
21 |
4.813459 |
3 |
10 |
P=0.5-1,C=1-1.5 |
23 |
2.947106 |
4 |
5 |
P=0.5-1,C=1-1.5 |
29 |
2.916784 |
5 |
3 |
P=0.5-1,C=1-1.5 |
33 |
2.997993 |
Table 1: cluster result relatively
Above-mentioned cluster is carried out on 6200 sample datas, wishes that the number of cluster is made as 64, and maximum frequency of training is made as 200, all the other parameter manual shift, and P represents separation parameter, and C represents to merge parameter, and P and C change in [0.5,1] interval.We find that variance ratio more is not mild decline, and certain shake occurs, this select mainly due to different cluster selection of parameter such as initial classes center and the deletion step of clustering algorithm to resultant influence.Estimate and can find out from variance, the 3rd row, the cluster result variance of the 4th row and the 5th row is more or less the same, and can think and tend towards stability that the number with people's face basic facial expression pattern is made as 29 thus.Fig. 5 demonstrates the result:
2. the statistics vision mode method for building up in 3.
The purpose of setting up the statistics vision mode is the people's face movement locus that finds whole word to optimize in order to allow, fully uses contextual information to avoid single neural metwork training to be difficult to utilize context-sensitive defective simultaneously.The statistics vision mode can calculate the probability that video sequence occurs.If we suppose that F is the human face animation sequence of a particular statement, as,
F=f
1f
2…f
Q
So, P (F) can be calculated by following formula
P(F)=P(f
1f
2…f
Q)=P(f
1)P(f
2|f
1)…P(f
Q|f
1f
2…f
Q-1) (11)
Yet,, estimate all possible conditional probability P (f for any face posture and the sequence formed
j| f
1f
2F
F-1) be impossible, in practice, generally adopt the N unit syntax to solve this problem, can approximate evaluation P (F) be
Conditional probability P (f
i| f
I-1f
I-2F
I-N+1) can obtain by simple relative statistic method:
Wherein, F is the same occurrence numbers of various face postures in given training video database.After setting up the statistics vision mode, we adopt puzzled degree to estimate the performance quality of whole training pattern.Suppose θ
iBe the cluster centre of gathering I by the cluster that cluster analysis obtains, for θ={ θ
1, θ
2θ
n, we wish to find the vision mode of an optimization.Puzzled degree for model θ can define according to following method:
Wherein, S=s
1, s
2..., s
nThe human face animation argument sequence of expression statement.
The probability of expression human face animation argument sequence S under model p (θ).P (θ) represents our background knowledge for the motion of people's face in fact, can utilize above-mentioned statistical method to obtain simultaneously.Such as using the bi-gram commonly used in the natural language processing or the method for the ternary syntax, table 2 shows that the puzzled degree of the statistics vision mode that different cluster results obtain compares:
| Number of state | Bi-gram(PP) | Tri-gram(pp) |
1 | 18 | 8.039958 | 2.479012 |
2 | 21 | 6.840446 | 2.152096 |
3 | 26 | 5.410093 | 1.799709 |
4 | 29 | 4.623306 | 1.623896 |
5 | 33 | 4.037879 | 1.478828 |
Table 2: the puzzlement degree relatively
By the statistics vision mode, we obtain the distribution probability of one group of state transitions, when a plurality of human face animation sequences provide, can utilize the Viterbi algorithm to obtain the human face animation sequence that maximum possible takes place on probability.
3. the network learning method in 5.
If regard voice the task of a pattern-recognition as to the mapping of FAP pattern, there are a lot of learning algorithms to be used, as hidden Markov model (HMM), support vector machine (SVM) and neural network or the like.Because neural network embodies stronger efficient and robustness for study input and output mapping, we select a kind of neural network (BP net) to learn the sentence of a large amount of records.Each cluster node can be finished training with two neural networks, and one is used for the sign state, and value is 0 or 1, and another is used for sign speed.These two kinds of Feedback Neural Network can be unified to be described as:
Wherein x ∈ Φ is an audio frequency characteristics, w
(1)And w
(2)Be the weights and the threshold value of each layer, f
1And f
2The is-symbol function.Train very simply, behind the given data set, adopt the Levenberg-Marquardt optimized Algorithm to adjust weights and threshold value is come neural network training.
All calculate 16 dimension LPC and RASTA-PLP mixed vectors for each frame of voice and add 2 dimension prosodic parameters, form 18 dimension speech feature vectors, 6 frames are combined into an input vector before and after getting, and the input of so each neural network is 108 vectors of tieing up.For the state neural network, the output node number is decided to be 1, expression 0 or 1.Adopt 30 for middle hidden node number, the parameter of neural network is made as simultaneously: learning rate 0.001, the error of network are 0.005.For the speed neural network, the output node number is decided to be 18, expression 18 dimension FAP proper vectors.Adopt 80 for middle hidden node number.The parameter of neural network is made as simultaneously: learning rate 0.001, the error of network are 0.005.
2) application stage comprises the steps (Fig. 6):
1) audio recording:
Can directly utilize microphone or other sound pick-up outfits to obtain speech data
2) audio feature extraction
Audio feature extraction method according to learning phase is extracted phonetic feature
3) based on the mapping of the audio frequency characteristics of statistical learning model to video features
Phonetic feature is sent into the neural network of everyone face pattern correspondence as input, and each state neural network all has an output, the degree of approximation that belongs to this classification that obtains exporting; After a sentence is finished, utilize statistics vision mode and Viterbi decoding algorithm to obtain the transferring route of the class of a maximum probability, coupling together is exactly the human face animation mode sequences corresponding with voice;
Though information main or that provided by voice plays a major role, the viterbi algorithm guarantees that formation sequence meets the proper motion of people's face.Though directly represent each state of sequence just can drive people's face grid with each cluster centre, choose basic model owing to simplify, jitter phenomenon can appear in human face animation.Classic method generally solves with interpolation, though can eliminate shake, but do not meet the dynamic perfromance of human face animation, we have two neural networks to predict now under each state, one of them predetermined speed, the net result sequence of utilizing transition matrix to obtain like this includes enough information and can generate and the consistent animation of nature person's face motion, and whole formula is very succinct, makes T=(t
1, t
2T
n) be people's face motion state point of prediction, V={v
1, v
2V
nBe the speed under each state point.
Y wherein
(t*i/m)-t+1The I frame of expression from state t to state t+1, m are represented the frame number that need insert to state t+1 from state t. because the speed parameter has been arranged, make the human face animation that generates meet the polytrope of people's face motion more than interpolation method.
4) revise based on the video features stream of people's face sports rule
After the people's face kinematic parameter sequence that obtains based on statistical model, because a bit little influence that study predicts the outcome, can cause the sense of reality of whole animation sequence to descend, some people's face motion simultaneously is little with the correlation degree of phonetic feature, as nictation, nod, for this reason, on the basis of statistical learning, the rule that adds people's face movement knowledge storehouse is revised sequence, thereby improve result's output, make the animation sense of reality stronger.
5) audio-visual synchronization is broadcasted
Obtain voice and animation played file, can directly broadcast at different passages, because the data that itself obtain are strict synchronism, it also is synchronous therefore broadcasting.
Four) experimental result relatively
System has been adopted qualitative and quantitative two kinds of valuation methods: quantitative test is based on and calculates the error of weighing between predicted data and the True Data, to a lot of machine learning systems, all should adopt quantivative approach.Qualitative test is to judge by perception whether the people's face motion that synthesizes is true, and for synthetic, qualitative test is very important.In quantitative test, weighed the error of predicted data and True Data, comprise two groups of closed set (training data is a test data) and openers (test data is not through training).Fig. 7 shows the test result of upper lip height parameter in two words, and compare with single neural net method, last two figure test datas are training data, two figure test datas are non-training data down, by testing all FAP parameters and calculating the mean square deviation of predicted data and True Data, obtain the result of table 3 by formula (10).
Test data | Mean square deviation (VM+ANN) | Mean square deviation (ANN) |
Training data | 2.859213 | 3.863582 |
Test data | 4.097657 | 5.558253 |
The variance ratio of table 3:FAP parameter prediction data and True Data
Evaluation for multi-mode system is not sought unity of standard so far, for the voice-driven human face animation system, in obtaining the your human face analysis data corresponding with voice, can't calculate the error of predicted data and True Data, the Practical Performance that therefore simple quantitative result can not representative system.For the tone testing evaluation of unspecified person, generally can only adopt method qualitatively, in experiment, require five people's audiovisual systems, and from intelligent, naturality, the acceptability of friendly and the motion of people's face is assessed.Because what system not only can solve people's dynamic change of portion on the face but also use is the raw tone of recording, and effectively solves stationary problem, therefore obtained higher evaluation.
Utilize the system of this paper, behind a given people's voice, the FAP pattern that neural network can the every frame phonetic feature of real-time estimate correspondence is by can directly driving the people's face grid based on Mpeg4 after level and smooth.Fig. 8 provides the partial frame of voice-driven human face animation.