Goal of the invention
The present invention proposes to plant has better robustness to distortion, and the tolerance of adjusting the distance has wider adaptive audio-frequency fingerprint method for fast searching based on cross entropy.
The invention is characterized in: in computing machine, realize according to the following steps successively:
Step (1) computer initialization:
Be provided with: the amount of being divided into gauss hybrid models generation module, based on the audio-frequency fingerprint extraction module of the amount of being divided into gauss hybrid models, broad sense dynamic time sequence comparing module, wherein:
The described amount of being divided into gauss hybrid models generation module uses and gathers good about 100 hours voice data in advance, carries out the maximum likelihood parameter estimation, creates the amount of a being divided into gauss hybrid models;
Described audio-frequency fingerprint extraction module extracts audio-frequency fingerprint based on the described amount of being divided into gauss hybrid models, and with distance between cross entropy audio gauge fingerprint;
Described broad sense dynamic time sequence comparing module is carried out fingerprint relatively in the sliding window mode with user's designated tone frequency range and input audio stream, judges whether include the designated tone frequency range in the audio stream.
Step (2) is created the amount of a being divided into gauss hybrid models according to the following steps:
Step (2.1) is gathered good about 100 hours voice data in advance.Through Fourier analysis in short-term, be that a frame extracts a cepstrum feature vector with 10 milliseconds.
The cepstrum feature vector set that step (2.2) utilizes step (2.1) to obtain carries out the maximum likelihood parameter estimation, creates the amount of a being divided into gauss hybrid models.This model comprises M Gaussian distribution as its component, and M weight coefficient, and the value of M is 512:
{ω
i (u),μ
i (u),∑
i (u)}
i=1,…,M
μ wherein
i (u), ∑
i (U)Mean value vector and the covariance matrix of representing i gaussian component, ω
i (u)The weight coefficient of representing i gaussian component, i=1 ..., M, subscript u identify this amount of being divided into gauss hybrid models.
Step (3) is carried out pre-service to user's designated tone frequency range according to the following steps:
Step (3.1) is to computing machine input user designated tone frequency range c, and time span is several seconds, through Fourier analysis in short-term, is that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, audio section c vows sequence { x with a cepstrum feature
n (c)}
N=1 ..., WRepresent that W represents the frame number of audio section c, n=1 ..., W represents the sequence number of each frame of audio section c, subscript c identifies this audio section c.
Step (3.2) is calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the n frame of audio section c
I, n (c), n=1 ..., W:
I=1 wherein ..., M, j=1 ..., M is the numbering of the gaussian component of the amount of being divided into gauss hybrid models, N
i(x| μ
i (u), ∑
i (u)) the expression mean value vector is μ
i (u), covariance matrix is a ∑
i (u)The Gaussian distribution probability density function.
Be calculated as follows the arithmetic mean of i gaussian component weight coefficient of each frame in audio section c, use ω
i (c)Expression:
The arithmetic mean of the weight coefficient of each gaussian component that calculates is formed a vector { ω
i (c)}
I=1 ..., M, this vector is represented-audio-frequency fingerprint as the low-dimensional of audio section c.
Step (4) is carried out fingerprint relatively in the sliding window mode with user's designated tone frequency range c and tested audio stream s:
Step (4.1) is imported tested audio stream s to computing machine in the hourage of setting, through Fourier analysis in short-term, be that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, tested audio stream s is with a cepstrum feature vector sequence { x
t (s)}
T=1 ..., TRepresent that T is the frame number of tested audio stream s, t=1 ..., T represents the sequence number of each frame of audio stream s, subscript s identifies this audio stream s.
Step (4.2) is calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the t frame of audio stream s
I, t (s), t=1 ..., T:
I=1 wherein ..., M, j=1 ..., M is the numbering that is divided into the gaussian component of gauss hybrid models.N
i(x| μ
i (u), ∑
i (u)) the expression mean value vector is μ
i (u), covariance matrix is a ∑
i (u)The Gaussian distribution probability density function.
Step (4.3) is provided with l=1.
Step (4.4) is if l+W-1>T then withdraws from.
Step (4.5) is the audio section { x in the window of W since the length of l frame with audio stream s
t (s)}
T=l ..., l+W-1, to call audio section s in the following text
(l), carry out the fingerprint distance calculation with audio section c.
At first, be calculated as follows and obtain audio section s
(l)Fingerprint
Promptly i gaussian component is at audio section s
(l)In the arithmetic mean of weight coefficient of each frame as audio section s
(l)The i dimension of audio-frequency fingerprint.
Then, be calculated as follows out audio section s
(l)Fingerprint { ω
i (s, l)}
I=1 ..., MFingerprint { ω with audio section c
i (c)}
I=1 ..., MBetween the cross entropy distance
If d
KL(l)≤and θ, judge that then audio stream s has comprised audio section c since the l frame, wherein θ is a default detection threshold, gets 0.01.Then make l=l+1, whether the remaining part of getting back to step (4.4) continuation search audio stream s also includes audio section c.
If d
KL(l)>and θ, then be calculated as follows out a jump step-length
Wherein Δ is a default bias amount, gets 0.001 or 0.005,
Expression rounds downwards.Then make l=l+ τ
L-skip, whether the remaining part of getting back to step (4.4) continuation search audio stream s also includes audio section c.
Unique point of the present invention is:
1. set up the amount of being divided into gauss hybrid models (CCGMM, Common Component Gaussian MixturoModel) and come the description audio fingerprint, and select the distance between cross entropy audio gauge fingerprint for use.The cross entropy distance metric has better better robustness to distortions such as low rate audio compression, the interference of echoing, D/A and A/D conversions.
2. propose broad sense dynamic time sequence comparison technology, be not only applicable to traditional L1 distance, and be applicable to the cross entropy distance, and other any distance functions distance metric that is convex function.
Test shows, it is that linear relatively the calculating frame by frame of 0.005 cross entropy distance can reduce by 93.47% distance calculation number of times that the present invention uses side-play amount, and under multiple distortion, error rate of the present invention is used L1 relatively apart from having reduced by 31.7%.
Embodiment
The present invention specifically comprises three modules, the amount of being divided into gauss hybrid models generation module, and based on the audio-frequency fingerprint extraction module of the amount of being divided into gauss hybrid models, broad sense dynamic time sequence comparing module.Specifying of each module is as follows.
The amount of being divided into gauss hybrid models generation module
Gather good about 100 hours voice data at first, in advance.Producing leaf analysis through Fu in short-term, is that a frame extracts a cepstrum feature vector with 10 milliseconds.
Then, utilize the above-mentioned cepstrum feature vector set that obtains, carry out the maximum likelihood parameter estimation, create the amount of a being divided into gauss hybrid models.This model comprises M Gaussian distribution as its component, and M weight coefficient, and the value of M is 512:
{ω
i (u),μ
i (u),∑
i (u)}
i=1,…,M
μ wherein
i (u), ∑
i (u)Mean value vector and the covariance matrix of representing i gaussian component, ω
i (u)The weight coefficient of representing i gaussian component, i=1 ..., M.Subscript u identifies this amount of being divided into gauss hybrid models.
The audio-frequency fingerprint extraction module
Prior art usually is to obtain eigenvector or histogram as audio-frequency fingerprint by simple vector quantization.The present invention uses the arithmetic mean of weight coefficient of each frame of the audio section that calculates based on the amount of being divided into gauss hybrid models as the fingerprint of audio section.In fingerprint search system, there is place, two places need call the audio-frequency fingerprint extraction module.
The fingerprint extraction of user's designated tone frequency range
For user's designated tone frequency range c, time span is several seconds, through Fourier analysis in short-term, is that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, audio section c is with a cepstrum feature vector sequence { x
n (c)}
N=1 ..., WRepresent that W represents the frame number of audio section c, n=1 ..., W represents the sequence number of each frame of audio section c.Subscript c identifies this audio section c.
Be calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the n frame of audio section c
I, n (c), n=1 ..., W:
I=1 wherein ..., M, j=1 ..., M is the numbering of the gaussian component of the amount of being divided into gauss hybrid models.N
i(x| μ
i (u), ∑
i (u)The expression mean value vector is μ
i (u), covariance matrix is a ∑
i (u)The Gaussian distribution probability density function.
Be calculated as follows the arithmetic mean of i gaussian component weight coefficient of each frame in audio section c, use ω
i (c)Expression:
The arithmetic mean of the weight coefficient of each gaussian component that calculates is formed a vector { ω
i (c)}
I=1 ..., M, this vector is represented-audio-frequency fingerprint as the low-dimensional of audio section c.
The fingerprint extraction of tested audio stream
In the hourage of setting, import tested audio stream s to computing machine, through Fourier analysis in short-term, be that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, tested audio stream s is with a cepstrum feature vector sequence { x
t (s)}
T=1 ..., TRepresent that T is the frame number of tested audio stream s, t=1 ..., T represents the sequence number of each frame of audio stream s.Subscript s identifies this audio stream s.
Be calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the t frame of audio stream s
I, t (s), t=1 ..., T:
I=1 wherein ..., M, j=1 ..., M is the numbering of the gaussian component of the amount of being divided into gauss hybrid models.N
i(x| μ
i (u), ∑
i (u)) the expression mean value vector is μ
i (u), covariance matrix is a ∑
i (u)The Gaussian distribution probability density function.
In fingerprint search, we need be the audio section { x in the window of W with audio stream s since the length of l frame
t (s)}
T=l ..., l+W-1(to call audio section s in the following text
(l)) carry out the fingerprint distance calculation with audio section c.For this reason, be calculated as follows and obtain audio section s
(l)Fingerprint
Promptly i gaussian component is at audio section s
(l)In the arithmetic mean of weight coefficient of each frame as audio section s
(l)The i dimension of audio-frequency fingerprint.
Broad sense dynamic time sequence comparing module
Audio-frequency fingerprint comparison is meant, in the sliding window mode audio section c of user's appointment carried out the fingerprint comparison with tested audio stream s, and sliding window length is made as the length of audio section c.If the fingerprint distance is then judged at audio stream s and has been found audio section c less than default detection threshold between the fragment of the position, somewhere of discovery audio stream s and the audio section c.
Notice owing to be applied in sliding window on the audio stream and can progressively pass and slide backward (with frame as chronomere) in time, therefore between continuous window, can embody a kind of continuity.Dynamic time sequence comparison technology has been utilized this continuity exactly, is the lower bound of the distance metric of parameter by calculating with the time step, skips unwanted distance calculation till this lower bound is less than detection threshold, thereby improves recall precision greatly.In original sequential was dynamically retrieved, distance metric was limited as the L1 distance.The present invention expands to the distance metric that any distance function is a convex function with it, proposes the comparison of broad sense dynamic time sequence, and concrete steps are as follows:
As mentioned above, c handles to user's designated tone frequency range, obtains a cepstrum feature vector sequence { x
n (c)}
N=1 ..., W, wherein W represents the frame number of audio section c, n=1 ..., W represents the sequence number of each frame of audio section c.Subscript c sign is the feature of audio section c.To n=1 ..., W, i=1 ..., M, by formula (1) calculates the weight coefficient ω of i gaussian component at the n frame of audio section c
Tn (c)By formula (2) calculate the audio-frequency fingerprint of audio section c, and promptly dimension is the vector { ω of M
i (c)}
I=1 ..., M
As mentioned above, tested audio stream is handled, obtained a cepstrum feature vector sequence { x
t (s)}
T=1 ..., T, wherein T is the frame number of tested audio stream s, t=1 ..., T represents the sequence number of each frame of audio stream s.Subscript s sign is the feature of audio stream s.To t=1 ..., T, i=1 ..., M, by formula (3) calculate the weight coefficient ω of i gaussian component at the t frame of audio stream s
I, t (s)
L=1 is set.
If l+W-1>T then withdraws from.
's audio section { x in the window of W with audio stream s since the length of l frame
t (s)}
T=l ..., l+W1(to call audio section s in the following text
(l)) carry out the fingerprint distance calculation with audio section c.
At first, be calculated as follows and obtain audio section s
(l)Fingerprint
Promptly i gaussian component is at audio section s
(l)In the arithmetic mean of weight coefficient of each frame as audio section s
(l)The i dimension of audio-frequency fingerprint.
Then, be calculated as follows out audio section s
(l)Fingerprint { ω
i (s, l)}
I=1 ..., MFingerprint { ω with audio section c
i (c)}
I=1 ..., MBetween the cross entropy distance
The cross entropy distance of wherein utilizing the i dimension to calculate is designated as
We have
If d
KL(l)≤and θ, judge that then audio stream s has comprised audio section c since the l frame, wherein θ is a default detection threshold (common desirable 0.01).Then make l=l+1, whether the remaining part of getting back to step (5.3.4) continuation search audio stream s also includes audio section c.
If d
KL(l)>and θ, then be calculated as follows out a jump step-length
Wherein
Expression is with d
KL, i(l) be considered as ω
i (s, l)Function, to ω
i (s, l)Ask local derviation:
Expression rounds downwards.Then make l=l+ τ
KL-skip, whether the remaining part of getting back to step (5.3.4) continuation search audio stream s also includes audio section c.
Can prove that the jump step-length that calculates according to (9) formula can guarantee not leak the comparison of any distance less than threshold value θ, thereby by avoiding unnecessary Calculation Method to accelerate retrieval rate.τ
KL-skipBig more, recall precision is high more.
Discuss:
1. above-mentioned broad sense dynamic time sequence comparison technology can be applied in the distance metric that satisfies following formula arbitrarily
And d
i(l) be ω
i (s, l)The situation of convex function, this includes but not limited to the cross entropy distance of formula (7) definition, and following formula (12) provides the L1 distance.
The L1 distance of wherein utilizing the i dimension to calculate is designated as
2. adopt cross entropy apart from the time jump step-length formula (9) (10) need do some adjustment when the actual computation.Notice if the weights omega of certain gaussian component
i (s, l)Near 0, (10) thereby the denominator of formula can become very big feasible jump step-length often less than 1.In order to address this problem, weight coefficient can be coupled with a positive offset (common desirable 0.001 or 0.005), the cross entropy distance calculation formula below adopting.At this moment broad sense dynamic time sequence comparison technology still is suitable for.
3. owing to use the amount of being divided into gauss hybrid models, at the weight coefficient that carries out only need estimating when audio-frequency fingerprint extracts each gaussian component, therefore successfully avoided the wrong estimation problem that might cause because of data deficiencies, this only has 2 to 3 seconds the accurate fingerprint extraction of minor frequency range significant for the duration.Simultaneously, use the calculating of having simplified KL distance between two gauss hybrid models after the amount of the being divided into gauss hybrid models greatly, just might carry out next step dynamic time sequence comparison.
Test result
At first use and gather good about 100 hours voice data (from CCTV, VOA etc.) in advance, carry out the maximum likelihood parameter estimation, create one and be divided into gauss hybrid models.
As tested audio stream, we have recorded 10 hours VOA Broadcast Journalisms, and the audio section that has therefrom extracted 1000 segment length and be 3 seconds is as user's designated tone frequency range.We have done three kinds of different distortions to tested audio stream and have handled:
1) (40-50Kbps) handled in inferior quality VBR mp3 compression;
2) use CoolEdit stack room Echo;
3) use the inferior quality speaker playback also to record again with the inferior quality microphone.
Search mission is exactly a search subscriber designated tone frequency range in the audio stream after 10 hours distortion is handled.Form 1 has been listed experimental result.Wherein CPU time is illustrated in the required averaging time (using the 1.4GHz Pentium 4 processor to experimentize) of audio section of finding 3 seconds in 10 hours the audio stream.
Test shows, using side-play amount is that 0.005 linear relatively the calculating frame by frame of cross entropy distance can reduce by 93.47% distance calculation number of times, and under multiple distortion, its error rate is used L1 relatively apart from having reduced by 31.7%.
Table 1 test result