CN101853262A

CN101853262A - A fast search method for audio fingerprints based on cross-entropy

Info

Publication number: CN101853262A
Application number: CN200910241366A
Authority: CN
Inventors: 欧智坚; 林晖
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2009-12-07
Filing date: 2009-12-07
Publication date: 2010-10-06

Abstract

A fast search method for audio fingerprints based on cross-entropy belongs to the field of audio fingerprint search technology. For the method, the user-designated audio segment is fingerprinted with the input audio stream in a sliding window manner, and it is judged whether the audio stream contains the designated audio segment. The invention has the advantages of greatly reducing the number of distance calculations, adapting to various audio distortion situations, and reducing the error rate by one-third compared with the L1 distance. The software has been realized and tested in the simulation experiment.

Description

Audio-frequency fingerprint method for fast searching based on cross entropy

Technical field

The present invention relates to a kind of audio-frequency fingerprint method for fast searching, can realize rational great-jump-forward comparison, reach the purpose of from a large amount of voice datas, carrying out the search of efficiently and accurately based on cross entropy.

Background technology

Audio-frequency fingerprint is meant that the content-based low-dimensional of audio frequency self represents (section audio is expressed as a low-dimensional eigenvector), and the audio frequency of different content has different fingerprints.The audio fingerprint techniques background that has a wide range of applications comprises content-based audio frequency identification and retrieval, fields such as copyright management.

There have been a lot of different algorithms in present stage.These algorithms mainly following some have any different:

1. choosing of audio frequency characteristics is as the average of audio frequency short-term spectrum, variance, spectrum center, fundamental tone etc.

2. the structure of fingerprint model, as Codebook of Vector Quantization, gauss hybrid models etc.

3. audio-frequency fingerprint contrasts used distance algorithm, as Euclidean distance, and mahalanobis distance etc.

4. searching algorithm is right as linear ratio, and arest neighbors is than equity.

Outstanding audio-frequency fingerprint searching method should be able to still can correctly be compared, and satisfy the time complexity requirement of calculating after sound signal cause certain distortion owing to compression, transmission etc.

Goal of the invention

The present invention proposes to plant has better robustness to distortion, and the tolerance of adjusting the distance has wider adaptive audio-frequency fingerprint method for fast searching based on cross entropy.

The invention is characterized in: in computing machine, realize according to the following steps successively:

Step (1) computer initialization:

Be provided with: the amount of being divided into gauss hybrid models generation module, based on the audio-frequency fingerprint extraction module of the amount of being divided into gauss hybrid models, broad sense dynamic time sequence comparing module, wherein:

The described amount of being divided into gauss hybrid models generation module uses and gathers good about 100 hours voice data in advance, carries out the maximum likelihood parameter estimation, creates the amount of a being divided into gauss hybrid models;

Described audio-frequency fingerprint extraction module extracts audio-frequency fingerprint based on the described amount of being divided into gauss hybrid models, and with distance between cross entropy audio gauge fingerprint;

Described broad sense dynamic time sequence comparing module is carried out fingerprint relatively in the sliding window mode with user's designated tone frequency range and input audio stream, judges whether include the designated tone frequency range in the audio stream.

Step (2) is created the amount of a being divided into gauss hybrid models according to the following steps:

Step (2.1) is gathered good about 100 hours voice data in advance.Through Fourier analysis in short-term, be that a frame extracts a cepstrum feature vector with 10 milliseconds.

The cepstrum feature vector set that step (2.2) utilizes step (2.1) to obtain carries out the maximum likelihood parameter estimation, creates the amount of a being divided into gauss hybrid models.This model comprises M Gaussian distribution as its component, and M weight coefficient, and the value of M is 512:

{ω _i ^(u)，μ _i ^(u)，∑ _i ^(u)} _{i＝1，…，M}

μ wherein _i ^(u), ∑ _i ^(U)Mean value vector and the covariance matrix of representing i gaussian component, ω _i ^(u)The weight coefficient of representing i gaussian component, i=1 ..., M, subscript u identify this amount of being divided into gauss hybrid models.

Step (3) is carried out pre-service to user's designated tone frequency range according to the following steps:

Step (3.1) is to computing machine input user designated tone frequency range c, and time span is several seconds, through Fourier analysis in short-term, is that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, audio section c vows sequence { x with a cepstrum feature _n ^(c)} _{N=1 ..., W}Represent that W represents the frame number of audio section c, n=1 ..., W represents the sequence number of each frame of audio section c, subscript c identifies this audio section c.

Step (3.2) is calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the n frame of audio section c _{I, n} ^(c), n=1 ..., W:

ω_{i, n}^{(c)} = \frac{ω_{i}^{(u)} N_{i} (x_{n}^{(c)} | μ_{i}^{(u)}, Σ_{i}^{(u)})}{Σ_{j = 1}^{M} ω_{j}^{(u)} N_{j} (x_{n}^{(c)} | μ_{j}^{(u)}, Σ_{j}^{(u)})}

I=1 wherein ..., M, j=1 ..., M is the numbering of the gaussian component of the amount of being divided into gauss hybrid models, N _i(x| μ _i ^(u), ∑ _i ^(u)) the expression mean value vector is μ _i ^(u), covariance matrix is a ∑ _i ^(u)The Gaussian distribution probability density function.

Be calculated as follows the arithmetic mean of i gaussian component weight coefficient of each frame in audio section c, use ω _i ^(c)Expression:

ω_{i}^{(c)} = \frac{1}{W} Σ_{n = 1}^{W} ω_{i, n}^{(c)},

The arithmetic mean of the weight coefficient of each gaussian component that calculates is formed a vector { ω _i ^(c)} _{I=1 ..., M}, this vector is represented-audio-frequency fingerprint as the low-dimensional of audio section c.

Step (4) is carried out fingerprint relatively in the sliding window mode with user's designated tone frequency range c and tested audio stream s:

Step (4.1) is imported tested audio stream s to computing machine in the hourage of setting, through Fourier analysis in short-term, be that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, tested audio stream s is with a cepstrum feature vector sequence { x _t ^(s)} _{T=1 ..., T}Represent that T is the frame number of tested audio stream s, t=1 ..., T represents the sequence number of each frame of audio stream s, subscript s identifies this audio stream s.

Step (4.2) is calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the t frame of audio stream s _{I, t} ^(s), t=1 ..., T:

ω_{i, t}^{(s)} = \frac{ω_{i}^{(u)} N_{i} (x_{t}^{(s)} | μ_{i}^{(u)}, Σ_{i}^{(u)})}{Σ_{j = 1}^{M} ω_{j}^{(u)} N_{j} (x_{t}^{(s)} | μ_{j}^{(u)}, Σ_{j}^{(u)})}

I=1 wherein ..., M, j=1 ..., M is the numbering that is divided into the gaussian component of gauss hybrid models.N _i(x| μ _i ^(u), ∑ _i ^(u)) the expression mean value vector is μ _i ^(u), covariance matrix is a ∑ _i ^(u)The Gaussian distribution probability density function.

Step (4.3) is provided with l=1.

Step (4.4) is if l+W-1＞T then withdraws from.

Step (4.5) is the audio section { x in the window of W since the length of l frame with audio stream s _t ^(s)} _{T=l ..., l+W-1}, to call audio section s in the following text ^(l), carry out the fingerprint distance calculation with audio section c.

At first, be calculated as follows and obtain audio section s ^(l)Fingerprint

ω_{i}^{(s, l)} = \frac{1}{W} Σ_{t = l}^{l + W - 1} ω_{i, t}^{(s)},

Promptly i gaussian component is at audio section s ^(l)In the arithmetic mean of weight coefficient of each frame as audio section s ^(l)The i dimension of audio-frequency fingerprint.

Then, be calculated as follows out audio section s ^(l)Fingerprint { ω _i ^{(s, l)}} _{I=1 ..., M}Fingerprint { ω with audio section c _i ^(c)} _{I=1 ..., M}Between the cross entropy distance

d_{KL} (l) = Σ_{i = 1}^{M} (ω_{i}^{(s, l)} - ω_{i}^{(c)}) \log \frac{ω_{i}^{(s, l)}}{ω_{i}^{(c)}}

If d _KL(l)≤and θ, judge that then audio stream s has comprised audio section c since the l frame, wherein θ is a default detection threshold, gets 0.01.Then make l=l+1, whether the remaining part of getting back to step (4.4) continuation search audio stream s also includes audio section c.

If d _KL(l)＞and θ, then be calculated as follows out a jump step-length

Wherein Δ is a default bias amount, gets 0.001 or 0.005,

Expression rounds downwards.Then make l=l+ τ _L-skip, whether the remaining part of getting back to step (4.4) continuation search audio stream s also includes audio section c.

Unique point of the present invention is:

1. set up the amount of being divided into gauss hybrid models (CCGMM, Common Component Gaussian MixturoModel) and come the description audio fingerprint, and select the distance between cross entropy audio gauge fingerprint for use.The cross entropy distance metric has better better robustness to distortions such as low rate audio compression, the interference of echoing, D/A and A/D conversions.

2. propose broad sense dynamic time sequence comparison technology, be not only applicable to traditional L1 distance, and be applicable to the cross entropy distance, and other any distance functions distance metric that is convex function.

Test shows, it is that linear relatively the calculating frame by frame of 0.005 cross entropy distance can reduce by 93.47% distance calculation number of times that the present invention uses side-play amount, and under multiple distortion, error rate of the present invention is used L1 relatively apart from having reduced by 31.7%.

Description of drawings

Fig. 1: based on the schematic diagram of the audio-frequency fingerprint searching method of cross entropy.

Fig. 2: based on the program flow diagram of the audio-frequency fingerprint searching method of cross entropy.

Embodiment

The present invention specifically comprises three modules, the amount of being divided into gauss hybrid models generation module, and based on the audio-frequency fingerprint extraction module of the amount of being divided into gauss hybrid models, broad sense dynamic time sequence comparing module.Specifying of each module is as follows.

The amount of being divided into gauss hybrid models generation module

Gather good about 100 hours voice data at first, in advance.Producing leaf analysis through Fu in short-term, is that a frame extracts a cepstrum feature vector with 10 milliseconds.

Then, utilize the above-mentioned cepstrum feature vector set that obtains, carry out the maximum likelihood parameter estimation, create the amount of a being divided into gauss hybrid models.This model comprises M Gaussian distribution as its component, and M weight coefficient, and the value of M is 512:

{ω _i ^(u)，μ _i ^(u)，∑ _i ^(u)} _{i＝1，…，M}

μ wherein _i ^(u), ∑ _i ^(u)Mean value vector and the covariance matrix of representing i gaussian component, ω _i ^(u)The weight coefficient of representing i gaussian component, i=1 ..., M.Subscript u identifies this amount of being divided into gauss hybrid models.

The audio-frequency fingerprint extraction module

Prior art usually is to obtain eigenvector or histogram as audio-frequency fingerprint by simple vector quantization.The present invention uses the arithmetic mean of weight coefficient of each frame of the audio section that calculates based on the amount of being divided into gauss hybrid models as the fingerprint of audio section.In fingerprint search system, there is place, two places need call the audio-frequency fingerprint extraction module.

The fingerprint extraction of user's designated tone frequency range

For user's designated tone frequency range c, time span is several seconds, through Fourier analysis in short-term, is that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, audio section c is with a cepstrum feature vector sequence { x _n ^(c)} _{N=1 ..., W}Represent that W represents the frame number of audio section c, n=1 ..., W represents the sequence number of each frame of audio section c.Subscript c identifies this audio section c.

Be calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the n frame of audio section c _{I, n} ^(c), n=1 ..., W:

ω_{i, n}^{(c)} = \frac{ω_{i}^{(u)} N_{i} (x_{n}^{(c)} | μ_{i}^{(u)}, Σ_{i}^{(u)})}{Σ_{j = 1}^{M} ω_{j}^{(u)} N_{j} (x_{n}^{(c)} | μ_{j}^{(u)}, Σ_{j}^{(u)})} - - - (1)

I=1 wherein ..., M, j=1 ..., M is the numbering of the gaussian component of the amount of being divided into gauss hybrid models.N _i(x| μ _i ^(u), ∑ _i ^(u)The expression mean value vector is μ _i ^(u), covariance matrix is a ∑ _i ^(u)The Gaussian distribution probability density function.

ω_{i}^{(c)} = \frac{1}{W} Σ_{n = 1}^{W} ω_{i, n}^{(c)} - - - (2)

The fingerprint extraction of tested audio stream

In the hourage of setting, import tested audio stream s to computing machine, through Fourier analysis in short-term, be that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, tested audio stream s is with a cepstrum feature vector sequence { x _t ^(s)} _{T=1 ..., T}Represent that T is the frame number of tested audio stream s, t=1 ..., T represents the sequence number of each frame of audio stream s.Subscript s identifies this audio stream s.

Be calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the t frame of audio stream s _{I, t} ^(s), t=1 ..., T:

ω_{i, t}^{(s)} = \frac{ω_{i}^{(u)} N_{i} (x_{t}^{(s)} | μ_{i}^{(u)}, Σ_{i}^{(u)})}{Σ_{j = 1}^{M} ω_{j}^{(u)} N_{j} (x_{t}^{(s)} | μ_{j}^{(u)}, Σ_{j}^{(u)})} - - - (3)

I=1 wherein ..., M, j=1 ..., M is the numbering of the gaussian component of the amount of being divided into gauss hybrid models.N _i(x| μ _i ^(u), ∑ _i ^(u)) the expression mean value vector is μ _i ^(u), covariance matrix is a ∑ _i ^(u)The Gaussian distribution probability density function.

In fingerprint search, we need be the audio section { x in the window of W with audio stream s since the length of l frame _t ^(s)} _{T=l ..., l+W-1}(to call audio section s in the following text ^(l)) carry out the fingerprint distance calculation with audio section c.For this reason, be calculated as follows and obtain audio section s ^(l)Fingerprint

ω_{i}^{(s, l)} = \frac{1}{W} Σ_{t = l}^{l + W - 1} ω_{i, t}^{(s)} - - - (4)

Broad sense dynamic time sequence comparing module

Audio-frequency fingerprint comparison is meant, in the sliding window mode audio section c of user's appointment carried out the fingerprint comparison with tested audio stream s, and sliding window length is made as the length of audio section c.If the fingerprint distance is then judged at audio stream s and has been found audio section c less than default detection threshold between the fragment of the position, somewhere of discovery audio stream s and the audio section c.

Notice owing to be applied in sliding window on the audio stream and can progressively pass and slide backward (with frame as chronomere) in time, therefore between continuous window, can embody a kind of continuity.Dynamic time sequence comparison technology has been utilized this continuity exactly, is the lower bound of the distance metric of parameter by calculating with the time step, skips unwanted distance calculation till this lower bound is less than detection threshold, thereby improves recall precision greatly.In original sequential was dynamically retrieved, distance metric was limited as the L1 distance.The present invention expands to the distance metric that any distance function is a convex function with it, proposes the comparison of broad sense dynamic time sequence, and concrete steps are as follows:

As mentioned above, c handles to user's designated tone frequency range, obtains a cepstrum feature vector sequence { x _n ^(c)} _{N=1 ..., W}, wherein W represents the frame number of audio section c, n=1 ..., W represents the sequence number of each frame of audio section c.Subscript c sign is the feature of audio section c.To n=1 ..., W, i=1 ..., M, by formula (1) calculates the weight coefficient ω of i gaussian component at the n frame of audio section c _Tn ^(c)By formula (2) calculate the audio-frequency fingerprint of audio section c, and promptly dimension is the vector { ω of M _i ^(c)} _{I=1 ..., M}

As mentioned above, tested audio stream is handled, obtained a cepstrum feature vector sequence { x _t ^(s)} _{T=1 ..., T}, wherein T is the frame number of tested audio stream s, t=1 ..., T represents the sequence number of each frame of audio stream s.Subscript s sign is the feature of audio stream s.To t=1 ..., T, i=1 ..., M, by formula (3) calculate the weight coefficient ω of i gaussian component at the t frame of audio stream s _{I, t} ^(s)

L=1 is set.

If l+W-1＞T then withdraws from.

's audio section { x in the window of W with audio stream s since the length of l frame _t ^(s)} _{T=l ..., l+W1}(to call audio section s in the following text ^(l)) carry out the fingerprint distance calculation with audio section c.

At first, be calculated as follows and obtain audio section s ^(l)Fingerprint

ω_{i}^{(s, l)} = \frac{1}{W} Σ_{t = l}^{l + W - 1} ω_{i, t}^{(s)} - - - (5)

d_{KL} (l) = Σ_{i = 1}^{M} (ω_{i}^{(s, l)} - ω_{i}^{(c)}) \log \frac{ω_{i}^{(s, l)}}{ω_{i}^{(c)}} - - - (6)

The cross entropy distance of wherein utilizing the i dimension to calculate is designated as

d_{KL, i} (l) = (ω_{i}^{(s, l)} - ω_{i}^{(c)}) \log \frac{ω_{i}^{(s, l)}}{ω_{i}^{(c)}} - - - (7)

We have

d_{KL} (l) = Σ_{i = 1}^{K} d_{KL, i} (l) - - - (8)

If d _KL(l)≤and θ, judge that then audio stream s has comprised audio section c since the l frame, wherein θ is a default detection threshold (common desirable 0.01).Then make l=l+1, whether the remaining part of getting back to step (5.3.4) continuation search audio stream s also includes audio section c.

If d _KL(l)＞and θ, then be calculated as follows out a jump step-length

Wherein

Expression is with d _{KL, i}(l) be considered as ω _i ^{(s, l)}Function, to ω _i ^{(s, l)}Ask local derviation:

\frac{&PartialD; d_{KL, i}}{&PartialD; ω_{i}^{(sl,)}} = 1 + \log \frac{ω_{i}^{(s, l)}}{ω_{i}^{(c)}} - \frac{ω_{i}^{(c)}}{ω_{i}^{(s, l)}} - - - (10)

Expression rounds downwards.Then make l=l+ τ _KL-skip, whether the remaining part of getting back to step (5.3.4) continuation search audio stream s also includes audio section c.

Can prove that the jump step-length that calculates according to (9) formula can guarantee not leak the comparison of any distance less than threshold value θ, thereby by avoiding unnecessary Calculation Method to accelerate retrieval rate.τ _KL-skipBig more, recall precision is high more.

Discuss:

1. above-mentioned broad sense dynamic time sequence comparison technology can be applied in the distance metric that satisfies following formula arbitrarily

d (l) = Σ_{i = 1}^{M} d_{i} (l) - - - (11)

And d _i(l) be ω _i ^{(s, l)}The situation of convex function, this includes but not limited to the cross entropy distance of formula (7) definition, and following formula (12) provides the L1 distance.

d_{L 1} (l) = Σ_{i = 1}^{M} | ω_{i}^{(s, l)} - ω_{i}^{(c)} | - - - (12)

The L1 distance of wherein utilizing the i dimension to calculate is designated as

d_{L 1, i} (l) = | ω_{i}^{(s, l)} - ω_{i}^{(c)} | - - - (13)

2. adopt cross entropy apart from the time jump step-length formula (9) (10) need do some adjustment when the actual computation.Notice if the weights omega of certain gaussian component _i ^{(s, l)}Near 0, (10) thereby the denominator of formula can become very big feasible jump step-length often less than 1.In order to address this problem, weight coefficient can be coupled with a positive offset (common desirable 0.001 or 0.005), the cross entropy distance calculation formula below adopting.At this moment broad sense dynamic time sequence comparison technology still is suitable for.

d_{KL} (l) = Σ_{i = 1}^{M} (ω_{i}^{(s, l)} - ω_{i}^{(c)}) \log \frac{ω_{i}^{(s, l)} + Δ}{ω_{i}^{(c)} + Δ} - - - (14)

3. owing to use the amount of being divided into gauss hybrid models, at the weight coefficient that carries out only need estimating when audio-frequency fingerprint extracts each gaussian component, therefore successfully avoided the wrong estimation problem that might cause because of data deficiencies, this only has 2 to 3 seconds the accurate fingerprint extraction of minor frequency range significant for the duration.Simultaneously, use the calculating of having simplified KL distance between two gauss hybrid models after the amount of the being divided into gauss hybrid models greatly, just might carry out next step dynamic time sequence comparison.

Test result

At first use and gather good about 100 hours voice data (from CCTV, VOA etc.) in advance, carry out the maximum likelihood parameter estimation, create one and be divided into gauss hybrid models.

As tested audio stream, we have recorded 10 hours VOA Broadcast Journalisms, and the audio section that has therefrom extracted 1000 segment length and be 3 seconds is as user's designated tone frequency range.We have done three kinds of different distortions to tested audio stream and have handled:

1) (40-50Kbps) handled in inferior quality VBR mp3 compression;

2) use CoolEdit stack room Echo;

3) use the inferior quality speaker playback also to record again with the inferior quality microphone.

Search mission is exactly a search subscriber designated tone frequency range in the audio stream after 10 hours distortion is handled.Form 1 has been listed experimental result.Wherein CPU time is illustrated in the required averaging time (using the 1.4GHz Pentium 4 processor to experimentize) of audio section of finding 3 seconds in 10 hours the audio stream.

Test shows, using side-play amount is that 0.005 linear relatively the calculating frame by frame of cross entropy distance can reduce by 93.47% distance calculation number of times, and under multiple distortion, its error rate is used L1 relatively apart from having reduced by 31.7%.

Table 1 test result

Claims

1. based on the audio-frequency fingerprint method for fast searching of cross entropy, it is characterized in that, in computing machine, realize according to the following steps successively:

Step (1) computer initialization:

{ω _i ^(u)，μ _i ^(u)，∑ _i ^(u)} _{i＝1，...，M}

Step (3.1) is to computing machine input user designated tone frequency range c, and time span is several seconds, through Fourier analysis in short-term, is that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, audio section c is with a cepstrum feature vector sequence { x _n ^(c)} _{N=1 ..., W}Represent that W represents the frame number of audio section c, n=1 ..., W represents the sequence number of each frame of audio section c, subscript c identifies this audio section c.

ω_{i, n}^{(c)} = \frac{ω_{i}^{(u)} N_{i} (x_{n}^{(c)} | μ_{i}^{(u)}, Σ_{i}^{(u)})}{Σ_{j = 1}^{M} ω_{j}^{(u)} N_{j} (x_{n}^{(c)} | μ_{j}^{(u)}, Σ_{j}^{(u)})}

ω_{i}^{(c)} = \frac{1}{W} Σ_{n = 1}^{W} ω_{i, n}^{(c)},

The arithmetic mean of the weight coefficient of each gaussian component that calculates is formed a vector { ω _t ^(c)} _{I=1 ..., M}, this vector is represented an audio-frequency fingerprint as the low-dimensional of audio section c.

ω_{i, t}^{(s)} = \frac{ω_{i}^{(u)} N_{i} (x_{t}^{(s)} | μ_{i}^{(u)}, Σ_{i}^{(u)})}{Σ_{j = 1}^{M} ω_{j}^{(u)} N_{j} (x_{t}^{(s)} | μ_{j}^{(u)}, Σ_{j}^{(u)})}

Step (4.3) is provided with l=1.

Step (4.4) is if l+W-1＞T then withdraws from.

At first, be calculated as follows and obtain audio section s ^(l)Fingerprint

ω_{i}^{(s, l)} = \frac{1}{W} Σ_{t = l}^{l + W - 1} ω_{i, t}^{(s)},

d_{KL} (l) = Σ_{i = 1}^{M} (ω_{i}^{s, l} - ω_{i}^{(c)}) \log \frac{ω_{i}^{(s, l)}}{ω_{i}^{(c)}}

If d _KL(l)＞and θ, then be calculated as follows out a jump step-length

Wherein Δ is a default bias amount, gets 0.001 or 0.005,

Expression rounds downwards.Then make l=l+ τ _KL-skip, whether the remaining part of getting back to step (4.4) continuation search audio stream s also includes audio section c.