[go: up one dir, main page]

CN101853262A - A fast search method for audio fingerprints based on cross-entropy - Google Patents

A fast search method for audio fingerprints based on cross-entropy Download PDF

Info

Publication number
CN101853262A
CN101853262A CN200910241366A CN200910241366A CN101853262A CN 101853262 A CN101853262 A CN 101853262A CN 200910241366 A CN200910241366 A CN 200910241366A CN 200910241366 A CN200910241366 A CN 200910241366A CN 101853262 A CN101853262 A CN 101853262A
Authority
CN
China
Prior art keywords
audio
omega
frame
fingerprint
divided
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910241366A
Other languages
Chinese (zh)
Inventor
欧智坚
林晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN200910241366A priority Critical patent/CN101853262A/en
Publication of CN101853262A publication Critical patent/CN101853262A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Complex Calculations (AREA)

Abstract

基于交叉熵的音频指纹快速搜索方法,属于音频指纹搜索技术领域,其特征在于,利用共分量高斯混合模型来描述音频指纹,并且用交叉熵来衡量音频指纹间的距离,然后用广义动态时序比对方法,以滑动窗方式将用户指定音频段与输入音频流进行指纹比较,判断该音频流中是否包含有指定音频段。本发明具有能大大地减少距离计算次数,而且还能适应于多种音频失真情况,错误率相对于使用L1距离降低了三分之一的优点,已完成软件实现并在仿真实验中予以测试。

Figure 200910241366

A fast search method for audio fingerprints based on cross-entropy belongs to the field of audio fingerprint search technology. For the method, the user-designated audio segment is fingerprinted with the input audio stream in a sliding window manner, and it is judged whether the audio stream contains the designated audio segment. The invention has the advantages of greatly reducing the number of distance calculations, adapting to various audio distortion situations, and reducing the error rate by one-third compared with the L1 distance. The software has been realized and tested in the simulation experiment.

Figure 200910241366

Description

Audio-frequency fingerprint method for fast searching based on cross entropy
Technical field
The present invention relates to a kind of audio-frequency fingerprint method for fast searching, can realize rational great-jump-forward comparison, reach the purpose of from a large amount of voice datas, carrying out the search of efficiently and accurately based on cross entropy.
Background technology
Audio-frequency fingerprint is meant that the content-based low-dimensional of audio frequency self represents (section audio is expressed as a low-dimensional eigenvector), and the audio frequency of different content has different fingerprints.The audio fingerprint techniques background that has a wide range of applications comprises content-based audio frequency identification and retrieval, fields such as copyright management.
There have been a lot of different algorithms in present stage.These algorithms mainly following some have any different:
1. choosing of audio frequency characteristics is as the average of audio frequency short-term spectrum, variance, spectrum center, fundamental tone etc.
2. the structure of fingerprint model, as Codebook of Vector Quantization, gauss hybrid models etc.
3. audio-frequency fingerprint contrasts used distance algorithm, as Euclidean distance, and mahalanobis distance etc.
4. searching algorithm is right as linear ratio, and arest neighbors is than equity.
Outstanding audio-frequency fingerprint searching method should be able to still can correctly be compared, and satisfy the time complexity requirement of calculating after sound signal cause certain distortion owing to compression, transmission etc.
Goal of the invention
The present invention proposes to plant has better robustness to distortion, and the tolerance of adjusting the distance has wider adaptive audio-frequency fingerprint method for fast searching based on cross entropy.
The invention is characterized in: in computing machine, realize according to the following steps successively:
Step (1) computer initialization:
Be provided with: the amount of being divided into gauss hybrid models generation module, based on the audio-frequency fingerprint extraction module of the amount of being divided into gauss hybrid models, broad sense dynamic time sequence comparing module, wherein:
The described amount of being divided into gauss hybrid models generation module uses and gathers good about 100 hours voice data in advance, carries out the maximum likelihood parameter estimation, creates the amount of a being divided into gauss hybrid models;
Described audio-frequency fingerprint extraction module extracts audio-frequency fingerprint based on the described amount of being divided into gauss hybrid models, and with distance between cross entropy audio gauge fingerprint;
Described broad sense dynamic time sequence comparing module is carried out fingerprint relatively in the sliding window mode with user's designated tone frequency range and input audio stream, judges whether include the designated tone frequency range in the audio stream.
Step (2) is created the amount of a being divided into gauss hybrid models according to the following steps:
Step (2.1) is gathered good about 100 hours voice data in advance.Through Fourier analysis in short-term, be that a frame extracts a cepstrum feature vector with 10 milliseconds.
The cepstrum feature vector set that step (2.2) utilizes step (2.1) to obtain carries out the maximum likelihood parameter estimation, creates the amount of a being divided into gauss hybrid models.This model comprises M Gaussian distribution as its component, and M weight coefficient, and the value of M is 512:
i (u),μ i (u),∑ i (u)} i=1,…,M
μ wherein i (u), ∑ i (U)Mean value vector and the covariance matrix of representing i gaussian component, ω i (u)The weight coefficient of representing i gaussian component, i=1 ..., M, subscript u identify this amount of being divided into gauss hybrid models.
Step (3) is carried out pre-service to user's designated tone frequency range according to the following steps:
Step (3.1) is to computing machine input user designated tone frequency range c, and time span is several seconds, through Fourier analysis in short-term, is that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, audio section c vows sequence { x with a cepstrum feature n (c)} N=1 ..., WRepresent that W represents the frame number of audio section c, n=1 ..., W represents the sequence number of each frame of audio section c, subscript c identifies this audio section c.
Step (3.2) is calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the n frame of audio section c I, n (c), n=1 ..., W:
ω i , n ( c ) = ω i ( u ) N i ( x n ( c ) | μ i ( u ) , Σ i ( u ) ) Σ j = 1 M ω j ( u ) N j ( x n ( c ) | μ j ( u ) , Σ j ( u ) )
I=1 wherein ..., M, j=1 ..., M is the numbering of the gaussian component of the amount of being divided into gauss hybrid models, N i(x| μ i (u), ∑ i (u)) the expression mean value vector is μ i (u), covariance matrix is a ∑ i (u)The Gaussian distribution probability density function.
Be calculated as follows the arithmetic mean of i gaussian component weight coefficient of each frame in audio section c, use ω i (c)Expression:
ω i ( c ) = 1 W Σ n = 1 W ω i , n ( c ) ,
The arithmetic mean of the weight coefficient of each gaussian component that calculates is formed a vector { ω i (c)} I=1 ..., M, this vector is represented-audio-frequency fingerprint as the low-dimensional of audio section c.
Step (4) is carried out fingerprint relatively in the sliding window mode with user's designated tone frequency range c and tested audio stream s:
Step (4.1) is imported tested audio stream s to computing machine in the hourage of setting, through Fourier analysis in short-term, be that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, tested audio stream s is with a cepstrum feature vector sequence { x t (s)} T=1 ..., TRepresent that T is the frame number of tested audio stream s, t=1 ..., T represents the sequence number of each frame of audio stream s, subscript s identifies this audio stream s.
Step (4.2) is calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the t frame of audio stream s I, t (s), t=1 ..., T:
ω i , t ( s ) = ω i ( u ) N i ( x t ( s ) | μ i ( u ) , Σ i ( u ) ) Σ j = 1 M ω j ( u ) N j ( x t ( s ) | μ j ( u ) , Σ j ( u ) )
I=1 wherein ..., M, j=1 ..., M is the numbering that is divided into the gaussian component of gauss hybrid models.N i(x| μ i (u), ∑ i (u)) the expression mean value vector is μ i (u), covariance matrix is a ∑ i (u)The Gaussian distribution probability density function.
Step (4.3) is provided with l=1.
Step (4.4) is if l+W-1>T then withdraws from.
Step (4.5) is the audio section { x in the window of W since the length of l frame with audio stream s t (s)} T=l ..., l+W-1, to call audio section s in the following text (l), carry out the fingerprint distance calculation with audio section c.
At first, be calculated as follows and obtain audio section s (l)Fingerprint
ω i ( s , l ) = 1 W Σ t = l l + W - 1 ω i , t ( s ) ,
Promptly i gaussian component is at audio section s (l)In the arithmetic mean of weight coefficient of each frame as audio section s (l)The i dimension of audio-frequency fingerprint.
Then, be calculated as follows out audio section s (l)Fingerprint { ω i (s, l)} I=1 ..., MFingerprint { ω with audio section c i (c)} I=1 ..., MBetween the cross entropy distance
d KL ( l ) = Σ i = 1 M ( ω i ( s , l ) - ω i ( c ) ) log ω i ( s , l ) ω i ( c )
If d KL(l)≤and θ, judge that then audio stream s has comprised audio section c since the l frame, wherein θ is a default detection threshold, gets 0.01.Then make l=l+1, whether the remaining part of getting back to step (4.4) continuation search audio stream s also includes audio section c.
If d KL(l)>and θ, then be calculated as follows out a jump step-length
Figure G2009102413667D00043
Wherein Δ is a default bias amount, gets 0.001 or 0.005,
Figure G2009102413667D00044
Expression rounds downwards.Then make l=l+ τ L-skip, whether the remaining part of getting back to step (4.4) continuation search audio stream s also includes audio section c.
Unique point of the present invention is:
1. set up the amount of being divided into gauss hybrid models (CCGMM, Common Component Gaussian MixturoModel) and come the description audio fingerprint, and select the distance between cross entropy audio gauge fingerprint for use.The cross entropy distance metric has better better robustness to distortions such as low rate audio compression, the interference of echoing, D/A and A/D conversions.
2. propose broad sense dynamic time sequence comparison technology, be not only applicable to traditional L1 distance, and be applicable to the cross entropy distance, and other any distance functions distance metric that is convex function.
Test shows, it is that linear relatively the calculating frame by frame of 0.005 cross entropy distance can reduce by 93.47% distance calculation number of times that the present invention uses side-play amount, and under multiple distortion, error rate of the present invention is used L1 relatively apart from having reduced by 31.7%.
Description of drawings
Fig. 1: based on the schematic diagram of the audio-frequency fingerprint searching method of cross entropy.
Fig. 2: based on the program flow diagram of the audio-frequency fingerprint searching method of cross entropy.
Embodiment
The present invention specifically comprises three modules, the amount of being divided into gauss hybrid models generation module, and based on the audio-frequency fingerprint extraction module of the amount of being divided into gauss hybrid models, broad sense dynamic time sequence comparing module.Specifying of each module is as follows.
The amount of being divided into gauss hybrid models generation module
Gather good about 100 hours voice data at first, in advance.Producing leaf analysis through Fu in short-term, is that a frame extracts a cepstrum feature vector with 10 milliseconds.
Then, utilize the above-mentioned cepstrum feature vector set that obtains, carry out the maximum likelihood parameter estimation, create the amount of a being divided into gauss hybrid models.This model comprises M Gaussian distribution as its component, and M weight coefficient, and the value of M is 512:
i (u),μ i (u),∑ i (u)} i=1,…,M
μ wherein i (u), ∑ i (u)Mean value vector and the covariance matrix of representing i gaussian component, ω i (u)The weight coefficient of representing i gaussian component, i=1 ..., M.Subscript u identifies this amount of being divided into gauss hybrid models.
The audio-frequency fingerprint extraction module
Prior art usually is to obtain eigenvector or histogram as audio-frequency fingerprint by simple vector quantization.The present invention uses the arithmetic mean of weight coefficient of each frame of the audio section that calculates based on the amount of being divided into gauss hybrid models as the fingerprint of audio section.In fingerprint search system, there is place, two places need call the audio-frequency fingerprint extraction module.
The fingerprint extraction of user's designated tone frequency range
For user's designated tone frequency range c, time span is several seconds, through Fourier analysis in short-term, is that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, audio section c is with a cepstrum feature vector sequence { x n (c)} N=1 ..., WRepresent that W represents the frame number of audio section c, n=1 ..., W represents the sequence number of each frame of audio section c.Subscript c identifies this audio section c.
Be calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the n frame of audio section c I, n (c), n=1 ..., W:
ω i , n ( c ) = ω i ( u ) N i ( x n ( c ) | μ i ( u ) , Σ i ( u ) ) Σ j = 1 M ω j ( u ) N j ( x n ( c ) | μ j ( u ) , Σ j ( u ) ) - - - ( 1 )
I=1 wherein ..., M, j=1 ..., M is the numbering of the gaussian component of the amount of being divided into gauss hybrid models.N i(x| μ i (u), ∑ i (u)The expression mean value vector is μ i (u), covariance matrix is a ∑ i (u)The Gaussian distribution probability density function.
Be calculated as follows the arithmetic mean of i gaussian component weight coefficient of each frame in audio section c, use ω i (c)Expression:
ω i ( c ) = 1 W Σ n = 1 W ω i , n ( c ) - - - ( 2 )
The arithmetic mean of the weight coefficient of each gaussian component that calculates is formed a vector { ω i (c)} I=1 ..., M, this vector is represented-audio-frequency fingerprint as the low-dimensional of audio section c.
The fingerprint extraction of tested audio stream
In the hourage of setting, import tested audio stream s to computing machine, through Fourier analysis in short-term, be that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, tested audio stream s is with a cepstrum feature vector sequence { x t (s)} T=1 ..., TRepresent that T is the frame number of tested audio stream s, t=1 ..., T represents the sequence number of each frame of audio stream s.Subscript s identifies this audio stream s.
Be calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the t frame of audio stream s I, t (s), t=1 ..., T:
ω i , t ( s ) = ω i ( u ) N i ( x t ( s ) | μ i ( u ) , Σ i ( u ) ) Σ j = 1 M ω j ( u ) N j ( x t ( s ) | μ j ( u ) , Σ j ( u ) ) - - - ( 3 )
I=1 wherein ..., M, j=1 ..., M is the numbering of the gaussian component of the amount of being divided into gauss hybrid models.N i(x| μ i (u), ∑ i (u)) the expression mean value vector is μ i (u), covariance matrix is a ∑ i (u)The Gaussian distribution probability density function.
In fingerprint search, we need be the audio section { x in the window of W with audio stream s since the length of l frame t (s)} T=l ..., l+W-1(to call audio section s in the following text (l)) carry out the fingerprint distance calculation with audio section c.For this reason, be calculated as follows and obtain audio section s (l)Fingerprint
ω i ( s , l ) = 1 W Σ t = l l + W - 1 ω i , t ( s ) - - - ( 4 )
Promptly i gaussian component is at audio section s (l)In the arithmetic mean of weight coefficient of each frame as audio section s (l)The i dimension of audio-frequency fingerprint.
Broad sense dynamic time sequence comparing module
Audio-frequency fingerprint comparison is meant, in the sliding window mode audio section c of user's appointment carried out the fingerprint comparison with tested audio stream s, and sliding window length is made as the length of audio section c.If the fingerprint distance is then judged at audio stream s and has been found audio section c less than default detection threshold between the fragment of the position, somewhere of discovery audio stream s and the audio section c.
Notice owing to be applied in sliding window on the audio stream and can progressively pass and slide backward (with frame as chronomere) in time, therefore between continuous window, can embody a kind of continuity.Dynamic time sequence comparison technology has been utilized this continuity exactly, is the lower bound of the distance metric of parameter by calculating with the time step, skips unwanted distance calculation till this lower bound is less than detection threshold, thereby improves recall precision greatly.In original sequential was dynamically retrieved, distance metric was limited as the L1 distance.The present invention expands to the distance metric that any distance function is a convex function with it, proposes the comparison of broad sense dynamic time sequence, and concrete steps are as follows:
As mentioned above, c handles to user's designated tone frequency range, obtains a cepstrum feature vector sequence { x n (c)} N=1 ..., W, wherein W represents the frame number of audio section c, n=1 ..., W represents the sequence number of each frame of audio section c.Subscript c sign is the feature of audio section c.To n=1 ..., W, i=1 ..., M, by formula (1) calculates the weight coefficient ω of i gaussian component at the n frame of audio section c Tn (c)By formula (2) calculate the audio-frequency fingerprint of audio section c, and promptly dimension is the vector { ω of M i (c)} I=1 ..., M
As mentioned above, tested audio stream is handled, obtained a cepstrum feature vector sequence { x t (s)} T=1 ..., T, wherein T is the frame number of tested audio stream s, t=1 ..., T represents the sequence number of each frame of audio stream s.Subscript s sign is the feature of audio stream s.To t=1 ..., T, i=1 ..., M, by formula (3) calculate the weight coefficient ω of i gaussian component at the t frame of audio stream s I, t (s)
L=1 is set.
If l+W-1>T then withdraws from.
's audio section { x in the window of W with audio stream s since the length of l frame t (s)} T=l ..., l+W1(to call audio section s in the following text (l)) carry out the fingerprint distance calculation with audio section c.
At first, be calculated as follows and obtain audio section s (l)Fingerprint
ω i ( s , l ) = 1 W Σ t = l l + W - 1 ω i , t ( s ) - - - ( 5 )
Promptly i gaussian component is at audio section s (l)In the arithmetic mean of weight coefficient of each frame as audio section s (l)The i dimension of audio-frequency fingerprint.
Then, be calculated as follows out audio section s (l)Fingerprint { ω i (s, l)} I=1 ..., MFingerprint { ω with audio section c i (c)} I=1 ..., MBetween the cross entropy distance
d KL ( l ) = Σ i = 1 M ( ω i ( s , l ) - ω i ( c ) ) log ω i ( s , l ) ω i ( c ) - - - ( 6 )
The cross entropy distance of wherein utilizing the i dimension to calculate is designated as
d KL , i ( l ) = ( ω i ( s , l ) - ω i ( c ) ) log ω i ( s , l ) ω i ( c ) - - - ( 7 )
We have
d KL ( l ) = Σ i = 1 K d KL , i ( l ) - - - ( 8 )
If d KL(l)≤and θ, judge that then audio stream s has comprised audio section c since the l frame, wherein θ is a default detection threshold (common desirable 0.01).Then make l=l+1, whether the remaining part of getting back to step (5.3.4) continuation search audio stream s also includes audio section c.
If d KL(l)>and θ, then be calculated as follows out a jump step-length
Wherein
Figure G2009102413667D00096
Expression is with d KL, i(l) be considered as ω i (s, l)Function, to ω i (s, l)Ask local derviation:
∂ d KL , i ∂ ω i ( sl , ) = 1 + log ω i ( s , l ) ω i ( c ) - ω i ( c ) ω i ( s , l ) - - - ( 10 )
Expression rounds downwards.Then make l=l+ τ KL-skip, whether the remaining part of getting back to step (5.3.4) continuation search audio stream s also includes audio section c.
Can prove that the jump step-length that calculates according to (9) formula can guarantee not leak the comparison of any distance less than threshold value θ, thereby by avoiding unnecessary Calculation Method to accelerate retrieval rate.τ KL-skipBig more, recall precision is high more.
Discuss:
1. above-mentioned broad sense dynamic time sequence comparison technology can be applied in the distance metric that satisfies following formula arbitrarily
d ( l ) = Σ i = 1 M d i ( l ) - - - ( 11 )
And d i(l) be ω i (s, l)The situation of convex function, this includes but not limited to the cross entropy distance of formula (7) definition, and following formula (12) provides the L1 distance.
d L 1 ( l ) = Σ i = 1 M | ω i ( s , l ) - ω i ( c ) | - - - ( 12 )
The L1 distance of wherein utilizing the i dimension to calculate is designated as
d L 1 , i ( l ) = | ω i ( s , l ) - ω i ( c ) | - - - ( 13 )
2. adopt cross entropy apart from the time jump step-length formula (9) (10) need do some adjustment when the actual computation.Notice if the weights omega of certain gaussian component i (s, l)Near 0, (10) thereby the denominator of formula can become very big feasible jump step-length often less than 1.In order to address this problem, weight coefficient can be coupled with a positive offset (common desirable 0.001 or 0.005), the cross entropy distance calculation formula below adopting.At this moment broad sense dynamic time sequence comparison technology still is suitable for.
d KL ( l ) = Σ i = 1 M ( ω i ( s , l ) - ω i ( c ) ) log ω i ( s , l ) + Δ ω i ( c ) + Δ - - - ( 14 )
3. owing to use the amount of being divided into gauss hybrid models, at the weight coefficient that carries out only need estimating when audio-frequency fingerprint extracts each gaussian component, therefore successfully avoided the wrong estimation problem that might cause because of data deficiencies, this only has 2 to 3 seconds the accurate fingerprint extraction of minor frequency range significant for the duration.Simultaneously, use the calculating of having simplified KL distance between two gauss hybrid models after the amount of the being divided into gauss hybrid models greatly, just might carry out next step dynamic time sequence comparison.
Test result
At first use and gather good about 100 hours voice data (from CCTV, VOA etc.) in advance, carry out the maximum likelihood parameter estimation, create one and be divided into gauss hybrid models.
As tested audio stream, we have recorded 10 hours VOA Broadcast Journalisms, and the audio section that has therefrom extracted 1000 segment length and be 3 seconds is as user's designated tone frequency range.We have done three kinds of different distortions to tested audio stream and have handled:
1) (40-50Kbps) handled in inferior quality VBR mp3 compression;
2) use CoolEdit stack room Echo;
3) use the inferior quality speaker playback also to record again with the inferior quality microphone.
Search mission is exactly a search subscriber designated tone frequency range in the audio stream after 10 hours distortion is handled.Form 1 has been listed experimental result.Wherein CPU time is illustrated in the required averaging time (using the 1.4GHz Pentium 4 processor to experimentize) of audio section of finding 3 seconds in 10 hours the audio stream.
Test shows, using side-play amount is that 0.005 linear relatively the calculating frame by frame of cross entropy distance can reduce by 93.47% distance calculation number of times, and under multiple distortion, its error rate is used L1 relatively apart from having reduced by 31.7%.
Figure G2009102413667D00121
Table 1 test result

Claims (1)

1. based on the audio-frequency fingerprint method for fast searching of cross entropy, it is characterized in that, in computing machine, realize according to the following steps successively:
Step (1) computer initialization:
Be provided with: the amount of being divided into gauss hybrid models generation module, based on the audio-frequency fingerprint extraction module of the amount of being divided into gauss hybrid models, broad sense dynamic time sequence comparing module, wherein:
The described amount of being divided into gauss hybrid models generation module uses and gathers good about 100 hours voice data in advance, carries out the maximum likelihood parameter estimation, creates the amount of a being divided into gauss hybrid models;
Described audio-frequency fingerprint extraction module extracts audio-frequency fingerprint based on the described amount of being divided into gauss hybrid models, and with distance between cross entropy audio gauge fingerprint;
Described broad sense dynamic time sequence comparing module is carried out fingerprint relatively in the sliding window mode with user's designated tone frequency range and input audio stream, judges whether include the designated tone frequency range in the audio stream.
Step (2) is created the amount of a being divided into gauss hybrid models according to the following steps:
Step (2.1) is gathered good about 100 hours voice data in advance.Through Fourier analysis in short-term, be that a frame extracts a cepstrum feature vector with 10 milliseconds.
The cepstrum feature vector set that step (2.2) utilizes step (2.1) to obtain carries out the maximum likelihood parameter estimation, creates the amount of a being divided into gauss hybrid models.This model comprises M Gaussian distribution as its component, and M weight coefficient, and the value of M is 512:
i (u),μ i (u),∑ i (u)} i=1,...,M
μ wherein i (u), ∑ i (u)Mean value vector and the covariance matrix of representing i gaussian component, ω i (u)The weight coefficient of representing i gaussian component, i=1 ..., M, subscript u identify this amount of being divided into gauss hybrid models.
Step (3) is carried out pre-service to user's designated tone frequency range according to the following steps:
Step (3.1) is to computing machine input user designated tone frequency range c, and time span is several seconds, through Fourier analysis in short-term, is that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, audio section c is with a cepstrum feature vector sequence { x n (c)} N=1 ..., WRepresent that W represents the frame number of audio section c, n=1 ..., W represents the sequence number of each frame of audio section c, subscript c identifies this audio section c.
Step (3.2) is calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the n frame of audio section c I, n (c), n=1 ..., W:
ω i , n ( c ) = ω i ( u ) N i ( x n ( c ) | μ i ( u ) , Σ i ( u ) ) Σ j = 1 M ω j ( u ) N j ( x n ( c ) | μ j ( u ) , Σ j ( u ) )
I=1 wherein ..., M, j=1 ..., M is the numbering of the gaussian component of the amount of being divided into gauss hybrid models, N i(x| μ i (u), ∑ i (u)) the expression mean value vector is μ i (u), covariance matrix is a ∑ i (u)The Gaussian distribution probability density function.
Be calculated as follows the arithmetic mean of i gaussian component weight coefficient of each frame in audio section c, use ω i (c)Expression:
ω i ( c ) = 1 W Σ n = 1 W ω i , n ( c ) ,
The arithmetic mean of the weight coefficient of each gaussian component that calculates is formed a vector { ω t (c)} I=1 ..., M, this vector is represented an audio-frequency fingerprint as the low-dimensional of audio section c.
Step (4) is carried out fingerprint relatively in the sliding window mode with user's designated tone frequency range c and tested audio stream s:
Step (4.1) is imported tested audio stream s to computing machine in the hourage of setting, through Fourier analysis in short-term, be that a frame extracts a cepstrum feature vector with 10 milliseconds.Like this, tested audio stream s is with a cepstrum feature vector sequence { x t (s)} T=1 ..., TRepresent that T is the frame number of tested audio stream s, t=1 ..., T represents the sequence number of each frame of audio stream s, subscript s identifies this audio stream s.
Step (4.2) is calculated as follows the weight coefficient ω of i gaussian component of the amount of being divided into gauss hybrid models at the t frame of audio stream s I, t (s), t=1 ..., T:
ω i , t ( s ) = ω i ( u ) N i ( x t ( s ) | μ i ( u ) , Σ i ( u ) ) Σ j = 1 M ω j ( u ) N j ( x t ( s ) | μ j ( u ) , Σ j ( u ) )
I=1 wherein ..., M, j=1 ..., M is the numbering of the gaussian component of the amount of being divided into gauss hybrid models.N i(x| μ i (u), ∑ i (u)) the expression mean value vector is μ i (u), covariance matrix is a ∑ i (u)The Gaussian distribution probability density function.
Step (4.3) is provided with l=1.
Step (4.4) is if l+W-1>T then withdraws from.
Step (4.5) is the audio section { x in the window of W since the length of l frame with audio stream s t (s)} T=l ..., l+W-1, to call audio section s in the following text (l), carry out the fingerprint distance calculation with audio section c.
At first, be calculated as follows and obtain audio section s (l)Fingerprint
ω i ( s , l ) = 1 W Σ t = l l + W - 1 ω i , t ( s ) ,
Promptly i gaussian component is at audio section s (l)In the arithmetic mean of weight coefficient of each frame as audio section s (l)The i dimension of audio-frequency fingerprint.
Then, be calculated as follows out audio section s (l)Fingerprint { ω i (s, l)} I=1 ..., MFingerprint { ω with audio section c i (c)} I=1 ..., MBetween the cross entropy distance
d KL ( l ) = Σ i = 1 M ( ω i s , l - ω i ( c ) ) log ω i ( s , l ) ω i ( c )
If d KL(l)≤and θ, judge that then audio stream s has comprised audio section c since the l frame, wherein θ is a default detection threshold, gets 0.01.Then make l=l+1, whether the remaining part of getting back to step (4.4) continuation search audio stream s also includes audio section c.
If d KL(l)>and θ, then be calculated as follows out a jump step-length
Figure F2009102413667C00034
Wherein Δ is a default bias amount, gets 0.001 or 0.005,
Figure F2009102413667C00041
Expression rounds downwards.Then make l=l+ τ KL-skip, whether the remaining part of getting back to step (4.4) continuation search audio stream s also includes audio section c.
CN200910241366A 2009-12-07 2009-12-07 A fast search method for audio fingerprints based on cross-entropy Pending CN101853262A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910241366A CN101853262A (en) 2009-12-07 2009-12-07 A fast search method for audio fingerprints based on cross-entropy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910241366A CN101853262A (en) 2009-12-07 2009-12-07 A fast search method for audio fingerprints based on cross-entropy

Publications (1)

Publication Number Publication Date
CN101853262A true CN101853262A (en) 2010-10-06

Family

ID=42804757

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910241366A Pending CN101853262A (en) 2009-12-07 2009-12-07 A fast search method for audio fingerprints based on cross-entropy

Country Status (1)

Country Link
CN (1) CN101853262A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622353A (en) * 2011-01-27 2012-08-01 天脉聚源(北京)传媒科技有限公司 Fixed audio retrieval method
CN106934242A (en) * 2017-03-16 2017-07-07 杭州安脉盛智能技术有限公司 The health degree appraisal procedure and system of equipment under multi-mode based on Cross-Entropy Method
CN108205550A (en) * 2016-12-16 2018-06-26 北京酷我科技有限公司 The generation method and device of audio-frequency fingerprint

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622353A (en) * 2011-01-27 2012-08-01 天脉聚源(北京)传媒科技有限公司 Fixed audio retrieval method
CN102622353B (en) * 2011-01-27 2013-10-16 天脉聚源(北京)传媒科技有限公司 Fixed audio retrieval method
CN108205550A (en) * 2016-12-16 2018-06-26 北京酷我科技有限公司 The generation method and device of audio-frequency fingerprint
CN106934242A (en) * 2017-03-16 2017-07-07 杭州安脉盛智能技术有限公司 The health degree appraisal procedure and system of equipment under multi-mode based on Cross-Entropy Method

Similar Documents

Publication Publication Date Title
EP3479377B1 (en) Speech recognition
Senoussaoui et al. A study of the cosine distance-based mean shift for telephone speech diarization
US7457749B2 (en) Noise-robust feature extraction using multi-layer principal component analysis
US6278970B1 (en) Speech transformation using log energy and orthogonal matrix
US8019089B2 (en) Removal of noise, corresponding to user input devices from an audio signal
US20030231775A1 (en) Robust detection and classification of objects in audio using limited training data
CN109243487B (en) Playback voice detection method for normalized constant Q cepstrum features
CN102074236A (en) Speaker clustering method for distributed microphone
CN105872855A (en) Labeling method and device for video files
CN107993663A (en) A kind of method for recognizing sound-groove based on Android
US7133826B2 (en) Method and apparatus using spectral addition for speaker recognition
CN109065022B (en) Method for extracting i-vector, method, device, equipment and medium for speaker recognition
CN108986824A (en) A kind of voice playback detection method
CN106910494B (en) Audio identification method and device
CN102945670A (en) Multi-environment characteristic compensation method for voice recognition system
CN102436806A (en) Audio copy detection method based on similarity
CN103854661A (en) Method and device for extracting music characteristics
Jensen et al. Evaluation of MFCC estimation techniques for music similarity
US20040193415A1 (en) Automated decision making using time-varying stream reliability prediction
CN101853262A (en) A fast search method for audio fingerprints based on cross-entropy
CN111192569B (en) Double-microphone voice feature extraction method and device, computer equipment and storage medium
CN118230722B (en) Intelligent voice recognition method and system based on AI
Petry et al. Fractal dimension applied to speaker identification
Su et al. Audio splicing detection and localization using multistage filterbank spectral sketches and decision fusion
Khan et al. Hybrid BiLSTM-HMM based event detection and classification system for food intake recognition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20101006