HK1076660B

HK1076660B - Method for comparing audio using characterizations based on auditory events

Info

Publication number: HK1076660B
Application number: HK05108591.1A
Authority: HK
Inventors: 布莱特．G．克罗克特; 迈克尔．J．史密斯尔斯
Original assignee: 多尔拜实验特许公司
Priority date: 2001-05-25
Filing date: 2002-02-22
Publication date: 2007-04-04

Description

Method for comparing audio using auditory event based characterization

Technical Field

The present invention relates to audio signals. More particularly, the invention relates to characterizing audio signals and using the characterization to determine whether one audio signal originates from another audio signal or whether two audio signals originate from the same audio signal.

Background

The separation of sound into units perceived as separate is sometimes referred to as "auditory event analysis" or "auditory scene analysis" ("ASA"). Bregman sets forth a large body of discussion of Auditory Scene Analysis in his work, Audio Scene Analysis-The Perceptual Organization of Sound, Massachusetts Institute of Technology, 1991, Fourth printing2001, Second MIT Press background edition. In addition, U.S. patent 6002776 to Bhadkamkar et al (12/14 1999) cited a publication dated back to 1976 as "existing work related to sound separation by auditory scene analysis". However, the Bhadkamkar et al patent does not suggest a practical application of auditory scene analysis, presuming "although interesting from a scientific point of view as a model of human auditory processing, until significant progress was made, the techniques involving auditory scene analysis are currently computationally demanding and highly specialized, so as not to be considered practical sound separation techniques".

Bregman in the paragraph states that "when a sound suddenly changes in tone quality, pitch, volume, or (to a lesser extent) spatial position, we hear discrete units. "(Audio Scene Analysis-The Perceptial Organization of Sound, supra at page 469). Bregman also discusses the perception of multiple simultaneous sound streams when they are separated in frequency.

There are many different methods for extracting features or characteristics from audio. If the features or characteristics are properly defined, extraction of the features or characteristics may be accomplished using an automated process. For example, "ISO/IEC JTC 1/SC 29/WG 11" (MPEG) is currently standardizing various audio descriptors as part of the MPEG-7 standard. A common drawback of these methods is that they ignore ASA. These methods attempt to periodically measure certain "traditional" signal processing parameters, such as pitch, amplitude, power, harmonic structure, and spectral flatness. These parameters, while providing useful information, do not analyze and characterize the audio signal into elements that are perceived independently from human cognition.

Auditory scene analysis attempts to characterize an audio signal in a manner similar to human perception by recognizing elements that are independent in human cognition. By studying this approach, an automated process can be achieved that accurately accomplishes tasks that heretofore required human assistance.

The identification of the independent perceptual elements allows to uniquely identify the audio signal with much less information than the complete signal itself. For example, a signal that is replicated from another signal (or from the same original signal as the other signal) may be determined using a compact and unique identification based on auditory events.

Disclosure of Invention

A method of generating a unique reduced-information characterization (reduced-information characterization) of an audio signal is described, which may be used for identifying the audio signal. The representation may be considered as a "signature" or "fingerprint" of the audio information. According to the invention, an Auditory Scene Analysis (ASA) is performed to identify auditory events as a basis for characterizing an audio signal. Ideally, auditory scene analysis identifies auditory events that a listener can perceive even though the audio is processed, such as low bit rate encoding or acoustic transmission through a speaker. The audio signal may be characterized by the location of the boundaries of the auditory events and, optionally, by the dominant sub-band of each auditory event. The resulting information pattern constitutes a compact audio fingerprint or signature that can be compared to one or more other such audio fingerprints or signatures. Determining that at least a portion of the respective signatures are identical (in terms of a desired confidence) indicates that the relevant portions of the audio signals from which the respective signatures were derived are identical, or are derived from the same audio signal.

The auditory scene analysis method according to the present invention provides a method for rapidly and accurately comparing two audio signals, particularly music, by comparing signatures based on auditory event information. In contrast to conventional feature extraction methods that extract features that are less important for perceiving similarity between audio signals (e.g., pitch, amplitude, power, and harmonic structure), ASA extracts information or features that are the basis for similarity perception. The use of ASA increases the chances of finding similarities in materials (materials) that have undergone significant processing, such as low bit rate encoding or acoustic transmission through loudspeakers.

While the present invention may in fact be practiced in the analog or digital domains (or some combination thereof), in a practical embodiment of the invention, the audio signal is represented by groups of samples of data and processed in the digital domain.

Referring to fig. 1A, auditory scene analysis 2 is applied to an audio signal in order to generate a "signature" or "fingerprint" associated with the signal. In this case, there are two audio signals of interest. They may be similar in that one audio signal may be derived from the other audio signal, or the two audio signals may have been previously derived from the same original signal, but this is not known a priori. Thus, an auditory scene analysis is applied to the two signals. For simplicity, fig. 1A shows only the application of ASA to one signal. As shown in fig. 1B, the signatures of the two audio signals, signature 1 and signature 2, are provided to a correlator or correlation subroutine 4 that produces a correlation score. The user may set a minimum relevance score that specifies a desired confidence that at least a portion of the two signatures are the same. In practice, these two signatures may be saved data. In one implementation, one of the signatures may originate from, for example, an unauthorized copy of a musical work, and the other signature may be one of a large number of signatures in a database (each signature originating from a copyright owner's musical work), the unauthorized copy signatures being compared against the large number of signatures in the database until a match, if any, is obtained with the required confidence level. This can be done automatically by a machine, the details of which are outside the scope of the present invention.

Since the signatures represent audio signals but are much shorter than the audio signals from which the signatures originate (i.e., they are more compact or have fewer bits), the similarity between two signatures can be determined more quickly (or the two signatures lack similarity) than the similarity between two audio signals.

Additional details of FIGS. 1A and 1B are described below.

According to an aspect of the present invention, an efficient computational method is provided for dividing audio into segments of time or "auditory events" that are perceived as independent.

A valid indicator of the start and end of a perceived auditory event is a change in spectral content. In order to detect changes in tone quality and pitch (spectral content), as an ancillary result, certain changes in amplitude, an audio event detection method according to an aspect of the invention detects changes in spectral content with respect to time. Optionally, according to another aspect of the invention, the method may further detect a change in amplitude with respect to time, the change in amplitude with respect to time being undetectable by detecting a change in spectral content with respect to time.

In implementations where the computational requirements are minimal, the method divides the audio into time segments by analyzing the entire frequency band (full bandwidth audio) or substantially the entire frequency band (in practical implementations, band-limited filtering at the end of the spectrum is usually employed) of the audio signal and giving the loudest audio signal components the greatest weight. This method takes advantage of the psychoacoustic phenomenon in which, on a small timescale (20 milliseconds and less), the ear concentrates on a single auditory event at a given time. This means that although multiple events may occur at the same time, one component will be perceptually most prominent and will be processed separately as if only the unique event occurred. Exploiting this effect also allows auditory event detection to scale with the complexity of the processed audio. For example, if the input audio signal being processed is a solo, the audio events identified may be a single note being played. Similarly for an input speech signal, various components of speech, such as vowels and consonants, may be recognized as a single audio element. As the complexity of audio increases, such as music with drumbeats or multiple instruments and singing, auditory event detection identifies the most prominent (i.e., loudest) audio element at any given moment. On the other hand, by taking into account the hearing threshold and the frequency response, the "most prominent" audio element is determined.

Optionally, according to another aspect of the invention, the method may also take into account the variation of the spectral content with respect to time in discrete frequency bands (fixed or dynamically determined frequency bands, or fixed and dynamically determined frequency bands) instead of the entire bandwidth, at the cost of greater computational complexity. This alternative approach would consider more than one audio stream in different frequency bands rather than assuming that only a single audio stream is perceived at a particular time.

Even a simple, computationally efficient audio segmentation method according to an aspect of the present invention can be used to identify auditory events.

The auditory event detection method of the present invention may be implemented by dividing the time domain audio waveform into time intervals or groups, and then converting the data in each group to the frequency domain using a filter bank, or a time-frequency transform, such as a Discrete Fourier Transform (DFT) (implemented as a Fast Fourier Transform (FFT) in view of speed).

To minimize computational complexity, only a single band of the time-domain audio waveform may be processed, preferably the entire band of the spectrum (about 50 Hz-15 kHz for typical quality music systems) or substantially the entire band (e.g., band-limiting filters may exclude high or low frequency extremes).

The frequency domain data is preferably normalized, as described below. The degree to which the frequency domain data needs to be normalized gives an indication of the amplitude. Thus, if the degree of change exceeds a predetermined threshold, the change may also be used to indicate an event boundary. The start and end points of events resulting from spectral changes and from amplitude changes may be ored to identify event boundaries resulting from both changes.

In practical embodiments where the audio is represented by grouped samples, each auditory event time start and end point must coincide with the boundaries of the groups into which the time domain audio waveform is divided. There is a tradeoff between real-time processing requirements (since larger groups require less processing overhead) and resolution of event locations (smaller groups provide more detailed location information of auditory events).

Alternatively, as described above, instead of processing the spectral content of the time-domain waveform in a single frequency band, the frequency spectrum of the time-domain waveform may be divided into two or more frequency bands prior to frequency-domain conversion, at the cost of greater computational complexity. Each band is then converted to the frequency domain and processed as if it were an independent channel. The resulting event boundaries are then ORed to determine the event boundaries for that channel. The multiple frequency bands may be fixed frequency bands, adaptive frequency bands, or a combination of fixed and adaptive frequency bands. For example, the adaptive frequency bands may be determined using tracking filter techniques employed in audio noise reduction and other techniques (e.g., simultaneous main sinusoids at 800Hz and 2kHz may result in two adaptively determined frequency bands centered at these two frequencies).

Other techniques for providing auditory scene analysis may also be employed to identify auditory events in the present invention.

Drawings

Fig. 1A is a flow chart illustrating the extraction of a signature from an audio signal according to the present invention. The audio signal may represent, for example, music (e.g., a musical composition or "song").

FIG. 1B is a flow chart illustrating correlating two signatures in accordance with the present invention.

Fig. 2 is a flow chart illustrating the extraction of audio event locations, and optionally major sub-bands, from an audio signal in accordance with the present invention.

Fig. 3 is a schematic diagram illustrating the spectral analysis steps according to the present invention.

Fig. 4A and 4B are idealized audio waveforms representing multiple audio event locations or event boundaries, in accordance with the present invention.

Fig. 5 is the correlation 4 of fig. 2, showing in more detail a flow chart for correlating two signatures, according to the present invention.

Fig. 6A-D are schematic diagrams of signals illustrating examples of signature alignment according to the present invention. FIGS. 6A-D are not drawn to scale. In the case where the digital audio signal is represented by samples, the horizontal axis represents the order of discrete data stored in each signature array.

Detailed Description

In a practical embodiment of the invention, the audio signal is represented by samples processed in groups (blocks) of 512 samples, which at a sampling frequency of 44.1kHz approximately correspond to 11.6 milliseconds of input audio. A group length that is less than the duration (about 20 milliseconds) of the shortest perceivable auditory event (audio event) is desirable. It will be clear that the aspects of the invention are not limited to such a practical embodiment. The principles of the present invention do not require that the audio be arranged into sample groups prior to determining the audio event, nor that a sample group of constant length be provided if the audio is arranged into sample groups. However, to minimize complexity, a fixed group length of 512 samples (or a number of samples that is other powers of 2) is beneficial for three main reasons. First, it provides an acceptable low enough latency for real-time processing applications. Second, the number of samples is a power of 2, which is beneficial for Fast Fourier Transform (FFT) analysis. Third, a window size of appropriate size is provided to accomplish useful auditory scene analysis.

In the following discussion, the input signal is assumed to be data having amplitude values in the range of [ -1, +1 ].

Auditory scene analysis 2 (fig. 1A)

Following an audio input data packet (not shown), in process 2 of fig. 1A ("auditory scene analysis"), the input audio signal is divided into auditory events, each of which may be perceived as independent. The auditory scene analysis may be accomplished by means of the Auditory Scene Analysis (ASA) process discussed above. Although one suitable process for implementing auditory scene analysis is described in more detail below, other useful techniques for accomplishing ASA may be employed.

Fig. 2 outlines a process that may be used as the auditory scene analysis process of fig. 1A, in accordance with the present technology. The ASA step or process 2 consists of three general processing sub-steps. The first sub-step 2-1 ("perform spectral analysis") obtains the audio signal, divides the audio signal into groups, and calculates the spectral profile or spectral content of each group. Spectral analysis transforms the audio signal into the short-term (short-term) frequency domain. This can be achieved with any filter group; based on transforms or sets of band pass filters and in either the linear frequency space or in the warped (warped) frequency space (e.g., Bark scale or critical band that is closer to the human ear characteristics). For any filter bank there is a trade-off between time and frequency. The higher the temporal resolution and thus the shorter the time interval, the lower the frequency resolution results. The higher the frequency resolution and thus the narrower the sub-band, resulting in a longer time interval.

The first sub-step 2-1 calculates the spectral content of successive time segments (segments) of the audio signal. In a practical embodiment, the ASA group size is 512 samples of the input audio signal (fig. 3), as described below. In a second sub-step 2-2, the difference in spectral content between the groups is determined ("performing a spectral curve difference measurement"). Thus, the second sub-step calculates the difference in spectral content between successive time segments of the audio signal. In a third substep 2-3 ("identifying the location of an auditory event boundary"), a group boundary is considered an auditory event boundary when the spectral difference between one spectral curve group and the next spectral curve group is greater than a threshold value. Thus, when the difference in spectral curve content between successive time segments exceeds a threshold, the third sub-step sets auditory event boundaries between such successive time segments. As described above, a valid indicator of the beginning or end of a perceived auditory event is considered a change in spectral content. The position of the event boundary is saved as signature (signature). Optional processing steps 2-4 ("identify primary) sub-bands") identify primary sub-bands that can also be saved as part of the signature using spectral analysis.

In this embodiment, the auditory event boundaries determine auditory events having lengths that are integer multiples of a spectral curve group (the minimum length being one spectral curve group (512 samples in this example)). In principle, the event boundaries need not be so limited.

Overlapping or non-overlapping segments of audio can be windowed (windowed) and used to calculate the spectral curve of the input audio. The overlap results in better positional resolution of auditory events and makes it less likely that an event, such as a transient, will be missed. However, as the time resolution increases, the frequency resolution decreases. Overlapping also increases computational complexity. Thus, the overlap may be ignored. Fig. 3 shows a schematic diagram of a non-overlapping 512-sample set being windowed and converted to the frequency domain by means of a Discrete Fourier Transform (DFT). Each sample group may be windowed and transformed into the frequency domain using a DFT, preferably implemented as a Fast Fourier Transform (FFT) for speed considerations.

The following variables can be used to calculate the spectral curves for the input set:

n is the number of samples in the input signal

M-the number of windowed samples used to compute a spectral curve

P ═ number of samples overlapped for spectral computation

Q ═ number of spectral windows/regions calculated

In general, any integer may be used for the variables. However, if M is set equal to a power of 2, so that a standard FFT is available for spectral curve calculation, the implementation will be more efficient. In a practical embodiment of the auditory scene analysis process, the enumerated parameters may be set as:

512 samples (or 11.6 ms at 44.1 kHz)

Sample 0 (no overlap)

The values listed above are determined experimentally and are generally found to determine the location and duration of auditory events with sufficient accuracy. However, it has been found that setting the value of P to 256 samples (50% overlap) helps to identify certain hard-to-find events. Although many different types of windows may be used to minimize spectral artifacts (artifacts) due to windowing, the window used in spectral curve calculation is an M-pointHanning, Kaiser-Bessel or other suitable window, preferably a non-rectangular window. After a number of experimental analyses, the above-indicated values and Hanning window were chosen as they have been shown to provide excellent results over a large range of audio materials. Non-rectangular windowing is preferred for processing of audio signals where low frequency content is dominant. Rectangular windowing produces spectral artifacts that can lead to incorrect detection of events. Unlike some codec applications where the overall overlap/add process must provide a constant level, for which this constraint does not apply, a window may be selected with respect to a feature, such as its time/frequency resolution and stop-band rejection.

In sub-step 2-1 (FIG. 2), the spectrum for each M-sample group may be calculated by windowing the data using M-point Hanning, Kaiser-Bessel, or other suitable window, converting to the frequency domain using M-point fast Fourier transform, and calculating the magnitude of the FFT coefficients. The resulting data is normalized so that the largest magnitude is set to 1 and the normalized array of M numbers is converted to the log (log) domain. The array need not be converted to the log domain, but the conversion simplifies the computation of the difference metric in sub-step 2-2. Furthermore, the log domain more closely conforms to the log domain amplitude nature of the human auditory system. The resulting log domain values range from-infinity to zero. In practical embodiments, a lower limit may be imposed on the range of values; the lower limit may be fixed, e.g. -60dB, or may be frequency dependent to reflect the lower audibility of silence sounds at low and very high frequencies. (Note that the size of the array can also be reduced to M/2, since the FFT represents negative as well as positive frequencies).

Sub-step 2-2 calculates a measure of the difference between the spectra of neighboring groups. For each group, each of the M (logarithmic) spectral coefficients obtained in step 2-1 is subtracted from the corresponding coefficient of the previous group and the magnitude of the difference is calculated (ignoring the sign). The M differences are then summed to a number. Thus, for the entire audio signal, the result is a set of Q positive numbers; the larger the number, the greater the difference in the spectrum between the sample set and the previous sample set. The difference measure can also be expressed in terms of the average difference of each spectral coefficient by dividing it (measure) by the number of spectral coefficients used in the summation, in this case M coefficients.

Sub-step 2-3 determines the location of the auditory event boundary by applying a threshold to the set of difference metrics having thresholds of sub-step 2-2. When the difference measure exceeds a threshold, a change in the spectrum is deemed sufficient to represent a new event, and the group number (number) of the change is recorded as the event boundary. For the values of M and P given above, and for log domain values expressed in dB (in sub-step 2-1), the threshold may be set to 2500 if the entire magnitude FFT (including the mirrored portion) is compared, or 1250 if 1/2 FFTs are compared (as described above, FFTs represent negative and positive frequencies — one magnitude is the mirror of the other for the magnitudes of the FFT). The value is chosen experimentally and provides good auditory event boundary detection. The parameter value may be changed to decrease (increase the threshold) or increase (decrease the threshold) the detection of an event. The details of this practical embodiment are not important. Other ways of calculating the spectral content of successive time segments of the audio signal, calculating the difference between successive time segments, and setting auditory event boundaries at respective boundaries between successive time segments when the difference in spectral curve content between such successive time segments exceeds a threshold value may also be used.

The output of the auditory scene analysis process of subroutine (function)2 of fig. 1A is an array of information b (Q) representing the location of the auditory event boundaries, Q0, 1, …, Q-1, for an audio signal consisting of Q groups of M samples each. For an overlap of M512 samples, P0 samples, and a signal sample rate of 44.1kHz, the auditory scene analysis subroutine 2 outputs approximately 86 values per second. Array b (q) is preferably stored as a signature, such that in its basic form (without optional primary subband frequency information), the signature of the audio signal is array b (q) representing a string of auditory event boundaries.

Examples of the results of auditory scene analysis for two different signals are shown in fig. 4A and 4B. The upper graph (fig. 4A) represents the result of auditory scene processing, with auditory event boundaries determined at samples 1024 and 1536. The lower graph (fig. 4B) shows the identification of event boundaries at samples 1024, 2048, and 3072.

Identify major sub-bands (optional)

For each group, an optional additional step in the ASA process (shown in fig. 2) is to extract from the audio signal information indicative of the dominant frequency "sub-band" of the group (the conversion of the data in each group to the frequency domain results in information being separated into frequency sub-bands). The group-based information may be converted into auditory event-based information, identifying the dominant sub-band for each auditory event. This information for each auditory event provides additional information for correlation processing (described below) in addition to auditory event boundary information.

The main (largest amplitude) sub-band may be selected from a plurality of sub-bands, e.g. 3 or 4 sub-bands, located in the frequency range or band to which the human ear is most sensitive. Alternatively, other criteria may be used to select the sub-bands. For example, the spectrum may be divided into three sub-bands. The preferred frequency ranges for the sub-bands are:

sub-band 1: 301 Hz-560 Hz

And (3) sub-band 2: 560 Hz-1938 Hz

Sub-band 3: 1938 Hz-9948 Hz

To determine the dominant sub-bands, the sum of squares of the magnitude spectrum (or power magnitude spectrum) is calculated for each sub-band. The final sum for each sub-band is calculated and the largest sum is selected. The sub-bands may also be weighted before the largest sum is selected. The weighting may take the form of dividing the sum of each sub-band by the number of spectral values in that sub-band, or may take the form of addition or multiplication to emphasize the importance of a frequency band compared to another frequency band. This is useful in situations where some sub-bands have on average more energy than others, but are perceptually less important.

Considering an audio signal consisting of Q groups, the output of the main subband processing is an information array ds (Q) (Q0, 1, … Q-1) representing the main subbands in each group. Array DS (q) is preferably stored in the signature together with array B (q). Thus, with the optional main sub-band information, the signature of the audio signal is two arrays B (q) and DS (q), representing a string of auditory event boundaries and the main frequency sub-band within each group, respectively. Thus, in an idealized example, the two arrays may have the following values (for the case where there are three possible main sub-bands).

10100010010000010 (event boundary)

11222211133333311 (Main sub-band)

In most cases, the main sub-band remains the same within each auditory event, as shown in this example, or has an average value if it is not uniform for all groups within an event. Thus, a main sub-band may be determined for each auditory event, and the array DS (q) may be modified to ensure that the same main sub-band is assigned to each group within an event.

Correlation

By means of a correlation subroutine or procedure it can be determined whether one signature is identical or similar to another stored signature. A correlation subroutine or process compares the two signatures to determine their similarity. This can be done in two steps, as shown in fig. 5: step 5-1 eliminates or minimizes the effect of time drift or delay on the signatures, followed by step 5-2 which calculates a measure of similarity between the signatures.

The first mentioned step 5-1 minimizes the effect of any delay between the two signatures. Such a delay may be deliberately added to the audio signal or may be the result of signal processing and/or low bit rate audio coding. The output of this step is the two modified signatures in a form suitable for computing their similarity measure.

The second-mentioned step 5-2 compares the two signatures after modification to find a quantitative measure of their similarity (correlation score). The measure of similarity can then be compared against a threshold to determine whether the signatures are the same or different at a desired confidence. Two suitable related processes or subroutines are illustrated. Either of these two related subroutines or processes or some other suitable related process or subroutine may be used as part of the present invention.

First associated procedure or subroutine

Elimination of time delay effects

The correlation subroutine or process separates a single region or portion from each signature so that the two regions are the most similar portions of the respective signatures and have the same length. The isolation region may be the full overlap region between the two signatures, as shown in the example in fig. 6A-D, or the isolation region may be smaller than the overlap region.

The preferred method uses the entire overlap area of the two signatures. Some examples are shown in fig. 6. The overlapping area of the two signatures may be a portion made up of the tail of one signature and the head of the other signature (fig. 6B and 6C). If one of the signatures is smaller than the other signature, the overlap region between the two signatures can be all of the smaller signature and a portion of the larger signature (FIGS. 6A and 6D).

There are many different ways to separate the common area from the two data arrays. Standard mathematical methods involve looking up a time lag (lag) or delay measure between data arrays using cross-correlation. When the start of each of the two data arrays are aligned, the skew or delay is considered to be 0. When the start of each of the two data arrays is misaligned, the skew or delay is not 0. The cross-correlation calculates a measure of each possible time lag or delay between the two data arrays: the metrics are saved as an array (the output of the cross-correlation subroutine). The skew or delay representing a peak in the cross-correlation array is considered to be the skew or delay of one data array relative to another. The following paragraphs show this correlation method mathematically.

Suppose S₁(Length N)₁) Is an array from signature 1, S₂(Length N)₂) Is the array from signature 2. First, a cross-correlation array R is calculated_E1E2(see, for example, John G. Proakis, Dimitris G. Manolakis, Digital Signal Processing: Principles, Algorithms, and Applications, Macmillan publishing company, 1992, ISBN 0-02-396815-X).

The cross-correlation is preferably performed using standard FFT-based techniques to reduce the execution time.

Due to S₁And S₂Is limited in scope, therefore R_E1E2Has a length of N₁+N₂-1. Suppose S₁And S₂Similarly, corresponding to R_E1E2The time lag l of the largest element in (A) represents S₂Relative to S₁The delay of (2).

For MAX (R)_E1E2(l))，l_peak＝l (2)

Since this time lag represents a delay, signature S₁And S₂Is denoted as S₁' and S₂'; respectively have the same length N₁₂。

Expressed as an equation, signature S₁And S₂Is overlapped with the part S₁' and S₂' is defined as:

S₁' and S₂The length of' is:

first associated procedure or subroutine

Similarity measure

This step compares the two signatures to find a quantitative measure of their similarity. The preferred method uses a correlation coefficient (equation 5). This is a standard textbook method (William Mendenhall, Dennis D.Wackery, Richard L.Scheaffer, chemical Statistics with Applications: Forth Edition, Duxbury Press, 1990, ISBN 0-534-.

Where σ is₁And σ₂Are each S₁' and S₂Standard deviation of'.

S₁' and S₂The covariance of' is defined as:

here mu₁And mu₂Are each S₁' and S₂Mean value of (means).

The correlation coefficient, ρ, is in the range-1 ≦ ρ ≦ 1, where-1 and 1 represent ideal correlations. Preferably, a threshold is applied to the absolute value of the metric to indicate a correct match.

In practice, the value of the threshold (based on a large set of training signatures) may be adjusted to ensure an acceptable false negative and (rejection) detection rate.

The first correlation process or subroutine is preferred for signatures with large offsets (mismatches) or delays, and for signatures where the length of one signature is significantly less than the length of the other signature.

Second associated procedure or subroutine

Elimination of time delay effects

A second correlation process or subroutine transforms signatures from their current time domain to a domain that is independent of time delay effects. This approach results in two modified signatures having the same length, so that they can be directly correlated or compared.

There are many ways to transform data in this manner. The preferred method uses a Discrete Fourier Transform (DFT). The DFT of the signal may be separated into amplitude and phase. The spatial drift or time delay of the signal (input to the DFT) changes the phase of the DFT but not the amplitude. Thus, the amplitude of the DFT of the signal may be considered to be a time-invariant representation of the signal.

This property of the DFT allows each of the two signatures to be transformed into a non-time-varying representation. If the two signatures have the same length, the amplitude DFT can be computed directly for each signature, and the result saved as a modified signature. If each signature is of a different length, the longer signature may be either truncated to have the same length as the shorter signature or the shorter signature may be zeroed out or extended to have the same length as the longer signature before calculating the DFT. The following paragraphs show the method in mathematical form.

Suppose S₁(Length N)₁) Is an array from signature 1, S₂(Length N)₂) Is the array from signature 2. First, the longer signature is truncated or the longer signature is zero-padded, so that the two signatures have the same length N₁₂. Generating a transformed signature array S by performing an amplitude DFT as follows₁' and S₂′：

Practice ofFor each signature, the average is preferably subtracted before calculating the DFT. Before discrete Fourier transform, S can also be processed₁And S₂The signature applies a certain windowing, however, in practice no special windowing is found to yield the best results.

Second associated procedure or subroutine

Similarity measure

The similarity measure step compares the two signatures to find a quantitative measure of their similarity. The preferred method uses a correlation coefficient (equation 9). This is a standard textbook method (William Mendenhall, Dennis D.Wackery, Richard L.Scheaffer, chemical Statistics with Applications: Forth Edition, Duxbury Press, 1990, ISBN 0-534-.

Where σ is₁And σ₂Are each S₁' and S₂Standard deviation of'.

S₁' and S₂The covariance of' is defined as:

here mu₁And mu₂Are each S₁' and S₂' average value.

In practice, the value of the threshold (based on a large set of training signatures) may be adjusted to ensure an acceptable false negative and detection rate.

In practical applications, many signatures may be stored together, forming a signature library representing "known" audio content. In this case, the ability to distinguish signatures can be improved by calculating an average signature and subtracting the average signature from each of the two signatures in the comparison.

For example, in the case of a known signature S comprising W signatures₀′～S_W-1In the case of the database of' the average signature is calculated as follows.

When comparing two signatures (even if one of the signatures is not in the signature library), the average signature is subtracted from both signatures before the covariance is calculated (for subsequent use in the correlation coefficient). The covariance becomes:

here mu₁And mu₂Are each S₁′-S_MEAN' and S₂′-S_MEAN' average value.

The second correlation process or subroutine is preferred for signatures with less detuning or delay, and for signatures with similar length of signature. It is also significantly faster than the first related process or subroutine. However, it results in a somewhat less accurate similarity measure, since some information is inevitably lost (by discarding the stage (phase) of the DFT).

Applications of

As briefly mentioned above, applications of the present invention are able to search an audio database; such as a music library of a sound recording company. Signatures may be generated for all songs in the library and stored in a database. The present invention provides a means to obtain a song of unknown provenance, calculate its signature, and compare its signature against all signatures in the database to determine the identity of the unknown song.

In effect, the accuracy (or confidence) of the similarity measure is proportional to the size of the signatures being compared. The greater the length of the signature, the greater the amount of data used in the comparison and thus the greater the confidence or accuracy of the similarity measure. It has been found that a signature generated from about 30 seconds of audio provides good discrimination. However, the larger the signature, the longer the time required to make the comparison.

Conclusion

It should be understood that the implementation of other variations and modifications of the invention and its various aspects will be apparent to those of ordinary skill in the art, and that the invention is not limited by the specific embodiments described. Thus, the present invention is intended to cover any modifications, variations, or equivalents that fall within the spirit and scope of the basic underlying principles disclosed and claimed herein.

The present invention and its various aspects may be implemented as software subroutines which are executed in a digital signal processor, a programmed general-purpose digital computer, and/or a special-purpose digital computer. The interface between the analog and digital signal streams may be implemented in suitable hardware and/or as subroutines in software and/or firmware.

Claims

1. A method of determining whether one audio signal originates from another audio signal or whether two audio signals originate from the same audio signal, comprising:

comparing a reduced-information representation of the audio signal, wherein the reduced-information representation is based on audio scene analysis.

2. The method of claim 1, wherein the comparing comprises:

eliminating or minimizing the effect of time drift or delay on the audio signal from the characterization,

calculating a measure of similarity, an

The measure of similarity is compared against a threshold.

3. The method of claim 2, wherein the eliminating identifies a portion in each of the tokens such that the portions are the most similar portions in the tokens and the portions have the same length.

4. The method of claim 3, wherein said eliminating identifies a portion of each of said representations by performing a cross-correlation.

5. The method of claim 4, wherein said computing computes the measure of similarity by computing a correlation coefficient for the identified portions of each of said representations.

6. The method of claim 2, wherein said removing transforms the representation into a domain that is independent of time delay effects.

7. The method of claim 6, wherein said removing transforms the representation into the frequency domain.

8. The method of claim 7, wherein said computing computes a measure of similarity by computing a correlation coefficient for each of the identified portions of the representations.

9. The method according to any of claims 1-8, wherein said reduced information characterisation based on auditory scene analysis is a plurality of sets of information representing at least the position of auditory event boundaries.

10. The method of claim 9, wherein the step of determining the auditory event boundaries comprises:

calculating the spectral content of successive time segments of the audio signal,

calculating a difference in spectral content between successive time segments of said audio signal, an

Auditory event boundaries are identified as boundaries between successive time segments when a difference in spectral content between the successive time segments exceeds a threshold.

11. The method of claim 9, wherein said plurality of sets of information further represent a dominant sub-band for each of said auditory events.

12. The method of any of claims 1-8, wherein one of the tokens is a token from a library of tokens representing known audio content.

13. The method of claim 12, wherein the simplified information characterizations based on auditory scene analysis are sets of information representing at least the location of auditory event boundaries.

14. The method of claim 13, wherein the step of determining the auditory event boundaries comprises:

15. The method of claim 13, wherein said plurality of sets of information further represent a dominant sub-band for each of said auditory events.

16. The method of claim 12, further comprising subtracting the average of the tokens in the library from both tokens after the eliminating and before the comparing.

17. The method of claim 16, wherein the simplified information characterizations based on auditory scene analysis are sets of information representing at least the location of auditory event boundaries.

18. The method of claim 17, wherein the step of determining the auditory event boundaries comprises:

19. The method of claim 17, wherein said plurality of sets of information further represent a dominant sub-band for each of said auditory events.