HK1190473B

HK1190473B - Extraction and matching of characteristic fingerprints from audio signals

Info

Publication number: HK1190473B
Application number: HK14103363.7A
Authority: HK
Inventors: 塞瑞志．比洛罗夫
Original assignee: 埃克斯凯利博Ip有限责任公司
Priority date: 2011-02-10
Filing date: 2012-01-13
Publication date: 2017-12-15

Abstract

An audio fingerprint is extracted from an audio sample, where the fingerprint contains information that is characteristic of the content in the sample. The fingerprint may be generated by computing an energy spectrum for the audio sample, resampling the energy spectrum, transforming the resampled energy spectrum to produce a series of feature vectors, and computing the fingerprint using differential coding of the feature vectors. The generated fingerprint can be compared to a set of reference fingerprints in a database to identify the original audio content.

Description

Extraction and matching of characteristic fingerprints from audio signals

Technical Field

The present invention relates generally to audio signal processing, and more particularly to extracting characteristic fingerprints from audio signals and searching a database of such fingerprints.

Background

The problem of identifying or comparing data signals with others presents a significant technical challenge due to differences in file formats, compression techniques, and other data representation methods. For example, in the case of digital music files on a computer, there are a wide variety of formats for encoding and compressing songs. Moreover, songs are often sampled in digital form at different data rates and have different characteristics (e.g., different waveforms). Recorded analog audio also contains noise and distortion. These significant waveform differences make direct comparison of these documents a poor choice for efficient document or signal recognition or comparison. Direct file comparison also does not allow media encoded in different formats to be compared (e.g., compare the same song encoded in MP3 and WAV).

For these reasons, identifying and tracking media and other content, such as content distributed over the Internet, is typically accomplished by attaching metadata, a watermark, or some other code containing identifying information for the media. But such attached information is often incomplete, incorrect, or both. For example, metadata is rarely complete and filenames are less uniform. Furthermore, methods such as watermarking are aggressive, altering the original file with the appended data or code. Another disadvantage of these methods is that they are vulnerable to tampering. Even if each media file includes accurate identification data, such as metadata or watermarks, the files can be "unlocked" (and thus pirated) if the information is successfully removed.

To avoid these problems, other methods have been developed based on the concept of analyzing the content of the data signal itself. In one type of method, an audio fingerprint is generated for a piece of audio, where the fingerprint contains characteristic information about the audio that can be used to identify the original audio. In one example, the audio fingerprint includes a sequence of numbers identifying the audio piece. The generation of audio fingerprints is typically based on the acoustic and perceptual properties of the audio for which the fingerprint is being generated. Audio fingerprints typically have a much smaller size than the original audio content and thus can be used as a convenient tool to identify, compare and search for audio content. Audio fingerprinting (audio fingerprinting) may be used in a variety of applications, including broadcast monitoring, audio content organization, content filtering of P2P networks, and identification of songs or other audio content. When applied to these different fields, audio fingerprint analysis typically involves fingerprint extraction and fingerprint database search algorithms.

Most existing fingerprinting techniques are based on extracting audio features from audio samples in the frequency domain. The audio is first divided into frames and a set of features is calculated for each frame. Audio features that may be used include Fast Fourier Transform (FFT) coefficients, mel-frequency cepstral coefficients (MFCCs), spectral flatness, sharpness, entropy, and modulation frequency. The computed features are assembled into a feature vector, which is typically transformed with a derivative, mean, or variance. The feature vectors are mapped to a more compact representation using an algorithm such as principal component analysis and then quantized to produce an audio fingerprint. Typically, fingerprints obtained by processing a single audio frame are of relatively small size and may not sufficiently uniquely identify the original audio sequence with a desired degree of reliability. To enhance the uniqueness of the fingerprint and thus increase the probability of correct identification (and reduce the false positive rate), small sub-fingerprints may be combined into larger blocks representing about 3 to 5 seconds of audio.

One fingerprinting technique developed by Philips uses a Short Time Fourier Transform (STFT) to extract 32-bit sub-fingerprints for each 11.8 millisecond interval of the audio signal. The audio signal is first divided into overlapping frames of 0.37 seconds length and the frames are weighted by a hamming window with an overlap factor of 31/32 and transformed to the frequency domain using an FFT. The obtained frequency domain data may be presented as a spectrogram (e.g., a time-frequency plot), time on the horizontal axis and frequency on the vertical axis. The frequency spectrum (spectrogram column) of each frame is divided into 33 non-overlapping frequency bands at logarithmic intervals in the range of 300Hz to 2000 Hz. The spectral energy in each band is calculated and a 32-bit sub-fingerprint is generated using the sign of the energy difference of successive bands along the time and frequency axes. If the energy difference between two frequency bands in one frame is greater than the energy difference between the same frequency bands in the previous frame, the algorithm outputs a "1" for the corresponding bit in the sub-fingerprint; otherwise, a "0" is output for the corresponding bit. The fingerprint is assembled by combining 256 subsequent 32-bit sub-fingerprints into a single fingerprint block, which corresponds to 3 seconds of audio.

Although designed to be robust to common types of audio processing, noise and distortion, the algorithm is still not very robust to high speed variations caused by the resulting spectral scaling. Thus, a modified algorithm is proposed in which the audio fingerprint is extracted in the scale invariant fourier-mellin domain. The modified algorithm includes additional steps performed after transforming the audio frame to the frequency domain. These additional steps include spectral log mapping followed by a second fourier transform. Thus, for each frame, a first FFT is applied, the result is logarithmically mapped to a power spectrum, and a second FFT is applied. This can be described as a fourier transform to a logarithmically resampled fourier transform and is similar to the well-known MFCC method widely used in speech recognition. The main difference is that the fourier-mellin transform uses a full spectrum logarithmic mapping, whereas MFCC is based on mel-frequency scale (linear up to 1KHz and logarithmically spaced for higher frequencies, mimicking the properties of the human auditory system).

The Philips algorithm falls into the category of so-called short-term analysis algorithms, since the sub-fingerprints are calculated using spectral coefficients of only two consecutive frames. There are other algorithms that extract spectral features using multiple overlapping FFT frames in a spectrogram. Some of these methods based on evaluating multiple frames over time are referred to as long-term spectrogram analysis algorithms.

A long-term Analysis algorithm, for example described in Sukittanon, "Modulation-Scale Analysis for Content Identification", IEEE Transactions on Signal Processing, v01.52, No.10 (10 months 2004), is based on an estimation of the Modulation frequency. In this algorithm, the audio is divided and a spectrogram is calculated for it. The modulation spectrum is then calculated by applying a second transform along the time line (e.g., horizontal axis) of the spectrogram for each spectrogram band (e.g., frequency range in the spectrogram). This is in contrast to the modified Philips approach, where a second FFT is applied along the frequency column (e.g., vertical axis) of the spectrogram. In this method, the spectrogram is divided into N frequency bands, and the same number N of Continuous Wavelet Transforms (CWTs), one for each frequency band, is computed.

Although the developer of this algorithm claims better performance than the Philips algorithm, the existing algorithms still exhibit some drawbacks. For example, these algorithms may not be robust enough to reliably recognize distorted speech and music, especially when the audio is compressed using CELP audio codec (e.g., associated with cellular telephone audio, such as GSM). Moreover, these algorithms are typically sensitive to noise and analog distortions (such as those associated with microphone recordings). Also, even though these algorithms can identify audio in the presence of a single type of distortion, it may not be possible to handle a combination of distortions, which is more common and closer to real-world situations (e.g., for cellular phones, audio recorded from a microphone in a noisy room with slight reverberation followed by GSM compression).

Thus, existing fingerprinting schemes, when applied to practical applications, have unacceptably high error rates (e.g., false positives and false negatives), produce fingerprints that are too large to be commercially viable, and/or are too slow. Accordingly, there is a need to overcome the existing limitations that current audio recognition techniques do not address.

Disclosure of Invention

The invention is thus able to extract a characteristic fingerprint from an audio signal based on the content of the signal. The fingerprint may be matched against a set of reference fingerprints (e.g., in a database) to determine the identity of the signal or the similarity between two signals. Due to the nature of the fingerprint extraction algorithm, it does not suffer from many of the problems that plague existing solutions, and it is fast, efficient, highly accurate, scalable, and robust compared to these algorithms.

In an embodiment of the method for generating an audio fingerprint, an audio signal is sampled and spectrogram information is calculated from the signal. The spectrogram is divided into a plurality of frequency bands. The sequence samples in the band are resampled.

In one embodiment, resampling the sequence samples comprises logarithmically mapping the samples. In another embodiment, resampling the sequence samples comprises scaling the size of the sequence samples in time based on the center frequency and/or frequency range of the respective frequency band, and resampling the scaled sequence samples. In another embodiment, resampling the sequence samples comprises shifting the sequence samples in time based on the center frequency and/or frequency range of the respective frequency band, and resampling the shifted sequence samples. In another embodiment, resampling the sequence samples comprises sampling from different sequence samples (i.e., frequency bands) over time.

A second transformation is then performed on the resampled sequences to obtain a feature vector for each sequence. In one embodiment, the second transformation comprises a transformation along a time axis. In another embodiment, the second transform comprises a transform along the time axis followed by a transform across the frequency axis. In another embodiment, the second transform comprises a two-dimensional discrete cosine transform (2D DCT). An audio fingerprint is computed based on the feature vectors. The audio fingerprint may be stored on a computer readable medium or may be determined immediately as a transmittable signal.

The different types of resampling of the sequence samples described make the algorithm less sensitive to variations in audio playback speed and to temporal compression and stretching. Thus, the fingerprint of the audio signal should change little or no change despite changes in the playback speed of the audio signal or due to time compression or stretching. The resampling also improves the low frequency resolution of the second time-frequency transform. This makes it possible to use a simple transform instead of a complex wavelet transform for analyzing the spectrogram modulation spectrum, making the implementation more efficient and faster than previous methods.

Furthermore, due to the described resampling, the band output frame contains for most part samples representing the beginning of the analyzed audio sequence. The resulting fingerprint is thus generated using samples that are mainly at the beginning of the sequence. Since a relatively small part of the audio sequence makes a major contribution in the resulting fingerprint, the fingerprint can be used to match shorter audio sequences. In one implementation, for example, a fingerprint generated from a 5 second original audio frame may be reliably matched to a sample taken from a one-half short audio clip.

Embodiments of the fingerprinting technique are also tolerant to noise and signal distortion. One implementation can detect a signal like speech in the presence of 100% white noise (i.e., a signal-to-noise ratio of 0 db). These techniques are also tolerant to filtering, compression, frequency equalization, and phase distortion.

In another embodiment, where the generated fingerprint frame is formed using a prescribed number of frequency bands, an acoustic model is used to mark insignificant frequency bands. Insignificant frequency bands may include frequency bands that do not add substantially any perceptible value when discriminating audio samples. Processing only the relevant frequency bands improves the signal-to-noise ratio and improves the robustness of the entire fingerprint matching process. Furthermore, excluding irrelevant frequency bands can greatly improve the recognition efficiency of band-limited audio content, for example in the case of speech encoded at a very low bit rate or analog recordings with a low band speed.

Embodiments of the present invention also provide fast indexing and efficient searching of fingerprints in large-scale databases. For example, an index for each audio fingerprint may be computed from a portion of the fingerprint content. In one embodiment, a set of bits from the fingerprint is used as an index to the fingerprint, where the bits correspond to more stable low frequency coefficients due to resampling. To match a test fingerprint to a set of fingerprints in a database, the test fingerprint may be matched to the indices to obtain a set of candidate fingerprints. The test fingerprint is then matched against the candidate fingerprints, thereby avoiding the need to match the test fingerprint against each of the fingerprints in the database.

In another embodiment, an edge detection algorithm is used to determine the exact edges of the analyzed audio frame or segment. In some applications, especially when the audio samples differ only during a short period of time of the total samples, it is important to know the position of the edges of the analyzed audio frame in the audio samples. The edge detection algorithm may use a linear regression technique to determine the edges of the audio frame.

Applications of embodiments of fingerprinting techniques are numerous and include real-time identification of audio streams and other audio content (e.g., streaming media, radio, advertisements, internet broadcasts, songs in a CD, MP3 files, or any other type of audio content). Embodiments of the present invention may thus enable efficient, real-time media content auditing and other reporting.

Drawings

FIG. 1 is a schematic diagram of a process of extracting and using fingerprints from an audio sample according to an embodiment of the invention.

Fig. 2 is a schematic diagram of a fingerprint extraction system according to an embodiment of the present invention.

Fig. 3 is a flow chart of a matching algorithm according to an embodiment of the present invention.

FIG. 4 illustrates an edge detection algorithm according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a fingerprint extraction system including a log resampler and a T-point transform module according to an embodiment of the invention.

Fig. 6A and 6B show a graphical table of a fingerprint extraction algorithm according to several alternative embodiments of the present invention.

Figures 7A and 7B show graphical representations of bandpass filters applied to audio frames according to embodiments of the present invention.

8A-8C illustrate graphical representations of resampling subband sample sequences, in accordance with several alternative embodiments of the present invention.

Detailed Description

SUMMARY

Embodiments of the present invention enable extraction of characteristic information (e.g., audio fingerprints) from an audio sample and matching or identification of the audio using the extracted characteristic information. As shown in FIG. 1, an audio frame 105 taken from an audio sample 100 is input into a fingerprint extraction algorithm 110. The audio samples 100 may be provided by any of a variety of sources. Using the sequence of audio frames 105, the fingerprint extraction algorithm 110 generates one or more audio fingerprints 115 that are characteristic of the sequence. As a distinguishing identifier, the audio fingerprint 115 provides information related to the identity or other characteristic of the sequence of frames 105 of the audio sample 100. In particular, the one or more fingerprints 115 of the audio sample 100 may enable the audio sample 100 to be uniquely identified. An embodiment of the fingerprint extraction algorithm 110 is described in more detail below.

The extracted fingerprint 115, once generated, may then be used for further processing or stored on a medium for later reuse. For example, the fingerprint 115 may be used by a fingerprint matching algorithm 120, the fingerprint matching algorithm 120 comparing the fingerprint 115 to entries in a fingerprint database 125 (e.g., a set of audio fingerprints from known sources) to determine the identity of the audio sample 100. Various methods of using fingerprints are also described below.

Depending on the application of the fingerprinting system, the audio sample 100 may originate from any of a variety of sources. In one embodiment, the audio sample 100 is sampled from a broadcast received from a media broadcaster and digitized. Alternatively, the media broadcaster may also transmit audio in digital form, thereby eliminating the need to digitize it. Types of media broadcasters include, but are not limited to: radio transmitters, satellite transmitters, and cable operators. The fingerprinting system may thus be used to audit these broadcasters to determine what audio was broadcast at what time. This enables an automated system for ensuring compliance with broadcast restrictions, license agreements, and the like. Because the fingerprint extraction algorithm 110 can operate without knowing the exact beginning and end of the broadcast signal, it can operate without requiring cooperation and knowledge of the media broadcaster to ensure independent and unbiased results.

In another embodiment, the media server retrieves the audio file from the media library and sends the digital broadcast over a network (e.g., the internet) for use by the fingerprint extraction algorithm 110. Streaming internet radio broadcasting is one example of this type of architecture, where media, advertisements, and other content are delivered to individual or group users. In such embodiments, the fingerprint extraction algorithm 110 and the matching algorithm 120 typically do not have any information about the start or end times of the individual media items contained in the streaming content of the audio sample 100; however, these algorithms 110 and 120 do not require this information to identify the streaming content.

In another embodiment, the fingerprint extraction algorithm 110 receives an audio sample 100 or a series of frames 105 thereof from a client computer having access to a storage device containing an audio file. The client computer retrieves an individual audio file from storage and sends the file to the fingerprint extraction algorithm 110 for generating one or more fingerprints 115 from the file. Alternatively, the client computer may retrieve a batch of files from storage 140 and send them in sequence to fingerprint extractor 110 for generating a set of fingerprints for each file. (As used herein, "set" is understood to include any number of items, including a single item, in a set.) the fingerprint extraction algorithm 110 may be executed by a client computer or a remote server coupled to the client computer through a network.

Algorithm

One embodiment of a fingerprint extraction system 200 implementing the fingerprint extraction algorithm 110 shown in FIG. 1 is shown in FIG. 2. The fingerprint extraction system 200 includes an analysis filter bank 205, the analysis filter bank 205 being coupled to a plurality of processing channels (each channel including one or more processing modules, here labeled as elements 210 and 215) which are in turn coupled to a differential encoder 225 for generating the audio fingerprint 115. The fingerprint extraction system 200 is configured to receive audio frames 105 and will generate audio fingerprints for the audio frames 105.

As described in more detail below, the analysis filter bank 205 typically calculates power spectrum information of the received signal across a frequency range for each input audio frame 105. In one embodiment, each processing channel corresponds to a frequency band within the frequency range, which may overlap. Thus, the channels divide the processing performed by the fingerprint extraction system 200 such that each channel performs processing for a respective frequency band. In another embodiment, each processing channel processes multiple frequency bands (i.e., multiple frequency bands are associated with each processing channel). In other embodiments, the processing of these multiple bands may be performed by a single module in a single channel, or the processing may be divided in any other configuration as appropriate for the application and system technology limitations.

The analysis filterbank 205 receives audio frames 105 (e.g., frames 105 from audio samples 100 as shown in fig. 1). Analysis filter bank 205 converts audio frame 105 from the time domain into the frequency domain to calculate power spectrum information for frame 105 over a range of frequencies. In one embodiment, the power spectrum of the signal in the range of approximately 250 to 2250Hz is divided into a plurality of frequency bands (e.g., Y frequency bands, where Y = 13). These bands may have a linear distribution or a logarithmic center frequency distribution (or any other scale) and may also overlap. The output of the filter bank contains a measure of the signal energy of each of the plurality of frequency bands. In one embodiment, the measure of the average energy is taken using the cube root of the average spectral energy in the band.

Various implementations of the analysis filter bank 205 are possible depending on the software and hardware requirements and limitations of the system. In one embodiment, the analysis filter bank 205 includes a plurality of band pass filters that isolate the signal of the audio frame 105 for each frequency band, followed by energy estimation and downsampling. In one embodiment, the frequency passed by each band pass filter varies over time. In another embodiment, the frequency passed by each band pass filter is constant (i.e., does not change over time). Figure 7A is a graphical representation of an embodiment in which the frequency passed by each band pass filter does not change over time. Each rectangle in the graph 702 represents the signal of an audio frame 105 output by a bandpass filter. On the other hand, fig. 7B shows a graphical representation of an embodiment in which the frequency passed by each band pass filter varies with time. As can be seen in graph 704, the frequency passed by each bandpass filter decreases over time in this example. In other embodiments, the frequency of passage increases over time. After the band pass filters are applied to the audio frames 105, each band includes the signal output by its corresponding band pass filter.

In another embodiment, the analysis filterbank 205 is implemented using a short-time Fast Fourier Transform (FFT). For example, audio 100 sampled at 8kHz is divided into 64ms frames 105 (i.e., 512 samples). The power spectrum of each 50% overlapping segment made up of two audio frames 105 (i.e., 1024 samples) is then calculated by adding a hamming window and performing an FFT, followed by band filtering using M uniformly or logarithmically spaced overlapping triangular windows.

Various time-frequency domain transforms may be used in place of the FFT described above. For example, a Modified Discrete Cosine Transform (MDCT) may be used. One advantage of MDCT is its low complexity, since it can be computed using only one n/4-point FFT and some back-and-forth rotation of the samples. Thus, it is expected that the performance of the filter bank 205 implemented with MDCT is better than the filter bank 205 implemented with FFT, e.g. the transform can be calculated twice as fast.

In another embodiment, the analysis filterbank 205 is implemented using an MP3 hybrid filterbank, which MP3 hybrid filterbank includes cascaded polyphase filters and MDCTs, followed by aliasing cancellation. The MP3 filter bank produces 576 spectral coefficients for each audio frame 105 consisting of 576 samples. For audio sampled at 8kHz, the resulting frame rate is 13.8fps compared to 15.626fps for the 1024-point FFT filter bank described above. The frame rate differences are offset during time-frequency analysis when the data is resampled, as described below. The analysis filterbank 205 may also be implemented using Quadrature Mirror Filters (QMFs). The first stage of the MP3 hybrid filterbank employs a QMF filter having 32 equal-width bands. Thus, the 250-2250Hz frequency range of an 11, 025Hz audio signal can thus be divided into 13 bands.

One advantage of the MP3 filter bank is its portability. For different CPUs, there are highly optimized implementations of MP3 filter banks. Thus, the fingerprint generation routine can be easily integrated with an MP3 encoder, which can obtain spectral coefficients from an MP3 filterbank without additional processing. Thus, the fingerprint generation routine can be easily integrated with an MP3 decoder, which can obtain spectral data directly from an MP3 bitstream without its complete decoding. Combinations with other audio codecs are also possible.

Once determined, the sub-band samples are buffered and provided to one or more resamplers 210. The resampler 210 receives the sub-band samples and resamples the sub-band samples to produce a resampled sequence. In one embodiment, resampler 210 resamples the sub-band samples according to a non-uniform order, such as non-sequential or an order that is opposite to the order in which the samples were sampled.

In one embodiment, each resampler 210 corresponds to one of the Y frequency bands and receives a sequence of S samples linearly spaced in time for that respective frequency band (e.g., where S is selected from 64 to 80 depending on the implementation of the filter bank). In one embodiment, each resampler performs logarithmic, scaled or offset resampling on its respective sequence of subband samples as the sequence of subband samples is received. As a result of the resampling, the resampler 210 produces M resample sequences for each audio frame.

In one embodiment, logarithmic resampling comprises a resampler 210 logarithmically mapping its respective sub-band samples to produce a resample sequence having T samples logarithmically spaced in time (e.g., where T = 64). Instead of logarithmic sampling, other types of non-linear sampling may be performed, such as exponential resampling.

In one embodiment, scaling resampling comprises the resampler 210 scaling the size (i.e., length) of its respective sub-band sample sequence in time. The sequence of sub-band samples is scaled based on the center frequency and/or frequency range of the frequency band. For example, the scaling may be as follows: the higher the center frequency of a subband, the larger the size of the sequence of subband samples. As another example, the scaling may be as follows: the higher the center frequency of a subband, the smaller the size of the subband sequence. The scaled sub-band sample sequence is resampled by the resampler 210 to produce a resampled sequence of T samples.

In one embodiment, offset resampling comprises the resampler 210 offsetting (i.e., shifting) its respective sequence of sub-band samples in time. The offset of the sub-band sequence is based on the center frequency and/or frequency range of the frequency band of the resampler. For example, it may be as follows: the higher the center frequency of a subband, the larger the time offset of the sequence of subband samples. The shifted sub-band sample sequence is resampled by the resampler 210 to produce a resampled sequence of T samples.

In another embodiment, each resampler 210 corresponds to a plurality of frequency bands. Each resampler 210 receives a sequence of sub-band samples for a plurality of frequency bands. The number of sub-band sample sequences received by each resampler 210 varies based on implementation. In one embodiment, the frequency bands corresponding to each resampler are contiguous.

Each resampler 210 performs time-frequency resampling on its corresponding sub-band sequence. Time-frequency resampling comprises resampling 210 from different respective frequency bands over time to produce a resample sequence of T samples. In one embodiment, the frequency at which the resampler 210 samples decreases with time. In another embodiment, the frequency at which the resampler 210 samples increases with time. As a result of the resampling, the resampler 210 produces M resample sequences for each audio frame.

Fig. 8A and 8B illustrate time-frequency resampling according to an embodiment. In fig. 8A, each of the gray outlined rectangles of the graph 802 represents a different frequency band (i.e., a sample sequence of frequency bands). Each black slash represents a resample sequence generated by resampler 210 as a result of time-frequency resampling. As seen in graph 802, to produce a resample sequence, each resampler 210 samples a different respective frequency band over time. In the embodiment of graph 802, the frequency at which resamplers 210 sample decreases with time. Graph 804 of fig. 8B is a graphical representation of the resampling sequence of fig. 8A without a frequency band.

Graph 806 of fig. 8C shows the resampling sequences produced by resamplers 210 in an embodiment where each resampler 210 corresponds to one of the Y frequency bands. Similar to fig. 8A, each gray outlined rectangle of graph 806 represents a different frequency band, and each black line in the middle of the rectangle represents a resampling sequence. As can be seen in fig. 8C, the number of resample sequences generated by the resampler 210 in this embodiment is the same as the number of frequency bands (i.e., M = Y). This is the case because each resampler 210 samples within its frequency band.

However, as can be seen in fig. 8A, in an embodiment where each resampler 210 corresponds to a plurality of frequency bands and performs time-frequency resampling, the number of resample sequences is less than the number of frequency bands (i.e., M < Y). More frequency bands are needed in this embodiment to ensure that each resampler 210 obtains samples from the same time period and that each resample sequence contains T samples.

After the resampler 210 performs the resampling and produces M resample sequences, the resample sequences may be stored in an M × T matrix corresponding to a sample spectrogram having a time (horizontal) axis and a frequency (vertical) axis. The M resample sequences are provided to one or more transform modules 215, which one or more transform modules 215 perform a transform on the samples.

In one embodiment, the transform performed on the samples of each band is a T-point transform, which is a transform along the time axis (i.e., each row of the [ M T ] matrix). In one embodiment, the T-point transform is a T-point FFY. The coefficient sequence resulting from the FFT is called a feature vector. In one embodiment, the feature vector for each band includes every other FFT coefficient calculated for that band in ascending order of frequency. Thus, each feature vector will include N coefficients (e.g., where N = T/2= 32). In another embodiment, instead of performing a T-point FFT, a T-point Discrete Cosine Transform (DCT), a T-point Discrete Hartley Transform (DHT), or a Discrete Wavelet Transform (DWT) is performed. The resulting feature vector is provided to a differential encoder 225.

In another embodiment, the transformation performed is a T-point transformation followed by an M-point transformation. A T-point transform is performed on the samples of each band as described above. After the T-point transform, the samples for each band are scaled, windowed, and normalized in size. After scaling, windowing, and normalization, an M-point transform is performed on the samples, which is a transform along the frequency axis (e.g., each column of the [ M × T ] matrix). In one embodiment, the M-point transform is an FFT, DCT, DHT, or DWT along the frequency axis. The resulting feature vector is provided to a differential encoder 225.

In another embodiment, the transform is a two-dimensional discrete cosine transform (2D DCT). To perform this transformation, the samples of each band are normalized. Once the samples are normalized, a one-dimensional DCT is performed along the time axis. A one-dimensional DCT transform along the frequency axis is followed by a one-dimensional DCT transform along the time axis. The resulting feature vector is provided to a differential encoder 225.

The differential encoder 225 generates a fingerprint 115 for the audio sample. In one embodiment, the feature vectors corresponding to each pair of adjacent bands are subtracted by the differential encoder 225. If there are Y bands, then there are Y-1 pairs of adjacent bands. Subtracting the two feature vectors gives a vector with N difference values. For each of these differences, the differential encoder 225 selects 1 if the difference is greater than or equal to 0 and the differential encoder 225 selects 0 if the difference is less than 0. For each group of four bits in the sequence, the encoder assigns bit values according to a codebook table. The optimal codebook values are calculated during tuning and training of the fingerprinting algorithm. Repeating this process for the eigenvectors of each successive band pair yields the bits of the [ (Y-1) xN/4 ] matrix. The matrix may be represented as a linear sequence of bits, which serves as the audio fingerprint 115. In the example of Y =13 and N =8, the fingerprint 115 has 12 bytes of information.

In one embodiment, the obtained feature vectors are decorrelated and reduced in size using Principal Component Analysis (PCA) before being quantized. Other decorrelation techniques, such as digital cosine transforms, may additionally or alternatively be used for removing redundancy and compressing the feature vectors.

In one embodiment, the fingerprint extraction system 200 generates a plurality of fingerprints for a series of audio frames in a particular audio signal that are highly overlapping. In one example, each series of frames 105 processed by the system 200 contains an audio signal that is 3 seconds long and begins 64 milliseconds after the beginning of the previous series. In this way, fingerprints are generated for a number of 3 second long portions of the audio signal starting every 64 milliseconds. To implement such a scheme, the fingerprint extraction system 200 may include memory buffers before and after the analysis filter bank 205, where the buffers are updated with the next 64 milliseconds of audio signal when the next audio frame 105 is received.

Fig. 6A and 6B show graphical representations of a fingerprint extraction algorithm 110 according to several alternative embodiments of the present invention. In this process, the analysis filter bank 205 receives audio frames. Graph 602 is a representation of received audio frames in the time domain. The analysis filter bank 205 performs an FFT on the audio frame to convert it from the time domain to the frequency domain, as shown in a graph 604. Power spectrum information is then calculated for the frame while in the frequency domain. The analysis filter bank 205 applies a plurality of band pass filters to isolate the frame signal of each frequency band, as shown by the graph 606.

The sub-band sample sequences of the frequency band are resampled by the resampler 210. Fig. 6B shows 4 alternative techniques (labeled A, B, C and D) that may be performed by the resampler 210 to resample the sub-band sample sequences. In one embodiment, techniques A, B and C are techniques that may be performed when each resampler 210 corresponds to a frequency band. In one embodiment, method D may be performed when each resampler 210 corresponds to a plurality of frequency bands and the resamplers 210 are configured to perform time-frequency resampling.

In technique a, each resampler 210 logarithmically samples its corresponding sub-band sample sequence (graph 608) to produce a resampled sequence of T samples (graph 616). In technique B, each resampler 210 scales the size of its respective sequence of subband samples based on the center frequency and/or frequency range of the subband. As shown in graph 610, in this example, the higher the center frequency of a subband and the wider the frequency range of the subband, the smaller the size of the sequence of subband samples. The scaled sub-band sample sequences are resampled to produce resample sequences, each having T samples (graph 618).

In technique C, each resampler 210 shifts its respective sequence of subband samples in time based on the center frequency and/or frequency range of the subband. As shown in graph 612, in this example, the higher the subband center frequency, the greater the offset of the subband sample sequence. The shifted sub-band sample sequences are resampled to produce resample sequences, each resample sequence having T samples (graph 620).

In technique D, each resampler 210 time-frequency resamples its respective sequence of sub-band samples. Time-frequency resampling is performed by resampler 210 over time to sample from different respective frequency bands. As shown in graph 614, in this example, the frequency of the resampler 210 samples decreases as time increases. The resampling produces resample sequences, each having T samples (graph 622).

The resample sequences (M resample sequences) produced by the resampler 210 are stored in an M x T matrix. Each transform module 215 performs a transform on the resample sequence produced by its corresponding resampler 210 (i.e., resampler 210 in the same channel as transform module 215). Fig. 6 shows 3 alternative techniques (labeled E, F and G) that may be performed by the transform module 215 to transform the resampled sequence and generate the feature vectors.

In technique E, transformation module 215 performs a T-point transformation, as shown in graph 624. In technique F, the transformation module 215 performs a T-point transformation in one dimension, followed by an M-point transformation in the other dimension, as shown in graph 626. In technique G, transform module 215 performs a two-dimensional DCT or other suitable two-dimensional transform, as shown in graph 628.

Once the sub-band samples have been transformed to obtain the feature vector, the differential encoder 225 uses the feature vector generated by the transform module 215 to generate the fingerprint 115.

Fig. 5 is an example of a fingerprint extraction system 200, where resampler 210 is a logarithmic resampler 210 and transform module 215 is a T-point transform module 215. The logarithmic resampler 210 performs logarithmic resampling as described above (technique a). However, it should be understood that in other embodiments, the logarithmic resamplers 210 may be interchanged with resamplers 210 that perform other resampling techniques (i.e., techniques B, C or D).

T-point transform module 215 performs a T-point transform as described above (technique E). However, in other embodiments, the T-point transform module 215 may be interchanged with transform module 215 performing other transform techniques (i.e., techniques F or G).

Acoustic model

In various applications of fingerprinting systems, certain frequency bands may be inconsequential because they are imperceptible, due to the encoding process of the audio sample removing the frequency bands or for some other reason. Thus, in one embodiment, for a particular fingerprint, acoustic model 235 is used to identify and label these insignificant frequency bands. Acoustic models (e.g., psychoacoustic models) are well known in various audio processing arts. A set of model parameters for the acoustic model 235 may be calculated for a high quality reference sample during creation of the fingerprint 115 and stored in the database 125. Insignificant bands in the fingerprint 115 may be marked by assigning a special code or zeroing out their corresponding values (i.e., bits). This effectively causes these bands to be ignored during any subsequent matching process, since only the relevant band pairs having non-zero values are used to discern the fingerprint 115 during the matching of the fingerprint to the database records. Masked bands (i.e., those with a value of zero) may also be completely excluded from comparison.

In one embodiment, the acoustic model is a psychoacoustic model of the human auditory system. This may be useful when the purpose of the fingerprinting system is the recognition of the human auditory system targeting audio. Such audio may be compressed by one or more perceptual coders to remove irrelevant audio information. The use of a human psychoacoustic model makes it possible to identify and exclude such irrelevant bands from the fingerprint.

The psychoacoustic model is only one acoustic model suitable for human perception of the encoded audio. Another acoustic model is a model that mimics the properties of a particular sound recording device. Each band of such a sound recording device acoustic model may be assigned a weight factor depending on its importance. Yet another acoustic model mimics the attributes of a particular environment, such as background noise that may be found in a vehicle or room. In such embodiments, each band of the acoustic model may be assigned a weight factor depending on its importance in the environment in which the system is designed.

In one embodiment, the parameters of the acoustic model 235 and the filter bank 205 depend on the type and properties of the analyzed audio signal 100. Different profiles comprising a set of subband weight factors and a plurality of filterbank bands and their frequency distributions are used to obtain a better match of the properties of the target audio signal. For example for audio like speech, the power of the signal is mainly concentrated in the low frequency band, whereas music may contain high frequency related components depending on the genre. In one embodiment, the parameters of the acoustic model are calculated from the reference audio signal and stored in the content database together with the generated fingerprint. In another embodiment, the parameters of the acoustic model are dynamically calculated based on the properties of the audio signal analyzed during the matching process.

Thus, possible applications of the acoustic model 235 include tuning audio recognition parameters for specific environments and/or recording device and encoding algorithm attributes. For example, knowing the acoustic properties of the cellular telephone audio path (microphone characteristics, audio processing and compression algorithms, etc.) allows the development of acoustic models that mimic these properties. The use of the model during the fingerprint comparison may significantly improve the robustness of the matching process of the generated fingerprints.

Fingerprint indexing and matching

In one embodiment, the fingerprint indexer 230 generates an index for each fingerprint 115. The fingerprints 115 are then stored in the fingerprint database 125 so that the contents of the fingerprint database 125 can be efficiently searched and matched. In an embodiment, the index of the fingerprint 115 includes a portion or hash of the fingerprint 115. Thus, the fingerprints 115 in the fingerprint database 125 are indexed according to useful identifying information about them.

In the above-described embodiment where each fingerprint 115 includes bits of the [ (Y-1) xN/4 ] matrix, the indexer 230 uses the bits from the leftmost column as an index. In an embodiment where each fingerprint 115 is a 12 x 8 bit matrix, the index of the fingerprint 115 may be the leftmost two columns of bits (24 bits total). Thus, the bits used as an index for each fingerprint 115 are based on the fingerprint 115 subset of low spectral coefficients used to compute the feature vector for that fingerprint 115. These bits thus correspond to the low frequency components of the resampled and transformed spectral bands, which are stable and insensitive to moderate noise and distortion. Thus, with a higher probability level, similar fingerprints will have the same value of the index. In this way, the index can be used to tag and group similar and possibly matching fingerprints in the database.

Fig. 3 illustrates a method of matching a test fingerprint to the fingerprint database 125 using the above-described index, according to one embodiment of the present invention. To find a match for a test fingerprint in the fingerprint database 125, the matching algorithm begins by calculating (310) an index value for the test fingerprint as described above. Using the index value, a set of candidate fingerprints is obtained (320), e.g., the set includes all fingerprints in the database 125 having the same index value. As described above, due to the way the index values are computed, it is highly likely that any matches in the database 125 are in the set of candidate fingerprints.

To test for any matches in the set of candidate fingerprints, a Bit Error Rate (BER) between the test fingerprint and each candidate fingerprint is calculated (330). The BER between two fingerprints is the percentage of their corresponding mismatched bits. The BER is expected to be 50% for an unrelated, completely random fingerprint. In one embodiment, two fingerprints are matched if the BER is less than about 35%; however, other numerical limits may be used depending on the desired tolerance for false positives and/or false negatives. Further, a calculation or criterion other than BER may be used to compare the two fingerprints. For example, an inverse BER metric, a matching rate may also be used. Furthermore, in comparing two fingerprints, some bits may be weighted higher than others.

If there are no matches within the predetermined matching criteria (340), or no more indices to modify (350), the matching algorithm fails to find any matches to the test fingerprint in the database 125. The system may then continue searching (e.g., using less restrictive criteria to obtain candidate fingerprints) or may stop. If there are one or more matching fingerprints (340), a list of matching fingerprints is returned (360).

In one embodiment, to obtain different sets of candidate fingerprints for searching for a match, the system may repeat the search after modifying (370) the computed fingerprint index. To modify (370) the computed fingerprint index, one or more bits of the computed fingerprint index may be flipped. In one example where the fingerprint index has 24 bits, after failing to find a match using the original fingerprint index, the search step is repeated 24 times, each time flipping a different single bit of the 24-bit fingerprint index. Various other techniques may be used to expand the search space.

In one embodiment, the fingerprint indexer 230 generates one or more indices by selecting index bits from one or more fingerprints based on a set of band weight factors calculated by the acoustic model 235 and pre-stored in the database 125. When a plurality of indexes (including indexes obtained by bit flipping) are used, the candidate fingerprint group includes all candidates obtained for each calculated index.

In another embodiment, the search is narrowed by pre-screening and selecting only fingerprint candidates found in most or all of the candidate sets obtained for each computed index. The performance of a database search can be significantly improved by prescreening multiple fingerprint candidate sets using multiple indices (including indices obtained by bit flipping). In one embodiment, an index and reference of possible fingerprint candidates is stored in computer memory so that fingerprint candidates can be quickly selected and pre-screened. In a second step (step 320), only the fingerprint candidate with the highest probability of matching a given fingerprint is loaded into computer memory and compared. This approach allows fast searches to be achieved by keeping only a small index in computer memory while storing larger fingerprints on slower devices (e.g., hard disk or over a network).

Detecting edges of audio frames

In some applications, it may be desirable to detect edges of matching audio segments. Edge detection allows the system to know exactly when a particular matching audio piece is occurring. Depending on the quality of the analyzed audio, embodiments of the edge detection algorithm may be able to detect edges of matching audio segments with an accuracy of about 0.1 to 0.5 seconds.

As described above, embodiments of the fingerprinting technique accumulate audio samples in a sub-band processing buffer. Due to this buffering, the output of the fingerprinting algorithm is delayed and blurred (smear) at the audio piece edges. This effect is illustrated in fig. 4, which is a graph of Bit Error Rate (BER) over time between a reference fingerprint of an audio piece and a series of fingerprints generated over time for an incoming sample audio stream. In the illustrated embodiment, the sub-band buffer holds 3 seconds of audio, and a match is declared when two fingerprints have a Bit Error Rate (BER) of 35% or less.

Initially, at time T0, the sub-band processing buffer is empty, and the resulting fingerprint thus yields 0 matches to the original audio (i.e., BER is expected to be approximately equal to 50%). As audio samples are added to the sub-band buffer, the BER drops, indicating a better match. After sufficient time has elapsed, the BER drops below the threshold 35% at time T1, indicating a match. Finally, at time T2, the BER reaches a plateau as the buffer is filled with samples. When the fingerprinting algorithm passes the end of the corresponding audio piece, it begins to produce fingerprints that match less at time T3 and reach the recognition threshold of 35% at time T4 because of the fingerprint having an increasing BER. The duration of the obtained matching curve (T1-T4) and the duration of the plateau state (T2-T3) are each shorter than the duration of the matching audio piece (T0-T3).

In one embodiment, an edge detection algorithm is used to determine the exact edges of matching audio frames or segments. A BER curve such as that shown in fig. 4 is obtained. The BER curve is divided into regions corresponding to the onset of a match with decreasing BER (e.g., T1-T2), the plateau with roughly constant BER (e.g., T2-T3), and the end of a match with increasing BER (e.g., T3-T4). Since the true BER curve will typically be noisy, it is partitioned using a suitable technique such as regression analysis. In one embodiment, all samples that produce more than 30% BER are ignored because they may not be reliable. The start of the matching audio segment (i.e., time T1) may then be calculated using linear regression as: the intersection of a line that fits in the best way into a progressively decreasing BER region (e.g., T1-T2) with a horizontal line corresponding to 50% BER. A similar method may be used to estimate the time T5, taking the intersection of a line that fits in the best way into the region of increasing BER (e.g., T3-T4) and a horizontal line corresponding to 50% BER. In this case, however, time T5 corresponds to the end of the segment being delayed by the duration B of the sub-band buffer, rather than the actual end of the matching audio segment. The position of the segment end (i.e., time T3) may be calculated by subtracting the sub-band buffer duration B from the obtained estimate T5.

In another embodiment, the end of the matching audio piece is estimated to be the end of the region T2-T3, and the start of the audio piece is calculated by subtracting the sub-band buffer duration B from the T2 time, which corresponds to the start of the region T2-T3.

Summary of the invention

Although discussed in terms of vectors and matrices, the information computed for any fingerprint or sub-fingerprint may be stored and processed in any form, not just as vectors or matrices of values. The terms "vector" and "matrix" are thus used merely as a convenient mechanism to represent data extracted from audio samples, and are not meant to be limiting in any other way. Furthermore, although the power spectrum is discussed in terms of a spectrogram, it will be appreciated that data representing the power spectrum or spectral analysis of an audio signal may not only be represented and used as a spectrogram, but may also be represented and used in any other suitable form.

In one embodiment, the software modules are implemented using a computer program product comprising a computer readable medium containing computer program code executable by a computer processor to perform any and all of the steps, operations, or processes described herein. Thus, any of the steps, operations, or processes described herein may be performed or implemented using one or more software modules or hardware modules, alone or in combination with other devices. Furthermore, any system parts described in terms of hardware elements may be implemented in software, and any system parts described in terms of software elements may be implemented in hardware, e.g. hard-coded into dedicated circuitry. For example, the code for performing the described methods may be embedded in a hardware device, e.g., in an ASIC or other conventional circuitry. This allows the benefits of the present invention to be combined with the capabilities of many different devices.

In another embodiment, the fingerprinting algorithm is embedded in and runs on any of a variety of audio devices, such as a cellular phone, a Personal Digital Assistant (PDA), an MP3 player and/or recorder, a set-top box, a television, a game console, or any other device that stores, processes, or plays audio content. Embedding fingerprinting algorithms in such devices may have several benefits. For example, generating an audio fingerprint directly on a cellular phone would provide better results than sending compressed audio from the phone to a fingerprinting server over a cellular network. Running the algorithm on a cellular phone eliminates the distortion caused by GSM compression, which is designed to compress speech and has poor performance on music. Thus, the method can significantly improve the recognition of audio recorded by a cellular phone. It also reduces the load on the server and network traffic.

Another benefit of such an embedded approach is the ability to monitor the listening experience without violating privacy and user interests. For example, a sound recording device may record audio, create a fingerprint, and then send only the fingerprint to a server for analysis. Recorded audio never leaves the device. The server may then use the transmitted fingerprint to identify the targeted music or advertisement, although it is not possible to recover the original audio from the fingerprint.

The foregoing description of the embodiments of the invention has been presented for the purposes of illustration and description; it is not intended to be exhaustive or to limit the invention to the precise form disclosed. One skilled in the relevant art will recognize that many modifications and variations are possible in light of the above teaching. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.

Claims

1. A method for extracting an audio fingerprint from an audio frame, the method comprising:

filtering the audio frame into a plurality of frequency bands to produce a corresponding plurality of filtered audio signals, wherein the frequency bands differ over time;

scaling in time the size of each filtered audio signal based on the frequency of the respective frequency band;

sampling from different respective frequency bands over time, thereby resampling the scaled and filtered audio signal to produce a resampled audio signal;

transforming the resampled audio signals to produce a feature vector for each resampled audio signal; and

the audio fingerprint is computed based on a feature vector.

2. The method of claim 1, wherein the frequency on which scaling is based is at least one of a center frequency and a frequency range of the respective frequency band.

3. The method of claim 1, wherein transforming the resampled audio signal comprises performing a transform along a time axis.

4. The method of claim 1, wherein transforming the resampled audio signal comprises:

transforming the resampled audio signal along a time axis; and

the resampled audio signal is transformed along a frequency axis.

5. The method of claim 4, further comprising size scaling, windowing, and normalizing the resampled audio signal.

6. The method of claim 1, wherein transforming the resampled audio signal comprises performing a two-dimensional discrete cosine transform (2D DCT).