HK1060632B

HK1060632B - Method and device for producing a fingerprint and method and device for identifying an audio signal

Info

Publication number: HK1060632B
Application number: HK04103530.7A
Authority: HK
Inventors: Jurgen Herre; Oliver Hellmuth; Markus Cremer; Eric Allamanche; Thorsten Kastner
Original assignee: M2Any有限公司
Priority date: 2001-07-10
Filing date: 2002-06-20
Publication date: 2012-07-06

Description

The present invention relates to the characterization or identification of audio signals in terms of their content, in particular the generation and use of different fingerprints for an audio signal

In recent years, the availability of multimedia data, i.e. audio data, has increased significantly. This development has been driven by a number of technical factors. These technical factors include, for example, the wide availability of the Internet, the wide availability of high-performance computers, and the wide availability of high-performance methods for data compression, i.e. source coding, of audio data.

The enormous volume of audiovisual data available worldwide, for example on the Internet, requires concepts that allow this data to be assessed, catalogued or managed according to content criteria.

Err1:Expecting ',' delimiter: line 1 column 82 (char 81)

U.S. Patent No. 5,918,223 reveals a method for content-based analysis, storage, retrieval, and segmentation of audio information. An analysis of audio data generates a set of numerical values, also known as a feature vector, that can be used to classify and rank the similarity between individual audio pieces typically stored in a multimedia database or on the World Wide Web.

The analysis also allows the description of user-defined classes of audio tracks based on an analysis of a set of audio tracks which are all members of a user-defined class.The system is able to find individual sound segments within a longer piece of audio, which allows the audio recording to be automatically segmented into a series of shorter audio segments.

Err1:Expecting ',' delimiter: line 1 column 208 (char 207)

The database system is able to quantify the distance in n-dimensional space between two n-dimensional vectors. It is also possible to generate classes of audio tracks by specifying a set of audio tracks that belong to a class. Examples of classes are bird song, rock music, etc. The user is enabled to search the audio tracks database using specific methods. The result of a search is a list of files, arranged by their distance from the specified n-dimensional vector.

Err1:Expecting ',' delimiter: line 1 column 68 (char 67)

Various categories are proposed for the characterisation of audio pieces, such as animal sounds, bell sounds, crowd sounds, laughter, machine sounds, musical instruments, male language, female language, telephone sounds or water sounds.

The problem in selecting the features used is that the computational effort to extract a feature should be moderate to achieve rapid characterization, but that at the same time the feature should be characteristic of the audio recording, so that two different pieces also have distinguishable features.

Err1:Expecting ',' delimiter: line 1 column 758 (char 757)

The other extreme would be to take, for example, only a mean value of all the scanning values of a piece. This mean value takes up very little memory space and is therefore ideal for both a large music database and matching algorithms.

The advantage of this approach is that the different types of fingerprints are always optimally suited for one particular application only, but more or less unsuitable for other applications. In this connection, it should be noted that audio identification or characterisation is of particular interest only if very large characteristics databases exist, the fingerprints of which could be compared with a search fingerprint in order to identify, directly identify or characterize an audiogram, and thus to identify the audiogram.If it is found that a certain type of fingerprint, while favourable for one application, was no longer favourable for the other, then in order to achieve an optimal tradeoff between characterization power on the one hand and storage space on the other hand, a new character extraction process must be carried out for the large number of audio signals whose fingerprints are stored in the database in order to create a new character extraction database which is an optimal tradeoff for current applications.Err1:Expecting ',' delimiter: line 1 column 208 (char 207)

Err1:Expecting ',' delimiter: line 1 column 275 (char 274)

Another problem is that fingerprints should be transmitted over a wide variety of transmission channels. A transmission channel with very low transmission capacity is, for example, an open-air transmission channel of a mobile phone. Here, in addition to the characterization power and storage capacity for the database, the bandwidth of the transmission channel also plays a decisive role. It would not make sense to generate a very characterization fingerprint that can then be transmitted through the narrow transmission channel. The optimal fingerprint for such an application is therefore additionally dictated by the transmission channel through which the fingerprint is to be transmitted, e.g. a data bank.

The present invention is intended to provide a flexible and adaptable fingerprinting concept.

This task is solved by a process for generating a fingerprint of claim 1, a process for characterizing an audio signal of claim 11, a fingerprint representation of claim 15, a device for generating a fingerprint of claim 16 or a device for characterizing an audio signal of claim 17.

Err1:Expecting ',' delimiter: line 1 column 900 (char 899)

Preferably, the fingerprint with higher identification power of the search fingerprint and the database fingerprint are always converted in such a way that two fingerprints are compared with each other, which are also comparable to each other.

This has the advantage that very weak fingerprints but also very strong fingerprints can be processed using the same fingerprint database, so that, depending on the predetermined fingerprint modes allowed, a suitable fingerprint mode can be found for each application, while still using the same fingerprint database.

This approach has the further advantage of removing the task of producing very low-quality but very fast-transferable fingerprints for changing applications from audio database producers, but of producing a scalable fingerprint once it is created which can then be used for a wide range of applications due to its scalability feature. On the other hand, users of such search databases are given sufficient flexibility to be able to generate, if the circumstances so require, either a very low-quality but very fast-transferable fingerprint, whereas users can use a different application with more characterization power and less ease of data transmission to decide on a fingerprint database, which in particular can make the same concept of a user-friendly and commercially-accessible audio database both conceptually and commercially usable.

Preferably, frequency scalability and/or time scalability are used. Frequency scalability is achieved by, in a preferred embodiment of the present invention, the fact that the fingerprint modes each contain separate fingerprint information for separate sub-bands of the audio signal, and that the fingerprint modes differ from each other in that they only contain separate fingerprint information for a different number of sub-bands. This setting of the bands is compatible for all fingerprint modes of the same type. If a database is generated with very single fingerprint matching, i.e. the fingerprint mode contains only 20 separate fingerprint data, then a separate database would be used for all sub-bands of the audio signal.

A relatively low-label fingerprint, for example, contains 10 blocks of fingerprint information per block of audio signal scanning values, while a high-label fingerprint contains fingerprint information per block of audio signal scanning values. At the same block length for both fingerprints for downward conversion, many consecutive high-label fingerprint information is combined to produce a single converted fingerprint, which produces the same number of fingerprint values as the high-label fingerprint.

The following are examples of the present invention, which are described in detail in relation to the accompanying drawings:Fig. 1a block diagram for the production of a fingerprint;Fig. 2a block diagram for a device for the characterization of an audio signal according to the invention;Fig. 3a diagram for the division of an audio signal into different sub-bands;Fig. 3a schematic representation of different fingerprint representations which can be produced by different fingerprint modes from either of the sub-bands shown in Fig. 3a;Fig. 4a diagram for the distribution of an audio signal over time;Fig. 4a schematic overview of the division of an audio signal into different sub-bands;Fig. 4a schematic representation of different fingerprint representations which can be produced by different fingerprint modes from the sub-band distribution shown in Fig. 5a;Fig. 4a diagram for the distribution of an audio signal over time;Fig. 4a schematic overview of the various types of block diagrams and printed images which can be produced according to the printing system.

The following is a diagram of a pattern recognition system in which the present invention can be used to advantage, which is shown in Figure 5 below.

Err1:Expecting ',' delimiter: line 1 column 79 (char 78)

In classification mode, an attempt is made to compare and classify a signal to be characterized with the entries in the database 54.

The pattern recognition system includes a device 56 for signal preprocessing, a downstream device 58 for feature extraction, a device 60 for feature processing, a device 62 for cluster generation, and a device 64 for performing classification, for example to make such a statement as a result of classification mode 52 about the content of the signal to be characterized that the signal is identical to the signal xy that was entered in a previous training mode.

The functionality of the individual blocks of Figure 5 is discussed below.

Block 56 together with block 58 forms a feature extractor, while block 60 is a feature processor. Block 56 converts an input signal into a uniform target format, such as the number of channels, the sampling rate, the resolution (in bits per sampling value), etc. This is useful and necessary in that no preconditions should be made about the source from which the input signal originates.

The purpose of the feature extraction device 58 is to reduce the usually large amount of information at the output of device 56 to a small amount of information. The signals to be examined usually have a high data rate, i.e. a high number of sampling values per time period. The restriction to a small amount of information must be done in such a way that the essence of the original signal, i.e. its specificity, is not lost. In device 58 predetermined characteristic properties, such as loudness, fundamental frequency, etc., are generally used and/or, according to the present invention, tone characteristics or the SFM extracted from the signal. The tone characteristics thus obtained are intended to contain, so to speak, the essence of the signal being examined.

In block 60 the previously calculated characteristic vectors can be processed. A simple processing is the normalization of the vectors. Possible characteristic processing is linear transformations, such as the Karhunen-Loeve transformation (KLT) or the linear discriminant analysis (LDA), which are known in the technique. Other transformations, especially nonlinear transformations, are also applicable to characteristic processing.

The class generator is used to summarize the processed feature vectors into classes, which correspond to a compact representation of the associated signal. Finally, the classifier 64 is used to assign a generated feature vector to a predefined class or a predefined signal.

Fig. 1 shows a schematic of a device for producing a fingerprint of an audio signal, as may be present, for example, in block 58 of Fig. 5. To produce a fingerprint of an audio signal, information is used that defines a number of predefined fingerprint modes, whereby these mode information is stored for compatible fingerprint modes by means of a device 10. The fingerprint modification defined by the mode information stored in device 10 all refer to the same type of fingerprint, whereby the fingerprint modification however yields different fingerprints, which on the one hand define data and on the other hand distinguish their characteristics to identify the audio signal. For example, a fingerprint modification is possible without a second characterization, but it is preferable to use a lower characterization power, so that the first characterization is convertible into a second characterization, which is, however, more easily distinguishable from the second characterization.

The device of the invention also includes a device 12 for setting a fingerprint mode from the majority of predefined fingerprint modes. In a device 14 for calculating the fingerprint according to the fingerprint mode set by the device 12, a fingerprint of an audio signal fed through an input 16 is finally calculated and output at an output 18. The device 14 for calculating the fingerprint according to the fingerprint mode set by the device 12 is connected to the storage device 10 to apply the corresponding calculation rules according to the fingerprint mode.

In the following, the device 14 for calculating the fingerprint in a set fingerprint mode is discussed in detail.

A time signal to be characterized can be translated into the spectral range by means of a device to generate a block of spectral coefficients from a block of time-sampling values. As will be shown later, a specific tone value can be determined for each spectral coefficient or spectral component, for example, to classify whether a spectral component is tonal or not by a yes/no determination.

Because a quantitative tone measure is obtained, it is also possible to indicate distances or similarities between two tone-indexed pieces, whereby pieces can be classified as similar if their tone measurements differ only by a difference smaller than a predetermined threshold, while other pieces can be classified as unlike if their tone indices differ by a difference greater than a difference threshold.

It should be noted that the signal to be characterised need not necessarily be a time signal, but that it may also be an MP3-coded signal, for example, consisting of a sequence of Huffman code words generated from quantized spectral values.

The quantized spectral values have been generated from the original spectral values by quantization, whereby the quantization has been chosen in such a way that the quantized noise introduced by the quantization is below the psychoacoustic masking threshold. In such a case, the encoded MP3 data stream can be used directly to calculate the spectral values, for example by means of an MP3 decoder. It is not necessary to perform a translation into the time domain and then a translation into the spectral domain before determining the tonality, but the spectral values calculated within the MP3 decoder can be taken directly to calculate the tonality per spectral component or the SFM (SFM = Decodifier Flatness Measure for the spectral dimension) without using a T-band, but if the MP3 filter is used to determine the spectral component, the spectral values can be set up in the same way as if the T-band were a characteristic and therefore the spectral band is 40D.

The measure of spectral flatness (SFM) is calculated by the following equation.

In this equation, X (n) represents the sum square of a spectral component with index n, while N represents the total number of spectral coefficients of a spectrum. From the equation it can be seen that the SFM is equal to the ratio from the geometric mean of the spectral components to the arithmetic mean of the spectral components. As is known, the geometric mean is always smaller or at most equal to the arithmetic mean, so that the SFM has a range of values that is only between 0 and 1. A value close to 0 indicates a tonal signal and a value close to 1 indicates a more resonant tonal signal with a very flat spectral path.

Err1:Expecting ',' delimiter: line 1 column 58 (char 57)

Err1:Expecting ',' delimiter: line 1 column 246 (char 245)

Err1:Expecting ',' delimiter: line 1 column 292 (char 291)

Another way to determine tonality is described in US Patent No. 5,918,203. Again, a positive real-valued representation of the spectrum of the signal to be characterized is used. This representation may include the amounts, sum squares, etc. of the spectral components. In an embodiment, the amounts or sum squares of the spectral components are first compressed logarithmically and then filtered with a filter with differentiating characteristics to obtain a block of differentially filtered spectral components.

In another embodiment, the amounts of the spectral components are first filtered with a filter with differentiating characteristics to obtain a numerator, and then filtered with a filter with integrating characteristics to obtain a denominator.

These two approaches suppress slow changes between adjacent amounts of spectral components while highlighting abrupt changes between adjacent amounts of spectral components in the spectrum. Slow changes between adjacent amounts of spectral components indicate atonal signal components, while abrupt changes indicate tonal signal components. The logarithmically compressed and differentially filtered spectral components or the ratios can then be used to calculate a tonality measure for the spectrum under consideration.

Although it has been stated in the previous text that a tonality value is calculated per spectral component, it is preferable, for example, to always add the sum squares of two adjacent spectral components and then to calculate a tonality value for each result of the addition by one of the above methods, in order to reduce computational effort.

Another way of determining the tonality of a spectral component is to compare the level of a spectral component with an average of spectral component levels in a frequency band. The width of the frequency band in which the spectral component is located, the level of which is compared with the average of, for example, the amounts or sum squares of the spectral components, can be chosen as required. One way is, for example, to select the band narrowly. Alternatively, the band could also be counted broadly, or even from psychoacoustic points of view.

Although the previous section has determined the tonality of an audio signal by its spectral components, this can also be done in time, i.e. using the tally values of the audio signal. For this purpose, an LPC analysis of the signal could be performed to estimate a prediction gain for the signal. The prediction gain is inversely proportional to the SFM and is also a measure of the tonality of the audio signal.

For example, the short-term spectrum can be divided into four adjacent and preferably non-overlapping regions or bands of frequency, with a tonality value being determined for each frequency band. This results in a 4-dimensional signal vector for a short-term stream of the signal to be characterized. To allow for better characterization, it would be preferable to consider further short-term processing, for example, four consecutive long-term vectors described above, so that a total number of t-dimensional vectors is obtained for a 16-dimensional signal, such as a 16-dimensional V or V-dimensional V, while it would be better to consider the average value of the t-dimensional vector for a given frame or V-dimensional V or V-dimensional V, or a 16-dimensional V or V-dimensional V, while taking into account the higher frequencies of the signal, and then calculating the value of the t-dimensional V or V-dimensional V.

In general, the tonality can thus be calculated from parts of the entire spectrum, which makes it possible to determine the tonality/noise of one or more spectrum and thus to obtain a finer characterization of the spectrum and thus of the audio signal.

In addition, short-term statistics from tonality values, such as mean, variance and central moments of higher order, can be calculated as a measure of tonality.

In addition, differences in time-series tone vectors or linearly filtered tone values may be used, e.g. IIR or FIR filters may be used as linear filters.

For the calculation of the SFM, it is also preferable, for example, to add or average two frequency adjacent sum squares and perform the SFM calculation on this roughed up positive and real-valued spectral representation for computing time savings, which also leads to greater robustness against narrowband frequency breaks and a lower computational effort.

Again referring to Figure 1, the following is a more detailed discussion of device 12 for setting a fingerprint mode of the predefined fingerprint modes. The task of device 12 is to select and set the fingerprint mode most suitable for a given application from the multitude of predefined fingerprint modes. The selection can be either empirical or automatic through predefined verification matching operations. In such verification matching operations, for example, several known audio signals are processed according to different fingerprint modes to produce different fingerprint modes of high characteristic. However, with these different fingerprint modes, a single identical fingerprint matching operation can be performed for each type of fingerprint, for example, a T-matching operation, which is a precondition for the specific characteristics of a given data bank, namely the threshold of the fingerprint quality.

Alternatively, the device 12 may select, independently of threshold values, but for example dictated by a transmission channel, the fingerprint mode that will produce a fingerprint that can be transmitted, for example, via a band-limited transmission channel, due to its data set.

The same applies if the fingerprint is not to be transmitted but stored, depending on the storage resources available, a memory-intensive and thus labelling-efficient fingerprint mode or a memory-saving but relatively low labelling fingerprint mode may be set by the unit 12.

Fig. 2 shows a block diagram of a device for characterizing an audio signal according to the invention. Such a device includes a device for generating a search fingerprint in one of the predefined fingerprint modes. This device is designated in Fig. 2 by reference sign 20 and is preferably made as described in conjunction with Fig. 1. The device for characterizing an audio signal also includes a database 22 in which database fingerprints are stored, which have also been calculated in one of the predefined fingerprint modes.

The device shown in Fig. 2 also includes a device 24 for comparing the search fingerprint produced by the device 20 with the database fingerprints. First, a device 24a determines whether the search fingerprint and the database fingerprint to be compared have the same identifying power, i.e. whether they were produced by the same fingerprint mode, or whether the search fingerprint was produced by a different fingerprint mode than the database fingerprint. If it is established that one of the fingerprints has a higher identifying power than the other, an introduction into the database 24F is carried out in such a way that the identification of the data is satisfied. If, in the case of a comparison, the two sets of fingerprints are identical, it is also established that the identification of the fingerprint is comparable.

Preferably, the device 24a is positioned to determine which fingerprint has the higher degree of identification, and this fingerprint is then scaled down to the fingerprint mode of the fingerprint which has the lower degree of identification of the two fingerprints, or, alternatively, if desired for example for rapid searching purposes, both fingerprints can be scaled down to a fingerprint mode which produces lower degree of identification than the search fingerprint and the database fingerprint.

Depending on the application, it may also be necessary to increase the fingerprint with the lower markings, e.g. by interpolation, but this alternative will only be useful if the nature of the fingerprint allows interpolation.

As already shown, there are conflicting requirements for the definition of the fingerprint mode. On the one hand, it is of great interest to achieve the greatest possible data reduction, i.e. a small fingerprint size, in order to be able to hold as many such fingerprints as possible in the memory of a computer and to make further processing more efficient.

On the other hand, with a decrease in fingerprint size, the risk of not being able to distinguish correctly between the pieces recorded in the database increases, particularly in the case of a large audio database, which may contain, for example, 500,000 titles, and in applications where the audio pieces are subject to severe distortion before detection, for example, in the case of acoustic transmission of the signal or loss-induced compression.

It would of course be possible to define for this reason more compact fingerprint formats which are less robust and less compact formats which offer correspondingly better discriminatory characteristics, but this requires, as has been shown, that the complex fingerprint databases have to be created and stored several times, i.e. once in each format, especially since a description in a first fingerprint type cannot generally be compared with a fingerprint of another type.

The present invention provides a universal scalable description format to address these problems, which provides a flexible compromise between fingerprint signature thickness and compactness depending on the application, without losing the property of fingerprint comparability. This is preferably achieved by two-dimensional scalability, with one dimension being bandwidth scalability and the other dimension being time scalability. Generally, bandwidth scalability is based on a spectral distribution of the audio bandwidth, or the frequency range of the audio bandwidth. One part of this, for example, 250 Hz to 4 Hz, is used in a two-dimensional signal bandwidth, where both bandwidths are used in the same way, with one dimension being bandwidth scalability and the other dimension being bandwidth scalability.

A preferred embodiment is the use of a band division, at least partially logarithmic, closely based on the frequency scale or frequency resolution used by human hearing for frequencies not too low, e.g. for frequencies greater than 500 Hz. It is preferable to use the logarithmic division mentioned above only from, for example, 500 Hz onwards and to divide the bands below 500 Hz equally widely, e.g. into five bands of 100 Hz each.

Each fingerprint representation in Figure 3b contains an identifier section indicating how many sub-bands contain fingerprint information, i.e. which fingerprint mode of the fingerprint is generated. The fingerprint representations are very sensitive. Figure 3b shows different fingerprint representations as they can be produced by different fingerprint modes. Figure 3b shows an identifier section indicating how many sub-bands contain fingerprint information, i.e. which fingerprint mode of the fingerprint is generated. Figure 3b shows different fingerprint representations as they can be produced by different fingerprint modes.

In the following, the function of block 24b of Fig. 2 is discussed in Fig. 3b, i.e. the conversion of fingerprints from one fingerprint mode to another. It is only an example that a database fingerprint has been generated in accordance with fingerprint mode No. 4. The database therefore contains very distinctive fingerprints. For example, a search fingerprint has been generated in accordance with fingerprint mode No. 2. After the device 24a of Fig. 2 has determined, for example, using the fingerprint identifier of Fig. 31 3b, that the search fingerprint and the database fingerprint have been generated in accordance with different fingerprint designations, the Fingerprint has been generated with a higher power, i.e. the Fingerprint has a higher power.The conversion in the example shown in Figure 3b is that the third sub-band fingerprint information and the fourth sub-band fingerprint information of the database fingerprint mode are no longer taken into account, so that the matching operation is not involved. Only the first sub-band fingerprint information and the second sub-band fingerprint information are compared. Alternatively, the database fingerprint generated in Fingerprint Mode No. 4 and the search fingerprint generated in Fingerprint Mode No. 2 can both be converted into Fingerprint Mode No. 1, which is an advantage, especially if rapid matching is desired.

It should be noted that it is not essential that the database fingerprint be more identifiable than the search fingerprint; for example, if there is only an older database with a weaker identification while the search fingerprints are more identifiable fingerprints, the reverse can be done, so that the search fingerprints are converted into a weaker but more compact form and the matching operation is then performed.

Although subbands 1 to 4 (30a to 30d) are shown in Fig. 3a without overlap, it should be noted that even a small overlap of subbands leads to a higher robustness of pitch changes. To increase the robustness of the representation against signal changes involving a change in signal pitch, for example a sampling rate conversion or a pitch change of a signal played slightly faster or slower, some overlap is preferred.

To illustrate this, Figures 4a and 4b below go into the following. Using mean and variance to summarize a number of n individual characteristic values, the temporal granularity of the fingerprint can be set. A compact description chooses a larger value for n and thus a larger temporal summary than a more generous but less compact description. Figure 4a shows a block-based processing of an audio signal over time t, with apparent overviews showing four consecutive blocks 40a to 40d in time. The blocks 40a to 40d already have the same length, i.e. a number of blocks of F-modes.

If fingerprint information generated in Fingerprint Mode 3 is stored in a database and if the search fingerprint was generated in Fingerprint Mode 2, the database fingerprint is converted by summing the first two blocks and then comparing them with the first fingerprint information of the search fingerprint, repeating this procedure for subsequent blocks 3 and 4.

In real world applications, it is preferred to aggregate fingerprint information from n blocks so that the fingerprint representation contains the mean and/or variance of the individual blocks' fingerprint information. Average: Variance:

In the above equations, n is an index which indicates how much fingerprint information Fi is summed up by how many blocks or bands etc. to form the mean Mn from them.

With reference to Figure 4b, the fingerprint information of block 1 of the fingerprint representation generated by Fingerprint Mode 3 will include the mean and/or variance of audio characteristics, the same would apply to the fingerprint information for block 2 of the fingerprint representation generated by Fingerprint Mode 3. Now, the two fingerprint information for block 1 and block 2 of the fingerprint representation generated by Fingerprint Mode 3 must be converted into fingerprint information of the fingerprint representation generated by Fingerprint Mode 2 as shown by Figure 42. Average: Variance:

The above equations are applicable to an example factor 2. In the equation, the values Nn and Vn represent the corresponding values mean or variance for the fingerprint information of block 1 according to fingerprint mode 3, while Mn' and Vn' represent the mean or variance for block 2 of the fingerprint information according to fingerprint mode 3 of Fig. 4b. In the case where the variance is used as fingerprint information, the additional value B must also be used as a mean of fingerprint information to ensure that the data can be scaled.

It should be noted that, by analogy, the fingerprint information of the fingerprint representation can be compared with the fingerprint information according to fingerprint mode 1.

This allows comparison of fingerprint representations of different temporal granularity, i.e. according to different fingerprint modes, e.g. by converting the finer ones into a coarser representation.

The fingerprint representation of the invention may be defined as a so-called scalable series, for example, as described in paragraph 4.2 of ISO/IEC JTC 1/SC 29/WG11 (MPEG), Information technology - multimedia content description interface- Part 4: Audio, 27.10.2000.

Claims

Method for producing a fingerprint of an audio signal using modus information (10) defining a plurality of predetermined fingerprint modi, all of the fingerprint modi relating to the same type of fingerprint, the fingerprint modi, however, providing different fingerprints scalable with regard to time and/or frequency differing from each other with regard to their data volume, on the one hand, and to their characterizing strength for characterizing the audio signal, on the other hand, the fingerprint modi differing from each other in that they include separate fingerprint information for a different number of sub-bands, or the scalable fingerprint comprising fingerprint information for a number of temporal blocks depending on the fingerprint modus, the method comprising:
setting (12) a predetermined fingerprint modus of the plurality of predetermined fingerprint modi; and

computing (14) a scalable fingerprint in accordance with the set predetermined fingerprint modus by applying computing regulations in accordance with the modus information for the set fingerprint modus.
Method as claimed in claim 1, wherein the fingerprint in accordance with a fingerprint modus having higher characterizing strength is convertible to a fingerprint in accordance with a fingerprint modus having lower characterizing strength.
Method as claimed in claim 1 or 2, further comprising:
transferring or storing the produced fingerprint via a transmission channel having a limited transmission capacity or on a storage medium having a limited storage capacity, respectively,

wherein in the step of setting (12) a fingerprint modus, the predetermined fingerprint modus is set depending on the transmission channel or the storage capacity, respectively.
Method as claimed in any one of the previous claims, wherein the type of fingerprint relates to the tonality properties of the audio signal.
Method as claimed in any one of the previous claims, wherein the audio signal may be subdivided into a predetermined number of predefined frequency bands (30a to 30d), wherein each fingerprint modus includes the production of fingerprint information per predefined frequency band, the fingerprint modi differing with regard to the number of items of fingerprint information, so that a first fingerprint modus includes, as a fingerprint, separately for each frequency band, a first number of items of fingerprint information for a first number of frequency bands, and a second fingerprint modus includes, as a fingerprint, separately for each frequency band, a second number of items of fingerprint information for a second number of frequency bands, the first number differing from the second number, and the predefined frequency bands being the same for all fingerprint modi.
Method as claimed in claim 5, wherein the subdivision of the audio signal into the predefined frequency bands comprises, at least partially, logarithmic band partitioning.
Method as claimed in claim 5 or 6, wherein two frequency bands mutually adjacent in terms of frequency have an overlap area, spectral components in the overlap area belonging to both adjacent frequency bands.
Method as claimed in any one of claims 5 to 7, wherein the frequency band including the lowest frequency is contained in all fingerprint modi, the fingerprint modi differing in the number of higher frequencies of subsequent frequency bands.
Method as claimed in any one of claims 1 to 8, wherein the audio signal may be subdivided into blocks (40a to 40d) successive in time and having a predetermined length, wherein in the production of a fingerprint, fingerprint information per block are determined, the fingerprint modi differing in the number of blocks represented by fingerprint information, and the length of the blocks being the same for all fingerprint modi.
Method as claimed in claim 9, wherein a first fingerprint modus includes the mean value and/or the variance of a first predefined number of blocks as fingerprint information, and a second fingerprint modus includes the mean value and/or the variance of a second predefined number of blocks, the ratio of the first predefined number to the second predefined number being an integral one.
Method of characterizing an audio signal, comprising:
producing a fingerprint of the audio signal using modus information (10) defining a plurality of predetermined fingerprint modi, all of the fingerprint modi relating to the same type of fingerprint, the fingerprint modi, however, providing different fingerprints scalable with regard to time and/or frequency differing from each other with regard to their data volume,

on the one hand, and to their characterizing strength for characterizing the audio signal, on the other hand, the fingerprint modi differing from each other in that they include separate fingerprint information for a different number of sub-bands, or the scalable fingerprint comprising fingerprint information for a number of temporal blocks depending on the fingerprint modus, the method comprising:
setting (12) a predetermined fingerprint modus of the plurality of predetermined fingerprint modi; and

computing (14) a scalable fingerprint in accordance with the set predetermined fingerprint modus by applying computing regulations in accordance with the modus information for the set fingerprint modus;

comparing (24) the computed fingerprint with a plurality of stored fingerprints representing known audio signals to characterize the audio signal, the stored fingerprints having been produced in accordance with one of the plurality of fingerprint modi, the step of comparing (24) comprising:
examining (24a) whether the search fingerprint and the database fingerprint have been produced in accordance with different fingerprint modi;

converting (24b) the search fingerprint and/or the database fingerprint so that the fingerprints to be compared exist in accordance with the same fingerprint modus; and

performing (24c) the comparison using the fingerprints existing in the same fingerprint modus.
Method as claimed in claim 11, wherein each fingerprint modus includes the production of fingerprint information per predefined frequency band, the fingerprint modi differing with regard to the number of items of fingerprint information, so that a first fingerprint modus includes, as a fingerprint, separately for each frequency band, a first number of items of fingerprint information for a first number of frequency bands, and a second fingerprint modus includes, as a fingerprint, separately for each frequency band, a second number of items of fingerprint information for a second number of frequency bands, the first number differing from the second number, wherein the step of converting (24b) comprises suppressing fingerprint information for sub-bands.
Method as claimed in claim 11, wherein the audio signal may be subdivided into blocks (40a to 40d) successive in time and having a predetermined length, wherein in the production of a fingerprint, fingerprint information per block are determined, the fingerprint modi differing in the number of blocks represented by fingerprint information, and the length of the blocks being the same for all fingerprint modi, and wherein the step of converting (24b) comprises the step of combining the fingerprint information of blocks successive in time.
Method as claimed in claim 13, wherein the fingerprint information includes a mean value and/or a variance, and wherein an integer ratio exists between the blocks combined in the search fingerprint and the blocks combined in the database fingerprint.
Fingerprint representation for an audio signal, comprising:
a fingerprint scalable with regard to time and/or frequency, the fingerprint being configured in accordance with one of a plurality of predetermined fingerprint modi, all of the fingerprint modi relating to the same type of fingerprint, the fingerprint modi, however, providing different fingerprints scalable with regard to time and/or frequency differing from each other with regard to their data volume, on the one hand, and to their characterizing strength for characterizing the audio signal, on the other hand, the scalable fingerprint comprising separate fingerprint information for separate sub-bands of the audio signal, the fingerprint modi differing from each other in that they include separate fingerprint information for a different number of sub-bands, or the scalable fingerprint comprising fingerprint information for a number of temporal blocks depending on the fingerprint modus; and

an indicator (31) indicating the fingerprint modus underlying the fingerprint.
Apparatus for producing a fingerprint of an audio signal using modus information (10) defining a plurality of predetermined fingerprint modi, all of the fingerprint modi relating to the same type of fingerprint, the fingerprint modi, however, providing different fingerprints scalable with regard to time and/or frequency differing from each other with regard to their data volume, on the one hand, and to their characterizing strength for characterizing the audio signal, on the other hand, the fingerprint modi differing from each other in that they include separate fingerprint information for a different number of sub-bands, or the scalable fingerprint comprising fingerprint information for a number of temporal blocks depending on the fingerprint modus, the apparatus comprising:
means for setting (12) a predetermined fingerprint modus of the plurality of predetermined fingerprint modi; and

means for computing (14) a scalable fingerprint in accordance with the set predetermined fingerprint modus by applying computing regulations in accordance with the modus information for the set fingerprint modus.
Apparatus for characterizing an audio signal, comprising:
means for producing a search fingerprint of the audio signal using modus information (10) defining a plurality of predetermined fingerprint modi, all of the fingerprint modi relating to the same type of fingerprint, the fingerprint modi, however, providing different fingerprints scalable with regard to time and/or frequency differing from each other with regard to their data volume, on the one hand, and to their characterizing strength for characterizing the audio signal, on the other hand, the fingerprint modi differing from each other in that they include separate fingerprint information for a different number of sub-bands, or the scalable fingerprint comprising fingerprint information for a number of temporal blocks depending on the fingerprint modus, the means comprising:
means for setting (12) a predetermined fingerprint modus of the plurality of predetermined fingerprint modi; and

means for computing (14) a scalable fingerprint in accordance with the set predetermined fingerprint modus by applying computing regulations in accordance with the modus information for the set fingerprint modus;

means for comparing the computed fingerprint with a plurality of stored fingerprints representing known audio signals to characterize the audio signal, the stored fingerprints having been produced in accordance with one of the plurality of fingerprint modi, the means comprising:
means for examining (24a) whether the search fingerprint and the database fingerprint have been produced in accordance with different fingerprint modi;

means for converting (24b) the search fingerprint and/or the database fingerprint so that the fingerprints to be compared exist in accordance with the same fingerprint modus; and

means for performing (24c) the comparison using the fingerprints existing in the same fingerprint modus.