WO2009001202A1 - Procédés et systèmes de similitudes musicales comprenant l'utilisation de descripteurs - Google Patents
Procédés et systèmes de similitudes musicales comprenant l'utilisation de descripteurs Download PDFInfo
- Publication number
- WO2009001202A1 WO2009001202A1 PCT/IB2008/001669 IB2008001669W WO2009001202A1 WO 2009001202 A1 WO2009001202 A1 WO 2009001202A1 IB 2008001669 W IB2008001669 W IB 2008001669W WO 2009001202 A1 WO2009001202 A1 WO 2009001202A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- descriptor
- hpcp
- audio piece
- piece
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/0008—Associated control or indicating means
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/36—Accompaniment arrangements
- G10H1/38—Chord
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/066—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for pitch analysis as part of wider processing for musical purposes, e.g. transcription, musical performance evaluation; Pitch recognition, e.g. in polyphonic sounds; Estimation or use of missing fundamental
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/076—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction of timing, tempo; Beat detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2240/00—Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
- G10H2240/121—Musical libraries, i.e. musical databases indexed by musical parameters, wavetables, indexing schemes using musical parameters, musical rule bases or knowledge bases, e.g. for automatic composing methods
- G10H2240/131—Library retrieval, i.e. searching a database or selecting a specific musical piece, segment, pattern, rule or parameter set
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2250/00—Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
- G10H2250/131—Mathematical functions for musical analysis, processing, synthesis or composition
- G10H2250/215—Transforms, i.e. mathematical transforms into domains appropriate for musical signal processing, coding or compression
- G10H2250/235—Fourier transform; Discrete Fourier Transform [DFT]; Fast Fourier Transform [FFT]
Definitions
- the present invention relates generally to the field of audio classification systems and techniques for determining similarities between audio pieces. More specifically, the present invention relates to systems and methods for providing a music similarity framework that can be utilized to extract features or sets of features from an audio piece based on descriptors, and for performing content-based audio classification.
- Audio classification systems typically utilize a front-end system that extracts acoustic features from an audio signal, and machine-learning or statistical techniques for classifying an audio piece according to a given criterion such as genre, tempo, or key.
- Some content-based audio classification systems are based on extracting spectral features within the audio signal, often on a frame-by- frame basis.
- spectral features may be extracted from the audio file using statistical techniques that employ Short-Time Fourier Transforms (SFTs) and MeI-
- MFCCs Frequency Cepstrum Coefficients
- the audio classification system may be tasked to find similarities between multiple audio pieces.
- One situation in which a song or musical piece may have different versions available includes the digital remastering of an original master version.
- Other cases in which different versions of the same song or musical piece may be available include the recording of a live track from a live performance, a karaoke version of the song translated to a different language, or an acoustic track or remix in which one or more instruments have been changed in timbre, tempo, pitch, etc. to create a new song.
- an artist may perform a cover version of a particular song that may have differing levels of musical similarity (e.g., style, harmonization, instrumentation, tempo, structure, etc.) between the original and cover version.
- musical similarity e.g., style, harmonization, instrumentation, tempo, structure, etc.
- the degree of disparity between these different aspects often establishes a vague boundary between what is considered a cover version of the song or an entirely different version of the song.
- the evaluation of similarity measures in music is often a difficult task, particularly in view of the large quantity of music currently available and the different musical, cultural, and personal aspects associated with music. This process is often exacerbated in certain genres of music where the score is not available, as is the case in most popular music.
- Another factor that becomes relevant when analyzing large amounts of data is the computational cost of the algorithms used in detecting music similarities.
- the algorithm performing the similarity analysis should be capable of quickly analyzing large amounts of data, and should be robust enough to handle real situations where vast differences in musical styles are commonplace.
- the present invention pertains to systems and methods for providing a music similarity framework that can be utilized to extract features or sets of features from an audio piece based on descriptors, and for performing content- based audio classification.
- An illustrative method for determining similarity between two or more audio pieces may include the steps of extracting one or more descriptors from each of the audio pieces, generating a vector for each of the audio pieces, extracting one or more audio features from each of the audio pieces, calculating values for each audio feature, normalizing the values for each audio feature, calculating a distance between a vector containing the normalized values and the vectors containing the audio pieces, and outputting a result to a user or process.
- the descriptors extracted from the audio pieces can include dissonance descriptors, tonal descriptors, rhythm descriptors, and/or spatial descriptors.
- An illustrative tonal descriptor for example, is a Harmonic Pitch Class Profile (HPCP) vector, which in some embodiments can be used to provide key estimation and tracking, chord estimation, and/or to perform a music similarity analysis between audio pieces.
- HPCP Harmonic Pitch Class Profile
- An illustrative music processing system in accordance with an embodiment of the present invention can include an input device for receiving an audio signal containing an audio piece, a tonality analysis module configured to extract tonal features from the audio signal, a data storing device adapted to store the extracted tonal features, a tonality comparison device configured to compare the extracted tonal features to tonal features from one or more reference audio pieces stored in memory, and an interface for providing a list of audio pieces to a user.
- the music processing system may utilize one or more descriptors to classify the audio piece and/or to perform a music similarity analysis on the audio piece.
- the music processing system can be tasked to determine whether the audio piece is a cover version of at least one of the reference audio pieces.
- FIG. 1 is a block diagram of a music processing system in accordance with an illustrative embodiment
- Figure 2 is a block diagram showing an illustrative method of computing a Harmonic Pitch Class Profile (HPCP) vector
- Figure 3 is a block diagram showing an illustrative implementation of computing an HPCP vector using band preset and frequency filtering
- Figure 4 is a block diagram showing an illustrative method for providing linear mapping to an amplitude normalized HPCP vector
- Figure 5 is a block diagram of an illustrative method of obtaining a tonality estimate using an HPCP vector
- Figure 6 is a block diagram of an illustrative method of determining an equal tempered deviation descriptor
- Figure 7 is a block diagram of an illustrative music similarity system for finding similarities between songs
- Figures 8-10 is a flow chart showing an illustrative method for detecting cover songs using the music similarity system of Figure 7;
- Figure 11 is a block diagram showing an illustrative input and output of the description extractors module of Figure 7;
- Figure 12 is a block diagram showing the refinement of the HPCP matrix using the HPCP post processing module of Figure 7;
- Figure 13 is a block diagram showing the calculation of a global HPCP using the HPCP averaging module of Figure 7;
- Figure 14 is a block diagram showing the calculation of a transposition index using the transposition module of Figure 7;
- Figure 15 is a block diagram showing the calculation of a similarity matrix using the similarity matrix creation module of Figure 7;
- Figure 16 is a block diagram showing the calculation of a local alignment matrix using the dynamic programming local alignment module of
- Figure 17 is a block diagram showing the calculation of song alignments and song distance using the alignment analysis module and score post-processing module of Figure 7;
- Figure 18 is a flow chart showing an illustrative backtracking algorithm that can be used to determine a similarity path
- Figure 19 is an illustrative dendrogram showing the distances between a set of known songs
- Figure 20 is a flow chart showing an illustrative method of determining whether an audio piece belongs to a western or non-western musical genre
- Figure 21 is a flow chart showing an illustrative method of determining the dissonance of an audio piece
- Figure 22 is a graph showing the relationship between critical bandwidth or bark values and dissonance/consonance
- Figure 23 is a flow chart showing an illustrative method of determining the dissonance between chords in an audio piece
- Figure 24 is a flow chart showing an illustrative method of computing the spectral complexity of an audio piece
- Figure 25 is a flow chart showing an illustrative method of determining the onset rate of an audio piece
- Figure 26 is a flow chart showing an illustrative method of determining the beats-per-minute (BPM) of an audio piece
- Figure 27 is a flow chart showing an illustrative method for calculating the beats loudness and/or bass beats loudness of an audio piece
- Figure 28 is a flow chart showing an illustrative method of computing a rhythmic intensity descriptor
- Figure 29 is a block diagram of an illustrative audio classification system for combining spectral features within spatial features within an audio piece;
- Figure 30 is a flow chart showing an illustrative method of extracting panning coefficients from the audio piece using the audio classification system of
- Figure 31 is a graph showing a warping function applied to the azimuth angle of the histogram
- Figure 32 is a graph showing a representation of the extracted panning coefficients for a jazz recording
- Figure 33 shows a representation of the extracted panning coefficients for a Pop-Rock recording
- Figure 34 is a flow chart showing an illustrative method for determining musical similarity between audio pieces.
- the music processing system 10 includes a microphone input device 12, a line input device 14, a music input device 16, an input operation device 18, an input selector switch 20, an analog-to- digital (A/D) converter 22, a tonality analysis device 24, data storing devices 26,28, a temporary memory 30, a tonality comparison device 32, a display device 34, a music reproducing device 36, a digital-to-analog (D/A) converter 38, and a speaker 40.
- A/D analog-to- digital
- D/A digital-to-analog
- the microphone input device 12 can collect a music audio signal with a microphone and output an analog audio signal representing the collected music audio signal.
- the line input device 14 can be connected to a disc player, tape recorder, or other such device so that an analog audio signal containing an audio piece can be input.
- the music input device 16 may be, for example, an MP3 player or other digital audio player (DAP) connected to the tonality analysis device 24 and the data storing device 28 to reproduce a digitized audio signal, such as a PCM signal.
- the input operation device 18 can be a device for a user or process to input data or commands to the system 10.
- the output of the input operation device 18 is connected to the input selector switch 20, the tonality analysis device 24, the tonality comparison device 32, and the music reproducing device 36.
- the input selector switch 20 can be used to selectively supply one of the output signals from the microphone input device 12 and the line input device 14 to the A/D converter 22. In some embodiments, the input selector switch 20 may operate in response to a command from the input operation device 18.
- the A/D converter 22 is connected to the tonality analysis device 24 and the data storing device 26, and is configured to digitize an analog audio signal and supply the digitized audio signal to the data storing device 28 as music data.
- the data storing device 26 stores the music data supplied from the A/D converter 22 and the music input device 16.
- the data storing device 26 may also provide access to digitized audio stored in a computer hard drive or other suitable storage device.
- the tonality analysis device 24 can be configured to extract tonal features from the supplied music data by executing a tonality analysis operation described further herein.
- the tonal features obtained from the music data are stored in the data storing device 28.
- a temporary memory 30 is used by the tonality analysis device 24 to store intermediate information.
- the display device 34 displays a visualization of the tonal features extracted by the tonality analysis device 24.
- the tonality comparison device 32 can be tasked to compare tonal features within a search query to the tonal features stored in the data storing device 28.
- a set of tonal features with high similarities to the search query may be detected by the tonality comparison device 32.
- a search query to detect a cover version of a particular song for example, a set of tonal features with high similarities may be detected within the song via the tonality comparison device 32, indicating the likelihood that the song is a cover version.
- the display device 34 may then display a result of the comparison as a list of audio pieces.
- the music reproducing device 36 reads out the data file of the audio pieces detected as showing the highest similarity by the tonality comparison device 32 from the data storing device 26, reproduces the data and outputs the data as a digital audio signal.
- the D/A converter 38 converts the digital audio signal reproduced by the music reproducing device 36 into an analog audio signal, which may then be delivered to a user via the speaker 40.
- the tonality analysis device 24, the tonality comparison device 26, and the music reproducing device 36 may each operate in response to a command from the input operation device 18.
- the input operation device 18 may comprise a graphical user interface (GUI), keyboard, touchpad, or other suitable interface that can be used to select the particular input device 12,14,16 to receive an audio signal, to select one or more search queries for analysis, or to perform other desired tasks.
- GUI graphical user interface
- the music processing system 10 is configured to automatically extract semantic descriptors in order to analyze the content of the music.
- exemplary descriptors that can be extracted include, but are not limited to, tonal descriptors, dissonance or consonance descriptors, rhythm descriptors, and spatial descriptors.
- Exemplary tonal descriptors can include, for example, a Harmonic Pitch Class Profile (HPCP) descriptor, a chord detection descriptor, a key detection descriptor, a local tonality detection descriptor, a cover song detection descriptor, and a western/non-western music detection descriptor.
- Exemplary dissonance or consonance descriptors can include a dissonance descriptor, a dissonance of chords descriptor, and a spectral complexity descriptor.
- Exemplary rhythm descriptors can include an onset rate descriptor, a beats per minute descriptor, a beats loudness descriptor, and a bass beats loudness descriptor.
- An example spatial descriptor can include a panning descriptor.
- the descriptors used to extract musical content from an audio piece can be generated as derivations and combinations of lower-level descriptors, and as generalizations induced from manually annotated databases by the application of machine-learning techniques.
- Some of the musical descriptors can be classified as instantaneous descriptors, which relate to an analysis frame representing a minimum temporal unit of the audio piece.
- An exemplary instantaneous descriptor may be the fundamental frequency, pitch class distribution, or chord of an analysis frame.
- Other musical descriptors are related to a certain segment of the musical piece, or to global descriptors relating to the entire musical piece (e.g., global pitch class distribution or key).
- An example global descriptor may be, for example, a phrase or chorus of a musical piece.
- the music processing system 10 may utilize tonal descriptors to automatically extract a set of tonal features from an audio piece.
- the tonal descriptors can be used to locate cover versions of the same song or to detect the key of a particular audio piece.
- the music processing system 10 can be configured to compute a Harmonic Pitch Class Profile (HPCP) vector of each audio piece.
- HPCP vector may represent a low-level tonal descriptor that can be used to provide key estimation and tracking, chord estimation, and to perform music similarity between audio pieces.
- a correlation between HPCP vectors can be used to identify versions of the same song by computing similarity measures for each song.
- a set of features representative of the pitch class distribution of the music can be extracted.
- the pitch-class distribution of the music can be related, either directly or indirectly, to the chords and the tonality of a piece, and is general to all types of music. Chords can be recognized from the pitch-class distribution without precisely detecting which notes are played in the music. Tonality can also be estimated from the pitch-class distribution without a previous chord-estimation procedure. These features can be also used to determine music similarity between pieces.
- the pitch class descriptors may fulfil one or more requirements in order to reliably extract information from the audio signal.
- the pitch class descriptors may take into consideration, for example, the pitch class distribution in both monophonic and polyphonic audio signals, the presence of harmonic frequencies within the audio signal, are robust to ambient noise (e.g., noise occurring during live recordings, percussive sounds, etc.), are independent of timbre and the types of played instruments within the audio signal such that the same piece played with different instruments has the same tonal description, are independent of loudness and dynamics within the piece, and are independent of tuning such that the reference frequency within the piece can be different from the standard A reference frequency (i.e. 440 Hz).
- the pitch class descriptors can also exhibit other desired features.
- an HPCP vector can be computed over three main stages, including a pre-processing stage (block 52), a frequency to pitch-class mapping stage (block 54), and a post-processing stage (block 56).
- the pre-processing stage (block 52) can include the step of performing a frequency quantization in order to obtain a spectral analysis on the audio piece. In certain embodiments, for example, a spectral analysis using a Discrete Fourier Transform is performed.
- the frequency quantization can be computed with long frames of 4096 samples at a 44.1 kHz sampling rate, a hop size of 2048, and windowing.
- the spectrum obtained via the spectral analysis can be normalized according to its spectral envelope in order to convert it to a flat spectrum.
- notes on high octaves contribute equally to the final HPCP vector than those notes on low pitch range so that the results are not influenced by different equalization procedures.
- a peak detection step is then performed on the spectra wherein the local maxima of the spectra (representing the harmonic part of the spectrum) are extracted.
- a global tuning frequency value is then determined from the spectral peaks.
- a global tuning frequency value may be determined by computing the deviation of frequency values with respect to the A440Hz reference frequency mapped to a semitone, and then computing a histogram of values, as understood from the following equations (1) to (3) below.
- the tuning frequency which is assumed to be constant for a given musical piece, can then be defined by the maximum value of the histogram, as further discussed herein.
- a value can then be computed for each analysis frame in a given segment of the piece, and a global value computed by building a histogram of frame values and selecting the value corresponding to the maximum of the histogram.
- a band preset/frequency filtering step can then be performed separately for the high frequency band for peaks at a frequency higher than 500Hz, and for the low frequency band for peaks at a frequency lower than 500Hz.
- These two frequency bands are processed separately, and then the result normalized such that they are equally important to the HPCP computation.
- Such normalization may account for, for example, the predominance of the lower frequencies in the HPCP computation due to their higher energy.
- an HPCP vector can be computed based on the global tuning frequency determined during the pre-processing stage (block 52).
- the HPCP vector can be defined generally by the following equation:
- a,- corresponds to the magnitude (in linear units) of a spectral peak
- fi corresponds to the frequency (in Hz) of a spectral peak
- nPeaks corresponds to the number of peaks detected in the peak detection step.
- the w(n,fj) function in equation (4) above can be defined as a weighting window (cos) for the frequency contribution.
- Each frequency f contributes to the HPCP bin(s) that are contained in a certain window around this frequency value.
- the contribution of the peak / (the square of the peak linear amplitude
- the value of the weight depends on the frequency distance between f, and the center frequency of the bin n, f n , measured in semitones, as can be seen from the following equations:
- m is the integer that minimizes the module of the distance ⁇ d ⁇ ; and / is the window size in semitones.
- the weighting window minimizes the estimation errors that can occur when there are tuning differences and inharmonicity present in the spectrum.
- a weighting procedure can also be employed to take into consideration the contribution of the harmonics to the pitch class of its fundamental frequency.
- Each peak frequency fi has a contribution to the frequencies having fj as a harmonic frequency (fj, fj/2, fj/3, f,/4, ... f/n harmonics).
- the contribution decrease along frequencies can be determined using the following equation:
- the interval resolution selected by the user is directly related to the size of the HPCP vector. For example, an interval resolution of one semitone or 100 cents would yield a vector size of 12, one third of semitone or 33 cents would
- the interval resolution influences the frequency resolution of the HPCP vector. As the interval resolution increases, it is generally easier to distinguish frequency details such as vibrato or glissando, and to differentiate voices in the same frequency range. Typically, a high frequency resolution is desirable when analyzing expressive frequency evolutions. On the other hand, increasing the interval resolution also increases the quantity of data and the computation cost.
- an amplitude normalization step is performed so that every element in the HPCP vector is divided by the maximum value such that the maximum value equals 1.
- the two HPCP vectors corresponding to the high and low frequencies, respectively, are then added up and normalized with respect to each other.
- a non-linear mapping function may then be applied to the normalized vector. In some embodiments, for example, the following non-linear mapping function may be applied to the HPCP vector:
- HPCP[k] sinf(HPCP[k]* Pl*. 5);
- HPCP[k] * HPCP[k]; if (HPCP[k] ⁇ 0.6)
- HPCP[k] * HPCP[k]/0 .6 * HPCP[k]/0.6; ⁇
- FIG. 3 is a block diagram showing an illustrative implementation of the method 50 of Figure 2 using band preset and frequency filtering to obtain an HPCP vector.
- the HPCP vector can be obtained by performing frequency to pitch mapping, considering the estimated tuning frequency (e.g., A440) and its harmonic frequencies (blocks 58 and 60).
- a weighting technique may be used to make the harmonics contribute to the pitch class of its fundamental frequency, as discussed above. From this mapping, an HPCP high frequency value (block 62) and HPCP low frequency value (block 64) can be computed using selected peaks within a
- the selection of a high frequency range between about 500 Hz to about 5 KHz can be used in computing the HPCP high frequency value (block 62) whereas the selection of a low frequency range between about 40 Hz and about 500 Hz can be used in computing the HPCP low frequency value (block 64.
- Other ranges are possible, however.
- FIG. 4 is a block diagram showing an illustrative method 72 for providing non-linear mapping to the amplitude normalized HPCP vector.
- Figure 4 may correspond, for example, to one or more of the steps (e.g., block 70) shown in Figure 3.
- the method 72 can include the steps of applying a sine function to the HPCP vector (block 74), applying a squaring function (block 76), comparing the result against a factor (block 78), and applying a mapping to the result (block 80) if the result is less than the factor.
- Figure 5 is a block diagram showing an illustrative method 82 for obtaining a tonality estimation using the HPCP vector.
- the average of the HPCP vector (block 90) can be computed for all of the audio piece in order to estimate the global tonality, or alternatively for a certain segment of the audio piece in order to obtain a sequence of tonality values, one for each segment (for tracking the evolution of the tonality).
- the method 82 includes the definition of two vectors, a major and minor tonal profile (block 84), which are adapted (block 86) to generate a key profile matrix (block 88).
- This key profile matrix contains the tonal profile vector for each of the 24 possible keys.
- the HPCP average vector 90 is compared with each of the vectors of the key profile matrix in a similarity computation process (block 90), which results in the creation of a similarity matrix (block 94).
- the estimated tonality may be defined by the note name (i.e., pitch class), the mode (major or minor), and the strength (block 98), which is equal to the value of the similarity matrix.
- the descriptors that can be derived from the HPCP vector can include, but are not limited to, diatonic strength, local tonality (key), tonality (key/chord), key and chord histograms, equal tempered deviations, and a non tempered/tempered energy ratio.
- the diatonic strength is the key strength result from the key estimation algorithm, but using a diatonic profile. It may be the maximum correlation with a diatonic major or minor profile, representing the chromacity of the musical piece.
- the local tonality (key) descriptor provides information about the temporal evolution of the tonality.
- the tonality of the audio piece can be estimated in segments using a sliding window approach in order to obtain a key value for each segment representing the temporal evolution of the tonality of the piece.
- the tonality (key/chord) contour descriptor is a relative contour representing the distance between consecutive local tonality values. Pitch intervals are often preferred to absolute pitch in melodic retrieval and similarity applications since melodic perception is invariant to transposition. For different versions of the same song that can be transposed to adapt the song to a single or instrument tessitura, for example, the tonality contour descriptor may permit a relative representation of the key evolution of the song.
- the distance between consecutives tonalities can be measured in the circle of fifths: a transition from C major to F major may be represented by -1, a transition from C major to A minor by 0, a transition from C major to D major by +2, etc.
- Key and chord histograms can be derived from the local tonality computed over segments of an audio piece.
- the total number of different tonalities (i.e., keys/chords) present in an audio piece can be determined.
- the most repeated tonality and the tonality change rate for an audio piece can also be determined.
- the equal-tempered deviation descriptor can be used to measure the deviation of the local maxima of an HPCP vector. As can be further seen in Figure 6, for example, an illustrative method 100 of computing an equal-tempered
- a non-tempered/tempered energy ratio between the amplitude of the non-tempered bins of the HPCP vector and the total energy can also be determined using the HPCP vector.
- the HPCP vector should be computed with a high interval resolution (e.g., 120 bins, 10 cents per semitone).
- chord is a combination of three or more notes that are played simultaneously or almost simultaneously.
- sequence of chords that form an audio piece is extremely useful for characterizing a song.
- the detection of chords within an audio piece may begin by obtaining an HPCP 36-bin feature vector per frame representing the tonality statistics of the frame. Then, the HPCP is averaged over a 2 second time. At a sampling rate of 44,100Hz and with frames of a size 4096 bins with a 50% overlap, 2 seconds corresponds to 43 frames. Thus, each of the elements of the HPCP is averaged with the same element (i.e. at the same position in the vector of the subsequent 42 frames).
- the chord corresponding to the averaged HPCP vector is extracted by correlating the average HPCP vector with a set of tonic triad tonal profiles. These tonal profiles can be computed after listening tests and refined to work with the HPCP. The process can then be repeated for each successive frame in the audio piece, thus producing a sequence of chords within the audio piece. If instead of averaging HPCP vectors over a 2 second time they are averaged over the whole duration of the audio piece, the result of the correlation with a set of tonal profiles would produce the estimated key for the audio piece.
- FIG. 7 is a block diagram of an illustrative music similarity system 110 for finding similarities between songs.
- the system 110 includes an HPCP descriptors extraction module 112, an HPCP matrix postprocessing module 114, an HPCP averaging module 116, a transposition module 118, a similarity matrix creation module 120, a dynamic programming local alignment module 122, an alignment analysis module 124, and a score postprocessing module 126.
- FIG. 8 An illustrative method 128 for detecting cover songs using the music similarity system 10 of Figure 7 will now be described with respect to Figures 8- 18.
- the method 128 may begin generally at block 130 with the step of inputting two audio files each representing a song 132,134.
- the songs 132,134 may represent, for example, an original version of a song and a cover version of the song.
- Each of the audio files containing the songs 132,134 is transmitted to the HPCP descriptors extraction module 112, which calculates an HPCP vector (138,140) for each of the audio files (block 136).
- each of the HPCP vectors is inputted to the HPCP matrix post-processing module 114, which refines the HPCP vectors 138,140 and provides both vectors 144,146 to the similarity creation module 120.
- the HPCP vectors 138,140 are also transmitted to the HPCP averaging module 116, which calculates (block 144) a global HPCP 150,152 from each HPCP vector 138,140.
- the global HPCPs 150,152 are then provided to the transposition module 118, which calculates (block 154) a transposition index 156 that is transmitted to the similarity matrix creation module 120.
- the similarity matrix creation module 120 receives as an input the two HPCP matrices 144,146, one for each of the audio files 132,134, and the transposition index 156 is used to produce a similarity matrix 160.
- the similarity matrix 160 is then transmitted to the dynamic programming local alignment module 122, which calculates (block 162) a matrix of local alignments 164 between the two songs 132,134 that is transmitted to the alignment analysis module 124.
- the alignment analysis module 124 then processes the matrix 164 to produce an
- the alignment score 168 is transmitted to the score post-processing module 126 that gives the final output, which in some embodiments is a song distance value 174 for the audio files 132,134.
- FIG 11 is a block diagram showing an illustrative input and output of the descriptors extraction module 112. As shown in Figure 11 , the module 112 takes as an input an audio file 176.
- the audio file 176 may comprise a song in a WAV, 44,100 Hz, 16-bit format, although other formats are possible.
- Each song 132,134 from the audio file 176 is decomposed into short overlapping frames, with frame-lengths ranging from 25 ms to 500 ms. For example, a frame-length of 96 ms with 50% overlapping can be utilized. Then, the spectral information of each frame is processed to obtain a Harmonic Pitch Class Profile (HPCP) 178, a 36-bin feature vector representing the tonality statistics of the frame. Then, the feature vectors are normalized by dividing every component of the vector by the maximum value of the vector. As there were no negative values of h, each component is finally comprised between 0 and 1 based on the following equation:
- Figure 12 is a block diagram showing how the HPCP matrix 138,140 is refined using the HPCP post-processing module 114 of Figure 7.
- HPCP post-processing module 114 inharmonic or silent frames are removed from the HPCP matrix 138,140, thus producing a refined HPCP matrix 144,146. These can be detected by looking at the vectors that have an infinite value, which means that their maximum was zero.
- the mean value of the valid vectors over all frames is calculated in the HPCP averaging
- FIG. 14 is a block diagram showing the calculation of a transposition index 156 using the transposition module 118.
- the result of the HPCP averaging module 116 is used as an input for the transposition module 118.
- other features similar to HPCP may also be used.
- a chroma feature vector may be used.
- the number of components or bins of these feature vectors may vary from 12 to 36, or even higher values. However, since each bin represents a note or a part of a note, the number of components chosen has to be a multiple of 12. Therefore, selecting a higher number of bins improves accuracy of the feature vectors, but also increases the computational load of the method.
- the transposition module 118 can be used to calculate a transposition index 156 of the two songs 130,132 as follows:
- Figure 15 is a block diagram showing the calculation of a similarity matrix using the similarity matrix creation module 120.
- the transposition index t calculated by the transposition module 118 is used to transpose one refined HPCP matrix with respect to the other so that both are in the same tonal reference or key. For each column of only one of the two HPCP matrices:
- N is the number of components of the vector
- a similarity matrix 160 can be constructed where each element (i,j) within the matrix 160 is the result of the following equations:
- N is the number of components of the vectors h (columns of the HPCP matrix), " ⁇ " indicates a dot product, and circularshiftQ is the same function as described previously.
- t corresponds to the transposition index 156 calculated by the transposition module 118 and ((X))N is the modulo ⁇ / ofx.
- Equation (18) may be computationally costly to compute (O(2*N*N) operations, where N is the number of components of a vector.
- a Fourier Transform can be used which may result in
- N is the number of components of the vector; ((x)) N is the modulo N of x; FFT is the Fast Fourier Transform; and C indicates the complex conjugate.
- a similarity matrix 160 may also be constructed by any other suitable similarity measure between HPCP vectors such as the Euclidean distance, cosine distance, correlation, or histogram intersection.
- FIG. 16 is a block diagram showing the calculation of the local alignment matrix 164 using the dynamic programming local alignment module 122.
- the similarity matrix 160 obtained from the similarity matrix creation module 120 is the only input of the module 122.
- the module 122 is configured to perform a dynamic programming local alignment on the similarity matrix 160, producing a local alignment matrix 164. If, for example, one song is n frames long and the other song is m frames long, then the resultant local alignment matrix 164 would have the dimensions (n+1 ,m+1).
- the local alignment matrix 164 can be initialized as follows:
- H(I, O S(U)
- H(j, ⁇ ) S(j,V)
- H(i,j) m S Lx ⁇ O,H(i-l,j -l) + S(i,j),H(i-l,j -2) + S(iJ),H(i- 2,j -l) + S(i,j) ⁇
- Figure 17 is a block diagram showing the calculation of the song alignments 170 and song distance 174 using the alignment analysis module 124 and score post-processing module 126.
- the alignment analysis module 124 receives as an input a local alignment matrix 164, which comprises high peak values corresponding to strong similarity points. These peaks can then be backtracked by a backtracking algorithm in order to determine a path.
- FIG. 18 is a flow chart showing an illustrative backtracking algorithm 180 that can be used to determine a similarity path.
- the algorithm 180 may begin generally at block 182 with the step of picking a position (i,j) where a peak is located. Once a position is selected, the algorithm 180 next determines the values of H(Mj), H(i,j-1), H(Mj-I) and selects a maximum value max(H(M j), H(i,j-1), H(M j-I), as indicated generally at block 184.
- Figure 18 shows an illustrative backtracking algorithm 180 that can be used to compute possible song alignments 170
- other backtracking algorithms can also be used.
- One alternative backtracking that can be used to compute the song alignments 170 is described, for example, in Waterman and Eggert, "A New Algorithm for Best Subsequence Alignments with Application to tRNA-rRNA Comparisons", published in the Journal of Molecular Biology, vol. 197 (pp. 723-728), 1987, which is incorporated herein by reference in its entirety.
- the backtracking algorithm 180 may be run for each of the peaks. The starting point or initial peak will typically be the maximum value of the local alignment matrix 164.
- the path may then be calculated for the second highest value of the local alignment matrix 164, and so forth.
- a path representing the optimal alignment between subsequences of the two songs is then created as the backtracking algorithm 180 calculates each of the peak positions and stores their values in the local alignment matrix 164.
- the backtracking algorithm 180 may be focused in the first path found with a starting point which is the maximum value of the local alignment matrix, different paths can also be found. These different paths could be used for other applications such as segmentation, comparing different interpretations of different passages of the same song, and so forth.
- the alignment score 168 is obtained by:
- the alignment score 168 is normalized by the maximum path length using the score post-processing module 126. This results in the song distance 174 using an inverse relation:
- the song distance 174 generated by the score post-processing module 126 can be used to determine the similarity of the two songs based on their tonal
- the song distance 174 between two songs can be used in any system where such a measure is needed, including a cover song identification system in which low values of the song distance 174 indicate a greater probability that the songs 132,134 are the same, or a music recommendation system in which it is desired to sort the songs 132,134 according to a tonal progression similarity criterion.
- the song distance 174 can be the input of another process or method such as a clustering algorithm that can be used to find groups of similar songs, to determine hierarchies, or to perform some other desired function.
- a dendrogram 192 made with the song distances 174 of a small group of known songs is illustrated.
- the optimal alignment (or path) found between two songs is a summarization of the intervals that they both have in common (i.e., where the tonality sequences coincide. For example, if a path starts at position (i,j) and ends at position (i+ki,j+k 2 ), this indicates that the frames from i to i+ki for the first song (and from j to j+k 2 for the second song) are similar and belong to the same tonal sequence.
- the path or song alignment 170 can be used to detect tempo differences between songs. If there is an alignment such as (1 ,1), (2,2), (3,3), (4,4), ..., it is possible to infer that both songs are aligned and therefore should have a similar tempo. Instead, if song A and song C have an alignment like (1,1), (1 ,2), (2,3), (2,4), (3,5), (3,6), (4,7) then, it is possible to infer that the tempo of song C would be twice that of song A. Another application of the path or song alignment 170 is when the two songs are the same. In this case, the sequence between these frames (i and i+k-i) would correspond to the most representative (or repeated) excerpt of the song and the local alignment matrix 164 would have several peaks corresponding to the points of high self- similarity.
- This term is often used in contrast to Asian, African or Arab cultural origins. Although this distinction is usually not sufficient to characterize a musical piece on its own, it is very useful in combination with other features such as tonal features or rhythm features for filtering after a similarity search. This filtering may help to avoid taking as similar two audio pieces with similar tonal features and similar rhythmical features but belonging to a different cultural genre. This may also take into consideration human perception in assessing the performance of the music similarity system.
- a western/non-western classification descriptor 214 In order to determine the cultural genre of an audio piece 196, some different criteria are taken into account in order to obtain a western/non-western classification descriptor 214. As shown in Figure 20, this may include, for example, the high resolution pitch class distribution (HPCP), the tuning frequency 200, the octave centroid 202, the dissonance 204, the equal-tempered deviation 206, the non-tempered to tempered energy ratio 208, and the diatonic strength 210. Other factors in addition to, or in lieu of, those shown in Figure 20 may also be taken into consideration in order to obtain a western/non-western classification descriptor 214.
- HPCP high resolution pitch class distribution
- the high-resolution pitch class distribution 198 may be determined first by calculating the Harmonic Pitch Class Profile (HPCP) on a frame by frame basis, as discussed previously. In some embodiments, this calculation can be performed with 100 ms overlapped frames (i.e., a frame length of 4096 samples at 44,100 Hz, overlapped 50% with a hop size of 2048). Other parameters are also possible, however.
- HPCP Harmonic Pitch Class Profile
- the frequency used to tune a musical piece is typically a standard reference A, or 440 Hz.
- a measure of the deviation with 440 Hz is thus very useful for cultural genre determination.
- the reference frequency is estimated for each analysis frame by analyzing the deviation of the spectral peaks with respect to the standard reference frequency of 440 Hz. A global value is then obtained by combining the frame estimates in a histogram.
- the equal-tempered deviation 206 measures the deviation of the HPCP local maxima from equal-tempered bins.
- Their deviation from closest equal-tempered bins weighted by their magnitude and normalized by the sum of peak magnitudes can then be calculated based on the following formula:
- the non-tempered to tempered energy ratio 208 represents the ratio between the HPCP amplitude of non-tempered bins and the total amplitude, and can be expressed as follows:
- the diatonic strength 210 represents the maximum correlation of the HPCP 198 and a diatonic major profile ring-shifted in all possible positions.
- western music uses a diatonic major profile.
- a higher score in the correlation would indicate that the audio piece 196 is more likely to belong to a western music genre than a non-western music genre.
- HPCP 198 and the related descriptors 206,208,210 can be used to map all the pitch class values to a single octave. This introduces an inherent limitation that consists in the inability for differentiating the absolute pitch height of the audio piece 196.
- an octave centroid 202 feature is computed.
- a multi-pitch estimation process is then applied to the spectral analysis.
- the dissonance 204 of the audio piece 196 is also calculated.
- An exemplary method for computing the dissonance 204 of the audio piece 196 is described, for example, with respect to Figure 21 herein.
- the audio pieces 196 and their classification are then fed as training data to an automatic classification tool, which uses a machine learning process 212 to extract data.
- An example of an automatic classification tool that uses machine learning techniques is the Waikato Environment for Knowledge Analysis (WEKA) software developed by the University of Waikato.
- WEKA Waikato Environment for Knowledge Analysis
- the aforementioned features are extracted from the audio piece 196 and fed to the automatic classification tool that classifies it as belonging to either a western music genre or to a non-western music genre 214.
- This classification 214 can then be used in conjunction with other features (e.g., tonal features, rhythm features, etc.) for filtering a similarity search performed on the audio piece 196.
- the dissonance of an audio piece may be generally defined as the quality of a sound which seems unstable, and which has an aural need to resolve to a stable consonance.
- a consonance may be generally defined as a harmony, chord or interval that is considered stable.
- the dissonance descriptor can be represented by a real number comprised within the range [0,1], where a dissonance of 0 corresponds to a perfect consonance and a dissonance of 1 corresponds to a complete dissonance.
- the method 216 can take as an input a digitized audio piece 218 sampled at a 44,100 Hz sampling rate.
- the audio piece 218 has been digitized with a different sampling rate, then it can be re-sampled so the rate is 44,100 Hz. If the audio piece 218 is in a compressed format then it can be decompressed to a PCM format or other uncompressed format.
- the audio piece 218 can be divided into overlapping frames.
- Each frame my have a size of 2048 samples and an overlap of 1024 samples, corresponding to a frame length of around 50ms. If a different frame length is selected, then either the number of samples per frame must also be changed or the sampling rate must be changed. The number of samples can be linked to the sampling rate by correlating 1 second to 44,100 samples.
- each frame can be smoothed down to eliminate or reduce noisy frequencies. To accomplish this, each frame can modulated with a window function (block 220).
- this window function is a Blackman- Harris-62dB function, but other window functions such as any low-resolution window function may be used.
- a frequency quantization of each windowed frame can be performed (block 222).
- a Fast Fourier Transform can be performed.
- a vector containing the frequencies with the corresponding energies can be obtained.
- the resulting vector can then be weighted to take into consideration the difference in perception of the human ear for different frequencies. A pure sinusoidal tone with a frequency at around 4 kHz would be perceived as louder by the human ear than a pure tone of identical physical energy (typically measured in dB SPL) at a lower or higher frequency.
- dB SPL pure tone of identical physical energy
- the vector can be weighted according to a weighting curve (block 224) such as that defined in standard IEC179.
- the weighting curve applied is a dB-A scale.
- all local spectral maxima can be extracted for each frame (block 226). For this operation and for computational cost optimization reasons, only the frequencies within the range 100Hz to 5,000Hz are taken into consideration, which contain the frequencies present in most music. It should be understood, however, that the spectral maxima can be extracted for other frequency ranges. [0131] Then, every pair of maxima separated by less than 1.18 of their critical band can be determined.
- the critical band is the specific area on the inner ear membrane that goes into vibration in response to an incoming sine wave. Frequencies within a critical band interact with each other whereas frequencies that do not reside in the same critical band are typically treated independently.
- the following formula assigns to each frequency a bark value:
- f corresponds to the frequency for which the bark value is being calculated.
- the width of a critical band at a frequency f1 would then be distance between f1 and a second frequency /2 that result in bark values that differ by a value of 1.
- the dissonance of these pairs of frequencies can then be derived (block 230) from the relationship between their distance in bark values or critical bands and the resulting dissonance.
- a graph showing the relationship between the critical bandwidth or bark values and resulting dissonance/consonance is shown, for example, in Figure 22.
- the total dissonance for a single peak f may thus be calculated based on the following relation:
- ⁇ represents every peak found at a distance less than the critical band for peak fi
- Dissonanceif ⁇ f j is the dissonance for the pair of peaks f,-, f/,
- Energy(f j ) is the energy of the peak at f j ,; and n represents the total number of maxima in the frame being processed.
- Dissonance ⁇ is the dissonance calculated previously for the peak at f ⁇
- Energy(f t ) is the energy of the peak at /) • ; and n represents the total number of maxima in the frame being processed.
- the dissonance (block 234) of the audio file 218 is obtained by averaging the dissonance of all frames.
- the dissonance (block) 234 can also be computed by summing a weighted dissonance (block 232) for each frame with the weighting factor being proportional to the energy of the frame in which the dissonance has been calculated.
- the dissonance descriptor can be represented by a real number within the range [0,1].
- the method 236 takes as an input (block 238) the sequence of chords of a song or an audio piece, which can be calculated in a manner described above.
- the dissonance of two successive chords is substantially the same as the dissonance of two chords played simultaneously. Accordingly, and as shown generally at block 240, successive pairs of chords are selected from the chord sequence 238. It can also be assumed that both successive chords have an equal amount of energy, and therefore the energies of the fundamentals of their chords in the method 238 are considered to be 1.
- the dissonance between two simultaneous chords is therefore the dissonance produced by the superimposed fundamentals of every chord and their harmonics.
- the number of fundamentals may vary and the number of harmonics per fundamental is theoretically infinite, it may be necessary for computational reasons to limit the number of fundamentals and harmonics.
- the number of fundamentals taken per chord is 3 since the three most representative fundamentals of a chord are generally sufficient to characterize the chord.
- 10 harmonics, including the fundamental can also be considered sufficient to characterize the chord.
- the amplitude of the subsequent harmonics can be multiplied by a factor such as 0.9.
- F 0 is the amplitude of the fundamental f 0
- the amplitude of the first harmonic 2-/ 0 would be 0.9* F 0 .
- the amplitude of the second harmonic 3 -/ 0 would be ( ⁇ .9) 2 - F 0 , and so forth for the subsequent harmonics.
- a spectrum comprising 60 frequencies can be obtained (block 242).
- the 60 frequencies may correspond, for example, to the six fundamentals and their 54 harmonics, 9 per fundamental.
- the dissonance of this spectrum can then be calculated the same way as if it was a frame in the aforementioned method 216 for calculating the dissonance of a song or audio piece.
- the spectrum can be weighted to take into consideration the difference in perception of the human ear for different frequencies.
- the local maxima can then be extracted and the dissonance of the spectrum calculated (block 244).
- the process can then be repeated advancing by one in the sequence of chords of a song or audio piece (block 246).
- a sequence of n-1 dissonances is obtained.
- the average of the sequence of dissonances is then computed (block 248) in order to obtain the dissonance of chords of the song or audio piece (block 250).
- the spectral complexity may also be calculated as a step or steps used in determining the dissonance of a song or audio piece, or in determining the dissonance between chords, as discussed above.
- the spectral complexity can be understood generally as a measure of the complexity of the instrumentation of the audio piece. Typically, several instruments are present in an audio piece. The presence of multiple instruments increases the complexity of the spectrum of the audio piece and, as such, can represent a useful audio feature for characterizing the audio piece.
- the spectral complexity descriptor is a real number representing this complexity.
- an illustrative method 252 for determining the spectral complexity of an audio piece can begin at block 254 with the step of extracting spectral peaks from the audio piece.
- the maxima can be extracted from each audio frame in a manner similar to that discussed above, by removing any spectral peaks below a certain threshold (block 256), and then counting those spectral peaks that are above the threshold (block 258).
- the chosen threshold should not be too low since, in such case, the spectral peaks that do not
- a value of 0.005 can be used to provide a proper count of the peaks while avoiding noise.
- the onset of an audio piece is the beginning of a note or a sound in which the amplitude of the sound rises from zero to an initial peak.
- the onset rate is a real number representing the number of onsets per second. It may also be considered as a measure of the number of sonic events per second, and is thus a rhythmic indicator of the audio piece. A higher onset rate typically means that the audio piece has a higher rhythmic density.
- Figure 25 is a flow chart showing an illustrative method 264 of determining the onset rate of an audio piece 266. First, the audio piece 266 can be divided into overlapping frames via a windowing process (block 268).
- Each frame may have a size of 1024 samples and an overlap of 512 samples, corresponding to a frame length of around 20 ms. If a different frame length is selected, then either the number of samples per frame must also be changed or the sampling rate must be changed. The number of samples can be linked to the sampling rate by correlating 1 second to 44,100 samples. [0148] Then, each frame can be smoothed down to get rid of noisy frequencies. To accomplish this, each frame can be modulated with a window function (block 270). In certain embodiments, for example, this window function is a discrete probability mass function such as a Hann function, although other window functions such as any low-resolution window function may be used. [0149] Next, a frequency quantization of each windowed frame can be performed (block 272). In certain embodiments, such quantization can include the
- An onset detection function is a function that converts the signal or its spectrum into a function that is more effective in detecting onset transients.
- Most onset detection functions known in the art are better adapted to detect a special kind of onset. Therefore, in the present invention two different onset detection functions are used.
- the first onset detection function is the High Frequency Content (HFC) function (block 274), which is better adapted to detect percussive onsets while being less precise for tonal onsets.
- HFC High Frequency Content
- the second onset function is the Complex Domain function (block 276), which is better adapted to detect tonal onsets while being less precise for percussive onsets.
- the Complex Domain function determines the onsets by calculating the difference in phase between the current frame and a prediction for the current frame that does not comprise an offset. An example of such process is described, for example, in “Complex Domain Onset Detection For Musical Signals” by Duxbury et al., published in "Proc. Of the 6 th Int. Conference on Digital Audio Effects (2003),” which is incorporated herein by reference in its entirety. This calculation is also carried out for each frame in the FFT of the audio piece 266.
- both detection functions are normalized (blocks 278,280) by dividing each of their values by the maximum value of the corresponding function.
- the two detection functions are summed by their respective values (block 282).
- the resulting onset detection function is then smoothed down by applying a
- each onset detection function value is compared to a dynamic threshold.
- This threshold is calculated for each frame, and its calculation is carried out by taking into consideration values of the onset detection function for preceding frames as well as subsequent frames. This calculation takes advantage of the fact that in the sustained period of a sound the only difference of a frame with the next one is the phase of the harmonics, provided there is not any onset in the subsequent frames. Using this phase difference, it is therefore possible to predict the values in the subsequent frames.
- the threshold may be calculated (block 286) as the median of a determined number of values of the onset detection function samples, where the selected values of the onset detection function comprise a determined number of values corresponding to frames before the frame that is being considered, and a number of values corresponding to frames following the frame that is being considered.
- the number of values of frames preceding the current frame being considered is 5.
- the number of values of frames following the current frame is 4.
- the foreseen value for the current frame may also be taken into account.
- an onset binary function is then defined (block 288) that yields the potential offsets by assigning a value of 1 to the function if there is a local maximum in the frame higher than the threshold. If there are no local maxima higher than the threshold, the function yields a value of 0. A value of 1 for a determined frame indicates that this frame potentially comprises a potential audio onset. Thus, the results of this function are concatenated and may be considered as a string of bits.
- the frames that have an assigned value of 1 but whose preceding and subsequent frame have an assigned value of 0 are assumed to be a false positive. For example, if the part of the bit string is 010 it is changed to 000, it is assumed that the frame is a false
- the onset rate can then be calculated (block 292) by dividing the number of onsets in the audio piece 266 by the length of the audio piece 266 in seconds.
- the beat of an audio piece is a metrical or rhythm stress in music. In other words, the beat is every tick on a metronome that indicates the rhythm of a song.
- the BPM is a real positive number representing the most frequently observed tempo period of the audio piece in beats per minute.
- Figure 26 is a flow chart showing an illustrative method 294 of determining the beats per minute (BPM) of an audio piece 296. The method 294 may begin by reproducing the above steps for onset detection until the frequency quantization is obtained.
- the audio piece 296 can be divided into overlapping frames via a windowing process (block 298). Each frame can then be modulated with a window function (block 300) in order to remove or eliminate noise frequencies. A frequency quantization of each windowed frame can then be performed (block 302).
- the spectrum may then be divided (block 304) into several different frequency bands.
- the spectrum may be divided into 8 different bands having boundaries at 40.0 Hz, 413.16 Hz, 974.51 Hz, 1 ,818.94 Hz, 3,089.19 Hz, 5,000.0 Hz, 7,874.4 Hz, 12,198.29 Hz, and 17,181.13 Hz.
- the energy of every band is then computed.
- positive variations of the per-band energy derivatives are extracted (block 306).
- the two onset detection functions (308) are also calculated.
- the two onset detection functions (308) are the High Frequency Content (HFC) function and the
- the 8 band energy derivatives (block 306) and the two onset detection functions (block 308) are referred to herein as feature functions since the process for all of them is the same.
- each of the 8 band energy derivatives (306) and the two onset detection functions are resampled (block 310) by taking a sample every 256 samples.
- the resampled feature functions are then fed to a tempo detection module, which forms a 6 second window with each of the resampled feature functions and calculates (block 312) a temporal unbiased autocorrelation function (ACF) over the window based on the following formula:
- feat[n] is the feature function that is currently being processed
- ⁇ is the length of the feature function frame.
- the lags observed should be related with the first lag corresponding to a fundamental frequency and the rest of the n-observed lags to its n first harmonics. In certain embodiments, the number of peaks observed is 4.
- the ACF is passed through a comb filter bank (block 314), producing a tempo (block 316).
- Each filter in the bank corresponds to a different fundamental beat period.
- the comb filter bank is not equally weighted but uses a Rayleigh weighting function with a maximum around 110 bpm. This is also useful to minimize the weight of tempi below 40 bpm and above 210 bpm.
- the tempo is then computed (block 318). The filter that gives a maximum output corresponds to the best matching tempo period.
- a tempo module calculates the beats positions by determining the phase, which can be calculated (block 320) by correlating the feature function with an impulse train having the same period as the tempo (block 316) that was found in the previous step. The value that maximizes the correlation corresponds to the phase of the signal. However, since this value may be greater than the tempo period, the phase is determined (block 322) by taking the value modulo of the tempo period determined before. The phase determines the shifting between the beginning of the window and the first beat position. The rest of the beat positions are found periodically, by a tempo period, after the first beat. [0166] This process is repeated for each of the function features for each frame. The most frequently observed tempo across all function features is selected as the tempo period for this frame. The same process is applied to the phase detection across all function features.
- the 6 second window is slid of 512 feature samples and its tempo computed again.
- the 6 second window constitutes a sliding window with a hop size of 512.
- a sequence of tempi and a sequence of phases can then be obtained from all the calculations across the whole song of the sliding window.
- the tempo period of the song is then selected as the most frequently observed tempo in the sequence.
- the same process is applied to obtain the phase.
- the selected phase can also be used to calculate the beats position.
- Beats loudness is a measure of the strength of the rhythmic beats of an audio piece.
- the beats loudness is a real number between 0 and 1 , where a
- FIG. 38 is a flow chart showing an illustrative method 324 for calculating the beats loudness and/or bass beats loudness of an audio piece.
- the method 324 may begin generally at block 326 by receiving an audio piece having the format described above with respect to Figure 26 in which the beats position has been determined. The steps described hereafter, although described for a single beat, can be applied to every beat in the audio piece.
- the beat attack position in the audio piece is determined (block 328).
- the beat attack position of a beat is the position in time of the point of maximum energy of the signal during a beat. Thus, there is only one beat attack position per beat. It is necessary to finely determine the beat attack position because the precision of the method 324 is very sensitive to the precision in the position of the beat attack.
- the beat attack position of the audio piece In order to determine the beat attack position of the audio piece, a frame covering a 100 ms window centered on the beat position is taken. Then, the beat attack position is determined by finding what is the point of highest energy in the range. To accomplish this, the index / that maximizes the relation frame(/)*frame(/) is determined, where frame(/) is the value of the sample in the frame at the index /.
- a frame starting from the beat attack position can be taken from the audio piece.
- the size (in milliseconds) of the frame may be taken arbitrarily. However, the frame should represent an audio beat from the beat attack to the beat decrease. In certain embodiments, the size of the frame may be 50 ms.
- the frame can be smoothed down to get rid of noisy frequencies.
- the frame can be modulated with a window function (block 330).
- this window function is a Blackman-Harris-62dB function, but other window functions such as any low-resolution window function may be used.
- a frequency quantization (block 332) of the windowed frame is performed (e.g., using a Fast Fourier Transform). After the frequency quantization, a vector containing the frequencies with the corresponding energies is then obtained.
- the total energy of the beat is then calculated (block 334) by adding the square of the value of every bin in the vector obtained from the frequency quantization. The resulting energy represents the energy of the beat, therefore the higher the energy of the frame spectrum, the louder is the beat. [0177]
- the above steps can be performed for each beat in the audio piece. Once all the beats have been analyzed, the energy of the frames each corresponding to one beat in the audio piece is averaged (block 336). The averaged result is between 0 and 1 because the window function applied to each frame is already normalized. Therefore, the total energy of each frame, and thus their average is also between 0 and 1. The averaged energy of the frames constitutes the beat loudness (block 338).
- the above described method 326 may also be used for deriving the bass beats loudness.
- the bass beats loudness is a measure of the weight of low frequencies in the whole spectrum within the beats of the audio piece. It is a real number between 0 and 1.
- the calculation process of the bass beats loudness is generally the same as for calculating the beats loudness, but further includes the step of calculating the ratio between the energy of the low frequencies and the total energy of the spectrum (block 340). This can be calculated on a frame-by-frame basis or over all frames. For example, the beat energy band ratio can be determined by calculating the average energy in the low frequencies of the frames and dividing it by the average total energy of the frames.
- the range of low frequencies can be between about 20-150 Hz, which corresponds to the bass frequency used in many commercial high-fidelity system equalizers. Other low frequency range values could be chosen, however.
- the above steps can be performed for each beat in the audio piece. Once all of the beats have been analyzed, the energy of the frames each corresponding to one bass beat is averaged (block 342). The averaged result is between 0 and 1 because the window function applied to each frame is already normalized. Therefore, the total energy of each frame, and thus their average is
- the averaged energy of the frames constitutes the bass beats loudness (block 344).
- the combination of both the beats loudness and the bass beats loudness may be useful for characterizing an audio piece.
- a folk song may have a low beats loudness and a low beats bass beats loudness
- a punk-rock song may have a high beats loudness and a low bass beats loudness
- a hip-hop song may have a high beats loudness and a high bass beats loudness.
- Rhythmic intensity is a measure of the intensity of an audio piece from a rhythmic point of view. Typically, a slow, soft and relaxing audio piece can be considered to have a low rhythmic intensity. On the other hand, a fast, energetic audio piece can be considered to have a high rhythmic intensity.
- the rhythmic intensity is a number between 0 and 1 where higher values indicate a more rhythmically intensive audio piece.
- FIG 28 is a flow chart showing an illustrative method 346 of computing a rhythmic intensity descriptor.
- the rhythmic intensity descriptor can be based on the onset rate (block 348), the beats per minute (block 350), the beats loudness (block 352), and the bass beats loudness (block 354).
- each of these descriptors represent a number within a range. Typically, the chosen range depends on the descriptor that is being considered.
- the rhythmic intensity can be calculated by splitting each of the ranges into three different zones.
- the choices of the zones for each descriptor can be made according to a double criteria taking as one criteria the statistical analysis of the descriptors of a sample of music pieces that is large enough to be representative (e.g. around one million).
- Musicological concepts can also be utilized as another criteria.
- the threshold values correspond to the limits that human perception presents. For example, it is known that the higher threshold for the perception of a slow rhythm is around 100 bpm while the lower limit for a fast rhythm is around 120 bpm. Audio pieces with a bpm between 100 and 120 are
- the zones for each of the descriptors can be defined as shown in the following table:
- the rhythmic intensity can be calculated by assigning a score (blocks 356,358,360,362) depending on which zone the value of the corresponding descriptor falls in.
- a score (blocks 356,358,360,362) depending on which zone the value of the corresponding descriptor falls in.
- the final value of the rhythmic intensity (block 368) is between 0 and 1.
- Panning of an audio piece is generally the spread of a monaural audio signal in a multi-channel sound field.
- a panning descriptor containing the spatial distribution of audio mixtures within polyphonic audio can be used to classify an audio piece.
- the extraction of spatial information from an audio piece can be used to perform music genre classification.
- FIG 29 is a block diagram of an illustrative audio classification system 370 for combining spectral features with spatial features within an audio piece.
- audio classification system 370 includes an audio database 372 that contains the audio pieces that are to be classified.
- An audio piece is provided to a spectral features module 374 that extracts a vector of features z Q from the audio piece, mixing left and right audio channels.
- the audio piece with separated left and right channels ChL 1 ChR is also provided to a spatial features module 376 that extracts panning coefficients p ⁇ _ from the audio piece.
- the spectral features z Q and the panning coefficients p L are then provided to an audio classifier module 378.
- the audio classifier module 378 can be previously trained by means of machine learning techniques using a number of audio pieces as examples. These example audio pieces contain both the features (spectral and spatial) and the class associated with each example. After the training phase, the audio classifier module 378 is able to predict the class 380 associated with new audio pieces not used in the training phase.
- Figure 30 is a flow chart showing an illustrative method 382 of extracting the panning coefficients PL from an audio piece using the audio classification system 370.
- the audio piece may comprise a multi-channel audio piece having a left channel ChL(t) and a right channel ChR(t). In some embodiments, the audio piece may be in a PCM format with a sample rate of 44,100 Hz and 16 bits per sample.
- the method 382 can be performed on a frame by frame basis, where each frame corresponds to a short-time window of the audio piece such as several milliseconds.
- the stereo mix of an audio piece can be represented as a linear combination of n monophonic sound sources:
- the panning knob in mixing consoles or digital audio workstations follows the following law which constitutes the typical panning formula, wherein x e [o,i] for mixing a sound source /:
- a short-time Fourier transformation (STFT) for each of the audio channels ChL(t),ChR(t) is performed.
- ratios R[k] are derived from the typical panning formula above (block 386), referring to an azimuth angle range going from -45° to +45°, and the ratio of the magnitudes of both spectra S L (t,f),S R (t,f).
- the resulting sequence R[k] represents the spatial localization of each frequency bin k of the STFT.
- the range of the azimuth angle of the panorama is ⁇ e [-45°,45 0 ], while the range of the ratios sequence value is /?[ ⁇ ]e [o,i] .
- R[k] can thus be expressed as:
- the effect of the direction of reception of acoustic signals in auditory perception is taken into consideration using a warping function. Since human auditory perception presents a higher resolution towards the center of the azimuth, a non-linear function such as the one depicted in Figure 31 is used. A warping function is applied to the azimuth angle in order to have more resolution
- an energy weight histogram H w is calculated (block 390) by weighting each bin k with the energy of a frequency bin of the STFT, where .
- M is the number of bins of the histogram H w
- N is the size of the spectrum which corresponds to half of the STFT size:
- the computed histograms are then averaged together. Since panning histograms can vary very rapidly from one frame to the next, the histograms may be averaged over a time window of several frames which can range from hundreds of milliseconds to several seconds. If a single panning histogram for the whole song is required, the averaging time should correspond to the song length.
- the minimum value of time for averaging purposes is the frame length which is determined by the input in the STFT algorithm. In one embodiment, the minimum value of time for averaging is around 2 seconds, although other averaging times greater or lesser than this are possible.
- a running average filter such as the one in equation (42) below for each of the M bins of the histogram H w>n is used, where A is the number of averaging frames, and n indicates the current frame index:
- the histogram H w is normalized to produce an averaged histogram that is independent of the energy in the audio signal. This energy is represented by the magnitude of S L (t,f),S R (t,f).
- the histogram H w can be normalized by dividing every element m in the histogram by the sum of all bins in the histogram as expressed in the following equation:
- the normalized panning histogram H"° rm is converted to the final panning coefficients p t using a cepstral analysis process.
- the logarithm of the panning histogram HTM rm may be taken before applying an Inverse Discrete Fourier Transform (IDFT).
- IDFT Inverse Discrete Fourier Transform
- the method 382 has been described for all frequency bins /c, it can also be applied to different frequency bands.
- the method 382 can be repeated from block 386 on a subset of the ratios vector R[k], from k e [#,,£,.] where S,- is the beginning bin of one band, and E,- is the ending bin of the band.
- the method 382 could also be used extract information from other audio channels or combinations of audio channels.
- surround-sound audio mixtures could also be analyzed by taking pairs of channels such as front/left and back/left channels.
- the panning distribution between four or more channels can be used.
- a ratio between each pair of channels is calculated. This yields one R[k] vector for each pair of channels in the step shown at block 386, with the remainder of the method 382 applied in the manner described above. In such case, one vector of panning coefficients p, for each pair of channels would be obtained. For audio classification, all vectors of panning coefficients would be combined.
- the panning coefficients pi may be combined in an algorithm based on the Bayesian Information Criterion (BIC) to detect homogeneous parts of an audio mixture to produce a segmentation of the audio piece. This may be useful, for example, to identify the presence of a soloist in a song.
- BIC Bayesian Information Criterion
- Figures 32 and 33 represent illustrative panning coefficients for two different musical genres.
- Figure 32 represents, for example, the panning coefficients for a song belonging to the jazz genre and
- Figure 33 represents the
- Figure 34 is a flow chart showing another illustrative method 398 for determining musical similarity between audio pieces.
- a database of audio pieces block 400
- Example descriptor features that can be extracted from the database may include, for example, tonal descriptors, dissonance or consonance descriptors, rhythm descriptors, and/or spatial descriptors, as discussed herein.
- the descriptor features can be stored in a reference database (block 400). Since only those audio pieces that were processed already can be used in the similarity calculation, the number of audio pieces should typically be as large as possible.
- each audio piece has two vectors of features associated with it, where one vector contains the values for the audio features and the other vector contains the normalized values for each audio feature.
- Each feature represents a dimension in an n-dimensional space.
- one or more audio features from an audio piece (block 408) to be compared against the audio pieces contained in the database (block 400) can be extracted and stored in a vector (block 410).
- the audio piece may comprise, for example, a cover version of a song to be compared against a database of audio pieces containing the original song version.
- a distance is calculated between the vector containing the normalized values from the audio piece (block 408) and the vectors containing the normalized values from the audio pieces in the reference database (block 400).
- the distance is a Euclidean distance, although any other suitable distance calculation may be used.
- additional filtering can be performed using additional conditional expressions to limit the number of similar audio pieces found (block 418).
- an additional conditional expression that can be used is an expression limiting the number of closest audio pieces provided as a result.
- other conditions not related to similarity but to some of the extracted features can also be used to perform additional filtering.
- a condition may be used to specify the 10 closest audio pieces that have a beats per minute (bpm) of more than 100, thus filtering audio pieces that have a higher rhythmic intensity.
- bpm beats per minute
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Auxiliary Devices For Music (AREA)
Abstract
L'invention concerne des systèmes et des procédés pour déterminer les similitudes entre deux extraits sonores. Selon l'invention, un procédé permettant la détermination des similitudes musicales comprend les étapes suivantes: extraction d'un ou plusieurs descripteurs à partir de chaque extrait sonore; production d'un vecteur pour chaque extrait; extraction d'une ou plusieurs caractéristiques sonores de chaque extrait sonore; calcul de valeurs pour chaque caractéristique sonore; calcul de la distance entre un vecteur contenant les valeurs normalisées et les vecteurs contenant les extraits sonores; et délivrance en sortie d'une réponse à un utilisateur indiquant les similitudes entre les extraits sonores. Les descripteurs peuvent être utilisés pour la réalisation d'une classification sonore en fonction du contenu et pour la détermination des similitudes entre des extraits musicaux. Les descripteurs qui ont été extraits de chaque extrait sonore peuvent comprendre des descripteurs tonaux, des descripteurs de dissonance, des descripteurs rythmiques, et des descripteurs spatiaux.
Applications Claiming Priority (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US94686007P | 2007-06-28 | 2007-06-28 | |
| US60/946,860 | 2007-06-28 | ||
| US97010907P | 2007-09-05 | 2007-09-05 | |
| US60/970,109 | 2007-09-05 | ||
| US98871407P | 2007-11-16 | 2007-11-16 | |
| US60/988,714 | 2007-11-16 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2009001202A1 true WO2009001202A1 (fr) | 2008-12-31 |
Family
ID=39864680
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2008/001669 Ceased WO2009001202A1 (fr) | 2007-06-28 | 2008-06-25 | Procédés et systèmes de similitudes musicales comprenant l'utilisation de descripteurs |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2009001202A1 (fr) |
Cited By (20)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2011009946A1 (fr) | 2009-07-24 | 2011-01-27 | Johannes Kepler Universität Linz | Procédé et appareil permettant de dériver des informations à partir dune piste audio et de déterminer une similarité entre des pistes audio |
| US8190663B2 (en) | 2009-07-06 | 2012-05-29 | Osterreichisches Forschungsinstitut Fur Artificial Intelligence Der Osterreichischen Studiengesellschaft Fur Kybernetik Of Freyung | Method and a system for identifying similar audio tracks |
| EP2551843A1 (fr) * | 2011-07-27 | 2013-01-30 | YAMAHA Corporation | Appareil d'analyse musicale |
| TWI413096B (zh) * | 2009-10-08 | 2013-10-21 | Chunghwa Picture Tubes Ltd | 適應性畫面更新率調變系統及其方法 |
| FR3022048A1 (fr) * | 2014-06-10 | 2015-12-11 | Weezic | Procede de suivi d'une partition musicale et procede de modelisation associe |
| WO2019115333A1 (fr) * | 2017-12-11 | 2019-06-20 | 100 Milligrams Holding Ab | Système et procédé de création et de recréation d'un mixage de musique, produit programme d'ordinateur et système informatique |
| CN110010151A (zh) * | 2018-12-31 | 2019-07-12 | 瑞声科技(新加坡)有限公司 | 一种音频信号处理方法及设备、存储介质 |
| CN111583963A (zh) * | 2020-05-18 | 2020-08-25 | 合肥讯飞数码科技有限公司 | 一种重复音频检测方法、装置、设备及存储介质 |
| CN111816147A (zh) * | 2020-01-16 | 2020-10-23 | 武汉科技大学 | 一种基于信息提取的音乐节奏定制方法 |
| US10827028B1 (en) | 2019-09-05 | 2020-11-03 | Spotify Ab | Systems and methods for playing media content on a target device |
| CN112071333A (zh) * | 2019-06-11 | 2020-12-11 | 纳宝株式会社 | 用于动态音符匹配的电子装置及其操作方法 |
| CN113196381A (zh) * | 2019-01-11 | 2021-07-30 | 雅马哈株式会社 | 音响解析方法以及音响解析装置 |
| CN113257276A (zh) * | 2021-05-07 | 2021-08-13 | 普联国际有限公司 | 一种音频场景检测方法、装置、设备及存储介质 |
| US20210287662A1 (en) * | 2018-09-04 | 2021-09-16 | Gracenote, Inc. | Methods and apparatus to segment audio and determine audio segment similarities |
| CN113724739A (zh) * | 2021-09-01 | 2021-11-30 | 腾讯音乐娱乐科技(深圳)有限公司 | 检索音频和训练声学模型的方法、终端及存储介质 |
| US11544314B2 (en) | 2019-06-27 | 2023-01-03 | Spotify Ab | Providing media based on image analysis |
| US11551678B2 (en) | 2019-08-30 | 2023-01-10 | Spotify Ab | Systems and methods for generating a cleaned version of ambient sound |
| US11810564B2 (en) | 2020-02-11 | 2023-11-07 | Spotify Ab | Dynamic adjustment of wake word acceptance tolerance thresholds in voice-controlled devices |
| US11822601B2 (en) | 2019-03-15 | 2023-11-21 | Spotify Ab | Ensemble-based data comparison |
| CN120089160A (zh) * | 2025-04-27 | 2025-06-03 | 苏州大学 | 一种基于音频处理的无损管道风险等级检测方法 |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2004095315A1 (fr) * | 2003-04-24 | 2004-11-04 | Koninklijke Philips Electronics N.V. | Analyse de caracteristiques temporelles parametrees |
-
2008
- 2008-06-25 WO PCT/IB2008/001669 patent/WO2009001202A1/fr not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2004095315A1 (fr) * | 2003-04-24 | 2004-11-04 | Koninklijke Philips Electronics N.V. | Analyse de caracteristiques temporelles parametrees |
Non-Patent Citations (4)
| Title |
|---|
| EMILIA GÓMEZ: "TONAL DESCRIPTION OF MUSIC AUDIO SIGNALS", 2006, XP002501266 * |
| FABIEN GOUYON: "A computational approach to rhythm description --- Audio features for the computation of rhythm periodicity functions and their use in tempo induction and music content processing", 2005, XP002501267 * |
| JOAN SERRA ET AL: "Audio cover song identification based on tonal sequence alignment", ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 2008. ICASSP 2008. IEEE INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 31 March 2008 (2008-03-31), pages 61 - 64, XP031250488, ISBN: 978-1-4244-1483-3 * |
| SOBIECZKY F ED - KEEVE E ET AL: "Results in mathematics and music: visualization of roughness in musical consonance", VISUALIZATION '96. PROCEEDINGS, IEEE, NE, 1 January 1996 (1996-01-01), pages 355 - 357, XP031172362, ISBN: 978-0-89791-864-0 * |
Cited By (27)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8190663B2 (en) | 2009-07-06 | 2012-05-29 | Osterreichisches Forschungsinstitut Fur Artificial Intelligence Der Osterreichischen Studiengesellschaft Fur Kybernetik Of Freyung | Method and a system for identifying similar audio tracks |
| WO2011009946A1 (fr) | 2009-07-24 | 2011-01-27 | Johannes Kepler Universität Linz | Procédé et appareil permettant de dériver des informations à partir dune piste audio et de déterminer une similarité entre des pistes audio |
| TWI413096B (zh) * | 2009-10-08 | 2013-10-21 | Chunghwa Picture Tubes Ltd | 適應性畫面更新率調變系統及其方法 |
| EP2551843A1 (fr) * | 2011-07-27 | 2013-01-30 | YAMAHA Corporation | Appareil d'analyse musicale |
| US9024169B2 (en) | 2011-07-27 | 2015-05-05 | Yamaha Corporation | Music analysis apparatus |
| FR3022048A1 (fr) * | 2014-06-10 | 2015-12-11 | Weezic | Procede de suivi d'une partition musicale et procede de modelisation associe |
| WO2019115333A1 (fr) * | 2017-12-11 | 2019-06-20 | 100 Milligrams Holding Ab | Système et procédé de création et de recréation d'un mixage de musique, produit programme d'ordinateur et système informatique |
| US12125472B2 (en) | 2018-09-04 | 2024-10-22 | Gracenote, Inc. | Methods and apparatus to segment audio and determine audio segment similarities |
| US11657798B2 (en) * | 2018-09-04 | 2023-05-23 | Gracenote, Inc. | Methods and apparatus to segment audio and determine audio segment similarities |
| US20210287662A1 (en) * | 2018-09-04 | 2021-09-16 | Gracenote, Inc. | Methods and apparatus to segment audio and determine audio segment similarities |
| CN110010151A (zh) * | 2018-12-31 | 2019-07-12 | 瑞声科技(新加坡)有限公司 | 一种音频信号处理方法及设备、存储介质 |
| CN113196381A (zh) * | 2019-01-11 | 2021-07-30 | 雅马哈株式会社 | 音响解析方法以及音响解析装置 |
| CN113196381B (zh) * | 2019-01-11 | 2023-12-26 | 雅马哈株式会社 | 音响解析方法以及音响解析装置 |
| US11822601B2 (en) | 2019-03-15 | 2023-11-21 | Spotify Ab | Ensemble-based data comparison |
| CN112071333A (zh) * | 2019-06-11 | 2020-12-11 | 纳宝株式会社 | 用于动态音符匹配的电子装置及其操作方法 |
| US11544314B2 (en) | 2019-06-27 | 2023-01-03 | Spotify Ab | Providing media based on image analysis |
| US11551678B2 (en) | 2019-08-30 | 2023-01-10 | Spotify Ab | Systems and methods for generating a cleaned version of ambient sound |
| US10827028B1 (en) | 2019-09-05 | 2020-11-03 | Spotify Ab | Systems and methods for playing media content on a target device |
| CN111816147A (zh) * | 2020-01-16 | 2020-10-23 | 武汉科技大学 | 一种基于信息提取的音乐节奏定制方法 |
| US11810564B2 (en) | 2020-02-11 | 2023-11-07 | Spotify Ab | Dynamic adjustment of wake word acceptance tolerance thresholds in voice-controlled devices |
| CN111583963B (zh) * | 2020-05-18 | 2023-03-21 | 合肥讯飞数码科技有限公司 | 一种重复音频检测方法、装置、设备及存储介质 |
| CN111583963A (zh) * | 2020-05-18 | 2020-08-25 | 合肥讯飞数码科技有限公司 | 一种重复音频检测方法、装置、设备及存储介质 |
| CN113257276A (zh) * | 2021-05-07 | 2021-08-13 | 普联国际有限公司 | 一种音频场景检测方法、装置、设备及存储介质 |
| CN113257276B (zh) * | 2021-05-07 | 2024-03-29 | 普联国际有限公司 | 一种音频场景检测方法、装置、设备及存储介质 |
| CN113724739A (zh) * | 2021-09-01 | 2021-11-30 | 腾讯音乐娱乐科技(深圳)有限公司 | 检索音频和训练声学模型的方法、终端及存储介质 |
| CN113724739B (zh) * | 2021-09-01 | 2024-06-11 | 腾讯音乐娱乐科技(深圳)有限公司 | 检索音频和训练声学模型的方法、终端及存储介质 |
| CN120089160A (zh) * | 2025-04-27 | 2025-06-03 | 苏州大学 | 一种基于音频处理的无损管道风险等级检测方法 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20080300702A1 (en) | Music similarity systems and methods using descriptors | |
| WO2009001202A1 (fr) | Procédés et systèmes de similitudes musicales comprenant l'utilisation de descripteurs | |
| Muller et al. | Signal processing for music analysis | |
| Paulus et al. | Measuring the similarity of Rhythmic Patterns. | |
| Li et al. | Separation of singing voice from music accompaniment for monaural recordings | |
| JP3964792B2 (ja) | 音楽信号を音符基準表記に変換する方法及び装置、並びに、音楽信号をデータバンクに照会する方法及び装置 | |
| Yoshii et al. | Drum sound recognition for polyphonic audio signals by adaptation and matching of spectrogram templates with harmonic structure suppression | |
| US20100198760A1 (en) | Apparatus and methods for music signal analysis | |
| Zhu et al. | Precise pitch profile feature extraction from musical audio for key detection | |
| Hargreaves et al. | Structural segmentation of multitrack audio | |
| Mauch et al. | Timbre and Melody Features for the Recognition of Vocal Activity and Instrumental Solos in Polyphonic Music. | |
| US20060075883A1 (en) | Audio signal analysing method and apparatus | |
| JP5127982B2 (ja) | 音楽検索装置 | |
| Elowsson et al. | Modeling the perception of tempo | |
| Lerch | Software-based extraction of objective parameters from music performances | |
| Bay et al. | Harmonic source separation using prestored spectra | |
| Grosche et al. | Automatic transcription of recorded music | |
| Osmalsky et al. | Neural networks for musical chords recognition | |
| Waghmare et al. | Analyzing acoustics of indian music audio signal using timbre and pitch features for raga identification | |
| Thomas et al. | Detection of largest possible repeated patterns in indian audio songs using spectral features | |
| Kumar et al. | Melody extraction from music: A comprehensive study | |
| Dittmar et al. | A toolbox for automatic transcription of polyphonic music | |
| Pardo | Finding structure in audio for music information retrieval | |
| Holzapfel et al. | Similarity methods for computational ethnomusicology | |
| Chen et al. | An efficient method for polyphonic audio-to-score alignment using onset detection and constant Q transform |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08776294 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 08776294 Country of ref document: EP Kind code of ref document: A1 |