HK40063033B

HK40063033B - Methods and apparatus to fingerprint an audio signal via normalization

Info

Publication number: HK40063033B
Application number: HK62022051906.5A
Authority: HK
Inventors: R·库弗; Z·拉菲
Original assignee: 格雷斯诺特有限公司
Priority date: 2018-09-07
Filing date: 2019-09-06
Publication date: 2025-04-11

Description

Method and apparatus for fingerprinting audio signals via normalization

相关申请Related applications

本专利要求于2018年9月7日提交的法国专利申请序列号1858041的优先权和利益。法国专利申请序列号1858041的全部内容通过引用并入本文。This patent claims priority and interest in French patent application serial number 1858041, filed on September 7, 2018. The entire contents of French patent application serial number 1858041 are incorporated herein by reference.

技术领域Technical Field

本公开总体上涉及音频信号，并且更具体地，涉及经由归一化对音频信号进行指纹识别的方法和装置。This disclosure generally relates to audio signals, and more specifically, to methods and apparatus for fingerprinting audio signals via normalization.

背景技术Background Technology

音频信息(例如，声音、语音、音乐等)可以表示为数字数据(例如，电子的、光学的，等等)。(例如，经由麦克风)捕获的音频可以被数字化、以电子方式存储、处理和/或分类。对音频信息进行分类的一种方法是通过生成音频指纹。音频指纹是通过对音频信号的一部分进行采样而创建的音频信息的数字摘要。音频指纹在历史上一直被用来识别音频和/或验证音频的真实性。Audio information (e.g., sound, speech, music, etc.) can be represented as digital data (e.g., electronic, optical, etc.). Audio captured (e.g., via a microphone) can be digitized, stored electronically, processed, and/or classified. One method for classifying audio information is by generating an audio fingerprint. An audio fingerprint is a digital summary of audio information created by sampling a portion of an audio signal. Audio fingerprints have historically been used to identify audio and/or verify its authenticity.

附图说明Attached Figure Description

图1是可以实现本公开的教导的示例系统。Figure 1 is an example system that can implement the teachings of this disclosure.

图2是图1的音频处理器的示例实现方式。Figure 2 shows an example implementation of the audio processor in Figure 1.

图3A和图3B描绘了由图2的示例频率范围分离器生成的示例未经处理的频谱图。Figures 3A and 3B depict example unprocessed spectrum diagrams generated by the example frequency range separator in Figure 2.

图3C描绘了由图2的信号归一化器根据图3A和图3B的未经处理的频谱图生成的经归一化的频谱图的示例。Figure 3C depicts an example of a normalized spectrum generated by the signal normalizer of Figure 2 based on the unprocessed spectrum of Figures 3A and 3B.

图4是被分成固定音频信号频率分量的、图3A和图3B的示例未经处理的频谱图。Figure 4 shows the example unprocessed spectrograms of Figures 3A and 3B, which are divided into fixed audio signal frequency components.

图5是由图2的信号归一化器根据图4的固定音频信号频率分量生成的经归一化的频谱图的示例。Figure 5 is an example of a normalized spectrum generated by the signal normalizer in Figure 2 based on the fixed audio signal frequency components in Figure 4.

图6是由图2的点选择器根据图5的经归一化的频谱图生成的经归一化且经加权的频谱图的示例。Figure 6 is an example of a normalized and weighted spectrogram generated by the point selector of Figure 2 based on the normalized spectrogram of Figure 5.

图7和图8是表示可以被执行以实现图2的音频处理器的机器可读指令的流程图。Figures 7 and 8 are flowcharts showing machine-readable instructions that can be executed to implement the audio processor of Figure 2.

图9是被构造成执行图7和图8的指令以实现图2的音频处理器的示例处理平台的框图。Figure 9 is a block diagram of an example processing platform configured to execute the instructions of Figures 7 and 8 to implement the audio processor of Figure 2.

附图不是按比例绘制的。通常，贯穿附图和所附书面描述，将使用相同的附图标记来指代相同或相似的部分。The accompanying drawings are not to scale. Generally, the same reference numerals will be used throughout the drawings and accompanying written description to refer to the same or similar parts.

具体实施方式Detailed Implementation

基于指纹或签名的媒体监测技术通常利用监测时间间隔期间被监测的媒体的一个或更多个固有特性，以生成针对该媒体的基本上唯一的代理(proxy)。这样的代理称为签名或指纹，并且可以采取表示媒体信号(例如，形成正被监测的媒体呈现的音频信号和/或视频信号)的任何方面的任何形式(例如，一系列数字值、波形等)。签名可以是在时间间隔内连续收集的一系列签名。术语“指纹”和“签名”在本文中可互换地使用，并且在本文中被定义成意指用于识别媒体的、根据该媒体的一个或更多个固有特性生成的代理。Fingerprint- or signature-based media monitoring techniques typically utilize one or more inherent characteristics of the monitored media during a monitoring time interval to generate a substantially unique proxy for that media. Such a proxy is called a signature or fingerprint and can take any form (e.g., a series of numerical values, waveforms, etc.) representing any aspect of the media signal (e.g., audio and/or video signals forming the presentation of the monitored media). A signature can be a series of signatures collected consecutively within a time interval. The terms “fingerprint” and “signature” are used interchangeably herein and are defined herein as meaning a proxy generated based on one or more inherent characteristics of the media for identification purposes.

基于签名的媒体监测通常涉及：确定(例如，生成和/或收集)表示由被监测的媒体设备输出的媒体信号(例如，音频信号和/或视频信号)的签名，并且将被监测的签名与已知的(例如，参考)媒体源所对应的一个或更多个参考签名进行比较。可以评估各种比较标准(诸如互相关值、Hamming距离等)，以确定被监测的签名是否匹配特定参考签名。Signature-based media surveillance typically involves: determining (e.g., generating and/or collecting) a signature representing a media signal (e.g., audio and/or video signal) output by the monitored media device, and comparing the monitored signature with one or more reference signatures corresponding to known (e.g., reference) media sources. Various comparison criteria (such as cross-correlation values, Hamming distance, etc.) can be evaluated to determine whether the monitored signature matches a specific reference signature.

当发现被监测的签名与参考签名之一之间的匹配时，可以将被监测的媒体识别为对应于由与被监测的签名匹配的参考签名表示的特定参考媒体。由于诸如媒体标识符、呈现时间、广播频道等的属性是针对参考签名收集的，于是可以将这些属性与被监测的媒体(所述被监测的媒体的被监测的签名与参考签名匹配)相关联。基于代码和/或签名来识别媒体的示例系统是早就已知的，并且在Thomas的美国专利5,481,294中首次公开，该美国专利的全部内容通过引用并入本文。When a match is found between a monitored signature and one of the reference signatures, the monitored media can be identified as corresponding to a specific reference media represented by the reference signature that matches the monitored signature. Since attributes such as media identifiers, presentation time, broadcast channel, etc., are collected against the reference signature, these attributes can be associated with the monitored media (the monitored media whose monitored signature matches the reference signature). Example systems for identifying media based on codes and/or signatures are already known and were first disclosed in Thomas's U.S. Patent 5,481,294, the entire contents of which are incorporated herein by reference.

历史上，音频指纹识别技术使用音频信号的最响亮的部分(例如，能量最大的部分等)来在一时间段中创建指纹。然而，在一些情况下，该方法具有多个严重局限性。在一些示例中，音频信号的最响亮的部分可能与噪声(例如，不想要的音频)相关联，而不是来自感兴趣的音频。例如，如果用户试图在嘈杂的餐厅中对歌曲进行指纹识别，则捕获的音频信号的最响亮的部分可能是餐厅顾客之间的对话，而不是要识别的歌曲或媒体。在该示例中，音频信号的许多采样部分将具有背景噪声而不具有音乐，这降低了所生成的指纹的有用性。Historically, audio fingerprinting technology has used the loudest parts of an audio signal (e.g., the most energetic portions) to create a fingerprint over a period of time. However, this method has several serious limitations in some cases. In some examples, the loudest parts of the audio signal may be associated with noise (e.g., unwanted audio) rather than the audio of interest. For example, if a user is trying to fingerprint a song in a noisy restaurant, the loudest part of the captured audio signal might be conversations between restaurant patrons, rather than the song or media being identified. In this example, many sampled portions of the audio signal will contain background noise instead of music, which reduces the usefulness of the generated fingerprint.

先前的指纹识别技术的另一潜在限制是，特别是在音乐中，低音频率范围中的音频往往最响。在一些示例中，处于主导地位的低音频率能量导致音频信号的采样部分主要在低音频率范围内。因此，使用现有方法生成的指纹通常不包括来自可以用于签名匹配的音频频谱的所有部分的样本，尤其是较高的频率范围(例如，高音范围等)中的样本。Another potential limitation of previous fingerprinting technologies is that, particularly in music, audio in the low-frequency range tends to be the loudest. In some examples, the dominant bass frequency energy causes the sampling portion of the audio signal to be primarily in the bass frequency range. Therefore, fingerprints generated using existing methods often do not include samples from all parts of the audio spectrum that can be used for signature matching, especially samples from the higher frequency ranges (e.g., the treble range, etc.).

本文公开的示例方法和装置通过使用均值归一化(mean normalization)从音频信号生成指纹来克服上述问题。一种示例方法包括按周围音频区域的音频特性来对音频信号的一个或更多个时间-频率仓(time-frequency bin)进行归一化。如本文所使用的，“时间-频率仓”是音频信号的在特定时间(例如，进入音频信号的三秒)与特定频率仓(例如，FFT仓)相对应的一部分。在一些示例中，归一化是按音频信号的音频类别加权的。在一些示例中，通过选择来自经归一化的时间-频率仓的点来生成指纹。The example methods and apparatuses disclosed herein overcome the aforementioned problems by generating fingerprints from audio signals using mean normalization. One example method involves normalizing one or more time-frequency bins of an audio signal according to the audio characteristics of the surrounding audio region. As used herein, a “time-frequency bin” is a portion of the audio signal corresponding to a specific frequency bin (e.g., an FFT bin) at a particular time (e.g., three seconds into the audio signal). In some examples, the normalization is weighted according to the audio category of the audio signal. In some examples, fingerprints are generated by selecting points from the normalized time-frequency bins.

本文公开的另一示例方法包括将音频信号分成两个或更多个音频信号频率分量。如本文所使用的，“音频信号频率分量”是音频信号的与频率范围和时间段相对应的部分。在一些示例中，音频信号频率分量可以由多个时间-频率仓组成。在一些示例中，针对音频信号频率分量中的一些音频信号频率分量确定音频特性。在该示例中，音频信号频率分量中的各个音频信号频率分量按关联的音频特性(例如，音频均值等)被归一化。在一些示例中，通过选择来自经归一化的音频信号频率分量的点来生成指纹。Another example method disclosed herein involves dividing an audio signal into two or more audio signal frequency components. As used herein, an "audio signal frequency component" is a portion of the audio signal corresponding to a frequency range and time period. In some examples, an audio signal frequency component may consist of multiple time-frequency bins. In some examples, audio characteristics are determined for some audio signal frequency components within the audio signal frequency components. In this example, the individual audio signal frequency components within the audio signal frequency components are normalized according to associated audio characteristics (e.g., audio mean, etc.). In some examples, fingerprints are generated by selecting points from the normalized audio signal frequency components.

图1是可以实现本公开的教导的示例系统100。示例系统100包括示例音频源102、示例麦克风104，该示例麦克风104从音频源102捕获声音并将所捕获的声音转换成示例音频信号106。示例音频处理器108接收音频信号106并生成示例指纹110。Figure 1 shows an example system 100 that can implement the teachings of this disclosure. The example system 100 includes an example audio source 102 and an example microphone 104 that captures sound from the audio source 102 and converts the captured sound into an example audio signal 106. An example audio processor 108 receives the audio signal 106 and generates an example fingerprint 110.

示例音频源102发出可听声音。示例音频源可以是扬声器(例如，电声换能器等)、现场表演、对话和/或任何其它合适的音频源。示例音频源102可以包括期望的音频(例如，要进行指纹识别的音频等)并且还可以包括不期望的音频(例如，背景噪声等)。在所示示例中，音频源102是扬声器。在其它示例中，音频源102可以是任何其它合适的音频源(例如，人等)。Example audio source 102 emits audible sound. The example audio source can be a speaker (e.g., an electroacoustic transducer, etc.), a live performance, dialogue, and/or any other suitable audio source. Example audio source 102 can include desired audio (e.g., audio for fingerprint recognition, etc.) and can also include undesired audio (e.g., background noise, etc.). In the example shown, audio source 102 is a speaker. In other examples, audio source 102 can be any other suitable audio source (e.g., a person, etc.).

示例麦克风104是将由音频源102发出的声音转换成音频信号106的换能器。在一些示例中，麦克风104可以是计算机、移动设备(智能手机、平板电脑等)、导航设备或可穿戴设备(例如，智能手表等)的组件。在一些示例中，麦克风可以包括音频到数字转换，以将音频信号106数字化。在其它示例中，音频处理器108可以将音频信号106数字化。Example microphone 104 is a transducer that converts sound emitted by audio source 102 into audio signal 106. In some examples, microphone 104 may be a component of a computer, mobile device (smartphone, tablet, etc.), navigation device, or wearable device (e.g., smartwatch, etc.). In some examples, the microphone may include an audio-to-digital converter to digitize the audio signal 106. In other examples, audio processor 108 may digitize the audio signal 106.

示例音频信号106是由音频源102发出的声音的数字化表示。在一些示例中，音频信号106可以在被音频处理器108处理之前保存在计算机上。在一些示例中，音频信号106可以通过网络传递至示例音频处理器108。另外地或另选地，可以使用任何其它合适的方法来生成音频(例如，数字合成等)。Example audio signal 106 is a digital representation of sound emitted by audio source 102. In some examples, audio signal 106 may be stored on a computer before being processed by audio processor 108. In some examples, audio signal 106 may be transmitted to example audio processor 108 via a network. Alternatively or alternatively, any other suitable method may be used to generate audio (e.g., digital synthesis, etc.).

示例音频处理器108将示例音频信号106转换成示例指纹110。在一些示例中，音频处理器108将音频信号106分成频率仓和/或时间段，然后确定所创建的音频信号频率分量中的一者或更多者的均值能量。在一些示例中，音频处理器108可以使用各个时间-频率仓周围的音频区域的关联的均值能量来对音频信号频率分量进行归一化。在其它示例中，可以确定任何其它合适的音频特性并将其用于对各个时间-频率仓进行归一化。在一些示例中，可以通过在经归一化的音频信号频率分量当中选择最高能量来生成指纹110。另外地或另选地，可以使用任何合适的手段来生成指纹110。下面结合图2描述音频处理器108的示例实现方式。Example audio processor 108 converts example audio signal 106 into example fingerprint 110. In some examples, audio processor 108 divides audio signal 106 into frequency bins and/or time periods, and then determines the mean energy of one or more of the frequency components of the created audio signal. In some examples, audio processor 108 can normalize the frequency components of the audio signal using the associated mean energy of the audio regions surrounding each time-frequency bin. In other examples, any other suitable audio characteristics can be determined and used to normalize the respective time-frequency bins. In some examples, fingerprint 110 can be generated by selecting the highest energy among the normalized audio signal frequency components. Alternatively or separately, any suitable means can be used to generate fingerprint 110. An example implementation of audio processor 108 is described below with reference to Figure 2.

示例指纹110是音频信号106的简明数字摘要，可以使用该简明数字摘要来识别和/或验证音频信号106。例如，可以通过对音频信号106的多个部分进行采样并对这些部分进行处理来生成指纹110。在一些示例中，指纹110可以包括音频信号106的最高能量部分的样本。在一些示例中，可以在数据库中对指纹110进行索引，该数据库可以用于与其它指纹进行比较。在一些示例中，可以使用指纹110来识别音频信号106(例如，确定正在播放什么歌曲等)。在一些示例中，可以使用指纹110来验证音频的真实性。Example fingerprint 110 is a concise digital digest of audio signal 106, which can be used to identify and/or verify audio signal 106. For example, fingerprint 110 can be generated by sampling and processing multiple portions of audio signal 106. In some examples, fingerprint 110 may include samples of the highest energy portion of audio signal 106. In some examples, fingerprint 110 may be indexed in a database that can be used for comparison with other fingerprints. In some examples, fingerprint 110 can be used to identify audio signal 106 (e.g., to determine what song is playing). In some examples, fingerprint 110 can be used to verify the authenticity of audio.

图2是图1的音频处理器108的示例实现方式。示例音频处理器108包括示例频率范围分离器202、示例音频特性确定器204、示例信号归一化器206、示例点选择器208和示例指纹生成器210。Figure 2 is an example implementation of the audio processor 108 of Figure 1. The example audio processor 108 includes an example frequency range separator 202, an example audio characteristic determiner 204, an example signal normalizer 206, an example point selector 208, and an example fingerprint generator 210.

示例频率范围分离器202将音频信号(例如，图1的经数字化的音频信号106)分成时间-频率仓和/或音频信号频率分量。例如，频率范围分离器202可以对音频信号106执行快速傅立叶变换(FFT)，以将音频信号106变换到频域。另外地，示例频率范围分离器202可以将变换后的音频信号106分成两个或更多个频率仓(例如，使用Hamming函数、Hann函数等)。在该示例中，各个音频信号频率分量与两个或更多个频率仓中的频率仓相关联。另外地或另选地，频率范围分离器202可以将音频信号106聚合成一个或更多个时间段(例如，音频的持续时间、六秒的时段、1秒的时段等)。在其它示例中，频率范围分离器202可以使用任何合适的技术来变换音频信号106(例如，离散傅立叶变换、滑动时间窗口傅立叶变换、小波变换、离散Hadamard变换、离散Walsh Hadamard、离散余弦变换等)。在一些示例中，频率范围分离器202可以由一个或更多个带通滤波器(BPF)实现。在一些示例中，示例频率范围分离器202的输出可以由频谱图表示。下面结合图3A至图3B以及图4讨论频率范围分离器202的示例输出。Example frequency range separator 202 divides an audio signal (e.g., the digitized audio signal 106 of Figure 1) into time-frequency bins and/or audio signal frequency components. For example, frequency range separator 202 may perform a Fast Fourier Transform (FFT) on the audio signal 106 to transform the audio signal 106 to the frequency domain. Alternatively, example frequency range separator 202 may divide the transformed audio signal 106 into two or more frequency bins (e.g., using the Hamming function, Hann function, etc.). In this example, each audio signal frequency component is associated with a frequency bin in two or more frequency bins. Alternatively or additionally, frequency range separator 202 may aggregate the audio signal 106 into one or more time periods (e.g., the duration of the audio, a six-second period, a one-second period, etc.). In other examples, frequency range separator 202 may use any suitable technique to transform the audio signal 106 (e.g., Discrete Fourier Transform, Sliding Time Window Fourier Transform, Wavelet Transform, Discrete Hadamard Transform, Discrete Walsh Hadamard Transform, Discrete Cosine Transform, etc.). In some examples, the frequency range splitter 202 can be implemented by one or more bandpass filters (BPFs). In some examples, the output of the example frequency range splitter 202 can be represented by a spectrum. The example output of the frequency range splitter 202 is discussed below with reference to Figures 3A to 3B and Figure 4.

示例音频特性确定器204确定音频信号106的一部分(例如，音频信号频率分量、时间-频率仓周围的音频区域等)的音频特性。例如，音频特性确定器204可以确定音频信号频率分量中的一个或更多个音频信号频率分量的均值能量(例如，平均功率等)。另外地或另选地，音频特性确定器204可以确定音频信号的一部分的其它特性(例如，众数能量、中值能量、众数功率、中值能量、均值能量、均值幅度等)。Example audio characteristic determiner 204 determines the audio characteristics of a portion of audio signal 106 (e.g., audio signal frequency components, audio regions around time-frequency bins, etc.). For example, audio characteristic determiner 204 may determine the mean energy (e.g., average power, etc.) of one or more audio signal frequency components. Additionally or alternatively, audio characteristic determiner 204 may determine other characteristics of a portion of the audio signal (e.g., mode energy, median energy, mode power, median energy, mean energy, mean amplitude, etc.).

示例信号归一化器206按周围音频区域的关联的音频特性对一个或更多个时间-频率仓进行归一化。例如，信号归一化器206可以按周围音频区域的均值能量对时间-频率仓进行归一化。在其它示例中，信号归一化器206按关联的音频特性对音频信号频率分量中的一些音频信号频率分量进行归一化。例如，信号归一化器206可以使用与音频信号频率分量相关联的均值能量来对该音频信号频率分量的各个时间-频率仓进行归一化。在一些示例中，信号归一化器206的输出(例如，经归一化的时间-频率仓、经归一化的音频信号频率分量等)可以表示为频谱图。下面结合图3C和图5讨论信号归一化器206的示例输出。Example signal normalizer 206 normalizes one or more time-frequency bins according to the associated audio characteristics of the surrounding audio region. For example, signal normalizer 206 may normalize the time-frequency bins according to the mean energy of the surrounding audio region. In other examples, signal normalizer 206 normalizes some audio signal frequency components in the audio signal frequency components according to the associated audio characteristics. For example, signal normalizer 206 may use the mean energy associated with the audio signal frequency component to normalize the individual time-frequency bins of that audio signal frequency component. In some examples, the output of signal normalizer 206 (e.g., normalized time-frequency bins, normalized audio signal frequency components, etc.) may be represented as a spectrogram. An example output of signal normalizer 206 is discussed below with reference to Figures 3C and 5.

示例点选择器208从经归一化的音频信号选择要被用于生成指纹110的一个或更多个点。例如，示例点选择器208可以选择经归一化的音频信号的多个能量最大值。在其它示例中，点选择器208可以选择经归一化的音频的任何其它合适的点。Example point selector 208 selects one or more points from the normalized audio signal to be used to generate fingerprint 110. For example, example point selector 208 can select multiple energy maximum values of the normalized audio signal. In other examples, point selector 208 can select any other suitable point of the normalized audio.

另外地或另选地，点选择器208可以基于音频信号106的类别来对点的选择进行加权。例如，如果音频信号的类别是音乐，则点选择器208可以将点的选择侧重到音乐的公共频率范围(例如，低音、高音等)中。在一些示例中，点选择器208可以确定音频信号的类别(例如，音乐、语音、音效、广告等)。示例指纹生成器210使用由示例点选择器208选择的点来生成指纹(例如，指纹110)。示例指纹生成器210可以使用任何合适的方法根据所选择的点生成指纹。Alternatively, the point selector 208 may weight the selection of points based on the category of the audio signal 106. For example, if the category of the audio signal is music, the point selector 208 may focus the selection of points on the common frequency range of music (e.g., bass, treble, etc.). In some examples, the point selector 208 may determine the category of the audio signal (e.g., music, speech, sound effects, advertisements, etc.). The example fingerprint generator 210 uses the points selected by the example point selector 208 to generate a fingerprint (e.g., fingerprint 110). The example fingerprint generator 210 may use any suitable method to generate a fingerprint based on the selected points.

虽然图2例示了实现图1的音频处理器108的示例方式，但是图2所例示的要素、处理和/或设备中的一个或更多个要素、处理和/或设备可以组合、划分、重新布置、省略、消除和/或按任何其它方式来实现。此外，示例频率范围分离器202、示例音频特性确定器204、示例信号归一化器206、示例点选择器208和示例指纹生成器210和/或更一般地图1和图2的示例音频处理器108可以通过硬件、软件、固件和/或硬件、软件和/或固件的任何组合来实现。因此，例如，示例频率范围分离器202、示例音频特性确定器204、示例信号归一化器206、示例点选择器208和示例指纹生成器210和/或更一般地示例音频处理器108中的任一者可以由一个或更多个模拟或数字电路、逻辑电路、可编程处理器、可编程控制器、图形处理单元(GPU)、数字信号处理器(DSP)、专用集成电路(ASIC)、可编程逻辑器件(PLD)和/或现场可编程逻辑器件(FPLD)来实现。当将本专利的装置或系统权利要求中的任一项理解成覆盖纯软件和/或固件实现方式时，示例频率范围分离器202、示例音频特性确定器204、示例信号归一化器206、示例点选择器208和示例指纹生成器210中的至少一者由此被明确地定义成包括具有软件和/或固件的非暂时性计算机可读存储设备或存储盘(诸如存储器、数字通用盘(DVD)、光盘(CD)、蓝光盘等)。更进一步地，图1和图2的示例音频处理器106可以包括除了图2所例示的要素、处理和/或设备以外的或者代替图2所例示的要素、处理和/或设备的一个或更多个要素、处理和/或设备，和/或可以包括任何或全部所例示的要素、处理和设备中的不止一个。如本文所使用的，短语“进行通信”(包括其变型)涵盖直接通信和/或通过一个或更多个中间组件的间接通信，并且不需要直接的物理(例如，有线)通信和/或持续通信，而是另外地包括按照定期间隔、计划间隔、非周期性间隔和/或一次性事件的选择性通信。Although Figure 2 illustrates an example implementation of the audio processor 108 of Figure 1, one or more of the elements, processes, and/or devices illustrated in Figure 2 can be combined, divided, rearranged, omitted, eliminated, and/or implemented in any other way. Furthermore, the example frequency range separator 202, the example audio characteristic determiner 204, the example signal normalizer 206, the example point selector 208, and the example fingerprint generator 210, and/or the more general example audio processor 108 of Figures 1 and 2, can be implemented using hardware, software, firmware, and/or any combination of hardware, software, and/or firmware. Therefore, for example, any of the example frequency range splitter 202, example audio characteristic determiner 204, example signal normalizer 206, example point selector 208, and example fingerprint generator 210 and/or more generally, example audio processor 108, can be implemented by one or more analog or digital circuits, logic circuits, programmable processors, programmable controllers, graphics processing units (GPUs), digital signal processors (DSPs), application-specific integrated circuits (ASICs), programmable logic devices (PLDs), and/or field-programmable logic devices (FPLDs). When any of the device or system claims of this patent is understood to cover purely software and/or firmware implementations, at least one of the example frequency range splitter 202, example audio characteristic determiner 204, example signal normalizer 206, example point selector 208, and example fingerprint generator 210 is thus explicitly defined as including a non-transitory computer-readable storage device or storage disk (such as a memory, digital universal disc (DVD), optical disc (CD), Blu-ray disc, etc.) having software and/or firmware. Furthermore, the example audio processor 106 of Figures 1 and 2 may include one or more elements, processes, and/or devices other than, or in place of, those illustrated in Figure 2, and/or may include more than one of any or all of the illustrated elements, processes, and devices. As used herein, the phrase “to communicate” (including variations thereof) encompasses direct communication and/or indirect communication through one or more intermediate components, and does not require direct physical (e.g., wired) communication and/or continuous communication, but additionally includes selective communication at regular intervals, planned intervals, non-periodic intervals, and/or one-off events.

图3A至图3B描绘了由图2的示例频率范围分离器生成的示例未经处理的频谱图300。在图3A的所示示例中，示例未经处理的频谱图300包括由示例第一音频区域306A包围的示例第一时间-频率仓304A。在图3B的所示示例中，示例未经处理的频谱图包括由示例音频区域306B包围的示例第二时间-频率仓304B。图3A和图3B的示例未经处理的频谱图300以及经归一化的频谱图302各自包括表示频率仓的示例纵轴308和表示时间仓的示例横轴310。图3A和图3B例示了示例音频区域306A和306B，音频特性确定器204从该示例音频区域306A和306B获得归一化音频特性并且信号归一化器206使用该归一化音频特性来分别对第一时间-频率仓304A和第二时间-频率仓304B进行归一化。在所示示例中，对未经处理的频谱图300的各个时间-频率仓进行归一化，以生成经归一化的频谱图302。在其它示例中，可以对未经处理的频谱图300的任何合适数量的时间-频率仓进行归一化，以生成图3C的经归一化的频谱图302。Figures 3A and 3B depict example unprocessed spectrum plots 300 generated by the example frequency range separator of Figure 2. In the example shown in Figure 3A, the example unprocessed spectrum plot 300 includes an example first time-frequency bin 304A surrounded by an example first audio region 306A. In the example shown in Figure 3B, the example unprocessed spectrum plot includes an example second time-frequency bin 304B surrounded by an example audio region 306B. The example unprocessed spectrum plot 300 and the normalized spectrum plot 302 of Figures 3A and 3B each include an example vertical axis 308 representing a frequency bin and an example horizontal axis 310 representing a time bin. Figures 3A and 3B illustrate example audio regions 306A and 306B from which the audio characteristic determiner 204 obtains normalized audio characteristics, and the signal normalizer 206 uses these normalized audio characteristics to normalize the first time-frequency bin 304A and the second time-frequency bin 304B, respectively. In the example shown, the time-frequency bins of the unprocessed spectrogram 300 are normalized to generate a normalized spectrogram 302. In other examples, any suitable number of time-frequency bins of the unprocessed spectrogram 300 can be normalized to generate the normalized spectrogram 302 of Figure 3C.

示例纵轴308具有通过快速傅立叶变换(FFT)生成的频率仓单位，并且具有1024个FFT仓的长度。在其它示例中，可以通过度量频率的任何其它合适技术(例如，赫兹、另一变换算法等)来度量示例纵轴308。在一些示例中，纵轴308涵盖音频信号106的整个频率范围。在其它示例中，纵轴308可以涵盖音频信号106的一部分。Example vertical axis 308 has frequency bins generated by Fast Fourier Transform (FFT) and has a length of 1024 FFT bins. In other examples, example vertical axis 308 can be measured by any other suitable technique for measuring frequency (e.g., Hertz, another transform algorithm, etc.). In some examples, vertical axis 308 covers the entire frequency range of audio signal 106. In other examples, vertical axis 308 may cover a portion of audio signal 106.

在所示示例中，示例横轴310表示未经处理的频谱图300的总长度为11.5秒的时间段。在所示示例中，横轴310具有六十四毫秒(ms)间隔作为单位。在其它示例中，可以以任何其它合适的单位(例如，1秒等)来度量横轴310。例如，横轴310涵盖音频的完整持续时间。在其它示例中，横轴310可以涵盖音频信号106的持续时间的一部分。在所示示例中，频谱图300、302的各个时间-频率仓的大小为64ms×1FFT仓。In the example shown, the horizontal axis 310 represents a time period of 11.5 seconds with a total length of the unprocessed spectrogram 300. In the example shown, the horizontal axis 310 has a 64-millisecond (ms) interval as the unit. In other examples, the horizontal axis 310 can be measured in any other suitable unit (e.g., 1 second, etc.). For example, the horizontal axis 310 covers the entire duration of the audio. In other examples, the horizontal axis 310 can cover a portion of the duration of the audio signal 106. In the example shown, the size of each time-frequency bin in spectrograms 300 and 302 is 64 ms × 1 FFT bin.

在图3A的所示示例中，第一时间-频率仓304A与未经处理的频谱图300的频率仓和时间仓的交点以及音频信号106的与该交点相关联的一部分相关联。示例第一音频区域306A包括距示例第一时间-频率仓304A预定义距离内的时间-频率仓。例如，音频特性确定器204可以基于设定数量的FFT仓(例如，5个仓、11个仓等)来确定第一音频区域306A的垂直长度(例如，第一音频区域306A沿着纵轴308的长度，等等)。类似地，音频特性确定器204可以确定第一音频区域306A的水平长度(例如，第一音频区域306A沿着横轴310的长度，等等)。在所示示例中，第一音频区域306A是方形的。另选地，第一音频区域306A可以是任何合适的大小和形状，并且可以包含未经处理的频谱图300内的时间-频率仓的任何合适的组合(例如，任何合适的时间-频率仓组等)。示例音频特性确定器204然后可以确定被包含在第一音频区域306A内的时间-频率仓的音频特性(例如，均值能量等)。使用所确定的音频特性，图2的示例信号归一化器206可以对第一时间-频率仓304A的关联值进行归一化(例如，可以按第一音频区域306A内的各个时间-频率仓的均值能量对第一时间-频率仓304A的能量进行归一化)。In the example shown in Figure 3A, the first time-frequency bin 304A is associated with the intersection of the frequency bins and time bins of the unprocessed spectrogram 300 and a portion of the audio signal 106 associated with that intersection. The example first audio region 306A includes time-frequency bins within a predefined distance from the example first time-frequency bin 304A. For example, the audio characteristic determiner 204 can determine the vertical length of the first audio region 306A (e.g., the length of the first audio region 306A along the vertical axis 308, etc.) based on a set number of FFT bins (e.g., 5 bins, 11 bins, etc.). Similarly, the audio characteristic determiner 204 can determine the horizontal length of the first audio region 306A (e.g., the length of the first audio region 306A along the horizontal axis 310, etc.). In the example shown, the first audio region 306A is square. Alternatively, the first audio region 306A can be any suitable size and shape and can contain any suitable combination of time-frequency bins within the unprocessed spectrogram 300 (e.g., any suitable group of time-frequency bins, etc.). Example audio characteristic determiner 204 can then determine the audio characteristics (e.g., mean energy, etc.) of the time-frequency bins contained within the first audio region 306A. Using the determined audio characteristics, example signal normalizer 206 of FIG2 can normalize the associated values of the first time-frequency bin 304A (e.g., the energy of the first time-frequency bin 304A can be normalized according to the mean energy of the individual time-frequency bins within the first audio region 306A).

在图3B的所示示例中，第二时间-频率仓304B与未经处理的频谱图300的频率仓和时间仓的交点以及音频信号106的与该交点相关联的一部分相关联。示例第二音频区域306B包括距示例第二时间-频率仓304B预定义距离内的时间-频率仓。类似地，音频特性确定器204可以确定第二音频区域306B的水平长度(例如，第二音频区域306B沿着横轴310的长度，等等)。在所示示例中，第二音频区域306B是方形的。另选地，第二音频区域306B可以是任何合适的大小和形状，并且可以包含未经处理的频谱图300内的时间-频率仓的任何合适的组合(例如，任何合适的时间-频率仓组等)。在一些示例中，第二音频区域306B可以与第一音频区域306A重叠(例如，包含一些相同的时间-频率仓、在横轴310上移位、在纵轴308上移位等)。在一些示例中，第二音频区域306B可以具有与第一音频区域306A相同的大小和形状。在其它示例中，第二音频区域306B可以具有与第一音频区域306A不同的大小和形状。然后，示例音频特性确定器204可以确定第二音频区域306B所包含的时间-频率仓的音频特性(例如，均值能量等)。使用所确定的音频特性，图2的示例信号归一化器206可以对第二时间-频率仓304B的关联值进行归一化(例如，可以按位于第二音频区域306B内的仓的均值能量对第二时间-频率仓304B的能量进行归一化)。In the example shown in Figure 3B, the second time-frequency bin 304B is associated with the intersection of the frequency bins and time bins of the unprocessed spectrogram 300 and a portion of the audio signal 106 associated with that intersection. The example second audio region 306B includes time-frequency bins within a predefined distance from the example second time-frequency bin 304B. Similarly, the audio characteristic determiner 204 can determine the horizontal length of the second audio region 306B (e.g., the length of the second audio region 306B along the horizontal axis 310, etc.). In the example shown, the second audio region 306B is square. Alternatively, the second audio region 306B can be any suitable size and shape and can contain any suitable combination of time-frequency bins within the unprocessed spectrogram 300 (e.g., any suitable group of time-frequency bins, etc.). In some examples, the second audio region 306B can overlap with the first audio region 306A (e.g., containing some of the same time-frequency bins, shifted on the horizontal axis 310, shifted on the vertical axis 308, etc.). In some examples, the second audio region 306B may have the same size and shape as the first audio region 306A. In other examples, the second audio region 306B may have a different size and shape than the first audio region 306A. The example audio characteristic determiner 204 can then determine the audio characteristics (e.g., mean energy, etc.) of the time-frequency bins contained in the second audio region 306B. Using the determined audio characteristics, the example signal normalizer 206 of FIG2 can normalize the associated values of the second time-frequency bin 304B (e.g., the energy of the second time-frequency bin 304B can be normalized to the mean energy of the bins located within the second audio region 306B).

图3C描绘了由图2的信号归一化器通过对图3A至图3B的未经处理的频谱图300的多个时间-频率仓进行归一化而生成的经归一化的频谱图302的示例。例如，可以以与对时间-频率仓304A和304B进行归一化的方式相同的方式对未经处理的频谱图300的时间-频率仓中的一些或所有时间-频率仓进行归一化。结合图7描述了生成经归一化的频谱图的示例处理700。现在已按所述区域周围的局部区域内的局部均值能量对图3C的所得频率仓进行了归一化。结果，较黑的区域是在其各自的局部区域中具有最大能量的区域。这使得指纹能够包含甚至比通常较响的低音频率区域能量低的区域中的相关音频特征。Figure 3C depicts an example of a normalized spectrogram 302 generated by the signal normalizer of Figure 2 by normalizing multiple time-frequency bins of the unprocessed spectrogram 300 from Figures 3A to 3B. For example, some or all of the time-frequency bins in the unprocessed spectrogram 300 can be normalized in the same manner as time-frequency bins 304A and 304B. An example process 700 for generating the normalized spectrogram is described in conjunction with Figure 7. The resulting frequency bins of Figure 3C have now been normalized according to the local mean energy within the local regions surrounding the regions. As a result, the darker regions are those with the highest energy in their respective local regions. This allows the fingerprint to include relevant audio features in regions with even lower energy than typically louder bass frequency regions.

图4例示了被分成多个固定音频信号频率分量的图3的示例未经处理的频谱图300。通过利用快速傅立叶变换(FFT)处理音频信号106来生成示例未经处理的频谱图300。在其它示例中，可以使用任何其它合适的方法来生成未经处理的频谱图300。在该示例中，未经处理的频谱图300被分成多个示例音频信号频率分量402。示例未经处理的频谱图400包括图3的示例纵轴308以及图3的示例横轴310。在所示示例中，示例音频信号频率分量402各自具有示例频率范围408和示例时间段410。示例音频信号频率分量402包括示例第一音频信号频率分量412A和示例第二音频信号频率分量412B。在所示示例中，未经处理的频谱图300的较黑部分表示音频信号106的具有较高能量的部分。Figure 4 illustrates an example unprocessed spectrogram 300 of Figure 3, divided into multiple fixed audio signal frequency components. The example unprocessed spectrogram 300 is generated by processing the audio signal 106 using a Fast Fourier Transform (FFT). In other examples, any other suitable method can be used to generate the unprocessed spectrogram 300. In this example, the unprocessed spectrogram 300 is divided into multiple example audio signal frequency components 402. The example unprocessed spectrogram 400 includes an example vertical axis 308 and an example horizontal axis 310 of Figure 3. In the illustrated example, each example audio signal frequency component 402 has an example frequency range 408 and an example time period 410. The example audio signal frequency components 402 include an example first audio signal frequency component 412A and an example second audio signal frequency component 412B. In the illustrated example, the darker portions of the unprocessed spectrogram 300 represent the higher-energy portions of the audio signal 106.

示例音频信号频率分量402各自与连续频率范围(例如，频率仓等)和连续时间段的唯一组合相关联。在所示示例中，音频信号频率分量402中的各个音频信号频率分量具有相等大小的频率仓(例如，频率范围408)。在其它示例中，一些或全部音频信号频率分量402可以具有不同大小的频率仓。在所示示例中，音频信号频率分量402中的各个音频信号频率分量具有相等持续时间的时间段(例如，时间段410)。在其它示例中，一些或所有音频信号频率分量402可以具有不同持续时间的时间段。在所示示例中，音频信号频率分量402构成了整个音频信号106。在其它示例中，音频信号频率分量402可以包括音频信号106的一部分。Each of the example audio signal frequency components 402 is associated with a unique combination of a continuous frequency range (e.g., a frequency bin, etc.) and a continuous time period. In the illustrated example, the individual audio signal frequency components 402 have frequency bins of equal size (e.g., frequency range 408). In other examples, some or all of the audio signal frequency components 402 may have frequency bins of different sizes. In the illustrated example, the individual audio signal frequency components 402 have time periods of equal duration (e.g., time period 410). In other examples, some or all of the audio signal frequency components 402 may have time periods of different durations. In the illustrated example, the audio signal frequency components 402 constitute the entire audio signal 106. In other examples, the audio signal frequency components 402 may include a portion of the audio signal 106.

在所示示例中，第一音频信号频率分量412A位于音频信号106的高音范围内并且没有可见的能量点。示例第一音频信号频率分量412A与768FFT仓和896FFT仓之间的频率仓以及10024ms和11520ms之间的时间段相关联。在一些示例中，在第一音频信号频率分量412A内存在音频信号106的多个部分。在该示例中，由于音频信号106的低音频谱内的音频(例如，第二音频信号频率分量412B中的音频等)具有相对较高的能量，因此音频信号106的位于音频信号频率分量412A内的部分不可见。第二音频信号频率分量412B位于音频信号106的低音范围内并且具有可见能量点。示例第二音频信号频率分量412B与128FFT仓和256FFT仓之间的频率仓以及10024ms和11520ms之间的时间段相关联。在一些示例中，因为音频信号106的位于低音频谱内的部分(例如，第二音频信号频率分量412B等)具有相对较高的能量，因此根据未经处理的频谱图300生成的指纹将包括来自低音频谱的不成比例的数量的样本。In the example shown, the first audio signal frequency component 412A is located in the high-frequency range of the audio signal 106 and has no visible energy point. Example: The first audio signal frequency component 412A is associated with the frequency band between 768 FFT and 896 FFT bands and the time period between 10024 ms and 11520 ms. In some examples, multiple portions of the audio signal 106 exist within the first audio signal frequency component 412A. In this example, because the audio in the low-frequency spectrum of the audio signal 106 (e.g., the audio in the second audio signal frequency component 412B, etc.) has relatively high energy, the portion of the audio signal 106 located within audio signal frequency component 412A is not visible. The second audio signal frequency component 412B is located in the low-frequency range of the audio signal 106 and has a visible energy point. Example: The second audio signal frequency component 412B is associated with the frequency band between 128 FFT and 256 FFT bands and the time period between 10024 ms and 11520 ms. In some examples, because the portion of the audio signal 106 located in the bass spectrum (e.g., the second audio signal frequency component 412B, etc.) has relatively high energy, the fingerprint generated based on the unprocessed spectrogram 300 will include a disproportionate number of samples from the bass spectrum.

图5是由图2的信号归一化器根据图4的固定音频信号频率分量生成的经归一化的频谱图500的示例。示例经归一化的频谱图500包括图3的示例纵轴308以及图3的示例横轴310。示例经归一化的频谱图500被分成多个示例音频信号频率分量502。在所示示例中，音频信号频率分量502各自具有示例频率范围408和示例时间段410。示例音频信号频率分量502包括示例第一音频信号频率分量504A和示例第二音频信号频率分量504B。在一些示例中，第一音频信号频率分量504A和第二音频信号频率分量504B对应于与图3的第一音频信号频率分量412A和第二音频信号频率分量412B相同的频率仓和时间段。在所示示例中，经归一化的频谱图500的较黑部分表示音频频谱的具有较高能量的区域。Figure 5 is an example of a normalized spectrum 500 generated by the signal normalizer of Figure 2 based on the fixed audio signal frequency components of Figure 4. The example normalized spectrum 500 includes the example vertical axis 308 and the example horizontal axis 310 of Figure 3. The example normalized spectrum 500 is divided into multiple example audio signal frequency components 502. In the illustrated example, each audio signal frequency component 502 has an example frequency range 408 and an example time period 410. The example audio signal frequency components 502 include an example first audio signal frequency component 504A and an example second audio signal frequency component 504B. In some examples, the first audio signal frequency component 504A and the second audio signal frequency component 504B correspond to the same frequency range and time period as the first audio signal frequency component 412A and the second audio signal frequency component 412B of Figure 3. In the illustrated example, the darker portions of the normalized spectrum 500 represent regions of higher energy in the audio spectrum.

通过按关联的音频特性对图4的各个音频信号频率分量402进行归一化来对未经处理的频谱图300进行归一化来生成示例经归一化的频谱图500。例如，音频特性确定器204可以确定第一音频信号频率分量412A的音频特性(例如，均值能量等)。在该示例中，信号归一化器206然后可以按所确定的音频特性对第一音频信号频率分量412A进行归一化，以创建示例音频信号频率分量402A。类似地，可以通过按与图4的第二音频信号频率分量412B相关联的音频特性对该第二音频信号频率分量412B进行归一化来生成示例第二音频信号频率分量402B。在其它示例中，可以通过对音频信号分量402的一部分进行归一化来生成经归一化的频谱图500。在其它示例中，可以使用任何其它合适的方法来生成示例经归一化的频谱图500。An example normalized spectrum 500 is generated by normalizing the unprocessed spectrum 300 by normalizing the individual audio signal frequency components 402 of FIG4 according to their associated audio characteristics. For example, the audio characteristic determiner 204 can determine the audio characteristics (e.g., mean energy, etc.) of the first audio signal frequency component 412A. In this example, the signal normalizer 206 can then normalize the first audio signal frequency component 412A according to the determined audio characteristics to create the example audio signal frequency component 402A. Similarly, an example second audio signal frequency component 402B can be generated by normalizing the second audio signal frequency component 412B according to the audio characteristics associated with the second audio signal frequency component 412B of FIG4. In other examples, the normalized spectrum 500 can be generated by normalizing a portion of the audio signal component 402. In other examples, any other suitable method can be used to generate the example normalized spectrum 500.

在图5的所示示例中，第一音频信号频率分量504A(例如，由信号归一化器206处理后的图4的第一音频信号频率分量412A等)在经归一化的频谱图500上具有可见能量点。例如，因为已按第一音频信号频率分量412A的能量对第一音频信号频率分量504A进行了归一化，所以音频信号106的先前隐藏的部分(例如，当与第一音频信号频率分量412A相比时)在经归一化的频谱图500上可见。第二音频信号频率分量504B(例如，由信号归一化器206处理后的图4的第二音频信号频率分量412B等)对应于音频信号106的低音范围。例如，因为已按第二音频信号频率分量412B的能量对第二音频信号频率分量504B进行了归一化，所以可见能量点的数量已减少(例如，当与第二音频信号频率分量412B相比时)。在一些示例中，与根据图4的未经处理的频谱图300生成的指纹相比，根据经归一化的频谱图500生成的指纹(例如，图1的指纹110)将包括更均匀地分布在音频频谱中的样本。In the example shown in FIG5, the first audio signal frequency component 504A (e.g., the first audio signal frequency component 412A of FIG4 after processing by signal normalizer 206, etc.) has visible energy points on the normalized spectrum 500. For example, because the first audio signal frequency component 504A has been normalized according to the energy of the first audio signal frequency component 412A, previously hidden portions of the audio signal 106 (e.g., compared to the first audio signal frequency component 412A) are visible on the normalized spectrum 500. The second audio signal frequency component 504B (e.g., the second audio signal frequency component 412B of FIG4 after processing by signal normalizer 206, etc.) corresponds to the bass range of the audio signal 106. For example, because the second audio signal frequency component 504B has been normalized according to the energy of the second audio signal frequency component 412B, the number of visible energy points has been reduced (e.g., compared to the second audio signal frequency component 412B). In some examples, a fingerprint generated from a normalized spectrogram 500 (e.g., fingerprint 110 in Figure 1) will include samples that are more evenly distributed across the audio spectrum compared to a fingerprint generated from an unprocessed spectrogram 300 in Figure 4.

图6是由图2的点选择器208根据图5的经归一化的频谱图500生成的经归一化且经加权的频谱图600的示例。示例频谱图600包括图3的示例纵轴308以及图3的示例横轴310。示例经归一化且经加权的频谱图600被分成多个示例音频信号频率分量502。在所示示例中，示例音频信号频率分量502各自具有示例频率范围408和示例时间段410。示例音频信号频率分量502包括示例第一音频信号频率分量604A和示例第二音频信号频率分量604B。在一些示例中，第一音频信号频率分量604A和第二音频信号频率分量604B分别对应于与图3的第一音频信号频率分量412A和第二音频信号频率分量412B相同的频率仓和时间段。在所示示例中，经归一化且经加权的频谱图600的较黑部分表示音频频谱的具有较高能量的区域。Figure 6 is an example of a normalized and weighted spectrum diagram 600 generated by the dot selector 208 of Figure 2 based on the normalized spectrum diagram 500 of Figure 5. The example spectrum diagram 600 includes the example vertical axis 308 and the example horizontal axis 310 of Figure 3. The example normalized and weighted spectrum diagram 600 is divided into multiple example audio signal frequency components 502. In the example shown, each example audio signal frequency component 502 has an example frequency range 408 and an example time period 410. The example audio signal frequency components 502 include an example first audio signal frequency component 604A and an example second audio signal frequency component 604B. In some examples, the first audio signal frequency component 604A and the second audio signal frequency component 604B correspond to the same frequency range and time period as the first audio signal frequency component 412A and the second audio signal frequency component 412B of Figure 3, respectively. In the example shown, the darker portions of the normalized and weighted spectrum diagram 600 represent regions of higher energy in the audio spectrum.

通过基于音频信号106的类别利用从零到一的值范围对经归一化的频谱图600进行加权来生成示例经归一化且经加权的频谱图600。例如，如果音频信号106是音乐，则图2的点选择器208将沿着各列对与音乐相关联的音频频谱的区域进行加权。在其它示例中，加权可以应用于多个列，并且可以采用从零到一的不同范围。An example normalized and weighted spectrogram 600 is generated by weighting the normalized spectrogram 600 using a value range from zero to one based on the category of the audio signal 106. For example, if the audio signal 106 is music, the point selector 208 of Figure 2 will weight the regions of the audio spectrum associated with music along each column. In other examples, the weighting can be applied to multiple columns and can take different ranges from zero to one.

图7和图8示出了表示用于实现图2的音频处理器108的示例硬件逻辑、机器可读指令、硬件实现状态机和/或其任何组合的流程图。机器可读指令可以是供计算机处理器(诸如下面结合图9讨论的示例处理器平台900中示出的处理器912)执行的可执行程序或可执行程序的一部分。可以在存储在非暂时性计算机可读存储介质(诸如CD-ROM、软盘、硬盘驱动器、DVD、蓝光盘或者与处理器912相关联的存储器)上的软件中具体实施所述程序，但是全部程序和/或其部分可以另选地由除了处理器912以外的设备来执行，和/或在固件或专用硬件中具体实施。此外，尽管参照图7和图8所例示的流程图对示例程序进行了描述，但是可以另选地使用实现示例音频处理器108的许多其它方法。例如，可以改变框的执行顺序，和/或可以改变、消除或组合所述框中的一些框。另外地或者另选地，任何或所有框都可以由被构造成执行对应的操作而不执行软件或固件的一个或更多个硬件电路(例如，分立的和/或集成的模拟和/或数字电路、FPGA、ASIC、比较器、运算放大器(op-amp)、逻辑电路等)来实现。Figures 7 and 8 illustrate flowcharts representing example hardware logic, machine-readable instructions, hardware implementation state machines, and/or any combination thereof for implementing the audio processor 108 of Figure 2. The machine-readable instructions may be an executable program or part of an executable program that is executed by a computer processor, such as processor 912 shown in the example processor platform 900 discussed below in conjunction with Figure 9. The program may be implemented in software stored on a non-transitory computer-readable storage medium, such as a CD-ROM, floppy disk, hard disk drive, DVD, Blu-ray disc, or memory associated with processor 912, but the entire program and/or parts thereof may alternatively be executed by a device other than processor 912, and/or implemented in firmware or dedicated hardware. Furthermore, although the example program has been described with reference to the flowcharts illustrated in Figures 7 and 8, many other methods of implementing the example audio processor 108 may be used alternatively. For example, the execution order of blocks may be changed, and/or some blocks in the blocks may be changed, eliminated, or combined. Alternatively or concurrently, any or all boxes may be implemented by one or more hardware circuits (e.g., discrete and/or integrated analog and/or digital circuits, FPGAs, ASICs, comparators, operational amplifiers, logic circuits, etc.) configured to perform the corresponding operations without executing software or firmware.

如上所述，可以使用存储在非暂时性计算机和/或机器可读介质(诸如硬盘驱动器、闪速存储器、只读存储器、光盘、数字通用盘、缓存、随机存取存储器和/或任何其它存储设备或存储盘，其中信息存储长达任何持续时间(例如，用于延长的时间段、永久地、用于简单的实例、用于临时缓冲和/或用于缓存信息))上的可执行指令(例如，计算机和/或机器可读指令)来实现图7和图8的示例处理。如本文所使用的，术语非暂时性计算机可读介质被明确定义成包括任何类型的计算机可读存储设备和/或存储盘，并且排除传播信号以及排除传输介质。As described above, the example processes of Figures 7 and 8 can be implemented using executable instructions (e.g., computer and/or machine-readable instructions) stored on non-transitory computer and/or machine-readable media (such as hard disk drives, flash memory, read-only memory, optical disks, digital universal disks, caches, random access memory, and/or any other storage devices or disks, where information is stored for any duration (e.g., for extended periods, permanently, for simple instances, for temporary buffering, and/or for caching information)). As used herein, the term non-transitory computer-readable media is explicitly defined to include any type of computer-readable storage device and/or disk, excluding propagated signals and transmission media.

“包括”和“包含”(及其所有形式和时态)在本文中用作开放式用语。因此，每当权利要求采用任何形式的“包括”或“包含”(例如，包含(comprises)、包括(includes)、包含(comprising)、包括(including)、具有等)作为前序部分或在任何种类的权利要求记载内时，将理解，在不落在对应权利要求或记载的范围之外的情况下，可以存在附加要素、用语等。如本文所使用的，当短语“至少”用作权利要求的例如前序部分中的过渡用语时，其以与用语“包含”和“包括”是开放式相同的方式是开放式的。当例如以诸如A、B和/或C的形式使用时，用语“和/或”是指A、B、C的任何组合或子集，诸如(1)单独的A，(2)单独的B，(3)单独的C，(4)A与B，(5)A与C，(6)B与C以及(7)A和B和C。如本文所使用的，在描述结构、组件、项、对象和/或事物的上下文中，短语“A和B中的至少一个”旨在表示包括以下任一项的实现方式：(1)至少一个A、(2)至少一个B和(3)至少一个A和至少一个B。类似地，如本文所使用的，在描述结构、组件、项、对象和/或事物的上下文中，短语“A或B中的至少一个”旨在表示包括以下任一项的实现方式：(1)至少一个A、(2)至少一个B和(3)至少一个A和至少一个B。如本文所使用的，在描述处理、指令、动作、活动和/或步骤的执行的上下文中，短语“A和B中的至少一个”旨在表示包括以下任一项的实现方式：(1)至少一个A、(2)至少一个B和(3)至少一个A和至少一个B。类似地，如本文所使用的，在描述处理、指令、动作、活动和/或步骤的执行的上下文中，短语“A或B中的至少一个”旨在表示包括以下任一项的实现方式：(1)至少一个A、(2)至少一个B和(3)至少一个A和至少一个B。"Comprising" and "including" (and all their forms and tenses) are used herein as open-ended terms. Therefore, whenever a claim uses any form of "comprising" or "including" (e.g., comprising, including, comprising, including, having, etc.) as a preamble or within the recounting of any kind of claim, it will be understood that additional elements, terms, etc., may be present without falling outside the scope of the corresponding claim or recounting. As used herein, when the phrase "at least" is used as a transitional term in, for example, the preamble of a claim, it is open-ended in the same way that the terms "comprising" and "including" are open-ended. When used, for example, in the form of A, B, and/or C, the term "and/or" refers to any combination or subset of A, B, C, such as (1) A alone, (2) B alone, (3) C alone, (4) A and B, (5) A and C, (6) B and C, and (7) A and B and C. As used herein, in the context of describing a structure, component, item, object, and/or thing, the phrase "at least one of A and B" is intended to indicate an implementation including any of the following: (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein, in the context of describing a structure, component, item, object, and/or thing, the phrase "at least one of A or B" is intended to indicate an implementation including any of the following: (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. As used herein, in the context of describing the execution of a process, instruction, action, activity, and/or step, the phrase "at least one of A and B" is intended to indicate an implementation including any of the following: (1) at least one A, (2) at least one B, and (3) at least one A and at least one B. Similarly, as used herein, in the context of describing the execution of processes, instructions, actions, activities and/or steps, the phrase “at least one of A or B” is intended to indicate an implementation including any of the following: (1) at least one A, (2) at least one B and (3) at least one A and at least one B.

图7的处理在框702开始。在框702，音频处理器108接收经数字化的音频信号106。例如，音频处理器108可以接收由麦风104捕获的(例如，由图1的音频源102等发出的)音频。在该示例中，麦克风可以包括模数转换器，以将音频转换成经数字化的音频信号106。在其它示例中，音频处理器108可以接收存储在数据库(例如，图9的易失性存储器914、图9的非易失性存储器916、图9的大容量存储设备928等)中的音频。在其它示例中，可以通过网络(例如，因特网等)将经数字化的音频信号106发送至音频处理器108。另外地或另选地，音频处理器108可以通过任何其它合适的手段来接收音频信号106。The processing in Figure 7 begins at block 702. At block 702, audio processor 108 receives a digitized audio signal 106. For example, audio processor 108 may receive audio captured by microphone 104 (e.g., emitted by audio source 102 of Figure 1, etc.). In this example, the microphone may include an analog-to-digital converter to convert the audio into a digitized audio signal 106. In other examples, audio processor 108 may receive audio stored in a database (e.g., volatile memory 914 of Figure 9, non-volatile memory 916 of Figure 9, mass storage device 928 of Figure 9, etc.). In other examples, the digitized audio signal 106 may be transmitted to audio processor 108 via a network (e.g., the Internet, etc.). Alternatively or additionally, audio processor 108 may receive the audio signal 106 by any other suitable means.

在框704，频率范围分离器202对音频信号106进行加窗并将音频信号106变换到频域。例如，频率范围分离器202可以执行快速傅立叶变换，以将音频信号106变换到频域，并且可以执行加窗函数(例如，Hamming函数、Hann函数等)。另外地或另选地，频率范围分离器202可以将音频信号106聚合成两个或更多个时间仓。在这些示例中，时间-频率仓与频率仓和时间仓的交点相对应并且包含音频信号106的一部分。In block 704, frequency range splitter 202 windowes the audio signal 106 and transforms the audio signal 106 to the frequency domain. For example, frequency range splitter 202 may perform a Fast Fourier Transform to transform the audio signal 106 to the frequency domain, and may perform a windowing function (e.g., Hamming function, Hann function, etc.). Alternatively or additionally, frequency range splitter 202 may aggregate the audio signal 106 into two or more time bins. In these examples, the time-frequency bin corresponds to the intersection of the frequency bin and the time bin and contains a portion of the audio signal 106.

在框706，音频特性确定器204选择时间-频率仓以进行归一化。例如，音频特性确定器204可以选择图3A的第一时间-频率仓304A。在一些示例中，音频特性确定器204可以选择与先前选择的第一时间-频率仓相邻的时间-频率仓。In box 706, the audio characteristic determiner 204 selects a time-frequency bin for normalization. For example, the audio characteristic determiner 204 can select the first time-frequency bin 304A of Figure 3A. In some examples, the audio characteristic determiner 204 can select a time-frequency bin adjacent to the previously selected first time-frequency bin.

在框708，音频特性确定器204确定周围音频区域的音频特性。例如，如果音频特性确定器204选择了第一时间-频率仓304A，则音频特性确定器204可以确定第一音频区域306A的音频特性。在一些示例中，音频特性确定器204可以确定音频区域的均值能量。在其它示例中，音频特性确定器204可以确定任何其它合适的音频特性(例如，均值幅度等)。In box 708, audio characteristic determiner 204 determines the audio characteristics of the surrounding audio region. For example, if audio characteristic determiner 204 selects a first time-frequency bin 304A, then audio characteristic determiner 204 can determine the audio characteristics of a first audio region 306A. In some examples, audio characteristic determiner 204 can determine the mean energy of the audio region. In other examples, audio characteristic determiner 204 can determine any other suitable audio characteristic (e.g., mean amplitude, etc.).

在框710，音频特性确定器204确定如果要选择另一时间-频率仓，则处理700返回至框706。如果不选择另一时间-频率仓，则处理700前进至框712。在一些示例中，重复框706至框710，直到已选择了未经处理的频谱图300的每一个时间-频率仓。在其它示例中，可以以任何合适次数的迭代来重复框706至框710。In box 710, the audio characteristic determiner 204 determines whether another time-frequency bin should be selected, and then process 700 returns to box 706. If no other time-frequency bin is selected, process 700 proceeds to box 712. In some examples, boxes 706 through 710 are repeated until every time-frequency bin of the unprocessed spectrogram 300 has been selected. In other examples, boxes 706 through 710 may be repeated for any suitable number of iterations.

在框712，信号归一化器206基于关联的音频特性对各个时间-频率仓进行归一化。例如，信号归一化器206可以利用在框708确定的关联的音频特性来对在框706选择的时间-频率仓中的各个时间-频率仓进行归一化。例如，信号归一化器可以按第一音频区域306A和第二音频区域306B的音频特性(例如，均值能量)分别对第一时间-频率仓304A和第二时间-频率仓304B进行归一化。在一些示例中，信号归一化器206基于时间-频率仓的归一化来生成经归一化的频谱图(例如，图3C的经归一化的频谱图302)。In block 712, signal normalizer 206 normalizes each time-frequency bin based on associated audio characteristics. For example, signal normalizer 206 may normalize each time-frequency bin selected in block 706 using the associated audio characteristics determined in block 708. For example, the signal normalizer may normalize the first time-frequency bin 304A and the second time-frequency bin 304B according to the audio characteristics (e.g., mean energy) of the first audio region 306A and the second audio region 306B, respectively. In some examples, signal normalizer 206 generates a normalized spectrogram based on the normalization of the time-frequency bins (e.g., the normalized spectrogram 302 of FIG. 3C).

在框714，点选择器208确定如果要基于音频类别对指纹生成进行加权，则处理700前进至框716。如果不基于音频类别对指纹生成进行加权，则处理700前进至框720。在框716，点选择器208确定音频信号106的音频类别。例如，点选择器208可以向用户呈现提示以指示音频的类别(例如，音乐、语音、音效、广告等)。在其它示例中，音频处理器108可以使用音频类别确定算法来确定音频类别。在一些示例中，音频类别可以是特定人的声音、一般人类语音、音乐、音效和/或广告。In box 714, dot selector 208 determines whether fingerprint generation should be weighted based on audio category, and process 700 proceeds to box 716. If fingerprint generation is not weighted based on audio category, process 700 proceeds to box 720. In box 716, dot selector 208 determines the audio category of audio signal 106. For example, dot selector 208 may present a cue to the user indicating the category of the audio (e.g., music, speech, sound effects, advertisements, etc.). In other examples, audio processor 108 may use an audio category determination algorithm to determine the audio category. In some examples, the audio category may be a specific person's voice, general human speech, music, sound effects, and/or advertisements.

在框718，点选择器208基于所确定的音频类别对时间频率仓进行加权。例如，如果音频类别是音乐，则点选择器208可以与音乐通常所关联的高音和低音范围相关联地对音频信号频率分量进行加权。在一些示例中，如果音频类别是特定人的声音，则点选择器208可以与该人的声音相关联地对音频信号频率分量进行加权。在一些示例中，信号归一化器206的输出可以表示为频谱图。In box 718, the dot selector 208 weights the time-frequency components based on the determined audio category. For example, if the audio category is music, the dot selector 208 may weight the audio signal frequency components in association with the high and low frequency ranges typically associated with music. In some examples, if the audio category is a particular person's voice, the dot selector 208 may weight the audio signal frequency components in association with that person's voice. In some examples, the output of the signal normalizer 206 may be represented as a spectrogram.

在框720，指纹生成器210通过选择经归一化的音频信号的能量极值来生成音频信号106的指纹(例如，图1的指纹110)。例如，指纹生成器210可以使用与一个或更多个能量极值(例如，一个极值、二十个极值等)相关联的频率时间仓和能量。在一些示例中，指纹生成器210可以选择经归一化的音频信号106的能量最大值。在其它示例中，指纹生成器210可以选择经归一化的音频信号频率分量的任何其它合适的特征。在一些示例中，指纹生成器210可以利用任何合适的手段(例如，算法等)来生成表示音频信号106的指纹110。一旦生成了指纹110，则处理700结束。In block 720, fingerprint generator 210 generates a fingerprint of audio signal 106 (e.g., fingerprint 110 of FIG. 1) by selecting energy extrema of the normalized audio signal. For example, fingerprint generator 210 may use frequency time bins and energy associated with one or more energy extrema (e.g., one extremum, twenty extrema, etc.). In some examples, fingerprint generator 210 may select the maximum energy value of the normalized audio signal 106. In other examples, fingerprint generator 210 may select any other suitable feature of the frequency components of the normalized audio signal. In some examples, fingerprint generator 210 may utilize any suitable means (e.g., algorithms, etc.) to generate fingerprint 110 representing audio signal 106. Once fingerprint 110 is generated, process 700 ends.

图8的处理800在框802开始。在框802，音频处理器108接收经数字化的音频信号。例如，音频处理器108可以接收(例如，由图1的音频源102等发出并由麦克风104捕获的)音频。在该示例中，麦克风可以包括模数转换器，以将音频信号转换成经数字化的音频信号106。在其它示例中，音频处理器108可以接收存储在数据库(例如，图9的易失性存储器914、图9的非易失性存储器916、图9的大容量存储设备928等)中的音频。在其它示例中，可以通过网络(例如，因特网等)将经数字化的音频信号106发送至音频处理器108。另外地或另选地，音频处理器108可以通过任何合适手段来接收音频信号106。The processing 800 in Figure 8 begins at block 802. At block 802, audio processor 108 receives a digitized audio signal. For example, audio processor 108 may receive audio (e.g., emitted by audio source 102 of Figure 1 and captured by microphone 104). In this example, the microphone may include an analog-to-digital converter to convert the audio signal into a digitized audio signal 106. In other examples, audio processor 108 may receive audio stored in a database (e.g., volatile memory 914 of Figure 9, non-volatile memory 916 of Figure 9, mass storage device 928 of Figure 9, etc.). In other examples, the digitized audio signal 106 may be transmitted to audio processor 108 via a network (e.g., the Internet, etc.). Alternatively or additionally, audio processor 108 may receive the audio signal 106 by any suitable means.

在框804，频率范围分离器202将音频信号分成两个或更多个音频信号频率分量(例如，图3的音频信号频率分量402等)。例如，频率范围分离器202可以执行快速傅立叶变换，以将音频信号106变换到频域，并且可以执行加窗函数(例如，Hamming函数、Hann函数等)，以创建频率仓。在这些示例中，各个音频信号频率分量与频率仓中的一个或更多个频率仓相关联。另外地或另选地，频率范围分离器202还可以将音频信号106分成两个或更多个时间段。在这些示例中，各个音频信号频率分量与两个或更多个时间段中的一时间段和两个或更多个频率仓中的一频率仓的唯一组合相对应。例如，频率范围分离器202可以将音频信号106分成第一频率仓、第二频率仓、第一时间段和第二时间段。在该示例中，第一音频信号频率分量与音频信号106在第一频率仓和第一时间段内的部分相对应，第二音频信号频率分量与音频信号106在第一频率仓和第二时间段内的部分相对应，第三音频信号频率分量与音频信号106在第二频率仓和第一时间段内的部分相对应，第四音频信号频率部分与音频信号106在第二频率仓和第二时间段内的分量相对应。在一些示例中，频率范围分离器202的输出可以表示为频谱图(例如，图3的未经处理的频谱图300)。In block 804, frequency range splitter 202 divides the audio signal into two or more audio signal frequency components (e.g., audio signal frequency component 402 in Figure 3, etc.). For example, frequency range splitter 202 may perform a Fast Fourier Transform to transform the audio signal 106 to the frequency domain and may perform a windowing function (e.g., Hamming function, Hann function, etc.) to create frequency bins. In these examples, each audio signal frequency component is associated with one or more frequency bins. Alternatively or additionally, frequency range splitter 202 may also divide the audio signal 106 into two or more time periods. In these examples, each audio signal frequency component corresponds to a unique combination of one of the two or more time periods and one of the two or more frequency bins. For example, frequency range splitter 202 may divide the audio signal 106 into a first frequency bin, a second frequency bin, a first time period, and a second time period. In this example, the first audio signal frequency component corresponds to a portion of the audio signal 106 within the first frequency range and the first time period; the second audio signal frequency component corresponds to a portion of the audio signal 106 within the first frequency range and the second time period; the third audio signal frequency component corresponds to a portion of the audio signal 106 within the second frequency range and the first time period; and the fourth audio signal frequency component corresponds to a portion of the audio signal 106 within the second frequency range and the second time period. In some examples, the output of the frequency range splitter 202 may be represented as a spectrum (e.g., the unprocessed spectrum 300 of Figure 3).

在框806，音频特性确定器204确定各个音频信号频率分量的音频特性。例如，音频特性确定器204可以确定各个音频信号频率分量的均值能量。在其它示例中，音频特性确定器204可以确定任何其它合适的音频特性(例如，均值幅度等)。In box 806, audio characteristic determiner 204 determines the audio characteristics of each audio signal frequency component. For example, audio characteristic determiner 204 may determine the mean energy of each audio signal frequency component. In other examples, audio characteristic determiner 204 may determine any other suitable audio characteristics (e.g., mean amplitude, etc.).

在框808，信号归一化器206基于与各个音频信号频率分量相关联的所确定的音频特性来对该音频信号频率分量进行归一化。例如，信号归一化器206可以按与各个音频信号频率分量相关联的均值能量来对该音频信号频率分量进行归一化。在其它示例中，信号归一化器206可以使用任何其它合适的音频特性来对音频信号频率分量进行归一化。在一些示例中，信号归一化器206的输出可以表示为频谱图(例如，图5的经归一化的频谱图500)。In block 808, signal normalizer 206 normalizes the audio signal frequency components based on determined audio characteristics associated with each individual audio signal frequency component. For example, signal normalizer 206 may normalize the audio signal frequency components according to the mean energy associated with each individual audio signal frequency component. In other examples, signal normalizer 206 may use any other suitable audio characteristics to normalize the audio signal frequency components. In some examples, the output of signal normalizer 206 may be represented as a spectrogram (e.g., the normalized spectrogram 500 of Figure 5).

在框810，音频特性确定器204确定如果要基于音频类别对指纹生成进行加权，则处理800前进至框812。如果不基于音频类别对指纹生成进行加权，则处理800进行至框816。在框812，音频处理器108确定音频信号106的音频类别。例如，音频处理器108可以向用户呈现提示以指示音频的类别(例如，音乐、语音等)。在其它示例中，音频处理器108可以使用音频类别确定算法来确定音频类别。在一些示例中，音频类别可以是特定人的声音、一般人类语音、音乐、音效和/或广告。In box 810, the audio feature determiner 204 determines whether fingerprint generation should be weighted based on audio category, and process 800 proceeds to box 812. If fingerprint generation is not weighted based on audio category, process 800 proceeds to box 816. In box 812, the audio processor 108 determines the audio category of the audio signal 106. For example, the audio processor 108 may present a cue to the user indicating the audio category (e.g., music, speech, etc.). In other examples, the audio processor 108 may use an audio category determination algorithm to determine the audio category. In some examples, the audio category may be a specific person's voice, general human speech, music, sound effects, and/or advertising.

在框814，信号归一化器206基于所确定的音频类别对音频信号频率分量进行加权。例如，如果音频类别是音乐，则信号归一化器206可以针对与音乐的平均频谱包络相关联的从高音到低音的各个频率位置、利用从零到一的不同缩放值沿着各个列对音频信号频率分量进行加权。在一些示例中，如果音频类别是人类声音，则信号归一化器206可以与人类声音的频谱包络相关联地对音频信号频率分量进行加权。在一些示例中，信号归一化器206的输出可以表示为频谱图(例如，图6的频谱图600)。In box 814, signal normalizer 206 weights the frequency components of the audio signal based on the determined audio category. For example, if the audio category is music, signal normalizer 206 may weight the audio signal frequency components along the columns using different scaling values from zero to one, for each frequency position from high to low frequencies associated with the average spectral envelope of music. In some examples, if the audio category is human voice, signal normalizer 206 may weight the audio signal frequency components in association with the spectral envelope of human voice. In some examples, the output of signal normalizer 206 may be represented as a spectrogram (e.g., spectrogram 600 of Figure 6).

在框816，指纹生成器210通过选择经归一化的音频信号频率分量的能量极值来生成音频信号106的指纹(例如，图1的指纹110)。例如，指纹生成器210可以使用与一个或更多个能量极值(例如，二十个极值等)相关联的频率时间仓和能量。在一些示例中，指纹生成器210可以选择经归一化的音频信号的能量最大值。在其它示例中，指纹生成器210可以选择经归一化的音频信号频率分量的任何其它合适的特征。在一些示例中，指纹生成器210可以利用另一合适手段(例如，算法等)来生成表示音频信号106的指纹110。一旦生成了指纹110，则处理800结束。In block 816, fingerprint generator 210 generates a fingerprint of audio signal 106 (e.g., fingerprint 110 of FIG. 1) by selecting energy extrema of normalized audio signal frequency components. For example, fingerprint generator 210 may use frequency time bins and energy associated with one or more energy extrema (e.g., twenty extrema, etc.). In some examples, fingerprint generator 210 may select the maximum energy value of the normalized audio signal. In other examples, fingerprint generator 210 may select any other suitable feature of the normalized audio signal frequency components. In some examples, fingerprint generator 210 may utilize another suitable means (e.g., an algorithm, etc.) to generate fingerprint 110 representing audio signal 106. Once fingerprint 110 is generated, processing 800 ends.

图9是被构造为执行图7和/或图8的指令以实现图2的音频处理器108的示例处理器平台900的框图。例如，处理器平台900可以是服务器、个人计算机、工作站、自学习机器(例如，神经网络)、移动设备(例如，手机、智能电话、诸如ipd^TM的平板计算机)、个人数字助理(PDA)、因特网设备、DVD播放器、CD播放器、数字视频录像机、蓝光播放器、游戏机、个人视频录像机、机顶盒、头戴设备或其它可穿戴设备或者任何其它类型的计算设备。Figure 9 is a block diagram of an example processor platform 900 configured to execute the instructions of Figures 7 and/or 8 to implement the audio processor 108 of Figure 2. For example, the processor platform 900 may be a server, personal computer, workstation, self-learning machine (e.g., neural network), mobile device (e.g., mobile phone, smartphone, tablet computer such as iPad ^™ ), personal digital assistant (PDA), internet device, DVD player, CD player, digital video recorder, Blu-ray player, game console, personal video recorder, set-top box, head-mounted device or other wearable device, or any other type of computing device.

所示示例的处理器平台900包括处理器912。所示示例的处理器912是硬件。例如，处理器912可以由来自任何期望系列或制造商的一个或更多个集成电路、逻辑电路、微处理器、GPU、DSP或控制器来实现。硬件处理器可以是基于半导体的(例如，基于硅的)器件。在该示例中，处理器912实现示例频率范围分离器202、示例音频特性确定器204、示例信号归一化器206、示例点选择器208和示例指纹生成器210。The processor platform 900 shown in the example includes a processor 912. The processor 912 shown in the example is hardware. For example, the processor 912 can be implemented by one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer. The hardware processor can be a semiconductor-based (e.g., silicon-based) device. In this example, the processor 912 implements an example frequency range splitter 202, an example audio characteristic determiner 204, an example signal normalizer 206, an example point selector 208, and an example fingerprint generator 210.

所示示例的处理器912包括本地存储器913(例如，缓存)。所示示例的处理器912经由总线918与包括易失性存储器914和非易失性存储器916的主存储器进行通信。易失性存储器914可以由同步动态随机存取存储器(SDRAM)、动态随机存取存储器(DRAM)、动态随机存取存储器和/或任何其它类型的随机存取存储器设备来实现。非易失性存储器916可以由闪存存储器和/或任何其它期望类型的存储器设备来实现。对主存储器914、916的访问由存储器控制器来控制。The processor 912 of the illustrated example includes local memory 913 (e.g., cache). The processor 912 of the illustrated example communicates via bus 918 with main memory, which includes volatile memory 914 and non-volatile memory 916. Volatile memory 914 may be implemented using synchronous dynamic random access memory (SDRAM), dynamic random access memory (DRAM), dynamic random access memory, and/or any other type of random access memory device. Non-volatile memory 916 may be implemented using flash memory and/or any other desired type of memory device. Access to main memory 914, 916 is controlled by a memory controller.

所示示例的处理器平台900还包括接口电路920。接口电路920可以通过任何类型的接口标准(诸如，以太网接口、通用串行总线(USB)、接口、近场通信(NFC)接口和/或PCI express接口)来实现。The processor platform 900 shown in the example also includes interface circuitry 920. Interface circuitry 920 can be implemented using any type of interface standard, such as Ethernet interface, Universal Serial Bus (USB), interface, Near Field Communication (NFC) interface, and/or PCI Express interface.

在所示示例中，一个或更多个输入设备922连接至接口电路920。输入设备922允许用户将数据和/或命令输入到处理器912中。例如，输入设备922可以通过音频传感器、麦克风、摄像头(静态或视频)和/或语音识别系统来实现。In the example shown, one or more input devices 922 are connected to interface circuitry 920. Input devices 922 allow users to input data and/or commands into processor 912. For example, input devices 922 can be implemented using an audio sensor, microphone, camera (still or video), and/or voice recognition system.

一个或更多个输出设备924也连接至所示示例的接口电路920。输出设备924可以例如由显示设备(例如，发光二极管(LED)、有机发光二极管(OLED)、液晶显示器(LCD)、阴极射线管显示器(CRT)、平面转换(IPS)显示器、触摸屏等)、触觉输出设备、打印机和/或扬声器来实现。因此，所示示例的接口电路920通常包括图形驱动器卡、图形驱动器芯片和/或图形驱动器处理器。One or more output devices 924 are also connected to the interface circuitry 920 of the illustrated example. The output devices 924 may be implemented, for example, by display devices (e.g., light-emitting diodes (LEDs), organic light-emitting diodes (OLEDs), liquid crystal displays (LCDs), cathode ray tube displays (CRTs), flat panel displays (IPS) displays, touchscreens, etc.), haptic output devices, printers, and/or speakers. Therefore, the interface circuitry 920 of the illustrated example typically includes a graphics driver card, a graphics driver chip, and/or a graphics driver processor.

所示示例的接口电路920还包括通信设备(诸如，发送器、接收器、收发器、调制解调器、住宅网关、无线接入点和/或网络接口)，以促进经由网络926与外部机器(例如，任何种类的计算设备)交换数据。例如，所述通信可以经由以太网连接、数字订户线(DSL)连接、电话线连接、同轴电缆系统、卫星系统、直线对传式无线系统、蜂窝电话系统等。The interface circuitry 920 of the example shown also includes communication devices (such as transmitters, receivers, transceivers, modems, residential gateways, wireless access points, and/or network interfaces) to facilitate the exchange of data with external machines (e.g., any kind of computing device) via network 926. For example, the communication may be via Ethernet connections, digital subscriber line (DSL) connections, telephone line connections, coaxial cable systems, satellite systems, line-to-line wireless systems, cellular telephone systems, etc.

所示示例的处理器平台900还包括用于存储软件和/或数据的一个或更多个大容量存储设备928。这样的大容量存储设备928的示例包括软盘驱动器、硬盘驱动器、光盘驱动器、蓝光盘驱动器、独立磁盘冗余阵列(RAID)系统和数字通用盘(DVD)驱动器。The processor platform 900 shown in the example also includes one or more mass storage devices 928 for storing software and/or data. Examples of such mass storage devices 928 include floppy disk drives, hard disk drives, optical disk drives, Blu-ray disc drives, redundant array of independent disks (RAID) systems, and digital universal disc (DVD) drives.

用于实现图6的方法的机器可执行指令932可以存储在大容量存储设备928、易失性存储器914、非易失性存储器916中和/或可移除非暂时性计算机可读存储介质(诸如，CD或DVD)上。The machine-executable instructions 932 used to implement the method of FIG6 may be stored in a mass storage device 928, volatile memory 914, non-volatile memory 916 and/or on a removable non-transitory computer-readable storage medium (such as a CD or DVD).

根据前述内容，将理解，已公开了允许创建音频信号的指纹的示例方法和装置，其减少了在指纹中捕获的噪声量。另外，通过从音频信号的能量较小的区域采样音频，与以前使用的音频指纹识别方法相比，可以创建更鲁棒的音频指纹。Based on the foregoing, it will be understood that example methods and apparatuses for creating fingerprints from audio signals have been disclosed, which reduce the amount of noise captured in the fingerprint. Furthermore, by sampling audio from regions of the audio signal with lower energy, more robust audio fingerprints can be created compared to previously used audio fingerprinting methods.

尽管本文公开了某些示例方法、装置以及制品，但是本专利的覆盖范围不限于此。与此相反，本专利覆盖完全落入本专利的权利要求的范围内的所有方法、装置以及制品。Although certain example methods, apparatuses, and articles of manufacture are disclosed herein, the scope of this patent is not limited thereto. Rather, this patent covers all methods, apparatuses, and articles of manufacture that fall fully within the scope of the claims of this patent.

Claims

1. An apparatus for audio fingerprint recognition, the apparatus comprising:

A frequency range splitter transforms an audio signal into the frequency domain. The transformed audio signal includes multiple time-frequency bins, including a first time-frequency bin.

An audio characteristic determiner determines a first characteristic of a first group of time-frequency bins in a plurality of time-frequency bins, the first group of time-frequency bins surrounding the first time-frequency bin;

A signal normalizer normalizes the audio signal to generate a normalized energy value, wherein the normalization of the audio signal includes normalizing the first time-frequency bin according to the first characteristic;

A point selector, wherein the point selector selects one of the normalized energy values; and

A fingerprint generator that uses a selected normalized energy value from the normalized energy values to generate a fingerprint of the audio signal.

The point selector also includes:

Determine the category of the audio signal; and

The selection of one of the normalized energy values is weighted according to the category of the audio signal.

Wherein, the black portion of the normalized and weighted spectrogram generated by the point selector represents the high-energy region of the audio spectrum, and

Each of the multiple time-frequency bins is a unique combination of the following: (1) the time period of the audio signal and (2) the frequency bin of the transformed audio signal.

2. The apparatus of claim 1, wherein the frequency range separator further performs a fast Fourier transform on the audio signal.

3. The apparatus of claim 1, wherein the category of the audio signal includes at least one of music, human speech, sound effects, and advertisements.

4. The apparatus according to any one of claims 1 to 3, wherein the audio characteristic determiner further determines a second characteristic of a second group of time-frequency bins in the plurality of time-frequency bins, the second group of time-frequency bins surrounding a second time-frequency bin in the plurality of time-frequency bins, and the signal normalizer further normalizes the first time-frequency bin according to the first characteristic.

5. The apparatus according to any one of claims 1 to 3, wherein the point selector selects one of the normalized energy values based on the energy extrema of the normalized audio signal.

6. The apparatus of claim 1, wherein each of the normalized energy values corresponds to a corresponding time-frequency bin in the plurality of time-frequency bins.

7. The apparatus of claim 1, wherein the determined first characteristic includes at least one of the following: (i) mean energy value, (ii) mode energy value, (iii) median energy value, (iv) mode power value, and (v) mean amplitude of the audio signal.

8. A method for audio fingerprint recognition, the method comprising the following steps:

The audio signal is transformed to the frequency domain, and the transformed audio signal includes multiple time-frequency bins, including a first time-frequency bin.

Determine a first characteristic of a first group of time-frequency bins among the plurality of time-frequency bins, wherein the first group of time-frequency bins surrounds the first time-frequency bin;

The audio signal is normalized to generate a normalized energy value. The normalization of the audio signal includes normalizing the first time-frequency bin according to the first characteristic.

Select one of the normalized energy values; and

The fingerprint of the audio signal is generated using one of the selected normalized energy values.

The step of selecting one of the normalized energy values includes:

Determine the category of the audio signal; and

In this generated, normalized, and weighted spectrogram, the black portion represents the high-energy region of the audio spectrum, and

9. The method of claim 8, wherein the step of transforming the audio signal to the frequency domain includes performing a fast Fourier transform on the audio signal.

10. The method of claim 8, wherein the category of the audio signal includes at least one of music, human speech, sound effects, and advertisements.

11. The method according to any one of claims 8 to 10, further comprising:

Determine a second characteristic of a second group of time-frequency bins among the plurality of time-frequency bins, the second group of time-frequency bins surrounding a second time-frequency bin among the plurality of time-frequency bins; and

The first time-frequency bin is normalized according to the first characteristic.

12. The method according to any one of claims 8 to 10, wherein the step of selecting one of the normalized energy values is based on the energy extrema of the normalized audio signal.

13. The method of claim 8, wherein each of the normalized energy values corresponds to a corresponding time-frequency bin in the plurality of time-frequency bins.

14. The method of claim 8, wherein the determined first characteristic includes at least one of the following: (i) mean energy value, (ii) mode energy value, (iii) median energy value, (iv) mode power value, and (v) mean amplitude of the audio signal.

15. A non-transitory computer-readable storage medium comprising instructions that, when executed, cause a processor to at least:

Select one of the normalized energy values; and

When the instruction is executed, it also causes the processor to:

Determine the category of the audio signal; and

In this context, the black portion of the normalized and weighted spectrogram represents the high-energy region of the audio spectrum, and...

16. The non-transitory computer-readable storage medium of claim 15, wherein the processor further performs a fast Fourier transform of the audio signal.

17. The non-transitory computer-readable storage medium of claim 15, wherein the processor further determines a second characteristic of a second group of time-frequency bins in the plurality of time-frequency bins, the second group of time-frequency bins surrounding a second time-frequency bin in the plurality of time-frequency bins, and the processor further normalizes the first time-frequency bin according to the first characteristic.

18. The non-transitory computer-readable storage medium of claim 15, wherein the processor selects the one of the normalized energy values based on the energy extrema of the normalized audio signal.

19. The non-transitory computer-readable storage medium of claim 15, wherein each of the normalized energy values corresponds to a corresponding time-frequency bin in the plurality of time-frequency bins.

20. The non-transitory computer-readable storage medium of claim 15, wherein the determined first characteristic includes at least one of the following: (i) mean energy value, (ii) mode energy value, (iii) median energy value, (iv) mode power value, and (v) mean amplitude of the audio signal.

21. An apparatus for audio fingerprint recognition, the apparatus comprising:

At least one memory;

Instructions; and

At least one processor, the at least one processor executing the instructions to:

The audio signal is normalized to generate a normalized energy value. The normalization of the audio signal includes normalizing the first time-frequency bin according to the first characteristic. Each normalized energy value in the normalized energy value corresponds to a corresponding time-frequency bin in the plurality of time-frequency bins.

Select one of the normalized energy values; and

Among them, selecting one of the normalized energy values includes:

Determine the category of the audio signal; and

The selection of one of the normalized energy values is weighted according to the category of the audio signal, and

In the generated normalized and weighted spectrogram, the black portion represents the high-energy region of the audio spectrum.

Each of the multiple time-frequency bins comprises a unique combination of the following: (1) the time period of the audio signal and (2) the frequency bin of the transformed audio signal.

22. The apparatus of claim 21, wherein transforming the audio signal to the frequency domain comprises performing a fast Fourier transform on the audio signal.

23. The apparatus of claim 21, wherein the category of the audio signal includes at least one of music, human speech, sound effects, and advertisements.

24. The apparatus of claim 21, wherein the at least one processor executes the instructions to:

25. The apparatus of claim 21, wherein the determined first characteristic includes at least one of the following: (i) mean energy value, (ii) mode energy value, (iii) median energy value, (iv) mode power value, and (v) mean amplitude of the audio signal.