TW201828285A

TW201828285A - Audio recognition method and system improving the feature point matching success rate in the audio recognition

Info

Publication number: TW201828285A
Application number: TW106101958A
Authority: TW
Inventors: 杜志軍
Original assignee: 阿里巴巴集團服務有限公司
Priority date: 2017-01-19
Filing date: 2017-01-19
Publication date: 2018-08-01

Abstract

The embodiment of the present application discloses an audio recognition method, including: performing a diffusion process on first feature points in a spectrogram of an audio file to be recognized to obtain a feature point map, the number of the first feature points being multiple; searching, in the spectrogram of the target audio file, whether there is a second feature point corresponding to each of the first feature points after the diffusion processing in the feature point map; if positive, determining that the to-be-recognized audio file is part of the target audio file. The present application also discloses an embodiment of an audio recognition system. With this embodiment, the feature point matching success rate can be improved in the audio recognition.

Description

Audio recognition method and system

本申請關於互聯網技術領域，特別關於一種音頻識別方法及系統。 The present application relates to the field of Internet technologies, and in particular, to an audio recognition method and system.

隨著互聯網技術的不斷發展，互聯網已成為人們生活中必不可少的工具。利用互聯網設備實現未知音頻的識別，並基於音頻識別的互動，成為一種新的應用趨勢。 With the continuous development of Internet technology, the Internet has become an indispensable tool in people's lives. The use of Internet devices to achieve the recognition of unknown audio, and based on the interaction of audio recognition, has become a new application trend.

基於音頻識別的互動有多種應用，一種應用例如是：用戶聽到一首不知道歌名的歌曲，可以錄製該歌曲的一段音頻，然後利用音頻識別技術，可以識別出這首歌的歌名、歌手等資訊。 There are many applications for interaction based on audio recognition. One application is, for example, a user who hears a song that does not know the name of the song, can record a piece of audio of the song, and then uses audio recognition technology to recognize the song title, singer, etc. of the song. News.

現有技術中，一般是提取待識別音頻的特徵點，利用特徵點對進行識別。如圖1所示，橫軸代表時間，縱軸代表頻率。提取的特徵點為圖中的“X”；兩個特徵點構成一個特徵點對，在目標區域內有8個特徵點對；採用特徵點對的方式在資料庫中進行識別，資料庫記憶體儲存有歌曲的特徵點及歌曲資訊如歌名、歌手等；如果在資料庫中能在相同的目標區域內匹配到一樣的特徵點對，則匹配成功；進而可以得到對應的歌曲資訊。然而，由於錄製音頻時不可避免的受到噪音的影響，提取的特徵點不一定都在正常的位置出現，所以導致特徵點對匹配成功的機率較低。 In the prior art, the feature points of the to-be-identified audio are generally extracted, and the feature point pairs are identified. As shown in Fig. 1, the horizontal axis represents time and the vertical axis represents frequency. The extracted feature points are “X” in the figure; two feature points form a feature point pair, and there are 8 feature point pairs in the target area; the feature point pairs are used to identify in the database, and the database memory The feature points and song information of the song are stored, such as song title, singer, etc.; if the same feature point pair can be matched in the same target area in the database, the matching is successful; and the corresponding song information can be obtained. However, since the extracted feature points are not necessarily in the normal position due to the inevitable noise interference when recording the audio, the probability of the feature point pair matching being successful is low.

綜上所述，現有技術中存在音頻識別中特徵點匹配成功率低的問題。 In summary, in the prior art, there is a problem that the feature point matching success rate in audio recognition is low.

本申請實施例的目的是提供一種音頻識別方法及系統，用以解決現有技術中音頻識別中特徵點匹配成功率低的問題。 An object of the present application is to provide an audio recognition method and system, which solves the problem of low feature point matching success rate in audio recognition in the prior art.

為解決上述技術問題，本申請一實施例提供的音頻識別方法，包括：對待識別音頻檔案的語譜圖中的第一特徵點進行擴散處理，得到特徵點圖，所述第一特徵點的數量為多個；在目標音頻檔案的語譜圖中查找是否存在與所述特徵點圖中擴散處理後的各第一特徵點分別對應的第二特徵點；若是，則確定所述待識別音頻檔案為所述目標音頻檔案的一部分。 In order to solve the above technical problem, an audio recognition method according to an embodiment of the present invention includes: performing a diffusion process on a first feature point in a spectrogram of an audio file to be obtained, to obtain a feature point map, and the number of the first feature points. a plurality of; determining, in a spectrogram of the target audio file, whether there is a second feature point respectively corresponding to each of the first feature points after the diffusion processing in the feature point map; if yes, determining the to-be-identified audio file Is part of the target audio file.

本申請一實施例提供的音頻識別系統，包括：擴散單元，用於對待識別音頻檔案的語譜圖中的第一特徵點進行擴散處理，得到特徵點圖，所述第一特徵點的數量為多個；查找單元，在目標音頻檔案的語譜圖中查找是否存在與所述特徵點圖中擴散處理後的各第一特徵點分別對應的第二特徵點；確定單元，用於在目標音頻檔案的語譜圖中查找到與所述特徵點圖中擴散處理後的各第一特徵點分別對應的第二特徵點時，則確定所述待識別音頻檔案為所述目標音頻檔案的一部分。 An audio recognition system according to an embodiment of the present invention includes: a diffusion unit, configured to perform diffusion processing on a first feature point in a spectrogram of an audio file to be obtained, to obtain a feature point map, where the number of the first feature points is a plurality of; a searching unit, searching, in a spectrogram of the target audio file, whether there is a second feature point respectively corresponding to each of the first feature points after the diffusion processing in the feature point map; and determining unit, configured to be in the target audio When the second feature point corresponding to each of the first feature points after the diffusion processing in the feature point map is found in the score map of the file, the audio file to be identified is determined to be part of the target audio file.

由以上本申請實施例提供的技術方案可見，本申請實施例提供的一種音頻識別方法及系統，透過對待識別音頻檔案的語譜圖中的第一特徵點進行擴散處理，可以減少所述第一特徵點受噪音影響產生的偏差；從而提高擴散處理後的第一特徵點與目標音頻檔案的匹配率，即提高了特徵點匹配成功率。 It can be seen that the audio recognition method and system provided by the embodiment of the present application can reduce the first by performing diffusion processing on the first feature point in the spectrum map of the audio file to be identified. The deviation of the feature point caused by the noise; thereby improving the matching rate between the first feature point after the diffusion process and the target audio file, that is, improving the feature point matching success rate.

210‧‧‧擴散單元 210‧‧‧Diffusion unit

220‧‧‧查找單元 220‧‧‧Search unit

230‧‧‧確定單元 230‧‧‧determination unit

為了更清楚地說明本申請實施例或現有技術中的技術方案，下面將對實施例或現有技術描述中所需要使用的附圖作簡單地介紹，顯而易見地，下面描述中的附圖僅僅是本申請中記載的一些實施例，對於本領域普通技術人員來講，在不付出創造性勞動性的前提下，還可以根據這些附圖獲得其他的附圖。 In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only Some of the embodiments described in the application can be obtained by those skilled in the art from other drawings without any inventive labor.

圖1為現有技術中利用特徵點對進行識別的示意圖；圖2為本申請一實施例中提供的音頻識別方法的流程圖；圖3為待識別音頻檔案的語譜圖的示意圖；圖4a為擴散處理前的第一特徵點的示意圖；圖4b為擴散處理後的第一特徵點的示意圖；圖5為圖1中S120步驟的方法流程圖；圖6為在目標音頻檔案的語譜圖中查找與特徵點圖中擴散處理後第一特徵點分別對應的第二特徵點的示意圖；圖7為本申請一實施例中提供的音頻識別方法的流程圖；圖8a為在語譜圖中確定的第一特徵點的示意圖；圖8b為圖8a的局部放大圖；圖9為本申請一實施例中提供的音頻識別系統的模組示意圖。 1 is a schematic diagram of identifying a pair of feature points in the prior art; FIG. 2 is a flowchart of an audio recognition method provided in an embodiment of the present application; FIG. 3 is a schematic diagram of a chromatogram of an audio file to be identified; Schematic diagram of the first feature point before the diffusion process; FIG. 4b is a schematic diagram of the first feature point after the diffusion process; FIG. 5 is a flow chart of the method of step S120 of FIG. 1; FIG. 6 is a map of the target audio file A schematic diagram of finding a second feature point corresponding to the first feature point after the diffusion processing in the feature point map; FIG. 7 is a flowchart of the audio recognition method provided in an embodiment of the present application; FIG. 8a is determined in the score map FIG. 8 is a partial enlarged view of FIG. 8a; FIG. 9 is a schematic diagram of a module of an audio recognition system provided in an embodiment of the present application.

為了使本技術領域的人員更好地理解本申請中的技術方案，下面將結合本申請實施例中的附圖，對本申請實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本申請一部分實施例，而不是全部的實施例。基於本申請中的實施例，本領域普通技術人員在沒有付出創造性勞動前提下所獲得的所有其他實施例，都應當屬於本申請保護的範圍。 The technical solutions in the embodiments of the present application are clearly and completely described in the following, in which the technical solutions in the embodiments of the present application are clearly and completely described. The embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope shall fall within the scope of the present application.

圖2為本申請一實施例中提供的音頻識別方法的流程圖。本實施例中，所述音頻識別方法包括如下步驟： FIG. 2 is a flowchart of an audio identification method provided in an embodiment of the present application. In this embodiment, the audio recognition method includes the following steps:

S110：對待識別音頻檔案的語譜圖中的第一特徵點進行擴散處理，得到特徵點圖，所述第一特徵點的數量為多個。 S110: Diffusion processing is performed on the first feature points in the spectrogram of the audio file to be obtained, to obtain a feature point map, where the number of the first feature points is plural.

語譜圖也稱為語音頻譜圖，一般是透過處理接收的時域信號得到。一般地，語譜圖的橫坐標用來表示時間，縱坐標用來表示頻率，坐標點值表示語音資料的能量。通常可以採用二維平面來表達三維資訊，所以坐標點值所表示的語音資料的能量值，大小可以透過顏色來表示。例如透過彩色的方式表示，顏色越深的可以表示該坐標點的語音能量越強；反之，顏色越淺的可以表示該坐標點的語音能量越弱。還可以透過灰度的方式表示，顏色越接近於白色的可以表示該坐標點的語音能量越強；反之，顏色越接近於黑色的可以表示該坐標點的語音能量越弱。 The spectrogram, also known as the speech spectrogram, is typically obtained by processing the received time domain signal. In general, the abscissa of the spectrogram is used to represent time, the ordinate is used to represent frequency, and the coordinate point value is used to represent the energy of speech data. Generally, a two-dimensional plane can be used to express three-dimensional information, so the energy value of the voice data represented by the coordinate point value can be expressed by color. For example, by means of color, the darker the color, the stronger the speech energy of the coordinate point; conversely, the lighter the color, the weaker the speech energy of the coordinate point. It can also be expressed by gray scale. The closer the color is to white, the stronger the speech energy can represent the coordinate point; conversely, the closer the color is to black, the weaker the speech energy of the coordinate point.

這樣，語譜圖可以直觀的表示語音信號隨時間變化的頻譜特性。任一給定頻率成分在給定時刻的強弱用相應點的灰度或色調的濃淡來表示。 In this way, the spectrogram can intuitively represent the spectral characteristics of the speech signal over time. The strength of any given frequency component at a given time is indicated by the shade of the corresponding point or the shade of the hue.

具體地，語譜圖可以透過如下步驟獲得： Specifically, the spectrogram can be obtained by the following steps:

A1：對待識別音頻檔案按照預設時間進行分幀處理。 A1: The audio file to be identified is subjected to frame processing according to a preset time.

所述預設時間可以是用戶根據過往經驗得出的經驗值。本實施例中所述預設時間包括32毫秒。即對待識別音頻檔案按照32毫秒進行分幀處理，得到每32毫秒為一幀，幀疊16毫秒的音頻片段。 The preset time may be an empirical value obtained by the user based on past experience. The preset time in this embodiment includes 32 milliseconds. That is, the audio file to be identified is subjected to frame processing in 32 milliseconds to obtain an audio segment of one frame every 32 milliseconds and a frame stack of 16 milliseconds.

A2：對分幀處理後的音頻片段進行短時頻譜分析，得到語譜圖。 A2: Perform short-term spectrum analysis on the segmented audio segment to obtain a spectral map.

所述短時頻譜分析包括快速傅立葉變化(Fast Fourier Transformation，FFT)。FFT是離散傅立葉變換的快速算法，利用FFT可以將音頻信號轉變為記錄了時間域與頻率域的聯合分佈資訊的語譜圖。 The short time spectrum analysis includes Fast Fourier Transformation (FFT). FFT is a fast algorithm of discrete Fourier transform. The FFT can be used to transform the audio signal into a spectral map that records the joint distribution information of the time domain and the frequency domain.

由於以32毫秒分幀處理，而32毫秒對應了8000hz採樣，使得FFT計算後可以得到256頻率點。 Since the processing is performed in 32 milliseconds, and 32 milliseconds corresponds to 8000hz sampling, 256 frequency points can be obtained after the FFT calculation.

如圖3中橫軸可以代表幀數，即音頻檔案分幀處理後的幀數的個數，對應了語譜圖的寬度；縱軸可以代表頻率，共有256個頻率點，對應了語譜圖的高度；坐標點值表示第一特徵點的能量。 As shown in Figure 3, the horizontal axis can represent the number of frames, that is, the number of frames after the audio file is framed, corresponding to the width of the spectral map; the vertical axis can represent the frequency, a total of 256 frequency points, corresponding to the spectrum The height of the coordinate point value represents the energy of the first feature point.

較佳地，在對分幀處理後的音頻片段進行短時頻譜分析之後，還可以包括： Preferably, after the short-time spectrum analysis is performed on the framing-processed audio segment, the method further includes:

A3：提取所述短時頻譜分析後300-2khz頻率段。 A3: Extract the 300-2khz frequency segment after the short-term spectrum analysis.

由於一般的歌曲主要的頻率是集中在300-2khz這個頻率段上的，所以本實施例透過提取300-2khz這個頻率段後，即可以消除其它頻率段噪音對所述頻率段的負面影響。 Since the main frequency of the general song is concentrated on the frequency band of 300-2khz, the present embodiment can eliminate the negative influence of other frequency band noise on the frequency segment by extracting the frequency band of 300-2khz.

在本申請的另一實施例中，在S110步驟之前，還可以包括：將待識別音頻檔案的語譜圖的第一特徵點的能量值歸一化為第一特徵點的灰度值。 In another embodiment of the present application, before the step S110, the method further includes: normalizing the energy value of the first feature point of the spectrogram of the audio file to be identified into the gray value of the first feature point.

本實施例中，由於經過FFT之後的第一特徵點的能量值範圍較大，有時可能達到0-2^8，甚至0-2^16(能量值範圍與音頻檔案的信號強度呈正比)；所以，這裡將所述能量值歸一化到0-255的範圍內；使得0-255可以對應為灰度值，0代表黑色，255代表白色。 In this embodiment, since the energy value range of the first feature point after the FFT is large, sometimes it may reach 0-2^8 or even 0-2^16 (the energy value range is proportional to the signal strength of the audio file) Therefore, the energy value is normalized to a range of 0-255 here; so that 0-255 can correspond to a gray value, 0 represents black, and 255 represents white.

一般的歸一化方法包括：遍歷整個語譜圖中的第一特徵點的能量值，獲得最大值和最小值；對所述第一特徵點進行歸一化： A general normalization method includes: traversing the energy value of the first feature point in the entire spectrogram to obtain a maximum value and a minimum value; normalizing the first feature point:

其中，V為第一特徵點的能量值；V _min為最小值；V _max為最大值。 Wherein, V is the energy value of the first feature points; V _min is the minimum value; V _max is the maximum value.

本申請實施例可以是採用上述一般的歸一化方法。然而，這種歸一化方法，對於可能存在某些弱音時，獲得的V _min太小，例如可能趨近於0，使得歸一化公式變為了，這樣就與V _min無關了。因此這樣的V _min不具有代表性，影響了整體的歸一化處理結果。 The embodiment of the present application may adopt the above general normalization method. However, this normalization method, when there may be some mute, the obtained V _{min is} too small, for example, may approach 0, so that the normalization formula becomes This is irrelevant to V _min . Therefore, such V _{min is} not representative and affects the overall normalization processing result.

本申請實施例中提供了一種新的歸一化方法，可以包括：以第一預設長度為窗口逐幀遍歷語譜圖；獲取所述窗口內第一特徵點的能量值中的局部最大值和局部最小值；根據所述局部最大值和局部最小值將第一特徵點的能量值歸一化為第一特徵點的灰度值。 A new normalization method is provided in the embodiment of the present application, which may include: traversing a spectral map frame by frame with a first preset length as a window; and acquiring a local maximum value of energy values of the first feature point in the window. And a local minimum value; normalizing the energy value of the first feature point to the gray value of the first feature point according to the local maximum value and the local minimum value.

利用(2)所示的公式，其中，V為第一特徵點的能量值；V _min為局部最小值；V _max為局部最大值。 Using (2) as shown in formula, wherein, V is the energy value of the first feature points; V _min is the local minimum; V _max is the local maximum.

本實施例以分幀處理後說明，所述第一預設長度可以包括當前幀的前T幀到當前幀的後T幀。即所述第一預設長度為2T幀，2T+1幀大於1秒。 In this embodiment, after the framing processing, the first preset length may include a pre-T frame of the current frame to a last T frame of the current frame. That is, the first preset length is 2T frames, and the 2T+1 frames are greater than 1 second.

本實施例提供的歸一化方法，對於某些弱音，只能影響在其所在的第一預設長度內的歸一化結果，不能影響在第一預設長度之外的歸一化結果。所以這樣的歸一化方法可以減少弱音對整體歸一化結果的影響。 The normalization method provided in this embodiment can only affect the normalization result within the first preset length in which it is located, and cannot affect the normalization result outside the first preset length. So such a normalization method can reduce the impact of mute on the overall normalized result.

所述擴散處理，可以包括高斯函數(Gauss function)擴散處理，即利用高斯函數對第一特徵點進行擴散處理；還可以包括放大處理，即將第一特徵點放大若干倍，例如放大10倍。 The diffusion process may include a Gaussian function diffusion process, that is, performing diffusion processing on the first feature point by using a Gaussian function; and may further include an amplification process of enlarging the first feature point by several times, for example, by 10 times.

以下以高斯函數擴散處理為例，利用如下公式： Take the Gaussian function diffusion processing as an example, and use the following formula:

其中a、b與c為常數，且a>0。 Where a, b and c are constants and a>0.

即利用公式(1)對第一特徵點的半徑或直徑進行高斯函數擴散處理。 That is, Gaussian function diffusion processing is performed on the radius or diameter of the first feature point by using the formula (1).

以下以將第一特徵點放大處理為例。將所述第一特徵點的半徑或直徑放大處理，例如將半徑或直徑放大10倍。當然，在某些實施例中，還可以將所述第一特徵點放大若干倍後變為圓形、菱形、矩形等中的至少一種。 Hereinafter, the first feature point enlargement processing will be taken as an example. The radius or diameter of the first feature point is magnified, for example by a 10x magnification of the radius or diameter. Of course, in some embodiments, the first feature point may be enlarged by several times to become at least one of a circle, a diamond, a rectangle, and the like.

如圖4a所示，在擴散處理前的白點(待識別音頻檔案的第一特徵點)與黑點(目標音頻檔案的特徵點)存在偏差，進而最後匹配得到的第二特徵點就少；如圖4b所示，在擴散處理後的白點從一個點擴散成了一個區域，並且所述區域與黑點都重合。 As shown in FIG. 4a, the white point before the diffusion processing (the first feature point of the audio file to be identified) is deviated from the black point (the feature point of the target audio file), and thus the second feature point obtained by the final matching is less; As shown in Fig. 4b, the white point after the diffusion process spreads from one point to a region, and the region coincides with the black dots.

擴散處理可以使得第一特徵點由點擴散為區域，進而可以對噪音有一定的抗干擾能力，例如由於噪音干擾的影響，錄製的音頻的第一特徵點可能與原始的音頻的第一特徵點位置有少許的偏差，而透過所述擴散處理後可以忽略這個偏差，增加匹配得到的第二特徵點的數量。 The diffusion process can make the first feature point diffuse from the point into a region, and thus can have certain anti-interference ability to noise, for example, due to the influence of noise interference, the first feature point of the recorded audio may be the first feature point of the original audio. There is a slight deviation in the position, and the deviation can be ignored after the diffusion process, and the number of second feature points obtained by the matching is increased.

S120：在目標音頻檔案的語譜圖中查找是否存在與所述特徵點圖中擴散處理後的各第一特徵點分別對應的第二特徵點。 S120: Searching, in the spectrogram of the target audio file, whether there is a second feature point corresponding to each of the first feature points after the diffusion processing in the feature point map.

如圖5所示，所述S120步驟，具體可以包括：S121：以所述特徵點圖為窗口逐幀遍歷所述目標音頻檔案的語譜圖；S122：每次遍歷過程中將所述窗口內所述目標音頻檔案的語譜圖中坐標位於所述窗口內擴散處理後第一特徵點的坐標範圍內的特徵點確定為第二特徵點；S123：查找所述窗口內所述目標音頻檔案的語譜圖中是否存在與所述擴散處理後各第一特徵點分別對應的各第二特徵點。 As shown in FIG. 5, the step S120 may specifically include: S121: traversing a spectrum map of the target audio file frame by frame with the feature point map as a window; S122: in the window during each traversal process Determining, in the spectrogram of the target audio file, a feature point in a coordinate range of the first feature point after the diffusion processing in the window is determined as a second feature point; S123: searching for the target audio file in the window Whether there are second feature points respectively corresponding to the first feature points after the diffusion processing in the spectrogram.

如圖6所示，為在目標音頻檔案的語譜圖中查找與特徵點圖中擴散處理後第一特徵點分別對應的第二特徵點的示意圖。假設特徵點圖的幀數為N；目標音頻檔案的語譜圖的幀數為L，所述L大於或等於N。首先在所述目標音頻檔案的語譜圖中幀數為[0，N]的區域內查找；之後在[1，N+1]的區域內查找；這樣逐幀查找，直到[L-N，L]的區域結束遍歷。在每次遍歷過程中每一幀的[t，t+N]的窗口內其中t為幀數，將目標音頻檔案的語譜圖中坐標位於擴散處理後第一特徵點的坐標範圍內的特徵點確定為第二特徵點。在目標音頻檔案內查找與所述擴散處理後各第一特徵點分別對應的各第二特徵點。 As shown in FIG. 6, a schematic diagram of finding a second feature point corresponding to the first feature point after the diffusion processing in the feature point map is searched for in the spectrogram of the target audio file. It is assumed that the number of frames of the feature point map is N; the number of frames of the spectrum map of the target audio file is L, and the L is greater than or equal to N. First searching in the region of the target audio file whose frame number is [0, N]; then searching in the region of [1, N+1]; thus searching frame by frame until [LN, L] The region ends traversing. In the window of [t, t+N] of each frame in each traversal process, where t is the number of frames, the coordinates of the spectral map of the target audio file are located within the coordinate range of the first feature point after the diffusion processing. The point is determined as the second feature point. Searching, in the target audio file, each second feature point corresponding to each of the first feature points after the diffusion process.

在其它實施例中，還可以是遍歷資料庫中所有的音頻檔案。這樣，可以更精確的識別出待識別音頻檔案的音頻資訊。 In other embodiments, it is also possible to traverse all audio files in the database. In this way, the audio information of the audio file to be identified can be more accurately identified.

S130：若是，則確定所述待識別音頻檔案為所述目標音頻檔案的一部分。 S130: If yes, determining that the to-be-identified audio file is part of the target audio file.

如果在目標音頻檔案的語譜圖中查找到與所述擴散處理後各第一特徵點分別對應的第二特徵點，則可以確定所述待識別音頻檔案為所述目標音頻檔案的一部分。 If the second feature point corresponding to each of the first feature points after the diffusion process is found in the spectrogram of the target audio file, the audio file to be identified may be determined to be part of the target audio file.

透過本實施例中，對待識別音頻檔案的語譜圖中的第一特徵點進行擴散處理，可以減少所述第一特徵點受噪音影響產生的偏差；從而提高擴散處理後的第一特徵點與目標音頻檔案的匹配率，即可以實現了提高音頻特徵點匹配成功率。 In this embodiment, the first feature point in the spectrogram of the audio file to be identified is subjected to diffusion processing, so that the deviation of the first feature point caused by the noise may be reduced; thereby improving the first feature point after the diffusion processing The matching rate of the target audio file can achieve an improved audio feature point matching success rate.

在本申請的一實施例中，所述S122步驟，具體可以包括：確定所述窗口內所述目標音頻檔案的語譜圖中坐標位於所述窗口內擴散處理後第一特徵點的坐標範圍內的特徵點與第一特徵點的匹配度；將所述匹配度大於第一閾值的特徵點確定為第二特徵點。 In an embodiment of the present application, the step S122 may specifically include: determining that a coordinate in a spectrogram of the target audio file in the window is located within a coordinate range of the first feature point after the diffusion processing in the window The matching degree of the feature point and the first feature point; determining the feature point whose matching degree is greater than the first threshold as the second feature point.

所述匹配度包括所述窗口內語譜圖中位於擴散處理後第一特徵點的坐標範圍內的特徵點個數與第一特徵點個數的比值或所述窗口內語譜圖中位於擴散處理後第一特徵點的坐標範圍內的特徵點對應的第一特徵點的能量值或者灰度值之和。所述第一閾值可以是用戶根據綜合相關因素的一個統計結果。 The matching degree includes a ratio of the number of feature points in the coordinate range of the first feature point after the diffusion processing in the spectral spectrum of the window to the number of the first feature points or the diffusion in the spectral spectrum in the window The sum of the energy values or the gray values of the first feature points corresponding to the feature points in the coordinate range of the first feature point after the processing. The first threshold may be a statistical result of the user according to the comprehensive related factors.

以所述窗口內語譜圖中位於擴散處理後第一特徵點的坐標範圍內的特徵點個數與第一特徵點個數的比值為例，例如擴散後第一特徵點為100個，所述特徵點為60個；則所述第一特徵點與所述特徵點的匹配度為60%。如果所述第一閾值為80%，那麼將所述特徵點確定為第二特徵點。 Taking the ratio of the number of feature points in the coordinate range of the first feature point after the diffusion process to the number of the first feature points in the intra-theme spectrum, for example, the first feature point after diffusion is 100. The feature points are 60; then the degree of matching between the first feature points and the feature points is 60%. If the first threshold is 80%, the feature point is determined as the second feature point.

以所述窗口內語譜圖中位於擴散處理後第一特徵點的坐標範圍內的特徵點對應的第一特徵點的能量值之和為例，例如特徵點有10個，那麼將這10個特徵點對應的10個第一特徵點的能量值相加，得到能量值之和。如果所述能量值之和大於所述第一閾值，那麼將所述特徵點確定為第二特徵點。 For example, for example, the sum of the energy values of the first feature points corresponding to the feature points in the coordinate range of the first feature point after the diffusion processing in the intra-theme spectrum, for example, 10 feature points, then the 10 The energy values of the ten first feature points corresponding to the feature points are added to obtain the sum of the energy values. If the sum of the energy values is greater than the first threshold, the feature point is determined as the second feature point.

以所述窗口內語譜圖中位於擴散處理後第一特徵點的坐標範圍內的特徵點對應的第一特徵點的灰度值之和為例，例如特徵點有10個，那麼將這10個特徵點對應的10個第一特徵點的灰度值相加，得到灰度值之和。如果所述灰度值之和大於所述第一閾值，那麼將所述特徵點確定為第二特徵點。 Taking the sum of the gradation values of the first feature points corresponding to the feature points in the coordinate range of the first feature point after the diffusion processing in the intra-theme spectrum, for example, if there are 10 feature points, then 10 The gray values of the ten first feature points corresponding to the feature points are added to obtain the sum of the gray values. If the sum of the grayscale values is greater than the first threshold, the feature point is determined as the second feature point.

在本申請的一實施例中，在S110步驟之前，還可以包括S101、S102，如圖7所示：S101：將待識別音頻檔案的語譜圖中包含的能量值或者灰度值大於第二閾值的特徵點作為關鍵點；所述第二閾值可以是用戶根據綜合相關因素的一個統計結果；第二閾值越小，可以提取的關鍵點就越多，進而可能造成後續匹配時間越久；第二閾值越大，可以提取的關鍵點就越少，進而可能造成後續匹配的成功機率過低。 In an embodiment of the present application, before step S110, S101 and S102 may be further included, as shown in FIG. 7: S101: the energy value or the gray value included in the spectrum map of the audio file to be identified is greater than the second. The feature point of the threshold is used as a key point; the second threshold may be a statistical result of the user according to the comprehensive correlation factor; the smaller the second threshold is, the more key points can be extracted, which may cause the subsequent matching time to be longer; The larger the threshold, the fewer key points that can be extracted, which may cause the success rate of subsequent matching to be too low.

S102：若所述關鍵點的能量值或者灰度值在預設區域內為最大值，則將該關鍵點確定為第一特徵點；所述預設區域可以是以所述關鍵點為中心並根據預設半徑確定的圓形區域；或者以所述關鍵點為中心並根據預設長和寬確定的矩形區域。 S102: if the energy value or the gray value of the key point is a maximum value in the preset area, determining the key point as the first feature point; the preset area may be centered on the key point and a circular area determined according to a preset radius; or a rectangular area centered on the key point and determined according to a preset length and width.

所述預設區域可以是用戶根據綜合相關因素的一個統計結果；預設區域越小，可以確定的第一特徵點越多，進而可能造成後續匹配時間越久；預設區域越大，可以確定的第一特徵點越少，進而可能造成後續匹配的成功機率過低。 The preset area may be a statistical result of the user according to the comprehensive related factors; the smaller the preset area is, the more the first feature points can be determined, which may cause the subsequent matching time to be longer; the larger the preset area, the more certain The fewer the first feature points, the lower the probability of success in subsequent matching.

如圖8a所示，為確定的第一特徵點在語譜圖上的示意圖。圖中白點即第一特徵點。具體的，假設所述第二預設閾值為30，所述預設區域為15*15(以關鍵點為中心，橫坐標取15幀，縱坐標取長度15)，如圖8b所示，為圖8a的局部放大示意圖；圖中白點的能量值或者灰度值即大於第一預設閾值30並且在預設區域15*15內依然是最大值，提取出這樣的點作為第一特徵點。 As shown in FIG. 8a, a schematic diagram of the determined first feature point on the spectrogram. The white point in the figure is the first feature point. Specifically, it is assumed that the second preset threshold is 30, and the preset area is 15*15 (centered on a key point, 15 frames on the abscissa and 15 on the ordinate), as shown in FIG. 8b. FIG. 8a is a partially enlarged schematic view; the energy value or the gray value of the white point in the figure is greater than the first preset threshold 30 and is still the maximum value in the preset area 15*15, and such a point is extracted as the first feature point. .

本申請實施例與上一實施例不同之處在於，透過提取語譜圖中能量值或者灰度值大的特徵點作為第一特徵點，從而可以排除能量弱的特徵點對後續匹配的干擾，並且還可以大大的減少擴散處理的資料量，進而提高系統性能。 The difference between the embodiment of the present application and the previous embodiment is that the feature point with the energy value or the gray value in the spectrogram is extracted as the first feature point, so that the interference of the weak feature point on the subsequent matching can be excluded. And it can greatly reduce the amount of data for diffusion processing, thereby improving system performance.

在本申請的一實施例中，所述目標音頻檔案可以攜帶有音頻資訊。本申請應用於歌曲識別場景中時，所述音頻資訊可以包括歌曲名。用戶錄製一段不知道歌曲名的待識別音頻檔案或待識別音頻檔案就是一首不知道歌曲名的歌曲，當確定待識別音頻檔案為目標音頻檔案的一部分時，就可以識別出所述待識別音頻檔案的歌曲名。 In an embodiment of the present application, the target audio file may carry audio information. When the present application is applied to a song recognition scene, the audio information may include a song title. The user records a to-be-recognized audio file or a to-be-recognized audio file that does not know the name of the song, and is a song that does not know the name of the song. When it is determined that the to-be-identified audio file is part of the target audio file, the to-be-recognized audio can be identified. The song name of the file.

圖9為本申請一實施例中提供的音頻識別系統的模組示意圖。本實施例中，所述音頻識別系統包括：擴散單元210，用於對待識別音頻檔案的語譜圖中的第一特徵點進行擴散處理，得到特徵點圖，所述第一特徵點的數量為多個；查找單元220，在目標音頻檔案的語譜圖中查找是否存在與所述特徵點圖中擴散處理後的各第一特徵點分別對應的第二特徵點；確定單元230，用於在目標音頻檔案的語譜圖中查找到與所述特徵點圖中擴散處理後的各第一特徵點分別對應的第二特徵點的區域時，則確定所述待識別音頻檔案為所述目標音頻檔案的一部分。 FIG. 9 is a schematic diagram of a module of an audio recognition system provided in an embodiment of the present application. In this embodiment, the audio recognition system includes: a diffusion unit 210, configured to perform a diffusion process on the first feature point in the spectrogram of the audio file to be obtained, to obtain a feature point map, where the number of the first feature points is a plurality of; the searching unit 220, searching, in the spectrogram of the target audio file, whether there is a second feature point respectively corresponding to each of the first feature points after the diffusion processing in the feature point map; the determining unit 230, configured to Determining, in the score map of the target audio file, an area of the second feature point corresponding to each of the first feature points after the diffusion processing in the feature point map, determining that the to-be-recognized audio file is the target audio Part of the file.

較佳地，在所述擴散單元210之前，還可以包括：歸一化單元，用於將待識別音頻檔案的語譜圖的第一特徵點的能量值歸一化為第一特徵點的灰度值。 Preferably, before the diffusing unit 210, the method further includes: a normalization unit, configured to normalize the energy value of the first feature point of the spectrogram of the audio file to be identified into the gray of the first feature point Degree value.

較佳地，所述擴散處理包括高斯函數擴散處理或者放大處理中的至少一種。 Preferably, the diffusion process includes at least one of a Gaussian function diffusion process or an amplification process.

較佳地，所述歸一化單元，具體可以包括：第一歸一化子單元，用於以第一預設長度為窗口逐幀遍歷語譜圖；第二歸一化子單元，用於獲取所述窗口內第一特徵點的能量值中的局部最大值和局部最小值；第三歸一化子單元，用於根據所述局部最大值和局部最小值將第一特徵點的能量值歸一化為第一特徵點的灰度值。 Preferably, the normalization unit may further include: a first normalization subunit, configured to traverse the spectral map frame by frame with a first preset length as a window; and a second normalization subunit, Obtaining a local maximum value and a local minimum value in the energy values of the first feature points in the window; and a third normalization subunit, configured to calculate the energy value of the first feature points according to the local maximum value and the local minimum value Normalized to the gray value of the first feature point.

較佳地，所述查找單元220，具體可以包括：第一查找子單元，用於以所述特徵點圖為窗口逐幀遍歷所述目標音頻檔案的語譜圖；第二查找子單元，用於每次遍歷過程中將所述窗口內所述目標音頻檔案的語譜圖中坐標位於所述窗口內擴散處理後第一特徵點的坐標範圍內的特徵點確定為第二特徵點；第三查找子單元，用於查找所述窗口內所述目標音頻檔案的語譜圖中是否存在與所述擴散處理後各第一特徵點分別對應的各第二特徵點。 Preferably, the searching unit 220 may further include: a first searching subunit, configured to traverse the spectral map of the target audio file frame by frame with the feature point map as a window; and the second searching subunit, Determining, in each traversing process, a feature point in a coordinate map of the target audio file in the window that is located within a coordinate range of the first feature point after the diffusion processing in the window as a second feature point; And a search subunit, configured to search for a second feature point corresponding to each of the first feature points after the diffusion processing in the spectrogram of the target audio file in the window.

較佳地，所述第二查找子單元，具體可以包括：第四查找子單元，用於確定所述窗口內所述目標音頻檔案的語譜圖中坐標位於所述窗口內擴散處理後第一特徵點的坐標範圍內的特徵點與所述第一特徵點的匹配度；第五查找子單元，用於將所述匹配度大於第一閾值的特徵點確定為第二特徵點。 Preferably, the second locating subunit may further include: a fourth locating subunit, configured to determine that a coordinate in the gramogram of the target audio file in the window is located after the diffusion processing in the window a matching degree of the feature point in the coordinate range of the feature point and the first feature point; and a fifth search subunit, configured to determine the feature point whose matching degree is greater than the first threshold as the second feature point.

較佳地，所述匹配度包括所述窗口內語譜圖中位於擴散處理後第一特徵點的坐標範圍內的特徵點個數與第一特徵點個數的比值或所述窗口內語譜圖中位於擴散處理後第一特徵點的坐標範圍內的特徵點對應的第一特徵點的能量值或者灰度值之和。 Preferably, the matching degree includes a ratio of the number of feature points in the coordinate range of the first feature point after the diffusion processing to the number of first feature points in the spectral map in the window or the intra-window spectrum The sum of the energy values or the gray values of the first feature points corresponding to the feature points in the coordinate range of the first feature point after the diffusion processing.

較佳地，在所述擴散處理之前，還可以包括：第一處理單元，用於將待識別音頻檔案的語譜圖中包含的能量值或者灰度值大於第二閾值的特徵點作為關鍵點；第二處理單元，用於在所述關鍵點的能量值或者灰度值在預設區域內為最大值時，將該關鍵點確定為第一特徵點。 Preferably, before the spreading process, the method further includes: a first processing unit, configured to use, as a key point, an energy value included in a spectrogram of the audio file to be identified or a feature point whose gray value is greater than a second threshold And a second processing unit, configured to determine the key point as the first feature point when the energy value or the gray value of the key point is a maximum value in the preset area.

較佳地，所述目標音頻檔案攜帶有音頻資訊，所述音頻資訊包括歌曲名。 Preferably, the target audio file carries audio information, and the audio information includes a song name.

在20世紀90年代，對於一個技術的改進可以很明顯地區分是硬體上的改進(例如，對二極體、電晶體、開關等電路結構的改進)還是軟體上的改進(對於方法流程的改進)。然而，隨著技術的發展，當今的很多方法流程的改進已經可以視為硬體電路結構的直接改進。設計人員幾乎都透過將改進的方法流程編程到硬體電路中來得到相應的硬體電路結構。因此，不能說一個方法流程的改進就不能用硬體實體模組來實現。例如，可編程邏輯裝置(Programmable Logic Device,PLD)(例如現場可編程閘陣列(Field Programmable Gate Array，FPGA))就是這樣一種集成電路，其邏輯功能由用戶對裝置編程來確定。由設計人員自行編程來把一個數位系統“集成”在一片PLD上，而不需要請晶片製造廠商來設計和製作專用的集成電路晶片。而且，如今，取代手工地製作集成電路晶片，這種編程也多半改用“邏輯編譯器(logic compiler)”軟體來實現，它與程式開發撰寫時所用的軟體編譯器相類似，而要編譯之前的原始代碼也得用特定的編程語言來撰寫，此稱之為硬體描述語言(Hardware Description Language，HDL)，而HDL也並非僅有一種，而是有許多種，如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等，目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)與Verilog。本領域技術人員也應該清楚，只需要將方法流程用上述幾種硬體描述語言稍作邏輯編程並編程到集成電路中，就可以很容易得到實現該邏輯方法流程的硬體電路。 In the 1990s, improvements to a technology could clearly distinguish between hardware improvements (eg, improvements to circuit structures such as diodes, transistors, switches, etc.) or software improvements (for method flow). Improve). However, as technology advances, many of today's method flow improvements can be seen as direct improvements in hardware circuit architecture. Designers almost always get the corresponding hardware structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be implemented by a hardware entity module. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is an integrated circuit whose logic functions are determined by the user programming the device. Designers program themselves to "integrate" a digital system on a single PLD without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit die. Moreover, today, instead of manually making integrated circuit chips, this programming is mostly implemented using "logic compiler" software, which is similar to the software compiler used in programming development, but before compiling The original code has to be written in a specific programming language. This is called the Hardware Description Language (HDL). HDL is not the only one, but there are many kinds, such as ABEL (Advanced Boolean Expression Language). ), AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description Language), etc. VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are used. It should also be apparent to those skilled in the art that the hardware process for implementing the logic method flow can be easily obtained by simply programming the method flow into the integrated circuit with a few logic description languages.

控制器可以按任何適當的方式實現，例如，控制器可以採取例如微處理器或處理器以及儲存可由該(微)處理器執行的電腦可讀程式代碼(例如軟體或韌體)的電腦可讀媒體、邏輯閘、開關、專用集成電路(Application Specific Integrated Circuit，ASIC)、可編程邏輯控制器和嵌入微控制器的形式，控制器的例子包括但不限於以下微控制器：ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320，記憶體控制器還可以被實現為記憶體的控制邏輯的一部分。本領域技術人員也知道，除了以純電腦可讀程式代碼方式實現控制器以外，完全可以透過將方法步驟進行邏輯編程來使得控制器以邏輯閘、開關、專用集成電路、可編程邏輯控制器和嵌入微控制器等的形式來實現相同功能。因此這種控制器可以被認為是一種硬體部件，而對其內包括的用於實現各種功能的裝置也可以視為硬體部件內的結構。或者甚至，可以將用於實現各種功能的裝置視為既可以是實現方法的軟體模組又可以是硬體部件內的結構。 The controller can be implemented in any suitable manner, for example, the controller can be computer readable by, for example, a microprocessor or processor and storing computer readable program code (eg, software or firmware) executable by the (micro)processor. Media, logic gates, switches, Application Specific Integrated Circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM The Microchip PIC18F26K20 and the Silicone Labs C8051F320, the memory controller can also be implemented as part of the memory's control logic. Those skilled in the art will also appreciate that in addition to implementing the controller in purely computer readable program code, it is entirely possible to logically program the method steps to enable the controller to use logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedding. The form of a microcontroller or the like to achieve the same function. Thus such a controller can be considered a hardware component, and the means for implementing various functions included therein can also be considered as a structure within the hardware component. Or even a device for implementing various functions can be considered as either a software module implementing the method or a structure within the hardware component.

上述實施例闡明的系統、裝置、模組或單元，具體可以由電腦晶片或實體實現，或者由具有某種功能的產品來實現。 The system, device, module or unit illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product having a certain function.

為了描述的方便，描述以上裝置時以功能分為各種單元分別描述。當然，在實施本申請時可以把各單元的功能在同一個或多個軟體和/或硬體中實現。 For the convenience of description, the above devices are described separately by function into various units. Of course, the functions of each unit can be implemented in the same software or software and/or hardware in the implementation of the present application.

本領域內的技術人員應明白，本發明的實施例可提供為方法、系統、或電腦程式產品。因此，本發明可採用完全硬體實施例、完全軟體實施例、或結合軟體和硬體方面的實施例的形式。而且，本發明可採用在一個或多個其中包含有電腦可用程式代碼的電腦可用儲存媒體(包括但不限於磁盤記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。 Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Thus, the present invention may take the form of a fully hardware embodiment, a fully software embodiment, or an embodiment combining the software and hardware. Moreover, the present invention can take the form of a computer program product embodied on one or more computer usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.).

本發明是參照根據本發明實施例的方法、設備(系統)、和電腦程式產品的流程圖和/或方框圖來描述的。應理解可由電腦程式指令實現流程圖和/或方框圖中的每一流程和/或方框、以及流程圖和/或方框圖中的流程和/或方框的結合。可提供這些電腦程式指令到通用電腦、專用電腦、嵌入式處理機或其他可編程資料處理設備的處理器以產生一個機器，使得透過電腦或其他可編程資料處理設備的處理器執行的指令產生用於實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能的裝置。 The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, a special purpose computer, an embedded processor or other programmable data processing device to produce a machine for generating instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.

這些電腦程式指令也可儲存在能引導電腦或其他可編程資料處理設備以特定方式工作的電腦可讀記憶體中，使得儲存在該電腦可讀記憶體中的指令產生包括指令裝置的製造品，該指令裝置實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能。 The computer program instructions can also be stored in a computer readable memory that can boot a computer or other programmable data processing device to operate in a particular manner, such that instructions stored in the computer readable memory produce an article of manufacture including the instruction device. The instruction means implements the functions specified in one or more blocks of the flow or in a flow or block diagram of the flowchart.

這些電腦程式指令也可裝載到電腦或其他可編程資料處理設備上，使得在電腦或其他可編程設備上執行一系列操作步驟以產生電腦實現的處理，從而在電腦或其他可編程設備上執行的指令提供用於實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能的步驟。 These computer program instructions can also be loaded onto a computer or other programmable data processing device to perform a series of operational steps on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

在一個典型的配置中，計算設備包括一個或多個處理器(CPU)、輸入/輸出介面、網路介面和記憶體。 In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, a network interface, and memory.

記憶體可能包括電腦可讀媒體中的非永久性記憶體，隨機存取記憶體(RAM)和/或非揮發性記憶體等形式，如唯讀記憶體(ROM)或快閃記憶體(flash RAM)。記憶體是電腦可讀媒體的示例。 The memory may include non-permanent memory, random access memory (RAM) and/or non-volatile memory in computer readable media such as read only memory (ROM) or flash memory (flash) RAM). Memory is an example of a computer readable medium.

電腦可讀媒體包括永久性和非永久性、可移動和非可移動媒體可以由任何方法或技術來實現資訊儲存。資訊可以是電腦可讀指令、資料結構、程式的模組或其他資料。電腦的儲存媒體的例子包括，但不限於相變記憶體(PRAM)、靜態隨機存取記憶體(SRAM)、動態隨機存取記憶體(DRAM)、其他類型的隨機存取記憶體(RAM)、唯讀記憶體(ROM)、電可擦除可編程唯讀記憶體(EEPROM)、快閃記憶體或其他記憶體技術、唯讀光碟唯讀記憶體(CD-ROM)、數位多功能光碟(DVD)或其他光學儲存、磁盒式磁帶，磁帶磁盤儲存或其他磁性儲存設備或任何其他非傳輸媒體，可用於儲存可以被計算設備訪問的資訊。按照本文中的界定，電腦可讀媒體不包括暫存電腦可讀媒體(transitory media)，如調製的資料信號和載波。 Computer readable media including both permanent and non-permanent, removable and non-removable media can be stored by any method or technology. Information can be computer readable instructions, data structures, modules of programs, or other materials. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), and other types of random access memory (RAM). Read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM only, digital versatile disc (DVD) or other optical storage, magnetic tape cartridge, magnetic tape storage or other magnetic storage device or any other non-transportable media that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.

還需要說明的是，術語“包括”、“包含”或者其任何其他變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、商品或者設備不僅包括那些要素，而且還包括沒有明確列出的其他要素，或者是還包括為這種過程、方法、商品或者設備所固有的要素。在沒有更多限制的情況下，由語句“包括一個......”限定的要素，並不排除在包括所述要素的過程、方法、商品或者設備中還存在另外的相同要素。 It is also to be understood that the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, Other elements not explicitly listed, or elements that are inherent to such a process, method, commodity, or equipment. An element defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device including the element.

本領域技術人員應明白，本申請的實施例可提供為方法、系統或電腦程式產品。因此，本申請可採用完全硬體實施例、完全軟體實施例或結合軟體和硬體方面的實施例的形式。而且，本申請可採用在一個或多個其中包含有電腦可用程式代碼的電腦可用儲存媒體(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。 Those skilled in the art will appreciate that embodiments of the present application can be provided as a method, system, or computer program product. Thus, the present application can take the form of a fully hardware embodiment, a fully software embodiment, or an embodiment combining the software and hardware. Moreover, the present application may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) containing computer usable code. .

本申請可以在由電腦執行的電腦可執行指令的一般上下文中描述，例如程式模組。一般地，程式模組包括執行特定任務或實現特定抽象資料類型的常式、程式、物件、元件、資料結構等等。也可以在分散式運算環境中實踐本申請，在這些分散式運算環境中，由透過通信網路而被連接的遠程處理設備來執行任務。在分散式運算環境中，程式模組可以位於包括儲存設備在內的本地和遠程電腦儲存媒體中。 The application can be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, a program module includes routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types. The present application can also be practiced in a distributed computing environment in which tasks are performed by remote processing devices that are connected through a communication network. In a decentralized computing environment, program modules can be located in local and remote computer storage media, including storage devices.

本說明書中的各個實施例均採用遞進的方式描述，各個實施例之間相同相似的部分互相參見即可，每個實施例重點說明的都是與其他實施例的不同之處。尤其，對於系統實施例而言，由於其基本相似於方法實施例，所以描述的比較簡單，相關之處參見方法實施例的部分說明即可。 The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

以上所述僅為本申請的實施例而已，並不用於限制本申請。對於本領域技術人員來說，本申請可以有各種更改和變化。凡在本申請的精神和原理之內所作的任何修改、等同替換、改進等，均應包含在本申請的申請專利範圍的範圍之內。 The above description is only an embodiment of the present application and is not intended to limit the application. Various changes and modifications can be made to the present application by those skilled in the art. Any modifications, equivalents, improvements, etc. made within the spirit and principles of the present application are intended to be included within the scope of the appended claims.

Claims

An audio recognition method, comprising: performing a diffusion process on a first feature point in a spectrogram of an audio file to be obtained, to obtain a feature point map, the number of the first feature points being plural; in the target audio file Finding, in the spectrogram, whether there is a second feature point respectively corresponding to each of the first feature points after the diffusion processing in the feature point map; if yes, determining that the to-be-identified audio file is part of the target audio file.

The method of claim 1, wherein before the first feature point in the spectrogram of the audio file to be identified is subjected to diffusion processing, the method further includes: in the spectrogram of the audio file to be identified The energy value of the first feature point is normalized to the gray value of the first feature point.

The method of claim 1 or 2, wherein the diffusion process comprises at least one of a Gaussian function diffusion process or an amplification process.

The method of claim 2, wherein the energy value of the first feature point in the spectrogram of the audio file to be identified is normalized to the gray value of the first feature point, specifically: a preset length is a window traversing the spectral map frame by frame; acquiring a local maximum value and a local minimum value in the energy value of the first feature point in the window; and the energy of the first feature point according to the local maximum value and the local minimum value The value is normalized to the gray value of the first feature point.

The method of claim 1 or 2, wherein the finding in the spectrogram of the target audio file whether there is a second corresponding to each of the first feature points after the diffusion processing in the feature point map The feature point specifically includes: traversing the spectrum map of the target audio file frame by frame with the feature point map as a window; the coordinates of the spectrum map of the target audio file in the window are diffused in the window during each traversal process The feature point in the coordinate range of the first feature point is determined as the second feature point; whether there is a second map corresponding to each of the first feature points after the diffusion process in the spectrum map of the target audio file in the window Feature points.

The method of claim 5, wherein the feature point in the spectral map of the target audio file in the window is located within a coordinate range of the first feature point after the diffusion processing in the window is determined as The second feature point includes: determining a matching degree between the feature point in the coordinate map of the target audio file in the window and the first feature point in the coordinate range of the first feature point after the diffusion processing in the window; The feature point whose matching degree is greater than the first threshold is determined as the second feature point.

The method of claim 6, wherein the matching degree includes the number of feature points and the number of first feature points in the coordinate range of the first feature point after the diffusion process in the spectral spectrum of the window The ratio or the sum of the energy values or the gray values of the first feature points corresponding to the feature points located in the coordinate range of the first feature point after the diffusion processing in the intra-spectral spectrum.

The method of claim 1 or 2, wherein before the first feature point of the spectrogram of the audio file to be identified is subjected to diffusion processing, the method further comprises: in the spectrogram of the audio file to be identified The feature value of the included energy value or the gray value is greater than the second threshold as a key point; if the energy value or the gray value of the key point is a maximum value in the preset area, the key point is determined as the first feature point .

The method of claim 1, wherein the target audio file carries audio information, and the audio information includes a song name.

An audio recognition system, comprising: a diffusion unit, configured to perform diffusion processing on a first feature point in a spectrogram of an audio file to be obtained, to obtain a feature point map, where the number of the first feature points is multiple; a searching unit, searching, in a spectrogram of the target audio file, whether there is a second feature point respectively corresponding to each of the first feature points after the diffusion processing in the feature point map; and determining unit, configured to be used in the target audio file When the second feature point corresponding to each of the first feature points in the feature point map is found, the audio file to be identified is determined to be part of the target audio file.

The system of claim 10, wherein before the diffusing unit, further comprising: a normalization unit for normalizing energy values of the first feature points in the spectrogram of the audio file to be identified The gray value is converted into the first feature point.

The system of claim 10, wherein the diffusion process comprises at least one of a Gaussian function diffusion process or an amplification process.

The system of claim 11, wherein the normalization unit comprises: a first normalization sub-unit, configured to traverse the spectrogram frame by frame with a first preset length as a window; a normalized subunit for obtaining a local maximum and a local minimum in the energy value of the first feature point in the window; and a third normalization subunit for using the local maximum and the local minimum The energy value of a feature point is normalized to the gray value of the first feature point.

The system of claim 10, wherein the searching unit comprises: a first searching subunit, configured to traverse the spectral map of the target audio file frame by frame with the feature point map as a window; a second search subunit, configured to determine, in the traversal process, a feature point in a coordinate map of the target audio file in the window that is within a coordinate range of the first feature point after the diffusion processing in the window as a second feature a third search subunit, configured to search for a second feature point corresponding to each of the first feature points after the diffusion process in the spectrogram of the target audio file in the window.

The system of claim 14, wherein the second lookup subunit comprises: a fourth lookup subunit, configured to determine that coordinates in the spectrogram of the target audio file in the window are located in the window The matching degree of the feature point in the coordinate range of the first feature point after the diffusion process and the first feature point; the fifth search sub-unit is configured to determine the feature point whose matching degree is greater than the first threshold as the second feature point.

The system of claim 15, wherein the matching degree comprises the number of feature points and the number of first feature points in the coordinate range of the first feature point after the diffusion process in the spectral spectrum of the window. The ratio or the sum of the energy values or the gray values of the first feature points corresponding to the feature points located in the coordinate range of the first feature point after the diffusion processing in the intra-spectral spectrum.

The system of claim 10 or 11, wherein before the diffusion processing, the method further includes: a first processing unit, configured to: use the energy value or the gray value included in the spectrogram of the audio file to be identified A feature point that is greater than the second threshold is used as a key point; and the second processing unit is configured to determine the key point as the first feature point when the energy value or the gray value of the key point is a maximum value in the preset area.

The system of claim 10, wherein the target audio file carries audio information, the audio information including a song name.