CN113113051A

CN113113051A - Audio fingerprint extraction method and device, computer equipment and storage medium

Info

Publication number: CN113113051A
Application number: CN202110260352.0A
Authority: CN
Inventors: 黄润乾
Original assignee: Voiceai Technologies Co ltd
Current assignee: Voiceai Technologies Co ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-07-13

Abstract

The present application relates to an audio fingerprint extraction method, device, computer equipment and storage medium. The method includes: acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying noise audio in the audio signal; acquiring a star spectrogram corresponding to the audio signal, clearing the star point corresponding to the noise audio in the spectrogram, and obtaining an update Star spectrogram; according to the updated star spectrogram, the audio feature hash data is obtained; according to the audio feature hash data, the audio fingerprint is obtained. The method can effectively resist the influence of noise audio on audio fingerprint extraction, and improve the noise robustness.

Description

Audio fingerprint extraction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of audio fingerprint technology, and in particular, to an audio fingerprint extraction method, apparatus, computer device, and storage medium.

Background

With the development of audio fingerprint technology, the technology is widely used in the fields of music identification, copyright content monitoring and the like. The audio fingerprint technology can extract unique characteristic information in the audio signal and is used for identification, search and positioning service of mass sound samples, and the audio fingerprint technology is a technology capable of automatically identifying audio content.

The audio fingerprinting technology can extract noise robustness characteristics from audio signals, and the traditional audio fingerprinting technology has noise robustness but is limited to the situation that weak noise audio is overlapped with target audio. When an interference signal of a noise audio exists, the interference signal not only contains superimposed noise overlapped with a target audio, but also contains the condition of noise parallel to the target audio, when the noise audio is strong and the proportion of the noise audio not superimposed with the target audio is high, namely the noise audio is parallel to the target audio, the traditional audio fingerprint technology cannot ensure the noise robustness, and the noise audio has a great influence on the result of audio fingerprint identification.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an audio fingerprint extraction method, apparatus, computer device and storage medium capable of improving noise robustness.

A method of audio fingerprint extraction, the method comprising:

acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying noise audio in the audio signal;

acquiring a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram;

obtaining audio characteristic hash data according to the updated star spectrogram;

and obtaining the audio fingerprint according to the audio characteristic hash data.

In one embodiment, the audio signal includes a target audio and a noise audio; acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying noise audio in the audio signal comprises:

the method comprises the steps of obtaining an audio signal, carrying out voice endpoint detection processing on the audio signal based on a voice endpoint detection deep learning model, identifying noise audio and target audio in the audio signal, and constructing the voice endpoint detection deep learning model through audio training data.

In one embodiment, the star spectrogram is updated to be the star spectrogram corresponding to the target audio; the training process of the voice endpoint detection deep learning model comprises the following steps:

acquiring audio training data with classification labels, wherein the audio training data comprises target audio training data and noise audio training data;

and acquiring an initial voice endpoint detection deep learning model, and training the initial voice endpoint detection deep learning model through a voice endpoint detection algorithm according to the audio training data to obtain the voice endpoint detection deep learning model.

In one embodiment, the obtaining a star spectrogram corresponding to an audio signal, removing corresponding star points of a noise audio in the star spectrogram, and before obtaining an updated star spectrogram, further includes:

and performing voice enhancement processing on the target audio, wherein the voice enhancement processing comprises any one of voice enhancement processing based on spectral subtraction and voice enhancement processing based on a deep learning algorithm.

In one embodiment, the obtaining a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram includes:

extracting a star spectrogram corresponding to the audio signal based on an audio fingerprint algorithm, wherein star points on the star spectrogram are distributed along a time axis;

detecting the star points at each moment in the star spectrogram along a time axis;

and when the audio signals corresponding to the star points are detected to be noise audio, clearing the star points corresponding to the noise audio to obtain an updated star spectrogram.

In one embodiment, the audio feature hash data comprises an audio feature hash table; obtaining the audio characteristic hash data according to the updated star spectrogram comprises:

forming a hash key according to any two star points in the updated star spectrogram corresponding to the target audio;

and constructing an audio characteristic hash table according to the hash key and the hash value corresponding to the hash key.

In one embodiment, constructing the audio feature hash table according to the hash key and the hash value corresponding to the hash key includes:

determining a hash value corresponding to the hash key according to the time deviation of any two star points on a time axis;

obtaining a hash pair according to the hash key and the hash value corresponding to the hash key;

and constructing an audio characteristic hash table according to the hash pairs.

An audio fingerprint extraction device, the device comprising:

the audio signal acquisition module is used for acquiring an audio signal, performing voice endpoint detection processing on the audio signal and identifying noise audio in the audio signal;

the noise audio clearing module is used for acquiring a star spectrogram corresponding to the audio signal, and clearing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram;

the audio data processing module is used for obtaining audio characteristic hash data according to the updated star spectrogram;

and the audio fingerprint generation module is used for obtaining the audio fingerprint according to the audio characteristic hash data.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the audio fingerprint extraction method, the device, the computer equipment and the storage medium, the audio signal is obtained; carrying out voice endpoint detection processing on the audio signal, and identifying noise audio in the audio signal; the method can effectively identify the noise audio, and is convenient for removing the noise audio subsequently to obtain more accurate audio signals. Removing corresponding star points of the noise audio in the star spectrogram by acquiring the star spectrogram corresponding to the audio signal to obtain a corresponding updated star spectrogram; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.

Drawings

FIG. 1 is a diagram of an exemplary audio fingerprint extraction method;

FIG. 2 is a flowchart illustrating an audio fingerprint extraction method according to an embodiment;

FIG. 3 is a flowchart illustrating an audio fingerprint extraction method according to another embodiment;

FIG. 4 is a flowchart illustrating an audio fingerprint extraction method according to another embodiment;

FIG. 5 is a flowchart illustrating steps of a method for audio fingerprint extraction to build a deep learning model for voice endpoint detection in one embodiment;

FIG. 6 is a flowchart illustrating an audio fingerprint extraction method according to yet another embodiment;

FIG. 7 is a flowchart illustrating an audio fingerprint extraction method according to another embodiment;

FIG. 8 is a flowchart illustrating an audio fingerprint extraction method according to another embodiment;

FIG. 9 is a block diagram of an audio fingerprint extraction device according to an embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The audio fingerprint extraction method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 acquires the audio signal sent by the terminal 102, performs voice endpoint detection processing on the audio signal, and identifies the noise audio in the audio signal; acquiring a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram; obtaining audio characteristic hash data according to the updated star spectrogram; and obtaining the audio fingerprint according to the audio characteristic hash data. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, an audio fingerprint extraction method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step 202, acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying a noise audio in the audio signal.

The audio signal can be data of a frequency and amplitude variation information carrier with regular sound waves of voice, music and sound effects. The voice endpoint detection processing is processing performed by VAD (voice Activity detection) technology, which is hereinafter referred to as VAD processing for short, and the acquired audio signal includes a desired target audio and an uncorrelated noise audio, where the noise audio is an interference signal in the audio signal and interferes with the desired target audio. The noise audio superimposed with the interference includes a noise audio overlapping with the target audio, a noise audio juxtaposed with the target audio, and the like. Not only overlapping noise but also parallel noise can be identified by VAD processing.

Specifically, when the server acquires the audio signal, the server performs voice endpoint detection processing based on a voice endpoint detection algorithm on the audio signal, and by using the audio after the voice endpoint detection processing, the noise audio interfering and overlapping in the audio signal can be identified, and the part of the audio signal that is not the noise audio corresponds to the required target audio.

And step 204, acquiring a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram.

The star spectrogram (Constellation) is a feature for extracting noise robustness from an audio signal. Noise robustness refers to the ability to withstand the effects of noise. The star spectrogram consists of star points, and the star points are distributed in the star spectrogram.

Specifically, the server acquires a star spectrogram corresponding to the audio signal, wherein the star spectrogram is formed by performing fourier transform on the audio signal to obtain a spectrogram of the audio, and then selecting a point with noise robustness from a signal of each frame in the spectrogram. The points with noise robustness are selected as star points in a star spectrogram, the horizontal axis represents time, and the vertical axis represents frequency. There are many ways to select the point, such as a resonance peak point, or several frequency intervals along the frequency axis, and the point with the largest energy is selected in each interval. The server only reserves the points of the star spectrogram identified as the target audio by setting all the star spectrograms corresponding to the segments identified as the noise audio to zero so as to eliminate the corresponding star points of the noise audio in the star spectrogram. At this time, only the star points corresponding to the target audio frequency are left in the star spectrogram corresponding to the audio signal, and the updated star spectrogram corresponding to the target audio frequency is obtained. For example, the search may be performed on the star spectrogram along the time axis, and if it is detected that the identification result of the VAD processing corresponding to the current time is the noise audio, all star points at the current time are deleted, and finally, the updated star spectrogram corresponding to the target audio is obtained.

And step 206, obtaining the audio characteristic hash data according to the updated star spectrogram.

The updated star spectrogram is an updated star spectrogram corresponding to a star point with only the target audio, the audio characteristic hash data is obtained based on an audio fingerprint algorithm, the audio hash data is hash data corresponding to an audio signal of the target audio, and the audio characteristic hash data is used for representing an audio fingerprint.

Specifically, the server obtains the audio feature hash data according to the star points in the updated star spectrogram, for example, may obtain a hash key and a hash value by updating any two star points in the star spectrogram, and form a hash table according to the hash key and the hash value.

And step 208, obtaining the audio fingerprint according to the audio characteristic hash data.

The hash data may be a hash table, and the audio fingerprint refers to a unique digital feature in the audio signal, which is embodied in the form of an identifier, and is used for matching and comparing audio, for example, to identify a huge amount of sound samples or track and locate the positions of the samples in a database. An audio fingerprint is essentially an identifier associated with an audio signal. Take hash table as an example. Specifically, the server obtains an audio fingerprint corresponding to the target audio according to the constructed audio characteristic hash table.

In the audio fingerprint extraction method, the audio signal is obtained, and voice endpoint detection processing is carried out on the audio signal to identify the noise audio frequency in the audio signal; the method can effectively identify the noise audio, and is convenient for removing the noise audio subsequently to obtain more accurate audio signals. Removing corresponding star points of the noise audio in the star spectrogram by acquiring the star spectrogram corresponding to the audio signal to obtain a corresponding updated star spectrogram; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.

In one embodiment, as shown in fig. 3, the audio signal includes a target audio and a noise audio, step 202, acquiring the audio signal, and performing a speech endpoint detection process on the audio signal, where identifying the noise audio in the audio signal includes:

step 302, acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying a noise audio and a target audio in the audio signal.

The voice endpoint detection processing is processing performed by VAD (voice Activity detection) technology, which is hereinafter referred to as VAD processing. The VAD processing may be performed by a VAD algorithm based on signal processing, or may be performed by collecting data of a target audio type and a noise audio type to form training data to train a VAD deep learning model for VAD processing, and the audio data processed by the trained VAD deep learning model is more accurate and has more noise robustness. And the audio is made to be more noise robust through VAD processing and then subsequent processing based on the audio fingerprint algorithm. Specifically, the server acquires an audio signal, performs VAD processing on the audio signal, the audio signal includes a target audio and a noise audio, identifies the target audio and the noise audio in the audio signal through VAD processing, and identifies the noise audio and the target audio in the audio signal.

By acquiring the audio signal and carrying out voice endpoint detection processing on the audio signal, the noise audio and the target audio in the audio signal can be accurately distinguished, the noise audio and the target audio are identified, and the audio signal which is more accurate can be obtained after the noise audio is subsequently removed. Removing corresponding star points of the noise audio in the star spectrogram by acquiring the star spectrogram corresponding to the audio signal to obtain a corresponding updated star spectrogram; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.

Step 204, obtaining a star spectrogram corresponding to the audio signal, removing corresponding star points of the noise audio in the star spectrogram, and before obtaining an updated star spectrogram, further comprising:

and step 304, performing speech enhancement processing on the target audio, wherein the speech enhancement processing comprises any one of speech enhancement processing based on spectral subtraction and speech enhancement processing based on a deep learning algorithm.

Here, the speech enhancement processing refers to enhancing a desired target audio in an audio signal. Specifically, the noise audio and the target audio in the audio signal are identified after VAD processing, and then the target audio in the audio signal is enhanced by speech enhancement processing, which may be any one of speech enhancement processing based on spectral subtraction and speech enhancement processing based on a deep learning algorithm. Spectral subtraction is one method of subtracting the spectrum of a noise signal from the spectrum of a noisy signal, and speech enhancement processing based on spectral subtraction may be to subtract the estimated power spectrum of the noisy audio from speech whose audio signal is noisy speech. The noise signal is a signal corresponding to the noise audio of the audio signal. The speech enhancement processing based on the deep learning algorithm can be to train a speech enhancement deep learning network, and input the noisy speech into the deep learning network to obtain enhanced speech output. For example, in a section of audio signal, not only the speech with noise superimposed thereon but also the audio with all noise is included, and in the part where the noise audio is superimposed on the target audio, speech enhancement of the target audio is performed by spectral subtraction or deep learning algorithm. Further, the speech enhancement processing may be, but not limited to, any one of speech enhancement processing based on spectral subtraction and speech enhancement processing based on a deep learning algorithm, and may also include any one of speech enhancement processing based on other speech enhancement algorithms.

In this embodiment, the target audio in the audio signal can be enhanced by performing the speech enhancement processing on the target audio, so that the target audio and the noise audio which have the superposition condition in the audio signal can be more favorably distinguished, the audio signal can be more accurately processed in the subsequent star spectrogram, and the extracted star spectrogram is more accurate.

In one embodiment, as shown in fig. 4, step 202, acquiring an audio signal, and performing a speech endpoint detection process on the audio signal, wherein identifying noise audio in the audio signal includes:

step 402, obtaining an audio signal, performing voice endpoint detection processing on the audio signal based on the voice endpoint detection deep learning model, and recognizing noise audio and target audio in the audio signal, wherein the voice endpoint detection deep learning model is constructed through audio training data.

Where speech endpoint detection is a detection that identifies speech segments from a segment of speech (clean or noisy) signal, which may be the target audio in an audio signal, and non-speech segments, which may be noisy audio. The voice endpoint detection deep learning model is a VAD deep learning model obtained by training data through acquiring collected data of a target audio type and a noise audio type.

Specifically, the server acquires an audio signal, where the audio signal includes a noise audio and a target audio, and the noise audio and the target audio in the audio signal need to be identified. Based on the voice endpoint detection deep learning model, voice endpoint detection processing is carried out on the audio signal, firstly, the characteristics of the segments in the audio signal are extracted, and then the voice segments and the non-voice segments are identified. And finally, obtaining target audio and noise audio, and constructing a voice endpoint detection deep learning model through audio training data.

In this embodiment, the audio signal is subjected to the voice endpoint detection processing based on the voice endpoint detection deep learning model by obtaining the audio signal, so as to obtain the noise audio, and the voice endpoint detection deep learning model is constructed by the audio training data. The method can effectively identify the noise audio, and is convenient for removing the noise audio subsequently to obtain more accurate audio signals. Then, by acquiring a star spectrogram corresponding to the audio signal, removing corresponding star points of the noise audio in the star spectrogram to obtain a corresponding updated star spectrogram; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.

In one embodiment, as shown in fig. 5, the star spectrogram is updated to be the star spectrogram corresponding to the target audio; the training process of the voice endpoint detection deep learning model comprises the following steps:

step 502, audio training data with classification labels are obtained, wherein the audio training data comprises target audio training data and noise audio training data.

The audio training data is used for inputting audio data for training the model, the target audio training data is training data of a target audio type, and the noise audio training data is training data of a noise audio type. The target audio training data and the noise audio training data carry different classification labels, and the classification labels are beneficial to subsequent model training.

Specifically, the server acquires audio training data carrying classification labels, wherein the audio training data comprises target audio training data and noise audio training data. For example, the target audio type training data may be the voice of the human customer service in the human customer service telephone call audio, and the audio training data may be formed by combining a large amount of common telephone call noise such as beep sound of the telephone, ring tone, noise ratio in the environment such as whistling sound, etc. by acquiring a large amount of collected human customer service audio without any other noise, and corresponding noise audio type training data. The combination mode can be but is not limited to splicing, overlapping and the like, different audio training data are classified, classification labels can be marked, different audio training data carry corresponding different classification labels, and the classification labels are used for marking whether each voice frame of each voice corresponds to artificial customer service voice or noise.

And step 504, acquiring an initial voice endpoint detection deep learning model, and training the initial voice endpoint detection deep learning model through a voice endpoint detection algorithm according to the audio training data to obtain the voice endpoint detection deep learning model.

The voice endpoint detection deep learning model is trained based on a voice endpoint detection algorithm, and voice endpoint detection can understand that detection processing is carried out from the beginning end to the end of a section of audio when an audio signal is collected. The voice endpoint detection algorithm is to extract features by performing endpoint detection on an audio signal through the algorithm, wherein the endpoint detection also comprises the step of identifying a starting end and an ending end. The algorithm may be, but is not limited to, a short-time energy algorithm, which is an algorithm for calculating the energy of a frame of speech signal, or a zero-crossing rate algorithm. The zero-crossing rate algorithm is an algorithm for calculating the number of times that a frame of speech time domain signal crosses 0 (time axis). The deep learning model for voice endpoint detection is a model which can predict unknown data labels and is obtained after training, and the target audio type and the noise audio type can be effectively identified.

Specifically, in the preprocessing stage, the audio signal is processed based on a voice endpoint detection algorithm to provide feature processing of the audio signal, and an initial voice endpoint detection deep learning model of relevant features is obtained according to the preprocessing result. In addition, classification is also needed, namely the label classification is carried out, and target audio training data and noise audio training data carrying different classification labels are obtained. According to target audio training data and noise audio training data carrying different classification labels, an initial voice endpoint detection deep learning model is trained, the training can be countertraining of adding noise audio in the target audio, effective recognition is achieved through the countertraining, and the voice endpoint detection deep learning model is obtained. For example, a section of input speech is trained and recognized by a VAD deep learning model, which speech frames are artificial customer service sounds and which speech frames are noise. The audio training data is the obtained training data of the target audio type and the obtained training data of the noise audio type, the training data of the target audio type is artificial customer service audio without any other noise, and the training data of the noise audio type is a large amount of common telephone conversation noise such as beep sound of a telephone, color ring, noise ratio in the environment such as whistling sound and the like.

In this embodiment, through obtaining the audio training data that carries the classification label, the audio training data includes target audio training data and noise audio training data, obtain initial pronunciation endpoint detection degree of depth learning model, according to the audio training data, train initial pronunciation endpoint detection degree of depth learning model through pronunciation endpoint detection algorithm, obtain pronunciation endpoint detection degree of depth learning model, can detect degree of depth learning model through the pronunciation endpoint that obtains, based on pronunciation endpoint detection degree of depth learning model, carry out pronunciation endpoint detection to audio signal and handle, obtain the noise audio frequency, can effectively discern noise audio frequency and target audio frequency, be convenient for follow-up noise audio frequency of getting rid of obtains the audio signal of more accurate reservation target audio frequency. Then, by acquiring a star spectrogram corresponding to the audio signal, removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram corresponding to the target audio; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data corresponding to the target audio after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.

In an embodiment, as shown in fig. 6, in step 204, acquiring a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram includes:

step 602, extracting a star spectrogram corresponding to the audio signal based on an audio fingerprint algorithm, wherein star points on the star spectrogram are distributed along a time axis.

Wherein, the star spectrogram corresponds to star points and consists of the star points. And extracting the features in the audio signal based on an audio fingerprint algorithm to obtain a star spectrogram. Specifically, the server obtains a spectrogram corresponding to the audio signal after performing fourier transform on the audio signal, then selects a noise robust point in the spectrogram based on a signal of each frame, and forms a star spectrogram by a series of noise robust points, wherein the noise robust point is a star point in the star spectrogram, a coordinate axis represents time, and a vertical axis represents the frequency of the audio signal. There are many ways to select the point, for example, the resonance peak point can be selected, or several frequency intervals are divided along the frequency axis, and the point with the largest energy is selected in each interval.

And step 604, detecting the star points at each moment in the star spectrogram along a time axis.

The coordinate axis on the star spectrogram represents time, namely the time axis, the star points on the star spectrogram are distributed along the time axis, and the vertical axis represents the frequency of the audio signal. Each time corresponds to a star point. Specifically, the server detects the star points at each moment in the star spectrogram one by one along the time axis.

Step 606, when the audio signal corresponding to the star point is detected to be the noise audio, the star point corresponding to the noise audio is removed, and the updated star spectrogram is obtained.

Specifically, according to the identification result of VAD processing, all the star points corresponding to the segment identified as the noise audio are set to zero and removed on the star spectrogram, only the star points of the star spectrogram identified as the target audio are reserved, and when all detection is finished, the updated star spectrogram only including the target audio is obtained.

In the embodiment, a corresponding star spectrogram corresponding to the audio signal is obtained, and corresponding star points of the noise audio in the star spectrogram are eliminated, so that a corresponding updated star spectrogram is obtained; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.

In one embodiment, as shown in FIG. 7, the audio feature hash data comprises an audio feature hash table; step 206, obtaining the audio characteristic hash data according to the updated star spectrogram comprises:

step 702, forming a hash key according to any two star points in the updated star spectrogram corresponding to the target audio.

Specifically, a Key is formed by any two star points in the updated star spectrogram in a combined manner, and the Key is a hash Key, and the hash Key is used for constructing a hash table.

Step 704, an audio feature hash table is constructed according to the hash key and the hash value corresponding to the hash key.

Each hash key corresponds to a hash Value, the hash Value is Value of the hash key, and the hash Value is obtained by frequency of two star points forming the hash key and time deviation on a time axis. Specifically, an audio characteristic hash table composed of Key-Value is constructed according to a hash Key and a hash Value corresponding to the hash Key. The hash data may be a hash table, step 208, and the obtaining the audio fingerprint according to the audio feature hash data includes: step 706, obtaining the audio fingerprint according to the audio characteristic hash table.

In this embodiment, a hash key is formed by updating any two star points in the star spectrogram corresponding to the target audio, and an audio characteristic hash table is constructed according to the hash key and a hash value corresponding to the hash key. And obtaining the audio characteristic hash data corresponding to the target audio after the noise audio is removed. Through the audio characteristic hash data corresponding to the target audio, the accurate audio fingerprint in the audio signal can be obtained, the influence of the noise audio on the extraction of the audio fingerprint can be effectively resisted, and the noise robustness is improved.

In one embodiment, as shown in fig. 8, in step 704, constructing an audio feature hash table according to the hash key and the hash value corresponding to the hash key includes:

step 802, determining a hash value corresponding to the hash key according to the time deviation of any two star points on the time axis.

The time deviation of the two star points on the time axis is the time deviation corresponding to the frequency points corresponding to the two star points. Specifically, the hash value corresponding to the hash key is composed of the frequencies (f1 and f2) of two star points and the time offset (Δ t) of the two star points on the time axis, and f1/f2 can be represented by m-bit (e.g., 10-bit) quantization, Δ t can be represented by n-bit (e.g., 12-bit) quantization, and then the hash value corresponding to the hash key can be represented by 2 × m + n bits.

And step 804, obtaining a hash pair according to the hash key and the hash value corresponding to the hash key.

One hash pair is composed of a hash Key and a corresponding hash Value, and the hash pair can be represented as Key-Value. Specifically, a hash pair is obtained according to the hash Key and the hash value corresponding to the hash Key, for example, the hash Key may be represented by a 32-bit integer. The hash Value corresponding to each hash key is a time deviation of frequency points corresponding to two stars in the current audio signal on the audio signal.

And step 806, constructing an audio characteristic hash table according to the hash pairs.

The audio characteristic hash data can be an audio characteristic hash table, and the audio characteristic hash table is a hash table for representing audio characteristics and is composed of hash pairs key-values corresponding to the audio characteristics. Specifically, a hash key and a hash value are formed according to any two star points on the updated star spectrogram corresponding to the target audio in the audio signal, so as to obtain all hash pairs, and an audio characteristic hash table is constructed through all hash pairs corresponding to the target audio in the audio signal.

In this embodiment, the hash value corresponding to the hash key is determined according to the time deviation of any two star points on the time axis, and the hash pair is obtained according to the hash key and the hash value corresponding to the hash key. And constructing audio characteristic hash data according to the hash pairs. And obtaining the audio characteristic hash data corresponding to the target audio after the noise audio is removed. Through the audio characteristic hash data corresponding to the target audio, the accurate audio fingerprint in the audio signal can be obtained, the influence of the noise audio on the extraction of the audio fingerprint can be effectively resisted, and the noise robustness is improved.

In an application example, the application also provides an application scenario, and the application scenario applies the audio fingerprint extraction method. Specifically, the audio fingerprint extraction method is applied to the application scene as follows:

the method comprises the steps that a server acquires audio training data which are acquired by a terminal and carry classification labels, wherein the audio training data comprise target audio training data and noise audio training data; and then the server acquires an initial VAD deep learning model, and trains the initial VAD deep learning model through a VAD algorithm according to the audio training data to obtain the VAD deep learning model.

And the server acquires the audio signal uploaded by the terminal, and performs VAD processing on the audio signal based on the obtained VAD deep learning model to obtain noise audio and target audio. And then carrying out speech enhancement processing based on spectral subtraction or speech enhancement processing based on a deep learning algorithm on the target audio.

The server extracts a star spectrogram corresponding to the audio signal based on an audio fingerprint algorithm, the star spectrogram is subjected to Fourier transform on the audio signal to obtain a spectrogram corresponding to the audio signal, and then noise robust points are selected in the spectrogram based on signals of each frame and are composed of a series of points with noise robustness. The points with noise robustness are star points in a star spectrogram, the coordinate axis represents time and is a time axis, and the vertical axis represents the frequency of the audio signal. Therefore, the star points on the star spectrogram are distributed along the time axis. Detecting the star points at each moment in the star spectrogram along a time axis; and when the audio signals corresponding to the star points are detected to be noise audio, clearing the star points corresponding to the noise audio to obtain an updated star spectrogram corresponding to the target audio.

The server forms a hash Key according to any two star points in the updated star spectrogram corresponding to the target audio; then, determining a hash Value corresponding to the hash key according to the time deviation of any two star points on a time axis; obtaining a hash pair Key-Value according to the hash Key and the hash Value corresponding to the hash Key; and constructing an audio characteristic hash table according to the Key-Value of the hash pair.

And the server obtains the audio fingerprint according to the constructed audio characteristic hash table. Specifically, the audio fingerprint refers to a unique digital feature in an audio signal embodied in the form of an identifier for matching and comparing audio, for example, for identifying a huge amount of sound samples or tracking and locating the positions of the samples in a database. An audio fingerprint is essentially an identifier associated with an audio signal. And the server obtains the audio fingerprint corresponding to the target audio according to the constructed audio characteristic hash table.

In the embodiment, by acquiring the audio signal, voice endpoint detection processing is performed on the audio signal, and a noise audio in the audio signal are identified; it should be noted that in an application scenario that needs to be optimized, an audio signal has superimposed noise audio, where the superimposed noise audio includes a noise sound overlapping with a target audio and a noise audio parallel to the target audio, and the noise sound overlapping with the target audio and the noise audio parallel to the target audio, which are included in the noise audio, can be effectively identified through voice endpoint detection processing, so that it is convenient to subsequently remove the noise audio to obtain a more accurate audio signal. By obtaining a star spectrogram corresponding to the audio signal, star points in the star spectrogram are selected points with noise robustness. Obtaining a corresponding updated star spectrogram by removing corresponding star points of the noise audio in the star spectrogram; the method can effectively remove noise audio, improve noise robustness and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.

It should be understood that, although the steps in the flowcharts of the above embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in each flowchart may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

In one embodiment, as shown in fig. 9, there is provided an audio fingerprint extraction apparatus including: an audio signal acquisition module 902, a noise audio clean-up module 904, an audio data processing module 906, and an audio fingerprint generation module 908, wherein:

an audio signal obtaining module 902, configured to obtain an audio signal, perform voice endpoint detection processing on the audio signal, and identify a noise audio in the audio signal;

the noise audio clearing module 904 is configured to obtain a star spectrogram corresponding to the audio signal, and clear corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram;

the audio data processing module 906 is configured to obtain audio feature hash data according to the updated star spectrogram;

and an audio fingerprint generating module 908, configured to obtain an audio fingerprint according to the audio feature hash data.

In one embodiment, the audio signal acquiring module is further configured to acquire an audio signal, perform voice endpoint detection processing on the audio signal, and identify a noise audio and a target audio in the audio signal.

In one embodiment, the audio fingerprint extraction device further includes a speech enhancement processing module, and the speech enhancement processing module is further configured to perform speech enhancement processing on the target audio, where the speech enhancement processing includes any one of speech enhancement processing based on spectral subtraction and speech enhancement processing based on a deep learning algorithm.

In one embodiment, the audio signal acquisition module is further configured to acquire an audio signal, perform speech endpoint detection processing on the audio signal based on the speech endpoint detection deep learning model, and identify a noise audio and a target audio in the audio signal, where the speech endpoint detection deep learning model is constructed by audio training data.

In one embodiment, the audio fingerprint extraction device comprises a model training module, a classification module and a comparison module, wherein the model training module is used for acquiring audio training data carrying classification labels, and the audio training data comprises target audio training data and noise audio training data; and acquiring an initial voice endpoint detection deep learning model, and training the initial voice endpoint detection deep learning model through a voice endpoint detection algorithm according to the audio training data to obtain the voice endpoint detection deep learning model.

In one embodiment, the noise audio clearing module is further configured to extract a star spectrogram corresponding to the audio signal based on an audio fingerprint algorithm, wherein star points on the star spectrogram are distributed along a time axis; detecting the star points at each moment in the star spectrogram along a time axis; and when the audio signals corresponding to the star points are detected to be noise audio, clearing the star points corresponding to the noise audio to obtain an updated star spectrogram.

In one embodiment, the audio data processing module is further configured to form a hash key according to any two star points in the updated star spectrogram corresponding to the target audio; and constructing an audio characteristic hash table according to the hash key and the hash value corresponding to the hash key.

In one embodiment, the audio data processing module is further configured to determine a hash value corresponding to the hash key according to a time deviation of any two stars on the time axis; obtaining a hash pair according to the hash key and the hash value corresponding to the hash key; and constructing an audio characteristic hash table according to the hash pairs.

For the specific definition of the audio fingerprint extraction device, reference may be made to the above definition of the audio fingerprint extraction method, which is not described herein again. The modules in the audio fingerprint extraction device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store audio training data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an audio fingerprint extraction method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. an audio fingerprint extraction method, is characterized in that, described method comprises:

Acquire an audio signal, perform voice endpoint detection processing on the audio signal, and identify the noise audio in the audio signal;

Acquire the star spectrogram corresponding to the audio signal, clear the star point corresponding to the noise audio in the star spectrogram, and obtain an updated star spectrogram;

Obtain audio feature hash data according to the updated star spectrogram;

According to the audio feature hash data, an audio fingerprint is obtained.

2. method according to claim 1, is characterized in that, described audio frequency signal comprises target audio frequency and noise audio frequency; Described acquisition audio signal, carry out voice endpoint detection processing to described audio signal, identify the audio frequency in described audio signal. Noise audio includes:

Acquire an audio signal, detect a deep learning model based on a voice endpoint, perform a voice endpoint detection process on the audio signal, identify noise audio and target audio in the audio signal, and the voice endpoint detection deep learning model is constructed by audio training data.

3. method according to claim 2, is characterized in that, described update star spectrogram is the star spectrogram corresponding to target audio; The training process of described voice endpoint detection deep learning model comprises:

Obtaining audio training data carrying classification labels, the audio training data includes target audio training data and noise audio training data;

Obtaining an initial deep learning model for detecting voice endpoints, and training the initial deep learning model for detecting voice endpoints by using a voice endpoint detection algorithm according to the audio training data to obtain a deep learning model for detecting voice endpoints.

4. The method according to claim 2, characterized in that, acquiring a spectrogram corresponding to the audio signal, removing the corresponding star points of the noise audio in the spectrogram, and obtaining an updated spectrogram Before, also included:

Perform speech enhancement processing on the target audio, and the speech enhancement processing includes any one of speech enhancement processing based on spectral subtraction and speech enhancement processing based on deep learning algorithms.

5. The method according to claim 1, characterized in that, acquiring the spectrogram corresponding to the audio signal, removing the corresponding star points of the noise audio in the spectrogram, and obtaining an updated spectrogram include:

Based on the audio fingerprint algorithm, extract the star spectrogram corresponding to the audio signal, and the star points on the star spectrogram are distributed along the time axis;

Detecting star points at each moment in the star spectrogram along the time axis;

When it is detected that the audio signal corresponding to the star point is noise audio, the star point corresponding to the noise audio is cleared to obtain an updated star spectrogram.

6. The method according to claim 5, wherein the audio feature hash data comprises an audio feature hash table; and the obtaining of the audio feature hash data according to the updated star spectrogram comprises:

According to any two star points in the updated star spectrogram corresponding to the target audio, a hash key is formed;

An audio feature hash table is constructed according to the hash key and the hash value corresponding to the hash key.

7. The method according to claim 6, wherein, according to the hash key and the corresponding hash value of the hash key, constructing the audio feature hash table comprises:

Determine the hash value corresponding to the hash key according to the time deviation of the any two star points on the time axis;

Obtain a hash pair according to the hash key and the hash value corresponding to the hash key;

Based on the hash pairs, an audio feature hash table is constructed.

8. An audio fingerprint extraction device, wherein the device comprises:

an audio signal acquisition module for acquiring audio signals, performing voice endpoint detection processing on the audio signals, and identifying noise audio in the audio signals;

A noise audio clearing module, used for acquiring the star spectrogram corresponding to the audio signal, clearing the star point corresponding to the noise audio in the star spectrogram, and obtaining an updated star spectrogram;

an audio data processing module for obtaining audio feature hash data according to the updated star spectrogram;

An audio fingerprint generation module, configured to hash data according to the audio feature to obtain an audio fingerprint.

9. A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the method according to any one of claims 1 to 7 when the processor executes the computer program. step.

10. A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are implemented.