[go: up one dir, main page]

CN113113051A - Audio fingerprint extraction method and device, computer equipment and storage medium - Google Patents

Audio fingerprint extraction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113113051A
CN113113051A CN202110260352.0A CN202110260352A CN113113051A CN 113113051 A CN113113051 A CN 113113051A CN 202110260352 A CN202110260352 A CN 202110260352A CN 113113051 A CN113113051 A CN 113113051A
Authority
CN
China
Prior art keywords
audio
star
spectrogram
noise
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110260352.0A
Other languages
Chinese (zh)
Inventor
黄润乾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Voiceai Technologies Co ltd
Original Assignee
Voiceai Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Voiceai Technologies Co ltd filed Critical Voiceai Technologies Co ltd
Priority to CN202110260352.0A priority Critical patent/CN113113051A/en
Publication of CN113113051A publication Critical patent/CN113113051A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/61Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本申请涉及一种音频指纹提取方法、装置、计算机设备和存储介质。所述方法包括:获取音频信号,对音频信号进行语音端点检测处理,识别音频信号中的噪声音频;获取音频信号对应的星谱图,清除噪声音频在星谱图中对应的星点,得到更新星谱图;根据更新星谱图,得到音频特征哈希数据;根据音频特征哈希数据,得到音频指纹。采用本方法能够有效抵御噪声音频对音频指纹提取的影响,提高了噪声鲁棒性。

Figure 202110260352

The present application relates to an audio fingerprint extraction method, device, computer equipment and storage medium. The method includes: acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying noise audio in the audio signal; acquiring a star spectrogram corresponding to the audio signal, clearing the star point corresponding to the noise audio in the spectrogram, and obtaining an update Star spectrogram; according to the updated star spectrogram, the audio feature hash data is obtained; according to the audio feature hash data, the audio fingerprint is obtained. The method can effectively resist the influence of noise audio on audio fingerprint extraction, and improve the noise robustness.

Figure 202110260352

Description

Audio fingerprint extraction method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of audio fingerprint technology, and in particular, to an audio fingerprint extraction method, apparatus, computer device, and storage medium.
Background
With the development of audio fingerprint technology, the technology is widely used in the fields of music identification, copyright content monitoring and the like. The audio fingerprint technology can extract unique characteristic information in the audio signal and is used for identification, search and positioning service of mass sound samples, and the audio fingerprint technology is a technology capable of automatically identifying audio content.
The audio fingerprinting technology can extract noise robustness characteristics from audio signals, and the traditional audio fingerprinting technology has noise robustness but is limited to the situation that weak noise audio is overlapped with target audio. When an interference signal of a noise audio exists, the interference signal not only contains superimposed noise overlapped with a target audio, but also contains the condition of noise parallel to the target audio, when the noise audio is strong and the proportion of the noise audio not superimposed with the target audio is high, namely the noise audio is parallel to the target audio, the traditional audio fingerprint technology cannot ensure the noise robustness, and the noise audio has a great influence on the result of audio fingerprint identification.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an audio fingerprint extraction method, apparatus, computer device and storage medium capable of improving noise robustness.
A method of audio fingerprint extraction, the method comprising:
acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying noise audio in the audio signal;
acquiring a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram;
obtaining audio characteristic hash data according to the updated star spectrogram;
and obtaining the audio fingerprint according to the audio characteristic hash data.
In one embodiment, the audio signal includes a target audio and a noise audio; acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying noise audio in the audio signal comprises:
the method comprises the steps of obtaining an audio signal, carrying out voice endpoint detection processing on the audio signal based on a voice endpoint detection deep learning model, identifying noise audio and target audio in the audio signal, and constructing the voice endpoint detection deep learning model through audio training data.
In one embodiment, the star spectrogram is updated to be the star spectrogram corresponding to the target audio; the training process of the voice endpoint detection deep learning model comprises the following steps:
acquiring audio training data with classification labels, wherein the audio training data comprises target audio training data and noise audio training data;
and acquiring an initial voice endpoint detection deep learning model, and training the initial voice endpoint detection deep learning model through a voice endpoint detection algorithm according to the audio training data to obtain the voice endpoint detection deep learning model.
In one embodiment, the obtaining a star spectrogram corresponding to an audio signal, removing corresponding star points of a noise audio in the star spectrogram, and before obtaining an updated star spectrogram, further includes:
and performing voice enhancement processing on the target audio, wherein the voice enhancement processing comprises any one of voice enhancement processing based on spectral subtraction and voice enhancement processing based on a deep learning algorithm.
In one embodiment, the obtaining a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram includes:
extracting a star spectrogram corresponding to the audio signal based on an audio fingerprint algorithm, wherein star points on the star spectrogram are distributed along a time axis;
detecting the star points at each moment in the star spectrogram along a time axis;
and when the audio signals corresponding to the star points are detected to be noise audio, clearing the star points corresponding to the noise audio to obtain an updated star spectrogram.
In one embodiment, the audio feature hash data comprises an audio feature hash table; obtaining the audio characteristic hash data according to the updated star spectrogram comprises:
forming a hash key according to any two star points in the updated star spectrogram corresponding to the target audio;
and constructing an audio characteristic hash table according to the hash key and the hash value corresponding to the hash key.
In one embodiment, constructing the audio feature hash table according to the hash key and the hash value corresponding to the hash key includes:
determining a hash value corresponding to the hash key according to the time deviation of any two star points on a time axis;
obtaining a hash pair according to the hash key and the hash value corresponding to the hash key;
and constructing an audio characteristic hash table according to the hash pairs.
An audio fingerprint extraction device, the device comprising:
the audio signal acquisition module is used for acquiring an audio signal, performing voice endpoint detection processing on the audio signal and identifying noise audio in the audio signal;
the noise audio clearing module is used for acquiring a star spectrogram corresponding to the audio signal, and clearing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram;
the audio data processing module is used for obtaining audio characteristic hash data according to the updated star spectrogram;
and the audio fingerprint generation module is used for obtaining the audio fingerprint according to the audio characteristic hash data.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying noise audio in the audio signal;
acquiring a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram;
obtaining audio characteristic hash data according to the updated star spectrogram;
and obtaining the audio fingerprint according to the audio characteristic hash data.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying noise audio in the audio signal;
acquiring a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram;
obtaining audio characteristic hash data according to the updated star spectrogram;
and obtaining the audio fingerprint according to the audio characteristic hash data.
According to the audio fingerprint extraction method, the device, the computer equipment and the storage medium, the audio signal is obtained; carrying out voice endpoint detection processing on the audio signal, and identifying noise audio in the audio signal; the method can effectively identify the noise audio, and is convenient for removing the noise audio subsequently to obtain more accurate audio signals. Removing corresponding star points of the noise audio in the star spectrogram by acquiring the star spectrogram corresponding to the audio signal to obtain a corresponding updated star spectrogram; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.
Drawings
FIG. 1 is a diagram of an exemplary audio fingerprint extraction method;
FIG. 2 is a flowchart illustrating an audio fingerprint extraction method according to an embodiment;
FIG. 3 is a flowchart illustrating an audio fingerprint extraction method according to another embodiment;
FIG. 4 is a flowchart illustrating an audio fingerprint extraction method according to another embodiment;
FIG. 5 is a flowchart illustrating steps of a method for audio fingerprint extraction to build a deep learning model for voice endpoint detection in one embodiment;
FIG. 6 is a flowchart illustrating an audio fingerprint extraction method according to yet another embodiment;
FIG. 7 is a flowchart illustrating an audio fingerprint extraction method according to another embodiment;
FIG. 8 is a flowchart illustrating an audio fingerprint extraction method according to another embodiment;
FIG. 9 is a block diagram of an audio fingerprint extraction device according to an embodiment;
FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The audio fingerprint extraction method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 acquires the audio signal sent by the terminal 102, performs voice endpoint detection processing on the audio signal, and identifies the noise audio in the audio signal; acquiring a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram; obtaining audio characteristic hash data according to the updated star spectrogram; and obtaining the audio fingerprint according to the audio characteristic hash data. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, an audio fingerprint extraction method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
step 202, acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying a noise audio in the audio signal.
The audio signal can be data of a frequency and amplitude variation information carrier with regular sound waves of voice, music and sound effects. The voice endpoint detection processing is processing performed by VAD (voice Activity detection) technology, which is hereinafter referred to as VAD processing for short, and the acquired audio signal includes a desired target audio and an uncorrelated noise audio, where the noise audio is an interference signal in the audio signal and interferes with the desired target audio. The noise audio superimposed with the interference includes a noise audio overlapping with the target audio, a noise audio juxtaposed with the target audio, and the like. Not only overlapping noise but also parallel noise can be identified by VAD processing.
Specifically, when the server acquires the audio signal, the server performs voice endpoint detection processing based on a voice endpoint detection algorithm on the audio signal, and by using the audio after the voice endpoint detection processing, the noise audio interfering and overlapping in the audio signal can be identified, and the part of the audio signal that is not the noise audio corresponds to the required target audio.
And step 204, acquiring a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram.
The star spectrogram (Constellation) is a feature for extracting noise robustness from an audio signal. Noise robustness refers to the ability to withstand the effects of noise. The star spectrogram consists of star points, and the star points are distributed in the star spectrogram.
Specifically, the server acquires a star spectrogram corresponding to the audio signal, wherein the star spectrogram is formed by performing fourier transform on the audio signal to obtain a spectrogram of the audio, and then selecting a point with noise robustness from a signal of each frame in the spectrogram. The points with noise robustness are selected as star points in a star spectrogram, the horizontal axis represents time, and the vertical axis represents frequency. There are many ways to select the point, such as a resonance peak point, or several frequency intervals along the frequency axis, and the point with the largest energy is selected in each interval. The server only reserves the points of the star spectrogram identified as the target audio by setting all the star spectrograms corresponding to the segments identified as the noise audio to zero so as to eliminate the corresponding star points of the noise audio in the star spectrogram. At this time, only the star points corresponding to the target audio frequency are left in the star spectrogram corresponding to the audio signal, and the updated star spectrogram corresponding to the target audio frequency is obtained. For example, the search may be performed on the star spectrogram along the time axis, and if it is detected that the identification result of the VAD processing corresponding to the current time is the noise audio, all star points at the current time are deleted, and finally, the updated star spectrogram corresponding to the target audio is obtained.
And step 206, obtaining the audio characteristic hash data according to the updated star spectrogram.
The updated star spectrogram is an updated star spectrogram corresponding to a star point with only the target audio, the audio characteristic hash data is obtained based on an audio fingerprint algorithm, the audio hash data is hash data corresponding to an audio signal of the target audio, and the audio characteristic hash data is used for representing an audio fingerprint.
Specifically, the server obtains the audio feature hash data according to the star points in the updated star spectrogram, for example, may obtain a hash key and a hash value by updating any two star points in the star spectrogram, and form a hash table according to the hash key and the hash value.
And step 208, obtaining the audio fingerprint according to the audio characteristic hash data.
The hash data may be a hash table, and the audio fingerprint refers to a unique digital feature in the audio signal, which is embodied in the form of an identifier, and is used for matching and comparing audio, for example, to identify a huge amount of sound samples or track and locate the positions of the samples in a database. An audio fingerprint is essentially an identifier associated with an audio signal. Take hash table as an example. Specifically, the server obtains an audio fingerprint corresponding to the target audio according to the constructed audio characteristic hash table.
In the audio fingerprint extraction method, the audio signal is obtained, and voice endpoint detection processing is carried out on the audio signal to identify the noise audio frequency in the audio signal; the method can effectively identify the noise audio, and is convenient for removing the noise audio subsequently to obtain more accurate audio signals. Removing corresponding star points of the noise audio in the star spectrogram by acquiring the star spectrogram corresponding to the audio signal to obtain a corresponding updated star spectrogram; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.
In one embodiment, as shown in fig. 3, the audio signal includes a target audio and a noise audio, step 202, acquiring the audio signal, and performing a speech endpoint detection process on the audio signal, where identifying the noise audio in the audio signal includes:
step 302, acquiring an audio signal, performing voice endpoint detection processing on the audio signal, and identifying a noise audio and a target audio in the audio signal.
The voice endpoint detection processing is processing performed by VAD (voice Activity detection) technology, which is hereinafter referred to as VAD processing. The VAD processing may be performed by a VAD algorithm based on signal processing, or may be performed by collecting data of a target audio type and a noise audio type to form training data to train a VAD deep learning model for VAD processing, and the audio data processed by the trained VAD deep learning model is more accurate and has more noise robustness. And the audio is made to be more noise robust through VAD processing and then subsequent processing based on the audio fingerprint algorithm. Specifically, the server acquires an audio signal, performs VAD processing on the audio signal, the audio signal includes a target audio and a noise audio, identifies the target audio and the noise audio in the audio signal through VAD processing, and identifies the noise audio and the target audio in the audio signal.
By acquiring the audio signal and carrying out voice endpoint detection processing on the audio signal, the noise audio and the target audio in the audio signal can be accurately distinguished, the noise audio and the target audio are identified, and the audio signal which is more accurate can be obtained after the noise audio is subsequently removed. Removing corresponding star points of the noise audio in the star spectrogram by acquiring the star spectrogram corresponding to the audio signal to obtain a corresponding updated star spectrogram; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.
Step 204, obtaining a star spectrogram corresponding to the audio signal, removing corresponding star points of the noise audio in the star spectrogram, and before obtaining an updated star spectrogram, further comprising:
and step 304, performing speech enhancement processing on the target audio, wherein the speech enhancement processing comprises any one of speech enhancement processing based on spectral subtraction and speech enhancement processing based on a deep learning algorithm.
Here, the speech enhancement processing refers to enhancing a desired target audio in an audio signal. Specifically, the noise audio and the target audio in the audio signal are identified after VAD processing, and then the target audio in the audio signal is enhanced by speech enhancement processing, which may be any one of speech enhancement processing based on spectral subtraction and speech enhancement processing based on a deep learning algorithm. Spectral subtraction is one method of subtracting the spectrum of a noise signal from the spectrum of a noisy signal, and speech enhancement processing based on spectral subtraction may be to subtract the estimated power spectrum of the noisy audio from speech whose audio signal is noisy speech. The noise signal is a signal corresponding to the noise audio of the audio signal. The speech enhancement processing based on the deep learning algorithm can be to train a speech enhancement deep learning network, and input the noisy speech into the deep learning network to obtain enhanced speech output. For example, in a section of audio signal, not only the speech with noise superimposed thereon but also the audio with all noise is included, and in the part where the noise audio is superimposed on the target audio, speech enhancement of the target audio is performed by spectral subtraction or deep learning algorithm. Further, the speech enhancement processing may be, but not limited to, any one of speech enhancement processing based on spectral subtraction and speech enhancement processing based on a deep learning algorithm, and may also include any one of speech enhancement processing based on other speech enhancement algorithms.
In this embodiment, the target audio in the audio signal can be enhanced by performing the speech enhancement processing on the target audio, so that the target audio and the noise audio which have the superposition condition in the audio signal can be more favorably distinguished, the audio signal can be more accurately processed in the subsequent star spectrogram, and the extracted star spectrogram is more accurate.
In one embodiment, as shown in fig. 4, step 202, acquiring an audio signal, and performing a speech endpoint detection process on the audio signal, wherein identifying noise audio in the audio signal includes:
step 402, obtaining an audio signal, performing voice endpoint detection processing on the audio signal based on the voice endpoint detection deep learning model, and recognizing noise audio and target audio in the audio signal, wherein the voice endpoint detection deep learning model is constructed through audio training data.
Where speech endpoint detection is a detection that identifies speech segments from a segment of speech (clean or noisy) signal, which may be the target audio in an audio signal, and non-speech segments, which may be noisy audio. The voice endpoint detection deep learning model is a VAD deep learning model obtained by training data through acquiring collected data of a target audio type and a noise audio type.
Specifically, the server acquires an audio signal, where the audio signal includes a noise audio and a target audio, and the noise audio and the target audio in the audio signal need to be identified. Based on the voice endpoint detection deep learning model, voice endpoint detection processing is carried out on the audio signal, firstly, the characteristics of the segments in the audio signal are extracted, and then the voice segments and the non-voice segments are identified. And finally, obtaining target audio and noise audio, and constructing a voice endpoint detection deep learning model through audio training data.
In this embodiment, the audio signal is subjected to the voice endpoint detection processing based on the voice endpoint detection deep learning model by obtaining the audio signal, so as to obtain the noise audio, and the voice endpoint detection deep learning model is constructed by the audio training data. The method can effectively identify the noise audio, and is convenient for removing the noise audio subsequently to obtain more accurate audio signals. Then, by acquiring a star spectrogram corresponding to the audio signal, removing corresponding star points of the noise audio in the star spectrogram to obtain a corresponding updated star spectrogram; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.
In one embodiment, as shown in fig. 5, the star spectrogram is updated to be the star spectrogram corresponding to the target audio; the training process of the voice endpoint detection deep learning model comprises the following steps:
step 502, audio training data with classification labels are obtained, wherein the audio training data comprises target audio training data and noise audio training data.
The audio training data is used for inputting audio data for training the model, the target audio training data is training data of a target audio type, and the noise audio training data is training data of a noise audio type. The target audio training data and the noise audio training data carry different classification labels, and the classification labels are beneficial to subsequent model training.
Specifically, the server acquires audio training data carrying classification labels, wherein the audio training data comprises target audio training data and noise audio training data. For example, the target audio type training data may be the voice of the human customer service in the human customer service telephone call audio, and the audio training data may be formed by combining a large amount of common telephone call noise such as beep sound of the telephone, ring tone, noise ratio in the environment such as whistling sound, etc. by acquiring a large amount of collected human customer service audio without any other noise, and corresponding noise audio type training data. The combination mode can be but is not limited to splicing, overlapping and the like, different audio training data are classified, classification labels can be marked, different audio training data carry corresponding different classification labels, and the classification labels are used for marking whether each voice frame of each voice corresponds to artificial customer service voice or noise.
And step 504, acquiring an initial voice endpoint detection deep learning model, and training the initial voice endpoint detection deep learning model through a voice endpoint detection algorithm according to the audio training data to obtain the voice endpoint detection deep learning model.
The voice endpoint detection deep learning model is trained based on a voice endpoint detection algorithm, and voice endpoint detection can understand that detection processing is carried out from the beginning end to the end of a section of audio when an audio signal is collected. The voice endpoint detection algorithm is to extract features by performing endpoint detection on an audio signal through the algorithm, wherein the endpoint detection also comprises the step of identifying a starting end and an ending end. The algorithm may be, but is not limited to, a short-time energy algorithm, which is an algorithm for calculating the energy of a frame of speech signal, or a zero-crossing rate algorithm. The zero-crossing rate algorithm is an algorithm for calculating the number of times that a frame of speech time domain signal crosses 0 (time axis). The deep learning model for voice endpoint detection is a model which can predict unknown data labels and is obtained after training, and the target audio type and the noise audio type can be effectively identified.
Specifically, in the preprocessing stage, the audio signal is processed based on a voice endpoint detection algorithm to provide feature processing of the audio signal, and an initial voice endpoint detection deep learning model of relevant features is obtained according to the preprocessing result. In addition, classification is also needed, namely the label classification is carried out, and target audio training data and noise audio training data carrying different classification labels are obtained. According to target audio training data and noise audio training data carrying different classification labels, an initial voice endpoint detection deep learning model is trained, the training can be countertraining of adding noise audio in the target audio, effective recognition is achieved through the countertraining, and the voice endpoint detection deep learning model is obtained. For example, a section of input speech is trained and recognized by a VAD deep learning model, which speech frames are artificial customer service sounds and which speech frames are noise. The audio training data is the obtained training data of the target audio type and the obtained training data of the noise audio type, the training data of the target audio type is artificial customer service audio without any other noise, and the training data of the noise audio type is a large amount of common telephone conversation noise such as beep sound of a telephone, color ring, noise ratio in the environment such as whistling sound and the like.
In this embodiment, through obtaining the audio training data that carries the classification label, the audio training data includes target audio training data and noise audio training data, obtain initial pronunciation endpoint detection degree of depth learning model, according to the audio training data, train initial pronunciation endpoint detection degree of depth learning model through pronunciation endpoint detection algorithm, obtain pronunciation endpoint detection degree of depth learning model, can detect degree of depth learning model through the pronunciation endpoint that obtains, based on pronunciation endpoint detection degree of depth learning model, carry out pronunciation endpoint detection to audio signal and handle, obtain the noise audio frequency, can effectively discern noise audio frequency and target audio frequency, be convenient for follow-up noise audio frequency of getting rid of obtains the audio signal of more accurate reservation target audio frequency. Then, by acquiring a star spectrogram corresponding to the audio signal, removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram corresponding to the target audio; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data corresponding to the target audio after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.
In an embodiment, as shown in fig. 6, in step 204, acquiring a star spectrogram corresponding to the audio signal, and removing corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram includes:
step 602, extracting a star spectrogram corresponding to the audio signal based on an audio fingerprint algorithm, wherein star points on the star spectrogram are distributed along a time axis.
Wherein, the star spectrogram corresponds to star points and consists of the star points. And extracting the features in the audio signal based on an audio fingerprint algorithm to obtain a star spectrogram. Specifically, the server obtains a spectrogram corresponding to the audio signal after performing fourier transform on the audio signal, then selects a noise robust point in the spectrogram based on a signal of each frame, and forms a star spectrogram by a series of noise robust points, wherein the noise robust point is a star point in the star spectrogram, a coordinate axis represents time, and a vertical axis represents the frequency of the audio signal. There are many ways to select the point, for example, the resonance peak point can be selected, or several frequency intervals are divided along the frequency axis, and the point with the largest energy is selected in each interval.
And step 604, detecting the star points at each moment in the star spectrogram along a time axis.
The coordinate axis on the star spectrogram represents time, namely the time axis, the star points on the star spectrogram are distributed along the time axis, and the vertical axis represents the frequency of the audio signal. Each time corresponds to a star point. Specifically, the server detects the star points at each moment in the star spectrogram one by one along the time axis.
Step 606, when the audio signal corresponding to the star point is detected to be the noise audio, the star point corresponding to the noise audio is removed, and the updated star spectrogram is obtained.
Specifically, according to the identification result of VAD processing, all the star points corresponding to the segment identified as the noise audio are set to zero and removed on the star spectrogram, only the star points of the star spectrogram identified as the target audio are reserved, and when all detection is finished, the updated star spectrogram only including the target audio is obtained.
In the embodiment, a corresponding star spectrogram corresponding to the audio signal is obtained, and corresponding star points of the noise audio in the star spectrogram are eliminated, so that a corresponding updated star spectrogram is obtained; the method can effectively remove noise and audio and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.
In one embodiment, as shown in FIG. 7, the audio feature hash data comprises an audio feature hash table; step 206, obtaining the audio characteristic hash data according to the updated star spectrogram comprises:
step 702, forming a hash key according to any two star points in the updated star spectrogram corresponding to the target audio.
Specifically, a Key is formed by any two star points in the updated star spectrogram in a combined manner, and the Key is a hash Key, and the hash Key is used for constructing a hash table.
Step 704, an audio feature hash table is constructed according to the hash key and the hash value corresponding to the hash key.
Each hash key corresponds to a hash Value, the hash Value is Value of the hash key, and the hash Value is obtained by frequency of two star points forming the hash key and time deviation on a time axis. Specifically, an audio characteristic hash table composed of Key-Value is constructed according to a hash Key and a hash Value corresponding to the hash Key. The hash data may be a hash table, step 208, and the obtaining the audio fingerprint according to the audio feature hash data includes: step 706, obtaining the audio fingerprint according to the audio characteristic hash table.
In this embodiment, a hash key is formed by updating any two star points in the star spectrogram corresponding to the target audio, and an audio characteristic hash table is constructed according to the hash key and a hash value corresponding to the hash key. And obtaining the audio characteristic hash data corresponding to the target audio after the noise audio is removed. Through the audio characteristic hash data corresponding to the target audio, the accurate audio fingerprint in the audio signal can be obtained, the influence of the noise audio on the extraction of the audio fingerprint can be effectively resisted, and the noise robustness is improved.
In one embodiment, as shown in fig. 8, in step 704, constructing an audio feature hash table according to the hash key and the hash value corresponding to the hash key includes:
step 802, determining a hash value corresponding to the hash key according to the time deviation of any two star points on the time axis.
The time deviation of the two star points on the time axis is the time deviation corresponding to the frequency points corresponding to the two star points. Specifically, the hash value corresponding to the hash key is composed of the frequencies (f1 and f2) of two star points and the time offset (Δ t) of the two star points on the time axis, and f1/f2 can be represented by m-bit (e.g., 10-bit) quantization, Δ t can be represented by n-bit (e.g., 12-bit) quantization, and then the hash value corresponding to the hash key can be represented by 2 × m + n bits.
And step 804, obtaining a hash pair according to the hash key and the hash value corresponding to the hash key.
One hash pair is composed of a hash Key and a corresponding hash Value, and the hash pair can be represented as Key-Value. Specifically, a hash pair is obtained according to the hash Key and the hash value corresponding to the hash Key, for example, the hash Key may be represented by a 32-bit integer. The hash Value corresponding to each hash key is a time deviation of frequency points corresponding to two stars in the current audio signal on the audio signal.
And step 806, constructing an audio characteristic hash table according to the hash pairs.
The audio characteristic hash data can be an audio characteristic hash table, and the audio characteristic hash table is a hash table for representing audio characteristics and is composed of hash pairs key-values corresponding to the audio characteristics. Specifically, a hash key and a hash value are formed according to any two star points on the updated star spectrogram corresponding to the target audio in the audio signal, so as to obtain all hash pairs, and an audio characteristic hash table is constructed through all hash pairs corresponding to the target audio in the audio signal.
In this embodiment, the hash value corresponding to the hash key is determined according to the time deviation of any two star points on the time axis, and the hash pair is obtained according to the hash key and the hash value corresponding to the hash key. And constructing audio characteristic hash data according to the hash pairs. And obtaining the audio characteristic hash data corresponding to the target audio after the noise audio is removed. Through the audio characteristic hash data corresponding to the target audio, the accurate audio fingerprint in the audio signal can be obtained, the influence of the noise audio on the extraction of the audio fingerprint can be effectively resisted, and the noise robustness is improved.
In an application example, the application also provides an application scenario, and the application scenario applies the audio fingerprint extraction method. Specifically, the audio fingerprint extraction method is applied to the application scene as follows:
the method comprises the steps that a server acquires audio training data which are acquired by a terminal and carry classification labels, wherein the audio training data comprise target audio training data and noise audio training data; and then the server acquires an initial VAD deep learning model, and trains the initial VAD deep learning model through a VAD algorithm according to the audio training data to obtain the VAD deep learning model.
And the server acquires the audio signal uploaded by the terminal, and performs VAD processing on the audio signal based on the obtained VAD deep learning model to obtain noise audio and target audio. And then carrying out speech enhancement processing based on spectral subtraction or speech enhancement processing based on a deep learning algorithm on the target audio.
The server extracts a star spectrogram corresponding to the audio signal based on an audio fingerprint algorithm, the star spectrogram is subjected to Fourier transform on the audio signal to obtain a spectrogram corresponding to the audio signal, and then noise robust points are selected in the spectrogram based on signals of each frame and are composed of a series of points with noise robustness. The points with noise robustness are star points in a star spectrogram, the coordinate axis represents time and is a time axis, and the vertical axis represents the frequency of the audio signal. Therefore, the star points on the star spectrogram are distributed along the time axis. Detecting the star points at each moment in the star spectrogram along a time axis; and when the audio signals corresponding to the star points are detected to be noise audio, clearing the star points corresponding to the noise audio to obtain an updated star spectrogram corresponding to the target audio.
The server forms a hash Key according to any two star points in the updated star spectrogram corresponding to the target audio; then, determining a hash Value corresponding to the hash key according to the time deviation of any two star points on a time axis; obtaining a hash pair Key-Value according to the hash Key and the hash Value corresponding to the hash Key; and constructing an audio characteristic hash table according to the Key-Value of the hash pair.
And the server obtains the audio fingerprint according to the constructed audio characteristic hash table. Specifically, the audio fingerprint refers to a unique digital feature in an audio signal embodied in the form of an identifier for matching and comparing audio, for example, for identifying a huge amount of sound samples or tracking and locating the positions of the samples in a database. An audio fingerprint is essentially an identifier associated with an audio signal. And the server obtains the audio fingerprint corresponding to the target audio according to the constructed audio characteristic hash table.
In the embodiment, by acquiring the audio signal, voice endpoint detection processing is performed on the audio signal, and a noise audio in the audio signal are identified; it should be noted that in an application scenario that needs to be optimized, an audio signal has superimposed noise audio, where the superimposed noise audio includes a noise sound overlapping with a target audio and a noise audio parallel to the target audio, and the noise sound overlapping with the target audio and the noise audio parallel to the target audio, which are included in the noise audio, can be effectively identified through voice endpoint detection processing, so that it is convenient to subsequently remove the noise audio to obtain a more accurate audio signal. By obtaining a star spectrogram corresponding to the audio signal, star points in the star spectrogram are selected points with noise robustness. Obtaining a corresponding updated star spectrogram by removing corresponding star points of the noise audio in the star spectrogram; the method can effectively remove noise audio, improve noise robustness and obtain an accurate star spectrogram. Constructing audio characteristic hash data according to the updated star spectrogram; the audio characteristic hash data after the noise audio is removed can be obtained. Through the Hash data according to the audio characteristics, accurate audio fingerprints in the audio signals can be obtained, the influence of noise audio on audio fingerprint extraction can be effectively resisted, and the noise robustness is improved.
It should be understood that, although the steps in the flowcharts of the above embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in each flowchart may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
In one embodiment, as shown in fig. 9, there is provided an audio fingerprint extraction apparatus including: an audio signal acquisition module 902, a noise audio clean-up module 904, an audio data processing module 906, and an audio fingerprint generation module 908, wherein:
an audio signal obtaining module 902, configured to obtain an audio signal, perform voice endpoint detection processing on the audio signal, and identify a noise audio in the audio signal;
the noise audio clearing module 904 is configured to obtain a star spectrogram corresponding to the audio signal, and clear corresponding star points of the noise audio in the star spectrogram to obtain an updated star spectrogram;
the audio data processing module 906 is configured to obtain audio feature hash data according to the updated star spectrogram;
and an audio fingerprint generating module 908, configured to obtain an audio fingerprint according to the audio feature hash data.
In one embodiment, the audio signal acquiring module is further configured to acquire an audio signal, perform voice endpoint detection processing on the audio signal, and identify a noise audio and a target audio in the audio signal.
In one embodiment, the audio fingerprint extraction device further includes a speech enhancement processing module, and the speech enhancement processing module is further configured to perform speech enhancement processing on the target audio, where the speech enhancement processing includes any one of speech enhancement processing based on spectral subtraction and speech enhancement processing based on a deep learning algorithm.
In one embodiment, the audio signal acquisition module is further configured to acquire an audio signal, perform speech endpoint detection processing on the audio signal based on the speech endpoint detection deep learning model, and identify a noise audio and a target audio in the audio signal, where the speech endpoint detection deep learning model is constructed by audio training data.
In one embodiment, the audio fingerprint extraction device comprises a model training module, a classification module and a comparison module, wherein the model training module is used for acquiring audio training data carrying classification labels, and the audio training data comprises target audio training data and noise audio training data; and acquiring an initial voice endpoint detection deep learning model, and training the initial voice endpoint detection deep learning model through a voice endpoint detection algorithm according to the audio training data to obtain the voice endpoint detection deep learning model.
In one embodiment, the noise audio clearing module is further configured to extract a star spectrogram corresponding to the audio signal based on an audio fingerprint algorithm, wherein star points on the star spectrogram are distributed along a time axis; detecting the star points at each moment in the star spectrogram along a time axis; and when the audio signals corresponding to the star points are detected to be noise audio, clearing the star points corresponding to the noise audio to obtain an updated star spectrogram.
In one embodiment, the audio data processing module is further configured to form a hash key according to any two star points in the updated star spectrogram corresponding to the target audio; and constructing an audio characteristic hash table according to the hash key and the hash value corresponding to the hash key.
In one embodiment, the audio data processing module is further configured to determine a hash value corresponding to the hash key according to a time deviation of any two stars on the time axis; obtaining a hash pair according to the hash key and the hash value corresponding to the hash key; and constructing an audio characteristic hash table according to the hash pairs.
For the specific definition of the audio fingerprint extraction device, reference may be made to the above definition of the audio fingerprint extraction method, which is not described herein again. The modules in the audio fingerprint extraction device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store audio training data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an audio fingerprint extraction method.
Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1.一种音频指纹提取方法,其特征在于,所述方法包括:1. an audio fingerprint extraction method, is characterized in that, described method comprises: 获取音频信号,对所述音频信号进行语音端点检测处理,识别所述音频信号中的噪声音频;Acquire an audio signal, perform voice endpoint detection processing on the audio signal, and identify the noise audio in the audio signal; 获取所述音频信号对应的星谱图,清除所述噪声音频在所述星谱图中对应的星点,得到更新星谱图;Acquire the star spectrogram corresponding to the audio signal, clear the star point corresponding to the noise audio in the star spectrogram, and obtain an updated star spectrogram; 根据所述更新星谱图,得到音频特征哈希数据;Obtain audio feature hash data according to the updated star spectrogram; 根据所述音频特征哈希数据,得到音频指纹。According to the audio feature hash data, an audio fingerprint is obtained. 2.根据权利要求1所述的方法,其特征在于,所述音频信号包括目标音频和噪声音频;所述获取音频信号,对所述音频信号进行语音端点检测处理,识别所述音频信号中的噪声音频包括:2. method according to claim 1, is characterized in that, described audio frequency signal comprises target audio frequency and noise audio frequency; Described acquisition audio signal, carry out voice endpoint detection processing to described audio signal, identify the audio frequency in described audio signal. Noise audio includes: 获取音频信号,基于语音端点检测深度学习模型,对所述音频信号进行语音端点检测处理,识别所述音频信号中的噪声音频和目标音频,所述语音端点检测深度学习模型通过音频训练数据构建。Acquire an audio signal, detect a deep learning model based on a voice endpoint, perform a voice endpoint detection process on the audio signal, identify noise audio and target audio in the audio signal, and the voice endpoint detection deep learning model is constructed by audio training data. 3.根据权利要求2所述的方法,其特征在于,所述更新星谱图为目标音频对应的星谱图;所述语音端点检测深度学习模型的训练过程包括:3. method according to claim 2, is characterized in that, described update star spectrogram is the star spectrogram corresponding to target audio; The training process of described voice endpoint detection deep learning model comprises: 获取携带有分类标签的音频训练数据,所述音频训练数据包括目标音频训练数据和噪声音频训练数据;Obtaining audio training data carrying classification labels, the audio training data includes target audio training data and noise audio training data; 获取初始的语音端点检测深度学习模型,根据所述音频训练数据,通过语音端点检测算法对所述初始的语音端点检测深度学习模型进行训练,得到语音端点检测深度学习模型。Obtaining an initial deep learning model for detecting voice endpoints, and training the initial deep learning model for detecting voice endpoints by using a voice endpoint detection algorithm according to the audio training data to obtain a deep learning model for detecting voice endpoints. 4.根据权利要求2所述的方法,其特征在于,所述获取所述音频信号对应的星谱图,清除所述噪声音频在所述星谱图中对应的星点,得到更新星谱图之前,还包括:4. The method according to claim 2, characterized in that, acquiring a spectrogram corresponding to the audio signal, removing the corresponding star points of the noise audio in the spectrogram, and obtaining an updated spectrogram Before, also included: 对所述目标音频进行语音增强处理,所述语音增强处理包括基于谱减法的语音增强处理以及基于深度学习算法的语音增强处理中的任意一种。Perform speech enhancement processing on the target audio, and the speech enhancement processing includes any one of speech enhancement processing based on spectral subtraction and speech enhancement processing based on deep learning algorithms. 5.根据权利要求1所述的方法,其特征在于,所述获取所述音频信号对应的星谱图,清除所述噪声音频在所述星谱图中对应的星点,得到更新星谱图包括:5. The method according to claim 1, characterized in that, acquiring the spectrogram corresponding to the audio signal, removing the corresponding star points of the noise audio in the spectrogram, and obtaining an updated spectrogram include: 基于音频指纹算法,提取所述音频信号对应的星谱图,所述星谱图上的星点沿时间轴分布;Based on the audio fingerprint algorithm, extract the star spectrogram corresponding to the audio signal, and the star points on the star spectrogram are distributed along the time axis; 沿所述时间轴对所述星谱图中各时刻的星点进行检测;Detecting star points at each moment in the star spectrogram along the time axis; 当检测到所述星点对应的音频信号为噪声音频时,清除所述噪声音频对应的星点,得到更新星谱图。When it is detected that the audio signal corresponding to the star point is noise audio, the star point corresponding to the noise audio is cleared to obtain an updated star spectrogram. 6.根据权利要求5所述的方法,其特征在于,所述音频特征哈希数据包括音频特征哈希表;所述根据所述更新星谱图,得到音频特征哈希数据包括:6. The method according to claim 5, wherein the audio feature hash data comprises an audio feature hash table; and the obtaining of the audio feature hash data according to the updated star spectrogram comprises: 根据所述目标音频对应的更新星谱图中的任意两个星点,构成哈希键;According to any two star points in the updated star spectrogram corresponding to the target audio, a hash key is formed; 根据所述哈希键以及所述哈希键对应的哈希值,构建音频特征哈希表。An audio feature hash table is constructed according to the hash key and the hash value corresponding to the hash key. 7.根据权利要求6所述的方法,其特征在于,所述根据所述哈希键以及所述哈希键对应的哈希值,构建音频特征哈希表包括:7. The method according to claim 6, wherein, according to the hash key and the corresponding hash value of the hash key, constructing the audio feature hash table comprises: 根据所述任意两个星点在所述时间轴上的时间偏差,确定所述哈希键对应的哈希值;Determine the hash value corresponding to the hash key according to the time deviation of the any two star points on the time axis; 根据所述哈希键以及所述哈希键对应的哈希值,得到哈希对;Obtain a hash pair according to the hash key and the hash value corresponding to the hash key; 根据所述哈希对,构建音频特征哈希表。Based on the hash pairs, an audio feature hash table is constructed. 8.一种音频指纹提取装置,其特征在于,所述装置包括:8. An audio fingerprint extraction device, wherein the device comprises: 音频信号获取模块,用于获取音频信号,对所述音频信号进行语音端点检测处理,识别所述音频信号中的噪声音频;an audio signal acquisition module for acquiring audio signals, performing voice endpoint detection processing on the audio signals, and identifying noise audio in the audio signals; 噪声音频清除模块,用于获取所述音频信号对应的星谱图,清除所述噪声音频在所述星谱图中对应的星点,得到更新星谱图;A noise audio clearing module, used for acquiring the star spectrogram corresponding to the audio signal, clearing the star point corresponding to the noise audio in the star spectrogram, and obtaining an updated star spectrogram; 音频数据处理模块,用于根据所述更新星谱图,得到音频特征哈希数据;an audio data processing module for obtaining audio feature hash data according to the updated star spectrogram; 音频指纹生成模块,用于根据所述音频特征哈希数据,得到音频指纹。An audio fingerprint generation module, configured to hash data according to the audio feature to obtain an audio fingerprint. 9.一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至7中任一项所述的方法的步骤。9. A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the method according to any one of claims 1 to 7 when the processor executes the computer program. step. 10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法的步骤。10. A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are implemented.
CN202110260352.0A 2021-03-10 2021-03-10 Audio fingerprint extraction method and device, computer equipment and storage medium Pending CN113113051A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110260352.0A CN113113051A (en) 2021-03-10 2021-03-10 Audio fingerprint extraction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110260352.0A CN113113051A (en) 2021-03-10 2021-03-10 Audio fingerprint extraction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113113051A true CN113113051A (en) 2021-07-13

Family

ID=76711461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110260352.0A Pending CN113113051A (en) 2021-03-10 2021-03-10 Audio fingerprint extraction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113113051A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113506550A (en) * 2021-07-29 2021-10-15 北京花兰德科技咨询服务有限公司 Artificial intelligent reading display and display method
CN114783453A (en) * 2022-03-18 2022-07-22 深圳市声扬科技有限公司 Speech enhancement method, device, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110173208A1 (en) * 2010-01-13 2011-07-14 Rovi Technologies Corporation Rolling audio recognition
CN108831492A (en) * 2018-05-21 2018-11-16 广州国音科技有限公司 A kind of method, apparatus, equipment and readable storage medium storing program for executing handling voice data
CN109271501A (en) * 2018-09-19 2019-01-25 北京容联易通信息技术有限公司 A kind of management method and system of audio database
CN109473123A (en) * 2018-12-05 2019-03-15 百度在线网络技术(北京)有限公司 Voice activity detection method and device
CN110085251A (en) * 2019-04-26 2019-08-02 腾讯音乐娱乐科技(深圳)有限公司 Voice extracting method, voice extraction element and Related product
CN111428078A (en) * 2020-03-20 2020-07-17 腾讯科技(深圳)有限公司 Audio fingerprint coding method and device, computer equipment and storage medium
CN111737515A (en) * 2020-07-22 2020-10-02 深圳市声扬科技有限公司 Audio fingerprint extraction method and device, computer equipment and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110173208A1 (en) * 2010-01-13 2011-07-14 Rovi Technologies Corporation Rolling audio recognition
CN108831492A (en) * 2018-05-21 2018-11-16 广州国音科技有限公司 A kind of method, apparatus, equipment and readable storage medium storing program for executing handling voice data
CN109271501A (en) * 2018-09-19 2019-01-25 北京容联易通信息技术有限公司 A kind of management method and system of audio database
CN109473123A (en) * 2018-12-05 2019-03-15 百度在线网络技术(北京)有限公司 Voice activity detection method and device
CN110085251A (en) * 2019-04-26 2019-08-02 腾讯音乐娱乐科技(深圳)有限公司 Voice extracting method, voice extraction element and Related product
CN111428078A (en) * 2020-03-20 2020-07-17 腾讯科技(深圳)有限公司 Audio fingerprint coding method and device, computer equipment and storage medium
CN111737515A (en) * 2020-07-22 2020-10-02 深圳市声扬科技有限公司 Audio fingerprint extraction method and device, computer equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113506550A (en) * 2021-07-29 2021-10-15 北京花兰德科技咨询服务有限公司 Artificial intelligent reading display and display method
CN114783453A (en) * 2022-03-18 2022-07-22 深圳市声扬科技有限公司 Speech enhancement method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
EP3806089B1 (en) Mixed speech recognition method and apparatus, and computer readable storage medium
AU2017305245B2 (en) Call classification through analysis of DTMF events
CN109308912B (en) Music style recognition method, device, computer equipment and storage medium
US9536547B2 (en) Speaker change detection device and speaker change detection method
CN113035202B (en) Identity recognition method and device
CN109360572B (en) Call separation method and device, computer equipment and storage medium
CN108492830B (en) Voiceprint recognition method and device, computer equipment and storage medium
CN112712809B (en) Voice detection method and device, electronic equipment and storage medium
CN108922543A (en) Model library method for building up, audio recognition method, device, equipment and medium
CN113903361B (en) Voice quality inspection method, device, equipment and storage medium based on artificial intelligence
CN114582325B (en) Audio detection method, device, computer equipment and storage medium
CN111081223A (en) Voice recognition method, device, equipment and storage medium
CN113113051A (en) Audio fingerprint extraction method and device, computer equipment and storage medium
TWI659410B (en) Audio recognition method and device
CN112687274A (en) Voice information processing method, device, equipment and medium
JP7259981B2 (en) Speaker authentication system, method and program
CN114218428A (en) Audio data clustering method, device, device and storage medium
CN114495903B (en) A language category identification method, device, electronic device and storage medium
CN115862676A (en) Voice superposition detection method and device based on deep learning and computer equipment
Sarada et al. Audio deepfake detection and classification
CN116486789A (en) Speech recognition model generation method, speech recognition method, device and equipment
CN113921018A (en) Voiceprint recognition model training method and device and voiceprint recognition method and device
Dat et al. Generalized Gaussian distribution Kullback-Leibler kernel for robust sound event recognition
CN112669881B (en) A voice detection method, device, terminal and storage medium
CN119007740B (en) Multi-language long recording transcription method, system and device with adaptive noise suppression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210713

RJ01 Rejection of invention patent application after publication