CN117916801A - Reverberation and noise robust speech activity detection via modulation domain attention - Google Patents
Reverberation and noise robust speech activity detection via modulation domain attention Download PDFInfo
- Publication number
- CN117916801A CN117916801A CN202280060615.6A CN202280060615A CN117916801A CN 117916801 A CN117916801 A CN 117916801A CN 202280060615 A CN202280060615 A CN 202280060615A CN 117916801 A CN117916801 A CN 117916801A
- Authority
- CN
- China
- Prior art keywords
- computer
- modulation
- spectral
- implemented method
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
A system for detecting speech from a reverberant signal is disclosed. The system is programmed to receive spectral time amplitude data in the modulated frequency domain. The system is programmed to then enhance the spectro-temporal amplitude data by reducing reverberation and other noise and smoothing based on certain properties of the spectro-temporal spectrogram associated with the spectro-temporal amplitude data. Next, the system is programmed to calculate various features related to the presence of speech based on the enhanced spectral temporal amplitude data and other data in the modulation frequency domain or (acoustic) frequency domain. The system is programmed to then determine, based on the various features, a degree of speech present in audio data corresponding to the received spectral time amplitude data. The system may be programmed to transmit the extent of the speech present to the output device.
Description
Cross Reference to Related Applications
The present application claims priority from the following priority applications: international application PCT/CN2021/112265 (ref: D20109 WO) filed on 8/12 of 2021, U.S. provisional application 63/239,976 (ref: D20109USP 1) filed on 9/2 of 2021, and european application EP 21205203.9 filed on 28 of 10/2021.
Technical Field
The application relates to voice activity detection. More particularly, example embodiments described below relate to solving noise and reverberation robustness problems based on modulation domain attention.
Background
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Thus, unless otherwise indicated, any approaches described in this section are not to be construed so as to qualify as prior art merely by virtue of their inclusion in this section.
Conventionally, it is difficult for a voice enhancement system incorporated in a speakerphone, a video conference or a hearing aid to properly manage noise and reverberation (which may be regarded as noise, but will be referred to separately below). It would be helpful to have a robust Voice Activity Detection (VAD) that estimates information about noise and reverberation and reduces artifacts and perceptual discontinuities caused by noise and reverberation during speech. Such VAD is particularly helpful for enhancing voice quality and intelligibility of audio/video content recording and playback systems, such as voice messaging components of any social networking software, video blogging (vlog) platforms, or podcast settings.
Disclosure of Invention
A computer-implemented method of detecting speech from a reverberant signal based on data in a modulation frequency domain is disclosed. The method comprises the following steps: receiving, by the processor, new audio data in the time domain; converting, by the processor, a piece of new audio data corresponding to a point in time into a particular Spectral Time Amplitude (STA) as a time-frequency representation; applying a detection model to the particular STA to obtain an estimate of the degree of speech in the new audio data, comprising: obtaining a Modulation Spectrum Metric (MSM) for the point in time having an acoustic band dimension and a modulation band dimension from one or more STAs obtained from the new audio data; calculating a Diffusion Indicator (DI) based on the MSM, the DI indicating a degree of diffusion of a new piece of audio data in a modulation frequency domain; generating an enhanced STA that filters out reverberation and other noise from the particular STA; calculating one or more features from the enhanced STA; creating one or more feature vectors using the DI and the one or more features; and determining an estimate of the degree of speech in the new piece of audio data from the one or more feature vectors; and transmitting an estimate of the degree of speech in the new piece of audio data.
Drawings
Example embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
FIG. 1 illustrates an example networked computer system in which various embodiments may be practiced.
Fig. 2 illustrates example components of an audio management server computer in accordance with the disclosed embodiments.
Fig. 3A illustrates an energy diagram of a joint acoustic/modulation frequency representation of a pure reverberant speech signal with a reverberation time of 0 ms.
Fig. 3B illustrates an energy diagram of a joint acoustic/modulation frequency representation of a pure reverberant speech signal with a reverberation time of 500 ms.
Fig. 3C illustrates an energy diagram of a joint acoustic/modulation frequency representation of a reverberant speech signal with a reverberation time of 1 second(s).
Fig. 4A illustrates an energy diagram of a joint acoustic/modulation frequency representation of noise recorded in a room, wherein the modulation frequency ranges up to 24Hz.
Fig. 4B illustrates an energy diagram of a joint acoustic/modulation frequency representation of noise recorded in a room, wherein the modulation frequency range is between 4-24 Hz.
Fig. 5A illustrates an energy plot of a joint acoustic/modulation frequency representation with a signal-to-noise ratio (SNR) of 20 dB.
Fig. 5B illustrates an energy plot of a joint acoustic/modulation frequency representation with a signal-to-noise ratio (SNR) of 10 dB.
Fig. 5C illustrates an energy plot of a joint acoustic/modulation frequency representation with a signal-to-noise ratio (SNR) of 0 dB.
Fig. 6 illustrates a process performed by an audio management server computer in a spectral temporal amplitude enhancer to enhance temporal spectral magnitude data by noise reduction.
Fig. 7 illustrates an example process performed with an audio management server computer according to some embodiments described herein.
FIG. 8 is a block diagram that illustrates a computer system upon which an embodiment of the invention may be implemented.
Detailed Description
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the example embodiments of the invention. It may be evident, however, that the example embodiments may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the example embodiments.
Embodiments are described in the following subsections according to the following summary:
1. General overview
2. Example computing Environment
3. Example computer component
4. Description of the functionality
4.1. Diffusion indicator module
4.2. Spectrum time amplitude enhancer
4.3. Enhanced feature extractor
4.4. Feature fusion and classification
5. Example procedure
6. Hardware implementation
**
1. General overview
A system and associated method for detecting speech from a reverberant signal based on data in a modulated frequency domain is disclosed. In some embodiments, the system is programmed to receive spectral time amplitude data. The system is programmed to then enhance the spectro-temporal amplitude data by reducing reverberation and other noise and smoothing based on certain properties in the modulation frequency domain of the spectro-temporal spectrogram associated with the spectro-temporal amplitude data. Next, the system is programmed to calculate various features related to the presence of speech based on the enhanced spectral temporal amplitude data and other data in the modulation frequency domain or (acoustic) frequency domain. The system is programmed to then determine, based on the various features, a degree of speech present in audio data corresponding to the received spectral time amplitude data. The system may be programmed to transmit the extent of the speech present to the output device.
In some embodiments, reducing reverberation to produce enhanced spectral time amplitude data is based primarily on filtering out information that falls within certain modulation frequency ranges. The calculation of the features representing the reduced presence of reverberation may include applying an existing index, typically applied to the frequency domain, to the modulation frequency domain, or extracting features directly from the modulation spectrogram associated with the spectral time amplitude.
There are technical advantages to the system. The system achieves an efficient VAD by intelligently selecting features from the audio data that distinguish between speech and noise, including reverberation. These features may exist in different levels, some of which are related to noise in the environment and some of which are related to clean speech, so that these features are used to improve the accuracy of classification. Such VADs further enable detection and extraction of clean speech from given audio data, and have many applications, especially in environments where reverberation occurs frequently.
2. Example computing Environment
FIG. 1 illustrates an example networked computer system in which various embodiments may be practiced. Fig. 1 is shown in simplified schematic format for illustration of a clear example, and other embodiments may include more, fewer, or different elements.
In some embodiments, the networked computer system includes an audio management server computer 102 ("server"), one or more sensors 104 or input devices, and one or more output devices 110, which are communicatively coupled by a direct physical connection or via one or more networks 118.
In some embodiments, server 102 broadly represents an instance of one or more computers, virtual computing instances, and/or applications programmed or configured with data structures and/or database records arranged to host or perform functions related to low latency speech enhancement by noise reduction. Server 102 may include a server farm, a cloud computing platform, a parallel computer, or any other computing facility with sufficient computing power in terms of data processing, data storage, and network communications for the functions described above.
In some embodiments, each of the one or more sensors 104 may include a microphone or another digital recording device that converts sound into an electrical signal. Each sensor is configured to transmit detected audio data to the server 102. Each sensor may include a processor or may be integrated into a typical client device, such as a desktop computer, laptop computer, tablet computer, smart phone, or wearable device.
In some embodiments, each of the one or more output devices 110 may include a speaker or another digital playback device that converts electrical signals back into sound. Each output device is programmed to play audio data received from the server 102. Similar to the sensor, the output device may include a processor, or may be integrated into a typical client device, such as a desktop computer, laptop computer, tablet computer, smart phone, or wearable device.
One or more of the networks 118 may be implemented by any medium or mechanism that provides for the exchange of data between the various elements of fig. 1. Examples of network 118 include, but are not limited to, one or more cellular networks (communicatively coupled with data connections to computing devices through cellular antennas), near Field Communication (NFC) networks, local Area Networks (LANs), wide Area Networks (WANs), the internet, terrestrial or satellite links, and the like.
In some embodiments, the server 102 is programmed to receive input audio data corresponding to sound in a given environment from one or more sensors 104. The server 102 is programmed to next process the input audio data, which typically corresponds to a mix of speech and noise, to estimate how much speech is present in each frame of the input data. The server 102 is further programmed to update the input audio data based on the estimate to produce cleaned output audio data expected to contain less noise than the input audio data. Further, the server 102 is programmed to send output audio data to one or more output devices.
3. Example computer component
Fig. 2 illustrates example components of an audio management server computer in accordance with the disclosed embodiments. The figures are for illustration purposes only and server 102 may include fewer or more functional components or storage components. Each functional component may be implemented as a software component, a general-purpose or special-purpose hardware component, a firmware component, or any combination thereof. Each functional component may also be coupled with one or more storage components (not shown). The storage component can be implemented using any of a relational database, an object database, a flat file system, or JSON storage. The storage component may be connected to the functional component over a network, either locally or using programming calls, remote Procedure Call (RPC) facilities, or message buses. The components may or may not be self-contained. These components may be functionally or physically centralized or distributed, depending on implementation-specific or other considerations.
In some embodiments, the server 102 includes a modulation domain attention module 220 that includes a spread indicator module 202, a spectrum time amplitude enhancer 204, and an enhanced feature extractor 206. Server 102 also includes feature fusion operator 208 and classification operator 210.
In some embodiments, the diffusion indicator module 202 includes computer-executable instructions that enable the generation of distinguishing features that distinguish between speech and non-speech (e.g., reverberation or other noise) based on different clustering characteristics in the modulation frequency domain.
In some embodiments, the spectral time amplitude enhancer 204 includes computer-executable instructions that enable enhancement of spectral time amplitude in the modulation frequency domain to enhance feature extraction.
In some embodiments, enhanced feature extractor 206 includes computer-executable instructions that enable extraction of temporal and spectral features from the enhanced spectral temporal amplitude data.
In some embodiments, the feature fusion operator 208 includes computer-executable instructions that enable combining features generated by the diffusion indicator module 202, the enhanced feature extractor 206, and optionally other features, as discussed further below.
In some embodiments, classification operator 210 includes computer-executable instructions that enable the presence of clean speech without reverberation or other noise in given audio data to be determined based on a combination of features produced by feature fusion operator 208.
4. Description of the functionality
While the mixed audio signal may have a large amount of overlap in the time domain, modulation frequency analysis provides an additional dimension that may present a greater degree of separation between audio sources. In other words, the audio signal originally captured in the time domain may be converted into a time-frequency representation (TFR) (a view of the time-varying signal represented in time and frequency) by a transformation such as a discrete short-time fourier transform (STFT). TFR may then be extended to a third dimension that represents the modulation frequency under certain assumptions.
As illustrated in fig. 3A, 3B, 4A, 4B, 5A, 5B, and 5C, the modulation frequency domain is typically shown by a modulation spectrum indicating intensity values, wherein the vertical axis represents a conventional acoustic frequency index k and the horizontal axis represents a modulation frequency index i. The modulation profile may exhibit a greater degree of separation between audio sources.
In the modulated frequency domain, the temporal envelope of clean (anechoic) speech contains frequencies of 2-16Hz with spectral peaks of about 4Hz, which corresponds to the syllable rate of spoken speech. However, noise and reverberation exhibit different modulation characteristics. For reverberated speech, diffuse reverberant tails are typically modeled as an exponentially decaying gaussian white noise process. As the level of reverberation increases, the signal attains properties more resembling gaussian white noise. The reverberant signal exhibits a higher frequency temporal envelope due to the "whitening" effect of the reverberant tail.
Fig. 3A illustrates an energy diagram of a joint acoustic/modulation frequency representation of a pure reverberant speech signal with a reverberation time of 0 ms. Fig. 3B illustrates an energy diagram of a joint acoustic/modulation frequency representation of a pure reverberant speech signal with a reverberation time of 500 ms. Fig. 3C illustrates an energy diagram of a joint acoustic/modulation frequency representation of a reverberant speech signal with a reverberation time of 1 second(s). As illustrated in fig. 3B, for clean speech, the body of the modulated energy 302 lies primarily below 10Hz in the modulated frequency domain and peaks around 4 Hz. As illustrated in fig. 3C, the reverberation causes the energy to be smeared to higher modulation frequencies. The stronger the reverberation, the more the shift to higher modulation frequencies. These figures show that generally clean speech producing higher energy concentrates in lower modulation frequency regions and that the more reverberations the more energy is transferred to higher modulation frequency regions.
Since the spreading of indoor noise generally occurs slowly over time, the modulation spectrum of indoor noise is mainly determined by modulation frequencies below 1 Hz. Thus, the envelope of the indoor noise can be modeled as a constant plus a random value. The constant envelope covers the main energy and is centered below 1Hz in the modulation frequency, while the random envelope covers the residual energy uniformly distributed throughout the modulation frequency domain.
Fig. 4A illustrates a normalized energy plot in a joint acoustic/modulation frequency representation of noise recorded in a room (room noise without speech), where the modulation frequency ranges up to 24Hz and normalization is implemented between 0-24 Hz. As shown in fig. 4A, the main energy is concentrated below 1Hz in the modulation frequency, which illustrates the constant envelope of the indoor noise. Fig. 4B illustrates an energy diagram of a joint acoustic/modulation frequency representation of noise recorded in a room, wherein the modulation frequency range is between 4-24Hz and the normalization is implemented between 4-24 Hz. As shown in fig. 4B, the residual random envelope of the indoor noise exhibits a uniform distribution along the modulation frequency dimension. In addition, energy is concentrated mainly at low acoustic frequencies and gradually decreases as the acoustic frequency increases.
Fig. 5A illustrates an energy plot of a joint acoustic/modulation frequency representation with a signal-to-noise ratio (SNR) of 20 dB. Fig. 5B illustrates an energy plot of a joint acoustic/modulation frequency representation with a signal-to-noise ratio (SNR) of 10 dB. Fig. 5C illustrates an energy plot of a joint acoustic/modulation frequency representation with a signal-to-noise ratio (SNR) of 0 dB. The noise here is truly recorded indoor noise. From these figures, it can be seen that when the modulation frequency is higher than 4Hz, the constant time envelope of the indoor noise (below 4 Hz) has been filtered or masked as illustrated in fig. 4A, and the residual random time envelope of the noise results in a relatively uniform masking of the speech region, especially in the low acoustic frequency band, as illustrated in fig. 4B. The stronger the noise, the more masking in the modulation frequency domain, which can be seen from the smaller proportion of the high energy portion (such as 502) in the majority of the same acoustic frequency range (such as 504 in fig. 5C) compared to the proportion of the high energy portion (such as 506) in the majority of the same acoustic frequency range. Therefore, most of the energy exists in the low acoustic frequency range. In addition, clean speech, which generally produces higher energy, is concentrated in lower modulation frequency regions, and the more "noise" the more mixing or masking of the random time envelope that brings energy into the higher modulation frequency regions.
In some embodiments, the server 102 receives a time domain signal x (n), n representing a discrete time dependent variable. The STFT may be used to obtain the time-frequency (T-F) transform of X (n) X (l, k):
Where l denotes a time/frame index, k denotes a channel index, N denotes a frame length or a Fast Fourier Transform (FFT) length, g () denotes an analysis window of length N, and M denotes a decimation factor.
In some embodiments, server 102 then transforms X (l, k) as a T-F transformed narrowband signal to a spectrotemporal amplitude of the perceived acoustic band Y (l, m) based on the human auditory system using a transformation matrix:
Where m represents the index of the perceived acoustic band, H is designed for striping Matrix, andRepresents X (l, k), wherein k ranges from 0 toFirst/>, using X (l, k) onlyA narrow band because the residual narrow band can be derived from the first/>, of the real valued signalThe FFT components are recovered.
In some embodiments, the last L frames of spectral amplitude are used to calculate the modulation spectral metrics (spectrograms) Z (L, m, c) at any frame L, perceived acoustic band m, and modulation band c based on the FFT:
where ω (-) represents a window function known to those skilled in the art.
4.1. Diffusion indicator module
In some embodiments, the server 102 in the diffusion module 202 calculates a Diffusion Indicator (DI) for a particular time based on the last L frames, the DI characterizing a relationship between energy falling in a lower range of the modulation frequency domain and energy falling in a higher range of the modulation frequency domain. As discussed above, the energy data corresponding to clean speech tends to fall within the lower range in the modulation frequency domain, but the more reverberations or other noise that is mixed with clean speech, the energy data corresponding to the mixing tends to spread to the higher range in the modulation frequency domain, resulting in more "spread" of energy values in the modulation frequency domain. Thus, a higher DI will indicate a higher reverberant or noisy audio signal.
In some embodiments, DI may be calculated as the center of gravity of the modulation spectrum:
Where c L and c H indicate the lowest and highest modulation bands in the analysis that generally correspond to 3Hz and 30Hz, and m L and m H indicate the lowest and highest acoustic bands in the analysis that generally correspond to 125Hz and 8,000 Hz.
In some embodiments, DI may be calculated as the energy ratio of the low modulation portion to the high modulation portion:
Where c L1 and c L2 indicate modulation bands generally corresponding to 3Hz and 16Hz, and c H1 and c H2 indicate modulation bands generally corresponding to 16Hz and 30 Hz.
In some embodiments, the spread indicator may be calculated as the energy ratio of the low modulation portion to the entire modulation portion:
4.2. Spectrum time amplitude enhancer
Fig. 6 illustrates a process performed by a server in a spectrum time amplitude enhancer to enhance spectrum time amplitude data by noise reduction. In some embodiments, the server 102 in the spectrotemporal amplitude enhancer 204 performs a series of steps including reverberation and noise filtering, residual noise estimation, and residual noise suppression in the modulation frequency domain to convert the initial spectrotemporal amplitude data into enhanced spectrotemporal amplitude data.
In some embodiments, given the modulation spectrum metric calculated according to equation (1), server 102 filters noise and reverberation in block 604 as follows to obtain a filtered modulation spectrum metric
Where c L is the index of the low-cutoff modulation band and c H is the index of the high-cutoff modulation band, as shown in equation (2).
In some embodiments, server 102 smoothes the filtered modulation spectrum metrics in block 606 as follows. According to the pasmodic theorem (Parseval's theshem), it roughly indicates that the sum (or integral) of the squares of a function is equal to the sum (or integral) of the squares of its fourier transform:
where Y (n, m) 2 is proportional to the spectral time energy corresponding to amplitude Y (n, m).
The server 102 calculates the modulation frequency domain by aggregation as followsIn (3) smoothed spectral time energy:
Which represents the average value of |y (n, m) | 2 mentioned in formula (6).
Now, the server 102 calculates the enhanced spectral time energy with reverberation and noise filtering in the modulated frequency domain based on the above formulas (5) and (6) as follows
The server 102 may then calculate the smoothed and enhanced spectral time amplitude based on equation (7) above as follows
Where a constant of 2 is used to keep the energy from scaling because the FFT has conjugate symmetry.
In some embodiments, server 102 estimates residual (ambient) noise in block 608Is used for the spectrum time amplitude of the (c). One approach is for the server 102 to track the lowest level of spectral time energy in the room over a period of time.
In some embodiments, server 102 performs residual noise estimation and suppression to obtain enhanced spectral temporal amplitude in block 610 as followsAs output data of block 620:
In some embodiments, data in the modulation frequency domain may be used to calculate enhanced spectral temporal amplitude via a machine learning model. To build the model, a "raw speech" class may be included in the training dataset as input data, the "raw speech" class comprising a plurality of pieces of spectral time amplitude data in the modulation frequency domain corresponding to a combination of pure speech, noise and reverberation of lengths in a specific length range (such as 5 minutes). An "enhanced speech" class may be included in the training dataset as output data, the "enhanced speech" class comprising a plurality of pieces of spectral temporal amplitude data in the modulation frequency domain corresponding to smooth, noise-reduced clean speech. As discussed above, noise reduction includes eliminating reverberation, ambient sound, and other noise. Machine learning methods known to those skilled in the art (such as those described in arXiv:1709.08243 or arXiv:1704.07804[ cs.CV ]) can then be applied to the training data set to build a model configured to produce enhanced spectral time amplitude data. The feature extractor may then extract features based on the enhanced spectral-temporal amplitude instead of the original amplitude, resulting in enhanced features, as discussed below.
4.3. Enhanced feature extractor
In some embodiments, the server 102 in the enhanced feature extractor 206 calculates certain features of the enhanced temporal spectrum amplitude, such as enhanced mel-frequency cepstral coefficients (MFCCs) or enhanced Spectrum Flatness (SFTs) that are typically applied to the spectrum.
In some embodiments, in the computation of the MFCC, server 102 computes an Enhanced MFCC (EMFCC) using the enhanced temporal spectral amplitude computed in spectral temporal amplitude enhancer 204 instead of the original spectral temporal amplitude. Before calculating the MFCC, the mel-frequency filter may be considered as a specific striping matrix.
In some embodiments, in the calculation of SFT, server 102 calculates an Enhanced SFT (ESFT) using the enhanced spectro-temporal amplitude calculated in spectro-temporal amplitude enhancer 204 instead of the original spectro-temporal amplitude. Specifically, the original SFT may be calculated using Y (l, m) as follows to consider the time dimension:
where Y (l, M) again represents the time stamp l or the spectral-temporal amplitude of the perceived acoustic band M of the first frame, M represents the total number of bands, and is summed along the time dimension. ESFT from enhanced spectral time amplitude Is obtained as follows:
in some embodiments, other spectral correlation metrics may also be used to characterize the flat or peak condition of the signal spectrum and produce additional features of enhanced spectral temporal amplitude, such as the following:
spectral peaks based on the sum of the peak band and other band power ratios;
spectral peaks based on peak-to-average (no peak band) power ratio;
Variance or standard deviation of adjacent band powers;
the sum or maximum of the band power differences between adjacent bands;
spectrum variance or spectrum spread around its spectrum centroid;
Spectral entropy.
4.4. Feature fusion and classification
In some embodiments, the server 102 in the feature fusion operator 208 combines the diffusion indicator, the enhanced features, and other common features that are not enhanced, such as zero-crossing rate in the frequency domain, spectral flux, or tones. Server 102 then calculates one or more feature vectors from the combination. The outputs of all features may simply be connected into one feature to form a vector of features. The different features may also form a vector of features. Alternatively, the different features may form respective feature vectors, each vector having one feature.
In some embodiments, server 102 in classification operator 210 classifies one or more feature vectors generated by feature fusion operator 208 via a machine learning model. To build the model, the server 102 may prepare a training set of feature vectors generated by running a set of audio signals (converted to the frequency domain and modulated frequency domain) that contain various degrees of speech (excluding reverberation or other noise) and various degrees of reverberation through modules 202, 204, 206, and 208. "degree" may be defined as the ratio in terms of volume or loudness (i.e., amplitude of sound waves) or another sound characteristic. For each signal in the training set, the extracted feature vector may be input data and the indication of the presence of any speech in the signal (binary value) or the degree of clean speech in the signal (continuous value) may be output data. The server 102 may then apply any machine learning model known to those skilled in the art for classification, such as logistic regression, statistical methods, including adaptive enhancement (AdaBoost) or Gaussian Mixture Model (GMM), or artificial neural networks, including multi-layer perceptrons or support vector machines. For example, for a neural network, a softmax function may be applied to calculate the probability that an input signal contains speech, which may be used as an estimate of the degree of speech in the input signal.
5. Example procedure
Fig. 7 illustrates an example process performed with an audio management server computer according to some embodiments described herein. Fig. 7 is shown in simplified schematic format for illustration of a clear example, and other embodiments may include more, fewer, or different elements connected in various ways. Fig. 7 is intended to disclose an algorithm, program, or summary that may be used to implement one or more computer programs or other software elements which, when executed, cause the functional improvements and technical advances described herein to be performed. Furthermore, the flowcharts herein are described in the same degree of detail as commonly used by those skilled in the art to communicate with each other in terms of algorithms, plans, or specifications that form the basis for the software programs they plan to write or implement using the techniques or knowledge they accumulate.
In some embodiments, in step 702, the server 102 is programmed to receive new audio data in the time domain.
In some embodiments, in step 704, the server 102 is programmed to convert a new piece of audio data corresponding to a point in time into a particular Spectral Time Amplitude (STA) as a time-frequency representation.
In some embodiments, in step 706, the server 102 is programmed to obtain a Modulation Spectrum Metric (MSM) from one or more STAs obtained from the new audio data, having an acoustic band dimension and a point in time of the modulation band dimension.
In some embodiments, server 102 is programmed to calculate a Diffusion Indicator (DI) based on the MSM, the DI indicating a degree of diffusion of a new piece of audio data in the modulation frequency domain, in step 708.
In some embodiments, DI is the center of gravity of the modulation spectrum based on MSM values in the modulation band range and the acoustic band range. In some embodiments, DI is the energy ratio of a low modulation portion based on MSM values in a low modulation band range and an acoustic band range to a high modulation portion based on MSM values in a high modulation band range and an acoustic band range. In some embodiments, DI is the energy ratio of the low modulation portion based on the MSM values in the low modulation band range and the acoustic band range to the entire modulation portion based on the MSM values in the entire modulation band range and the acoustic band range.
In some embodiments, computing DI includes applying a machine learning model that is trained with metrics of MSM of audio data with only clean speech and audio data with varying degrees of reverberation and other noise as input data and corresponding DI values as output data.
In some embodiments, the server 102 is programmed to generate an enhanced STA that filters out reverberation and other noise from a particular STA in step 710.
In some embodiments, generating the enhanced STA includes filtering out MSM values outside of the modulation band range. In other embodiments, the modulation frequency band ranges from 3Hz to 30Hz.
In some embodiments, generating the enhanced STA includes calculating smoothed spectral-temporal energy by aggregating over time. In other embodiments, generating the enhanced STA includes removing residual noise by tracking a minimum spectral time energy over time.
In some embodiments, generating the enhanced STA includes applying a machine learning model that trains with spectral time amplitude data corresponding to different degrees of reverberation and other noise as input data and with corresponding spectral time amplitude data corresponding to only clean speech as output data. In some embodiments, the server 102 is further programmed to extract features characterizing clean speech from the application of the machine learning model, including low cutoff modulation frequencies and high cutoff modulation frequencies.
In some embodiments, server 102 is programmed to calculate one or more features from the enhanced STA and create one or more feature vectors using the DI and the one or more features in step 712.
In some embodiments, the computing includes computing enhanced mel frequency filter cepstral coefficients (MFCCs) by applying the mel frequency filter to the enhanced STA for use in a last step of computing the MFCCs. In some embodiments, the calculating includes calculating enhanced Spectral Flatness (SFT) by using enhanced STAs instead of STAs, and summing values over time in the calculation of SFT.
In some embodiments, the one or more features include a spectral peak based on a sum of peak-to-other band power ratios, a spectral peak based on a peak-to-average (no-peak band) power ratio, a variance or standard deviation of adjacent band powers, a sum or maximum of band power differences between adjacent bands, a spectral spread or spectral variance around a spectral centroid, and a spectral entropy.
In some embodiments, in step 714, the server 102 is programmed to determine an estimate of the degree of speech in the new piece of audio data from the one or more feature vectors and transmit the estimate of the degree of speech in the new piece of audio data.
In some embodiments, the determining includes applying a machine learning model trained with one or more features of the spectral time amplitude data corresponding to clean speech and the spectral time amplitude data corresponding to different degrees of reverberation and other noise as input data and the corresponding speech degrees as output data.
6. Hardware implementation
According to one embodiment, the techniques described herein are implemented by at least one computing device. The techniques may be implemented, in whole or in part, using a combination of at least one server computer and/or other computing device coupled using a network, such as a packet data network. The computing device may be hardwired for performing the techniques, or may include a digital electronic device such as at least one Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA) that is permanently programmed to perform the techniques, or may include at least one general purpose hardware processor that is programmed to perform the techniques according to program instructions in firmware, memory, other storage, or a combination. Such computing devices may also incorporate custom hard-wired logic, ASICs, or FPGAs in combination with custom programming to implement the described techniques. The computing device may be a server computer, workstation, personal computer, portable computer system, handheld device, mobile computing device, wearable device, body mounted or implantable device, smart phone, smart appliance, internetworking device, autonomous or semi-autonomous device such as a robotic or unmanned ground or air vehicle, any other electronic device incorporating hard-wiring and/or program logic to implement the described techniques, one or more virtual computing machines or instances in a data center, and/or a network of server computers and/or personal computers.
FIG. 8 is a block diagram illustrating an example computer system that may be used to implement embodiments. In the example of fig. 8, a computer system 800 and instructions for implementing the disclosed techniques in hardware, software, or a combination of hardware and software are schematically represented as, for example, blocks and circles, in the same degree of detail commonly used by those of ordinary skill in the art to which this disclosure pertains to computer architecture and computer system implementations.
Computer system 800 includes an input/output (I/O) subsystem 802, which may include buses and/or other communication mechanisms for communicating information and/or instructions between the components of computer system 800 via electronic signal paths. The I/O subsystem 802 may include an I/O controller, a memory controller, and at least one I/O port. The electrical signal paths are schematically represented in the figures as, for example, lines, unidirectional arrows, or bidirectional arrows.
At least one hardware processor 804 is coupled to the I/O subsystem 802 for processing information and instructions. The hardware processor 804 may include, for example, a general purpose microprocessor or microcontroller and/or a special purpose microprocessor such as an embedded system or a Graphics Processing Unit (GPU) or a digital signal processor or an ARM processor. The processor 804 may include an integrated Arithmetic Logic Unit (ALU) or may be coupled to a separate ALU.
Computer system 800 includes one or more units of memory 806, such as main memory, coupled to I/O subsystem 802 for electronically and digitally storing data and instructions to be executed by processor 804. Memory 806 may include volatile memory, such as various forms of Random Access Memory (RAM), or other dynamic storage devices. Memory 806 may also be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 804. Such instructions, when stored in a non-transitory computer-readable storage medium accessible to the processor 804, may cause the computer system 800 to become a special-purpose machine customized to perform the operations specified in the instructions.
Computer system 800 also includes a non-volatile memory, such as Read Only Memory (ROM) 808 or other static storage device coupled to I/O subsystem 802 for storing information and instructions for processor 804. ROM 808 may include various forms of Programmable ROM (PROM) such as an Erasable PROM (EPROM) or an Electrically Erasable PROM (EEPROM). Persistent storage unit 810 may include various forms of non-volatile RAM (NVRAM) such as flash memory or solid state storage, magnetic or optical disks (e.g., CD-ROM or DVD-ROM), and may be coupled to I/O subsystem 802 for storing information and instructions. Storage 810 is an example of a non-transitory computer-readable medium that may be used to store instructions and data that, when executed by processor 804, cause a computer-implemented method for performing the techniques herein to be performed.
The instructions in memory 806, ROM 808 or storage 810 may include one or more sets of instructions organized as a module, method, object, function, routine, or call. The instructions may be organized as one or more computer programs, operating system services, or application programs including mobile applications. The instructions may include an operating system and/or system software; one or more libraries supporting multimedia, programming, or other functions; data protocol instructions or stacks for implementing TCP/IP, HTTP or other communication protocols; file processing instructions for interpreting and presenting files encoded using HTML, XML, JPEG, MPEG or PNG; user interface instructions for rendering or interpreting commands for a Graphical User Interface (GUI), a command line interface, or a text user interface; application software such as office suites, internet access applications, design and manufacturing applications, graphics applications, audio applications, software engineering applications, educational applications, games, or other applications. The instructions may implement a web server, a web application server, or a web client. The instructions may be organized into a presentation layer, an application layer, and a data storage layer such as a relational database system using Structured Query Language (SQL) or NoSQL, an object store, a graph database, a flat file system, or other data store.
Computer system 800 may be coupled to at least one output device 812 via I/O subsystem 802. In one embodiment, the output device 812 is a digital computer display. Examples of displays that may be used in various embodiments include touch screen displays or Light Emitting Diode (LED) displays or Liquid Crystal Displays (LCDs) or electronic paper displays. Computer system 800 may include other types of output devices 812 in addition to or instead of display devices. Examples of other output devices 812 include printers, ticket printers, plotters, projectors, sound or video cards, speakers, buzzers or piezoelectric or other audible devices, lights or LED or LCD indicators, haptic devices, actuators or servos.
At least one input device 814 is coupled to the I/O subsystem 802 for communicating signals, data, command selections, or gestures to the processor 804. Examples of input devices 814 include touch screens, microphones, still and video digital cameras, alphanumeric and other keys, keypads, keyboards, tablets, image scanners, joysticks, clocks, switches, buttons, dials, sliders, and/or various types of sensors such as force sensors, motion sensors, thermal sensors, accelerometers, gyroscopes, and Inertial Measurement Unit (IMU) sensors, and/or various types of transceivers such as wireless (such as cellular or Wi-Fi) transceivers, radio Frequency (RF) transceivers, or Infrared (IR) transceivers, and Global Positioning System (GPS) transceivers.
Another type of input device is a control device 816 that may perform cursor control or other automatic control functions, such as navigating through a graphical interface on a display screen, in lieu of or in addition to input functions. The control device 816 may be a touch pad, mouse, trackball, or cursor direction keys for communicating direction information and command selections to the processor 804 and for controlling cursor movement on the display 812. The input device may have at least two degrees of freedom in two axes, a first axis (e.g., x-axis) and a second axis (e.g., y-axis), allowing the device to specify an orientation in a certain plane. Another type of input device is a wired control device, a wireless control device, or an optical control device, such as a joystick, stick, console, steering wheel, pedal, shift mechanism, or other type of control device. The input device 814 may include a combination of a plurality of different input devices, such as a camera and a depth sensor.
In another embodiment, the computer system 800 may include internet of things (IoT) devices, wherein one or more of the output device 812, the input device 814, and the control device 816 are omitted. Or in such embodiments the input device 814 may include one or more cameras, motion detectors, thermometers, microphones, seismic detectors, other sensors or detectors, measurement devices or encoders, and the output device 812 may include a dedicated display such as a single-wire LED or LCD display, one or more indicators, display panels, meters, valves, solenoids, actuators or servers.
When the computer system 800 is a mobile computing device, the input device 814 may include a Global Positioning System (GPS) receiver coupled to a GPS module capable of triangulating, determining, and generating geographic location or location data, such as latitude-longitude values, for the geophysical location of the computer system 800. Output device 812 may include hardware, software, firmware, and interfaces for generating location report packets, notifications, pulse or heartbeat signals, or other repetitive data transmissions that specify the location of computer system 800, either alone or in combination with other application specific data, directed to host 824 or server 830.
Computer system 800 may implement the techniques described herein using custom hardwired logic, at least one ASIC or FPGA, firmware, and/or program instructions or logic that, when loaded and used or executed, in combination with a computer system, cause the computer system to operate as a special purpose machine. According to one embodiment, computer system 800 performs the techniques herein in response to processor 804 executing at least one sequence of at least one instruction contained in main memory 806. Such instructions may be read into main memory 806 from another storage medium, such as storage 810. Execution of the sequences of instructions contained in main memory 806 causes processor 804 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
The term "storage medium" as used herein refers to any non-transitory medium that stores data and/or instructions that cause a machine to operate in a specific manner. Such storage media may include non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage 810. Volatile media includes dynamic memory, such as memory 806. Common forms of storage media include, for example, a hard disk, a solid state drive, a flash memory drive, a magnetic data storage medium, any optical or physical data storage medium, a memory chip, etc.
Storage media are different from, but may be used in conjunction with, transmission media. Transmission media participate in the transfer of information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 802 of the I/O subsystem. Transmission media can also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.
Various forms of media may be involved in carrying at least one sequence of at least one instruction to processor 804 for execution. For example, the instructions may initially be carried on a magnetic disk or solid state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a communication link, such as an optical or coaxial cable or a telephone line, using a modem. A modem or router local to computer system 800 can receive the data on the communication link and convert the data for reading by computer system 800. For example, a receiver such as a radio frequency antenna or an infrared detector may receive data carried in a wireless or optical signal and appropriate circuitry may provide the data to the I/O subsystem 802, such as placing the data on a bus. The I/O subsystem 802 carries data to memory 806 from which the processor 804 retrieves and executes the instructions. The instructions received by memory 806 may optionally be stored on storage 810 either before or after execution by processor 804.
Computer system 800 also includes a communication interface 818 coupled to bus 802. Communication interface 818 provides a two-way data communication coupling to a network link 820 that connects, directly or indirectly, to at least one communication network, such as network 822 or a public or private cloud on the internet. For example, communication interface 818 may be an Ethernet networking interface, an Integrated Services Digital Network (ISDN) card, a cable modem, a satellite modem, or a modem to provide a data communication connection to a corresponding type of communication line (e.g., an Ethernet cable or any type of metal cable or fiber optic line or telephone line). Network 822 broadly represents a Local Area Network (LAN), wide Area Network (WAN), campus network, internet, or any combination thereof. Communication interface 818 may include a LAN card to provide a data communication connection to a compatible LAN or a cellular radiotelephone interface to wiredly transmit or receive cellular data in accordance with a cellular radiotelephone wireless networking standard or a satellite radio interface to wiredly transmit or receive digital data in accordance with a satellite wireless networking standard. In any such implementation, communication interface 818 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information via signal paths.
Network link 820 typically provides electrical, electromagnetic, or optical data communication directly or through at least one network to other data devices using, for example, satellite, cellular, wi-Fi, or bluetooth technology. For example, network link 820 may provide a connection through network 822 to a host computer 824.
In addition, network link 820 may provide a connection through network 822 or through internet equipment and/or computers operated via an Internet Service Provider (ISP) 826 to other computing devices. ISP 826 provides data communication services through the world wide packet data communication network (denoted as the Internet 828). A server computer 830 may be coupled to the internet 828. Server 830 broadly represents any computer, data center, virtual machine or virtual computing instance with or without a hypervisor, or computer executing a containerized program system such as DOCKER or KUBERNETES. Server 830 may represent an electronic digital service implemented using more than one computer or instance and accessed and used by transmitting a web service request, a Uniform Resource Locator (URL) string with parameters in an HTTP payload, an API call, an application service call, or other service call. Computer system 800 and server 830 can form elements of a distributed computing system that includes other computers, processing clusters, server clusters, or other computer organizations that cooperate to perform tasks or execute applications or services. The server 830 may include one or more sets of instructions organized as a module, method, object, function, routine, or call. The instructions may be organized as one or more computer programs, operating system services, or application programs including mobile applications. The instructions may include an operating system and/or system software; one or more libraries supporting multimedia, programming, or other functions; data protocol instructions or stacks for implementing TCP/IP, HTTP or other communication protocols; file format processing instructions for interpreting or rendering files encoded using HTML, XML, JPEG, MPEG or PNG; user interface instructions for rendering or interpreting commands for a Graphical User Interface (GUI), a command line interface, or a text user interface; application software such as office suites, internet access applications, design and manufacturing applications, graphics applications, audio applications, software engineering applications, educational applications, games, or other applications. Server 830 may include a web application server hosting a presentation layer, an application layer, and a data storage layer, such as a relational database system using Structured Query Language (SQL) or NoSQL, object storage, graphics database, flat file system, or other data storage.
Computer system 800 can send messages and receive data and instructions, including program code, through the network(s), network link 820 and communication interface 818. In the Internet example, a server 830 might transmit a requested code for an application program through Internet 828, ISP 826, local network 822 and communication interface 818. The received code may be executed by processor 804 as it is received, and/or stored in storage 810, or other non-volatile storage for later execution.
Execution of the instructions described in this section may implement a process in the form of an instance of a computer program being executed and consisting of program code and its current activities. According to an Operating System (OS), a process may be made up of multiple threads of execution that execute instructions simultaneously. In this context, a computer program is a passive set of instructions, and a process may be the actual execution of those instructions. Several processes may be associated with the same program; for example, opening several instances of the same program typically means executing more than one process. Multitasking may be implemented to allow multiple processes to share the processor 804. While each processor 804 or the core of the processor performs a single task at a time, the computer system 800 may be programmed to implement multitasking to allow each processor to switch between tasks being performed without having to wait for each task to complete. In embodiments, the switching may be performed when a task performs an input/output operation, when a task indicates that it may be switched, or when hardware interrupts. By quickly performing a context switch to appear multiple processes executing concurrently, time sharing may be implemented to allow for a quick response of the interactive user application. In an embodiment, for security and reliability, the operating system may prevent direct communication between independent processes, providing strictly mediated and controlled inter-process communication functionality.
7. Extensions and alternatives
In the foregoing specification, embodiments of the disclosure have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of what is the scope of the disclosure, and is intended by the applicants to be the application, is the set of claims that issue from this disclosure, in the specific form in which such claims issue, including any subsequent correction.
Aspects of the invention may be understood from the example embodiments (EEEs) enumerated below:
EEE 1. A computer-implemented method of detecting speech from a reverberant signal based on data in a modulated frequency domain, the method comprising:
Obtaining, by a processor, a particular Spectral Time Amplitude (STA) as a time-frequency representation corresponding to a point in time in the time domain covered by new audio data;
Obtaining a Modulation Spectrum Metric (MSM) for the point in time having an acoustic band dimension and a modulation band dimension from one or more STAs obtained from the new audio data;
Calculating a Diffusion Indicator (DI) based on the MSM, the DI indicating a degree of diffusion of a new piece of audio data in a modulation frequency domain;
Generating an enhanced STA that filters out reverberation and other noise from the particular STA;
Calculating one or more features from the enhanced STA;
creating one or more feature vectors using the DI and the one or more features; and
Determining an estimate of the degree of speech in the new piece of audio data from the one or more feature vectors;
an estimate of the degree of speech in the new piece of audio data is output.
EEE 2. The computer-implemented method of EEE 1, the DI is the center of gravity of the modulation spectrum based on MSM values in the modulation band range and the acoustic band range.
EEE 3. The computer-implemented method of EEE 1, the DI is an energy ratio of a low modulation portion based on MSM values in a low modulation band range and an acoustic band range to a high modulation portion based on MSM values in a high modulation band range and an acoustic band range.
EEE 4. The computer-implemented method of EEE 1, the DI is an energy ratio of a low modulation portion based on MSM values in a low modulation band range and an acoustic band range to an entire modulation portion based on MSM values in an entire modulation band range and acoustic band range.
EEE 5. The computer-implemented method of EEE 1, the obtaining comprising calculating the MSM using a fast fourier transform using a plurality of new pieces of audio data corresponding to a particular number of consecutive time points prior to the time point.
The computer-implemented method of any of EEEs 1-5, generating the enhanced STA comprising filtering out MSM values outside of a modulation band exclusion range.
EEE 7. The computer-implemented method of EEE 6, the modulation band exclusion range is 3Hz to 30Hz.
EEE 8. The computer-implemented method of any of EEEs 1-7, generating the enhanced STA comprises calculating smoothed spectral-temporal energy by aggregation over time.
The computer-implemented method of any of EEEs 1-8, generating the enhanced STA comprising removing residual noise by tracking a minimum spectrotemporal energy over time.
The computer-implemented method of any of EEEs 1-7, generating the enhanced STA comprising applying a machine learning model trained with spectral time amplitude data corresponding to different degrees of reverberation and other noise as input data and corresponding spectral time amplitude data corresponding to only clean speech as output data.
EEE 11 the computer-implemented method of EEE 10 further comprising extracting features characterizing the clean speech from application of the machine learning model, including a low cutoff modulation frequency and a high cutoff modulation frequency.
The computer-implemented method of any of EEEs 1-11, the calculating comprising calculating enhanced mel-frequency filter cepstral coefficients (MFCCs) using the enhanced STA.
EEE 13 the computer-implemented method of any of EEEs 1-12, the calculating comprising calculating an enhanced Spectral Flatness (SFT) by using the enhanced STA instead of the STA, and summing values over time in the calculation of the SFT.
The computer-implemented method of any of EEEs 1-13, the one or more features comprising a spectral peak based on a sum of peak-to-other band power ratios, a spectral peak based on a peak-to-average (no peak band) power ratio, a variance or standard deviation of adjacent band powers, a sum or maximum of band power differences between adjacent bands, a spectral spread or spectral variance around a spectral centroid, and a spectral entropy.
The computer-implemented method of any of EEEs 1-14, the determining comprising applying a machine learning model trained with one or more features of spectral time amplitude data corresponding to clean speech and spectral time amplitude data corresponding to different degrees of reverberation and other noise as input data and corresponding speech degrees as output data.
The computer-implemented method of any of EEEs 1-15, further comprising:
receiving new audio data in the time domain;
A new piece of audio data corresponding to a point in time is converted into the specific Spectral Time Amplitude (STA) as a time-frequency representation.
EEE 17. A computer-implemented method of detecting speech from a reverberant signal based on data in a modulated frequency domain, the method comprising:
Receiving, by the processor, new audio data in the time domain;
Converting, by the processor, a piece of new audio data corresponding to a point in time into a particular Spectral Time Amplitude (STA) as a time-frequency representation;
applying a detection model to the particular STA to obtain an estimate of the degree of speech in the new audio data, comprising:
Obtaining, by the processor, a Modulation Spectrum Metric (MSM) having an acoustic band dimension and a point in time of the modulation band dimension from one or more STAs obtained from the new audio data;
Calculating a Diffusion Indicator (DI) based on the MSM, the DI indicating a degree of diffusion of a new piece of audio data corresponding to the point in time in a modulation frequency domain;
Generating an enhanced STA that filters out reverberation and other noise from the particular STA;
Calculating one or more features from the enhanced STA;
Creating one or more feature vectors using the DI and the one or more features;
determining an estimate of the degree of speech in the new piece of audio data from the one or more feature vectors;
An estimate of the degree of speech in the new piece of audio data is transmitted.
EEE 18 the computer-implemented method of EEE 17, said obtaining comprising calculating the MSM using a fast Fourier transform using a plurality of new pieces of audio data corresponding to a particular number of consecutive points in time prior to the point in time.
EEE 19. The computer-implemented method of EEE 17, the generating is based on pasival theorem.
EEE 20 the computer-implemented method of EEE 17, said calculating comprising using MSM values in the acoustic band range of 125Hz to 8,000 Hz.
Claims (15)
1. A computer-implemented method of detecting speech from a reverberant signal based on data in a modulation frequency domain, the method comprising:
Obtaining, by a processor, a particular Spectral Time Amplitude (STA) as a time-frequency representation corresponding to a point in time in the time domain covered by new audio data;
Obtaining a Modulation Spectrum Metric (MSM) for the point in time having an acoustic band dimension and a modulation band dimension from one or more STAs obtained from the new audio data;
Calculating a Diffusion Indicator (DI) based on the MSM, the DI indicating a degree of diffusion of a new piece of audio data in a modulation frequency domain;
Generating an enhanced STA that filters out reverberation and other noise from the particular STA;
Calculating one or more features from the enhanced STA;
creating one or more feature vectors using the DI and the one or more features; and
Determining an estimate of the degree of speech in the new piece of audio data from the one or more feature vectors;
outputting the estimate of the degree of speech in the new piece of audio data.
2. The computer-implemented method of claim 1, the DI being a center of gravity of a modulation spectrum based on MSM values in a modulation band range and an acoustic band range.
3. The computer-implemented method of claim 1, the DI being an energy ratio of a low modulation portion based on MSM values in a low modulation band range and an acoustic band range to a high modulation portion based on MSM values in a high modulation band range and the acoustic band range.
4. The computer-implemented method of claim 1, the DI being an energy ratio of a low modulation portion based on MSM values in a low modulation band range and an acoustic band range to an entire modulation portion based on MSM values in an entire modulation band range and the acoustic band range.
5. The computer-implemented method of claim 1, the obtaining comprising calculating the MSM using a fast fourier transform using a plurality of new audio data corresponding to a particular number of consecutive time points prior to the time point.
6. The computer-implemented method of any of claims 1-5, generating the enhanced STA comprising filtering out MSM values outside of a modulation band exclusion range.
7. The computer-implemented method of claim 6, the modulation band exclusion range being 3Hz to 30Hz.
8. The computer-implemented method of any of claims 1-7, generating the enhanced STA comprises calculating smoothed spectral-temporal energy by aggregation over time.
9. The computer-implemented method of any of claims 1-8, generating the enhanced STA comprises removing residual noise by tracking a minimum spectral time energy over time.
10. The computer-implemented method of any of claims 1-7, generating the enhanced STA comprising applying a machine learning model trained with spectral time amplitude data corresponding to different degrees of reverberation and other noise as input data and corresponding spectral time amplitude data corresponding to only clean speech as output data.
11. The computer-implemented method of claim 10, further comprising extracting features characterizing the clean speech from application of the machine learning model, including a low cutoff modulation frequency and a high cutoff modulation frequency.
12. The computer-implemented method of any of claims 1-11, the calculating comprising calculating enhanced mel-frequency filter cepstral coefficients (MFCCs) using the enhanced STA.
13. The computer-implemented method of any of claims 1-12, the calculating comprising calculating an enhanced Spectral Flatness (SFT) by using the enhanced STA instead of the STA, and summing values over time in the calculation of the SFT.
14. The computer-implemented method of any of claims 1-13, the one or more features comprising a spectral peak based on a sum of peak-to-other band power ratios, a spectral peak based on a peak-to-average (no-peak band) power ratio, a variance or standard deviation of adjacent band powers, a sum or maximum of band power differences between adjacent bands, a spectral spread or spectral variance around a spectral centroid, and a spectral entropy.
15. The computer-implemented method of any of claims 1 to 14, the determining comprising applying a machine learning model trained with one or more features of spectral time amplitude data corresponding to clean speech and spectral time amplitude data corresponding to different degrees of reverberation and other noise as input data and corresponding speech degrees as output data.
Applications Claiming Priority (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNPCT/CN2021/112265 | 2021-08-12 | ||
| CN2021112265 | 2021-08-12 | ||
| US202163239976P | 2021-09-02 | 2021-09-02 | |
| US63/239,976 | 2021-09-02 | ||
| EP21205203.9 | 2021-10-28 | ||
| PCT/US2022/040076 WO2023018880A1 (en) | 2021-08-12 | 2022-08-11 | Reverb and noise robust voice activity detection based on modulation domain attention |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN117916801A true CN117916801A (en) | 2024-04-19 |
Family
ID=78414313
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202280060615.6A Pending CN117916801A (en) | 2021-08-12 | 2022-08-11 | Reverberation and noise robust speech activity detection via modulation domain attention |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN117916801A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118748003A (en) * | 2024-08-08 | 2024-10-08 | 宁波方太厨具有限公司 | Active noise reduction system and control method thereof, abnormal sound detection method and device |
-
2022
- 2022-08-11 CN CN202280060615.6A patent/CN117916801A/en active Pending
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118748003A (en) * | 2024-08-08 | 2024-10-08 | 宁波方太厨具有限公司 | Active noise reduction system and control method thereof, abnormal sound detection method and device |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP4238089B1 (en) | Deep-learning based speech enhancement | |
| CN107464564B (en) | Voice interaction method, device and equipment | |
| CN111986691B (en) | Audio processing method, device, computer equipment and storage medium | |
| JP7791984B2 (en) | Modulation-domain attention-based voice activity detection for reverberation and noise | |
| US11482237B2 (en) | Method and terminal for reconstructing speech signal, and computer storage medium | |
| CN116959471A (en) | Speech enhancement method, speech enhancement network training method and electronic device | |
| CN111462764B (en) | Audio encoding method, apparatus, computer-readable storage medium and device | |
| CN111863020A (en) | Voice signal processing method, device, equipment and storage medium | |
| WO2022256577A1 (en) | A method of speech enhancement and a mobile computing device implementing the method | |
| CN117916801A (en) | Reverberation and noise robust speech activity detection via modulation domain attention | |
| US12243519B2 (en) | Automatic adaptation of multi-modal system components | |
| US20240290341A1 (en) | Over-suppression mitigation for deep learning based speech enhancement | |
| EP4483360B1 (en) | Coded speech enhancement based on deep generative model | |
| EP4385012B1 (en) | Management of professionally generated and user-generated audio content | |
| CN117597732A (en) | Over-suppression mitigation for deep learning based speech enhancement | |
| EP4566054A1 (en) | Deep learning based mitigation of audio artifacts | |
| HK40084137A (en) | Processing method, device, electronic equipment and storage medium for audio signals | |
| HK40043558A (en) | Echo cancellation method and device, terminal, server and storage medium | |
| CN118742954A (en) | Coded Speech Enhancement Based on Deep Generative Model | |
| HK40026171B (en) | Audio coding method and device, computer readable storage medium and equipment | |
| HK40026171A (en) | Audio coding method and device, computer readable storage medium and equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |