Disclosure of Invention
The invention mainly aims to provide a multi-microphone array beam forming signal enhancement method and device, which can more accurately compensate deformation of signals and improve the quality of sound field consistency signals.
To achieve the above object, the present invention provides a multi-microphone array beam forming signal enhancement method, including:
Acquiring a multi-channel time domain signal acquired by a multi-microphone array, and performing time-frequency conversion to obtain a complex time-frequency domain signal;
the spatial distribution vector of the multi-microphone array is obtained, and the complex time-frequency domain signals are subjected to deformation compensation processing to obtain sound field consistency signals;
extracting sound source direction characteristics of the sound field consistency signals, and performing complex space-time attention beam forming network processing to obtain beam forming signals;
carrying out beam direction and gain parameter optimization on the beam forming signal through a preset countermeasure environmental simulator to obtain a dynamic frequency band enhancement signal;
And performing inverse time-frequency conversion on the dynamic frequency band enhancement signal, and performing low-delay optimization to obtain a target quality voice output signal.
Further, the acquiring the multi-channel time domain signal acquired by the multi-microphone array, performing time-frequency conversion to obtain a complex time-frequency domain signal, includes:
performing end point detection and noise power estimation on the multichannel time domain signal to obtain a preprocessing signal;
framing and windowing the preprocessed signals according to preset frame length frame shift parameters to obtain initial complex frequency domain signals;
Performing spectrum smoothing on the initial complex frequency domain signal to obtain a smooth frequency domain signal;
carrying out nonlinear subband division on the smooth frequency domain signal to obtain a subband signal;
performing geometric structure phase compensation on the subband signals to obtain subband correction signals;
and carrying out spectral subtraction denoising on the subband correction signals based on a preset noise power spectrum, and carrying out complex time-frequency domain characteristic representation conversion to obtain complex time-frequency domain signals.
Further, the obtaining the spatial distribution vector of the multi-microphone array, performing deformation compensation processing on the complex time-frequency domain signal to obtain a sound field consistency signal, includes:
acquiring physical coordinates and orientation parameters of the multi-microphone array, and performing space construction to obtain the space distribution vector;
performing space transfer function mapping calculation on the complex time-frequency domain signals according to the space distribution vector to obtain array acoustic characteristic mapping information;
Singular value decomposition is carried out on the array acoustic characteristic mapping information, and nonlinear correction of phase and amplitude is carried out, so that a preliminary compensation signal is obtained;
calculating a cross power spectral density matrix according to the preliminary compensation signal, and constructing geometric consistency constraint of the sound field to obtain topology representation of the sound field;
Performing complex domain tensor decomposition on the sound field topology representation to obtain a deformation compensation coefficient;
And carrying out phase amplitude alignment on the complex time-frequency domain signals according to the deformation compensation coefficient to obtain the sound field consistency signals.
Further, the extracting the sound source direction characteristics of the sound field consistency signal, and performing complex space-time attention beam forming network processing to obtain a beam forming signal, including:
Performing adaptive wavelet packet transformation and decomposition on the sound field consistency signal to obtain multi-level sound field time-frequency representation;
Carrying out complex domain space covariance calculation on the multi-level sound field time-frequency representation to obtain a frequency band phase difference spectrum;
performing high-order singular value decomposition on the frequency band phase difference spectrum to obtain an orthogonal projection structure;
constructing a hyperbolic direction estimation function according to the orthogonal projection structure, and carrying out gradient polarization search to obtain a sound source direction vector;
Performing phase alignment on the sound source direction vector and the sound field consistency signal to obtain an initial beam gain coefficient;
And constructing a complex domain attention mask according to the initial beam gain coefficient, and carrying out channel weighting on the sound field consistency signal to obtain a beam forming signal.
Further, the optimizing the beam direction and gain parameter of the beam forming signal by a preset countermeasure environmental simulator to obtain a dynamic frequency band enhancement signal includes:
based on the countermeasure environmental simulator, carrying out multiple noise interference and reverberation condition simulation on the beam forming signals to obtain diversified environmental interference samples;
Performing gradient descent optimization calculation on the diversified environment interference samples, and adjusting beam direction parameters and frequency band gain coefficients according to a preset signal-to-noise ratio objective function to obtain a parameter optimization matrix;
Performing directional reconstruction on the beam forming signals according to the parameter optimization matrix to obtain direction enhancement signals;
performing acoustic feature matching according to the direction enhancement signal and preset voice data to obtain corresponding target actual short-time voice data;
Performing feature extraction on the target actual short-time voice data based on a preset meta learning online calibration algorithm to obtain a target feature adaptation matrix;
Performing convolution operation on the target feature adaptation matrix and the direction enhancement signal to obtain a frequency spectrum correction signal;
And performing cross-band consistency processing and phase frequency band gain on the frequency spectrum correction signal to obtain the dynamic frequency band enhancement signal.
Further, the performing multiple noise interference and reverberation condition simulation on the beam forming signal based on the countermeasure environmental simulator to obtain a diversified environmental interference sample includes:
Performing time-frequency transformation on the beam forming signals through the countermeasure environmental simulator to obtain a time-frequency domain representation signal matrix;
Constructing a three-dimensional acoustic propagation structure according to the time-frequency domain representation signal matrix, and setting random reflection coefficients to obtain a space reverberation parameter set;
randomly extracting various types of noise sources from a preset environmental noise library, and performing domain stretching and frequency domain shifting to obtain a deformed noise source set;
performing position mapping on the deformed noise source set according to a preset spatial directivity distribution function to obtain a multidirectional noise distribution matrix;
Performing acoustic propagation delay and energy attenuation calculation according to the multidirectional noise distribution matrix and the space reverberation parameter set to obtain a noise propagation characteristic vector;
And carrying out reverberation processing on the time-frequency domain representation signal matrix based on the space reverberation parameter set, and carrying out interference superposition on the time-frequency domain representation signal matrix and the noise propagation characteristic vector to obtain the diversified environment interference sample.
Further, the feature extraction is performed on the target actual short-time voice data based on a preset meta learning online calibration algorithm to obtain a target feature adaptation matrix, which comprises the following steps:
Carrying out short-time window framing and windowing on the target actual short-time voice data to obtain an overlapped segmented signal sequence;
Carrying out Mel frequency cepstrum coefficient calculation on the overlapped segmented signal sequences to obtain a segmented feature description vector;
Performing meta-learning task sampling set construction on the segmented feature description vector to obtain a feature support set and a feature query set;
constructing a meta-learning inner loop optimization function based on the feature support set to obtain environment adaptation parameters;
MAML gradient updating is carried out on the characteristic query set according to the environment adaptation parameters, so as to obtain task model parameters;
Performing meta-learning external circulation optimization and online learning rate adjustment on the task model parameters to obtain a dynamic learning rate matrix;
Performing prototype network mapping on the segmented feature description vector according to the dynamic learning rate matrix to obtain environment perception feature representation;
performing measurement learning embedding space construction on the environment perception characteristic representation to obtain a voice characteristic embedding matrix;
and performing Bayesian credible interval constraint on the voice characteristic embedding matrix to obtain a target characteristic adaptation matrix.
Further, the performing inverse time-frequency conversion on the dynamic frequency band enhancement signal and performing low-delay optimization to obtain a target quality voice output signal includes:
performing overlap-add inverse transformation processing on the dynamic frequency band enhancement signal to obtain an initial time domain signal;
performing inter-frame phase consistency compensation on the initial time domain signal to obtain a phase continuous intermediate signal;
dynamically adjusting a buffer area according to the phase continuous intermediate signal to obtain a variable signal buffer area;
carrying out nonlinear amplitude normalization on the variable signal buffer area to obtain a dynamic range compressed signal;
performing time domain transient characteristic vector calculation according to the dynamic range compression signal, and performing time domain jitter elimination to obtain a jitter correction signal;
performing multistage cascade band reconstruction on the jitter correction signal to obtain a frequency response equalizing signal;
Performing equipment pre-compensation processing on the frequency response balanced signal to obtain an equipment matching signal;
And eliminating noise and artifact of the equipment matching signal according to a preset frame selective discarding algorithm to obtain the target quality voice output signal.
The invention also provides a multi-microphone array beam forming signal enhancement device, which is applied to the multi-microphone array beam forming signal enhancement method described in any one of the above, and comprises the following steps:
The acquisition module is used for acquiring multi-channel time domain signals acquired by the multi-microphone array, and performing time-frequency conversion to obtain complex time-frequency domain signals;
The analysis module is used for acquiring the spatial distribution vector of the multi-microphone array, and performing deformation compensation processing on the complex time-frequency domain signals to obtain sound field consistency signals;
The association module is used for extracting sound source direction characteristics of the sound field consistency signals and carrying out complex space-time attention beam forming network processing to obtain beam forming signals;
the processing module is used for optimizing the beam direction and gain parameters of the beam forming signals through a preset countermeasure environmental simulator to obtain dynamic frequency band enhancement signals;
And the control module is used for performing reverse time-frequency conversion on the dynamic frequency band enhancement signal and performing low-delay optimization to obtain a target quality voice output signal.
The method and the device for enhancing the beam forming signals of the multi-microphone array have the following beneficial effects:
By performing time-frequency conversion on time domain signals acquired by the multi-microphone array and combining the spatial distribution vector information of the array, the deformation compensation of the signals can be more accurately performed, and the quality of sound field consistency signals is improved. According to the method, the direction characteristics of the sound source are extracted, and the complex space-time attention beam shaping network is applied, so that the extraction accuracy of the target voice signal can be effectively improved, and the signal attenuation or distortion phenomenon in the traditional signal enhancement method is avoided. By introducing a preset countermeasure environment simulator, the beam direction and the gain parameters are optimized, so that the method can adapt to a dynamic complex acoustic environment, the self-adaptive adjustment of the beam direction and the gain parameters is realized, and the definition of signals and the accuracy of voice recognition are further improved. Based on the technology, noise interference and echo effect can be effectively reduced, so that high-quality voice output signals can be ensured to be kept under different environments, and the problems of tone quality reduction and voice distortion in the traditional method are reduced. In addition, the low-delay optimization design of the dynamic frequency band enhancement signal is beneficial to smooth operation of the real-time voice interaction system, and robustness and user experience of the system are enhanced. By comprehensively considering the space-time information and the dynamic sound field characteristics, the method can flexibly adapt to complex and changeable acoustic environments, and improves the overall performance and application effect of the multi-microphone array system.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention will be further described with reference to the drawings and detailed description.
Referring to fig. 1, a multi-microphone array beam forming signal enhancement method includes:
step S1, acquiring multi-channel time domain signals acquired by a multi-microphone array, and performing time-frequency conversion to acquire complex time-frequency domain signals;
S2, acquiring space distribution vectors of a plurality of microphone arrays, and performing deformation compensation processing on complex time-frequency domain signals to obtain sound field consistency signals;
S3, extracting sound source direction characteristics of the sound field consistency signal, and performing complex space-time attention beam forming network processing to obtain a beam forming signal;
Step S4, carrying out beam direction and gain parameter optimization on the beam forming signal through a preset countermeasure environmental simulator to obtain a dynamic frequency band enhancement signal;
and S5, performing inverse time-frequency conversion on the dynamic frequency band enhancement signal, and performing low-delay optimization to obtain a target quality voice output signal.
Based on the above steps, the detailed procedure is as follows:
Step S1 the first step in the beamforming of the multi-microphone array is to acquire the original acoustic signal and convert it into a representation suitable for processing. The microphone array is made up of a plurality of microphones arranged in a particular geometry, each microphone capturing a time domain representation of sound waves. These time domain signals contain sound source information, background noise and reverberation. In order to efficiently process these signals, it is necessary to convert them to the time-frequency domain. Time-frequency conversion typically uses a short-time fourier transform (STFT), which divides the time-domain signal into short-time frames and applies a fourier transform to each frame. In particular, a windowing function (e.g., hamming or hanning window) is applied to the time domain signal for each microphone channel to reduce spectral leakage. After application of the windowing function, a Fast Fourier Transform (FFT) is performed, generating a complex time-frequency domain representation. This complex representation contains amplitude and phase information, which is critical for subsequent beamforming. The result of the time-frequency conversion is a three-dimensional tensor with dimensions [ microphone number x time frame number x frequency bin ], each element being a complex number. The selection of the frame length and the frame shift step length affects the trade-off of time resolution and frequency resolution, and needs to be adjusted according to the application scene. Typical speech processing may use a frame length of 20-30ms and a frame shift of 10-15 ms.
Step S2, the spatial distribution vector of the multi-microphone array describes the relative position relation of each microphone in the three-dimensional space. This spatial distribution information is critical to understanding how the sound waves reach the different parts of the array. In practical applications, the microphone array may have problems such as shape distortion, installation errors, or inconsistent microphone characteristics, which all affect the consistent capture of the sound field. The deformation compensation process compensates for these errors by estimating the calibration coefficients for each microphone with spatially distributed vectors. The compensation process applies phase and amplitude corrections to the complex time-frequency domain signal based on the acoustic propagation model. The correction factors typically include a matrix of frequency dependent complex coefficients for adjusting the phase and amplitude relationships between the microphone channels. The deformation compensation may also involve microphone sensitivity difference calibration and frequency response equalization. The processed sound field consistency signal should ideally behave in that the signals of all microphones should be completely coherent with a corresponding phase delay if the sound source is coming from a certain direction. This consistency provides a key basis for the sound source direction estimation and beamforming in the next step.
And S3, performing sound source positioning and enhancement by using the calibrated sound field consistency signals. The sound source direction feature extraction process calculates a possible direction of the sound source from a phase difference between the microphone signals based on the phase information analysis of the sound field consistency signal. Common methods include Generalized Cross Correlation (GCC), multichannel Cross Power Spectral Density (CPSD) analysis, or deep learning based directional feature extraction networks. The extracted directional features are typically expressed as probability distributions or feature vectors of azimuth and pitch angles. The complex space-time attention beam forming network is an innovative signal processing architecture, and combines the advantages of the traditional beam forming technology and deep learning. The network applies attention mechanisms to the time dimension and the space dimension at the same time, and can dynamically adjust the attention degree to different time frequency points and space directions. The network processing includes complex masking matrix computation that assigns different complex weights to each time-frequency point based on the direction of the sound source and the time-frequency characteristics. By applying the masking matrix to the sound field consistency signal, the signal from the target direction can be enhanced, suppressing interference and noise from other directions. The quality of the beamformed signal depends on the accuracy of the directional characteristics and the degree of optimization of the network parameters.
Step S4, the antagonistic environment simulator is a specially designed module for simulating acoustic interference conditions which may occur in various real environments, such as reverberation, non-stationary noise, competing speakers and the like. The simulator employs an countermeasure training framework in which the generator is responsible for creating various complex acoustic scenarios under which the arbiter evaluates the performance of the beamforming algorithm. Based on this countermeasure mechanism, the system can discover potential weaknesses in the beamforming process. The beam direction optimization process maximizes the suppression of the interference sources while maintaining target signal enhancement by fine tuning the width and direction of the beam main lobe. Gain parameter optimization adjusts the gain profile for different frequency bands to balance the relationship between signal amplification and noise suppression. The optimization process adopts gradient descent or evolutionary algorithm, and iterative optimization is carried out based on objective indexes such as signal-to-noise ratio, voice definition and the like. The dynamic band enhancement signal is an optimized result, has adaptive gain adjustment characteristics for different frequency components, and can improve the intelligibility of the speech while maintaining the naturalness of the speech. The step changes the beam forming technology from static optimization to dynamic adaptation, and the robustness of the system in a changing environment is greatly improved.
And S5, converting the complex time-frequency domain signal back to the time domain by adopting Inverse Short Time Fourier Transform (ISTFT) in inverse time-frequency conversion. The conversion process involves an inverse fourier transform of each time-frequency frame followed by synthesis of a continuous time-domain signal using an overlap-add method. To ensure signal quality, spectral smoothing techniques are often employed to reduce artifacts caused by phase discontinuities. Low latency optimization is a key requirement for real-time voice interactive systems, and conventional STFT-ISTFT frameworks may introduce large processing delays. The low-delay optimization strategy includes employing a shorter analysis window and frame shift step size, implementing a streaming architecture to reduce buffering delay, and estimating the content of subsequent frames in advance using frequency domain prediction techniques. In addition, the optimization of computing resources is also an important aspect for reducing the system delay, and the method comprises the technical means of algorithm parallelization, GPU acceleration and the like. The finally output target quality voice signal meets the low delay requirement and simultaneously keeps good voice definition and naturalness. Quality assessment is typically a combination of subjective hearing tests (e.g., MOS scores) and objective indicators (e.g., PESQ, STOI). The step enables the whole system to meet the application scene requirements of teleconferences, intelligent sound boxes and the like with high requirements on real-time performance.
And S6, as a final link of the whole multi-microphone array beam forming signal enhancement method, carrying out final color rendering and quality verification on the output signal. Post-processing includes noise suppression residual artifact processing, dynamic range compression, and dereverberation enhancement. Artifact processing removes musical noise and spectral discontinuities that may be introduced during beamforming by spectral smoothing and transient detection techniques. Dynamic range compression adjusts the dynamic characteristics of the signal according to the application scene, so that the output keeps consistent audibility on different playing devices. The dereverberation enhancement technology supplements the defect of beam forming in the aspect of reverberation suppression, and further improves the definition of voice through blind room impulse response estimation and inverse filtering. The quality evaluation link establishes a multi-dimensional evaluation system, which comprises two main categories of objective evaluation and subjective evaluation. The objective evaluation uses PESQ, STOI, SNR standardized indexes, and the subjective evaluation adopts MUSHRA test or ABX comparison, and the like, and invites professional listeners and common users to participate in evaluation. And feeding back the evaluation result to the preamble step to form a closed-loop optimization mechanism. The step ensures that the finally output voice signals can meet the quality standard expected by the user under different application scenes, and simultaneously provides scientific basis for continuous optimization of the system.
According to the multi-microphone array beam forming signal enhancement method, time-frequency conversion is carried out on time domain signals acquired by the multi-microphone array, and the deformation compensation of the signals can be more accurately carried out by combining with the spatial distribution vector information of the array, so that the quality of sound field consistency signals is improved. According to the method, the direction characteristics of the sound source are extracted, and the complex space-time attention beam shaping network is applied, so that the extraction accuracy of the target voice signal can be effectively improved, and the signal attenuation or distortion phenomenon in the traditional signal enhancement method is avoided. By introducing a preset countermeasure environment simulator, the beam direction and the gain parameters are optimized, so that the method can adapt to a dynamic complex acoustic environment, the self-adaptive adjustment of the beam direction and the gain parameters is realized, and the definition of signals and the accuracy of voice recognition are further improved. Based on the technology, noise interference and echo effect can be effectively reduced, so that high-quality voice output signals can be ensured to be kept under different environments, and the problems of tone quality reduction and voice distortion in the traditional method are reduced. In addition, the low-delay optimization design of the dynamic frequency band enhancement signal is beneficial to smooth operation of the real-time voice interaction system, and robustness and user experience of the system are enhanced. By comprehensively considering the space-time information and the dynamic sound field characteristics, the method can flexibly adapt to complex and changeable acoustic environments, and improves the overall performance and application effect of the multi-microphone array system.
In one embodiment, acquiring a multi-channel time domain signal acquired by a multi-microphone array, performing time-frequency conversion to obtain a complex time-frequency domain signal, including:
After the multichannel time domain signals acquired by the multi-microphone array are acquired, endpoint detection and noise power estimation are carried out on the multichannel time domain signals, and preprocessed signals are obtained. Endpoint detection refers to identifying the starting and ending points of valid speech segments in a speech signal, distinguishing valid speech from background noise. Endpoint detection is typically based on short-term energy, zero crossing rate, etc., by setting appropriate thresholds to determine whether the current frame contains valid speech. Noise power estimation refers to estimating the power spectral density of background noise during periods of no speech activity, providing a reference for subsequent signal enhancement. The noise power estimate is updated during periods of no speech activity using a minimum statistical method or a method based on speech activity detection. The pre-processed signal is a signal after endpoint detection and noise power estimation that retains the active speech portion, accompanied by a noise power estimation result.
And framing and windowing the preprocessed signal according to the preset frame length frame shift parameter to obtain an initial complex frequency domain signal. Framing is the division of a long-time speech signal into short-time frames, each of which has relative stationarity. The preset frame length is typically 20-30 ms (corresponding to a signal with a sampling rate of 16kHz, and the frame length is 320-480 samples), and the frame is shifted to half the frame length or less (e.g., 10-15 ms, corresponding to 160-240 samples) to ensure a certain degree of overlap between adjacent frames. The windowing process is to multiply each frame signal by a window function (e.g., hamming window, hanning window, etc.) to reduce spectral leakage. After framing and windowing, performing Fast Fourier Transform (FFT) on each frame of signal, and converting to a frequency domain to obtain an initial complex frequency domain signal. The initial complex frequency domain signal contains amplitude information and phase information, and provides a basis for subsequent spectrum processing.
And performing spectrum smoothing on the initial complex frequency domain signal to obtain a smoothed frequency domain signal. The spectrum smoothing is to perform weighted average on the energy spectrums of adjacent frequency points on a frequency domain, so that the fluctuation of the frequency spectrums is reduced, and the stability of spectrum estimation is improved. The spectrum smoothing uses a sliding window smoothing method of a time-frequency domain, and for each frequency point, weighted average is carried out by considering a plurality of frequency points before and after the frequency point and corresponding frequency points of the previous frames. The smooth frequency domain signal has better smoothness and continuity than the initial complex frequency domain signal, reduces the influence of random noise, and is beneficial to subsequent signal processing.
And carrying out nonlinear sub-band division on the smooth frequency domain signal to obtain a sub-band signal. The nonlinear subband division refers to dividing a frequency domain signal into a plurality of subbands according to auditory characteristics of human ears, wherein the subband division of a low frequency part is finer, and the subband division of a high frequency part is thicker. The division mode accords with the perception characteristic of human ears on sounds with different frequencies, and the low-frequency part contains more voice information and needs finer processing. The nonlinear sub-band division adopts a Mel frequency scale or a Bark frequency scale, and converts the linear frequency scale into a perception frequency scale. For a signal at a sampling rate of 16kHz, it is typically divided into 20-30 subbands. The subband signals are signals divided by nonlinear subbands, each subband including signal components in a specific frequency range.
And performing geometric structure phase compensation on the subband signals to obtain subband correction signals. Geometric phase compensation refers to performing phase adjustment on signals received by each microphone according to the geometric layout and sound source direction of the microphone array, so that sound source signals in a target direction are aligned in phase, and thus the target sound source signals are enhanced. Geometric phase compensation calculates a corresponding phase difference based on the time delay difference from the sound source to each microphone, and performs phase correction on each microphone signal. For linear microphone arrays, the phase compensation is typically based on plane wave assumptions, and for annular or other shaped microphone arrays, the phase compensation takes into account more complex geometric relationships. The subband correction signals are signals subjected to geometric structure phase compensation, and target direction sound source signals received by all microphones are aligned in phase.
And carrying out spectral subtraction denoising on the subband correction signal based on a preset noise power spectrum, and carrying out complex time-frequency domain characteristic representation conversion to obtain a complex time-frequency domain signal. Spectral subtraction denoising is a common speech enhancement technique that derives an enhanced speech signal power spectrum by subtracting an estimated noise power spectrum from an observed signal power spectrum. The predetermined noise power spectrum is a background noise power spectrum estimated during periods of no speech activity. The denoising of the spectral subtraction adopts an over-subtraction or under-subtraction strategy, and a spectral bottom is set to avoid music noise. The complex time-frequency domain characteristic representation conversion is to convert the enhanced signal from the power spectrum domain back to the complex time-frequency domain, so that the subsequent beam forming and signal reconstruction are facilitated. The complex time-frequency domain signal retains the amplitude and phase information of the enhanced signal and provides the necessary information for beam forming and inverse transformation.
According to the embodiment, through a systematic signal processing flow, efficient target sound source extraction is achieved in a complex acoustic environment. The effective voice and the background noise are accurately distinguished by a preprocessing mode combining endpoint detection and noise power estimation, and a foundation is laid for subsequent processing. By adopting the nonlinear subband division technology, the frequency domain signal is finely divided according to the auditory characteristics of the human ears, the low-frequency part is finer, the high-frequency part is relatively coarse, and the voice perception quality is effectively improved. The geometric structure phase compensation fully utilizes the spatial information of the microphone array, realizes the phase alignment of the target direction sound source signals, and remarkably enhances the spatial selectivity of the target sound source. The spectrum smoothing technology reduces spectrum fluctuation, improves spectrum estimation stability, and further suppresses background noise interference based on spectral subtraction denoising of a preset noise power spectrum. The organic combination of the technologies forms an integrated signal enhancement solution, improves the signal definition and the signal intelligibility while maintaining the voice naturalness, and is particularly suitable for voice interaction scenes in complex acoustic environments such as conference systems, intelligent sound boxes and the like.
In one embodiment, obtaining a spatial distribution vector of a plurality of microphone arrays, and performing deformation compensation processing on a complex time-frequency domain signal to obtain a sound field consistency signal, including:
And (3) a complete space geometric structure model is constructed by measuring or setting the position coordinates of each microphone element in the three-dimensional space and the pickup direction angle parameters thereof. These physical coordinates and orientation parameters are integrated into a unified spatial coordinate system, generating spatial distribution vectors. The spatial distribution vector contains precise spatial position and directivity information for each pickup element in the microphone array, providing the basis data for subsequent acoustic property mapping.
After the spatial distribution vector is obtained, the spatial transfer function mapping calculation is performed on the converted complex time-frequency domain signal by using the vector. The mapping process correlates the complex time-frequency domain signals received by each microphone with its spatial location and directional characteristics to generate array acoustic characteristic mapping information. The array acoustic characteristic mapping information characterizes the characteristics of the acoustic wave interacting with the microphone array when the acoustic wave propagates in space, including information such as time difference, amplitude attenuation, phase change and the like of the acoustic wave reaching each microphone.
Singular Value Decomposition (SVD) operation is performed on the array acoustic characteristic mapping information, and the complex mapping information is decomposed into a plurality of orthogonal feature components. Through this decomposition, the system identifies a primary acoustic signature pattern and a secondary interference component. Then, the system performs nonlinear correction of the phase and amplitude of the decomposition result, and eliminates or mitigates distortion caused by uneven microphone pitch, sensitivity difference and hardware error, thereby obtaining a preliminary compensation signal. The preliminary compensation signal has a higher spatial consistency than the original signal, but is not yet perfectly aligned.
Based on the preliminary compensation signal, the system calculates a cross-power spectral density matrix (CPSD) that describes the correlation and energy distribution between the different microphone signals. By combining physical properties and propagation rules of the sound field, the system constructs constraint conditions of geometric consistency of the sound field, and the constraint conditions reflect geometric relations which the sound field should present at the positions of all microphones under ideal conditions. By imposing these constraints, the system generates a representation of the sound field topology that characterizes the geometric relationship between the sound source, the sound field spatial structure, and the microphone array.
The sound field topology representation is subjected to complex domain tensor decomposition treatment, and sound field characteristics are decomposed into amplitude components and phase components in a complex domain, so that deformation compensation coefficients are obtained. The deformation compensation coefficient contains fine correction parameters for each frequency point and each microphone, and is used for counteracting non-ideal deformation factors in the sound field.
And performing accurate alignment operation of the phase and the amplitude on the complex time-frequency domain signals by using the deformation compensation coefficient. Phase alignment ensures that the phase difference between signals from the same source of sound is properly compensated for between different microphones, while amplitude alignment provides for balanced uniformity of the signal strengths received by the microphones. Through this process, the system obtains a sound field consistency signal that is highly spatially consistent, providing the ideal underlying data for subsequent beamforming processing.
Compared with the original signal, the spatial coherence of the sound field consistency signal is obviously improved, the influence of environmental noise and reverberation is reduced, and the quality of the target sound source signal is enhanced, so that more accurate sound source positioning and signal enhancement effects are realized in a complex acoustic environment.
According to the method, the high-precision spatial consistency alignment of the microphone array signals is realized by acquiring the spatial distribution vectors of the microphone arrays and performing deformation compensation processing, so that the accuracy and the stability of beam forming are remarkably improved. The problem of acoustic characteristic distortion caused by element position errors, sensitivity differences and hardware inconsistencies of the microphone array in actual deployment is effectively solved by combining space transfer function mapping calculation with a singular value decomposition technology. By constructing geometric consistency constraint of the sound field and performing complex domain tensor decomposition, the method can accurately identify and compensate nonlinear deformation in the sound field, and overcomes the limitation that the traditional method is difficult to cope with complex acoustic environments. The accurate alignment of phase and amplitude results in signals from the same source being highly consistent among the different microphones, thereby significantly suppressing ambient noise and reverberation disturbances while maintaining the target source signal integrity.
In one embodiment, extracting sound source direction characteristics of the sound field consistency signal, and performing complex space-time attention beam shaping network processing to obtain a beam shaping signal, including:
The sound field consistency signals acquired by the multi-microphone array are decomposed through self-adaptive wavelet packet transformation, and multi-level sound field time-frequency representation is generated. The adaptive wavelet packet transformation is a time-frequency analysis method, which decomposes signals on different frequencies and time scales and is suitable for processing non-stationary signals. The transformation automatically adjusts the decomposition level and wavelet basis functions according to the local characteristics of the signal, thereby obtaining the optimal time-frequency resolution. During the transformation, each microphone signal is mapped to a plurality of frequency subbands, forming a time-frequency atomic structure. The time-frequency atoms keep the time and frequency characteristics of the original signals, and effectively capture the energy distribution and phase information of the sound field in each frequency band. The multi-level sound field time-frequency representation comprises complete time-frequency characteristics of the sound field in different frequency intervals, and provides a basis for subsequent spatial characteristic analysis.
The multi-layer sound field time-frequency representation is sent to a complex domain space covariance calculation unit to calculate the phase difference spectrum of each frequency band. The complex domain spatial covariance is a statistical description of the phase relationship between different microphone signals, reflecting the characteristics of sound waves propagating in space. In the calculation process, a complex covariance matrix is formed between the multiple microphone signals for each frequency band, and the matrix contains phase difference information between the microphone pairs. The band phase difference spectrum is a three-dimensional tensor structure with dimensions of frequency, space and time, respectively, which records the spatial phase variation patterns in the sound field, which are closely related to the direction and distance of the sound source. The band phase difference spectrum has a high signal-to-noise ratio and can maintain relatively stable direction information even in a noisy environment.
The frequency band phase difference spectrum is subjected to high-order singular value decomposition to obtain an orthogonal projection structure. The high-order singular value decomposition is a multidimensional data analysis technology and is suitable for processing tensor data with three dimensions and above. The decomposition decomposes the band phase difference spectrum tensor into a series of orthogonal subspace projections, each projection representing an independent spatial mode in the sound field. The orthogonal projection structure consists of a main singular matrix and corresponding singular values, wherein the main singular matrix reflects main spatial features in the frequency band phase difference spectrum, and the singular values represent the significance of the features. The orthogonal projection structure eliminates redundant information, highlights main direction characteristics in a sound field, and effectively improves accuracy and robustness of subsequent direction estimation.
Based on the orthogonal projection structure, a hyperbolic direction estimation function is constructed, gradient polarization search is carried out, and a sound source direction vector is obtained. The hyperbolic direction estimation function is a non-linear mapping that maps spatial coordinates to direction similarity scores to form a hyperbolic surface. The function forms a directional response surface in three-dimensional space by using an orthogonal projection structure of the band phase difference spectrum. The gradient polarization search is an algorithm for searching the optimal point on the response surface, and the gradient direction is calculated iteratively and moved along the gradient direction, so that the extreme point of the response surface is converged finally. The sound source direction vector is a three-dimensional unit vector, points to the space position of the sound source and comprises azimuth angle and pitch angle information. The calculation of the direction vector takes the fusion of the multiband information into account, ensuring the stability of the direction estimation under various acoustic environments.
And carrying out phase alignment on the sound source direction vector and the sound field consistency signal to obtain an initial beam gain coefficient. Phase alignment is the core step of beamforming, which causes each microphone signal to be phase-aligned in a particular direction, thereby enhancing the sound source signal in that direction. In the alignment process, according to the sound source direction vector and the geometric layout of the microphone array, the time delay difference between the sound source and each microphone is calculated, and the phase of each microphone signal is correspondingly adjusted. The initial beam gain coefficients are a set of complex weights that contain amplitude and phase information for weighted synthesis of each microphone signal. These coefficients cause the phase of the signals in the target direction to be superimposed consistently, while the signals in the non-target direction cancel each other out due to the phase mismatch, thus creating a spatially selective gain.
And constructing a complex domain attention mask according to the initial beam gain coefficient, and carrying out channel weighting on the sound field consistency signal to obtain a beam forming signal. A complex domain attention mask is a spatial filter operating on a complex plane that imparts different weights to signals of different directions and frequencies. The mask construction process further optimizes the spatial filtering characteristics by using the initial beam gain coefficients through an adaptive learning mechanism. The mask not only considers the amplitude of the signal, but also retains the phase information, ensuring the phase consistency in the beamforming process. The channel weighting is a process of applying a complex domain attention mask to the sound field consistency signal, and each channel signal is weighted and then subjected to complex domain superposition to form a final beam forming output. The beam forming signal has higher signal to noise ratio and definition, effectively inhibits the interference of environmental noise and reverberation, and enhances the sound source signal in the target direction. The beam forming signals are used for subsequent applications such as voice recognition, sound source localization, acoustic monitoring and the like, and the performance of the applications in complex acoustic environments is remarkably improved.
According to the embodiment, the adaptive wavelet packet transformation decomposition is carried out on the sound field consistency signal to obtain the multi-level sound field time-frequency representation, so that the energy distribution and phase information of the sound field in each frequency band are effectively captured, and a solid foundation is provided for accurate sound source positioning. The frequency band phase difference spectrum obtained by complex domain space covariance calculation can keep stable direction information in a high noise environment, and the anti-interference capability of the system is obviously improved. The orthogonal projection structure generated by high-order singular value decomposition eliminates redundant information, highlights main direction characteristics in a sound field, and enhances the accuracy of direction estimation. The method based on hyperbolic direction estimation function and gradient polarization search realizes high-precision sound source direction vector calculation and adapts to the change of complex acoustic environment. The initial beam gain coefficient obtained by phase alignment enables the phases of signals in the target direction to be superposed consistently, and signals in the non-target direction are mutually offset. Finally, the construction and application of the complex domain attention mask optimizes the spatial filtering characteristic while maintaining the phase information, generates a beam forming signal with high signal to noise ratio and definition, and effectively suppresses the environmental noise and reverberation interference.
In one embodiment, the optimizing the beam direction and gain parameters of the beam forming signal by a preset antagonistic environmental simulator to obtain a dynamic band enhancement signal comprises:
the simulator comprises a noise generation module, a reverberation simulation module and an environment condition parameter library, and is used for creating diversified environment interference samples.
A diversity environmental interference sample is generated by performing multiple noise interference and reverberation condition simulations on the beamformed signal by an antagonistic environmental simulator. These samples include white noise, pink noise, mechanical noise, and various noise types such as human voice, as well as reverberation conditions for different room sizes, materials, and shapes. The environmental interference samples form an interference matrix through interference source positioning and interference intensity mapping, and the interference matrix contains interference information in two dimensions of space and frequency.
When gradient descent optimization calculation is carried out on the diversified environment interference samples, a random gradient descent algorithm is adopted, and beam direction parameters and frequency band gain coefficients are adjusted according to a preset signal-to-noise ratio objective function. The signal-to-noise ratio objective function is defined as the logarithmic value of the ratio of the desired signal power to the noise power, which sets the threshold to 15dB, enabling the adaptive adjustment of the beam direction. In the optimization process, the adjustment range of the beam direction parameters is +/-30 degrees, the adjustment range of the band gain coefficients is 0.5-2.0, and the parameter optimization matrix is obtained through 300 iterative computations. The parameter optimization matrix contains beam direction angle correction values and gain coefficient adjustment values of each frequency band for subsequent signal reconstruction.
And carrying out directional reconstruction on the beam forming signals according to the parameter optimization matrix to obtain direction enhancement signals. In the directional reconstruction process, a subband decomposition technology is adopted to decompose a signal into 32 subbands, and each subband is independently applied with corresponding direction parameter correction. The direction enhancement signal has more accurate sound source positioning capability and stronger noise suppression effect than the original beam forming signal, and particularly shows prominence in a low signal-to-noise ratio environment.
And carrying out acoustic feature matching according to the direction enhancement signal and preset voice data to obtain corresponding target actual short-time voice data. The acoustic feature matching process uses mel-frequency cepstral coefficient (MFCC) feature extraction and Dynamic Time Warping (DTW) distance calculation to retrieve the most similar speech segments from a pre-set speech database. The pre-set voice database contains 8000 standard voice samples covering speakers of different gender, age and language. The target actual short-term speech data represents ideal speech signal characteristics for guiding subsequent signal enhancement processing.
And carrying out feature extraction on the target actual short-time voice data based on a preset meta-learning online calibration algorithm to obtain a target feature adaptation matrix. The meta-learning online calibration algorithm adopts a model-independent adaptive learning mechanism and comprises three core components of a feature extraction layer, a feature mapping layer and an adaptation layer. The feature extraction layer extracts time-frequency features by using short-time Fourier transform, the feature mapping layer identifies key feature points through an attention mechanism, and the adaptation layer generates a 16×16-dimensional adaptation matrix. The target feature adaptation matrix contains gain and phase adjustment parameters required for spectrum correction, and has high-dimensional feature expression capability.
And carrying out convolution operation on the target characteristic adaptation matrix and the direction enhancement signal to obtain a frequency spectrum correction signal. The convolution operation adopts a two-dimensional convolution mode, the convolution kernel size is 5 multiplied by 5, the step length is 1, and the filling mode is the same. In the convolution process, each element of the adaptation matrix and the corresponding frequency point of the signal are multiplied and accumulated, so that the fine adjustment of the frequency spectrum is realized. The frequency spectrum correction signal has clearer manifestation on harmonic structure and formants, and frequency spectrum distortion caused by environmental interference is effectively eliminated.
And performing cross-band consistency processing and phase band gain on the frequency spectrum correction signal to obtain a dynamic band enhancement signal. The cross-band consistency process employs a neighboring band smoothing algorithm, using a 5-point smoothing window to eliminate inter-band discontinuities. The phase band gain adopts a group delay correction technology, so that the continuity of the signal phase is maintained, and the phase adjustment range is +/-0.2 pi. The dynamic frequency band enhancement signal is characterized in that the low-frequency enhancement is 3-6dB, the medium-frequency enhancement is 1-3dB, the high-frequency enhancement is 4-8dB, and the overall enhancement characteristic is dynamically adjusted along with the environmental change.
According to the embodiment, the beam direction and gain parameters of the beam forming signals are optimized through the countermeasure environment simulator, so that the quality and definition of the signals can be obviously improved in various complex noise and reverberation environments. According to the method, by simulating a plurality of environmental interference samples and combining a gradient descent optimization algorithm, the beam direction and the band gain are effectively adjusted, so that the beam forming signals can be adaptively adjusted when facing different interferences, and the robustness and the adaptability of the system are improved. And the acoustic feature matching is carried out based on the target actual short-time voice data, so that the system can recover the time-frequency feature of the voice signal more accurately, and the voice intelligibility is further enhanced. The target characteristics are extracted and an adaptive matrix is generated through a meta-learning online calibration algorithm, so that the spectrum quality of the signal is effectively improved, and the effect of correcting the signal by the spectrum is more remarkable. The final dynamic band enhancement signal can provide stronger enhancement effect in different frequency bands, and particularly shows prominence in a low signal-to-noise ratio environment.
In one embodiment, based on the antagonistic environmental simulator performing multiple noise interference and reverberation condition simulation on the beam forming signal, a diversified environmental interference sample is obtained, including:
The antagonistic environment simulator performs a time-frequency transformation operation on the beamformed signal. The time-frequency transform is a process of converting a time-domain signal into a time-frequency domain representation, and the beamformed signal is decomposed into two-dimensional representations of time and frequency using a short-time fourier transform (STFT). The result of the time-frequency transformation is a time-frequency domain representation of a signal matrix containing information about the energy distribution of the signal at different points in time and at different frequencies.
And constructing a three-dimensional acoustic propagation structure based on the time-frequency domain representation signal matrix. A three-dimensional acoustic propagation structure is a mathematical model that simulates the propagation characteristics of sound waves in three dimensions, and takes into account factors such as space geometry, obstacle location, and surface texture. In the construction process, the countermeasure environmental simulator sets random reflection coefficients to simulate the reflection characteristics of the surfaces of different materials on the sound waves. Randomization of the reflection coefficients ensures the diversity and authenticity of the simulated environment. Through this step, the system generates a set of spatial reverberation parameters including key parameters such as Room Impulse Response (RIR), early reflections, and late reverberation.
The antagonistic environmental simulator randomly extracts various types of noise sources from a preset environmental noise library. The ambient noise library is a data set containing various real ambient noise recordings, such as traffic noise, crowd noise, machine noise, etc. The noise sources are subjected to time domain stretching processing to change the duration and time characteristics of the noise, and frequency domain offset processing is performed to adjust the frequency characteristics of the noise. The time domain stretching adopts a phase vocoder technology, the tone characteristics of noise are kept unchanged, and the frequency domain shifting is realized through frequency spectrum shifting. Through these processes, the system gets a set of distorted noise sources that has a higher diversity and unpredictability.
The antagonistic environment simulator performs position mapping on the deformation noise source set according to a preset spatial directivity distribution function. The spatial directivity distribution function describes the distribution law of noise sources in space, and the probability that noise sources appear in a specific direction is defined based on a probability density function. The position mapping process distributes the distorted noise sources to various directions in three-dimensional space, generating a multidirectional noise distribution matrix. The matrix represents the spatial distribution of noise sources from different directions, increasing the spatial diversity of noise interference.
The specific process of performing position mapping on the deformed noise source set according to the preset spatial directivity distribution function is as follows:
The spatial directivity distribution function is a mathematical description of the distribution law of noise sources in three dimensions. The position mapping process first assigns spatial position coordinates to each of the deforming noise sources Where a represents a horizontal angle (azimuth), C represents a vertical angle (elevation), and r represents a distance.
For each noise source ni, the system distributes the function from spatial directivityAnd obtaining azimuth and elevation by mid-sampling. This distribution function may be a uniform distribution, a gaussian distribution, or a custom distribution based on actual environmental measurements. The distance parameter r is then sampled based on a preset distance distribution function D (r) defining the possible distance range between the noise source and the microphone array.
The mathematical expression of the location map is:
;
where M represents the mapping function and, Representing the i-th source of the distorted noise,Representing the sampled angle values from the directional distribution function,Representing the distance values sampled from the distance distribution function.
After the position mapping is completed for all the deformed noise sources, the system constructs a multidirectional noise distribution matrix N, and each row of the matrix contains the position information and the noise characteristic parameters of one noise source to form a complete spatial distribution representation. This matrix will be used for subsequent acoustic propagation delay and energy attenuation calculations, ensuring that the simulated noise disturbance has real spatial characteristics.
The antagonistic environmental simulator performs acoustic propagation delay and energy attenuation calculations based on the multi-directional noise distribution matrix and the set of spatial reverberation parameters. The acoustic propagation delay refers to the time required for the acoustic wave to propagate from the noise source to the receiving point, and is related to the distance between the acoustic source and the propagation speed of the medium, and the energy attenuation describes the energy loss degree of the acoustic wave in the propagation process and is related to the propagation distance and the environmental absorption coefficient. By calculating the propagation path characteristics of each noise source to the microphone array, the system generates a noise propagation characteristics vector that contains the delay, attenuation, and direction information of the propagation of the respective noise source to the receiving point.
The antagonistic environmental simulator reverberates the time-frequency domain representation signal matrix based on the set of spatial reverberation parameters. The reverberation process simulates the acoustic effect of multiple reflections of sound waves in an enclosed space by convolving with the room impulse response. The processed signal and the noise propagation characteristic vector are subjected to interference superposition, and the situation that a target signal in a real environment is simultaneously interfered by various noise sources and reverberation is simulated. The superposition process considers the spatial position, propagation characteristics and energy size of the noise source, and ensures that the generated interference mode is more real. And finally obtaining diversified environment interference samples, wherein the samples reflect signal characteristics under various complex acoustic environments.
The generation of the diversified environment interference samples greatly enriches the diversity of training data, so that the beam forming algorithm can learn the capability of adapting to various complex environment conditions, thereby improving the robustness and generalization capability of the system in practical application. With these samples generated by the antagonistic environmental simulator, the beamforming system is able to maintain good signal enhancement under unseen noise and reverberation conditions.
According to the method, the countermeasure type environment simulator is introduced into the multi-microphone array beam forming signal enhancement method, so that various noise interference and reverberation conditions can be effectively simulated, and diversified environment interference samples can be generated. According to the method, accurate simulation of a complex acoustic environment is realized through time-frequency transformation, construction of a three-dimensional acoustic propagation structure, random extraction of a deformation noise source, position mapping and acoustic propagation delay and energy attenuation calculation, so that the robustness and adaptability of beam forming signal enhancement are greatly improved. The real environment noise is subjected to time domain stretching and frequency domain shifting, so that the generated noise source has higher diversity, and the processing capability of the system on accidental noise interference can be effectively improved. The introduction of the spatial reverberation parameter set and the multidirectional noise distribution matrix makes the generated interference sample more realistic, reflecting the reverberation and noise characteristics in the real environment. Finally, reverberation processing and interference superposition are carried out on the time-frequency domain representation signal matrix, and the generated diversified environment interference samples greatly enrich training data, are beneficial to performance expression of a beam forming algorithm in a complex environment, so that the signal enhancement effect and generalization capability of the system are improved.
In one embodiment, feature extraction is performed on the target actual short-time voice data based on a preset meta-learning online calibration algorithm to obtain a target feature adaptation matrix, including:
And carrying out short-time window framing and windowing on the target actual short-time voice data to obtain an overlapped segmented signal sequence. Specifically, the original speech signal is framed using a hamming window of 25 ms, with an overlap between adjacent frames of 10 ms. The processing is based on the assumption that the speech signal has a quasi-stationary characteristic in a short time, ensuring continuity and smooth transition of the signal by overlapping frames. After each voice frame is windowed, the edges of the voice frames are attenuated, and the frequency spectrum leakage phenomenon is reduced. The overlapped segmented signal sequence is a time sequence set formed by a plurality of short-time voice segments overlapped with each other, and the time sequence characteristics and the local frequency spectrum characteristics of the original voice are reserved.
And (3) carrying out Mel frequency cepstrum coefficient calculation on the overlapped segmented signal sequences to obtain a segmented feature description vector. The mel frequency cepstrum coefficient calculation process includes performing fast fourier transform on each segmented signal to obtain a power spectrum, converting a linear frequency scale to a mel frequency scale, and applying discrete cosine transform to obtain cepstrum coefficients. Typically, 13-26D Meier frequency cepstrum coefficients are extracted, and then the first-order difference and the second-order difference coefficients are combined to form 39-78D feature vectors. The segmentation feature description vector is a group of multidimensional vectors capable of effectively representing acoustic characteristics of the voice, and key information of the voice in a frequency spectrum domain is captured.
And constructing a meta-learning task sampling set for the segmented feature description vector to obtain a feature support set and a feature query set. The construction of the meta-learning task sampling set adopts an N-way K-shot strategy, N categories are randomly selected from the segmented feature description vector, each category comprises K samples to form a feature support set, and part of samples are selected to form a feature query set. The feature support set is used for model parameter initialization and internal circulation optimization, and the feature query set is used for evaluating model generalization capability and external circulation optimization. The "category" herein refers to different acoustic environments or speaker characteristic patterns.
And constructing a meta-learning inner loop optimization function based on the feature support set to obtain the environment adaptation parameters. The element learning inner loop optimization function adopts a gradient descent algorithm to carry out local optimization on the feature support set, the loss function is set as the prototype network measurement loss, and the optimization aim is to minimize the Euclidean distance between the sample in the feature support set and the prototype of the corresponding category. The environment adaptation parameters comprise an embedded function parameter matrix and a measurement space transformation matrix under a specific environment, and reflect the adaptation adjustment result of the model to the current acoustic environment.
And carrying out MAML (model independent element learning) gradient update on the characteristic query set according to the environment adaptation parameters to obtain task model parameters. MAML gradient update is carried out by calculating the gradient of the loss function on the characteristic query set to the environment adaptation parameter and adopting a second derivative optimization method to carry out parameter adjustment. The updating process comprises three steps of forward propagation calculation loss, backward propagation calculation gradient and parameter updating. The task model parameters are model weight sets after environmental adaptation, and have the capability of quickly adapting to new environments.
The meta-learning external circulation optimization is to integrate and optimize meta-learning results of a plurality of tasks so as to improve the generalization capability and adaptability of the model. The specific process is as follows:
The tasks in the task pool are sampled to form a plurality of tasks, and each task comprises a feature support set and a feature query set. For each task, a penalty function on the feature query set is calculated based on the current task model parameters. The loss function typically employs cross entropy loss or mean square error, reflecting the performance of the model on that task.
And carrying out gradient calculation on the loss function of each task, and obtaining the gradient of the loss function on the model parameters. And averaging the gradients of all tasks to obtain a global gradient. Based on the global gradient, updating global parameters of the model through a gradient descent algorithm. The updated model parameters can be better adapted to different tasks, and the generalization capability of the model is improved.
The task sampling, loss calculation, gradient calculation and parameter updating processes are iterated in the outer loop repeatedly until the loss function converges or reaches the preset iteration times. Through multiple iterations, model parameters are gradually optimized, and different acoustic environments and signal characteristics can be better adapted.
The online learning rate adjustment aims to dynamically adjust the learning rate of the model so that the model has optimal learning rate under different environments and tasks. The specific process is as follows:
The online learning rate adjustment adopts a Bayesian optimization framework. The Bayesian optimization is a global optimization method based on a probability model, and the optimal learning rate parameter is selected by constructing a probability distribution model of an objective function. The bayesian optimization framework comprises three main parts of a priori distribution, likelihood functions and posterior distribution.
And constructing prior distribution of the learning rate parameters according to the historical optimization track and the current performance index. The prior distribution reflects a preliminary estimate of the learning rate parameter and may be described using a gaussian process or other probability distribution model.
And updating posterior distribution of the learning rate parameter based on the loss function and gradient information of the current task. The posterior distribution combines prior information and observation data of the current task, and more accurately reflects the optimal value of the learning rate parameter.
And selecting the optimal learning rate parameter by maximizing a likelihood function of posterior distribution. The optimal learning rate parameter is the learning rate value that minimizes the model loss on the current task.
And generating a dynamic learning rate matrix according to the optimal learning rate parameter. The dynamic learning rate matrix contains adaptive learning rate values for different parameter layers and different tasks. The learning rate of each parameter layer is adjusted according to the importance and the change degree of the parameter layer on different tasks, so that the model can learn at an optimal rate under different environments and tasks.
And performing prototype network mapping on the segmented feature description vector according to the dynamic learning rate matrix to obtain the environment perception feature representation. The prototype network mapping process includes two main steps, feature embedding and prototype calculation. The feature embedding adopts a deep neural network to map the segmented feature description vector into a high-dimensional embedding space, and the prototype calculation obtains a class prototype vector by averaging the embedded vectors of the same class sample. The environment perception feature representation is a high-dimensional feature vector set with environment adaptability generated under a prototype network framework, so that key features of a voice signal are reserved, and the influence of environmental noise is filtered.
And performing measurement learning embedding space construction on the environment perception characteristic representation to obtain a voice characteristic embedding matrix. The metric learning embedding space is constructed by adopting a mahalanobis distance metric framework, and the distances of the similar samples are minimized and the distances of the different samples are maximized by learning an optimal distance metric matrix.
The measurement learning embedding space construction process comprises three main links of initializing a measurement matrix and iteratively optimizing measurement parameters and verifying measurement performance. In the initialization stage, a covariance matrix is calculated based on the environment-aware feature representation, and an initial distance metric matrix is obtained by using a feature decomposition method. The covariance matrix reflects the correlation between different feature dimensions, the inverse of which is used to construct the mahalanobis distance metric. The initial distance metric matrix is used as an initial parameter of metric learning to provide an initial metric reference for subsequent optimization.
In the iterative optimization stage, a contrast loss function is adopted to update the distance measurement matrix, and the optimization target is to minimize the mahalanobis distance between the similar samples and maximize the mahalanobis distance between the heterogeneous samples. In the optimization process, positive and negative sample pairs are constructed, wherein the positive sample pairs are from the same class and the negative sample pairs are from different classes. And constructing a plurality of positive and negative sample pairs by random sampling, calculating a loss function gradient, and iteratively adjusting a distance measurement matrix by using a gradient descent method so as to gradually converge an optimization target.
And in the stage of verifying the measurement performance, evaluating the optimized measurement space quality by using the classification accuracy and the clustering measurement index. The classification accuracy is evaluated by the performance of the nearest neighbor classifier in the measurement space, and the clustering measurement indexes comprise intra-class distances, inter-class distances and contour coefficients. The intra-class distance measures the tightness degree of similar samples, the inter-class distance measures the separation degree of different classes of samples, and the contour coefficient comprehensively evaluates the clustering effect. And (3) evaluating the optimized measurement learning embedding space through the indexes, and adjusting a training strategy according to the evaluation result.
The voice feature embedding matrix is a data structure containing a plurality of voice feature vectors, the feature vectors have good intra-class aggregation and inter-class separability in the measurement space obtained through learning, and the distinguishing capability of the voice features in different environments in the measurement space is ensured.
And performing Bayesian trusted interval constraint on the voice feature embedding matrix to obtain a target feature adaptation matrix. The Bayesian trusted interval constraint is used for estimating an uncertainty interval of a parameter by establishing a probability distribution model of the characteristic parameter, and filtering out unreliable characteristic representation. The constraint process comprises four key steps of constructing prior distribution of characteristic parameters, calculating posterior distribution, determining a trusted interval and screening characteristic vectors. The target feature adaptation matrix is a feature representation set subjected to reliability screening, has higher robustness and environmental adaptability, and provides a reliable feature basis for subsequent beam forming processing.
According to the method, the characteristic extraction is carried out by adopting the online calibration algorithm based on the pre-set meta-learning in the multi-microphone array beam forming signal enhancement method, so that the method can effectively adapt to different acoustic environments and improve the definition and the understandability of voice signals. The time and frequency spectrum characteristics of the original voice are carefully reserved through short-time window framing and windowing processing and Mel frequency cepstrum coefficient calculation. The model parameters can be quickly adjusted by adopting the meta-learning task sampling set to construct and inner-loop optimization functions so as to adapt to new acoustic environments, and the flexibility and adaptability of the model are improved. MAML gradient update and meta learning outer loop optimization ensure the efficient learning ability of the model on different tasks, and the learning rate is adaptively changed through online learning rate adjustment, so that the training efficiency and stability of the model are further improved. By means of measurement learning embedded space construction, the voice features have good intra-class aggregation and inter-class separability in a high-dimensional space, and the distinguishing capability of the voice features in different environments is ensured. The whole method not only improves the accuracy of voice recognition in a noise environment, but also enhances the robustness and practicability of the system under a complex acoustic condition.
In one embodiment, performing inverse time-frequency conversion on the dynamic band enhancement signal and low-delay optimization to obtain a target quality speech output signal comprises:
The dynamic band enhancement signal is converted into an initial time domain signal through an overlap-add inverse transform process. The inverse overlap-add transform is a method of converting a frequency domain signal back to the time domain, and the process uses an Inverse Short Time Fourier Transform (ISTFT). In this process, frequency domain data is divided into frames, inverse fourier transform is performed on each frame of data, and overlap addition is performed in the time domain. The overlap factor is typically set to 50% or 75% to ensure a smooth transition. The original time domain signal retains the basic time domain characteristics of the original sound but may have a phase discontinuity problem.
The initial time domain signal is subjected to inter-frame phase consistency compensation to obtain a phase continuous intermediate signal. This step aims to solve the problem of phase discontinuity which may occur between adjacent frames, avoiding the generation of artificial synthetic sound effects and distortions. Phase consistency compensation achieves a smooth transition in phase by calculating the phase difference of adjacent frames and applying a phase correction factor. For each frequency component, the system calculates its phase continuity metric and performs a phase correction when a phase jump exceeding a preset threshold (typically pi/2) is detected. The phase continuous intermediate signal has more natural sound characteristics, and reduces the synthetic trace.
The phase continuous intermediate signal is dynamically adjusted by a buffer zone to obtain a variable signal buffer zone. Buffer dynamic adjustment the buffer size is adaptively adjusted based on signal characteristics and processing requirements. On the basis of Voice Activity Detection (VAD), the silence segment uses a smaller buffer to reduce delay, while the voice active segment uses a larger buffer to ensure processing quality. The buffer size adjustment range is between 10ms and 50ms, with an adjustment step size of 5ms. The variable signal buffer effectively balances the relationship between processing delay and signal quality.
The variable signal buffer area is subjected to nonlinear amplitude normalization processing to obtain a dynamic range compressed signal. Nonlinear amplitude normalization uses a logarithmic compression or hyperbolic tangent function to perform nonlinear mapping on signal amplitude, compressing high-amplitude signals while enhancing low-amplitude signals. The normalization parameters are dynamically adjusted according to the statistical properties of the signal, including the compression ratio (1.5:1 to 3:1) and the threshold (-20 dB to-30 dB). The dynamic range compressed signal has a more balanced sound energy distribution, enhancing audibility and intelligibility of speech.
And calculating a time domain transient characteristic vector of the dynamic range compression signal, and performing time domain jitter elimination to obtain a jitter correction signal. The time domain transient characteristic vector is calculated by analyzing parameters such as short-time energy change, zero-crossing rate, spectral entropy and the like of the signal. The jitter detection is based on the degree of abrupt change of the feature vector, and when jitter is detected (the feature vector changes by 25% over a preset threshold value), an adaptive smoothing filter is applied for jitter suppression. The jitter cancellation process preserves the transient characteristics of the speech signal while removing the instability, the jitter corrected signal having a more stable speech quality.
And the jitter correction signal is subjected to multistage cascade frequency band reconstruction processing to obtain a frequency response equalization signal. The multistage cascade band reconstruction adopts a segmented multi-resolution filter bank to decompose a signal into a plurality of sub-bands (usually 4 to 8 sub-bands), and the sub-bands are respectively subjected to enhancement processing and then synthesized. The low frequency band (0-1 kHz) focuses on preserving fundamental frequency and tone information, the medium frequency band (1-4 kHz) focuses on improving speech intelligibility, and the high frequency band (4-8 kHz) focuses on restoring speech details. The frequency response balanced signal has more comprehensive and balanced frequency spectrum distribution, and the definition and naturalness of the voice are obviously improved.
And obtaining the equipment matching signal after the frequency response equalization signal is subjected to equipment pre-compensation processing. The device pre-compensation is based on the frequency response characteristics of the target playback device, and an inverse filtering technique is applied to perform pre-correction to compensate for distortion that may be introduced by the device. The compensation process includes frequency response correction, phase correction, dynamic range adjustment, and the like. The correction parameters are obtained through a database of device characteristics, and different compensation templates are applied for different classes of devices, such as headphones, speakers or mobile devices. The device matching signal is optimized for the specific playback device, ensuring that an optimal hearing experience is obtained in the actual use environment.
The equipment matching signal is processed by a frame selective discarding algorithm to eliminate noise and artifacts, and finally the target quality voice output signal is obtained. The frame selective discard algorithm scores the quality of each frame signal based on a signal quality assessment indicator. These evaluation metrics include signal-to-noise ratio (SNR), voice presence probability (VAD), and Spectral Distortion (SDR). In the signal-to-noise ratio evaluation, the system calculates the ratio between the energy of each frame signal and the background noise, and when the signal-to-noise ratio is lower than a preset threshold (e.g. 10 dB), the frame quality is considered to be poor. The speech presence probability is determined by a Voice Activity Detection (VAD) algorithm, and when the speech presence probability is below a preset threshold (e.g., 0.4), the frame is considered to be a non-speech frame. The spectral distortion is evaluated by comparing the spectral differences between the current frame and the frames preceding and following it, and when the spectral distortion exceeds a preset threshold (e.g., 0.2), the frame is considered to have artifacts.
When the signal quality score is below a preset threshold, the frame will be discarded or replaced with an interpolated composite frame. The interpolation composite frame is generated according to the data of the front and rear adjacent frames through a linear interpolation or spline interpolation technology. This way of processing ensures the continuity and natural transition of the signal.
The noise artifact cancellation process employs wavelet decomposition and reconstruction techniques. First, the device matching signal is wavelet decomposed to decompose the signal into several levels of detail and approximation components. The detail component contains high frequency noise and artifact information, while the approximation component retains the main features of the signal. In each level of detail components, the system applies a threshold processing method (such as a soft threshold or a hard threshold) to perform noise suppression, and the threshold is dynamically set according to the noise standard deviation. The processed detail components are recombined with the approximation components, and the denoised signals are obtained through wavelet reconstruction.
In the noise artifact eliminating process, the system also combines a directional filtering technology to perform directional processing on the detected artifact region. The directional filtering technique adaptively adjusts filter parameters to minimize artifacts based on transient characteristics and local statistical characteristics of the signal while preserving the natural characteristics of the speech. And performing repeated iterative optimization on the processed signal to obtain a target quality voice output signal.
In this embodiment, the voice signal quality of the multi-microphone array beamforming system is effectively improved through a series of signal processing steps. The quality of each frame of signal can be accurately evaluated through a frame selective discarding algorithm, and the low-quality frames are discarded or replaced according to the signal quality score for dynamic adjustment, so that the consistency and naturalness of the voice signals are ensured. The process effectively eliminates noise interference and artifacts, avoids the trace of artificial synthesis, and ensures that the voice signal is clearer and more natural. Through wavelet decomposition and reconstruction technology, noise suppression can be refined in the denoising process, and the fidelity of signals is improved. In addition, by combining the directional filtering technology, the system adaptively suppresses the artifacts, effectively reserves the original characteristics of the voice, and further improves the understandability of the voice.
Referring to fig. 2, the present invention further provides a multi-microphone array beam forming signal enhancement device, which is applied to any multi-microphone array beam forming signal enhancement method, including:
the acquisition module is used for acquiring multi-channel time domain signals acquired by the multi-microphone array, and performing time-frequency conversion to obtain complex time-frequency domain signals;
the analysis module is used for acquiring the spatial distribution vector of the multi-microphone array, and performing deformation compensation processing on the complex time-frequency domain signals to obtain sound field consistency signals;
the association module is used for extracting sound source direction characteristics of the sound field consistency signal and carrying out complex space-time attention beam forming network processing to obtain a beam forming signal;
the processing module is used for optimizing the beam direction and gain parameters of the beam forming signal through a preset countermeasure environmental simulator to obtain a dynamic frequency band enhancement signal;
And the control module is used for performing reverse time-frequency conversion on the dynamic frequency band enhancement signal and performing low-delay optimization to obtain a target quality voice output signal.
According to the multi-microphone array beam forming signal enhancement device, time-frequency conversion is carried out on time domain signals acquired by the multi-microphone array, and the deformation compensation of the signals can be more accurately carried out by combining the spatial distribution vector information of the array, so that the quality of sound field consistency signals is improved. According to the method, the direction characteristics of the sound source are extracted, and the complex space-time attention beam shaping network is applied, so that the extraction accuracy of the target voice signal can be effectively improved, and the signal attenuation or distortion phenomenon in the traditional signal enhancement method is avoided. By introducing a preset countermeasure environment simulator, the beam direction and the gain parameters are optimized, so that the method can adapt to a dynamic complex acoustic environment, the self-adaptive adjustment of the beam direction and the gain parameters is realized, and the definition of signals and the accuracy of voice recognition are further improved. Based on the technology, noise interference and echo effect can be effectively reduced, so that high-quality voice output signals can be ensured to be kept under different environments, and the problems of tone quality reduction and voice distortion in the traditional method are reduced. In addition, the low-delay optimization design of the dynamic frequency band enhancement signal is beneficial to smooth operation of the real-time voice interaction system, and robustness and user experience of the system are enhanced. By comprehensively considering the space-time information and the dynamic sound field characteristics, the method can flexibly adapt to complex and changeable acoustic environments, and improves the overall performance and application effect of the multi-microphone array system.
In this embodiment, the processor and the memory may be connected by a bus or other means. The memory may include volatile memory, such as random access memory, or nonvolatile memory, such as read only memory, flash memory, hard disk, or solid state disk. The processor may be a general-purpose processor, such as a central processing unit, a digital signal processor, an application specific integrated circuit, or one or more integrated circuits configured to implement embodiments of the present invention.
It should be noted that, for convenience and brevity of description, the specific working process of the above-described system and each module may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes using the descriptions and drawings of the present invention or direct or indirect application in other related technical fields are included in the scope of the present invention.