[go: up one dir, main page]

CN121054008A - Lightweight voice frequency band expansion method, device, terminal and medium for edge equipment - Google Patents

Lightweight voice frequency band expansion method, device, terminal and medium for edge equipment

Info

Publication number
CN121054008A
CN121054008A CN202511605734.7A CN202511605734A CN121054008A CN 121054008 A CN121054008 A CN 121054008A CN 202511605734 A CN202511605734 A CN 202511605734A CN 121054008 A CN121054008 A CN 121054008A
Authority
CN
China
Prior art keywords
amplitude spectrum
audio signal
frequency component
narrowband audio
logarithmic domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202511605734.7A
Other languages
Chinese (zh)
Other versions
CN121054008B (en
Inventor
刘鑫
闫永杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Elevoc Technology Co ltd
Original Assignee
Elevoc Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Elevoc Technology Co ltd filed Critical Elevoc Technology Co ltd
Priority to CN202511605734.7A priority Critical patent/CN121054008B/en
Publication of CN121054008A publication Critical patent/CN121054008A/en
Application granted granted Critical
Publication of CN121054008B publication Critical patent/CN121054008B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides an edge equipment-oriented lightweight voice frequency band expansion method, device, terminal and medium, belonging to the technical field of voice signal processing; the method comprises the steps of establishing a white noise amplitude spectrum, inputting the logarithmic domain narrow-band audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, generating a first high-frequency component corresponding to consonants in the narrow-band audio signal and a second high-frequency component corresponding to vowels in the narrow-band audio signal, further obtaining a predicted amplitude spectrum, expanding phase information of the narrow-band audio signal, generating a broadband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrow-band audio signal, and outputting the broadband audio signal. The invention uses the voice bandwidth expansion model to reconstruct the consonant in a targeted way, can ensure that the reconstructed consonant component has higher fidelity, and further ensures the definition of the reconstructed voice.

Description

Lightweight voice frequency band expansion method, device, terminal and medium for edge equipment
Technical Field
The present invention relates to the field of speech signal processing technologies, and in particular, to a method, an apparatus, a terminal, and a medium for expanding a lightweight speech band for an edge device.
Background
The voice bandwidth expansion (Bandwidth Extension, BWE) technology aims at reconstructing a high-frequency component (more than or equal to 2 kHz) missing in a narrow-band signal (the frequency band is usually less than or equal to 3 kHz), so that the perception definition of voice is improved, and the voice bandwidth expansion technology is a key means for optimizing voice experience of edge equipment (such as mobile phones, terminals of the Internet of things and the like) under the condition of limited bandwidth. However, the existing mainstream method generally relies on Half-wave rectification (Half-Wave Rectification, HWR) technology to recover high-frequency information, and the reconstruction mechanism thereof performs high-frequency extrapolation based on the periodic harmonic structure of vowels. This mechanism is inherently incompatible with the non-periodic, noise-like spectral characteristics exhibited by consonants, resulting in distortion of the reconstruction of the high frequency components of the consonants, which in turn affects the intelligibility of the reconstructed speech.
Accordingly, the prior art has drawbacks and needs to be improved and developed.
Disclosure of Invention
The invention aims to solve the technical problems of the prior art by providing a lightweight voice frequency band expanding method, a lightweight voice frequency band expanding device, a lightweight voice frequency band expanding terminal and a lightweight voice frequency band expanding storage medium for edge equipment, and aims to solve the problems that the high-frequency component of consonants is reconstructed and distorted so as to influence the definition of overall voice in the prior art.
The technical scheme adopted for solving the technical problems is as follows:
in a first aspect, an embodiment of the present invention provides an edge device-oriented lightweight speech band extension method, where the method includes:
Obtaining a narrowband audio signal from an edge device and preprocessing the narrowband audio signal to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum;
Constructing a white noise amplitude spectrum;
inputting the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, respectively generating a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal by utilizing the voice bandwidth expansion model, and obtaining a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component;
and expanding the phase information of the narrowband audio signal, generating a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal, and outputting the wideband audio signal.
In one embodiment, obtaining a narrowband audio signal from an edge device and preprocessing to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum, including:
obtaining a narrowband audio signal from an edge device;
Up-sampling the sampling rate of the narrowband audio signal to a preset target sampling rate to obtain an up-sampled time domain waveform;
performing half-wave rectification operation on the time domain waveform after up sampling to obtain a rectified time domain waveform;
respectively adopting a hanning window to carry out short-time Fourier transform on the time domain waveform and the rectified time domain waveform to obtain a narrow-band audio signal amplitude spectrum and a rectified narrow-band audio signal amplitude spectrum;
Superposing the narrow-band audio signal amplitude spectrum and the rectified narrow-band audio signal amplitude spectrum to obtain a mixed amplitude spectrum;
And respectively applying logarithmic transformation to the narrow-band audio signal amplitude spectrum and the mixed amplitude spectrum to obtain a logarithmic-domain narrow-band audio signal amplitude spectrum and a logarithmic-domain mixed amplitude spectrum.
In one embodiment, the speech bandwidth extension model includes a feature extraction module, a dual weighted gaussian mixture module, a fusion module, and a band-guided masking module.
In one embodiment, inputting the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum, and the white noise amplitude spectrum into a trained speech bandwidth extension model, generating a first high frequency component corresponding to a consonant in a narrowband audio signal and a second high frequency component corresponding to a vowel in the narrowband audio signal, respectively, using the speech bandwidth extension model, and obtaining a predicted amplitude spectrum based on the first high frequency component and the second high frequency component, comprising:
Inputting the logarithmic domain mixed amplitude spectrum into the characteristic extraction module, and processing to obtain time sequence characteristics representing the characteristics of the narrowband audio signal;
Inputting the time sequence characteristics, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into the double-weighted Gaussian mixture module, generating two groups of control parameters based on the time sequence characteristics, and respectively driving a first weighted Gaussian mixture module and a second weighted Gaussian mixture module of the double-weighted Gaussian mixture module to operate so as to correspondingly generate a first high-frequency component and a second high-frequency component;
Inputting the first high-frequency component and the second high-frequency component into the fusion module for fusion, and outputting an initial predicted amplitude spectrum;
inputting the logarithmic domain narrowband audio signal amplitude spectrum and the initial predicted amplitude spectrum into the frequency band guiding masking module for optimization to obtain the predicted amplitude spectrum.
In one embodiment, inputting the first high frequency component and the second high frequency component into the fusion module for fusion, and outputting an initial predicted magnitude spectrum, includes:
Inputting the first high-frequency component and the second high-frequency component into the fusion module, and splicing the first high-frequency component and the second high-frequency component by using the fusion module to obtain splicing characteristics;
processing the spliced characteristics to generate a weight coefficient;
and performing weighted calculation on the first high-frequency component and the second high-frequency component based on the weight coefficient to obtain an initial predicted amplitude spectrum.
In one embodiment, expanding the phase information of the narrowband audio signal, generating and outputting a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal, includes:
Performing overturn expansion on the phase information of the narrowband audio signal;
Combining the predicted amplitude spectrum with the extended phase information of the narrowband audio signal, and performing inverse short-time Fourier transform by adopting a Hanning window to generate a wideband audio signal;
Outputting the broadband audio signal.
In one embodiment, the training step of the speech bandwidth expansion model includes:
acquiring training data pairs, wherein the training data pairs consist of narrowband audio signal samples and corresponding real wideband audio samples;
preprocessing the narrowband audio signal sample to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum for training;
Constructing a white noise amplitude spectrum for training;
Constructing a training framework comprising a generator and a discriminator, wherein the generator expands a model for the bandwidth of the voice to be trained, and the discriminator is used for distinguishing a generated broadband audio sample from a real broadband audio sample;
Inputting the frequency spectrum of the training logarithmic domain narrowband audio signal and the frequency spectrum of the logarithmic domain mixed frequency spectrum and the frequency spectrum of the white noise into a generator to obtain a predicted training frequency spectrum;
performing inverse short-time Fourier transform according to the predicted training amplitude spectrum and the phase information of the real wideband audio sample, generating a wideband audio sample and outputting the wideband audio sample;
calculating a training loss between the wideband audio sample and the real wideband audio sample;
And obtaining a trained voice bandwidth expansion model based on the parameters of the training loss optimization generator and the discriminator until the loss converges.
In a second aspect, an embodiment of the present invention further provides an edge device-oriented lightweight speech band expansion apparatus, where the apparatus includes:
the first preprocessing module is used for acquiring a narrowband audio signal from the edge equipment and preprocessing the narrowband audio signal to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum;
The second preprocessing module is used for constructing a white noise amplitude spectrum;
The amplitude spectrum prediction module is used for inputting the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, respectively generating a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal by utilizing the voice bandwidth expansion model, and obtaining a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component;
And the waveform reconstruction module is used for expanding the phase information of the narrowband audio signal, generating a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal and outputting the wideband audio signal.
In a third aspect, the embodiment of the invention further provides a terminal, which comprises a memory, a processor and an edge-device-oriented lightweight speech band expansion program stored in the memory and capable of running on the processor, wherein the edge-device-oriented lightweight speech band expansion program, when executed by the processor, realizes the steps of the edge-device-oriented lightweight speech band expansion method.
In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium storing an edge device-oriented lightweight speech band extension program, where the edge device-oriented lightweight speech band extension program can be executed to implement the steps of the edge device-oriented lightweight speech band extension method as described above.
The method has the advantages that a narrow-band audio signal is obtained from edge equipment and preprocessed to obtain a logarithmic domain narrow-band audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum, a white noise amplitude spectrum is constructed, the logarithmic domain narrow-band audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum are input into a trained voice bandwidth expansion model, a first high-frequency component corresponding to consonants in the narrow-band audio signal and a second high-frequency component corresponding to vowels in the narrow-band audio signal are respectively generated by utilizing the voice bandwidth expansion model, a predicted amplitude spectrum is obtained based on the first high-frequency component and the second high-frequency component, phase information of the narrow-band audio signal is expanded, and a broadband audio signal is generated and output according to the predicted amplitude spectrum and the expanded phase information of the narrow-band audio signal. The invention uses the voice bandwidth expansion model to reconstruct the consonant in a targeted way, can ensure that the reconstructed consonant component has higher fidelity, and further ensures the definition of the reconstructed voice.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of a lightweight speech band expansion method for an edge device according to the present invention.
Fig. 2 is a schematic flow chart of generating control parameters and generating weights in the present invention.
Fig. 3 is a flow chart of the present invention for generating a wideband audio signal.
Fig. 4 is a schematic structural diagram of a lightweight speech band expanding device facing an edge device according to a preferred embodiment of the present invention.
Fig. 5 is a schematic block diagram of a terminal of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The voice bandwidth expansion (Bandwidth Extension, BWE) technology aims at reconstructing a high-frequency component (more than or equal to 2 kHz) missing in a narrow-band signal (the frequency band is usually less than or equal to 3 kHz), so that the perception definition of voice is improved, and the voice bandwidth expansion technology is a key means for optimizing voice experience of edge equipment (such as mobile phones, terminals of the Internet of things and the like) under the condition of limited bandwidth. However, the existing mainstream method generally relies on Half-wave rectification (Half-Wave Rectification, HWR) technology to recover high-frequency information, and the reconstruction mechanism thereof performs high-frequency extrapolation based on the periodic harmonic structure of vowels. This mechanism is inherently incompatible with the non-periodic, noise-like spectral characteristics exhibited by consonants, resulting in distortion of the reconstruction of the high frequency components of the consonants, which in turn affects the intelligibility of the reconstructed speech.
Aiming at the defects in the prior art, the invention provides a lightweight voice frequency band expanding method, a lightweight voice frequency band expanding device, a lightweight voice frequency band expanding terminal and a lightweight voice frequency band expanding medium for edge equipment, wherein the lightweight voice frequency band expanding method comprises the steps of obtaining a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum; the method comprises the steps of establishing a white noise amplitude spectrum, inputting the logarithmic domain narrow-band audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, respectively generating a first high-frequency component corresponding to consonants in a narrow-band audio signal and a second high-frequency component corresponding to vowels in the narrow-band audio signal by utilizing the voice bandwidth expansion model, obtaining a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component, expanding phase information of the narrow-band audio signal, generating a broadband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrow-band audio signal, and outputting the broadband audio signal. The invention uses the voice bandwidth expansion model to reconstruct the consonant in a targeted way, can ensure that the reconstructed consonant component has higher fidelity, and further ensures the definition of the reconstructed voice.
Referring to fig. 1, the lightweight voice band expansion method for edge devices according to the embodiment of the present invention includes the following steps:
Step S100, obtaining a narrowband audio signal from the edge equipment and preprocessing the narrowband audio signal to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum.
Specifically, the edge device may be a mobile phone or an internet of things terminal device. In the process of acquiring a narrowband audio signal from an edge device and preprocessing the narrowband audio signal to obtain a narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum, firstly acquiring the narrowband audio signal from the edge device, and then upsampling the sampling rate of the narrowband audio signal to a preset target sampling rate to obtain an upsampled time domain waveform. The target sample rate may be 22050. Performing half-wave rectification operation on the time domain waveform after upsampling to obtain a rectified time domain waveformIt will be appreciated that the number of the devices,=. Short-time Fourier transform is carried out on the time domain waveform and the rectified time domain waveform by adopting a hanning window respectively to obtain a narrow-band audio signal amplitude spectrumAnd rectified narrowband audio signal amplitude spectrum. When the hanning window is used for Short-time fourier transform (Short-Time Fourier Transform, STFT), the frame length is 1024 sampling points, the frame is shifted by 256 sampling points, and the number of fast fourier transform (Fast Fourier Transform, FFT) points is 1024. The obtained narrow-band audio signal amplitude spectrum and the rectified narrow-band audio signal amplitude spectrum are 513-dimension. And superposing the narrow-band audio signal amplitude spectrum and the rectified narrow-band audio signal amplitude spectrum to obtain a mixed amplitude spectrum. This process can be expressed as. And finally, respectively applying logarithmic transformation to the narrow-band audio signal amplitude spectrum and the mixed amplitude spectrum to obtain a logarithmic domain narrow-band audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum. The process of obtaining the log-domain mixed magnitude spectrum can be expressed as:
;
wherein, the For avoiding numerical instability in logarithmic calculations.
The amplitude spectrum of the original narrowband audio signal has obvious dynamic range difference, the energy of middle and low frequencies is far higher than that of high frequencies, the high frequency details are covered up when the voice bandwidth expansion model is directly input, the high frequency reconstruction precision is reduced, the training instability is caused by overlarge factor value fluctuation, and the convergence effect of the voice bandwidth expansion model is further restricted. According to the invention, the wide dynamic range of the original amplitude spectrum can be compressed to the narrow range of the logarithmic domain through logarithmic transformation, so that the masking of low-frequency energy to high-frequency details is reduced, meanwhile, the numerical fluctuation is reduced, and the convergence speed and stability of model training are improved.
Referring to fig. 1, the lightweight speech band expanding method for edge devices according to the embodiment of the present invention further includes the following steps:
and step 200, constructing a white noise amplitude spectrum.
Specifically, a white noise amplitude spectrum with an average value of-1 is constructed
Referring to fig. 1, the lightweight speech band expanding method for edge devices according to the embodiment of the present invention further includes the following steps:
And step 300, inputting the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, respectively generating a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal by utilizing the voice bandwidth expansion model, and obtaining a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component.
Specifically, the existing HWB-Net (High-Performance AND EFFICIENT Hybrid Waveform Bandwidth Extension Method) is a common speech bandwidth expansion model, the parameter number is 194K, the multiplication and addition operation times per second is 12M, and the method has the advantage of being convenient for edge equipment deployment. But it is not optimized for the aperiodic, noise-like character of consonants (e.g.,/f/,/s/, etc.). In order to overcome the defect, the invention designs a voice bandwidth expansion model (HWB-PLUS), which is improved based on the existing HWB-Net model, and carries out targeted processing on vowels and consonants respectively through a dual-path reconstruction mechanism. The voice bandwidth expansion model comprises a feature extraction module, a double-weighted Gaussian mixture module, a fusion module and a band guide masking module.
In one implementation, inputting the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum, and the white noise amplitude spectrum into a trained speech bandwidth extension model, generating a first high frequency component corresponding to a consonant in a narrowband audio signal and a second high frequency component corresponding to a vowel in the narrowband audio signal, respectively, using the speech bandwidth extension model, and obtaining a predicted amplitude spectrum based on the first high frequency component and the second high frequency component, including:
Inputting the logarithmic domain mixed amplitude spectrum into the characteristic extraction module, and processing to obtain time sequence characteristics representing the characteristics of the narrowband audio signal;
Inputting the time sequence characteristics, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into the double-weighted Gaussian mixture module, generating two groups of control parameters based on the time sequence characteristics, and respectively driving a first weighted Gaussian mixture module and a second weighted Gaussian mixture module of the double-weighted Gaussian mixture module to operate so as to correspondingly generate a first high-frequency component and a second high-frequency component;
Inputting the first high-frequency component and the second high-frequency component into the fusion module for fusion, and outputting an initial predicted amplitude spectrum;
inputting the logarithmic domain narrowband audio signal amplitude spectrum and the initial predicted amplitude spectrum into the frequency band guiding masking module for optimization to obtain the predicted amplitude spectrum.
Specifically, the feature extraction module includes a Linear frequency to equivalent rectangular bandwidth conversion (Linear Frequency to Equivalent Rectangular Bandwidth Conversion, linear2 ERB) module and an encoder module. Because the resolution of human ears on different frequencies is different, the resolution of the human ears on a low frequency band is higher, and the resolution of the human ears on a high frequency band is lower, the linear frequency-to-equivalent rectangular bandwidth conversion module processes the received logarithmic domain mixed amplitude spectrum by utilizing a triangular ERB (Equivalent Rectangular Bandwidth ) filter in the linear frequency-to-equivalent rectangular bandwidth conversion module, maps the logarithmic domain mixed amplitude spectrum to a sensing frequency axis based on ERB (Equivalent Rectangular Bandwidth ) to obtain an equivalent rectangular bandwidth frequency band spectrum, and provides characteristic input more in line with human ear sensing for a subsequent module. The equivalent rectangular bandwidth frequency band spectrum is 128-dimensional, and is more in line with the auditory characteristics of human ears.
The encoder module includes four one-dimensional convolutional layers and a packet gating loop unit (Grouped Gated Recurrent Unit). The one-dimensional convolution layer is a core feature coding module in the voice bandwidth expansion model and has the functions of carrying out local feature extraction and dimension compression on the equivalent rectangular bandwidth frequency band spectrum, providing compact spectrum features rich in meaning for a subsequent decoding module, and supporting real-time stream reasoning. The channel number configuration of the four one-dimensional convolution layers is 128, 64, and 64 in order. The kernel size of each convolutional layer is 3 and a ReLU activation function is used. After the local features are extracted through four one-dimensional convolution layers, a grouping gating circulation unit is utilized to capture the inter-frame time sequence dependency relationship (such as the continuity of the voice fundamental frequency, the dynamic change of the formants and the like) of the frequency spectrum features, and meanwhile, the requirements of model weight reduction and flow reasoning are considered, so that feature support with more time sequence consistency is provided for subsequent amplitude complementation and phase optimization. The packet gating loop unit includes a two-layer gating loop unit with input and hidden layer dimensions of 64. The packet gating loop unit ultimately outputs a timing characteristic that characterizes the narrowband audio signal. Where T is the time frame, f=64,Is a real set.
The dual weighted gaussian mixture module (DualWGMM) includes a first weighted gaussian mixture module and a second weighted gaussian mixture module. The first weighted gaussian mixture module (ConsWGMM) is used for reconstructing high frequency components of consonants in the narrowband audio signal, and the second weighted gaussian mixture module (VowelWGMM) is used for reconstructing high frequency components of vowels in the narrowband audio signal. The base signal of the first weighted Gaussian mixture module isThe aperiodic and noise-like statistical characteristics of the device are completely matched with high-frequency consonants (such as/f/,/s /), and the defect that the prior art cannot adapt to consonants is overcome. The first weighted Gaussian mixture module is used for carrying out fine modeling on the 2000-10000Hz high frequency band (consonant energy core area) through 32 Gaussian components, so that consonant reconstruction distortion is avoided. The base signal of the second weighted Gaussian mixture module isThe method retains the enhancement effect of half-wave rectification on vowel harmonics in the prior art, focuses the high-frequency harmonic expansion of low-frequency vowels, can maintain the continuity of vowel reconstruction in this way, and avoids the quality degradation of vowels caused by model splitting.
Taking the broadband distribution characteristic of the consonant high-frequency band into consideration, the mean value of the first weighted Gaussian mixture moduleBased on the average of 32 values uniformly sampled in the linear frequency range of 2000-10000Hz, the standard deviationFixed at 10. In this way the flexibility and stability of the first weighted gaussian mixture module is balanced.
In addition, the parameters of the weighted Gaussian mixture model in the existing HWB-Net model are mostly empirically set, and are not combined with the nonlinear perceptual characteristics of the human auditory system. The sensitivity of human beings to high frequency follows the mel scale rule (namely sensitivity to medium and low frequencies and sensitivity to high frequency is reduced), but the experience parameters are not focused on the key frequency band of the mel scale, so that the model wastes calculation resources on background noise, inaudible sounds and other non-key frequencies, high-frequency details related to perception cannot be optimized efficiently, and high-frequency reconstruction quality is limited. The invention aims to solve the problem of lack of acoustic rationality in parameter initialization, and designs the initial mean and standard deviation of a second weighted Gaussian mixture module based on the Mel scale so as to ensure the focusing of parameters to sense a key frequency band. Specifically, the initial average of the second weighted Gaussian mixture moduleThe center frequency of the kth filter in the Mel-scale triangular filter bank is set to be directly aligned with the middle-low frequency band (namely the vowel core distribution area) sensitive to human hearing. Will be the initial standard deviationCalculated according to the bandwidth of the Mel filter, the formula is. Wherein, the Is the firstThe bandwidths of the individual mel filters. By the method, the spectrum coverage of each Gaussian component is ensured to be consistent with that of the Mel filter, and resource waste at non-critical frequency is avoided.
And after the time sequence characteristics, the logarithmic domain mixed amplitude spectrum and the logarithmic domain amplitude spectrum are input into the double-weighted Gaussian mixture module, the first weighted Gaussian mixture module and the second weighted Gaussian mixture module respectively process the time sequence characteristics to generate corresponding control parameters.
The first weighted gaussian mixture module comprises two parallel linear layers, belonging to two branches respectively. In a branch that calculates the final standard deviation, the linear layer of the branch processes the timing characteristicsThen, through softplus function processing, the standard deviation offset of the kth Gaussian component of the t frame is obtained. Softplus function is a smooth approximation function used to model the nonlinear characteristics of the ReLU (RECTIFIED LINEAR Unit) function while maintaining scalability. And then calculating based on the initial standard deviation of the first weighted Gaussian mixture module to obtain the final standard deviation of the first weighted Gaussian mixture module. The formula referred to herein is:
;
in a branch for calculating predictive weights, the linear layer processing timing characteristics of the branch Then, the predictive weight of the kth Gaussian component of the t frame is obtained through Sigmoid activation function processing. The Sigmoid activation function is a nonlinear function that maps any real input to a (0, 1) interval. Then, utilizeAs a control parameter of the first weighted Gaussian mixture module, driving the first weighted Gaussian mixture module to generate a first high-frequency component. The formula involved in this process is as follows:
;
;
wherein, the Is a gaussian mixture function that simulates the secondary high frequency distribution.Is a white noise magnitude spectrum.
The second weighted Gaussian mixture module comprises two parallel linear layers, which respectively belong to two branches. In a branch that calculates the final standard deviation, the linear layer of the branch processes the timing characteristicsThen, through softplus function processing, the standard deviation offset of the kth Gaussian component of the t frame is obtained. And then calculating based on the initial standard deviation of the second weighted Gaussian mixture module to obtain the final standard deviation of the second weighted Gaussian mixture module. The formula referred to herein is:
;
in a branch for calculating predictive weights, the linear layer processing timing characteristics of the branch Then, the predictive weight of the kth Gaussian component of the t frame is obtained through Sigmoid activation function processing. Finally, utilizeAndAs a control parameter of the second weighted Gaussian mixture module, driving the second weighted Gaussian mixture module to generate a second high-frequency component. The formula involved in this process is as follows:
;
;
wherein, the To simulate the gaussian mixture function of the pitch frequency distribution of the elements,As a function of the frequency variation,Representing a point-by-point multiplication,For mixing the amplitude spectrum in the logarithmic domain of the narrowband audio signal,Representing a probability density function of a gaussian distribution.
In one implementation, inputting the first high frequency component and the second high frequency component into the fusion module for fusion, outputting an initial predicted magnitude spectrum, comprising:
Inputting the first high-frequency component and the second high-frequency component into the fusion module, and splicing the first high-frequency component and the second high-frequency component by using the fusion module to obtain splicing characteristics;
processing the spliced characteristics to generate a weight coefficient;
and performing weighted calculation on the first high-frequency component and the second high-frequency component based on the weight coefficient to obtain an initial predicted amplitude spectrum.
In particular, according toDimension stitching first high frequency componentAnd a second high frequency componentAnd obtaining splicing characteristics. The linear layer inside the fusion module transforms the above-mentioned spliced features, maps them into a scalar, then generates a frame-level weight coefficient ranging between [0,1] through Sigmoid activation function processing. If the current frame is mainly vowels, thenNear 1, if consonantNear 0. Then, an initial predicted magnitude spectrum is calculated according to the following formula: . In this way a natural transition between vowels and consonants is ensured. A schematic flow chart of the generation of control parameters and the generation of weights in the present invention is shown in fig. 2.
Referring to fig. 1, the lightweight speech band expanding method for edge devices according to the embodiment of the present invention further includes the following steps:
step 400, expanding the phase information of the narrowband audio signal, generating a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal, and outputting the wideband audio signal.
The method comprises the steps of carrying out overturn expansion on phase information of a narrow-band audio signal, combining the predicted amplitude spectrum with the expanded phase information of the narrow-band audio signal, carrying out inverse short-time Fourier transform by adopting a Hanning window to generate a wide-band audio signal, and outputting the wide-band audio signal. This process can be expressed as the following formula: wherein, the method comprises the steps of, In the case of a wideband audio signal,In the form of an inverse short-time fourier transform,For predicting the magnitude spectrum, e is the base of the natural logarithm, j is the imaginary unit,Is the phase information of the narrowband audio signal. The flow of the present invention for generating a wideband audio signal may be as shown in fig. 3.
In one implementation, the training step of the speech bandwidth expansion model includes:
acquiring training data pairs, wherein the training data pairs consist of narrowband audio signal samples and corresponding real wideband audio samples;
preprocessing the narrowband audio signal sample to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum for training;
Constructing a white noise amplitude spectrum for training;
Constructing a training framework comprising a generator and a discriminator, wherein the generator expands a model for the bandwidth of the voice to be trained, and the discriminator is used for distinguishing a generated broadband audio sample from a real broadband audio sample;
Inputting the frequency spectrum of the training logarithmic domain narrowband audio signal and the frequency spectrum of the logarithmic domain mixed frequency spectrum and the frequency spectrum of the white noise into a generator to obtain a predicted training frequency spectrum;
performing inverse short-time Fourier transform according to the predicted training amplitude spectrum and the phase information of the real wideband audio sample, generating a wideband audio sample and outputting the wideband audio sample;
calculating a training loss between the wideband audio sample and the real wideband audio sample;
And obtaining a trained voice bandwidth expansion model based on the parameters of the training loss optimization generator and the discriminator until the loss converges.
Specifically, the formula of the loss function in the invention is:
wherein, the In the event of a loss of waveform,For the multi-resolution short-time fourier transform loss,In order to combat the loss of this,And (5) matching loss for the characteristics.
Waveform lossThe calculation formula of (2) is as follows:
;
Wherein T is the total frame number of the narrowband audio signal samples, and T is the frame number. For a wideband audio signal of the t frame predicted at training,T-th frame, which is a true wideband audio sample.
Multiresolution short-time Fourier transform lossThe calculation formula of (2) is as follows:
;
wherein, the Is a number of 1, and is not limited by the specification,Representing the spectral convergence loss at the i-th group resolution,Representing the log magnitude spectral loss at group i resolution. The value range of i is 1-3, and three groups of short-time Fourier transformation configurations with different time frequency resolutions are correspondingly arranged. The first group is the number 512 of fast Fourier transform points, the frame shift 50, the window length 240, the second group is the number 1024 of fast Fourier transform points, the frame shift 120, the window length 600, the third group is the number 2048 of fast Fourier transform points, the frame shift 240, the window length 1200.
Countering lossesIncluding loss of discriminatorsSum generator loss. Loss of discriminatorThe calculation formula of (2) is as follows: generator loss The calculation formula of (2) is as follows: . Wherein, the Representing the output of the arbiter for a true wideband audio sample,Representing the output of the arbiter for the narrowband audio signal samples,Represents the distance of L2 and,In the case of a true wideband audio sample,Is a narrowband audio signal sample.
Loss of feature matchingThe calculation formula of (2) is as follows: . Wherein L is the number of layers of the discriminator, Representative Distinguishing deviceLayer characteristics.
The invention adopts the following indexes to compare the method (HWB-Plus) with the HWB-Net method (High-Performance AND EFFICIENT Hybrid Waveform Bandwidth Extension Method), the band-limited sinc difference value (bandlimited sinc interpolation) method and the BAE-Lite (bandwidth adaptive extension neural network) method in the prior art, and the difference between the reconstructed spectrum and the real spectrum is quantified by using the Log spectrum distance (Log SPECTRAL DISTANCE, LSD), and the smaller the value is, the better the value is. The overall hearing quality of speech is assessed using a deep noise suppressed speech quality score (Deep Noise Suppression Mean Opinion Score, DNSMOS) that meets the P.808 criterion, the larger the value the better the value range [0,5]. The perceptual evaluation of speech quality (Perceptual Evaluation of Speech Quality, PESQ) method is used to evaluate speech quality, the larger the value the better the value is, the numerical range [ -0.5,4.5]. The objective quality of speech is evaluated by means of a virtual speech quality objective listener (Virtual Speech Quality Objective Listener, VISQOL), the larger the value the better the value is, the range of values [0,5]. The Non-invasive speech quality assessment (Non-Intrusive Speech Quality Assessment, NISQA) was used to assess the objective quality of speech, with values ranging from [0,5] being the greater the better. Deployment efficiency is measured using the number of parameters (para.) and the number of times per second (multiple-Accumulate Operations per Second, MACs).
The performance index comparisons for the four methods are shown in table 1:
TABLE 1
As can be seen from Table 1, the HWB-PLUS method of the invention has the perception index being comprehensively better than BAE-Lite and HWB-Net on the premise that the parameters and calculated amount are basically consistent with HWB-Net, thus proving that the aim of 'light weight is unchanged' and performance is improved, consonant distortion is obviously reduced, and voice is more natural. Although the logarithmic spectrum distance is slightly higher than BAE-Lite, the perceived quality is more fit to human hearing demand, and the practical application scene of edge voice frequency band expansion is met.
In addition, in order to verify the double-weighted gaussian mixture module, the method based on the mel scale initialization parameter and the function of applying logarithmic transformation in the invention, the indexes are checked after the three are removed respectively, and the results are shown in table 2.
TABLE 2
As can be seen from table 2, the deep neural network speech quality score and the virtual speech quality objective listener drop by 0.19 and 0.19 respectively after removing the double weighted gaussian mixture module, proving that splitting vowel/consonant modeling is crucial to improving perceived quality. After the operation based on the Mel scale initialization parameter is removed, the log spectrum distance is increased to 1.07, which indicates that the acoustic rationality of the parameter directly affects the spectrum matching precision. After removing the operation of applying logarithmic transformation, the voice quality score of the deep neural network is reduced to 3.31, and the effect of the deep neural network on high-frequency detail retention and training stability is verified.
In summary, the method comprises the steps of obtaining a narrowband audio signal from an edge device, preprocessing the narrowband audio signal to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum, constructing a white noise amplitude spectrum, inputting the logarithmic domain narrowband audio signal amplitude spectrum sum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, respectively generating a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal by utilizing the voice bandwidth expansion model, obtaining a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component, expanding phase information of the narrowband audio signal, and generating and outputting a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal. The invention uses the voice bandwidth expansion model to reconstruct the consonant in a targeted way, can ensure that the reconstructed consonant component has higher fidelity, and further ensures the definition of the reconstructed voice.
In an embodiment, as shown in fig. 4, based on the foregoing method for expanding a lightweight speech band for an edge device, the present invention further correspondingly provides a lightweight speech band expanding device for an edge device, where the device includes:
The first preprocessing module 100 is configured to obtain a narrowband audio signal from an edge device and perform preprocessing to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum;
a second preprocessing module 200 for constructing a white noise magnitude spectrum;
The amplitude spectrum prediction module 300 is configured to input the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum to a trained speech bandwidth expansion model, generate a first high-frequency component corresponding to a consonant in a narrowband audio signal and a second high-frequency component corresponding to a vowel in the narrowband audio signal by using the speech bandwidth expansion model, and obtain a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component;
The waveform reconstruction module 400 is configured to expand the phase information of the narrowband audio signal, generate a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal, and output the wideband audio signal.
In one embodiment, the first preprocessing module includes:
an audio acquisition unit for acquiring a narrowband audio signal from an edge device;
The up-sampling unit is used for up-sampling the sampling rate of the narrowband audio signal to a preset target sampling rate to obtain an up-sampled time domain waveform;
the half-wave rectification unit is used for performing half-wave rectification operation on the time domain waveform after up sampling to obtain a rectified time domain waveform;
the first short-time Fourier transform unit is used for superposing the narrow-band audio signal amplitude spectrum and the rectified narrow-band audio signal amplitude spectrum to obtain a mixed amplitude spectrum;
and the logarithmic transformation unit is used for respectively applying logarithmic transformation to the narrow-band audio signal amplitude spectrum and the mixed amplitude spectrum to obtain a logarithmic domain narrow-band audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum.
In one embodiment, the amplitude spectrum prediction module comprises:
The time sequence feature generating unit is used for inputting the logarithmic domain mixed amplitude spectrum into the feature extracting module and obtaining time sequence features representing the features of the narrowband audio signals through processing;
The high-frequency component generating unit is used for inputting the time sequence characteristics, the white noise amplitude spectrum and the logarithmic domain amplitude spectrum into the double-weighted Gaussian mixture module, generating two groups of control parameters based on the time sequence characteristics, and respectively driving a first weighted Gaussian mixture module and a second weighted Gaussian mixture module of the double-weighted Gaussian mixture module to operate so as to correspondingly generate a first high-frequency component and a second high-frequency component;
The fusion unit is used for inputting the first high-frequency component and the second high-frequency component into the fusion module for fusion and outputting an initial predicted amplitude spectrum;
And the optimizing unit is used for inputting the logarithmic domain narrowband audio signal amplitude spectrum and the initial predicted amplitude spectrum into the frequency band guiding masking module for optimization to obtain the predicted amplitude spectrum.
In one embodiment, the apparatus further comprises:
The splicing unit is used for inputting the first high-frequency component and the second high-frequency component into the fusion module, and splicing the first high-frequency component and the second high-frequency component by using the fusion module to obtain splicing characteristics;
the weight generating unit is used for processing the splicing characteristics and generating weight coefficients;
and the initial prediction amplitude spectrum generation unit is used for performing weighted calculation on the first high-frequency component and the second high-frequency component based on the weight coefficient to obtain an initial prediction amplitude spectrum.
In one embodiment, the band expansion module includes:
the phase expansion unit is used for carrying out overturn expansion on the phase information of the narrowband audio signal;
the first inverse short-time Fourier transform unit is used for combining the predicted amplitude spectrum with the extended phase information of the narrowband audio signal, and performing inverse short-time Fourier transform by adopting a Hanning window to generate a broadband audio signal;
and the output unit is used for outputting the broadband audio signal.
In one embodiment, the apparatus further comprises:
the training data pair acquisition unit is used for acquiring training data pairs, wherein the training data pairs consist of narrowband audio signal samples and corresponding real broadband audio samples;
the first training preprocessing unit is used for preprocessing the narrowband audio signal sample to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum for training;
the second training preprocessing unit is used for constructing a white noise amplitude spectrum for training;
The training framework construction unit is used for constructing a training framework comprising a generator and a discriminator, wherein the generator is a speech bandwidth expansion model to be trained, and the discriminator is used for distinguishing a generated broadband audio sample from a real broadband audio sample;
A training unit for inputting the training logarithmic domain narrowband audio signal amplitude spectrum, logarithmic domain mixed amplitude spectrum and white noise amplitude spectrum into a generator, obtaining a predicted training amplitude spectrum;
the second inverse short-time Fourier transform unit is used for performing inverse short-time Fourier transform according to the predicted training amplitude spectrum and the phase information of the real wideband audio sample, generating a wideband audio sample and outputting the wideband audio sample;
a loss calculation unit for calculating a training loss between the wideband audio sample and the real wideband audio sample;
and the parameter optimization unit is used for obtaining a trained voice bandwidth expansion model based on the parameters of the training loss optimization generator and the discriminator until the loss converges.
Based on the above embodiment, the present invention further provides a terminal, and a schematic structural diagram of the terminal may be shown in fig. 5. The terminal comprises a processor, a memory, a network interface and a display screen which are connected through a device bus. Wherein the processor of the terminal is adapted to provide computing and control capabilities. The memory of the terminal includes a nonvolatile storage medium, an internal memory. The nonvolatile storage medium stores an operating device and a lightweight speech band expansion program for an edge device. The internal memory provides an environment for operation of an operating device and an edge device oriented lightweight speech band expansion program in a nonvolatile storage medium. The network interface of the terminal is used for communicating with an external terminal through a network connection. The method for expanding the lightweight voice frequency band of the edge equipment comprises the step of realizing any lightweight voice frequency band expanding method of the edge equipment when the lightweight voice frequency band expanding program of the edge equipment is executed by a processor. The display screen of the terminal may be a liquid crystal display screen or an electronic ink display screen.
It will be appreciated by those skilled in the art that the schematic structural diagram shown in fig. 5 is merely a schematic diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the terminal to which the present inventive arrangements are applied, and that a particular terminal may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a terminal is provided, where the terminal includes a memory, a processor, and an edge-device-oriented lightweight speech band expansion program stored in the memory and capable of running on the processor, where the step of any one of the edge-device-oriented lightweight speech band expansion methods provided by the embodiments of the present invention is implemented when the edge-device-oriented lightweight speech band expansion program is executed by the processor.
The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores an edge equipment-oriented lightweight voice band expansion program, and the edge equipment-oriented lightweight voice band expansion program realizes the steps of any one of the edge equipment-oriented lightweight voice band expansion methods provided by the embodiment of the invention when being executed by a processor.
It should be understood that the sequence number of each step in the above embodiment does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not be construed as limiting the implementation process of the embodiment of the present invention.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working process of the units and modules in the above device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units described above is merely a logical function division, and may be implemented in other manners, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted, or not performed.
The embodiments described above are only for illustrating the technical solution of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solution described in the foregoing embodiments may be modified or some of the technical features may be replaced equally, and that the modifications or replacements are not essential to the corresponding technical solution but are included in the scope of protection of the present invention.

Claims (10)

1.一种面向边缘设备的轻量化语音频带拓展方法,其特征在于,所述方法包括:1. A lightweight voice band extension method for edge devices, characterized in that the method includes: 从边缘设备获取窄带音频信号并进行预处理,得到对数域窄带音频信号幅度谱和对数域混合幅度谱;Narrowband audio signals are acquired from edge devices and preprocessed to obtain the logarithmic domain narrowband audio signal amplitude spectrum and the logarithmic domain mixed amplitude spectrum. 构建白噪声幅度谱;Construct the amplitude spectrum of white noise; 将所述对数域窄带音频信号幅度谱、所述对数域混合幅度谱和所述白噪声幅度谱输入至已训练的语音带宽拓展模型,利用所述语音带宽拓展模型分别生成与窄带音频信号中辅音对应的第一高频分量和与窄带音频信号中元音对应的第二高频分量,并基于所述第一高频分量和所述第二高频分量得到预测幅度谱;The logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum, and the white noise amplitude spectrum are input into a trained speech bandwidth expansion model. The speech bandwidth expansion model is used to generate a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal, respectively. The predicted amplitude spectrum is obtained based on the first high-frequency component and the second high-frequency component. 对所述窄带音频信号的相位信息进行扩展,根据预测幅度谱和窄带音频信号的扩展后相位信息,生成宽带音频信号并输出。The phase information of the narrowband audio signal is extended, and a wideband audio signal is generated and output based on the predicted amplitude spectrum and the extended phase information of the narrowband audio signal. 2.根据权利要求1所述的面向边缘设备的轻量化语音频带拓展方法,其特征在于,从边缘设备获取窄带音频信号并进行预处理,得到对数域窄带音频信号幅度谱和对数域混合幅度谱,包括:2. The lightweight audio band extension method for edge devices according to claim 1, characterized in that, narrowband audio signals are acquired from the edge device and preprocessed to obtain the logarithmic domain narrowband audio signal amplitude spectrum and the logarithmic domain mixed amplitude spectrum, including: 从边缘设备获取窄带音频信号;Acquire narrowband audio signals from edge devices; 将所述窄带音频信号的采样率上采样到预设的目标采样率,得到上采样后的时域波形;The sampling rate of the narrowband audio signal is upsampled to a preset target sampling rate to obtain the upsampled time-domain waveform; 对上采样后的所述时域波形进行半波整流操作,得到整流后时域波形;The upsampled time-domain waveform is subjected to half-wave rectification to obtain the rectified time-domain waveform. 对所述时域波形和所述整流后时域波形分别采用汉宁窗进行短时傅里叶变换,得到窄带音频信号幅度谱和整流后窄带音频信号幅度谱;The Hanning window is used to perform short-time Fourier transform on the time-domain waveform and the rectified time-domain waveform respectively to obtain the amplitude spectrum of the narrowband audio signal and the amplitude spectrum of the rectified narrowband audio signal. 将所述窄带音频信号幅度谱与所述整流后窄带音频信号幅度谱进行叠加,得到混合幅度谱;The amplitude spectrum of the narrowband audio signal is superimposed with the amplitude spectrum of the rectified narrowband audio signal to obtain a mixed amplitude spectrum; 对所述窄带音频信号幅度谱和所述混合幅度谱分别施加对数变换,得到对数域窄带音频信号幅度谱和对数域混合幅度谱。Logarithmic transformations are applied to the amplitude spectrum of the narrowband audio signal and the mixed amplitude spectrum to obtain the amplitude spectrum of the narrowband audio signal in the logarithmic domain and the mixed amplitude spectrum in the logarithmic domain. 3.根据权利要求1所述的面向边缘设备的轻量化语音频带拓展方法,其特征在于,所述语音带宽拓展模型包括特征提取模块、双加权高斯混合模块、融合模块和频带引导掩蔽模块。3. The lightweight speech bandwidth expansion method for edge devices according to claim 1, wherein the speech bandwidth expansion model includes a feature extraction module, a double-weighted Gaussian mixture module, a fusion module, and a frequency band guidance masking module. 4.根据权利要求3所述的面向边缘设备的轻量化语音频带拓展方法,其特征在于,将所述对数域窄带音频信号幅度谱、所述对数域混合幅度谱和所述白噪声幅度谱输入至已训练的语音带宽拓展模型,利用所述语音带宽拓展模型分别生成与窄带音频信号中辅音对应的第一高频分量和与窄带音频信号中元音对应的第二高频分量,并基于所述第一高频分量和所述第二高频分量得到预测幅度谱,包括:4. The lightweight speech bandwidth expansion method for edge devices according to claim 3, characterized in that the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum, and the white noise amplitude spectrum are input into a trained speech bandwidth expansion model, and the speech bandwidth expansion model is used to generate a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal, respectively, and a predicted amplitude spectrum is obtained based on the first high-frequency component and the second high-frequency component, including: 将所述对数域混合幅度谱输入所述特征提取模块,经处理得到表征窄带音频信号特征的时序特征;The logarithmic domain mixed amplitude spectrum is input into the feature extraction module, and after processing, the temporal features characterizing the narrowband audio signal are obtained. 将所述时序特征、所述对数域混合幅度谱、所述白噪声幅度谱输入所述双加权高斯混合模块,基于所述时序特征生成两组控制参数,分别驱动所述双加权高斯混合模块的第一加权高斯混合模块与第二加权高斯混合模块运行,以相应生成第一高频分量与所述第二高频分量;The time series features, the logarithmic domain mixed amplitude spectrum, and the white noise amplitude spectrum are input into the dual-weighted Gaussian mixing module. Based on the time series features, two sets of control parameters are generated to drive the first weighted Gaussian mixing module and the second weighted Gaussian mixing module of the dual-weighted Gaussian mixing module to run, so as to generate the first high-frequency component and the second high-frequency component respectively. 将所述第一高频分量与所述第二高频分量输入所述融合模块进行融合,输出初始预测幅度谱;The first high-frequency component and the second high-frequency component are input into the fusion module for fusion, and the initial predicted amplitude spectrum is output. 将所述对数域窄带音频信号幅度谱和所述初始预测幅度谱输入所述频带引导掩蔽模块进行优化,得到预测幅度谱。The amplitude spectrum of the logarithmic domain narrowband audio signal and the initial predicted amplitude spectrum are input into the frequency band guiding masking module for optimization to obtain the predicted amplitude spectrum. 5.根据权利要求4所述的面向边缘设备的轻量化语音频带拓展方法,其特征在于,将所述第一高频分量与所述第二高频分量输入所述融合模块进行融合,输出初始预测幅度谱,包括:5. The lightweight speech band extension method for edge devices according to claim 4, characterized in that, the first high-frequency component and the second high-frequency component are input into the fusion module for fusion, and an initial predicted amplitude spectrum is output, including: 将所述第一高频分量与所述第二高频分量输入所述融合模块,利用所述融合模块将所述第一高频分量与所述第二高频分量拼接,得到拼接特征;The first high-frequency component and the second high-frequency component are input into the fusion module, and the first high-frequency component and the second high-frequency component are spliced together by the fusion module to obtain the splicing feature; 对所述拼接特征进行处理,生成权重系数;The splicing features are processed to generate weight coefficients; 基于所述权重系数对所述第一高频分量和所述第二高频分量执行加权计算,得到初始预测幅度谱。The first high-frequency component and the second high-frequency component are weighted based on the weighting coefficients to obtain the initial predicted amplitude spectrum. 6.根据权利要求1所述的面向边缘设备的轻量化语音频带拓展方法,其特征在于,对所述窄带音频信号的相位信息进行扩展,根据预测幅度谱和窄带音频信号的扩展后相位信息,生成宽带音频信号并输出,包括:6. The lightweight audio band extension method for edge devices according to claim 1, characterized in that, the phase information of the narrowband audio signal is extended, and a wideband audio signal is generated and output based on the predicted amplitude spectrum and the extended phase information of the narrowband audio signal, comprising: 对所述窄带音频信号的相位信息进行翻转扩展;The phase information of the narrowband audio signal is flipped and expanded; 将所述预测幅度谱与窄带音频信号的扩展后相位信息结合,采用汉宁窗执行逆短时傅里叶变换,生成宽带音频信号;The predicted amplitude spectrum is combined with the expanded phase information of the narrowband audio signal, and an inverse short-time Fourier transform is performed using a Hanning window to generate a wideband audio signal. 将所述宽带音频信号输出。Output the broadband audio signal. 7.根据权利要求1所述的面向边缘设备的轻量化语音频带拓展方法,其特征在于,所述语音带宽拓展模型的训练步骤包括:7. The lightweight speech bandwidth extension method for edge devices according to claim 1, characterized in that the training steps of the speech bandwidth extension model include: 获取训练数据对,所述训练数据对由窄带音频信号样本及其对应的真实宽带音频样本组成;Acquire training data pairs, which consist of narrowband audio signal samples and their corresponding real broadband audio samples; 对所述窄带音频信号样本进行预处理,得到训练用的对数域窄带音频信号幅度谱和对数域混合幅度谱;The narrowband audio signal samples are preprocessed to obtain the logarithmic domain narrowband audio signal amplitude spectrum and the logarithmic domain mixed amplitude spectrum for training. 构建训练用的白噪声幅度谱;Construct the white noise amplitude spectrum for training; 构建包含生成器与判别器的训练框架,所述生成器为待训练的语音带宽拓展模型,所述判别器用于区分生成的宽带音频样本与真实宽带音频样本;A training framework is constructed that includes a generator and a discriminator. The generator is a speech bandwidth extension model to be trained, and the discriminator is used to distinguish between generated broadband audio samples and real broadband audio samples. 将训练用的对数域窄带音频信号幅度谱、对数域混合幅度谱和白噪声幅度谱输入生成器,得到预测训练幅度谱;The amplitude spectrum of the logarithmic domain narrowband audio signal, the mixed amplitude spectrum of the logarithmic domain, and the amplitude spectrum of the white noise used for training are input into the generator to obtain the predicted training amplitude spectrum; 根据所述预测训练幅度谱和所述真实宽带音频样本的相位信息进行逆短时傅里叶变换,生成宽带音频样本并输出;Based on the predicted training amplitude spectrum and the phase information of the real broadband audio samples, perform inverse short-time Fourier transform to generate broadband audio samples and output them; 计算所述宽带音频样本与所述真实宽带音频样本之间的训练损失;Calculate the training loss between the broadband audio sample and the real broadband audio sample; 基于所述训练损失优化生成器与判别器的参数,直至损失收敛,得到已训练的语音带宽拓展模型。The parameters of the generator and discriminator are optimized based on the training loss until the loss converges, resulting in a trained speech bandwidth expansion model. 8.一种面向边缘设备的轻量化语音频带拓展装置,其特征在于,包括:8. A lightweight voice band extension device for edge devices, characterized in that it comprises: 第一预处理模块,用于从边缘设备获取窄带音频信号并进行预处理,得到对数域窄带音频信号幅度谱和对数域混合幅度谱;The first preprocessing module is used to acquire narrowband audio signals from edge devices and perform preprocessing to obtain the logarithmic domain narrowband audio signal amplitude spectrum and the logarithmic domain mixed amplitude spectrum. 第二预处理模块,用于构建白噪声幅度谱;The second preprocessing module is used to construct the white noise amplitude spectrum; 幅度谱预测模块,用于将所述对数域窄带音频信号幅度谱、所述对数域混合幅度谱和所述白噪声幅度谱输入至已训练的语音带宽拓展模型,利用所述语音带宽拓展模型分别生成与窄带音频信号中辅音对应的第一高频分量和与窄带音频信号中元音对应的第二高频分量,并基于所述第一高频分量和所述第二高频分量得到预测幅度谱;The amplitude spectrum prediction module is used to input the amplitude spectrum of the logarithmic domain narrowband audio signal, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained speech bandwidth expansion model, and use the speech bandwidth expansion model to generate a first high-frequency component corresponding to the consonants in the narrowband audio signal and a second high-frequency component corresponding to the vowels in the narrowband audio signal, respectively, and obtain the predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component. 波形重建模块,用于对所述窄带音频信号的相位信息进行扩展,根据预测幅度谱和窄带音频信号的扩展后相位信息,生成宽带音频信号并输出。The waveform reconstruction module is used to extend the phase information of the narrowband audio signal, and generate and output a wideband audio signal based on the predicted amplitude spectrum and the extended phase information of the narrowband audio signal. 9.一种终端,其特征在于,所述终端包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的面向边缘设备的轻量化语音频带拓展程序,所述面向边缘设备的轻量化语音频带拓展程序被所述处理器执行时实现如权利要求1-7任意一项所述的面向边缘设备的轻量化语音频带拓展方法的步骤。9. A terminal, characterized in that the terminal comprises: a memory, a processor, and a lightweight voice band extension program for edge devices stored in the memory and executable on the processor, wherein the lightweight voice band extension program for edge devices, when executed by the processor, implements the steps of the lightweight voice band extension method for edge devices as described in any one of claims 1-7. 10.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有面向边缘设备的轻量化语音频带拓展程序,所述面向边缘设备的轻量化语音频带拓展程序被处理器执行时,实现如权利要求1-7任一项所述的面向边缘设备的轻量化语音频带拓展方法的步骤。10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a lightweight voice band extension program for edge devices, wherein when the lightweight voice band extension program for edge devices is executed by a processor, it implements the steps of the lightweight voice band extension method for edge devices as described in any one of claims 1-7.
CN202511605734.7A 2025-11-05 2025-11-05 Lightweight Voice Band Extension Method, Device, Terminal and Medium for Edge Devices Active CN121054008B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511605734.7A CN121054008B (en) 2025-11-05 2025-11-05 Lightweight Voice Band Extension Method, Device, Terminal and Medium for Edge Devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511605734.7A CN121054008B (en) 2025-11-05 2025-11-05 Lightweight Voice Band Extension Method, Device, Terminal and Medium for Edge Devices

Publications (2)

Publication Number Publication Date
CN121054008A true CN121054008A (en) 2025-12-02
CN121054008B CN121054008B (en) 2026-02-06

Family

ID=97806023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511605734.7A Active CN121054008B (en) 2025-11-05 2025-11-05 Lightweight Voice Band Extension Method, Device, Terminal and Medium for Edge Devices

Country Status (1)

Country Link
CN (1) CN121054008B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110125492A1 (en) * 2009-11-23 2011-05-26 Cambridge Silicon Radio Limited Speech Intelligibility
CN109215635A (en) * 2018-10-25 2019-01-15 武汉大学 Broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing
CN112055278A (en) * 2020-08-17 2020-12-08 大象声科(深圳)科技有限公司 Deep learning noise reduction method and device integrating in-ear microphone and out-of-ear microphone
CN117877498A (en) * 2024-01-10 2024-04-12 中国科学技术大学 A method, device, equipment and storage medium for expanding speech waveform
CN117912485A (en) * 2022-10-17 2024-04-19 安克创新科技股份有限公司 Voice band extension method, noise reduction audio device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110125492A1 (en) * 2009-11-23 2011-05-26 Cambridge Silicon Radio Limited Speech Intelligibility
CN109215635A (en) * 2018-10-25 2019-01-15 武汉大学 Broadband voice spectral tilt degree characteristic parameter method for reconstructing for speech intelligibility enhancing
CN112055278A (en) * 2020-08-17 2020-12-08 大象声科(深圳)科技有限公司 Deep learning noise reduction method and device integrating in-ear microphone and out-of-ear microphone
CN117912485A (en) * 2022-10-17 2024-04-19 安克创新科技股份有限公司 Voice band extension method, noise reduction audio device and storage medium
CN117877498A (en) * 2024-01-10 2024-04-12 中国科学技术大学 A method, device, equipment and storage medium for expanding speech waveform

Also Published As

Publication number Publication date
CN121054008B (en) 2026-02-06

Similar Documents

Publication Publication Date Title
Bhat et al. A real-time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone
CN109256138B (en) Identity verification method, terminal device and computer readable storage medium
US10497383B2 (en) Voice quality evaluation method, apparatus, and device
US20210193149A1 (en) Method, apparatus and device for voiceprint recognition, and medium
Hussain et al. Experimental study on extreme learning machine applications for speech enhancement
CN111899750B (en) Speech Enhancement Algorithm Combined with Cochlear Speech Features and Jump Deep Neural Networks
CN108447495A (en) A Deep Learning Speech Enhancement Method Based on Comprehensive Feature Set
Nossier et al. Mapping and masking targets comparison using different deep learning based speech enhancement architectures
CN108922561A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN108198566B (en) Information processing method and device, electronic device and storage medium
Sivapatham et al. Gammatone filter bank-deep neural network-based monaural speech enhancement for unseen conditions
CN103971697B (en) Sound enhancement method based on non-local mean filtering
CN116665701A (en) A method, system and device for classifying feeding intensity of fish schools
CN109215635B (en) A Reconstruction Method of Wideband Speech Spectrum Slope Feature Parameters for Speech Intelligibility Enhancement
CN115240701B (en) Noise reduction model training method, speech noise reduction method, device and electronic device
Cheng et al. DNN-based speech enhancement with self-attention on feature dimension
CN120148484B (en) Speech recognition method and device based on microcomputer
CN119694333B (en) Directional pickup method, system, equipment and storage medium
CN121054008B (en) Lightweight Voice Band Extension Method, Device, Terminal and Medium for Edge Devices
Ma et al. A modified Wiener filtering method combined with wavelet thresholding multitaper spectrum for speech enhancement
Xiang et al. Speech enhancement via generative adversarial LSTM networks
CN119580749A (en) Speech signal reconstruction method, device, equipment and storage medium
Hu et al. Learnable spectral dimension compression mapping for full-band speech enhancement
CN116913307A (en) Speech processing method, device, communication equipment and readable storage medium
Chen et al. Speech bandwidth extension based on Wasserstein generative adversarial network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant