CN121054008A

CN121054008A - Lightweight voice frequency band expansion method, device, terminal and medium for edge equipment

Info

Publication number: CN121054008A
Application number: CN202511605734.7A
Authority: CN
Inventors: 刘鑫; 闫永杰
Original assignee: Elevoc Technology Co ltd
Current assignee: Elevoc Technology Co ltd
Priority date: 2025-11-05
Filing date: 2025-11-05
Publication date: 2025-12-02
Anticipated expiration: 2045-11-05
Also published as: CN121054008B

Abstract

The invention provides an edge equipment-oriented lightweight voice frequency band expansion method, device, terminal and medium, belonging to the technical field of voice signal processing; the method comprises the steps of establishing a white noise amplitude spectrum, inputting the logarithmic domain narrow-band audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, generating a first high-frequency component corresponding to consonants in the narrow-band audio signal and a second high-frequency component corresponding to vowels in the narrow-band audio signal, further obtaining a predicted amplitude spectrum, expanding phase information of the narrow-band audio signal, generating a broadband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrow-band audio signal, and outputting the broadband audio signal. The invention uses the voice bandwidth expansion model to reconstruct the consonant in a targeted way, can ensure that the reconstructed consonant component has higher fidelity, and further ensures the definition of the reconstructed voice.

Description

Lightweight voice frequency band expansion method, device, terminal and medium for edge equipment

Technical Field

The present invention relates to the field of speech signal processing technologies, and in particular, to a method, an apparatus, a terminal, and a medium for expanding a lightweight speech band for an edge device.

Background

The voice bandwidth expansion (Bandwidth Extension, BWE) technology aims at reconstructing a high-frequency component (more than or equal to 2 kHz) missing in a narrow-band signal (the frequency band is usually less than or equal to 3 kHz), so that the perception definition of voice is improved, and the voice bandwidth expansion technology is a key means for optimizing voice experience of edge equipment (such as mobile phones, terminals of the Internet of things and the like) under the condition of limited bandwidth. However, the existing mainstream method generally relies on Half-wave rectification (Half-Wave Rectification, HWR) technology to recover high-frequency information, and the reconstruction mechanism thereof performs high-frequency extrapolation based on the periodic harmonic structure of vowels. This mechanism is inherently incompatible with the non-periodic, noise-like spectral characteristics exhibited by consonants, resulting in distortion of the reconstruction of the high frequency components of the consonants, which in turn affects the intelligibility of the reconstructed speech.

Accordingly, the prior art has drawbacks and needs to be improved and developed.

Disclosure of Invention

The invention aims to solve the technical problems of the prior art by providing a lightweight voice frequency band expanding method, a lightweight voice frequency band expanding device, a lightweight voice frequency band expanding terminal and a lightweight voice frequency band expanding storage medium for edge equipment, and aims to solve the problems that the high-frequency component of consonants is reconstructed and distorted so as to influence the definition of overall voice in the prior art.

The technical scheme adopted for solving the technical problems is as follows:

in a first aspect, an embodiment of the present invention provides an edge device-oriented lightweight speech band extension method, where the method includes:

Obtaining a narrowband audio signal from an edge device and preprocessing the narrowband audio signal to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum;

Constructing a white noise amplitude spectrum;

inputting the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, respectively generating a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal by utilizing the voice bandwidth expansion model, and obtaining a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component;

and expanding the phase information of the narrowband audio signal, generating a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal, and outputting the wideband audio signal.

In one embodiment, obtaining a narrowband audio signal from an edge device and preprocessing to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum, including:

obtaining a narrowband audio signal from an edge device;

Up-sampling the sampling rate of the narrowband audio signal to a preset target sampling rate to obtain an up-sampled time domain waveform;

performing half-wave rectification operation on the time domain waveform after up sampling to obtain a rectified time domain waveform;

respectively adopting a hanning window to carry out short-time Fourier transform on the time domain waveform and the rectified time domain waveform to obtain a narrow-band audio signal amplitude spectrum and a rectified narrow-band audio signal amplitude spectrum;

Superposing the narrow-band audio signal amplitude spectrum and the rectified narrow-band audio signal amplitude spectrum to obtain a mixed amplitude spectrum;

And respectively applying logarithmic transformation to the narrow-band audio signal amplitude spectrum and the mixed amplitude spectrum to obtain a logarithmic-domain narrow-band audio signal amplitude spectrum and a logarithmic-domain mixed amplitude spectrum.

In one embodiment, the speech bandwidth extension model includes a feature extraction module, a dual weighted gaussian mixture module, a fusion module, and a band-guided masking module.

In one embodiment, inputting the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum, and the white noise amplitude spectrum into a trained speech bandwidth extension model, generating a first high frequency component corresponding to a consonant in a narrowband audio signal and a second high frequency component corresponding to a vowel in the narrowband audio signal, respectively, using the speech bandwidth extension model, and obtaining a predicted amplitude spectrum based on the first high frequency component and the second high frequency component, comprising:

Inputting the logarithmic domain mixed amplitude spectrum into the characteristic extraction module, and processing to obtain time sequence characteristics representing the characteristics of the narrowband audio signal;

Inputting the time sequence characteristics, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into the double-weighted Gaussian mixture module, generating two groups of control parameters based on the time sequence characteristics, and respectively driving a first weighted Gaussian mixture module and a second weighted Gaussian mixture module of the double-weighted Gaussian mixture module to operate so as to correspondingly generate a first high-frequency component and a second high-frequency component;

Inputting the first high-frequency component and the second high-frequency component into the fusion module for fusion, and outputting an initial predicted amplitude spectrum;

inputting the logarithmic domain narrowband audio signal amplitude spectrum and the initial predicted amplitude spectrum into the frequency band guiding masking module for optimization to obtain the predicted amplitude spectrum.

In one embodiment, inputting the first high frequency component and the second high frequency component into the fusion module for fusion, and outputting an initial predicted magnitude spectrum, includes:

Inputting the first high-frequency component and the second high-frequency component into the fusion module, and splicing the first high-frequency component and the second high-frequency component by using the fusion module to obtain splicing characteristics;

processing the spliced characteristics to generate a weight coefficient;

and performing weighted calculation on the first high-frequency component and the second high-frequency component based on the weight coefficient to obtain an initial predicted amplitude spectrum.

In one embodiment, expanding the phase information of the narrowband audio signal, generating and outputting a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal, includes:

Performing overturn expansion on the phase information of the narrowband audio signal;

Combining the predicted amplitude spectrum with the extended phase information of the narrowband audio signal, and performing inverse short-time Fourier transform by adopting a Hanning window to generate a wideband audio signal;

Outputting the broadband audio signal.

In one embodiment, the training step of the speech bandwidth expansion model includes:

acquiring training data pairs, wherein the training data pairs consist of narrowband audio signal samples and corresponding real wideband audio samples;

preprocessing the narrowband audio signal sample to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum for training;

Constructing a white noise amplitude spectrum for training;

Constructing a training framework comprising a generator and a discriminator, wherein the generator expands a model for the bandwidth of the voice to be trained, and the discriminator is used for distinguishing a generated broadband audio sample from a real broadband audio sample;

Inputting the frequency spectrum of the training logarithmic domain narrowband audio signal and the frequency spectrum of the logarithmic domain mixed frequency spectrum and the frequency spectrum of the white noise into a generator to obtain a predicted training frequency spectrum;

performing inverse short-time Fourier transform according to the predicted training amplitude spectrum and the phase information of the real wideband audio sample, generating a wideband audio sample and outputting the wideband audio sample;

calculating a training loss between the wideband audio sample and the real wideband audio sample;

And obtaining a trained voice bandwidth expansion model based on the parameters of the training loss optimization generator and the discriminator until the loss converges.

In a second aspect, an embodiment of the present invention further provides an edge device-oriented lightweight speech band expansion apparatus, where the apparatus includes:

the first preprocessing module is used for acquiring a narrowband audio signal from the edge equipment and preprocessing the narrowband audio signal to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum;

The second preprocessing module is used for constructing a white noise amplitude spectrum;

The amplitude spectrum prediction module is used for inputting the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, respectively generating a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal by utilizing the voice bandwidth expansion model, and obtaining a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component;

And the waveform reconstruction module is used for expanding the phase information of the narrowband audio signal, generating a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal and outputting the wideband audio signal.

In a third aspect, the embodiment of the invention further provides a terminal, which comprises a memory, a processor and an edge-device-oriented lightweight speech band expansion program stored in the memory and capable of running on the processor, wherein the edge-device-oriented lightweight speech band expansion program, when executed by the processor, realizes the steps of the edge-device-oriented lightweight speech band expansion method.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium storing an edge device-oriented lightweight speech band extension program, where the edge device-oriented lightweight speech band extension program can be executed to implement the steps of the edge device-oriented lightweight speech band extension method as described above.

The method has the advantages that a narrow-band audio signal is obtained from edge equipment and preprocessed to obtain a logarithmic domain narrow-band audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum, a white noise amplitude spectrum is constructed, the logarithmic domain narrow-band audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum are input into a trained voice bandwidth expansion model, a first high-frequency component corresponding to consonants in the narrow-band audio signal and a second high-frequency component corresponding to vowels in the narrow-band audio signal are respectively generated by utilizing the voice bandwidth expansion model, a predicted amplitude spectrum is obtained based on the first high-frequency component and the second high-frequency component, phase information of the narrow-band audio signal is expanded, and a broadband audio signal is generated and output according to the predicted amplitude spectrum and the expanded phase information of the narrow-band audio signal. The invention uses the voice bandwidth expansion model to reconstruct the consonant in a targeted way, can ensure that the reconstructed consonant component has higher fidelity, and further ensures the definition of the reconstructed voice.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of a lightweight speech band expansion method for an edge device according to the present invention.

Fig. 2 is a schematic flow chart of generating control parameters and generating weights in the present invention.

Fig. 3 is a flow chart of the present invention for generating a wideband audio signal.

Fig. 4 is a schematic structural diagram of a lightweight speech band expanding device facing an edge device according to a preferred embodiment of the present invention.

Fig. 5 is a schematic block diagram of a terminal of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Aiming at the defects in the prior art, the invention provides a lightweight voice frequency band expanding method, a lightweight voice frequency band expanding device, a lightweight voice frequency band expanding terminal and a lightweight voice frequency band expanding medium for edge equipment, wherein the lightweight voice frequency band expanding method comprises the steps of obtaining a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum; the method comprises the steps of establishing a white noise amplitude spectrum, inputting the logarithmic domain narrow-band audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, respectively generating a first high-frequency component corresponding to consonants in a narrow-band audio signal and a second high-frequency component corresponding to vowels in the narrow-band audio signal by utilizing the voice bandwidth expansion model, obtaining a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component, expanding phase information of the narrow-band audio signal, generating a broadband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrow-band audio signal, and outputting the broadband audio signal. The invention uses the voice bandwidth expansion model to reconstruct the consonant in a targeted way, can ensure that the reconstructed consonant component has higher fidelity, and further ensures the definition of the reconstructed voice.

Referring to fig. 1, the lightweight voice band expansion method for edge devices according to the embodiment of the present invention includes the following steps:

Step S100, obtaining a narrowband audio signal from the edge equipment and preprocessing the narrowband audio signal to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum.

Specifically, the edge device may be a mobile phone or an internet of things terminal device. In the process of acquiring a narrowband audio signal from an edge device and preprocessing the narrowband audio signal to obtain a narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum, firstly acquiring the narrowband audio signal from the edge device, and then upsampling the sampling rate of the narrowband audio signal to a preset target sampling rate to obtain an upsampled time domain waveform. The target sample rate may be 22050. Performing half-wave rectification operation on the time domain waveform after upsampling to obtain a rectified time domain waveformIt will be appreciated that the number of the devices,=. Short-time Fourier transform is carried out on the time domain waveform and the rectified time domain waveform by adopting a hanning window respectively to obtain a narrow-band audio signal amplitude spectrumAnd rectified narrowband audio signal amplitude spectrum. When the hanning window is used for Short-time fourier transform (Short-Time Fourier Transform, STFT), the frame length is 1024 sampling points, the frame is shifted by 256 sampling points, and the number of fast fourier transform (Fast Fourier Transform, FFT) points is 1024. The obtained narrow-band audio signal amplitude spectrum and the rectified narrow-band audio signal amplitude spectrum are 513-dimension. And superposing the narrow-band audio signal amplitude spectrum and the rectified narrow-band audio signal amplitude spectrum to obtain a mixed amplitude spectrum. This process can be expressed as. And finally, respectively applying logarithmic transformation to the narrow-band audio signal amplitude spectrum and the mixed amplitude spectrum to obtain a logarithmic domain narrow-band audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum. The process of obtaining the log-domain mixed magnitude spectrum can be expressed as:

;

wherein, the For avoiding numerical instability in logarithmic calculations.

The amplitude spectrum of the original narrowband audio signal has obvious dynamic range difference, the energy of middle and low frequencies is far higher than that of high frequencies, the high frequency details are covered up when the voice bandwidth expansion model is directly input, the high frequency reconstruction precision is reduced, the training instability is caused by overlarge factor value fluctuation, and the convergence effect of the voice bandwidth expansion model is further restricted. According to the invention, the wide dynamic range of the original amplitude spectrum can be compressed to the narrow range of the logarithmic domain through logarithmic transformation, so that the masking of low-frequency energy to high-frequency details is reduced, meanwhile, the numerical fluctuation is reduced, and the convergence speed and stability of model training are improved.

Referring to fig. 1, the lightweight speech band expanding method for edge devices according to the embodiment of the present invention further includes the following steps:

and step 200, constructing a white noise amplitude spectrum.

Specifically, a white noise amplitude spectrum with an average value of-1 is constructed。

And step 300, inputting the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, respectively generating a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal by utilizing the voice bandwidth expansion model, and obtaining a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component.

Specifically, the existing HWB-Net (High-Performance AND EFFICIENT Hybrid Waveform Bandwidth Extension Method) is a common speech bandwidth expansion model, the parameter number is 194K, the multiplication and addition operation times per second is 12M, and the method has the advantage of being convenient for edge equipment deployment. But it is not optimized for the aperiodic, noise-like character of consonants (e.g.,/f/,/s/, etc.). In order to overcome the defect, the invention designs a voice bandwidth expansion model (HWB-PLUS), which is improved based on the existing HWB-Net model, and carries out targeted processing on vowels and consonants respectively through a dual-path reconstruction mechanism. The voice bandwidth expansion model comprises a feature extraction module, a double-weighted Gaussian mixture module, a fusion module and a band guide masking module.

In one implementation, inputting the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum, and the white noise amplitude spectrum into a trained speech bandwidth extension model, generating a first high frequency component corresponding to a consonant in a narrowband audio signal and a second high frequency component corresponding to a vowel in the narrowband audio signal, respectively, using the speech bandwidth extension model, and obtaining a predicted amplitude spectrum based on the first high frequency component and the second high frequency component, including:

Specifically, the feature extraction module includes a Linear frequency to equivalent rectangular bandwidth conversion (Linear Frequency to Equivalent Rectangular Bandwidth Conversion, linear2 ERB) module and an encoder module. Because the resolution of human ears on different frequencies is different, the resolution of the human ears on a low frequency band is higher, and the resolution of the human ears on a high frequency band is lower, the linear frequency-to-equivalent rectangular bandwidth conversion module processes the received logarithmic domain mixed amplitude spectrum by utilizing a triangular ERB (Equivalent Rectangular Bandwidth ) filter in the linear frequency-to-equivalent rectangular bandwidth conversion module, maps the logarithmic domain mixed amplitude spectrum to a sensing frequency axis based on ERB (Equivalent Rectangular Bandwidth ) to obtain an equivalent rectangular bandwidth frequency band spectrum, and provides characteristic input more in line with human ear sensing for a subsequent module. The equivalent rectangular bandwidth frequency band spectrum is 128-dimensional, and is more in line with the auditory characteristics of human ears.

The encoder module includes four one-dimensional convolutional layers and a packet gating loop unit (Grouped Gated Recurrent Unit). The one-dimensional convolution layer is a core feature coding module in the voice bandwidth expansion model and has the functions of carrying out local feature extraction and dimension compression on the equivalent rectangular bandwidth frequency band spectrum, providing compact spectrum features rich in meaning for a subsequent decoding module, and supporting real-time stream reasoning. The channel number configuration of the four one-dimensional convolution layers is 128, 64, and 64 in order. The kernel size of each convolutional layer is 3 and a ReLU activation function is used. After the local features are extracted through four one-dimensional convolution layers, a grouping gating circulation unit is utilized to capture the inter-frame time sequence dependency relationship (such as the continuity of the voice fundamental frequency, the dynamic change of the formants and the like) of the frequency spectrum features, and meanwhile, the requirements of model weight reduction and flow reasoning are considered, so that feature support with more time sequence consistency is provided for subsequent amplitude complementation and phase optimization. The packet gating loop unit includes a two-layer gating loop unit with input and hidden layer dimensions of 64. The packet gating loop unit ultimately outputs a timing characteristic that characterizes the narrowband audio signal. Where T is the time frame, f=64,Is a real set.

The dual weighted gaussian mixture module (DualWGMM) includes a first weighted gaussian mixture module and a second weighted gaussian mixture module. The first weighted gaussian mixture module (ConsWGMM) is used for reconstructing high frequency components of consonants in the narrowband audio signal, and the second weighted gaussian mixture module (VowelWGMM) is used for reconstructing high frequency components of vowels in the narrowband audio signal. The base signal of the first weighted Gaussian mixture module isThe aperiodic and noise-like statistical characteristics of the device are completely matched with high-frequency consonants (such as/f/,/s /), and the defect that the prior art cannot adapt to consonants is overcome. The first weighted Gaussian mixture module is used for carrying out fine modeling on the 2000-10000Hz high frequency band (consonant energy core area) through 32 Gaussian components, so that consonant reconstruction distortion is avoided. The base signal of the second weighted Gaussian mixture module isThe method retains the enhancement effect of half-wave rectification on vowel harmonics in the prior art, focuses the high-frequency harmonic expansion of low-frequency vowels, can maintain the continuity of vowel reconstruction in this way, and avoids the quality degradation of vowels caused by model splitting.

Taking the broadband distribution characteristic of the consonant high-frequency band into consideration, the mean value of the first weighted Gaussian mixture moduleBased on the average of 32 values uniformly sampled in the linear frequency range of 2000-10000Hz, the standard deviationFixed at 10. In this way the flexibility and stability of the first weighted gaussian mixture module is balanced.

In addition, the parameters of the weighted Gaussian mixture model in the existing HWB-Net model are mostly empirically set, and are not combined with the nonlinear perceptual characteristics of the human auditory system. The sensitivity of human beings to high frequency follows the mel scale rule (namely sensitivity to medium and low frequencies and sensitivity to high frequency is reduced), but the experience parameters are not focused on the key frequency band of the mel scale, so that the model wastes calculation resources on background noise, inaudible sounds and other non-key frequencies, high-frequency details related to perception cannot be optimized efficiently, and high-frequency reconstruction quality is limited. The invention aims to solve the problem of lack of acoustic rationality in parameter initialization, and designs the initial mean and standard deviation of a second weighted Gaussian mixture module based on the Mel scale so as to ensure the focusing of parameters to sense a key frequency band. Specifically, the initial average of the second weighted Gaussian mixture moduleThe center frequency of the kth filter in the Mel-scale triangular filter bank is set to be directly aligned with the middle-low frequency band (namely the vowel core distribution area) sensitive to human hearing. Will be the initial standard deviationCalculated according to the bandwidth of the Mel filter, the formula is. Wherein, the Is the firstThe bandwidths of the individual mel filters. By the method, the spectrum coverage of each Gaussian component is ensured to be consistent with that of the Mel filter, and resource waste at non-critical frequency is avoided.

And after the time sequence characteristics, the logarithmic domain mixed amplitude spectrum and the logarithmic domain amplitude spectrum are input into the double-weighted Gaussian mixture module, the first weighted Gaussian mixture module and the second weighted Gaussian mixture module respectively process the time sequence characteristics to generate corresponding control parameters.

The first weighted gaussian mixture module comprises two parallel linear layers, belonging to two branches respectively. In a branch that calculates the final standard deviation, the linear layer of the branch processes the timing characteristicsThen, through softplus function processing, the standard deviation offset of the kth Gaussian component of the t frame is obtained. Softplus function is a smooth approximation function used to model the nonlinear characteristics of the ReLU (RECTIFIED LINEAR Unit) function while maintaining scalability. And then calculating based on the initial standard deviation of the first weighted Gaussian mixture module to obtain the final standard deviation of the first weighted Gaussian mixture module. The formula referred to herein is:

;

in a branch for calculating predictive weights, the linear layer processing timing characteristics of the branch Then, the predictive weight of the kth Gaussian component of the t frame is obtained through Sigmoid activation function processing. The Sigmoid activation function is a nonlinear function that maps any real input to a (0, 1) interval. Then, utilize、、As a control parameter of the first weighted Gaussian mixture module, driving the first weighted Gaussian mixture module to generate a first high-frequency component. The formula involved in this process is as follows:

;

wherein, the Is a gaussian mixture function that simulates the secondary high frequency distribution.Is a white noise magnitude spectrum.

The second weighted Gaussian mixture module comprises two parallel linear layers, which respectively belong to two branches. In a branch that calculates the final standard deviation, the linear layer of the branch processes the timing characteristicsThen, through softplus function processing, the standard deviation offset of the kth Gaussian component of the t frame is obtained. And then calculating based on the initial standard deviation of the second weighted Gaussian mixture module to obtain the final standard deviation of the second weighted Gaussian mixture module. The formula referred to herein is:

;

in a branch for calculating predictive weights, the linear layer processing timing characteristics of the branch Then, the predictive weight of the kth Gaussian component of the t frame is obtained through Sigmoid activation function processing. Finally, utilize、AndAs a control parameter of the second weighted Gaussian mixture module, driving the second weighted Gaussian mixture module to generate a second high-frequency component. The formula involved in this process is as follows:

;

wherein, the To simulate the gaussian mixture function of the pitch frequency distribution of the elements,As a function of the frequency variation,Representing a point-by-point multiplication,For mixing the amplitude spectrum in the logarithmic domain of the narrowband audio signal,Representing a probability density function of a gaussian distribution.

In one implementation, inputting the first high frequency component and the second high frequency component into the fusion module for fusion, outputting an initial predicted magnitude spectrum, comprising:

processing the spliced characteristics to generate a weight coefficient;

In particular, according toDimension stitching first high frequency componentAnd a second high frequency componentAnd obtaining splicing characteristics. The linear layer inside the fusion module transforms the above-mentioned spliced features, maps them into a scalar, then generates a frame-level weight coefficient ranging between [0,1] through Sigmoid activation function processing. If the current frame is mainly vowels, thenNear 1, if consonantNear 0. Then, an initial predicted magnitude spectrum is calculated according to the following formula: . In this way a natural transition between vowels and consonants is ensured. A schematic flow chart of the generation of control parameters and the generation of weights in the present invention is shown in fig. 2.

step 400, expanding the phase information of the narrowband audio signal, generating a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal, and outputting the wideband audio signal.

The method comprises the steps of carrying out overturn expansion on phase information of a narrow-band audio signal, combining the predicted amplitude spectrum with the expanded phase information of the narrow-band audio signal, carrying out inverse short-time Fourier transform by adopting a Hanning window to generate a wide-band audio signal, and outputting the wide-band audio signal. This process can be expressed as the following formula: wherein, the method comprises the steps of, In the case of a wideband audio signal,In the form of an inverse short-time fourier transform,For predicting the magnitude spectrum, e is the base of the natural logarithm, j is the imaginary unit,Is the phase information of the narrowband audio signal. The flow of the present invention for generating a wideband audio signal may be as shown in fig. 3.

In one implementation, the training step of the speech bandwidth expansion model includes:

Constructing a white noise amplitude spectrum for training;

Specifically, the formula of the loss function in the invention is:

wherein, the 。In the event of a loss of waveform,For the multi-resolution short-time fourier transform loss,In order to combat the loss of this,And (5) matching loss for the characteristics.

Waveform lossThe calculation formula of (2) is as follows:

;

Wherein T is the total frame number of the narrowband audio signal samples, and T is the frame number. For a wideband audio signal of the t frame predicted at training,T-th frame, which is a true wideband audio sample.

Multiresolution short-time Fourier transform lossThe calculation formula of (2) is as follows:

;

wherein, the Is a number of 1, and is not limited by the specification,Representing the spectral convergence loss at the i-th group resolution,Representing the log magnitude spectral loss at group i resolution. The value range of i is 1-3, and three groups of short-time Fourier transformation configurations with different time frequency resolutions are correspondingly arranged. The first group is the number 512 of fast Fourier transform points, the frame shift 50, the window length 240, the second group is the number 1024 of fast Fourier transform points, the frame shift 120, the window length 600, the third group is the number 2048 of fast Fourier transform points, the frame shift 240, the window length 1200.

Countering lossesIncluding loss of discriminatorsSum generator loss. Loss of discriminatorThe calculation formula of (2) is as follows: generator loss The calculation formula of (2) is as follows: . Wherein, the Representing the output of the arbiter for a true wideband audio sample,Representing the output of the arbiter for the narrowband audio signal samples,Represents the distance of L2 and,In the case of a true wideband audio sample,Is a narrowband audio signal sample.

Loss of feature matchingThe calculation formula of (2) is as follows: . Wherein L is the number of layers of the discriminator, Representative Distinguishing deviceLayer characteristics.

The invention adopts the following indexes to compare the method (HWB-Plus) with the HWB-Net method (High-Performance AND EFFICIENT Hybrid Waveform Bandwidth Extension Method), the band-limited sinc difference value (bandlimited sinc interpolation) method and the BAE-Lite (bandwidth adaptive extension neural network) method in the prior art, and the difference between the reconstructed spectrum and the real spectrum is quantified by using the Log spectrum distance (Log SPECTRAL DISTANCE, LSD), and the smaller the value is, the better the value is. The overall hearing quality of speech is assessed using a deep noise suppressed speech quality score (Deep Noise Suppression Mean Opinion Score, DNSMOS) that meets the P.808 criterion, the larger the value the better the value range [0,5]. The perceptual evaluation of speech quality (Perceptual Evaluation of Speech Quality, PESQ) method is used to evaluate speech quality, the larger the value the better the value is, the numerical range [ -0.5,4.5]. The objective quality of speech is evaluated by means of a virtual speech quality objective listener (Virtual Speech Quality Objective Listener, VISQOL), the larger the value the better the value is, the range of values [0,5]. The Non-invasive speech quality assessment (Non-Intrusive Speech Quality Assessment, NISQA) was used to assess the objective quality of speech, with values ranging from [0,5] being the greater the better. Deployment efficiency is measured using the number of parameters (para.) and the number of times per second (multiple-Accumulate Operations per Second, MACs).

The performance index comparisons for the four methods are shown in table 1:

TABLE 1

As can be seen from Table 1, the HWB-PLUS method of the invention has the perception index being comprehensively better than BAE-Lite and HWB-Net on the premise that the parameters and calculated amount are basically consistent with HWB-Net, thus proving that the aim of 'light weight is unchanged' and performance is improved, consonant distortion is obviously reduced, and voice is more natural. Although the logarithmic spectrum distance is slightly higher than BAE-Lite, the perceived quality is more fit to human hearing demand, and the practical application scene of edge voice frequency band expansion is met.

In addition, in order to verify the double-weighted gaussian mixture module, the method based on the mel scale initialization parameter and the function of applying logarithmic transformation in the invention, the indexes are checked after the three are removed respectively, and the results are shown in table 2.

TABLE 2

As can be seen from table 2, the deep neural network speech quality score and the virtual speech quality objective listener drop by 0.19 and 0.19 respectively after removing the double weighted gaussian mixture module, proving that splitting vowel/consonant modeling is crucial to improving perceived quality. After the operation based on the Mel scale initialization parameter is removed, the log spectrum distance is increased to 1.07, which indicates that the acoustic rationality of the parameter directly affects the spectrum matching precision. After removing the operation of applying logarithmic transformation, the voice quality score of the deep neural network is reduced to 3.31, and the effect of the deep neural network on high-frequency detail retention and training stability is verified.

In summary, the method comprises the steps of obtaining a narrowband audio signal from an edge device, preprocessing the narrowband audio signal to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum, constructing a white noise amplitude spectrum, inputting the logarithmic domain narrowband audio signal amplitude spectrum sum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained voice bandwidth expansion model, respectively generating a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal by utilizing the voice bandwidth expansion model, obtaining a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component, expanding phase information of the narrowband audio signal, and generating and outputting a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal. The invention uses the voice bandwidth expansion model to reconstruct the consonant in a targeted way, can ensure that the reconstructed consonant component has higher fidelity, and further ensures the definition of the reconstructed voice.

In an embodiment, as shown in fig. 4, based on the foregoing method for expanding a lightweight speech band for an edge device, the present invention further correspondingly provides a lightweight speech band expanding device for an edge device, where the device includes:

The first preprocessing module 100 is configured to obtain a narrowband audio signal from an edge device and perform preprocessing to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum;

a second preprocessing module 200 for constructing a white noise magnitude spectrum;

The amplitude spectrum prediction module 300 is configured to input the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum to a trained speech bandwidth expansion model, generate a first high-frequency component corresponding to a consonant in a narrowband audio signal and a second high-frequency component corresponding to a vowel in the narrowband audio signal by using the speech bandwidth expansion model, and obtain a predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component;

The waveform reconstruction module 400 is configured to expand the phase information of the narrowband audio signal, generate a wideband audio signal according to the predicted amplitude spectrum and the expanded phase information of the narrowband audio signal, and output the wideband audio signal.

In one embodiment, the first preprocessing module includes:

an audio acquisition unit for acquiring a narrowband audio signal from an edge device;

The up-sampling unit is used for up-sampling the sampling rate of the narrowband audio signal to a preset target sampling rate to obtain an up-sampled time domain waveform;

the half-wave rectification unit is used for performing half-wave rectification operation on the time domain waveform after up sampling to obtain a rectified time domain waveform;

the first short-time Fourier transform unit is used for superposing the narrow-band audio signal amplitude spectrum and the rectified narrow-band audio signal amplitude spectrum to obtain a mixed amplitude spectrum;

and the logarithmic transformation unit is used for respectively applying logarithmic transformation to the narrow-band audio signal amplitude spectrum and the mixed amplitude spectrum to obtain a logarithmic domain narrow-band audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum.

In one embodiment, the amplitude spectrum prediction module comprises:

The time sequence feature generating unit is used for inputting the logarithmic domain mixed amplitude spectrum into the feature extracting module and obtaining time sequence features representing the features of the narrowband audio signals through processing;

The high-frequency component generating unit is used for inputting the time sequence characteristics, the white noise amplitude spectrum and the logarithmic domain amplitude spectrum into the double-weighted Gaussian mixture module, generating two groups of control parameters based on the time sequence characteristics, and respectively driving a first weighted Gaussian mixture module and a second weighted Gaussian mixture module of the double-weighted Gaussian mixture module to operate so as to correspondingly generate a first high-frequency component and a second high-frequency component;

The fusion unit is used for inputting the first high-frequency component and the second high-frequency component into the fusion module for fusion and outputting an initial predicted amplitude spectrum;

And the optimizing unit is used for inputting the logarithmic domain narrowband audio signal amplitude spectrum and the initial predicted amplitude spectrum into the frequency band guiding masking module for optimization to obtain the predicted amplitude spectrum.

In one embodiment, the apparatus further comprises:

The splicing unit is used for inputting the first high-frequency component and the second high-frequency component into the fusion module, and splicing the first high-frequency component and the second high-frequency component by using the fusion module to obtain splicing characteristics;

the weight generating unit is used for processing the splicing characteristics and generating weight coefficients;

and the initial prediction amplitude spectrum generation unit is used for performing weighted calculation on the first high-frequency component and the second high-frequency component based on the weight coefficient to obtain an initial prediction amplitude spectrum.

In one embodiment, the band expansion module includes:

the phase expansion unit is used for carrying out overturn expansion on the phase information of the narrowband audio signal;

the first inverse short-time Fourier transform unit is used for combining the predicted amplitude spectrum with the extended phase information of the narrowband audio signal, and performing inverse short-time Fourier transform by adopting a Hanning window to generate a broadband audio signal;

and the output unit is used for outputting the broadband audio signal.

In one embodiment, the apparatus further comprises:

the training data pair acquisition unit is used for acquiring training data pairs, wherein the training data pairs consist of narrowband audio signal samples and corresponding real broadband audio samples;

the first training preprocessing unit is used for preprocessing the narrowband audio signal sample to obtain a logarithmic domain narrowband audio signal amplitude spectrum and a logarithmic domain mixed amplitude spectrum for training;

the second training preprocessing unit is used for constructing a white noise amplitude spectrum for training;

The training framework construction unit is used for constructing a training framework comprising a generator and a discriminator, wherein the generator is a speech bandwidth expansion model to be trained, and the discriminator is used for distinguishing a generated broadband audio sample from a real broadband audio sample;

A training unit for inputting the training logarithmic domain narrowband audio signal amplitude spectrum, logarithmic domain mixed amplitude spectrum and white noise amplitude spectrum into a generator, obtaining a predicted training amplitude spectrum;

the second inverse short-time Fourier transform unit is used for performing inverse short-time Fourier transform according to the predicted training amplitude spectrum and the phase information of the real wideband audio sample, generating a wideband audio sample and outputting the wideband audio sample;

a loss calculation unit for calculating a training loss between the wideband audio sample and the real wideband audio sample;

and the parameter optimization unit is used for obtaining a trained voice bandwidth expansion model based on the parameters of the training loss optimization generator and the discriminator until the loss converges.

Based on the above embodiment, the present invention further provides a terminal, and a schematic structural diagram of the terminal may be shown in fig. 5. The terminal comprises a processor, a memory, a network interface and a display screen which are connected through a device bus. Wherein the processor of the terminal is adapted to provide computing and control capabilities. The memory of the terminal includes a nonvolatile storage medium, an internal memory. The nonvolatile storage medium stores an operating device and a lightweight speech band expansion program for an edge device. The internal memory provides an environment for operation of an operating device and an edge device oriented lightweight speech band expansion program in a nonvolatile storage medium. The network interface of the terminal is used for communicating with an external terminal through a network connection. The method for expanding the lightweight voice frequency band of the edge equipment comprises the step of realizing any lightweight voice frequency band expanding method of the edge equipment when the lightweight voice frequency band expanding program of the edge equipment is executed by a processor. The display screen of the terminal may be a liquid crystal display screen or an electronic ink display screen.

It will be appreciated by those skilled in the art that the schematic structural diagram shown in fig. 5 is merely a schematic diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the terminal to which the present inventive arrangements are applied, and that a particular terminal may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a terminal is provided, where the terminal includes a memory, a processor, and an edge-device-oriented lightweight speech band expansion program stored in the memory and capable of running on the processor, where the step of any one of the edge-device-oriented lightweight speech band expansion methods provided by the embodiments of the present invention is implemented when the edge-device-oriented lightweight speech band expansion program is executed by the processor.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores an edge equipment-oriented lightweight voice band expansion program, and the edge equipment-oriented lightweight voice band expansion program realizes the steps of any one of the edge equipment-oriented lightweight voice band expansion methods provided by the embodiment of the invention when being executed by a processor.

It should be understood that the sequence number of each step in the above embodiment does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not be construed as limiting the implementation process of the embodiment of the present invention.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working process of the units and modules in the above device may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units described above is merely a logical function division, and may be implemented in other manners, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted, or not performed.

The embodiments described above are only for illustrating the technical solution of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solution described in the foregoing embodiments may be modified or some of the technical features may be replaced equally, and that the modifications or replacements are not essential to the corresponding technical solution but are included in the scope of protection of the present invention.

Claims

1. A lightweight voice band extension method for edge devices, characterized in that the method includes:

Narrowband audio signals are acquired from edge devices and preprocessed to obtain the logarithmic domain narrowband audio signal amplitude spectrum and the logarithmic domain mixed amplitude spectrum.

Construct the amplitude spectrum of white noise;

The logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum, and the white noise amplitude spectrum are input into a trained speech bandwidth expansion model. The speech bandwidth expansion model is used to generate a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal, respectively. The predicted amplitude spectrum is obtained based on the first high-frequency component and the second high-frequency component.

The phase information of the narrowband audio signal is extended, and a wideband audio signal is generated and output based on the predicted amplitude spectrum and the extended phase information of the narrowband audio signal.

2. The lightweight audio band extension method for edge devices according to claim 1, characterized in that, narrowband audio signals are acquired from the edge device and preprocessed to obtain the logarithmic domain narrowband audio signal amplitude spectrum and the logarithmic domain mixed amplitude spectrum, including:

Acquire narrowband audio signals from edge devices;

The sampling rate of the narrowband audio signal is upsampled to a preset target sampling rate to obtain the upsampled time-domain waveform;

The upsampled time-domain waveform is subjected to half-wave rectification to obtain the rectified time-domain waveform.

The Hanning window is used to perform short-time Fourier transform on the time-domain waveform and the rectified time-domain waveform respectively to obtain the amplitude spectrum of the narrowband audio signal and the amplitude spectrum of the rectified narrowband audio signal.

The amplitude spectrum of the narrowband audio signal is superimposed with the amplitude spectrum of the rectified narrowband audio signal to obtain a mixed amplitude spectrum;

Logarithmic transformations are applied to the amplitude spectrum of the narrowband audio signal and the mixed amplitude spectrum to obtain the amplitude spectrum of the narrowband audio signal in the logarithmic domain and the mixed amplitude spectrum in the logarithmic domain.

3. The lightweight speech bandwidth expansion method for edge devices according to claim 1, wherein the speech bandwidth expansion model includes a feature extraction module, a double-weighted Gaussian mixture module, a fusion module, and a frequency band guidance masking module.

4. The lightweight speech bandwidth expansion method for edge devices according to claim 3, characterized in that the logarithmic domain narrowband audio signal amplitude spectrum, the logarithmic domain mixed amplitude spectrum, and the white noise amplitude spectrum are input into a trained speech bandwidth expansion model, and the speech bandwidth expansion model is used to generate a first high-frequency component corresponding to consonants in the narrowband audio signal and a second high-frequency component corresponding to vowels in the narrowband audio signal, respectively, and a predicted amplitude spectrum is obtained based on the first high-frequency component and the second high-frequency component, including:

The logarithmic domain mixed amplitude spectrum is input into the feature extraction module, and after processing, the temporal features characterizing the narrowband audio signal are obtained.

The time series features, the logarithmic domain mixed amplitude spectrum, and the white noise amplitude spectrum are input into the dual-weighted Gaussian mixing module. Based on the time series features, two sets of control parameters are generated to drive the first weighted Gaussian mixing module and the second weighted Gaussian mixing module of the dual-weighted Gaussian mixing module to run, so as to generate the first high-frequency component and the second high-frequency component respectively.

The first high-frequency component and the second high-frequency component are input into the fusion module for fusion, and the initial predicted amplitude spectrum is output.

The amplitude spectrum of the logarithmic domain narrowband audio signal and the initial predicted amplitude spectrum are input into the frequency band guiding masking module for optimization to obtain the predicted amplitude spectrum.

5. The lightweight speech band extension method for edge devices according to claim 4, characterized in that, the first high-frequency component and the second high-frequency component are input into the fusion module for fusion, and an initial predicted amplitude spectrum is output, including:

The first high-frequency component and the second high-frequency component are input into the fusion module, and the first high-frequency component and the second high-frequency component are spliced together by the fusion module to obtain the splicing feature;

The splicing features are processed to generate weight coefficients;

The first high-frequency component and the second high-frequency component are weighted based on the weighting coefficients to obtain the initial predicted amplitude spectrum.

6. The lightweight audio band extension method for edge devices according to claim 1, characterized in that, the phase information of the narrowband audio signal is extended, and a wideband audio signal is generated and output based on the predicted amplitude spectrum and the extended phase information of the narrowband audio signal, comprising:

The phase information of the narrowband audio signal is flipped and expanded;

The predicted amplitude spectrum is combined with the expanded phase information of the narrowband audio signal, and an inverse short-time Fourier transform is performed using a Hanning window to generate a wideband audio signal.

Output the broadband audio signal.

7. The lightweight speech bandwidth extension method for edge devices according to claim 1, characterized in that the training steps of the speech bandwidth extension model include:

Acquire training data pairs, which consist of narrowband audio signal samples and their corresponding real broadband audio samples;

The narrowband audio signal samples are preprocessed to obtain the logarithmic domain narrowband audio signal amplitude spectrum and the logarithmic domain mixed amplitude spectrum for training.

Construct the white noise amplitude spectrum for training;

A training framework is constructed that includes a generator and a discriminator. The generator is a speech bandwidth extension model to be trained, and the discriminator is used to distinguish between generated broadband audio samples and real broadband audio samples.

The amplitude spectrum of the logarithmic domain narrowband audio signal, the mixed amplitude spectrum of the logarithmic domain, and the amplitude spectrum of the white noise used for training are input into the generator to obtain the predicted training amplitude spectrum;

Based on the predicted training amplitude spectrum and the phase information of the real broadband audio samples, perform inverse short-time Fourier transform to generate broadband audio samples and output them;

Calculate the training loss between the broadband audio sample and the real broadband audio sample;

The parameters of the generator and discriminator are optimized based on the training loss until the loss converges, resulting in a trained speech bandwidth expansion model.

8. A lightweight voice band extension device for edge devices, characterized in that it comprises:

The first preprocessing module is used to acquire narrowband audio signals from edge devices and perform preprocessing to obtain the logarithmic domain narrowband audio signal amplitude spectrum and the logarithmic domain mixed amplitude spectrum.

The second preprocessing module is used to construct the white noise amplitude spectrum;

The amplitude spectrum prediction module is used to input the amplitude spectrum of the logarithmic domain narrowband audio signal, the logarithmic domain mixed amplitude spectrum and the white noise amplitude spectrum into a trained speech bandwidth expansion model, and use the speech bandwidth expansion model to generate a first high-frequency component corresponding to the consonants in the narrowband audio signal and a second high-frequency component corresponding to the vowels in the narrowband audio signal, respectively, and obtain the predicted amplitude spectrum based on the first high-frequency component and the second high-frequency component.

The waveform reconstruction module is used to extend the phase information of the narrowband audio signal, and generate and output a wideband audio signal based on the predicted amplitude spectrum and the extended phase information of the narrowband audio signal.

9. A terminal, characterized in that the terminal comprises: a memory, a processor, and a lightweight voice band extension program for edge devices stored in the memory and executable on the processor, wherein the lightweight voice band extension program for edge devices, when executed by the processor, implements the steps of the lightweight voice band extension method for edge devices as described in any one of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a lightweight voice band extension program for edge devices, wherein when the lightweight voice band extension program for edge devices is executed by a processor, it implements the steps of the lightweight voice band extension method for edge devices as described in any one of claims 1-7.