CN114974271B

CN114974271B - Voice reconstruction method based on sound channel filtering and glottal excitation

Info

Publication number: CN114974271B
Application number: CN202111650490.6A
Authority: CN
Inventors: 邵玉斌; 段云; 龙华; 杜庆治; 苏树盟; 黄张衡; 刘晶
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2025-07-08
Anticipated expiration: 2041-12-29
Also published as: CN114974271A

Abstract

The invention relates to a voice reconstruction method based on sound channel filtering and glottal excitation, and belongs to the technical field of language synthesis. The method comprises the steps of firstly marking the starting point of voice characteristics of the audio, marking the position information of an active segment and an inactive segment in the audio according to the flatness degree of short-time energy of the detected audio in a frequency domain, extracting the fundamental tone frequency of the audio, separating a voice pulse cepstrum and a voice response cepstrum according to a cepstrum method, extracting the fundamental tone frequency of each frame of the audio, constructing impulse response of a simulated voice pulse according to the extracted fundamental tone frequency, recovering cepstrum data by utilizing discrete cosine transform after the audio is subjected to fast Fourier transform and log spectrum acquisition, finally extracting the part with concentrated energy as simulated voice channel response, and reconstructing the voice by utilizing the extracted voice pulse impulse response and the extracted voice channel response. The invention improves the reconstruction efficiency on the premise of ensuring the accuracy of voice reconstruction.

Description

Voice reconstruction method based on sound channel filtering and glottal excitation

Technical Field

The invention relates to a voice reconstruction method based on sound channel filtering and glottal excitation, and belongs to the technical field of language synthesis.

Background

Along with the continuous progress of society, the speech synthesis technology is widely applied to the daily life of people, and the application value of the speech synthesis technology is more and more important. Whether the voice synthesized by the voice can be personified or not, the emotion of the voice influences the man-machine interaction experience of the equipment.

The speech synthesis techniques can be largely divided into three categories, waveform synthesis, parametric synthesis, and regular synthesis. The waveform synthesis method mainly comprises the steps of storing the needed synthesized voice waveforms according to different phoneme voice waveforms, taking out needed materials from a material library when needed, splicing and synthesizing the needed materials, outputting the needed voice signals, extracting voice parameters, synthesizing the needed voice signals through parameter changes, storing acoustic parameters of the voice in a system, forming syllables and words from phonemes, controlling rules of pitch, rhythm, weight and the like, synthesizing target characters, and converting the target characters into continuous sound waves by using the rules.

The parametric synthesis method has been rapidly developed in recent years by virtue of the advantages of small demand of a material voice library, convenient parameter extraction and wider prosodic feature range which can be adapted by the system. However, the method still has the defects of complex extraction algorithm, excessive extraction parameters, mild emotion of synthesized voice and the like.

The human voice sound production is that the glottis are continuously opened and closed, so that the airflow at the glottis impacts the vocal cords to vibrate, and the airflows pass through the vocal tract to produce voice, and the vocal tract is also in continuous conversion during speaking, so that different voices can be heard. The sounding model is based on the characteristics of human pronunciation organs and the principle of voice generation, and voice signals are reconstructed by extracting the fundamental tone frequency and channel parameter characteristics of a speaker at different moments in voice and simulating glottal excitation and channel change in sounding.

Disclosure of Invention

The invention aims to solve the technical problems of complex extraction algorithm and excessive extraction parameters of a traditional parameter synthesis method by providing a voice reconstruction method based on sound channel filtering and glottal excitation.

The technical scheme of the invention is that the voice reconstruction method based on sound channel filtering and glottal excitation is characterized in that:

Step1, marking the starting point and the ending point of the voice characteristics of the audio, and marking the position information of the presence segment and the absence segment in the audio according to the flatness degree of the short-time energy of the detected audio in the frequency domain for extracting the pitch frequency.

The method specifically comprises the steps of framing and windowing an audio signal, detecting the position information of an audio with a speech segment and a speech segment at a flat degree mark of short-time energy of the audio in a frequency domain, distinguishing the audio with the speech segment and the speech segment without the speech segment, and extracting the starting point and the stopping point of the audio by using a two-dimensional array to represent the position information of the end point of a voice.

Step2, the fundamental tone frequency contains the acoustic information of the speaker in the audio, the fundamental tone frequency of the audio is extracted, the sound pulse cepstrum and the sound response cepstrum are separated according to a cepstrum method, and the fundamental tone frequency of each frame of the audio is extracted.

Specifically, the sampling frequency of the signal and the quotient of the position of the maximum value of the signal in the frequency range after the cepstrum are taken as characteristics, and the extraction result represents the pitch frequency of a section of sample audio by a one-dimensional array.

Step3, constructing simulated glottal excitation according to the extracted pitch frequency;

Specifically, the pitch frequency of the sample extracted from Step2 is interpolated, smoothed and normalized, the pitch frequency is input into a voltage-controlled oscillator, a section of oscillating signal within 75-300 Hz is output, the oscillating signal is converted into rectangular waves and delayed by 1 unit to perform misplacement subtraction, a required impulse signal is obtained, and the output signal represents the glottal excitation of the sample audio simulation by a one-dimensional array.

Step4, extracting characteristic response of the sound channel according to the characteristic of discrete cosine transform, recovering cepstrum data by using discrete cosine transform after the audio frequency is subjected to fast Fourier transform and logarithmic spectrum acquisition, and finally extracting an energy concentrated part as simulated sound channel response and as filter data of reconstructed voice;

Specifically, the method comprises the steps of performing fast fourier transform and logarithmic operation on the original voice data after framing, extracting half data points (namely 128 points), performing discrete cosine transform to recover the phase part of the other half data, performing inverse fast fourier transform, and extracting 42 points with the most concentrated energy in oral features as parameters of an FIR filter, namely simulated channel response.

Step5, reconstructing the voice by using the extracted glottal excitation and the acoustic response. According to the glottal excitation response extracted in Step3 and the sound channel time domain FIR filtering parameters extracted in Step4, the extracted glottal parameters pass through the FIR filter frame by frame in a convolution form, and finally data of each frame are overlapped in a one-dimensional array through inverse framing, and the one-dimensional array is written into an audio file according to the sampling rate of original voice.

The invention has the advantages that the calculated amount of the extracted channel pulse is small, the operation time is quick, only half of data is needed for constructing the channel parameters, the operation time is reduced, the interference on the extraction of reconstruction parameters of the soundless section voice is reduced by the audio endpoint detection, and the operation efficiency is improved. Therefore, the invention improves the reconstruction efficiency on the premise of ensuring the accuracy of voice reconstruction aiming at the defects of large operation amount and high data demand in voice reconstruction in the prior art.

Drawings

FIG. 1 is a diagram of a sound model employed by the present invention;

FIG. 2 is a block diagram of the overall structure of the present invention;

FIG. 3 is a cepstrum of a frame of speech signal of the present invention;

FIG. 4 is a plot of the result of pitch frequency estimation of the present invention;

FIG. 5 is a waveform diagram of a frame of glottal excitation signals according to the present invention;

FIG. 6 is a waveform diagram of one frame of channel parameters of the present invention;

fig. 7 is a graph comparing the spectrograms of the original voice and the reconstructed voice.

Detailed Description

The invention will be further described with reference to the drawings and detailed description.

A voice reconstruction method based on sound channel filtering and glottal excitation is characterized in that a reconstruction control system diagram is shown in fig. 1, parameters such as a voiced/unvoiced segment mark, a fundamental tone frequency, glottal parameters, white noise and the like required by reconstruction of audio are extracted through a parameter extraction module, and target audio is reconstructed through an FIR filter, so that the problems of complex extraction algorithm and excessive extraction parameters of a traditional parameter synthesis method are solved.

The invention is mainly divided into two parts, namely, extracting the sound channel filtering parameters and extracting the glottal excitation parameters, and the whole flow chart is shown in figure 2.

The specific technical scheme of the invention is as follows:

step1, marking an audio voice characteristic starting point and marking position information of an audio with a voice section and a voice section in the position according to the flatness degree of short-time energy of the detected audio in a frequency domain for extracting a fundamental tone frequency;

Step2, extracting the fundamental tone frequency of the audio, separating the sound pulse cepstrum and the sound response cepstrum according to a cepstrum method, and extracting the fundamental tone frequency of each frame of the audio;

Step3, constructing impulse response of the simulated glottal pulse according to the extracted pitch frequency;

step4, according to the characteristics of discrete cosine transform, after the audio frequency is subjected to fast Fourier transform and logarithmic spectrum acquisition, the cepstrum data are recovered by utilizing the discrete cosine transform, and finally, the energy concentrated part is extracted and used as simulated sound channel response and as filter data of the reconstructed voice;

Step5, reconstructing the voice by using the extracted glottal excitation and the acoustic response.

The Step1 specifically comprises the following steps:

Step1.1, carrying out framing and windowing on the audio signals in the material library, wherein the window length is 256, the frame shift is 128, and carrying out fast Fourier transform calculation on the short-time voice frame signals after windowing;

step1.2, calculating the energy of the spectrum of each frame;

step1.3, calculating a probability density function of each sample point in each frame;

Step1.4, calculating the spectral entropy value of each frame, as shown in the formula (1):

Wherein H (i) is the spectral entropy of the ith frame and P (n, i) is the normalized spectral probability density function of spectral line n under the ith frame;

setting a decision threshold, wherein the threshold is set to be 0.12;

Performing end point detection according to the entropy spectrum value of each frame, wherein the value lower than the threshold value is represented by 0, the value is regarded as a no-speech section, the value higher than the threshold value is represented by 1, the value is regarded as a speech section, the detection result is represented by a group of one-dimensional arrays X, and the array length is the number of frames after framing;

The Step2 specifically comprises the following steps:

Step2.1, performing fast Fourier transform on the voice signal X _n (m) after framing to obtain a signal X _n (k), and performing modulo and logarithmic calculation on the signal to obtain a magnitude spectrum, wherein the magnitude spectrum is shown as a formula (2):

E_n＝20log₁₀(|X_n(k)|) (2)

Step2.2.2 inverse fast fourier transform of E _n to obtain a cepstrum of the frame signal, as shown in fig. 3, a peak value equal to the interval between harmonics is displayed in the cepstrum, and the quotient between the sampling frequency and the peak value is the required pitch frequency. Finding out the coordinate value S ₁、S₂ between two peaks and the voice sampling frequency f _s, and obtaining the fundamental tone frequency according to the formula (3):

Where i is the current i frame.

The pitch frequency estimation of the voice signal is shown in fig. 4, wherein the background in the figure is a spectrogram of the voice signal, and the pitch frequency algorithm of the invention is more accurate to extract.

In order to conveniently estimate the value of the pitch frequency, the method detects in the range of a cepstrum corresponding to the pitch frequency of 60-500 Hz, namely, the maximum peak coordinate S ⁽ⁱ⁾ is found in the abscissa (16,133) interval of the cepstrum, and the pitch frequency is obtained according to the formula (4):

Step2.3, outputting the calculated pitch frequency of each frame into a one-dimensional array, wherein the array length is the total frame number after the frame division of the voice signal, as shown in a formula (5), and n is the total frame number after the frame division.

F=[F⁽¹⁾,F⁽²⁾,F⁽³⁾,…,F⁽ⁿ⁾] (5)

The Step3 specifically comprises the following steps:

Step3.1, generating a pitch frequency sequence F _c which enables the pitch period transition between frames to be smoother by utilizing a cubic spline interpolation method on the pitch frequency extracted in the step2.3, wherein the length is the product of the frame length multiplied by the total frame number;

step3.2, feeding the interpolated pitch frequency into a VCO voltage controlled oscillator, the expression of which is shown in equation (6):

step3.3, normalizing the data output by the VCO as shown in a formula (7);

in the formula, the normalization range is 75 to 300Hz frequency range, and the waveform of the obtained frame glottal excitation signal is shown in fig. 5.

Converting z (n) into rectangular waves r (n), and performing differential judgment on the rectangular waves r (n) according to a formula (8) to obtain glottal pulse excitation;

R(n)=r(n)-r(n-1) (8)

the glottal pulse excitation extracted from a frame of speech signal is shown in fig. 5, wherein the abscissa represents the number of sampling points in a frame, the ordinate represents the amplitude value of the glottal pulse excitation, and the number of glottal pulses in a frame is determined by the pitch period of the current frame.

The Step4 specifically comprises the following steps:

Step4.1, carrying out FFT (fast Fourier transform) on voice data of each frame, wherein the number of points is 256 points, and taking a logarithmic spectrum P ₁ for the first 128 points;

Step4.2, discrete cosine transforming the spectrum P ₁ to obtain P ₂, and taking out the data in the interval 1-25 in P ₂ to perform inverse discrete cosine transforming to obtain a matrix P ₃;

step4.3, matrix P ₄ is obtained by reversing the order of P ₃, and a new matrix P ₅＝[P₃,P₄ is reconstructed from the groups P ₃ and P ₄.

And step4.4, carrying out inverse Fourier transform on the P ₅ and then taking a real part to obtain an all-pole filter parameter matrix G of the channel model.

And step4.5, taking out 42 points with the most concentrated energy in G as a glottal filtering parameter matrix G ₁, wherein the output result is expressed in a two-dimensional array, the rows represent the frame number, and the columns represent the channel filtering parameters of each frame.

The channel parameters extracted from a frame of speech signal are shown in fig. 6. The abscissa represents the number of sampling points and the ordinate represents the magnitude of the channel parameters.

The Step5 specifically comprises the following steps:

step5.1, when the extracted sound and silence mark matrix X and the current frame mark is 0, namely silence, generating silence glottal pulse excitation by utilizing random white noise, extracting the channel parameters of the current frame, putting the channel parameters into an FIR filter, reconstructing the voice of the current frame, and updating the parameters once for each frame;

Step5.2, when the current frame number is 1, namely the voiced segment, extracting the glottal pulse excitation of the current frame, extracting the channel parameters of the current frame, putting the channel parameters into an FIR filter, reconstructing the voice of the current frame of the voiced segment, and updating the parameters once for each frame;

Step5.3, storing the voice data reconstructed from each frame into a matrix W, wherein each column stores the voice signals reconstructed from each frame, and the total number of the voice signals is N.

Step5.4, the matrix W is reduced to a speech signal by inverse framing. The synthesized speech spectrogram is shown in fig. 7. The first subgraph in the graph is the spectrogram of the original voice, the second subgraph is the spectrogram of the reconstructed voice, the method used by the invention can better restore the original voice, the relations between formants and harmonics can be better restored in low frequency, the information contained in the voice can be better restored in high frequency, and the reconstructed voice is input into voice-to-word software, so that word information of the reconstructed voice can be recognized.

While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A voice reconstruction method based on vocal tract filtering and glottal excitation, characterized by comprising the following steps:

Step1, marking a voice characteristic starting point and a voice characteristic stopping point of the audio, and marking position information of an audio with a voice segment and a voice segment without the voice segment according to the flatness degree of short-time energy of the detected audio in a frequency domain for extracting a fundamental tone frequency;

Step4, after the audio frequency is subjected to fast Fourier transform and logarithmic spectrum acquisition, the cepstrum data are recovered by discrete cosine transform, and finally, the energy concentrated part is extracted and used as simulated sound channel response and used as filter data of the reconstructed voice;

step5, reconstructing voice by using the extracted glottal excitation and the extracted acoustic response;

step3 is specifically:

Step3.1, generating a pitch frequency sequence F _c which enables the pitch period transition between frames to be smoother by utilizing a cubic spline interpolation method on the pitch frequency, wherein the length is the product of the frame length multiplied by the total frame number;

step3.3, normalizing the data output by the VCO as shown in a formula (7);

wherein the normalization range is 75 to 300 Hz;

R(n)=r(n)-r(n-1) (8)

step4 is specifically:

Step4.3, the matrix P ₄ is obtained by reversing the sequence of P ₃, and a new matrix P ₅＝[P₃,P₄ is reconstructed by the groups P ₃ and P ₄;

Step4.4, performing inverse Fourier transform on the P ₅, and taking a real part to obtain an all-pole filter parameter matrix G of the channel model;

2. The method for reconstructing voice based on channel filtering and glottal excitation according to claim 1, wherein in Step1, the starting and ending points of voice characteristics of the voice are marked specifically by framing and windowing the audio signal, and then distinguishing the speaking section and the speaking section of the audio by detecting the position information of the speaking section and the speaking section of the voice at the flatness mark of the short-time energy of the audio in the frequency domain, and the extraction result represents the position information of the ending point of a voice by a group of two-dimensional arrays, thereby extracting the starting and ending points of voice of the voice.

3. The method for reconstructing voice based on channel filtering and glottal excitation according to claim 1, wherein Step2 is specifically characterized in that the extracted result represents the pitch frequency of a section of sample audio in a one-dimensional array by taking the quotient of the sampling frequency of the signal and the position of the maximum value of the signal in the frequency range after cepstrum as a feature.