CN114974271B - Voice reconstruction method based on sound channel filtering and glottal excitation - Google Patents
Voice reconstruction method based on sound channel filtering and glottal excitation Download PDFInfo
- Publication number
- CN114974271B CN114974271B CN202111650490.6A CN202111650490A CN114974271B CN 114974271 B CN114974271 B CN 114974271B CN 202111650490 A CN202111650490 A CN 202111650490A CN 114974271 B CN114974271 B CN 114974271B
- Authority
- CN
- China
- Prior art keywords
- voice
- audio
- frequency
- cepstrum
- extracted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention relates to a voice reconstruction method based on sound channel filtering and glottal excitation, and belongs to the technical field of language synthesis. The method comprises the steps of firstly marking the starting point of voice characteristics of the audio, marking the position information of an active segment and an inactive segment in the audio according to the flatness degree of short-time energy of the detected audio in a frequency domain, extracting the fundamental tone frequency of the audio, separating a voice pulse cepstrum and a voice response cepstrum according to a cepstrum method, extracting the fundamental tone frequency of each frame of the audio, constructing impulse response of a simulated voice pulse according to the extracted fundamental tone frequency, recovering cepstrum data by utilizing discrete cosine transform after the audio is subjected to fast Fourier transform and log spectrum acquisition, finally extracting the part with concentrated energy as simulated voice channel response, and reconstructing the voice by utilizing the extracted voice pulse impulse response and the extracted voice channel response. The invention improves the reconstruction efficiency on the premise of ensuring the accuracy of voice reconstruction.
Description
Technical Field
The invention relates to a voice reconstruction method based on sound channel filtering and glottal excitation, and belongs to the technical field of language synthesis.
Background
Along with the continuous progress of society, the speech synthesis technology is widely applied to the daily life of people, and the application value of the speech synthesis technology is more and more important. Whether the voice synthesized by the voice can be personified or not, the emotion of the voice influences the man-machine interaction experience of the equipment.
The speech synthesis techniques can be largely divided into three categories, waveform synthesis, parametric synthesis, and regular synthesis. The waveform synthesis method mainly comprises the steps of storing the needed synthesized voice waveforms according to different phoneme voice waveforms, taking out needed materials from a material library when needed, splicing and synthesizing the needed materials, outputting the needed voice signals, extracting voice parameters, synthesizing the needed voice signals through parameter changes, storing acoustic parameters of the voice in a system, forming syllables and words from phonemes, controlling rules of pitch, rhythm, weight and the like, synthesizing target characters, and converting the target characters into continuous sound waves by using the rules.
The parametric synthesis method has been rapidly developed in recent years by virtue of the advantages of small demand of a material voice library, convenient parameter extraction and wider prosodic feature range which can be adapted by the system. However, the method still has the defects of complex extraction algorithm, excessive extraction parameters, mild emotion of synthesized voice and the like.
The human voice sound production is that the glottis are continuously opened and closed, so that the airflow at the glottis impacts the vocal cords to vibrate, and the airflows pass through the vocal tract to produce voice, and the vocal tract is also in continuous conversion during speaking, so that different voices can be heard. The sounding model is based on the characteristics of human pronunciation organs and the principle of voice generation, and voice signals are reconstructed by extracting the fundamental tone frequency and channel parameter characteristics of a speaker at different moments in voice and simulating glottal excitation and channel change in sounding.
Disclosure of Invention
The invention aims to solve the technical problems of complex extraction algorithm and excessive extraction parameters of a traditional parameter synthesis method by providing a voice reconstruction method based on sound channel filtering and glottal excitation.
The technical scheme of the invention is that the voice reconstruction method based on sound channel filtering and glottal excitation is characterized in that:
Step1, marking the starting point and the ending point of the voice characteristics of the audio, and marking the position information of the presence segment and the absence segment in the audio according to the flatness degree of the short-time energy of the detected audio in the frequency domain for extracting the pitch frequency.
The method specifically comprises the steps of framing and windowing an audio signal, detecting the position information of an audio with a speech segment and a speech segment at a flat degree mark of short-time energy of the audio in a frequency domain, distinguishing the audio with the speech segment and the speech segment without the speech segment, and extracting the starting point and the stopping point of the audio by using a two-dimensional array to represent the position information of the end point of a voice.
Step2, the fundamental tone frequency contains the acoustic information of the speaker in the audio, the fundamental tone frequency of the audio is extracted, the sound pulse cepstrum and the sound response cepstrum are separated according to a cepstrum method, and the fundamental tone frequency of each frame of the audio is extracted.
Specifically, the sampling frequency of the signal and the quotient of the position of the maximum value of the signal in the frequency range after the cepstrum are taken as characteristics, and the extraction result represents the pitch frequency of a section of sample audio by a one-dimensional array.
Step3, constructing simulated glottal excitation according to the extracted pitch frequency;
Specifically, the pitch frequency of the sample extracted from Step2 is interpolated, smoothed and normalized, the pitch frequency is input into a voltage-controlled oscillator, a section of oscillating signal within 75-300 Hz is output, the oscillating signal is converted into rectangular waves and delayed by 1 unit to perform misplacement subtraction, a required impulse signal is obtained, and the output signal represents the glottal excitation of the sample audio simulation by a one-dimensional array.
Step4, extracting characteristic response of the sound channel according to the characteristic of discrete cosine transform, recovering cepstrum data by using discrete cosine transform after the audio frequency is subjected to fast Fourier transform and logarithmic spectrum acquisition, and finally extracting an energy concentrated part as simulated sound channel response and as filter data of reconstructed voice;
Specifically, the method comprises the steps of performing fast fourier transform and logarithmic operation on the original voice data after framing, extracting half data points (namely 128 points), performing discrete cosine transform to recover the phase part of the other half data, performing inverse fast fourier transform, and extracting 42 points with the most concentrated energy in oral features as parameters of an FIR filter, namely simulated channel response.
Step5, reconstructing the voice by using the extracted glottal excitation and the acoustic response. According to the glottal excitation response extracted in Step3 and the sound channel time domain FIR filtering parameters extracted in Step4, the extracted glottal parameters pass through the FIR filter frame by frame in a convolution form, and finally data of each frame are overlapped in a one-dimensional array through inverse framing, and the one-dimensional array is written into an audio file according to the sampling rate of original voice.
The invention has the advantages that the calculated amount of the extracted channel pulse is small, the operation time is quick, only half of data is needed for constructing the channel parameters, the operation time is reduced, the interference on the extraction of reconstruction parameters of the soundless section voice is reduced by the audio endpoint detection, and the operation efficiency is improved. Therefore, the invention improves the reconstruction efficiency on the premise of ensuring the accuracy of voice reconstruction aiming at the defects of large operation amount and high data demand in voice reconstruction in the prior art.
Drawings
FIG. 1 is a diagram of a sound model employed by the present invention;
FIG. 2 is a block diagram of the overall structure of the present invention;
FIG. 3 is a cepstrum of a frame of speech signal of the present invention;
FIG. 4 is a plot of the result of pitch frequency estimation of the present invention;
FIG. 5 is a waveform diagram of a frame of glottal excitation signals according to the present invention;
FIG. 6 is a waveform diagram of one frame of channel parameters of the present invention;
fig. 7 is a graph comparing the spectrograms of the original voice and the reconstructed voice.
Detailed Description
The invention will be further described with reference to the drawings and detailed description.
A voice reconstruction method based on sound channel filtering and glottal excitation is characterized in that a reconstruction control system diagram is shown in fig. 1, parameters such as a voiced/unvoiced segment mark, a fundamental tone frequency, glottal parameters, white noise and the like required by reconstruction of audio are extracted through a parameter extraction module, and target audio is reconstructed through an FIR filter, so that the problems of complex extraction algorithm and excessive extraction parameters of a traditional parameter synthesis method are solved.
The invention is mainly divided into two parts, namely, extracting the sound channel filtering parameters and extracting the glottal excitation parameters, and the whole flow chart is shown in figure 2.
The specific technical scheme of the invention is as follows:
step1, marking an audio voice characteristic starting point and marking position information of an audio with a voice section and a voice section in the position according to the flatness degree of short-time energy of the detected audio in a frequency domain for extracting a fundamental tone frequency;
Step2, extracting the fundamental tone frequency of the audio, separating the sound pulse cepstrum and the sound response cepstrum according to a cepstrum method, and extracting the fundamental tone frequency of each frame of the audio;
Step3, constructing impulse response of the simulated glottal pulse according to the extracted pitch frequency;
step4, according to the characteristics of discrete cosine transform, after the audio frequency is subjected to fast Fourier transform and logarithmic spectrum acquisition, the cepstrum data are recovered by utilizing the discrete cosine transform, and finally, the energy concentrated part is extracted and used as simulated sound channel response and as filter data of the reconstructed voice;
Step5, reconstructing the voice by using the extracted glottal excitation and the acoustic response.
The Step1 specifically comprises the following steps:
Step1.1, carrying out framing and windowing on the audio signals in the material library, wherein the window length is 256, the frame shift is 128, and carrying out fast Fourier transform calculation on the short-time voice frame signals after windowing;
step1.2, calculating the energy of the spectrum of each frame;
step1.3, calculating a probability density function of each sample point in each frame;
Step1.4, calculating the spectral entropy value of each frame, as shown in the formula (1):
Wherein H (i) is the spectral entropy of the ith frame and P (n, i) is the normalized spectral probability density function of spectral line n under the ith frame;
setting a decision threshold, wherein the threshold is set to be 0.12;
Performing end point detection according to the entropy spectrum value of each frame, wherein the value lower than the threshold value is represented by 0, the value is regarded as a no-speech section, the value higher than the threshold value is represented by 1, the value is regarded as a speech section, the detection result is represented by a group of one-dimensional arrays X, and the array length is the number of frames after framing;
The Step2 specifically comprises the following steps:
Step2.1, performing fast Fourier transform on the voice signal X n (m) after framing to obtain a signal X n (k), and performing modulo and logarithmic calculation on the signal to obtain a magnitude spectrum, wherein the magnitude spectrum is shown as a formula (2):
En=20log10(|Xn(k)|) (2)
Step2.2.2 inverse fast fourier transform of E n to obtain a cepstrum of the frame signal, as shown in fig. 3, a peak value equal to the interval between harmonics is displayed in the cepstrum, and the quotient between the sampling frequency and the peak value is the required pitch frequency. Finding out the coordinate value S 1、S2 between two peaks and the voice sampling frequency f s, and obtaining the fundamental tone frequency according to the formula (3):
Where i is the current i frame.
The pitch frequency estimation of the voice signal is shown in fig. 4, wherein the background in the figure is a spectrogram of the voice signal, and the pitch frequency algorithm of the invention is more accurate to extract.
In order to conveniently estimate the value of the pitch frequency, the method detects in the range of a cepstrum corresponding to the pitch frequency of 60-500 Hz, namely, the maximum peak coordinate S (i) is found in the abscissa (16,133) interval of the cepstrum, and the pitch frequency is obtained according to the formula (4):
Step2.3, outputting the calculated pitch frequency of each frame into a one-dimensional array, wherein the array length is the total frame number after the frame division of the voice signal, as shown in a formula (5), and n is the total frame number after the frame division.
F=[F(1),F(2),F(3),…,F(n)] (5)
The Step3 specifically comprises the following steps:
Step3.1, generating a pitch frequency sequence F c which enables the pitch period transition between frames to be smoother by utilizing a cubic spline interpolation method on the pitch frequency extracted in the step2.3, wherein the length is the product of the frame length multiplied by the total frame number;
step3.2, feeding the interpolated pitch frequency into a VCO voltage controlled oscillator, the expression of which is shown in equation (6):
step3.3, normalizing the data output by the VCO as shown in a formula (7);
in the formula, the normalization range is 75 to 300Hz frequency range, and the waveform of the obtained frame glottal excitation signal is shown in fig. 5.
Converting z (n) into rectangular waves r (n), and performing differential judgment on the rectangular waves r (n) according to a formula (8) to obtain glottal pulse excitation;
R(n)=r(n)-r(n-1) (8)
the glottal pulse excitation extracted from a frame of speech signal is shown in fig. 5, wherein the abscissa represents the number of sampling points in a frame, the ordinate represents the amplitude value of the glottal pulse excitation, and the number of glottal pulses in a frame is determined by the pitch period of the current frame.
The Step4 specifically comprises the following steps:
Step4.1, carrying out FFT (fast Fourier transform) on voice data of each frame, wherein the number of points is 256 points, and taking a logarithmic spectrum P 1 for the first 128 points;
Step4.2, discrete cosine transforming the spectrum P 1 to obtain P 2, and taking out the data in the interval 1-25 in P 2 to perform inverse discrete cosine transforming to obtain a matrix P 3;
step4.3, matrix P 4 is obtained by reversing the order of P 3, and a new matrix P 5=[P3,P4 is reconstructed from the groups P 3 and P 4.
And step4.4, carrying out inverse Fourier transform on the P 5 and then taking a real part to obtain an all-pole filter parameter matrix G of the channel model.
And step4.5, taking out 42 points with the most concentrated energy in G as a glottal filtering parameter matrix G 1, wherein the output result is expressed in a two-dimensional array, the rows represent the frame number, and the columns represent the channel filtering parameters of each frame.
The channel parameters extracted from a frame of speech signal are shown in fig. 6. The abscissa represents the number of sampling points and the ordinate represents the magnitude of the channel parameters.
The Step5 specifically comprises the following steps:
step5.1, when the extracted sound and silence mark matrix X and the current frame mark is 0, namely silence, generating silence glottal pulse excitation by utilizing random white noise, extracting the channel parameters of the current frame, putting the channel parameters into an FIR filter, reconstructing the voice of the current frame, and updating the parameters once for each frame;
Step5.2, when the current frame number is 1, namely the voiced segment, extracting the glottal pulse excitation of the current frame, extracting the channel parameters of the current frame, putting the channel parameters into an FIR filter, reconstructing the voice of the current frame of the voiced segment, and updating the parameters once for each frame;
Step5.3, storing the voice data reconstructed from each frame into a matrix W, wherein each column stores the voice signals reconstructed from each frame, and the total number of the voice signals is N.
Step5.4, the matrix W is reduced to a speech signal by inverse framing. The synthesized speech spectrogram is shown in fig. 7. The first subgraph in the graph is the spectrogram of the original voice, the second subgraph is the spectrogram of the reconstructed voice, the method used by the invention can better restore the original voice, the relations between formants and harmonics can be better restored in low frequency, the information contained in the voice can be better restored in high frequency, and the reconstructed voice is input into voice-to-word software, so that word information of the reconstructed voice can be recognized.
While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (3)
1. A voice reconstruction method based on vocal tract filtering and glottal excitation, characterized by comprising the following steps:
Step1, marking a voice characteristic starting point and a voice characteristic stopping point of the audio, and marking position information of an audio with a voice segment and a voice segment without the voice segment according to the flatness degree of short-time energy of the detected audio in a frequency domain for extracting a fundamental tone frequency;
Step2, extracting the fundamental tone frequency of the audio, separating the sound pulse cepstrum and the sound response cepstrum according to a cepstrum method, and extracting the fundamental tone frequency of each frame of the audio;
step3, constructing simulated glottal excitation according to the extracted pitch frequency;
Step4, after the audio frequency is subjected to fast Fourier transform and logarithmic spectrum acquisition, the cepstrum data are recovered by discrete cosine transform, and finally, the energy concentrated part is extracted and used as simulated sound channel response and used as filter data of the reconstructed voice;
step5, reconstructing voice by using the extracted glottal excitation and the extracted acoustic response;
step3 is specifically:
Step3.1, generating a pitch frequency sequence F c which enables the pitch period transition between frames to be smoother by utilizing a cubic spline interpolation method on the pitch frequency, wherein the length is the product of the frame length multiplied by the total frame number;
step3.2, feeding the interpolated pitch frequency into a VCO voltage controlled oscillator, the expression of which is shown in equation (6):
step3.3, normalizing the data output by the VCO as shown in a formula (7);
wherein the normalization range is 75 to 300 Hz;
converting z (n) into rectangular waves r (n), and performing differential judgment on the rectangular waves r (n) according to a formula (8) to obtain glottal pulse excitation;
R(n)=r(n)-r(n-1) (8)
step4 is specifically:
Step4.1, carrying out FFT (fast Fourier transform) on voice data of each frame, wherein the number of points is 256 points, and taking a logarithmic spectrum P 1 for the first 128 points;
Step4.2, discrete cosine transforming the spectrum P 1 to obtain P 2, and taking out the data in the interval 1-25 in P 2 to perform inverse discrete cosine transforming to obtain a matrix P 3;
Step4.3, the matrix P 4 is obtained by reversing the sequence of P 3, and a new matrix P 5=[P3,P4 is reconstructed by the groups P 3 and P 4;
Step4.4, performing inverse Fourier transform on the P 5, and taking a real part to obtain an all-pole filter parameter matrix G of the channel model;
and step4.5, taking out 42 points with the most concentrated energy in G as a glottal filtering parameter matrix G 1, wherein the output result is expressed in a two-dimensional array, the rows represent the frame number, and the columns represent the channel filtering parameters of each frame.
2. The method for reconstructing voice based on channel filtering and glottal excitation according to claim 1, wherein in Step1, the starting and ending points of voice characteristics of the voice are marked specifically by framing and windowing the audio signal, and then distinguishing the speaking section and the speaking section of the audio by detecting the position information of the speaking section and the speaking section of the voice at the flatness mark of the short-time energy of the audio in the frequency domain, and the extraction result represents the position information of the ending point of a voice by a group of two-dimensional arrays, thereby extracting the starting and ending points of voice of the voice.
3. The method for reconstructing voice based on channel filtering and glottal excitation according to claim 1, wherein Step2 is specifically characterized in that the extracted result represents the pitch frequency of a section of sample audio in a one-dimensional array by taking the quotient of the sampling frequency of the signal and the position of the maximum value of the signal in the frequency range after cepstrum as a feature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111650490.6A CN114974271B (en) | 2021-12-29 | 2021-12-29 | Voice reconstruction method based on sound channel filtering and glottal excitation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111650490.6A CN114974271B (en) | 2021-12-29 | 2021-12-29 | Voice reconstruction method based on sound channel filtering and glottal excitation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114974271A CN114974271A (en) | 2022-08-30 |
CN114974271B true CN114974271B (en) | 2025-07-08 |
Family
ID=82975223
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111650490.6A Active CN114974271B (en) | 2021-12-29 | 2021-12-29 | Voice reconstruction method based on sound channel filtering and glottal excitation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114974271B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109346106A (en) * | 2018-09-06 | 2019-02-15 | 河海大学 | A Cepstral Domain Pitch Period Estimation Method Based on Subband Signal-to-Noise Ratio Weighting |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007015489A1 (en) * | 2005-08-01 | 2007-02-08 | Kyushu Institute Of Technology | Voice search device and voice search method |
CN108281150B (en) * | 2018-01-29 | 2020-11-17 | 上海泰亿格康复医疗科技股份有限公司 | Voice tone-changing voice-changing method based on differential glottal wave model |
CN111147021B (en) * | 2020-04-03 | 2020-07-24 | 恒玄科技(上海)股份有限公司 | Voltage controlled oscillator |
-
2021
- 2021-12-29 CN CN202111650490.6A patent/CN114974271B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109346106A (en) * | 2018-09-06 | 2019-02-15 | 河海大学 | A Cepstral Domain Pitch Period Estimation Method Based on Subband Signal-to-Noise Ratio Weighting |
Non-Patent Citations (2)
Title |
---|
一种基于线性预测残差倒谱的基音检测算法;钱博;李燕萍;唐振民;徐丽敏;;计算机工程与应用;20071111(第32期);210-213 * |
一种基音频率归一化的语种识别方法;段云,等;《微电子学与计算机》;20230531;第40卷(第5期);20-28 * |
Also Published As
Publication number | Publication date |
---|---|
CN114974271A (en) | 2022-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8719030B2 (en) | System and method for speech synthesis | |
US9324330B2 (en) | Automatic conversion of speech into song, rap or other audible expression having target meter or rhythm | |
CN102664003B (en) | Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM) | |
US8280724B2 (en) | Speech synthesis using complex spectral modeling | |
CN101589430A (en) | Voice isolation device, voice synthesis device, and voice quality conversion device | |
CN108682432B (en) | Voice emotion recognition device | |
CN113436606A (en) | Original sound speech translation method | |
Ai et al. | Knowledge-and-data-driven amplitude spectrum prediction for hierarchical neural vocoders | |
US11183169B1 (en) | Enhanced virtual singers generation by incorporating singing dynamics to personalized text-to-speech-to-singing | |
CN114974271B (en) | Voice reconstruction method based on sound channel filtering and glottal excitation | |
Chazan et al. | Small footprint concatenative text-to-speech synthesis system using complex spectral envelope modeling. | |
JP4430174B2 (en) | Voice conversion device and voice conversion method | |
TWI302296B (en) | ||
Richards et al. | Deriving articulatory representations from speech with various excitation modes | |
Prasad et al. | Backend tools for speech synthesis in speech processing | |
Wang et al. | Beijing opera synthesis based on straight algorithm and deep learning | |
Hasan et al. | An approach to voice conversion using feature statistical mapping | |
Bonada et al. | Spectral approach to the modeling of the singing voice | |
Jung et al. | Pitch alteration technique in speech synthesis system | |
He et al. | Collaborative AI Dysarthric Speech Recognition System with Data Augmentation using Generative Adversarial Neural Network | |
CN120564735B (en) | Audio real-time conversion and analysis management system and method based on artificial intelligence | |
Lohiya et al. | Survey on Computer Aided Language Learning using automatic accent assessment techniques | |
Bohm et al. | Algorithm for formant tracking, modification and synthesis | |
Liu et al. | Research on Voice Humming to Guzheng Audio Transcription and Pitch-Lowering Processing Based on Genetic Algorithm and YIN Algorithm | |
Luo et al. | On End-to-End Chinese Speech Synthesis Based on World-Tacotron |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |