[go: up one dir, main page]

CN114974271B - Voice reconstruction method based on sound channel filtering and glottal excitation - Google Patents

Voice reconstruction method based on sound channel filtering and glottal excitation Download PDF

Info

Publication number
CN114974271B
CN114974271B CN202111650490.6A CN202111650490A CN114974271B CN 114974271 B CN114974271 B CN 114974271B CN 202111650490 A CN202111650490 A CN 202111650490A CN 114974271 B CN114974271 B CN 114974271B
Authority
CN
China
Prior art keywords
voice
audio
frequency
cepstrum
extracted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111650490.6A
Other languages
Chinese (zh)
Other versions
CN114974271A (en
Inventor
邵玉斌
段云
龙华
杜庆治
苏树盟
黄张衡
刘晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202111650490.6A priority Critical patent/CN114974271B/en
Publication of CN114974271A publication Critical patent/CN114974271A/en
Application granted granted Critical
Publication of CN114974271B publication Critical patent/CN114974271B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a voice reconstruction method based on sound channel filtering and glottal excitation, and belongs to the technical field of language synthesis. The method comprises the steps of firstly marking the starting point of voice characteristics of the audio, marking the position information of an active segment and an inactive segment in the audio according to the flatness degree of short-time energy of the detected audio in a frequency domain, extracting the fundamental tone frequency of the audio, separating a voice pulse cepstrum and a voice response cepstrum according to a cepstrum method, extracting the fundamental tone frequency of each frame of the audio, constructing impulse response of a simulated voice pulse according to the extracted fundamental tone frequency, recovering cepstrum data by utilizing discrete cosine transform after the audio is subjected to fast Fourier transform and log spectrum acquisition, finally extracting the part with concentrated energy as simulated voice channel response, and reconstructing the voice by utilizing the extracted voice pulse impulse response and the extracted voice channel response. The invention improves the reconstruction efficiency on the premise of ensuring the accuracy of voice reconstruction.

Description

Voice reconstruction method based on sound channel filtering and glottal excitation
Technical Field
The invention relates to a voice reconstruction method based on sound channel filtering and glottal excitation, and belongs to the technical field of language synthesis.
Background
Along with the continuous progress of society, the speech synthesis technology is widely applied to the daily life of people, and the application value of the speech synthesis technology is more and more important. Whether the voice synthesized by the voice can be personified or not, the emotion of the voice influences the man-machine interaction experience of the equipment.
The speech synthesis techniques can be largely divided into three categories, waveform synthesis, parametric synthesis, and regular synthesis. The waveform synthesis method mainly comprises the steps of storing the needed synthesized voice waveforms according to different phoneme voice waveforms, taking out needed materials from a material library when needed, splicing and synthesizing the needed materials, outputting the needed voice signals, extracting voice parameters, synthesizing the needed voice signals through parameter changes, storing acoustic parameters of the voice in a system, forming syllables and words from phonemes, controlling rules of pitch, rhythm, weight and the like, synthesizing target characters, and converting the target characters into continuous sound waves by using the rules.
The parametric synthesis method has been rapidly developed in recent years by virtue of the advantages of small demand of a material voice library, convenient parameter extraction and wider prosodic feature range which can be adapted by the system. However, the method still has the defects of complex extraction algorithm, excessive extraction parameters, mild emotion of synthesized voice and the like.
The human voice sound production is that the glottis are continuously opened and closed, so that the airflow at the glottis impacts the vocal cords to vibrate, and the airflows pass through the vocal tract to produce voice, and the vocal tract is also in continuous conversion during speaking, so that different voices can be heard. The sounding model is based on the characteristics of human pronunciation organs and the principle of voice generation, and voice signals are reconstructed by extracting the fundamental tone frequency and channel parameter characteristics of a speaker at different moments in voice and simulating glottal excitation and channel change in sounding.
Disclosure of Invention
The invention aims to solve the technical problems of complex extraction algorithm and excessive extraction parameters of a traditional parameter synthesis method by providing a voice reconstruction method based on sound channel filtering and glottal excitation.
The technical scheme of the invention is that the voice reconstruction method based on sound channel filtering and glottal excitation is characterized in that:
Step1, marking the starting point and the ending point of the voice characteristics of the audio, and marking the position information of the presence segment and the absence segment in the audio according to the flatness degree of the short-time energy of the detected audio in the frequency domain for extracting the pitch frequency.
The method specifically comprises the steps of framing and windowing an audio signal, detecting the position information of an audio with a speech segment and a speech segment at a flat degree mark of short-time energy of the audio in a frequency domain, distinguishing the audio with the speech segment and the speech segment without the speech segment, and extracting the starting point and the stopping point of the audio by using a two-dimensional array to represent the position information of the end point of a voice.
Step2, the fundamental tone frequency contains the acoustic information of the speaker in the audio, the fundamental tone frequency of the audio is extracted, the sound pulse cepstrum and the sound response cepstrum are separated according to a cepstrum method, and the fundamental tone frequency of each frame of the audio is extracted.
Specifically, the sampling frequency of the signal and the quotient of the position of the maximum value of the signal in the frequency range after the cepstrum are taken as characteristics, and the extraction result represents the pitch frequency of a section of sample audio by a one-dimensional array.
Step3, constructing simulated glottal excitation according to the extracted pitch frequency;
Specifically, the pitch frequency of the sample extracted from Step2 is interpolated, smoothed and normalized, the pitch frequency is input into a voltage-controlled oscillator, a section of oscillating signal within 75-300 Hz is output, the oscillating signal is converted into rectangular waves and delayed by 1 unit to perform misplacement subtraction, a required impulse signal is obtained, and the output signal represents the glottal excitation of the sample audio simulation by a one-dimensional array.
Step4, extracting characteristic response of the sound channel according to the characteristic of discrete cosine transform, recovering cepstrum data by using discrete cosine transform after the audio frequency is subjected to fast Fourier transform and logarithmic spectrum acquisition, and finally extracting an energy concentrated part as simulated sound channel response and as filter data of reconstructed voice;
Specifically, the method comprises the steps of performing fast fourier transform and logarithmic operation on the original voice data after framing, extracting half data points (namely 128 points), performing discrete cosine transform to recover the phase part of the other half data, performing inverse fast fourier transform, and extracting 42 points with the most concentrated energy in oral features as parameters of an FIR filter, namely simulated channel response.
Step5, reconstructing the voice by using the extracted glottal excitation and the acoustic response. According to the glottal excitation response extracted in Step3 and the sound channel time domain FIR filtering parameters extracted in Step4, the extracted glottal parameters pass through the FIR filter frame by frame in a convolution form, and finally data of each frame are overlapped in a one-dimensional array through inverse framing, and the one-dimensional array is written into an audio file according to the sampling rate of original voice.
The invention has the advantages that the calculated amount of the extracted channel pulse is small, the operation time is quick, only half of data is needed for constructing the channel parameters, the operation time is reduced, the interference on the extraction of reconstruction parameters of the soundless section voice is reduced by the audio endpoint detection, and the operation efficiency is improved. Therefore, the invention improves the reconstruction efficiency on the premise of ensuring the accuracy of voice reconstruction aiming at the defects of large operation amount and high data demand in voice reconstruction in the prior art.
Drawings
FIG. 1 is a diagram of a sound model employed by the present invention;
FIG. 2 is a block diagram of the overall structure of the present invention;
FIG. 3 is a cepstrum of a frame of speech signal of the present invention;
FIG. 4 is a plot of the result of pitch frequency estimation of the present invention;
FIG. 5 is a waveform diagram of a frame of glottal excitation signals according to the present invention;
FIG. 6 is a waveform diagram of one frame of channel parameters of the present invention;
fig. 7 is a graph comparing the spectrograms of the original voice and the reconstructed voice.
Detailed Description
The invention will be further described with reference to the drawings and detailed description.
A voice reconstruction method based on sound channel filtering and glottal excitation is characterized in that a reconstruction control system diagram is shown in fig. 1, parameters such as a voiced/unvoiced segment mark, a fundamental tone frequency, glottal parameters, white noise and the like required by reconstruction of audio are extracted through a parameter extraction module, and target audio is reconstructed through an FIR filter, so that the problems of complex extraction algorithm and excessive extraction parameters of a traditional parameter synthesis method are solved.
The invention is mainly divided into two parts, namely, extracting the sound channel filtering parameters and extracting the glottal excitation parameters, and the whole flow chart is shown in figure 2.
The specific technical scheme of the invention is as follows:
step1, marking an audio voice characteristic starting point and marking position information of an audio with a voice section and a voice section in the position according to the flatness degree of short-time energy of the detected audio in a frequency domain for extracting a fundamental tone frequency;
Step2, extracting the fundamental tone frequency of the audio, separating the sound pulse cepstrum and the sound response cepstrum according to a cepstrum method, and extracting the fundamental tone frequency of each frame of the audio;
Step3, constructing impulse response of the simulated glottal pulse according to the extracted pitch frequency;
step4, according to the characteristics of discrete cosine transform, after the audio frequency is subjected to fast Fourier transform and logarithmic spectrum acquisition, the cepstrum data are recovered by utilizing the discrete cosine transform, and finally, the energy concentrated part is extracted and used as simulated sound channel response and as filter data of the reconstructed voice;
Step5, reconstructing the voice by using the extracted glottal excitation and the acoustic response.
The Step1 specifically comprises the following steps:
Step1.1, carrying out framing and windowing on the audio signals in the material library, wherein the window length is 256, the frame shift is 128, and carrying out fast Fourier transform calculation on the short-time voice frame signals after windowing;
step1.2, calculating the energy of the spectrum of each frame;
step1.3, calculating a probability density function of each sample point in each frame;
Step1.4, calculating the spectral entropy value of each frame, as shown in the formula (1):
Wherein H (i) is the spectral entropy of the ith frame and P (n, i) is the normalized spectral probability density function of spectral line n under the ith frame;
setting a decision threshold, wherein the threshold is set to be 0.12;
Performing end point detection according to the entropy spectrum value of each frame, wherein the value lower than the threshold value is represented by 0, the value is regarded as a no-speech section, the value higher than the threshold value is represented by 1, the value is regarded as a speech section, the detection result is represented by a group of one-dimensional arrays X, and the array length is the number of frames after framing;
The Step2 specifically comprises the following steps:
Step2.1, performing fast Fourier transform on the voice signal X n (m) after framing to obtain a signal X n (k), and performing modulo and logarithmic calculation on the signal to obtain a magnitude spectrum, wherein the magnitude spectrum is shown as a formula (2):
En=20log10(|Xn(k)|) (2)
Step2.2.2 inverse fast fourier transform of E n to obtain a cepstrum of the frame signal, as shown in fig. 3, a peak value equal to the interval between harmonics is displayed in the cepstrum, and the quotient between the sampling frequency and the peak value is the required pitch frequency. Finding out the coordinate value S 1、S2 between two peaks and the voice sampling frequency f s, and obtaining the fundamental tone frequency according to the formula (3):
Where i is the current i frame.
The pitch frequency estimation of the voice signal is shown in fig. 4, wherein the background in the figure is a spectrogram of the voice signal, and the pitch frequency algorithm of the invention is more accurate to extract.
In order to conveniently estimate the value of the pitch frequency, the method detects in the range of a cepstrum corresponding to the pitch frequency of 60-500 Hz, namely, the maximum peak coordinate S (i) is found in the abscissa (16,133) interval of the cepstrum, and the pitch frequency is obtained according to the formula (4):
Step2.3, outputting the calculated pitch frequency of each frame into a one-dimensional array, wherein the array length is the total frame number after the frame division of the voice signal, as shown in a formula (5), and n is the total frame number after the frame division.
F=[F(1),F(2),F(3),…,F(n)] (5)
The Step3 specifically comprises the following steps:
Step3.1, generating a pitch frequency sequence F c which enables the pitch period transition between frames to be smoother by utilizing a cubic spline interpolation method on the pitch frequency extracted in the step2.3, wherein the length is the product of the frame length multiplied by the total frame number;
step3.2, feeding the interpolated pitch frequency into a VCO voltage controlled oscillator, the expression of which is shown in equation (6):
step3.3, normalizing the data output by the VCO as shown in a formula (7);
in the formula, the normalization range is 75 to 300Hz frequency range, and the waveform of the obtained frame glottal excitation signal is shown in fig. 5.
Converting z (n) into rectangular waves r (n), and performing differential judgment on the rectangular waves r (n) according to a formula (8) to obtain glottal pulse excitation;
R(n)=r(n)-r(n-1) (8)
the glottal pulse excitation extracted from a frame of speech signal is shown in fig. 5, wherein the abscissa represents the number of sampling points in a frame, the ordinate represents the amplitude value of the glottal pulse excitation, and the number of glottal pulses in a frame is determined by the pitch period of the current frame.
The Step4 specifically comprises the following steps:
Step4.1, carrying out FFT (fast Fourier transform) on voice data of each frame, wherein the number of points is 256 points, and taking a logarithmic spectrum P 1 for the first 128 points;
Step4.2, discrete cosine transforming the spectrum P 1 to obtain P 2, and taking out the data in the interval 1-25 in P 2 to perform inverse discrete cosine transforming to obtain a matrix P 3;
step4.3, matrix P 4 is obtained by reversing the order of P 3, and a new matrix P 5=[P3,P4 is reconstructed from the groups P 3 and P 4.
And step4.4, carrying out inverse Fourier transform on the P 5 and then taking a real part to obtain an all-pole filter parameter matrix G of the channel model.
And step4.5, taking out 42 points with the most concentrated energy in G as a glottal filtering parameter matrix G 1, wherein the output result is expressed in a two-dimensional array, the rows represent the frame number, and the columns represent the channel filtering parameters of each frame.
The channel parameters extracted from a frame of speech signal are shown in fig. 6. The abscissa represents the number of sampling points and the ordinate represents the magnitude of the channel parameters.
The Step5 specifically comprises the following steps:
step5.1, when the extracted sound and silence mark matrix X and the current frame mark is 0, namely silence, generating silence glottal pulse excitation by utilizing random white noise, extracting the channel parameters of the current frame, putting the channel parameters into an FIR filter, reconstructing the voice of the current frame, and updating the parameters once for each frame;
Step5.2, when the current frame number is 1, namely the voiced segment, extracting the glottal pulse excitation of the current frame, extracting the channel parameters of the current frame, putting the channel parameters into an FIR filter, reconstructing the voice of the current frame of the voiced segment, and updating the parameters once for each frame;
Step5.3, storing the voice data reconstructed from each frame into a matrix W, wherein each column stores the voice signals reconstructed from each frame, and the total number of the voice signals is N.
Step5.4, the matrix W is reduced to a speech signal by inverse framing. The synthesized speech spectrogram is shown in fig. 7. The first subgraph in the graph is the spectrogram of the original voice, the second subgraph is the spectrogram of the reconstructed voice, the method used by the invention can better restore the original voice, the relations between formants and harmonics can be better restored in low frequency, the information contained in the voice can be better restored in high frequency, and the reconstructed voice is input into voice-to-word software, so that word information of the reconstructed voice can be recognized.
While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (3)

1. A voice reconstruction method based on vocal tract filtering and glottal excitation, characterized by comprising the following steps:
Step1, marking a voice characteristic starting point and a voice characteristic stopping point of the audio, and marking position information of an audio with a voice segment and a voice segment without the voice segment according to the flatness degree of short-time energy of the detected audio in a frequency domain for extracting a fundamental tone frequency;
Step2, extracting the fundamental tone frequency of the audio, separating the sound pulse cepstrum and the sound response cepstrum according to a cepstrum method, and extracting the fundamental tone frequency of each frame of the audio;
step3, constructing simulated glottal excitation according to the extracted pitch frequency;
Step4, after the audio frequency is subjected to fast Fourier transform and logarithmic spectrum acquisition, the cepstrum data are recovered by discrete cosine transform, and finally, the energy concentrated part is extracted and used as simulated sound channel response and used as filter data of the reconstructed voice;
step5, reconstructing voice by using the extracted glottal excitation and the extracted acoustic response;
step3 is specifically:
Step3.1, generating a pitch frequency sequence F c which enables the pitch period transition between frames to be smoother by utilizing a cubic spline interpolation method on the pitch frequency, wherein the length is the product of the frame length multiplied by the total frame number;
step3.2, feeding the interpolated pitch frequency into a VCO voltage controlled oscillator, the expression of which is shown in equation (6):
step3.3, normalizing the data output by the VCO as shown in a formula (7);
wherein the normalization range is 75 to 300 Hz;
converting z (n) into rectangular waves r (n), and performing differential judgment on the rectangular waves r (n) according to a formula (8) to obtain glottal pulse excitation;
R(n)=r(n)-r(n-1) (8)
step4 is specifically:
Step4.1, carrying out FFT (fast Fourier transform) on voice data of each frame, wherein the number of points is 256 points, and taking a logarithmic spectrum P 1 for the first 128 points;
Step4.2, discrete cosine transforming the spectrum P 1 to obtain P 2, and taking out the data in the interval 1-25 in P 2 to perform inverse discrete cosine transforming to obtain a matrix P 3;
Step4.3, the matrix P 4 is obtained by reversing the sequence of P 3, and a new matrix P 5=[P3,P4 is reconstructed by the groups P 3 and P 4;
Step4.4, performing inverse Fourier transform on the P 5, and taking a real part to obtain an all-pole filter parameter matrix G of the channel model;
and step4.5, taking out 42 points with the most concentrated energy in G as a glottal filtering parameter matrix G 1, wherein the output result is expressed in a two-dimensional array, the rows represent the frame number, and the columns represent the channel filtering parameters of each frame.
2. The method for reconstructing voice based on channel filtering and glottal excitation according to claim 1, wherein in Step1, the starting and ending points of voice characteristics of the voice are marked specifically by framing and windowing the audio signal, and then distinguishing the speaking section and the speaking section of the audio by detecting the position information of the speaking section and the speaking section of the voice at the flatness mark of the short-time energy of the audio in the frequency domain, and the extraction result represents the position information of the ending point of a voice by a group of two-dimensional arrays, thereby extracting the starting and ending points of voice of the voice.
3. The method for reconstructing voice based on channel filtering and glottal excitation according to claim 1, wherein Step2 is specifically characterized in that the extracted result represents the pitch frequency of a section of sample audio in a one-dimensional array by taking the quotient of the sampling frequency of the signal and the position of the maximum value of the signal in the frequency range after cepstrum as a feature.
CN202111650490.6A 2021-12-29 2021-12-29 Voice reconstruction method based on sound channel filtering and glottal excitation Active CN114974271B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111650490.6A CN114974271B (en) 2021-12-29 2021-12-29 Voice reconstruction method based on sound channel filtering and glottal excitation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111650490.6A CN114974271B (en) 2021-12-29 2021-12-29 Voice reconstruction method based on sound channel filtering and glottal excitation

Publications (2)

Publication Number Publication Date
CN114974271A CN114974271A (en) 2022-08-30
CN114974271B true CN114974271B (en) 2025-07-08

Family

ID=82975223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111650490.6A Active CN114974271B (en) 2021-12-29 2021-12-29 Voice reconstruction method based on sound channel filtering and glottal excitation

Country Status (1)

Country Link
CN (1) CN114974271B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109346106A (en) * 2018-09-06 2019-02-15 河海大学 A Cepstral Domain Pitch Period Estimation Method Based on Subband Signal-to-Noise Ratio Weighting

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007015489A1 (en) * 2005-08-01 2007-02-08 Kyushu Institute Of Technology Voice search device and voice search method
CN108281150B (en) * 2018-01-29 2020-11-17 上海泰亿格康复医疗科技股份有限公司 Voice tone-changing voice-changing method based on differential glottal wave model
CN111147021B (en) * 2020-04-03 2020-07-24 恒玄科技(上海)股份有限公司 Voltage controlled oscillator

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109346106A (en) * 2018-09-06 2019-02-15 河海大学 A Cepstral Domain Pitch Period Estimation Method Based on Subband Signal-to-Noise Ratio Weighting

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于线性预测残差倒谱的基音检测算法;钱博;李燕萍;唐振民;徐丽敏;;计算机工程与应用;20071111(第32期);210-213 *
一种基音频率归一化的语种识别方法;段云,等;《微电子学与计算机》;20230531;第40卷(第5期);20-28 *

Also Published As

Publication number Publication date
CN114974271A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
US8719030B2 (en) System and method for speech synthesis
US9324330B2 (en) Automatic conversion of speech into song, rap or other audible expression having target meter or rhythm
CN102664003B (en) Residual excitation signal synthesis and voice conversion method based on harmonic plus noise model (HNM)
US8280724B2 (en) Speech synthesis using complex spectral modeling
CN101589430A (en) Voice isolation device, voice synthesis device, and voice quality conversion device
CN108682432B (en) Voice emotion recognition device
CN113436606A (en) Original sound speech translation method
Ai et al. Knowledge-and-data-driven amplitude spectrum prediction for hierarchical neural vocoders
US11183169B1 (en) Enhanced virtual singers generation by incorporating singing dynamics to personalized text-to-speech-to-singing
CN114974271B (en) Voice reconstruction method based on sound channel filtering and glottal excitation
Chazan et al. Small footprint concatenative text-to-speech synthesis system using complex spectral envelope modeling.
JP4430174B2 (en) Voice conversion device and voice conversion method
TWI302296B (en)
Richards et al. Deriving articulatory representations from speech with various excitation modes
Prasad et al. Backend tools for speech synthesis in speech processing
Wang et al. Beijing opera synthesis based on straight algorithm and deep learning
Hasan et al. An approach to voice conversion using feature statistical mapping
Bonada et al. Spectral approach to the modeling of the singing voice
Jung et al. Pitch alteration technique in speech synthesis system
He et al. Collaborative AI Dysarthric Speech Recognition System with Data Augmentation using Generative Adversarial Neural Network
CN120564735B (en) Audio real-time conversion and analysis management system and method based on artificial intelligence
Lohiya et al. Survey on Computer Aided Language Learning using automatic accent assessment techniques
Bohm et al. Algorithm for formant tracking, modification and synthesis
Liu et al. Research on Voice Humming to Guzheng Audio Transcription and Pitch-Lowering Processing Based on Genetic Algorithm and YIN Algorithm
Luo et al. On End-to-End Chinese Speech Synthesis Based on World-Tacotron

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant