CN111108558A - Voice conversion method and device, computer equipment and computer readable storage medium - Google Patents
Voice conversion method and device, computer equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN111108558A CN111108558A CN201980003120.8A CN201980003120A CN111108558A CN 111108558 A CN111108558 A CN 111108558A CN 201980003120 A CN201980003120 A CN 201980003120A CN 111108558 A CN111108558 A CN 111108558A
- Authority
- CN
- China
- Prior art keywords
- target
- feature
- voice
- converted
- conversion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the invention discloses a voice conversion method, a voice conversion device, computer equipment and a computer readable storage medium. The voice conversion method comprises the following steps: acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format; carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format; extracting the characteristics of the voice to be converted to obtain the characteristics to be converted; inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model; and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from that of the voice to be converted. The voice conversion method can perform high-quality voice conversion in an off-line state, has high running speed and can realize real-time voice conversion.
Description
Technical Field
The present invention relates to the field of audio processing technologies, and in particular, to a method and an apparatus for voice conversion, a computer device, and a computer-readable storage medium.
Background
The voice conversion technology is a technology for converting a source voice into a target voice under the condition of keeping semantic content unchanged, wherein the source voice is a voice sent by a first person, and the target voice is a voice sent by a second person, namely, the source voice sent by the first person is converted into the target voice sent by the second person with the same semantic through the voice conversion technology.
With the rapid development of the deep neural network technology, the voice converted by the voice conversion method based on deep learning has high similarity, good voice quality and good fluency. At present, a speech conversion method based on deep learning mainly comprises two steps, firstly, a conversion model is trained by a large amount of speech data, and then the trained model is used for speech conversion. Because training has high requirements on computing resources, off-line end resources are few, performance is low, and when the training is performed, resource exhaustion is easy to occur. Therefore, the current voice conversion function based on deep learning can be realized only by relying on an online high-performance server, and cannot be used in an offline state.
Disclosure of Invention
In view of the above, it is desirable to provide a voice conversion method, apparatus, computer device and storage medium capable of performing high-quality voice conversion in an offline state.
A method of speech conversion, the method comprising:
acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format;
extracting the characteristics of the voice to be converted to obtain the characteristics to be converted;
inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;
and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from that of the voice to be converted.
An apparatus for speech conversion, the apparatus comprising:
the system comprises an acquisition module, a conversion module and a conversion module, wherein the acquisition module is used for acquiring the voice to be converted and an original conversion model, and the format of the original conversion model is an online format;
the format conversion module is used for carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format;
the feature extraction module is used for extracting features of the voice to be converted to obtain features to be converted;
the feature conversion module is used for inputting the features to be converted into the target conversion model to obtain target features output by the target conversion model;
and the result module is used for obtaining target voice according to the target characteristics output by the target conversion model, the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from the voice to be converted.
A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:
acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format;
extracting the characteristics of the voice to be converted to obtain the characteristics to be converted;
inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;
and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from that of the voice to be converted.
A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format;
extracting the characteristics of the voice to be converted to obtain the characteristics to be converted;
inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;
and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from that of the voice to be converted.
The embodiment of the invention has the following beneficial effects:
according to the voice conversion method, the voice conversion device, the computer equipment and the computer readable storage medium, the voice to be converted and the original conversion model are obtained, the original conversion model cannot work in an off-line state, so that the characteristics of the voice to be converted are extracted to obtain the characteristics to be converted, the format of the original conversion model is converted into an off-line format, then the target characteristics can be obtained according to the characteristics to be converted and the target conversion model of the off-line format, and then the target voice is obtained according to the target characteristics. The voice conversion method can not only perform high-quality voice conversion in an off-line state, but also has high running speed and can realize real-time voice conversion.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Wherein:
FIG. 1 is a diagram of an exemplary implementation of a voice conversion method;
FIG. 2 is a flow diagram of a method of speech conversion in one embodiment;
FIG. 3 is a flow diagram of a method of speech conversion in one embodiment;
FIG. 4 is a diagram illustrating segmentation processing of speech to be converted in one embodiment;
FIG. 5 is a block diagram showing the structure of a speech conversion apparatus according to an embodiment;
FIG. 6 is a block diagram of a computer device in one embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
FIG. 1 is a diagram of an exemplary speech conversion method application environment. As shown in fig. 1, the voice conversion method is applied to a voice conversion system. The voice conversion system comprises a terminal, wherein the terminal can be a desktop terminal or a mobile terminal, and the mobile terminal can be at least one of a mobile phone, a tablet computer, a notebook computer and the like. The terminal comprises a microphone, a conversion unit and a player, wherein the microphone is used for acquiring voice to be converted, the conversion unit is used for converting the voice to be converted into target voice with the same content as the voice to be converted but different sound, and the player is used for playing the target voice.
In one embodiment, as shown in FIG. 2, a method of speech conversion is provided. The method can be applied to the terminal, the server and other voice conversion devices. The present embodiment is exemplified as applied to a voice conversion apparatus. In an off-line state, after the voice conversion device obtains the voice to be converted, a target voice with the same content and different sound as the voice to be converted can be obtained by the voice conversion method. The voice conversion method specifically comprises the following steps:
step 202: and acquiring the voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format.
The voice to be converted refers to the voice which is emitted by the voice to be converted and is emitted by the target voice to be converted.
The online format is a storage format of a file that can be opened or normally operated only in a state of network connection.
The original conversion model is a model which inputs the to-be-converted characteristics of the to-be-converted voice and outputs the to-be-converted characteristics of the target voice, and is used for acquiring the target characteristics of the target voice according to the to-be-converted characteristics of the to-be-converted voice in a network connection state.
Step 204: and carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format.
The offline format refers to a storage format of a file which can still be opened or normally work in a state of being disconnected from a network.
The target conversion model is used for obtaining the target characteristics of the target voice according to the characteristics to be converted of the voice to be converted in the state of network disconnection.
And carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format. Illustratively, the original conversion model is a model file trained by a TensorFlow (machine learning library developed by google, using python language) framework, the storage format of the original conversion model is an online format CheckPoint (abbreviated ckpt), and the storage format of the original conversion model can be converted into an offline format JetSoft shift new (abbreviated jsn) to obtain the target conversion model. The original conversion model in the ckpt format records more information, for example, parameters and data used in training the original conversion model, and the data is not needed in the voice conversion process in the off-line state, so that redundant data can be removed when the storage format of the original conversion model is converted into the jsn format, which is equivalent to simplifying and compressing the model file, and the operation speed in the off-line state can be increased, thereby increasing the speed of voice conversion and realizing real-time voice conversion.
Step 206: and extracting the characteristics of the voice to be converted to obtain the characteristics to be converted.
And the feature to be converted is used for inputting a target conversion model to obtain a target feature corresponding to the voice to be converted.
Obtaining the spectrum characteristics of the voice to be converted according to the voice to be converted, for example, the Mel frequency spectrum of the voice to be converted, extracting the characteristics of the voice to be converted, and determining the characteristics to be converted of the voice to be converted according to the characteristics.
Step 208: and inputting the features to be converted into the target conversion model to obtain the target features output by the target conversion model.
The target characteristics are used for acquiring target voice with the same content and different sound with the voice to be converted.
And in an off-line state, when the target conversion model is in an operating state, inputting the feature to be converted into the target conversion model, and directly outputting the target feature corresponding to the feature to be converted by the target conversion model.
Step 210: and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from that of the voice to be converted.
The target voice refers to the voice which is emitted by the target person and has the same voice content as the voice to be converted and the voice different from the voice to be converted.
The characteristics of fundamental frequency, spectrum envelope, non-periodicity and the like of the target voice can be obtained according to the target characteristics, the Mel frequency spectrum of the target voice is determined, and the target voice can be obtained according to the Mel frequency spectrum of the target voice. Illustratively, the feature to be converted is binarized 130-dimensional serialized data, the target feature obtained by inputting the target conversion model is also 130-dimensional serialized data, lf0, mgc and bap feature data of the target voice are obtained through inverse normalization, and then the lf0, mgc and bap feature data are converted into f0, sp and ap features through SPTK, the Mel frequency spectrum of the target voice can be determined through f0, sp and ap of the target voice, and the Mel frequency spectrum of the target voice can be obtained.
According to the voice conversion method, the voice to be converted and the original conversion model are obtained, and the original conversion model cannot work in an off-line state, so that the characteristics of the voice to be converted are extracted to obtain the characteristics to be converted, after the format of the original conversion model is converted into the off-line format, the target characteristics can be obtained according to the characteristics to be converted and the target conversion model of the off-line format, and then the target voice is obtained according to the target characteristics. The voice conversion method can not only perform high-quality voice conversion in an off-line state, but also has high running speed and can realize real-time voice conversion.
In one embodiment, the step 206 performs feature extraction on the speech to be converted to obtain features to be converted, including: performing periodic feature extraction and non-periodic feature extraction on the voice to be converted to obtain periodic features and non-periodic features corresponding to the voice to be converted, wherein the periodic features comprise fundamental frequency and spectrum envelope; and obtaining the feature to be converted according to the periodic feature and the non-periodic feature.
When a person speaks, a plurality of sound sources generating acoustic energy exist in a sound channel, wherein non-periodic sound sources comprise air feeding sound, friction sound and explosion sound generated at lips, teeth, throats, sound channels and the like, while periodic sound sources are generated by vocal cord vibration at a glottis, so that the voice to be converted comprises periodic components and non-periodic components, and the corresponding frequency spectrum characteristics of the voice to be converted comprise periodic characteristics and non-periodic characteristics. In the present embodiment, a mel spectrum of a speech to be converted is used as a spectrum characteristic.
The Fundamental Frequency (f 0) is a group of sine waves constituting the original signal, and the sine wave with the lowest Frequency is the Fundamental Frequency and the others are overtones. The spectral envelope (sp) is an envelope in which peaks of amplitudes of different frequencies are connected by a smooth curve. An aperiodic sequence (ap) refers to an aperiodic signal parameter of speech.
The periodic characteristics refer to fundamental frequency and spectrum envelope in the Mel frequency spectrum of the voice to be converted.
The non-periodic feature refers to a non-periodic sequence in the Mel frequency spectrum of the voice to be converted.
According to the periodic feature and the aperiodic feature, feature data serving as input of a target conversion model can be obtained through processing, and the feature data is to-be-converted features. Illustratively, a group of feature data is obtained according to the periodic features and the aperiodic features, and the feature data is calculated and subjected to format conversion to obtain features to be converted.
In one embodiment, obtaining the feature to be converted according to the periodic feature and the aperiodic feature includes: obtaining a target dimension characteristic according to the periodic characteristic and the aperiodic characteristic, wherein the dimension of the target dimension characteristic is higher than the sum of the dimensions of the periodic characteristic and the aperiodic characteristic; and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.
The target dimension characteristic refers to a good characteristic with dimension higher than that of the periodic characteristic and the aperiodic characteristic, wherein the dimension obtained according to the periodic characteristic and the aperiodic characteristic. And mapping the periodic features and the non-periodic features in the low dimension to obtain target dimension features in the high dimension, so that the quality of the synthesized voice can be improved.
Illustratively, the periodic features F0 and sp are obtained according to a mel spectrum of the voice to be converted, the aperiodic feature ap is obtained by Processing three features with a Speech Signal Processing kit (SPTK) to obtain 1-dimensional lF0 (taking a logarithm of F0), 41-dimensional mgc and 1-dimensional waveband aperiodicity (bap), calculating whether to pronounce (voice, vocal, abbreviated as vuv) data in 1 dimension according to lF0, and respectively obtaining first derivatives and second derivatives of lF0, mgc and bap to obtain 1 × 2, 41 × 2 and 1 × 2 dimensional data. And finally, normalizing the data vuv, lf0 and derivatives thereof, mgc and derivatives thereof, and bap and derivatives thereof to obtain 130-dimensional serialized data. The 130-dimensional serialized data is used as a target dimension feature.
And carrying out format conversion on the target dimension characteristics so as to enable the target dimension characteristics to meet the input format requirement of a target conversion model, wherein the characteristic data obtained through format conversion is the characteristics to be converted. Illustratively, when the input format requirement of the target conversion model is binary data, binary conversion is performed on the target dimension characteristic, and the obtained binary data is the characteristic to be converted.
In one embodiment, the target transformation model operates based on a computer Unified Device Architecture recursive Neural Network Toolkit (currernnt) framework.
The current Neural Network (RNN) is an open source parallel implementation of a deep parallel Neural Network (RNN), and supports a Graphics Processing Unit (GPU) through an intada (computer Unified device architecture). The currernnt supports unidirectional and bidirectional RNNs with Long Short-Term Memory (LSTM) storage cells, overcoming the problem of gradient vanishing.
And placing the target conversion model in a CURRENNT, wherein the target conversion model is in a running state, the features to be converted are placed in the same CURRENNT, the features to be converted are input into the target conversion model, and the target conversion model outputs the target features corresponding to the features to be converted.
As shown in fig. 3, in one embodiment, the method further comprises:
step 306: and carrying out segmentation processing on the voice to be converted to obtain a plurality of segmented voices.
Because the off-line equipment has limited computing resources, if the voice to be converted is long in time, the voice to be converted is directly converted, the operation speed is low, and the real-time conversion of the voice cannot be realized. And carrying out segmentation processing on the voice to be converted to obtain a plurality of segmented voices, wherein due to the length of the segmented voices, the conversion can be carried out quickly, so that the running speed can be greatly improved. Illustratively, the duration of the speech to be converted is greater than a preset duration, and the speech to be converted is segmented according to preset conditions. As shown in fig. 4, the speech 41 to be converted is divided into 3 segments on average according to the duration, and 3 segmented speech 42 are obtained.
Step 308: and extracting the characteristics of the segmented voices to obtain a plurality of segmented characteristics.
The segmented characteristics refer to characteristics to be converted corresponding to each segmented voice.
And respectively extracting the characteristics of each segmented voice, and obtaining the characteristic to be converted corresponding to each segmented voice according to the extracted characteristics, namely obtaining the segmented characteristics of each segmented voice.
Step 310: and inputting each segmented feature into the target conversion model in parallel to obtain a target segmented feature corresponding to each segmented feature.
The target segmentation feature refers to a target feature corresponding to each segmentation feature.
After obtaining the multiple segment features, calling multiple cores of a Central Processing Unit (CPU) to simultaneously convert the multiple segment features, starting multiple processes, and each process independently executes to input the segment features into a target conversion model, so as to obtain target segment features corresponding to the segment features. And inputting each segmented feature into the target conversion model in parallel, wherein the conversion speed is much higher than that of each segmented feature in sequence, so that the real-time conversion of the voice is realized.
Step 312: and obtaining the target voice according to the target segmentation characteristics corresponding to each segmentation characteristic.
Target segmented features corresponding to each segmented feature can be synthesized to obtain target features, and target voice is obtained according to the target features; and the target segmented voice corresponding to the target segmented feature can be obtained according to the target segmented feature, and the segmented voice is synthesized to obtain the target voice. Illustratively, the speech to be converted is segmented into 5 segmented speech, 5 corresponding segmented features are obtained according to the 5 segmented speech, 5 corresponding segmented features are input into the target conversion model to obtain 5 corresponding target segmented features, 5 corresponding target segmented speech is obtained according to the 5 corresponding target segmented features, and the 5 corresponding target segmented speech is synthesized to obtain the target speech.
In one embodiment, any two target segmented features adjacent in time in the plurality of target segmented features include an overlap feature, and the step 312 of obtaining the target speech according to the target segmented feature corresponding to each segmented feature includes: and obtaining the target voice according to the target segmentation feature corresponding to each segmentation feature and the overlapping feature of any two target segmentation features adjacent in time in the plurality of target segmentation features.
As shown in fig. 4, in order to prevent the speech to be converted 41 from having an error or losing some features when the features are extracted subsequently due to the segmentation process, any two segment speech 42 adjacent in time in the plurality of segment speech 42 may include an overlapping portion 421 during the segmentation process.
The overlapping feature refers to that the overlapping portion 421 included in any two of the segmented voices 42 adjacent to each other in time in the plurality of segmented voices 42 is converted into a corresponding target feature.
Merging the target segmentation features corresponding to each segmentation feature together to obtain a merged feature, and adjusting the merged feature according to the overlapping feature of any two target segmentation features adjacent in time in the plurality of target segmentation features to obtain a targetAnd marking the features, and acquiring the target voice according to the target features. Illustratively, the speech to be converted is segmented into 2 segmented speech, and 2 target segmented features are obtained through conversion, wherein the target segmented feature I is (A + C)A) The target segmentation feature II is (C)B+ B), the overlapping feature of the target segmentation feature I and the target segmentation feature II is C, and the front 1/2 of the overlapping feature C in the target segmentation feature I, namely C, can be reserved in the process of obtaining the target featureA is frontKeeping 1/2 (C) at the end of the overlapping feature C in the target segment feature IIAfter BThe target is characterized by (A + C)A is front+CAfter B+ B), obtaining the target voice according to the target characteristics.
In one embodiment, obtaining the target speech according to a target segmentation feature corresponding to each of the segmentation features and an overlapping feature of any two temporally adjacent target segmentation features in the plurality of target segmentation features includes: acquiring a feature weight set, wherein the feature weight set comprises a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segmented features adjacent in time; and obtaining the target voice according to a target segmentation feature corresponding to each segmentation feature, overlapping features of any two target segmentation features adjacent in time in the plurality of target segmentation features, and the feature weight set.
The feature weight set is used for determining the weight of the overlapping features of any two target segment features adjacent in time in the two target segment features respectively.
Illustratively, the speech to be converted is segmented into 2 segmented speech, and 2 target segmented features are obtained through conversion, wherein the target segmented feature I is (A + C)A) The target segmentation feature II is (C)B+ B), the overlapping feature of the target segmentation feature I and the target segmentation feature II is C, the first feature weight in the feature weight set is m for determining the weight of the overlapping feature C in the target segmentation feature I, the second feature weight is n for determining the weight of the overlapping feature C in the target segmentation feature II, and the target feature of the speech to be converted is (A + m × C)A+n×CB+ B), obtaining the target voice according to the target characteristics.
As shown in fig. 5, in one embodiment, there is provided a voice conversion apparatus including:
an obtaining module 502, configured to obtain a voice to be converted and an original conversion model, where a format of the original conversion model is an online format;
a format conversion module 504, configured to perform format conversion on the original conversion model to obtain a target conversion model in an offline format;
a feature extraction module 506, configured to perform feature extraction on the speech to be converted to obtain a feature to be converted;
a feature conversion module 508, configured to input the feature to be converted into the target conversion model, so as to obtain a target feature output by the target conversion model;
a result module 510, configured to obtain a target voice according to a target feature output by the target conversion model, where a voice content of the target voice is the same as the voice to be converted, and a sound of the target voice is different from the voice to be converted.
According to the voice conversion device, the voice to be converted and the original conversion model are obtained, and the original conversion model cannot work in an off-line state, so that the characteristics of the voice to be converted are extracted to obtain the characteristics to be converted, after the format of the original conversion model is converted into the off-line format, the target characteristics can be obtained according to the characteristics to be converted and the target conversion model of the off-line format, and then the target voice is obtained according to the target characteristics. The voice conversion method can not only perform high-quality voice conversion in an off-line state, but also has high running speed and can realize real-time voice conversion.
In an embodiment, the feature extraction module 506 is configured to perform periodic feature extraction and aperiodic feature extraction on the speech to be converted, so as to obtain periodic features and aperiodic features corresponding to the speech to be converted, where the periodic features include a fundamental frequency and a spectral envelope; and obtaining the feature to be converted according to the periodic feature and the non-periodic feature.
In an embodiment, the feature extraction module 506 is specifically configured to obtain a target dimension feature according to the periodic feature and the aperiodic feature, where a dimension of the target dimension feature is higher than a sum of dimensions of the periodic feature and the aperiodic feature; and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.
In one embodiment, the target conversion model operates based on a computer unified device architecture recurrent neural network toolkit framework.
In an embodiment, the feature extraction module 506 is configured to perform segmentation processing on the speech to be converted to obtain a plurality of segmented voices, and perform feature extraction on the plurality of segmented voices to obtain a plurality of segmented features; the feature conversion module 508 is configured to input each of the segmented features into the target conversion model in parallel, so as to obtain a target segmented feature corresponding to each of the segmented features; the result module 510 is configured to obtain a target speech according to a target segmentation feature corresponding to each of the segmentation features.
In one embodiment, any two target segment features adjacent in time in the plurality of target segment features include an overlap feature, and the result module 510 is configured to obtain the target speech according to a target segment feature corresponding to each of the segment features and the overlap feature of any two target segment features adjacent in time in the plurality of target segment features.
In one embodiment, the result module 510 is configured to obtain a feature weight set, where the feature weight set includes a first feature weight and a second feature weight, where the first feature weight and the second feature weight are weights corresponding to overlapping features of any two temporally adjacent target segment features; and obtaining the target voice according to a target segmentation feature corresponding to each segmentation feature, overlapping features of any two target segmentation features adjacent in time in the plurality of target segmentation features, and the feature weight set.
FIG. 6 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may be a terminal, a server, or a voice conversion device. As shown in fig. 6, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the speech conversion method. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform the speech conversion method. Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is proposed, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:
acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format;
extracting the characteristics of the voice to be converted to obtain the characteristics to be converted;
inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;
and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from that of the voice to be converted.
According to the computer equipment, the voice to be converted and the original conversion model are obtained, and the original conversion model cannot work in an off-line state, so that the characteristics of the voice to be converted are extracted to obtain the characteristics to be converted, after the format of the original conversion model is converted into the off-line format, the target characteristics can be obtained according to the characteristics to be converted and the target conversion model of the off-line format, and then the target voice is obtained according to the target characteristics. The voice conversion method can not only perform high-quality voice conversion in an off-line state, but also has high running speed and can realize real-time voice conversion.
In an embodiment, the extracting the feature of the speech to be converted to obtain the feature to be converted includes: performing periodic feature extraction and non-periodic feature extraction on the voice to be converted to obtain periodic features and non-periodic features corresponding to the voice to be converted, wherein the periodic features comprise fundamental frequency and spectrum envelope; and obtaining the feature to be converted according to the periodic feature and the non-periodic feature.
In one embodiment, the obtaining the feature to be converted according to the periodic feature and the aperiodic feature includes: obtaining a target dimension characteristic according to the periodic characteristic and the aperiodic characteristic, wherein the dimension of the target dimension characteristic is higher than the sum of the dimensions of the periodic characteristic and the aperiodic characteristic; and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.
In one embodiment, the target conversion model operates based on a computer unified device architecture recurrent neural network toolkit framework.
In an embodiment, the extracting the feature of the speech to be converted to obtain the feature to be converted includes: carrying out segmentation processing on the voice to be converted to obtain a plurality of segmented voices; extracting the features of the segmented voices to obtain a plurality of segmented features; the inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model includes: inputting each segmented feature into the target conversion model in parallel to obtain a target segmented feature corresponding to each segmented feature; the obtaining of the target voice according to the target feature output by the target conversion model comprises: and obtaining the target voice according to the target segmentation characteristics corresponding to each segmentation characteristic.
In one embodiment, any two temporally adjacent target segmented features of the plurality of target segmented features comprise overlapping features; the obtaining of the target voice according to the target segmentation feature corresponding to each segmentation feature includes: and obtaining the target voice according to the target segmentation feature corresponding to each segmentation feature and the overlapping feature of any two target segmentation features adjacent in time in the plurality of target segmentation features.
In one embodiment, the obtaining the target speech according to a target segmented feature corresponding to each segmented feature and an overlapping feature of any two temporally adjacent target segmented features in the plurality of target segmented features includes: acquiring a feature weight set, wherein the feature weight set comprises a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segmented features adjacent in time; and obtaining the target voice according to a target segmentation feature corresponding to each segmentation feature, overlapping features of any two target segmentation features adjacent in time in the plurality of target segmentation features, and the feature weight set.
In one embodiment, a computer-readable storage medium is proposed, in which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of:
acquiring a voice to be converted and an original conversion model, wherein the format of the original conversion model is an online format;
carrying out format conversion on the original conversion model to obtain a target conversion model in an offline format;
extracting the characteristics of the voice to be converted to obtain the characteristics to be converted;
inputting the feature to be converted into the target conversion model to obtain a target feature output by the target conversion model;
and obtaining target voice according to the target characteristics output by the target conversion model, wherein the voice content of the target voice is the same as the voice to be converted, and the sound of the target voice is different from that of the voice to be converted.
The computer-readable storage medium obtains the voice to be converted and the original conversion model, extracts the characteristics of the voice to be converted to obtain the characteristics to be converted because the original conversion model cannot work in an off-line state, converts the format of the original conversion model into the off-line format, obtains the target characteristics according to the characteristics to be converted and the target conversion model of the off-line format, and then obtains the target voice according to the target characteristics. The voice conversion method can not only perform high-quality voice conversion in an off-line state, but also has high running speed and can realize real-time voice conversion.
In an embodiment, the extracting the feature of the speech to be converted to obtain the feature to be converted includes: performing periodic feature extraction and non-periodic feature extraction on the voice to be converted to obtain periodic features and non-periodic features corresponding to the voice to be converted, wherein the periodic features comprise fundamental frequency and spectrum envelope; and obtaining the feature to be converted according to the periodic feature and the non-periodic feature.
In one embodiment, the obtaining the feature to be converted according to the periodic feature and the aperiodic feature includes: obtaining a target dimension characteristic according to the periodic characteristic and the aperiodic characteristic, wherein the dimension of the target dimension characteristic is higher than the sum of the dimensions of the periodic characteristic and the aperiodic characteristic; and carrying out format conversion on the target dimension characteristics to obtain the characteristics to be converted.
In one embodiment, the target conversion model operates based on a computer unified device architecture recurrent neural network toolkit framework.
In an embodiment, the extracting the feature of the speech to be converted to obtain the feature to be converted includes: carrying out segmentation processing on the voice to be converted to obtain a plurality of segmented voices; extracting the features of the segmented voices to obtain a plurality of segmented features; the inputting the feature to be converted into the target conversion model to obtain the target feature output by the target conversion model includes: inputting each segmented feature into the target conversion model in parallel to obtain a target segmented feature corresponding to each segmented feature; the obtaining of the target voice according to the target feature output by the target conversion model comprises: and obtaining the target voice according to the target segmentation characteristics corresponding to each segmentation characteristic.
In one embodiment, any two temporally adjacent target segmented features of the plurality of target segmented features comprise overlapping features; the obtaining of the target voice according to the target segmentation feature corresponding to each segmentation feature includes: and obtaining the target voice according to the target segmentation feature corresponding to each segmentation feature and the overlapping feature of any two target segmentation features adjacent in time in the plurality of target segmentation features.
In one embodiment, the obtaining the target speech according to a target segmented feature corresponding to each segmented feature and an overlapping feature of any two temporally adjacent target segmented features in the plurality of target segmented features includes: acquiring a feature weight set, wherein the feature weight set comprises a first feature weight and a second feature weight, and the first feature weight and the second feature weight are weights corresponding to overlapping features in any two target segmented features adjacent in time; and obtaining the target voice according to a target segmentation feature corresponding to each segmentation feature, overlapping features of any two target segmentation features adjacent in time in the plurality of target segmentation features, and the feature weight set.
It should be noted that the above-mentioned speech converting method, speech converting apparatus, computer device and computer readable storage medium belong to a general inventive concept, and the contents in the embodiments of the speech converting method, speech converting apparatus, computer device and computer readable storage medium are mutually applicable.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2019/126865 WO2021120145A1 (en) | 2019-12-20 | 2019-12-20 | Voice conversion method and apparatus, computer device and computer-readable storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111108558A true CN111108558A (en) | 2020-05-05 |
| CN111108558B CN111108558B (en) | 2023-08-04 |
Family
ID=70427470
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201980003120.8A Active CN111108558B (en) | 2019-12-20 | 2019-12-20 | Speech conversion method, device, computer equipment, and computer-readable storage medium |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN111108558B (en) |
| WO (1) | WO2021120145A1 (en) |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103430234A (en) * | 2011-03-17 | 2013-12-04 | 国际商业机器公司 | Voice transformation with encoded information |
| US20160005403A1 (en) * | 2014-07-03 | 2016-01-07 | Google Inc. | Methods and Systems for Voice Conversion |
| CN107545903A (en) * | 2017-07-19 | 2018-01-05 | 南京邮电大学 | A kind of phonetics transfer method based on deep learning |
| CN107610717A (en) * | 2016-07-11 | 2018-01-19 | 香港中文大学 | Many-to-One Speech Conversion Method Based on Speech Posterior Probability |
| CN107785030A (en) * | 2017-10-18 | 2018-03-09 | 杭州电子科技大学 | A kind of phonetics transfer method |
| US20180336880A1 (en) * | 2017-05-19 | 2018-11-22 | Baidu Usa Llc | Systems and methods for multi-speaker neural text-to-speech |
| CN110097890A (en) * | 2019-04-16 | 2019-08-06 | 北京搜狗科技发展有限公司 | A kind of method of speech processing, device and the device for speech processes |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR100484666B1 (en) * | 2002-12-31 | 2005-04-22 | (주) 코아보이스 | Voice Color Converter using Transforming Vocal Tract Characteristic and Method |
| CN1534595A (en) * | 2003-03-28 | 2004-10-06 | 中颖电子(上海)有限公司 | Speech sound change over synthesis device and its method |
| CN1645363A (en) * | 2005-01-04 | 2005-07-27 | 华南理工大学 | Portable realtime dialect inter-translationing device and method thereof |
| JP4241736B2 (en) * | 2006-01-19 | 2009-03-18 | 株式会社東芝 | Speech processing apparatus and method |
| CN105023570B (en) * | 2014-04-30 | 2018-11-27 | 科大讯飞股份有限公司 | A kind of method and system for realizing sound conversion |
| US9922138B2 (en) * | 2015-05-27 | 2018-03-20 | Google Llc | Dynamically updatable offline grammar model for resource-constrained offline device |
| CN107767879A (en) * | 2017-10-25 | 2018-03-06 | 北京奇虎科技有限公司 | Audio conversion method and device based on tone color |
| CN109637551A (en) * | 2018-12-26 | 2019-04-16 | 出门问问信息科技有限公司 | Phonetics transfer method, device, equipment and storage medium |
-
2019
- 2019-12-20 CN CN201980003120.8A patent/CN111108558B/en active Active
- 2019-12-20 WO PCT/CN2019/126865 patent/WO2021120145A1/en not_active Ceased
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103430234A (en) * | 2011-03-17 | 2013-12-04 | 国际商业机器公司 | Voice transformation with encoded information |
| US20160005403A1 (en) * | 2014-07-03 | 2016-01-07 | Google Inc. | Methods and Systems for Voice Conversion |
| CN107610717A (en) * | 2016-07-11 | 2018-01-19 | 香港中文大学 | Many-to-One Speech Conversion Method Based on Speech Posterior Probability |
| US20180336880A1 (en) * | 2017-05-19 | 2018-11-22 | Baidu Usa Llc | Systems and methods for multi-speaker neural text-to-speech |
| CN107545903A (en) * | 2017-07-19 | 2018-01-05 | 南京邮电大学 | A kind of phonetics transfer method based on deep learning |
| CN107785030A (en) * | 2017-10-18 | 2018-03-09 | 杭州电子科技大学 | A kind of phonetics transfer method |
| CN110097890A (en) * | 2019-04-16 | 2019-08-06 | 北京搜狗科技发展有限公司 | A kind of method of speech processing, device and the device for speech processes |
Non-Patent Citations (1)
| Title |
|---|
| 应耀鹏等: "《跨软件的文本转语音APP的设计与开发》" * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111108558B (en) | 2023-08-04 |
| WO2021120145A1 (en) | 2021-06-24 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7427723B2 (en) | Text-to-speech synthesis in target speaker's voice using neural networks | |
| CN111048064B (en) | Voice cloning method and device based on single speaker voice synthesis data set | |
| US11741942B2 (en) | Text-to-speech synthesis system and method | |
| US11355097B2 (en) | Sample-efficient adaptive text-to-speech | |
| US11417316B2 (en) | Speech synthesis method and apparatus and computer readable storage medium using the same | |
| US20250349282A1 (en) | Personalized and dynamic text to speech voice cloning using incompletely trained text to speech models | |
| EP4425482B1 (en) | Model training and tone conversion method and apparatus, device, and medium | |
| CN113948062B (en) | Data conversion method and computer storage medium | |
| WO2020215666A1 (en) | Speech synthesis method and apparatus, computer device, and storage medium | |
| CN114743539A (en) | Speech synthesis method, apparatus, device and storage medium | |
| CN107945786A (en) | Phoneme synthesizing method and device | |
| CN115171666A (en) | Speech conversion model training method, speech conversion method, apparatus and medium | |
| CN111226275A (en) | Speech synthesis method, device, terminal and medium based on prosodic feature prediction | |
| CN113362804A (en) | Method, device, terminal and storage medium for synthesizing voice | |
| JP7664330B2 (en) | Turn off text echo | |
| CN112712789A (en) | Cross-language audio conversion method and device, computer equipment and storage medium | |
| CN113506586A (en) | Method and system for recognizing emotion of user | |
| CN107240401B (en) | Tone conversion method and computing device | |
| CN113870875B (en) | Methods, apparatus, computer equipment and storage media for extracting timbre features | |
| CN115171644A (en) | Speech synthesis method, apparatus, electronic device and storage medium | |
| CN112767912A (en) | Cross-language voice conversion method and device, computer equipment and storage medium | |
| CN111108558A (en) | Voice conversion method and device, computer equipment and computer readable storage medium | |
| CN118447820A (en) | Voice conversion method, device, equipment and medium based on style | |
| CN116153288A (en) | Phoneme duration information generation method and device, storage medium and equipment | |
| CN114944144A (en) | Training method of voice synthesis model for Guangdong language and voice synthesis method |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| TR01 | Transfer of patent right | ||
| TR01 | Transfer of patent right |
Effective date of registration: 20231211 Address after: Room 601, 6th Floor, Building 13, No. 3 Jinghai Fifth Road, Beijing Economic and Technological Development Zone (Tongzhou), Tongzhou District, Beijing, 100176 Patentee after: Beijing Youbixuan Intelligent Robot Co.,Ltd. Address before: 518000 16th and 22nd Floors, C1 Building, Nanshan Zhiyuan, 1001 Xueyuan Avenue, Nanshan District, Shenzhen City, Guangdong Province Patentee before: Shenzhen UBTECH Technology Co.,Ltd. |