[go: up one dir, main page]

CN114203154A - Training method and device of voice style migration model and voice style migration method and device - Google Patents

Training method and device of voice style migration model and voice style migration method and device Download PDF

Info

Publication number
CN114203154A
CN114203154A CN202111500435.9A CN202111500435A CN114203154A CN 114203154 A CN114203154 A CN 114203154A CN 202111500435 A CN202111500435 A CN 202111500435A CN 114203154 A CN114203154 A CN 114203154A
Authority
CN
China
Prior art keywords
feature vector
content
speaker
loss
style migration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111500435.9A
Other languages
Chinese (zh)
Other versions
CN114203154B (en
Inventor
赵情恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111500435.9A priority Critical patent/CN114203154B/en
Publication of CN114203154A publication Critical patent/CN114203154A/en
Application granted granted Critical
Publication of CN114203154B publication Critical patent/CN114203154B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The disclosure provides a method and a device for training a voice style migration model and a voice style migration model, and relates to the technical field of artificial intelligence, in particular to the field of voice synthesis. The method comprises the following steps: respectively inputting the voice spectrum characteristics of the sample audio into a content coding network and a speaker coding network of the voice style migration model to obtain a content characteristic vector and a speaker characteristic vector of the sample audio, and extracting the fundamental tone frequency of the sample audio; determining mutual information of the content characteristic vector, the speaker characteristic vector and the fundamental tone frequency; inputting the content feature vector, the speaker feature vector and the fundamental tone frequency into a voice spectrum decoding network of the voice style migration model to obtain predicted voice spectrum features; and updating the parameters of the voice style migration model according to the predicted voice spectrum characteristics and the mutual information. The method optimizes the effect of voice style migration.

Description

语音风格迁移模型的训练、语音风格迁移方法及装置Training of voice style transfer model, voice style transfer method and device

技术领域technical field

本公开涉及人工智能技术领域中的语音合成技术,尤其涉及一种语音风格迁移模型的训练、语音风格迁移方法及装置。The present disclosure relates to speech synthesis technology in the technical field of artificial intelligence, and in particular, to a method and device for training a speech style transfer model, and for speech style transfer.

背景技术Background technique

语音风格迁移常应用在变声系统、语音聊天、在线游戏等场景中,将一个语音风格的声音以其他语音风格输出,用于隐藏用户身份或增加娱乐性等。Voice style transfer is often used in voice-changing systems, voice chats, online games and other scenarios, outputting one voice style voice in other voice styles to hide user identity or increase entertainment.

语音风格迁移往往是采用语音风格迁移模型来实现的,语音风格迁移模型用于提取源音频的内容特征和目标音频的说话人特征,输出内容为源音频的内容但语音风格为目标音频说话人的风格的声谱特征,最终利用声码器将声谱特征转换为音频输出。Voice style transfer is often achieved by using a voice style transfer model. The voice style transfer model is used to extract the content features of the source audio and the speaker features of the target audio. The output content is the content of the source audio but the voice style is that of the target audio speaker. The spectral features of the style are finally converted into audio output using a vocoder.

在训练语音风格迁移模型时,通过对样本音频的内容特征以及说话人特征进行提取,输出样本音频的预测声谱特征,利用预测声谱特征和样本音频的真实声谱特征的差异,对模型参数进行更新。由于样本音频的内容特征以及说话人特征之间会存在依赖关系,导致应用过程中的语音风格迁移效果较差。When training the speech style transfer model, the content features and speaker features of the sample audio are extracted, and the predicted spectral features of the sample audio are output, and the difference between the predicted spectral features and the real spectral features of the sample audio is used. to update. Due to the dependency between the content features of the sample audio and the speaker features, the speech style transfer effect in the application process is poor.

发明内容SUMMARY OF THE INVENTION

本公开提供了一种优化了语音风格迁移效果的语音风格迁移模型的训练、语音风格迁移方法及装置。The present disclosure provides a voice style transfer model training, voice style transfer method and device with optimized voice style transfer effect.

根据本公开的一方面,提供了一种According to one aspect of the present disclosure, there is provided a

语音风格迁移模型的训练方法,包括:Training methods for speech style transfer models, including:

将样本音频的声谱特征分别输入语音风格迁移模型的内容编码网络和说话人编码网络,得到所述样本音频的内容特征向量和说话人特征向量,并提取所述样本音频的基音频率;Inputting the spectral features of the sample audio into the content coding network and the speaker coding network of the speech style transfer model respectively, obtaining the content feature vector and the speaker feature vector of the sample audio, and extracting the pitch frequency of the sample audio;

确定所述内容特征向量、所述说话人特征向量和所述基音频率的互信息;determining the mutual information of the content feature vector, the speaker feature vector and the pitch frequency;

将所述内容特征向量、所述说话人特征向量和所述基音频率输入所述语音风格迁移模型的声谱解码网络,得到预测声谱特征;Inputting the content feature vector, the speaker feature vector and the pitch frequency into the sound spectrum decoding network of the voice style transfer model to obtain predicted sound spectrum features;

根据所述预测声谱特征和所述互信息,更新所述语音风格迁移模型的参数。According to the predicted spectral features and the mutual information, parameters of the speech style transfer model are updated.

根据本公开的另一方面,提供了一种语音风格迁移方法,包括:According to another aspect of the present disclosure, there is provided a voice style transfer method, comprising:

获取源音频的第一声谱特征和目标音频的第二声谱特征;obtaining the first sound spectrum feature of the source audio and the second sound spectrum feature of the target audio;

将所述第一声谱特征输入语音风格迁移模型的内容编码网络中,得到所述源音频的内容特征向量,并提取所述源音频的基音频率;Inputting the first sound spectrum feature into the content coding network of the speech style transfer model, obtaining the content feature vector of the source audio, and extracting the pitch frequency of the source audio;

将所述第二声谱特征输入语音风格迁移模型的说话人编码网络中,得到所述目标音频的说话人特征向量;Inputting the second sound spectrum feature into the speaker coding network of the speech style transfer model to obtain the speaker feature vector of the target audio;

将所述源音频的内容特征向量、所述目标音频的说话人特征向量和所述源音频的基音频率输入所述语音风格迁移模型的声谱解码网络,得到待输出声谱特征,并根据所述待输出声谱特征,生成待输出音频;The content feature vector of the source audio, the speaker feature vector of the target audio and the fundamental frequency of the source audio are input into the sound spectrum decoding network of the voice style transfer model to obtain the to-be-output sound spectrum feature, and according to the Describe the features of the sound spectrum to be output, and generate the audio to be output;

其中,所述语音风格迁移模型为根据前述的方法训练得到的。Wherein, the speech style transfer model is obtained by training according to the aforementioned method.

根据本公开的另一方面,提供了一种语音风格迁移模型的训练装置,包括:According to another aspect of the present disclosure, a training device for a speech style transfer model is provided, including:

特征提取模块,用于将样本音频的声谱特征分别输入语音风格迁移模型的内容编码网络和说话人编码网络,得到所述样本音频的内容特征向量和说话人特征向量,并提取所述样本音频的基音频率;The feature extraction module is used to input the spectral features of the sample audio into the content coding network and the speaker coding network of the speech style transfer model respectively, obtain the content feature vector and the speaker feature vector of the sample audio, and extract the sample audio the pitch frequency;

互信息计算模块,用于确定所述内容特征向量、所述说话人特征向量和所述基音频率的互信息;a mutual information calculation module, used for determining the mutual information of the content feature vector, the speaker feature vector and the pitch frequency;

声谱解码模块,用于将所述内容特征向量、所述说话人特征向量和所述基音频率输入所述语音风格迁移模型的声谱解码网络,得到预测声谱特征;A sound spectrum decoding module for inputting the content feature vector, the speaker feature vector and the pitch frequency into the sound spectrum decoding network of the voice style transfer model to obtain predicted sound spectrum features;

更新模块,用于根据所述预测声谱特征和所述互信息,更新所述语音风格迁移模型的参数。An update module, configured to update the parameters of the speech style transfer model according to the predicted sound spectrum feature and the mutual information.

根据本公开的另一方面,提供了一种语音风格迁移装置,包括:According to another aspect of the present disclosure, a voice style transfer apparatus is provided, comprising:

获取模块,用于获取源音频的第一声谱特征和目标音频的第二声谱特征;an acquisition module for acquiring the first sound spectrum feature of the source audio and the second sound spectrum feature of the target audio;

内容编码模块,用于将所述第一声谱特征输入语音风格迁移模型的内容编码网络中,得到所述源音频的内容特征向量,并提取所述源音频的基音频率;A content coding module, for inputting the first sound spectrum feature into the content coding network of the speech style transfer model, obtaining the content feature vector of the source audio, and extracting the pitch frequency of the source audio;

说话人编码模块,用于将所述第二声谱特征输入语音风格迁移模型的说话人编码网络中,得到所述目标音频的说话人特征向量;a speaker encoding module, for inputting the second sound spectrum feature into the speaker encoding network of the speech style transfer model, to obtain the speaker feature vector of the target audio;

声谱解码模块,用于将所述源音频的内容特征向量、所述目标音频的说话人特征向量和所述源音频的基音频率输入所述语音风格迁移模型的声谱解码网络,得到待输出声谱特征,并根据所述待输出声谱特征,生成待输出音频;The sound spectrum decoding module is used to input the content feature vector of the source audio, the speaker feature vector of the target audio and the fundamental frequency of the source audio into the sound spectrum decoding network of the voice style transfer model, and obtain the output to be output sound spectrum features, and generate to-be-output audio according to the to-be-output sound spectrum features;

其中,所述语音风格迁移模型为根据前述的方法训练得到的。Wherein, the speech style transfer model is obtained by training according to the aforementioned method.

根据本公开的再一方面,提供了一种电子设备,包括:According to yet another aspect of the present disclosure, an electronic device is provided, comprising:

至少一个处理器;以及at least one processor; and

与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述第一方面所述的方法。The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method of the first aspect above.

根据本公开的又一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,所述计算机指令用于使所述计算机执行上述第一方面所述的方法。According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of the first aspect above.

根据本公开的又一方面,提供了一种计算机程序产品,所述程序产品包括:计算机程序,所述计算机程序存储在可读存储介质中,电子设备的至少一个处理器可以从所述可读存储介质读取所述计算机程序,所述至少一个处理器执行所述计算机程序使得电子设备执行第一方面所述的方法。According to yet another aspect of the present disclosure, there is provided a computer program product, the program product comprising: a computer program, the computer program being stored in a readable storage medium, from which at least one processor of an electronic device can read The storage medium reads the computer program, and the at least one processor executes the computer program to cause the electronic device to perform the method of the first aspect.

根据本公开的技术方案,优化了语音风格迁移的效果。According to the technical solutions of the present disclosure, the effect of voice style transfer is optimized.

应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or critical features of embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.

附图说明Description of drawings

附图用于更好地理解本方案,不构成对本公开的限定。其中:The accompanying drawings are used for better understanding of the present solution, and do not constitute a limitation to the present disclosure. in:

图1是根据本公开实施例提供的语音风格迁移模型的训练方法的流程示意图;1 is a schematic flowchart of a training method for a speech style transfer model provided according to an embodiment of the present disclosure;

图2是根据本公开实施例提供的语音风格迁移模型的示意图;2 is a schematic diagram of a speech style transfer model provided according to an embodiment of the present disclosure;

图3是根据本公开实施例提供的内容编码网络的示意图;3 is a schematic diagram of a content encoding network provided according to an embodiment of the present disclosure;

图4是根据本公开实施例提供的语音风格迁移方法的流程示意图;4 is a schematic flowchart of a voice style transfer method provided according to an embodiment of the present disclosure;

图5是根据本公开实施例提供的语音风格迁移模型的训练装置的结构示意图;5 is a schematic structural diagram of a training apparatus for a speech style transfer model provided according to an embodiment of the present disclosure;

图6是根据本公开实施例提供的语音风格迁移装置的结构示意图;6 is a schematic structural diagram of a voice style transfer apparatus provided according to an embodiment of the present disclosure;

图7是用来实现本公开实施例的方法的电子设备的示意性框图。7 is a schematic block diagram of an electronic device used to implement the method of an embodiment of the present disclosure.

具体实施方式Detailed ways

以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

语音风格迁移模型中通常至少包括内容编码网络、说话人编码网络,以及声谱解码网络。其中,内容编码网络用于提取源音频的内容特征,说话人编码网络用于提取目标音频的说话人特征,声谱解码网络用于根据源音频的内容特征和目标音频的说话人特征,输出声谱特征,最终,将该声谱特征转换为音频,该音频即为内容为源音频的内容但语音风格为目标音频说话人的风格的音频。The speech style transfer model usually includes at least a content encoding network, a speaker encoding network, and a sound spectrum decoding network. Among them, the content coding network is used to extract the content features of the source audio, the speaker coding network is used to extract the speaker features of the target audio, and the sound spectrum decoding network is used to output the audio according to the content features of the source audio and the speaker features of the target audio. Spectral features, and finally, convert the spectral features into audio, which is audio with the content of the source audio but the voice style of the speaker of the target audio.

而在训练语音风格迁移模型时,对于样本音频,通过内容编码网络提取样本音频的内容特征,同时,通过说话人编码网络提取样本音频的说话人特征,最后再由声谱解码网络根据样本音频的内容特征和样本音频的说话人特征,输出样本音频的预测声谱特征,该预测声谱特征对应的音频的内容为样本音频的内容且语音风格为样本音频的说话人的风格,利用预测声谱特征和样本音频的真实声谱特征的差异,对模型参数进行更新。这样,在模型训练完成后,使用语音风格迁移模型进行语音风格迁移时,只需要将说话人编码网络的输入改变为目标音频,使说话人编码网络输出目标音频的说话人特征,即可实现语音风格的迁移。When training the voice style transfer model, for the sample audio, the content features of the sample audio are extracted by the content coding network, and at the same time, the speaker features of the sample audio are extracted by the speaker coding network. The content feature and the speaker feature of the sample audio, output the predicted spectral feature of the sample audio, the content of the audio corresponding to the predicted spectral feature is the content of the sample audio and the voice style is the style of the speaker of the sample audio, using the predicted sound spectrum The difference between the features and the real spectral features of the sample audio updates the model parameters. In this way, after the model training is completed, when using the voice style transfer model for voice style transfer, it is only necessary to change the input of the speaker coding network to the target audio, so that the speaker coding network outputs the speaker characteristics of the target audio, and then the voice can be realized. style transfer.

然而,由于在训练语音风格迁移模型时,对样本音频分别提取的内容特征和说话人特征可能并未解耦,即内容特征和说话人特征之间可能存在依赖关系,这样就有可能导致在训练完成后的应用中,将说话人编码网络的输入改变为目标音频后,语音风格的迁移效果较差。However, when training the speech style transfer model, the content features and speaker features extracted from the sample audio may not be decoupled, that is, there may be a dependency between the content features and the speaker features, which may lead to the training process. In the finished application, after changing the input of the speaker encoding network to the target audio, the transfer of speech style is poor.

为此,本公开实施例中提出,语音风格迁移模型中,除了采用内容编码模块提取内容特征、采用说话人编码模块提取说话人特征之外,还提取音频的基音频率,利用这三方面的特征来确定预测声谱特征,在训练模型时,除了根据预测声谱特征对模型参数进行更新,还利用内容特征、说话人特征和基音频率之间的互信息(Mutual Information,MI),两个特征之间的互信息可以看成是一个特征中包含的关于另一个随机特征的信息量,或者说是一个特征由于已知另一个特征而减少的不肯定性,用于度量两个特征之间的相互性。在训练过程中通过降低互信息,使得内容特征、说话人特征和基音频率尽可能的解耦,即,降低他们之间的依赖关系,使得各自的特征包含的信息不重叠,从而能够保证在后续应用的过程中,将说话人编码网络的输入改变为目标音频,得到目标音频的说话人特征后,即可利用模型得到目标音频的语音风格的声谱特征,不受其他特征的影响或干扰,语音风格迁移效果更好。To this end, it is proposed in the embodiments of the present disclosure that in the speech style transfer model, in addition to using the content coding module to extract content features and the speaker coding module to extract speaker features, the pitch frequency of the audio is also extracted, and the features of these three aspects are used. To determine the predicted spectral features, when training the model, in addition to updating the model parameters according to the predicted spectral features, it also uses the mutual information (Mutual Information, MI) between the content features, speaker features and pitch frequency, two features Mutual information between two features can be seen as the amount of information contained in one feature about another random feature, or the uncertainty reduced by one feature due to the known other feature, which is used to measure the difference between two features. reciprocity. In the training process, by reducing the mutual information, the content features, speaker features and pitch frequencies are decoupled as much as possible, that is, the dependencies between them are reduced, so that the information contained in their respective features does not overlap, so as to ensure that in the follow-up In the process of application, the input of the speaker coding network is changed to the target audio, and after obtaining the speaker characteristics of the target audio, the spectral characteristics of the voice style of the target audio can be obtained by using the model, which is not affected or interfered by other characteristics. Voice style transfer is better.

本公开提供一种语音风格迁移模型的训练、语音风格迁移方法及装置,应用于人工智能技术领域的语音合成领域,具体可以应用于变声系统、语音聊天、在线游戏等场景中,以优化语音风格迁移的效果。The present disclosure provides a voice style transfer model training, voice style transfer method and device, which are applied to the field of speech synthesis in the field of artificial intelligence technology, and can be specifically applied to scenarios such as voice changing systems, voice chat, and online games to optimize voice style. effect of migration.

下面,将通过具体的实施例对本公开提供的语音风格迁移模型的训练方法以及语音风格迁移方法进行详细地说明。可以理解的是,下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。Hereinafter, the training method of the speech style transfer model and the speech style transfer method provided by the present disclosure will be described in detail through specific embodiments. It can be understood that the following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.

图1是根据本公开实施例提供的一种语音风格迁移模型的训练方法的流程示意图。该方法的执行主体为语音风格迁移模型的训练装置,该装置可以通过软件和/或硬件的方式实现。如图1所示,该方法包括:FIG. 1 is a schematic flowchart of a training method of a speech style transfer model according to an embodiment of the present disclosure. The execution body of the method is a training device of the speech style transfer model, and the device can be implemented by means of software and/or hardware. As shown in Figure 1, the method includes:

S101、将样本音频的声谱特征分别输入语音风格迁移模型的内容编码网络和说话人编码网络,得到样本音频的内容特征向量和说话人特征向量,并提取样本音频的基音频率。S101. Input the spectral features of the sample audio into the content coding network and the speaker coding network of the speech style transfer model respectively, obtain the content feature vector and the speaker feature vector of the sample audio, and extract the pitch frequency of the sample audio.

示例的,样本音频可以采用开源数据集Aishell,LJSpeech,VCTK等,样本音频的声谱特征可以MFCC、PLP或Fbank特征等,本申请实施例对此不做限定。将样本音频的声谱特征输入内容编码网络,得到样本音频的内容特征向量;将将样本音频的声谱特征输入说话人编码网络,得到样本音频的说话人特征向量。For example, the sample audio may use open source data sets Aishell, LJSpeech, VCTK, etc., and the spectral features of the sample audio may be MFCC, PLP, or Fbank features, etc., which are not limited in this embodiment of the present application. Input the spectral features of the sample audio into the content coding network to obtain the content feature vector of the sample audio; input the spectral features of the sample audio into the speaker coding network to obtain the speaker feature vector of the sample audio.

说话人编码网络是对说话人(speaker identity)的特性进行建模,提取说话人的特征,除此之外,对于与说话人特性无关的音素,例如重音(stress)、停顿(pause)等,通过基音频率来进行刻画,样本音频的基音频率可以采用数字信号处理的方式获得。The speaker coding network models the characteristics of the speaker (speaker identity) and extracts the characteristics of the speaker. In addition, for phonemes that are not related to the characteristics of the speaker, such as stress (stress), pause (pause), etc., Characterized by the pitch frequency, the pitch frequency of the sample audio can be obtained by means of digital signal processing.

S102、确定内容特征向量、说话人特征向量和基音频率的互信息。S102. Determine the mutual information of the content feature vector, the speaker feature vector and the pitch frequency.

为了通过模型训练,实现内容特征向量、说话人特征向量和基音频率这三个特征之间的解耦,降低他们之间的依赖关系,本公开实施例中在模型训练过程中引入了这三个特征之间的互信息,可以理解的是,此处三个特征之间的互信息可以指的是他们两两之间各自的互信息,两个特征之间的互信息越低,则他们之间的相关性也越低,他们各自包含的信息越不重叠。In order to realize the decoupling between the three features of the content feature vector, the speaker feature vector and the pitch frequency through model training, and reduce the dependency between them, these three features are introduced in the model training process in the embodiment of the present disclosure. Mutual information between features, it can be understood that the mutual information between the three features here can refer to their respective mutual information. The lower the mutual information between the two features, the more The lower the correlation between them, the less overlapping the information they each contain.

S103、将内容特征向量、说话人特征向量和基音频率输入语音风格迁移模型的声谱解码网络,得到预测声谱特征。S103. Input the content feature vector, the speaker feature vector and the pitch frequency into the sound spectrum decoding network of the speech style transfer model to obtain the predicted sound spectrum feature.

声谱解码网络的输入是内容特征向量、说话人特征向量和基音频率,输出为预测声谱特征,由于内容特征向量、说话人特征向量和基音频率均是样本音频的特征,因此,本步骤中声谱解码网络实际是根据上述三个特征对样本音频的声谱特征进行重建,得到预测声谱特征。The input of the sound spectrum decoding network is the content feature vector, the speaker feature vector and the pitch frequency, and the output is the predicted sound spectrum feature. Since the content feature vector, the speaker feature vector and the pitch frequency are all features of the sample audio, therefore, in this step The sound spectrum decoding network actually reconstructs the sound spectrum features of the sample audio according to the above three features, and obtains the predicted sound spectrum features.

S104、根据预测声谱特征和互信息,更新语音风格迁移模型的参数。S104. Update parameters of the speech style transfer model according to the predicted sound spectrum feature and mutual information.

前述步骤已经说明,预测声谱特征是对样本音频的声谱特征进行重建,因此可以根据预测声谱特征和样本音频的声谱特征对语音风格迁移模型的参数进行调整,同时,为了保证内容特征向量、说话人特征向量和基音频率这三个特征之间的解耦,将降低互信息同样作为训练优化的目标,根据预测声谱特征和互信息,更新语音风格迁移模型的参数,在保证预测声谱特征的准确性的基础上,使得内容特征向量、说话人特征向量和基音频率解耦。The aforementioned steps have already explained that the predicted spectral features are to reconstruct the spectral features of the sample audio, so the parameters of the speech style transfer model can be adjusted according to the predicted spectral features and the spectral features of the sample audio. At the same time, in order to ensure the content features The decoupling between the three features of vector, speaker feature vector and pitch frequency will also reduce mutual information as the goal of training optimization. According to the predicted spectral features and mutual information, the parameters of the speech style transfer model are updated to ensure prediction. Based on the accuracy of the spectral features, the content feature vector, speaker feature vector and pitch frequency are decoupled.

本公开实施例的方法,通过提取样本音频的内容特征向量、说话人特征向量和基音频率,利用这三方面的特征来确定预测声谱特征,在训练模型时,除了根据预测声谱特征对模型参数进行更新,还加入样本音频的内容特征向量、说话人特征向量和基音频率之间的互信息,将降低互信息同样作为模型训练的优化目标,在训练过程中通过降低互信息,使得内容特征向量、说话人特征向量和基音频率解耦,从而能够保证在后续应用的过程中,将说话人编码网络的输入改变为目标音频,使得说话人编码网络输出目标音频的说话人特征向量,即可使声谱解码网络准确输出目标音频的语音风格的声谱特征,而不受其他特征的影响或干扰,使得语音风格迁移效果更好。In the method of the embodiment of the present disclosure, the content feature vector, the speaker feature vector and the pitch frequency of the sample audio are extracted, and the features of these three aspects are used to determine the predicted spectral features. The parameters are updated, and the mutual information between the content feature vector of the sample audio, the speaker feature vector and the pitch frequency is also added. The reduction of mutual information is also used as the optimization goal of model training. In the training process, by reducing the mutual information, the content features The vector, the speaker feature vector and the pitch frequency are decoupled, so as to ensure that the input of the speaker encoding network is changed to the target audio in the subsequent application process, so that the speaker encoding network outputs the speaker feature vector of the target audio, that is, The sound spectrum decoding network can accurately output the sound spectrum features of the voice style of the target audio without being affected or interfered by other features, so that the voice style transfer effect is better.

在上述实施例的基础上,进一步对各步骤的实现方式进行说明。On the basis of the above-mentioned embodiment, the implementation manner of each step will be further described.

可选的,步骤S101中将样本音频的声谱特征输入语音风格迁移模型的内容编码网络,得到样本音频的内容特征向量,包括:Optionally, in step S101, the spectral features of the sample audio are input into the content coding network of the speech style transfer model to obtain the content feature vector of the sample audio, including:

将声谱特征输入内容编码网络中的第一特征提取网络中,得到第一特征向量,第一特征向量的帧数低于声谱特征的帧数;将第一特征向量输入内容编码网络中的量化器中,以确定与第一特征向量的相似度最高的第二特征向量;采用对比预测编码方法,对第二特征向量进行更新,得到内容特征向量。The sound spectrum feature is input into the first feature extraction network in the content coding network to obtain a first feature vector, and the frame number of the first feature vector is lower than the frame number of the sound spectrum feature; the first feature vector is input into the content coding network. In the quantizer, the second eigenvector with the highest similarity with the first eigenvector is determined; the second eigenvector is updated by using the comparative prediction coding method to obtain the content eigenvector.

可选的,步骤S104中,根据预测声谱特征和互信息,更新语音风格迁移模型的参数,包括:根据内容特征向量,确定内容编码网络的第一损失;根据预测声谱特征,确定声谱解码网络的第二损失;根据第一损失、第二损失和互信息,更新语音风格迁移模型的参数。Optionally, in step S104, updating the parameters of the speech style transfer model according to the predicted sound spectrum feature and mutual information, including: determining the first loss of the content coding network according to the content feature vector; determining the sound spectrum according to the predicted sound spectrum feature. The second loss of the decoding network; according to the first loss, the second loss and the mutual information, the parameters of the speech style transfer model are updated.

可选的,根据第二特征向量,确定第一子损失,并根据内容特征向量,确定第二子损失;将第一子损失和第二子损失的和,确定为第一损失。Optionally, the first sub-loss is determined according to the second feature vector, and the second sub-loss is determined according to the content feature vector; the sum of the first sub-loss and the second sub-loss is determined as the first loss.

以下参照图2和图3进行说明。The following description will be made with reference to FIGS. 2 and 3 .

内容编码网络由{h-net+quantizer+g-net}构成,其中,h-net为第一特征提取网络,可选的,h-net由{Conv+{LayerNorm+Linear+ReLU}*4+Linear}组成;quantizer(量化器)是一个编码本(codebook),由多个可学习的向量构成,例如由512个64维的可学习的向量构成;g-net由单向的RNN构成。The content coding network consists of {h-net+quantizer+g-net}, where h-net is the first feature extraction network, optional, h-net consists of {Conv+{LayerNorm+Linear+ReLU}*4+Linear } composition; quantizer (quantizer) is a codebook (codebook), composed of multiple learnable vectors, for example, composed of 512 64-dimensional learnable vectors; g-net is composed of unidirectional RNN.

内容编码网络的输入是样本音频的声谱特征,示例的,声谱特征的维度可以取20,每帧时长25ms,帧移10ms,内容编码网络在计算时可以带有一定数量的上下文,例如上下文各2帧,内容编码网络的输出为从样本音频中提取的内容特征向量。The input of the content coding network is the spectral feature of the sample audio. For example, the dimension of the spectral feature can be 20, the duration of each frame is 25ms, and the frame shift is 10ms. The content coding network can be calculated with a certain amount of context, such as context 2 frames each, the output of the content coding network is the content feature vector extracted from the sample audio.

其中,h-net是将输入的样本音频的声谱特征X(T帧)映射到第一特征向量Z(例如T/2帧),通过对样本音频的声谱特征X的压缩,从样本音频的声谱特征中提取高级特征,使得得到的第一特征向量Z中的冗余信息更少。之后,采样字典学习算法(DictionaryLearning)矢量量化(Vector Quantization,VQ)的方式,从编码本中找到与第一特征向量Z的每列最相似的向量,构成第二特征向量

Figure BDA0003401371970000081
这样将第一特征向量Z压缩到更低维度紧凑的空间,去除了不重要的细节,保留了基础的语言学信息。通过最小化矢量量化损失,来完成对第一特征向量Z的压缩映射。根据第二特征向量
Figure BDA0003401371970000082
确定第一子损失,即矢量量化损失LVQ如下:Among them, h-net is to map the spectral feature X (T frame) of the input sample audio to the first feature vector Z (for example, T/2 frame), and by compressing the spectral feature X of the sample audio, from the sample audio Extract high-level features from the spectral features of , so that the obtained first feature vector Z has less redundant information. After that, the vector quantization (Vector Quantization, VQ) method of the sampling dictionary learning algorithm (DictionaryLearning) is used to find the most similar vector to each column of the first feature vector Z from the codebook to form the second feature vector.
Figure BDA0003401371970000081
In this way, the first feature vector Z is compressed into a lower-dimensional compact space, unimportant details are removed, and basic linguistic information is preserved. The compression mapping to the first feature vector Z is done by minimizing the vector quantization loss. According to the second eigenvector
Figure BDA0003401371970000082
Determine the first sub-loss, the vector quantization loss L VQ as follows:

Figure BDA0003401371970000083
Figure BDA0003401371970000083

其中,K为样本音频的个数,sg表示停止梯度(stop-gradient)。Among them, K is the number of sample audios, and sg represents stop-gradient.

g-net的输入为第二特征向量

Figure BDA0003401371970000084
g-net用于对第二特征向量
Figure BDA0003401371970000085
进行更新得到更新后的
Figure BDA0003401371970000086
更新后的
Figure BDA0003401371970000087
即为内容编码网络输出的内容特征向量。其中,g-net利用参数矩阵W乘以第二特征向量
Figure BDA0003401371970000088
以对第二特征向量
Figure BDA0003401371970000089
进行更新。在训练过程中,为了使第二特征向量
Figure BDA00034013719700000810
能够学习到更多的局部结构信息,同时丢弃掉底层的低级信息和噪声,本公开实施例中引入了无监督的对比预测编码(Contrastive Predictive Coding,CPC)方法,利用未来或上下文的信息,来提取紧凑潜在的特征表示,将自回归建模和噪声对比估计与预测编码相结合,以一种无监督的方式学习抽象表示,通过最小化信息噪声对比估计(Information Noise-Contrastive Estimation,lnfoNCE)损失LCPC对VQ(作为编码器)和g-net(作为自回归模型)进行联合训练,也就是,根据内容特征向量确定第二子损失,即信息噪声对比损失LCPC如下:The input of g-net is the second feature vector
Figure BDA0003401371970000084
g-net is used for the second eigenvector
Figure BDA0003401371970000085
update get updated
Figure BDA0003401371970000086
after the update
Figure BDA0003401371970000087
It is the content feature vector output by the content coding network. Among them, g-net uses the parameter matrix W to multiply the second eigenvector
Figure BDA0003401371970000088
to the second eigenvector
Figure BDA0003401371970000089
to update. During training, in order to make the second feature vector
Figure BDA00034013719700000810
It can learn more local structural information, while discarding the low-level information and noise of the bottom layer. Extract compact latent feature representations, combine autoregressive modeling and noise-contrastive estimation with predictive coding, and learn abstract representations in an unsupervised manner by minimizing Information Noise-Contrastive Estimation (lnfoNCE) loss The L CPC jointly trains VQ (as an encoder) and g-net (as an autoregressive model), that is, the second sub-loss is determined according to the content feature vector, that is, the information noise contrast loss L CPC is as follows:

Figure BDA0003401371970000091
Figure BDA0003401371970000091

其中,T′=T/2-M,Wm(m=1,2,...,M);K为样本音频的个数;W是g-net中需要训练的参数矩阵,M为常数,示例的M取6;

Figure BDA0003401371970000092
是一个正样本,是从负样本集合Ωk,t,m中向前移动m步得到的,示例的集合样本数取10;rk,t是g-net中用于预测t时刻之后的第二特征向量
Figure BDA0003401371970000093
的中间值。Among them, T′=T/2-M, W m (m=1, 2, ..., M); K is the number of sample audios; W is the parameter matrix that needs to be trained in g-net, and M is a constant , the M of the example is 6;
Figure BDA0003401371970000092
is a positive sample, which is obtained by moving forward m steps from the negative sample set Ω k, t, m , and the number of samples in the sample set is 10; r k, t is used in g-net to predict the th Two eigenvectors
Figure BDA0003401371970000093
the median value of .

这样,将上述的第一子损失和第二子损失的和,确定为内容编码网络的第一损失,也就是内容编码网络中各部分的总损失,通过第一子损失和第二子损失的计算,可以保证训练得到的内容编码网络输出的内容特征向量所包含的信息的准确性。In this way, the above-mentioned sum of the first sub-loss and the second sub-loss is determined as the first loss of the content coding network, that is, the total loss of each part in the content coding network. The calculation can ensure the accuracy of the information contained in the content feature vector output by the content coding network obtained by training.

可选的,说话人编码网络由{ConvBank*8+Conv1d*12+average-pooling+Linear*4}构成,说话人编码网络的输入为样本音频的声谱特征X,其中ConvBank层用于扩大感受野,抓取长时信息,最后用池化层,将信息特征进行聚合,最终输出说话人特征向量s。Optionally, the speaker coding network consists of {ConvBank*8+Conv1d*12+average-pooling+Linear*4}, the input of the speaker coding network is the spectral feature X of the sample audio, and the ConvBank layer is used to expand the experience Wild, capture long-term information, and finally use the pooling layer to aggregate the information features, and finally output the speaker feature vector s.

说话人编码网络主要是对说话人的特点进行建模,说话人的说话特点由有意识和无意识的活动组成,一般变化性很强,从声谱角度出发包括3方面,一方面是内在的稳定的粗粒度的特性,这部分是由说话人的生理结构特点决定的,不同的说话人不一样,从主观听觉上感受不同,表现在声谱的低频区域,例如平均的音高,反映声道冲激响应的频谱包络(spectral envelope),共振峰(formant)的相对幅度与位置等,另外一方面是不稳定的短时声学特性,发音的快慢急剧抖动(sharp and rapid)情况、频率、响度(intensity)、频谱的精细化结构等,这些能够反映说话人心理或精神方面的变化,在交谈时可以通过改变这些来表达不同的情绪与意图。上述两方面特征说话人编码网络进行提取,说话人编码网络中通过多层卷积网络,提取到不同级别的特征,进而加以综合,从而保证了说话人特征的准确性。The speaker coding network mainly models the characteristics of the speaker. The speaking characteristics of the speaker are composed of conscious and unconscious activities, and are generally highly variable. From the perspective of sound spectrum, it includes three aspects. On the one hand, it is inherently stable. Coarse-grained characteristics, this part is determined by the physiological structure of the speaker. Different speakers have different subjective auditory perceptions, which are manifested in the low-frequency region of the sound spectrum, such as the average pitch, which reflects the channel impulse. The spectral envelope of the excitation response, the relative amplitude and position of the formant, etc., on the other hand, the unstable short-term acoustic characteristics, the speed of the pronunciation, the sharp and rapid situation, the frequency, the loudness (intensity), the refined structure of the spectrum, etc., these can reflect the changes in the speaker's psychology or spirit, and different emotions and intentions can be expressed by changing these during conversation. The above two features are extracted by the speaker coding network. The multi-layer convolutional network in the speaker coding network extracts features at different levels, and then integrates them, thereby ensuring the accuracy of the speaker features.

而最后一方面是重音(stress)、停顿(pause)等与说话人本身无关的,则通过基音频率p刻画,基音频率p可以通过离散傅里叶变换(DFT)、谐波求和法等数字信号处理的方法确定。The last aspect is that stress, pause, etc. have nothing to do with the speaker itself, which is characterized by the fundamental frequency p, which can be determined by discrete Fourier transform (DFT), harmonic summation and other numbers. The method of signal processing is determined.

在确定了样本音频的内容特征向量、说话人特征向量和基音频率后,即可确定他们之间的互信息。本公开实施例中,通过比对对数比上限算法,确定内容特征向量、说话人特征向量和基音频率两两之间的互信息上限;将内容特征向量、说话人特征向量和基音频率两两之间的互信息上限的和,确定为内容特征向量、说话人特征向量和基音频率的互信息。After the content feature vector, speaker feature vector and pitch frequency of the sample audio are determined, the mutual information between them can be determined. In the embodiment of the present disclosure, the upper limit algorithm of the logarithm ratio is used to determine the upper limit of mutual information between the content feature vector, the speaker feature vector, and the pitch frequency; The sum of the upper limit of mutual information between , is determined as the mutual information of the content feature vector, the speaker feature vector and the pitch frequency.

在前述实施例中已经介绍,如果对上述内容特征向量、说话人特征向量和基音频率这三种特征没有明确的约束的话,很有可能导致三种特征互相依赖,例如内容特征里面包含部分说话人特征,说话人特征里面包含部分基音频率等,进而导致语音风格迁移的效果不理想。因此,本公开实施例中,从信息论的角度出发,利用互信息对上述特征之间的依赖关系进行约束,为了能够确保降低他们之间的互信息,本公开实施例中引入了比对对数比上限(Contrastive Log-ratio Upper Bound,CLUB)的算法,来估计出他们之间的互信息的上限,将该上限作为他们之间的互信息,并在训练过程中将降低互信息作为优化目标,从而保证训练后得到的模型所提取的上述三种特征的相关性最低。As described in the foregoing embodiments, if there are no explicit constraints on the above-mentioned three features, namely the content feature vector, the speaker feature vector and the pitch frequency, the three features are likely to depend on each other. For example, the content feature contains some speakers. The speaker features contain part of the pitch frequency, etc., which leads to an unsatisfactory effect of speech style transfer. Therefore, in the embodiments of the present disclosure, from the perspective of information theory, mutual information is used to constrain the dependencies between the above features. In order to ensure that the mutual information between them is reduced, the comparison logarithm is introduced in the embodiments of the present disclosure. The algorithm of Contrastive Log-ratio Upper Bound (CLUB) is used to estimate the upper limit of the mutual information between them, and the upper limit is taken as the mutual information between them, and the reduction of mutual information is used as the optimization goal during the training process. , so as to ensure that the above three features extracted by the model obtained after training have the lowest correlation.

相关技术中,计算两个特征u、v之间的互信息I(u,v)的方式,是计算两者联合分布与边缘分布之间的相对熵(Kullback-Leibler Divergence),即:In the related art, the way to calculate the mutual information I(u, v) between the two features u and v is to calculate the relative entropy (Kullback-Leibler Divergence) between the joint distribution and the marginal distribution of the two, namely:

I(u,v)=DKL(P(u,v);P(u)P(v))I(u,v)= DKL (P(u,v);P(u)P(v))

在引入CLUB算法后,得到互信息的上限如下:After the introduction of the CLUB algorithm, the upper limit of the mutual information obtained is as follows:

Figure BDA0003401371970000101
Figure BDA0003401371970000101

其中,

Figure BDA0003401371970000102
表示给定v后对u的接近真实分布的估计,
Figure BDA0003401371970000103
可以通过神经网络θu,v去计算得到。可选的,通过变量估计神经网络{Linear+ReLU+Linear}估计出
Figure BDA0003401371970000104
的均值,通过{Linear+ReLU+Linear+Tanh}估计出
Figure BDA0003401371970000111
的对数方差,然后在计算出对数似然(log-likehood),通过最大化如下损失更新神经网络θu,v:in,
Figure BDA0003401371970000102
represents an estimate of the near-true distribution of u given v,
Figure BDA0003401371970000103
It can be calculated by the neural network θ u, v . Optionally, estimated by the variable estimation neural network {Linear+ReLU+Linear}
Figure BDA0003401371970000104
The mean of , estimated by {Linear+ReLU+Linear+Tanh}
Figure BDA0003401371970000111
Then, after calculating the log-likelihood, update the neural network θ u, v by maximizing the loss as follows:

Figure BDA0003401371970000112
Figure BDA0003401371970000112

将内容特征向量、说话人特征向量和基音频率分别代入,得到这三个特征两两之间的互信息如下:Substitute the content feature vector, speaker feature vector and pitch frequency respectively, and the mutual information between these three features is obtained as follows:

Figure BDA0003401371970000113
Figure BDA0003401371970000113

Figure BDA0003401371970000114
Figure BDA0003401371970000114

Figure BDA0003401371970000115
Figure BDA0003401371970000115

其中

Figure BDA0003401371970000116
故内容特征向量、说话人特征向量和基音频率的整体的互信息为LMI:in
Figure BDA0003401371970000116
Therefore, the overall mutual information of the content feature vector, the speaker feature vector and the pitch frequency is L MI :

Figure BDA0003401371970000117
Figure BDA0003401371970000117

可选的,声谱解码网络由{LSTM+Conv1d+LSTM*2+Linear}构成,声谱解码网络的输入是样本音频的内容特征向量、说话人特征向量以及基音频率三者的拼接,输出是预测声谱特征,为了使预测声谱特征向着真实的样本音频的声谱特征逼近,通过均方误差(meansquare error,MSE)计算误差损失,即第二损失,来更新声谱解码网络,误差损失LREC如下:Optionally, the sound spectrum decoding network is composed of {LSTM+Conv1d+LSTM*2+Linear}. The input of the sound spectrum decoding network is the concatenation of the content feature vector of the sample audio, the speaker feature vector and the pitch frequency, and the output is To predict the spectral features, in order to make the predicted spectral features approach the spectral features of the real sample audio, the error loss is calculated by the mean square error (MSE), that is, the second loss, to update the spectral decoding network, the error loss LRECs are as follows:

Figure BDA0003401371970000118
Figure BDA0003401371970000118

在对内容特征向量、说话人特征向量以及基音频率三者进行拼接时,可以对内容特征向量进行上采样,即扩展到T帧,例如可以采用线性插值的方式进行扩展。此外,对说话人特征向量进行拷贝,使其重复到T帧。When splicing the content feature vector, the speaker feature vector and the pitch frequency, the content feature vector may be up-sampled, that is, extended to T frames, for example, linear interpolation may be used for extension. In addition, the speaker feature vector is copied and repeated to T frames.

通过上述过程,得到了第一损失、第二损失和互信息,将第一损失、第二损失和互信息的和,确定为总损失;根据总损失,更新语音风格迁移模型的参数。Through the above process, the first loss, the second loss and the mutual information are obtained, and the sum of the first loss, the second loss and the mutual information is determined as the total loss; according to the total loss, the parameters of the speech style transfer model are updated.

即,语音风格迁移模型的总损失LVC为:That is, the total loss LVC of the speech style transfer model is:

LVC=LVQ+LCPC+LRECMILMI L VC =L VQ +L CPC +L RECMI L MI

其中,λMI为常量权重(一般取0.1或0.01等),用于调节LMI在总损失中的权重。Among them, λ MI is a constant weight (generally 0.1 or 0.01, etc.), which is used to adjust the weight of L MI in the total loss.

通过本公开实施例的方法,在训练语音风格迁移模型的过程中,通过矢量量化使得内容特征向量中保留了重要的语言学信息,去除了不必要的冗余信息,同时将互信息作为损失对模型进行更新,通过无监督的训练方式,使得内容特征向量、说话人特征向量和基音频率之间解耦,从而在训练完成后的应用过程中,将目标音频输入说话人编码网络,即可使模型输出目标音频的语音风格的音声谱特征。Through the method of the embodiment of the present disclosure, in the process of training the speech style transfer model, the vector quantization keeps important linguistic information in the content feature vector, removes unnecessary redundant information, and uses mutual information as a loss pair. The model is updated, and the content feature vector, the speaker feature vector and the pitch frequency are decoupled through unsupervised training, so that in the application process after the training is completed, the target audio is input into the speaker coding network, which can be used. The model outputs the phonetic spectral features of the speech style of the target audio.

图4是根据本公开实施例提供的一种语音风格迁移方法的流程示意图。该方法的执行主体为语音风格迁移装置,该装置可以通过软件和/或硬件的方式实现。如图4所示,该方法包括:FIG. 4 is a schematic flowchart of a voice style transfer method provided according to an embodiment of the present disclosure. The execution body of the method is a voice style transfer device, and the device can be implemented by means of software and/or hardware. As shown in Figure 4, the method includes:

S401、获取源音频的第一声谱特征和目标音频的第二声谱特征。S401. Acquire a first sound spectrum feature of the source audio and a second sound spectrum feature of the target audio.

S402、将第一声谱特征输入语音风格迁移模型的内容编码网络中,得到源音频的内容特征向量,并提取源音频的基音频率。S402. Input the first sound spectrum feature into the content coding network of the speech style transfer model, obtain the content feature vector of the source audio, and extract the pitch frequency of the source audio.

S403、将第二声谱特征输入语音风格迁移模型的说话人编码网络中,得到目标音频的说话人特征向量。S403. Input the second sound spectrum feature into the speaker coding network of the speech style transfer model to obtain the speaker feature vector of the target audio.

S404、将源音频的内容特征向量、目标音频的说话人特征向量和源音频的基音频率输入语音风格迁移模型的声谱解码网络,得到待输出声谱特征,并根据待输出声谱特征,生成待输出音频。S404. Input the content feature vector of the source audio, the speaker feature vector of the target audio, and the fundamental frequency of the source audio into the sound spectrum decoding network of the speech style transfer model to obtain the sound spectrum features to be output, and generate a sound spectrum according to the sound spectrum features to be output. Audio to be output.

其中,语音风格迁移模型为根据前述实施例的方法训练得到的。Wherein, the speech style transfer model is obtained by training according to the method of the foregoing embodiment.

本公开实施例中,语音风格迁移模型中各部分的作用与前述实施例中类似,此处不再赘述,在应用过程中,内容编码网络提取源音频的内容特征向量,说话人编码网络提取目标音频的说话人特征向量,进一步由声谱解码网络根据源音频的内容特征向量、目标音频的说话人特征向量和源音频的基音频率,得到待输出声谱,最终通过待输出声谱,得到输出音频。输出音频的内容为源音频的内容,且语音风格为目标音频的语音风格。通过前述实施例中的方法对语音风格迁移模型进行训练,可以使得在该应用过程中,特征之间不存在依赖和干扰,语音风格迁移效果更好。In the embodiment of the present disclosure, the functions of each part in the speech style transfer model are similar to those in the foregoing embodiments, and will not be repeated here. In the application process, the content encoding network extracts the content feature vector of the source audio, and the speaker encoding network extracts the target The speaker feature vector of the audio is further obtained by the sound spectrum decoding network according to the content feature vector of the source audio, the speaker feature vector of the target audio and the fundamental frequency of the source audio, and finally the output sound spectrum is obtained through the output sound spectrum. audio. The content of the output audio is the content of the source audio, and the voice style is the voice style of the target audio. By training the speech style transfer model by the method in the foregoing embodiment, in the application process, there is no dependency and interference between features, and the speech style transfer effect is better.

图5是根据本公开实施例提供的一种语音风格迁移模型的训练装置的结构示意图。如图5所示,语音风格迁移模型的训练装置500包括:FIG. 5 is a schematic structural diagram of an apparatus for training a speech style transfer model according to an embodiment of the present disclosure. As shown in FIG. 5, the training device 500 of the speech style transfer model includes:

特征提取模块501,用于将样本音频的声谱特征分别输入语音风格迁移模型的内容编码网络和说话人编码网络,得到样本音频的内容特征向量和说话人特征向量,并提取样本音频的基音频率;The feature extraction module 501 is used to input the spectral features of the sample audio into the content coding network and the speaker coding network of the speech style transfer model respectively, obtain the content feature vector and the speaker feature vector of the sample audio, and extract the fundamental frequency of the sample audio. ;

互信息计算模块502,用于确定内容特征向量、说话人特征向量和基音频率的互信息;Mutual information calculation module 502, for determining the mutual information of the content feature vector, the speaker feature vector and the pitch frequency;

声谱解码模块503,用于将内容特征向量、说话人特征向量和基音频率输入语音风格迁移模型的声谱解码网络,得到预测声谱特征;The sound spectrum decoding module 503 is used for inputting the content feature vector, the speaker feature vector and the pitch frequency into the sound spectrum decoding network of the speech style transfer model to obtain the predicted sound spectrum feature;

更新模块504,用于根据预测声谱特征和互信息,更新语音风格迁移模型的参数。The updating module 504 is configured to update the parameters of the speech style transfer model according to the predicted spectral features and mutual information.

在一种实施方式中,更新模块504包括:In one embodiment, the update module 504 includes:

第一确定单元,用于根据内容特征向量,确定内容编码网络的第一损失;a first determining unit, configured to determine the first loss of the content coding network according to the content feature vector;

第二确定单元,用于根据预测声谱特征,确定声谱解码网络的第二损失;a second determining unit, configured to determine the second loss of the audio spectrum decoding network according to the predicted audio spectrum feature;

更新单元,用于根据第一损失、第二损失和互信息,更新语音风格迁移模型的参数。The updating unit is used for updating the parameters of the speech style transfer model according to the first loss, the second loss and the mutual information.

在一种实施方式中,更新单元包括:In one embodiment, the update unit includes:

第一确定子单元,用于将第一损失、第二损失和互信息的和,确定为总损失;a first determination subunit, configured to determine the sum of the first loss, the second loss and the mutual information as the total loss;

更新子单元,用于根据总损失,更新语音风格迁移模型的参数。The update subunit is used to update the parameters of the speech style transfer model according to the total loss.

在一种实施方式中,特征提取模块501包括:In one embodiment, the feature extraction module 501 includes:

第一提取单元,用于将声谱特征输入内容编码网络中的第一特征提取网络中,得到第一特征向量,第一特征向量的帧数小于声谱特征的帧数;The first extraction unit is used for inputting the sound spectrum feature into the first feature extraction network in the content coding network to obtain a first feature vector, and the frame number of the first feature vector is less than the frame number of the sound spectrum feature;

第二提取单元,用于将第一特征向量输入内容编码网络中的量化器中,以确定与第一特征向量的相似度最高的第二特征向量;The second extraction unit is used to input the first feature vector into the quantizer in the content coding network, to determine the second feature vector with the highest similarity with the first feature vector;

第三提取单元,用于采用对比预测编码方法,对第二特征向量进行更新,得到内容特征向量。The third extraction unit is used for updating the second feature vector by adopting the contrastive predictive coding method to obtain the content feature vector.

在一种实施方式中,第一确定单元包括:In one embodiment, the first determining unit includes:

第二确定子单元,用于根据第二特征向量,确定第一子损失,并根据内容特征向量,确定第二子损失;a second determination subunit, configured to determine the first subloss according to the second feature vector, and determine the second subloss according to the content feature vector;

第三确定子单元,用于将第一子损失和第二子损失的和,确定为第一损失。The third determination sub-unit is used for determining the sum of the first sub-loss and the second sub-loss as the first loss.

在一种实施方式中,互信息计算模块502包括:In one embodiment, the mutual information calculation module 502 includes:

第三确定单元,用于通过比对对数比上限算法,确定内容特征向量、说话人特征向量和基音频率两两之间的互信息上限;The third determination unit is used to determine the upper limit of mutual information between the content feature vector, the speaker feature vector and the pitch frequency by comparing the logarithm ratio upper limit algorithm;

第四确定单元,用于将内容特征向量、说话人特征向量和基音频率两两之间的互信息上限的和,确定为内容特征向量、说话人特征向量和基音频率的互信息。The fourth determining unit is configured to determine the sum of the upper limit of mutual information between the content feature vector, the speaker feature vector and the pitch frequency as the mutual information of the content feature vector, the speaker feature vector and the pitch frequency.

本公开实施例的装置可用于执行上述方法实施例中的语音风格迁移模型的训练方法,其实现原理和技术效果类似,此处不再赘述。The apparatus of the embodiment of the present disclosure can be used to execute the training method of the speech style transfer model in the above method embodiment, and the implementation principle and technical effect thereof are similar, and are not repeated here.

图6是根据本公开实施例提供的一种语音风格迁移装置的结构示意图。如图6所示,语音风格迁移装置600包括:FIG. 6 is a schematic structural diagram of a voice style transfer apparatus provided according to an embodiment of the present disclosure. As shown in FIG. 6, the voice style transfer apparatus 600 includes:

获取模块601,用于获取源音频的第一声谱特征和目标音频的第二声谱特征;An acquisition module 601 is used to acquire the first sound spectrum feature of the source audio and the second sound spectrum feature of the target audio;

内容编码模块602,用于将第一声谱特征输入语音风格迁移模型的内容编码网络中,得到源音频的内容特征向量,并提取源音频的基音频率;The content coding module 602 is used to input the first sound spectrum feature into the content coding network of the speech style transfer model, obtain the content feature vector of the source audio, and extract the pitch frequency of the source audio;

说话人编码模块603,用于将第二声谱特征输入语音风格迁移模型的说话人编码网络中,得到目标音频的说话人特征向量;The speaker encoding module 603 is used to input the second sound spectrum feature into the speaker encoding network of the speech style transfer model to obtain the speaker feature vector of the target audio;

声谱解码模块604,用于将源音频的内容特征向量、目标音频的说话人特征向量和源音频的基音频率输入语音风格迁移模型的声谱解码网络,得到待输出声谱特征,并根据待输出声谱特征,生成待输出音频;The sound spectrum decoding module 604 is used for inputting the content feature vector of the source audio, the speaker feature vector of the target audio and the fundamental frequency of the source audio into the sound spectrum decoding network of the speech style transfer model, to obtain the sound spectrum features to be output, and according to the sound spectrum decoding network to be output. Output spectral features to generate audio to be output;

其中,语音风格迁移模型为根据前述实施例的方法训练得到的。Wherein, the speech style transfer model is obtained by training according to the method of the foregoing embodiment.

本公开实施例的装置可用于执行上述方法实施例中的语音风格迁移方法,其实现原理和技术效果类似,此处不再赘述。The apparatus in the embodiment of the present disclosure can be used to execute the speech style transfer method in the foregoing method embodiment, and the implementation principle and technical effect thereof are similar, and are not repeated here.

根据本公开的实施例,本公开还提供了一种电子设备和存储有计算机指令的非瞬时计算机可读存储介质。According to embodiments of the present disclosure, the present disclosure also provides an electronic device and a non-transitory computer-readable storage medium storing computer instructions.

根据本公开的实施例,本公开还提供了一种计算机程序产品,程序产品包括:计算机程序,计算机程序存储在可读存储介质中,电子设备的至少一个处理器可以从可读存储介质读取计算机程序,至少一个处理器执行计算机程序使得电子设备执行上述任一实施例提供的方案。According to an embodiment of the present disclosure, the present disclosure also provides a computer program product, the program product includes: a computer program, the computer program is stored in a readable storage medium, and at least one processor of the electronic device can be read from the readable storage medium A computer program, where at least one processor executes the computer program to cause the electronic device to execute the solution provided by any of the foregoing embodiments.

图7是用来实现本公开实施例的方法的电子设备的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。7 is a schematic block diagram of an electronic device used to implement the method of an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图7所示,电子设备700包括计算单元701,其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序,来执行各种适当的动作和处理。在RAM 703中,还可存储设备700操作所需的各种程序和数据。计算单元701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。As shown in FIG. 7 , the electronic device 700 includes a computing unit 701 that can be programmed according to a computer program stored in a read only memory (ROM) 702 or loaded into a random access memory (RAM) 703 from a storage unit 708 . Various appropriate actions and processes are performed. In the RAM 703, various programs and data necessary for the operation of the device 700 can also be stored. The computing unit 701 , the ROM 702 , and the RAM 703 are connected to each other through a bus 704 . An input/output (I/O) interface 705 is also connected to bus 704 .

设备700中的多个部件连接至I/O接口705,包括:输入单元706,例如键盘、鼠标等;输出单元707,例如各种类型的显示器、扬声器等;存储单元708,例如磁盘、光盘等;以及通信单元709,例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 709 allows the device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理,例如语音风格迁移模型的训练方法或语音风格迁移方法。例如,在一些实施例中,语音风格迁移模型的训练方法或语音风格迁移方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元708。在一些实施例中,计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到设备700上。当计算机程序加载到RAM 703并由计算单元701执行时,可以执行上文描述的语音风格迁移模型的训练方法或语音风格迁移方法的一个或多个步骤。备选地,在其他实施例中,计算单元701可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行语音风格迁移模型的训练方法或语音风格迁移方法。Computing unit 701 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computing units 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as a training method of a voice style transfer model or a voice style transfer method. For example, in some embodiments, a method of training a speech style transfer model or a method of speech style transfer may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708 . In some embodiments, part or all of the computer program may be loaded and/or installed on device 700 via ROM 702 and/or communication unit 709 . When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the training method of the speech style transfer model or the speech style transfer method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform a training method of a speech style transfer model or a speech style transfer method by any other suitable means (eg, by means of firmware).

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein above may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented. The program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.

在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务("Virtual Private Server",或简称"VPS")中,存在的管理难度大,业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。A computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as a cloud computing server or a cloud host. It is a host product in the cloud computing service system to solve the problem of traditional physical hosts and VPS services ("Virtual Private Server", or "VPS" for short). , there are the defects of difficult management and weak business expansion. The server can also be a server of a distributed system, or a server combined with a blockchain.

应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, the steps described in the present application can be executed in parallel, sequentially or in different orders, and as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, no limitation is imposed herein.

上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements, and improvements made within the spirit and principles of the present disclosure should be included within the protection scope of the present disclosure.

Claims (17)

1. A training method of a speech style migration model comprises the following steps:
respectively inputting the voice spectrum characteristics of the sample audio into a content coding network and a speaker coding network of a voice style migration model to obtain a content characteristic vector and a speaker characteristic vector of the sample audio, and extracting the fundamental tone frequency of the sample audio;
determining mutual information of the content feature vector, the speaker feature vector and the fundamental tone frequency;
inputting the content feature vector, the speaker feature vector and the fundamental tone frequency into a sound spectrum decoding network of the voice style migration model to obtain predicted sound spectrum features;
and updating the parameters of the voice style migration model according to the predicted voice spectrum characteristics and the mutual information.
2. The method of claim 1, wherein said updating parameters of said speech style migration model based on said predicted sonographic features and said mutual information comprises:
determining a first loss of the content encoding network based on the content feature vector;
determining a second loss of the sound spectrum decoding network according to the predicted sound spectrum characteristics;
and updating parameters of the voice style migration model according to the first loss, the second loss and the mutual information.
3. The method of claim 2, wherein said updating parameters of said speech style migration model based on said first loss, said second loss, and said mutual information comprises:
determining the sum of the first loss, the second loss and the mutual information as a total loss;
and updating the parameters of the voice style migration model according to the total loss.
4. The method according to claim 2 or 3, wherein inputting the sound spectrum feature of the sample audio into the content coding network of the speech style migration model to obtain the content feature vector of the sample audio comprises:
inputting the sound spectrum characteristics into a first characteristic extraction network in the content coding network to obtain a first characteristic vector, wherein the frame number of the first characteristic vector is less than that of the sound spectrum characteristics;
inputting the first feature vector into a quantizer in the content coding network to determine a second feature vector having the highest similarity with the first feature vector;
and updating the second feature vector by adopting a contrast prediction coding method to obtain the content feature vector.
5. The method of claim 4, wherein said determining a first loss of the content encoding network from the content feature vector comprises:
determining a first sub-loss according to the second feature vector, and determining a second sub-loss according to the content feature vector;
determining a sum of the first sub-loss and the second sub-loss as the first loss.
6. The method of any of claims 1-5, wherein the determining mutual information of the content feature vector, the speaker feature vector, and the pitch frequency comprises:
determining the mutual information upper limit between the content characteristic vector, the speaker characteristic vector and the fundamental tone frequency by comparing a comparison ratio upper limit algorithm;
and determining the sum of the content characteristic vector, the speaker characteristic vector and the mutual information upper limit between every two pitch frequencies as the mutual information of the content characteristic vector, the speaker characteristic vector and the pitch frequencies.
7. A method of voice style migration, comprising:
acquiring a first sound spectrum characteristic of a source audio and a second sound spectrum characteristic of a target audio;
inputting the first voice spectrum characteristic into a content coding network of a voice style migration model to obtain a content characteristic vector of the source audio, and extracting a fundamental tone frequency of the source audio;
inputting the second acoustic spectrum characteristic into a speaker coding network of a speech style migration model to obtain a speaker characteristic vector of the target audio;
inputting the content characteristic vector of the source audio, the speaker characteristic vector of the target audio and the fundamental tone frequency of the source audio into a sound spectrum decoding network of the voice style migration model to obtain a sound spectrum characteristic to be output, and generating the audio to be output according to the sound spectrum characteristic to be output;
wherein the speech style migration model is trained according to the method of any one of claims 1-6.
8. A training apparatus for a speech style migration model, comprising:
the characteristic extraction module is used for respectively inputting the sound spectrum characteristics of the sample audio into a content coding network and a speaker coding network of the voice style migration model to obtain a content characteristic vector and a speaker characteristic vector of the sample audio and extracting the fundamental tone frequency of the sample audio;
a mutual information calculation module for determining mutual information of the content feature vector, the speaker feature vector and the fundamental tone frequency;
a voice spectrum decoding module, configured to input the content feature vector, the speaker feature vector, and the pitch frequency into a voice spectrum decoding network of the speech style migration model to obtain a predicted voice spectrum feature;
and the updating module is used for updating the parameters of the voice style migration model according to the predicted sound spectrum characteristics and the mutual information.
9. The apparatus of claim 8, wherein the update module comprises:
a first determining unit, configured to determine a first loss of the content encoding network according to the content feature vector;
a second determining unit, configured to determine a second loss of the sound spectrum decoding network according to the predicted sound spectrum feature;
and the updating unit is used for updating the parameters of the voice style migration model according to the first loss, the second loss and the mutual information.
10. The apparatus of claim 9, wherein the updating unit comprises:
a first determining subunit, configured to determine a sum of the first loss, the second loss, and the mutual information as a total loss;
and the updating subunit is used for updating the parameters of the voice style migration model according to the total loss.
11. The apparatus of claim 9 or 10, wherein the feature extraction module comprises:
a first extraction unit, configured to input the audio spectrum feature into a first feature extraction network in the content coding network to obtain a first feature vector, where a frame number of the first feature vector is smaller than a frame number of the audio spectrum feature;
a second extraction unit, configured to input the first feature vector into a quantizer in the content coding network to determine a second feature vector having a highest similarity with the first feature vector;
and the third extraction unit is used for updating the second feature vector by adopting a contrast predictive coding method to obtain the content feature vector.
12. The apparatus of claim 11, wherein the first determining unit comprises:
a second determining subunit, configured to determine a first sub-loss according to the second feature vector, and determine a second sub-loss according to the content feature vector;
a third determining subunit, configured to determine a sum of the first sub-loss and the second sub-loss as the first loss.
13. The apparatus according to any one of claims 8-12, wherein the mutual information calculation module comprises:
a third determining unit, configured to determine, by comparing a comparison ratio upper limit algorithm, a mutual information upper limit between each two of the content feature vector, the speaker feature vector, and the pitch frequency;
a fourth determining unit, configured to determine a sum of mutual information upper limits between each two of the content feature vector, the speaker feature vector, and the pitch frequency as mutual information of the content feature vector, the speaker feature vector, and the pitch frequency.
14. A speech style migration apparatus comprising:
the acquisition module is used for acquiring a first sound spectrum characteristic of source audio and a second sound spectrum characteristic of target audio;
a content coding module, configured to input the first spectrum feature into a content coding network of a speech style migration model, obtain a content feature vector of the source audio, and extract a pitch frequency of the source audio;
the speaker coding module is used for inputting the second acoustic spectrum characteristic into a speaker coding network of a speech style migration model to obtain a speaker characteristic vector of the target audio;
a sound spectrum decoding module, configured to input the content feature vector of the source audio, the speaker feature vector of the target audio, and the pitch frequency of the source audio into a sound spectrum decoding network of the speech style migration model, to obtain a sound spectrum feature to be output, and generate a sound spectrum to be output according to the sound spectrum feature to be output;
wherein the speech style migration model is trained according to the method of any one of claims 1-6.
15. An electronic device, comprising:
at least one processor; and a memory communicatively coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-7.
CN202111500435.9A 2021-12-09 2021-12-09 Speech style transfer model training, speech style transfer method and device Active CN114203154B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111500435.9A CN114203154B (en) 2021-12-09 2021-12-09 Speech style transfer model training, speech style transfer method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111500435.9A CN114203154B (en) 2021-12-09 2021-12-09 Speech style transfer model training, speech style transfer method and device

Publications (2)

Publication Number Publication Date
CN114203154A true CN114203154A (en) 2022-03-18
CN114203154B CN114203154B (en) 2025-03-14

Family

ID=80651693

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111500435.9A Active CN114203154B (en) 2021-12-09 2021-12-09 Speech style transfer model training, speech style transfer method and device

Country Status (1)

Country Link
CN (1) CN114203154B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114678020A (en) * 2022-03-23 2022-06-28 北京慧能分享科技有限公司 Unsupervised application of voice command model in power monitoring field
CN114708876A (en) * 2022-05-11 2022-07-05 北京百度网讯科技有限公司 Audio processing method and device, electronic equipment and storage medium
CN114783409A (en) * 2022-03-29 2022-07-22 北京百度网讯科技有限公司 Speech synthesis model training method, speech synthesis method and device
CN114822577A (en) * 2022-06-23 2022-07-29 全时云商务服务股份有限公司 Method and device for estimating fundamental frequency of voice signal
CN114842859A (en) * 2022-05-12 2022-08-02 平安科技(深圳)有限公司 Voice conversion method, system, terminal and storage medium based on IN and MI
CN115578996A (en) * 2022-09-28 2023-01-06 慧言科技(天津)有限公司 Speech synthesis method based on self-supervised learning and mutual information decoupling technology
CN115985284A (en) * 2022-12-09 2023-04-18 北京达佳互联信息技术有限公司 Speech style extraction model training method, speech synthesis method, device and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003173198A (en) * 2001-09-27 2003-06-20 Kenwood Corp Voice dictionary preparation apparatus, voice synthesizing apparatus, voice dictionary preparation method, voice synthesizing apparatus, and program
CN101308652A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 Synthesizing method of personalized singing voice
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
US20150112687A1 (en) * 2012-05-18 2015-04-23 Aleksandr Yurevich Bredikhin Method for rerecording audio materials and device for implementation thereof
CN112767958A (en) * 2021-02-26 2021-05-07 华南理工大学 Zero-learning-based cross-language tone conversion system and method
CN113763987A (en) * 2021-09-06 2021-12-07 中国科学院声学研究所 A training method and device for a speech conversion model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003173198A (en) * 2001-09-27 2003-06-20 Kenwood Corp Voice dictionary preparation apparatus, voice synthesizing apparatus, voice dictionary preparation method, voice synthesizing apparatus, and program
CN101308652A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 Synthesizing method of personalized singing voice
US20150112687A1 (en) * 2012-05-18 2015-04-23 Aleksandr Yurevich Bredikhin Method for rerecording audio materials and device for implementation thereof
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN112767958A (en) * 2021-02-26 2021-05-07 华南理工大学 Zero-learning-based cross-language tone conversion system and method
CN113763987A (en) * 2021-09-06 2021-12-07 中国科学院声学研究所 A training method and device for a speech conversion model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘鑫;鲍长春;: "基于耳蜗滤波器倒谱参数的音频频带扩展方法", 清华大学学报(自然科学版), no. 06, 15 June 2013 (2013-06-15), pages 913 - 916 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114678020A (en) * 2022-03-23 2022-06-28 北京慧能分享科技有限公司 Unsupervised application of voice command model in power monitoring field
CN114783409A (en) * 2022-03-29 2022-07-22 北京百度网讯科技有限公司 Speech synthesis model training method, speech synthesis method and device
CN114783409B (en) * 2022-03-29 2024-05-28 北京百度网讯科技有限公司 Speech synthesis model training method, speech synthesis method and device
CN114708876A (en) * 2022-05-11 2022-07-05 北京百度网讯科技有限公司 Audio processing method and device, electronic equipment and storage medium
CN114708876B (en) * 2022-05-11 2023-10-03 北京百度网讯科技有限公司 Audio processing method, device, electronic equipment and storage medium
CN114842859A (en) * 2022-05-12 2022-08-02 平安科技(深圳)有限公司 Voice conversion method, system, terminal and storage medium based on IN and MI
CN114822577A (en) * 2022-06-23 2022-07-29 全时云商务服务股份有限公司 Method and device for estimating fundamental frequency of voice signal
CN115578996A (en) * 2022-09-28 2023-01-06 慧言科技(天津)有限公司 Speech synthesis method based on self-supervised learning and mutual information decoupling technology
CN115578996B (en) * 2022-09-28 2025-09-30 慧言科技(天津)有限公司 Speech synthesis method based on self-supervised learning and mutual information decoupling technology
CN115985284A (en) * 2022-12-09 2023-04-18 北京达佳互联信息技术有限公司 Speech style extraction model training method, speech synthesis method, device and medium

Also Published As

Publication number Publication date
CN114203154B (en) 2025-03-14

Similar Documents

Publication Publication Date Title
CN114203154B (en) Speech style transfer model training, speech style transfer method and device
CN112712813B (en) Voice processing method, device, equipment and storage medium
CN113963679B (en) A method, device, electronic device and storage medium for voice style transfer
CN113643693B (en) Acoustic model conditioned on sound characteristics
WO2024055752A1 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
CN114023342B (en) Voice conversion method, device, storage medium and electronic equipment
CN112259089A (en) Voice recognition method and device
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
CN113793615B (en) Speaker recognition method, model training method, device, equipment and storage medium
CN114708876B (en) Audio processing method, device, electronic equipment and storage medium
CN114127849A (en) Speech emotion recognition method and device
CN113345460A (en) Audio signal processing method, device, equipment and storage medium
CN114913859A (en) Voiceprint recognition method and device, electronic equipment and storage medium
WO2024008215A2 (en) Speech emotion recognition method and apparatus
CN114783409B (en) Speech synthesis model training method, speech synthesis method and device
KR102663654B1 (en) Adaptive visual speech recognition
CN114937478A (en) Method for training a model, method and apparatus for generating molecules
JP2023169230A (en) Computer program, server device, terminal device, learned model, program generation method, and method
CN112634880A (en) Speaker identification method, device, equipment, storage medium and program product
CN114220415A (en) An audio synthesis method, device, electronic device and storage medium
CN113053356A (en) Voice waveform generation method, device, server and storage medium
CN114882151A (en) Method and device for generating virtual image video, equipment, medium and product
CN114999440B (en) Avatar generation method, apparatus, device, storage medium, and program product
CN113689867B (en) A training method, device, electronic device and medium for a speech conversion model
CN113468857B (en) Training method, device, electronic device and storage medium for style transfer model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant