CN111816148A - A method and system for virtual vocal sight-singing based on generative adversarial network - Google Patents
A method and system for virtual vocal sight-singing based on generative adversarial network Download PDFInfo
- Publication number
- CN111816148A CN111816148A CN202010590728.XA CN202010590728A CN111816148A CN 111816148 A CN111816148 A CN 111816148A CN 202010590728 A CN202010590728 A CN 202010590728A CN 111816148 A CN111816148 A CN 111816148A
- Authority
- CN
- China
- Prior art keywords
- file
- audio
- neural network
- input
- layers
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/0033—Recording/reproducing or transmission of music for electrophonic musical instruments
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/101—Music Composition or musical creation; Tools or processes therefor
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Auxiliary Devices For Music (AREA)
Abstract
本发明提供了一种基于生成对抗网络的虚拟人声视唱方法和系统,所述方法包括:步骤一、输入abc记谱法文件和用Vocaloid制作的人声唱谱音频;步骤二、将abc文件转化为自定义格式的文本文件,将自定义文本文件和人声音频作为Tacotron‑2神经网络模型的输入;步骤三、在Tacotron‑2神经网络中,输入的文本文件中的字符通过512维的字符嵌入Character Embedding表示;步骤四、完成了虚拟人声波形文件的合成;步骤五、得到一段完整的虚拟人声视唱音乐。本发明用虚拟人声演唱乐谱,输出的语音节奏流畅自然,从而使得听者在聆听信息时会感觉自然,而不会感到设备的语音输出带有机械感与生涩感。
The invention provides a method and system for virtual vocal solfeggio based on a generative confrontation network. The method includes: step 1, inputting abc notation files and vocal notation audio produced by Vocaloid; step 2, converting abc The file is converted into a text file in a custom format, and the custom text file and vocal audio are used as the input of the Tacotron‑2 neural network model; Step 3. In the Tacotron‑2 neural network, the characters in the input text file pass through the 512-dimensional The character embedded in Character Embedding represents; in step 4, the synthesis of the virtual vocal waveform file is completed; in step 5, a complete piece of virtual vocal sight-singing music is obtained. The invention uses virtual human voice to sing musical scores, and the output voice rhythm is smooth and natural, so that the listener will feel natural when listening to the information, and will not feel mechanical and jerky in the voice output of the device.
Description
技术领域technical field
本发明属于计算机领域,具体地,涉及一种基于生成对抗网络的虚拟人声视唱方法和系统。The invention belongs to the field of computers, and in particular, relates to a virtual vocal sight-singing method and system based on a generative confrontation network.
背景技术Background technique
学习音乐时,看谱并唱出是学习音乐的基础,称为“视唱”。视唱练耳特别重要,然而问题是视唱练耳的课外活动需要大量练习才能达到学习效果。显然,对于绝大多数的普通家庭,教师是无法陪伴学生全天候完成练习。因为学习与练习有着分不开的联系,练习对于唱歌和听力至关重要,所以,可以得知传统的听力训练中的听力训练课程存在一些缺陷,需要对其做出一些改进。“真人唱谱,逼真易学。”虚拟人声视唱乐谱相当于真人唱谱,弥补了诸如钢琴、电子琴等乐器音色与人声不同的短板,大大提高了识谱速度和唱谱能力。当前的音乐相关应用软件发展很快,新增了各种乐器的音色,也可以根据乐谱自动播放,但是没有在音色中加入人声。When learning music, reading scores and singing is the basis of learning music, which is called "sight-singing". Solfeggio is particularly important, but the problem is that extracurricular activities such as solfeggio require a lot of practice to achieve the learning effect. Obviously, for the vast majority of ordinary families, teachers cannot accompany students to complete exercises around the clock. Because learning and practice are inseparably linked, and practice is crucial for singing and listening, it can be seen that there are some defects in the traditional listening training courses, and some improvements need to be made. "Real person singing notation is realistic and easy to learn." The virtual vocal solfeggio music notation is equivalent to the real person singing notation, which makes up for the shortcomings of musical instruments such as piano and electronic organ that are different from human voices, and greatly improves the speed of reading music and the ability to sing music. The current music-related application software has developed rapidly, adding the timbre of various musical instruments, and it can also be played automatically according to the score, but no human voice is added to the timbre.
传统的模拟人声演唱方法部分或全部采用拼接等方式,比如将每个唱名按时间长度和节拍简单拼接,这类方法虽然操作简单,但是结果生硬,与真人演唱差别较大,效果并不十分理想。如何用真实人声演唱乐谱是一个需要解决的问题。The traditional method of simulating vocal singing adopts some or all methods such as splicing. For example, each roll call is simply spliced according to the time length and rhythm. Although this method is simple to operate, the result is blunt, which is quite different from real singing, and the effect is not good. Ideal. How to sing a score with a real human voice is a problem that needs to be solved.
发明内容SUMMARY OF THE INVENTION
本发明提供了一种基于生成对抗网络的虚拟人声视唱方法和系统,能用虚拟人声演唱乐谱,输出的语音节奏流畅自然,从而使得听者在聆听信息时会感觉自然,而不会感到设备的语音输出带有机械感与生涩感。The invention provides a virtual vocal sight-singing method and system based on a generative confrontation network. The virtual vocal can be used to sing musical scores, and the output voice rhythm is smooth and natural, so that the listener will feel natural when listening to the information, but not The voice output of the device feels mechanical and jerky.
为了解决上述问题,本发明提供一种基于生成对抗网络的虚拟人声视唱方法,所述方法包括:In order to solve the above-mentioned problems, the present invention provides a virtual vocal sight-singing method based on a generative adversarial network, the method comprising:
步骤一、输入abc记谱法文件和用Vocaloid制作的人声唱谱音频,人声唱谱音频与abc文件相对应;Step 1. Input the abc notation file and the vocal score audio produced by Vocaloid, and the vocal score audio corresponds to the abc file;
步骤二、将abc文件转化为自定义格式的文本文件,将自定义文本文件和人声音频作为Tacotron-2神经网络模型的输入;Step 2: Convert the abc file into a text file in a custom format, and use the custom text file and vocal audio as the input of the Tacotron-2 neural network model;
步骤三、在Tacotron-2神经网络中,输入的文本文件中的字符通过512维的字符嵌入Character Embedding表示,而后通过3个卷积层,卷积层的输出再传递到一个双向LSTM层中,同时,使用位置敏感注意力Location Sensitive Attention使得模型在输入的过程中始终向前移动,Tacotron-2神经网络生成的模型即梅尔频谱将作为MelGAN模型的输入;Step 3. In the Tacotron-2 neural network, the characters in the input text file are represented by the 512-dimensional character embedding Character Embedding, and then through 3 convolutional layers, the output of the convolutional layer is passed to a bidirectional LSTM layer, At the same time, the use of Location Sensitive Attention makes the model always move forward in the process of input, and the model generated by the Tacotron-2 neural network, that is, the Mel spectrum, will be used as the input of the MelGAN model;
步骤四、将Tacotron-2神经网络训练好的模型和原始人声音频文件作为MelGAN生成对抗神经网络模型的输入,通过生成器和鉴别器,最终将会得到特征图Feature Map以及合成的人声唱谱音频文件,完成了虚拟人声波形文件的合成;Step 4. Use the model trained by the Tacotron-2 neural network and the original vocal audio file as the input of the MelGAN generative adversarial neural network model. Through the generator and the discriminator, the feature map and the synthesized vocal will eventually be obtained. spectrum audio file, and the synthesis of the virtual vocal waveform file is completed;
步骤五、根据场景将相应的音频片段粘合拼接起来,最终将得到一段完整的虚拟人声视唱音乐。Step 5: Glue and splicing the corresponding audio clips according to the scene, and finally a complete piece of virtual vocal sight-singing music will be obtained.
第二方面,本申请实施例提供了一种基于生成对抗网络的虚拟人声视唱系统,所述系统包括:In a second aspect, an embodiment of the present application provides a virtual vocal sight-singing system based on a generative adversarial network, the system comprising:
输入模块,用于输入abc记谱法文件和用Vocaloid制作的人声唱谱音频,人声唱谱音频与abc文件相对应;The input module is used to input the abc notation file and the vocal score audio produced by Vocaloid, and the vocal score audio corresponds to the abc file;
转换模块,用于将abc文件转化为自定义格式的文本文件,将自定义文本文件和人声音频作为Tacotron-2神经网络模型的输入;The conversion module is used to convert the abc file into a text file in a custom format, and use the custom text file and vocal audio as the input of the Tacotron-2 neural network model;
处理模块,用于在Tacotron-2神经网络中,输入的文本文件中的字符通过512维的字符嵌入Character Embedding表示,而后通过3个卷积层,卷积层的输出再传递到一个双向LSTM层中,同时,使用位置敏感注意力Location Sensitive Attention使得模型在输入的过程中始终向前移动,Tacotron-2神经网络生成的模型即梅尔频谱将作为MelGAN模型的输入,将Tacotron-2神经网络训练好的模型和原始人声音频文件作为MelGAN生成对抗神经网络模型的输入,通过生成器和鉴别器,最终将会得到特征图Feature Map以及合成的人声唱谱音频文件,完成了虚拟人声波形文件的合成;Processing module, used in the Tacotron-2 neural network, the characters in the input text file are represented by 512-dimensional character embedding Character Embedding, and then through 3 convolutional layers, the output of the convolutional layer is passed to a bidirectional LSTM layer At the same time, the use of Location Sensitive Attention makes the model always move forward during the input process. The model generated by the Tacotron-2 neural network, that is, the Mel spectrum, will be used as the input of the MelGAN model, and the Tacotron-2 neural network will be trained. The good model and the original vocal audio file are used as the input of the MelGAN generative adversarial neural network model. Through the generator and the discriminator, the feature map and the synthesized vocal score audio file will finally be obtained, and the virtual vocal waveform is completed. synthesis of documents;
拼接模块,用于根据场景将相应的音频片段粘合拼接起来,最终将得到一段完整的虚拟人声视唱音乐。The splicing module is used for gluing and splicing the corresponding audio clips according to the scene, and finally a complete piece of virtual vocal solfeggio music will be obtained.
第三方面,本申请实施例提供了一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现本申请实施例描述的方法。In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes all When the computer program is described, the methods described in the embodiments of the present application are implemented.
第四方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序用于:所述计算机程序被处理器执行时实现如本申请实施例描述的方法。In a fourth aspect, the embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, the computer program being configured to: implement the method described in the embodiments of the present application when the computer program is executed by a processor .
附图说明Description of drawings
图1是本发明基于生成对抗网络的虚拟人声视唱流程图;Fig. 1 is the virtual vocal solfeggio flow chart of the present invention based on generative confrontation network;
图2是本发明真人唱谱音频波形图;Fig. 2 is the audio frequency waveform diagram of real person singing notation of the present invention;
图3是本发明基于生成对抗网络的虚拟人声视唱框架图。FIG. 3 is a frame diagram of the virtual vocal solfeggio based on the generative adversarial network of the present invention.
具体实施方式Detailed ways
为了能够使得本发明的发明目的、技术流程及技术创新点进行更加清晰的阐述,以下结合附图及实例,对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。此外,下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the invention purpose, technical process and technical innovation points of the present invention more clearly described, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.
为达到以上目的,本发明提供了一种基于生成对抗网络的虚拟人声视唱方法。主流程如图3所示,该方法包括:In order to achieve the above purpose, the present invention provides a virtual vocal sight-singing method based on a generative confrontation network. The main process is shown in Figure 3, and the method includes:
步骤一、输入abc记谱法文件和用Vocaloid制作的人声唱谱音频,人声唱谱音频与abc文件相对应;Step 1. Input the abc notation file and the vocal score audio produced by Vocaloid, and the vocal score audio corresponds to the abc file;
步骤二、将abc文件转化为自定义格式的文本文件,将自定义文本文件和人声音频作为Tacotron-2神经网络模型的输入;Step 2: Convert the abc file into a text file in a custom format, and use the custom text file and vocal audio as the input of the Tacotron-2 neural network model;
步骤三、在Tacotron-2神经网络中,输入的文本文件中的字符通过512维的字符嵌入Character Embedding表示,而后通过3个卷积层,卷积层的输出再传递到一个双向LSTM层中,同时,使用位置敏感注意力Location Sensitive Attention使得模型在输入的过程中始终向前移动,Tacotron-2神经网络生成的模型即梅尔频谱将作为MelGAN模型的输入;Step 3. In the Tacotron-2 neural network, the characters in the input text file are represented by the 512-dimensional character embedding Character Embedding, and then through 3 convolutional layers, the output of the convolutional layer is passed to a bidirectional LSTM layer, At the same time, the use of Location Sensitive Attention makes the model always move forward in the process of input, and the model generated by the Tacotron-2 neural network, that is, the Mel spectrum, will be used as the input of the MelGAN model;
步骤四、将Tacotron-2神经网络训练好的模型和原始人声音频文件作为MelGAN生成对抗神经网络模型的输入,通过生成器和鉴别器,最终将会得到特征图Feature Map以及合成的人声唱谱音频文件,完成了虚拟人声波形文件的合成;Step 4. Use the model trained by the Tacotron-2 neural network and the original vocal audio file as the input of the MelGAN generative adversarial neural network model. Through the generator and the discriminator, the feature map and the synthesized vocal will eventually be obtained. spectrum audio file, and the synthesis of the virtual vocal waveform file is completed;
步骤五、根据场景将相应的音频片段粘合拼接起来,最终将得到一段完整的虚拟人声视唱音乐。Step 5: Glue and splicing the corresponding audio clips according to the scene, and finally a complete piece of virtual vocal sight-singing music will be obtained.
优选的,步骤二具体包括:将乐谱信息带入到每一个音符,从而使得乐谱可以像文字一样被神经网络读出来,使用三个关键点:音高、时长和发音来表达乐谱中的音节;用自定义的规则将abc记谱法转化为另一种语言形式化的记谱法,保存在txt文件中,即乐谱语音,生成的文件为乐谱解析文件,乐谱解析文件第一项为音符和音调信息;‘b’表示降半调;‘#’表示升半调;‘r’表示空音,将数字和特殊符号如“#”用纯英文进行代替,具体的,将表示音高的原符号3、4、5分别用n、o、p代替,将表示时长的1/8拍、1/4拍、3/8拍、1/2拍、3/4拍、1拍分别用q、r、s、t、u、v代替,将表示音符的c、c#、d、d#、e、f、f#、g、g#、a、a#、b分别用a、b、c、d、e、f、g、h、i、j、k、l代替。Preferably, step 2 specifically includes: bringing the musical score information into each note, so that the musical score can be read out by the neural network like text, and using three key points: pitch, duration and pronunciation to express the syllables in the musical score; Convert abc notation to formal notation in another language with custom rules, and save it in a txt file, that is, score voice, the generated file is a score parsing file, and the first item of the score parsing file is the note and Tone information; 'b' means lower halftone; '#' means rising halftone; 'r' means empty tone, replace numbers and special symbols such as "#" with pure English, specifically, it will indicate the original pitch Symbols 3, 4, and 5 are replaced by n, o, and p, respectively, and 1/8 beat, 1/4 beat, 3/8 beat, 1/2 beat, 3/4 beat, and 1 beat are represented by q, 1 beat, and 1 beat respectively. r, s, t, u, v are replaced, and the c, c#, d, d#, e, f, f#, g, g#, a, a#, b of the notes are replaced by a, b, c, d, e, f, g, h, i, j, k, l instead.
优选的,步骤三还包括:Preferably, step 3 also includes:
使用位置敏感注意力使得模型在输入的过程中始终向前移动,将预测结果通过一个包含2个完全连接层的前置网络Pre-Net,而后,前置网络的输出和注意力上下文向量Attention Context Vector将传递到2个单向的LSTM层,LSTM层的输出和注意力上下文向量经过线性投影后,生成梅尔频谱,最后,将预测出的特征结果传递给一个包含5层卷积层的后置网络中,改善总体重建。Use position-sensitive attention to make the model always move forward during the input process, pass the prediction results through a pre-network Pre-Net containing 2 fully connected layers, and then, the output of the pre-network and the attention context vector Attention Context The Vector will be passed to 2 unidirectional LSTM layers. The output of the LSTM layer and the attention context vector will be linearly projected to generate the Mel spectrum. Finally, the predicted feature results will be passed to a post-convolutional layer with 5 layers. into the network to improve overall reconstruction.
优选的,还包括将对抗网络进行改进,具体包括:Preferably, it also includes improving the adversarial network, including:
在生成器放置一个感应偏差,即音频时间步长之间存在长范围相关性,在每个上采样层之后添加了具有膨胀的剩余块,因此每个后续层的时间远的输出激活具有显著的重叠输入,一堆膨胀卷积层的接收场随层数呈指数增长,增加每个输出时间步骤的诱导接收场,在远距离时间步长的诱导感受场中更大的重叠,导致更好的长程相关性;Placing an inductive bias in the generator, i.e. there is a long-range correlation between audio time steps, a residual block with dilation is added after each upsampling layer, so that the temporally distant output activations of each subsequent layer have significant Overlapping inputs, the receptive field of a stack of dilated convolutional layers grows exponentially with the number of layers, increasing the induced receptive field at each output time step, and greater overlap in the induced receptive field at distant time steps, leading to better long-range correlation;
使用内核大小作为跨度的倍数,确保膨胀随核大小的增长而增长,以使堆栈的接受场是一个完全平衡和对称树,核大小作为分支因子;Use the kernel size as a multiple of the stride, ensuring that the dilation grows with the kernel size, so that the receptive field of the stack is a fully balanced and symmetric tree, with the kernel size as the branching factor;
在生成器和鉴别器的所有层中使用权重归一化;Use weight normalization in all layers of generator and discriminator;
采用具有3个鉴别器D1,D2,D3的多尺度架构,3个鉴别器具有相同的网络结构,在不同的音频尺度上运行,D1以原始音频的规模运行,D2,D3以分别以降频2倍和4倍的原始音频进行运行,下采样使用内核大小为4的跨步平均池执行;选择了基于窗口的鉴别器,因为它们已经被证明可以捕获基本的高频结构,仅需要更少的参数,运行更快,并且可以应用于可变长度的音频序列。Adopt a multi-scale architecture with 3 discriminators D1, D2, D3, the 3 discriminators have the same network structure and operate at different audio scales, D1 operates at the scale of the original audio, D2, D3 respectively at the downscale 2 times and 4 times the original audio, downsampling is performed using strided average pooling with a kernel size of 4; window-based discriminators are chosen because they have been shown to capture fundamental high-frequency structures, requiring less parameter, runs faster, and can be applied to variable-length audio sequences.
使用特征匹配目标来训练生成器。The generator is trained using the feature matching target.
作为另一方面,本申请还提供了一种基于生成对抗网络的虚拟人声视唱系统,如图1所示所述系统包括:As another aspect, the present application also provides a virtual vocal sight-singing system based on a generative adversarial network. As shown in FIG. 1, the system includes:
输入模块,用于输入abc记谱法文件和用Vocaloid制作的人声唱谱音频,人声唱谱音频与abc文件相对应;The input module is used to input the abc notation file and the vocal score audio produced by Vocaloid, and the vocal score audio corresponds to the abc file;
转换模块,用于将abc文件转化为自定义格式的文本文件,将自定义文本文件和人声音频作为Tacotron-2神经网络模型的输入;The conversion module is used to convert the abc file into a text file in a custom format, and use the custom text file and vocal audio as the input of the Tacotron-2 neural network model;
处理模块,用于在Tacotron-2神经网络中,输入的文本文件中的字符通过512维的字符嵌入Character Embedding表示,而后通过3个卷积层,卷积层的输出再传递到一个双向LSTM层中,同时,使用位置敏感注意力Location Sensitive Attention使得模型在输入的过程中始终向前移动,Tacotron-2神经网络生成的模型即梅尔频谱将作为MelGAN模型的输入,将Tacotron-2神经网络训练好的模型和原始人声音频文件作为MelGAN生成对抗神经网络模型的输入,通过生成器和鉴别器,最终将会得到特征图Feature Map以及合成的人声唱谱音频文件,完成了虚拟人声波形文件的合成;Processing module, used in the Tacotron-2 neural network, the characters in the input text file are represented by 512-dimensional character embedding Character Embedding, and then through 3 convolutional layers, the output of the convolutional layer is passed to a bidirectional LSTM layer At the same time, the use of Location Sensitive Attention makes the model always move forward during the input process. The model generated by the Tacotron-2 neural network, that is, the Mel spectrum, will be used as the input of the MelGAN model, and the Tacotron-2 neural network will be trained. The good model and the original vocal audio file are used as the input of the MelGAN generative adversarial neural network model. Through the generator and the discriminator, the feature map and the synthesized vocal score audio file will finally be obtained, and the virtual vocal waveform is completed. synthesis of documents;
拼接模块,用于根据场景将相应的音频片段粘合拼接起来,最终将得到一段完整的虚拟人声视唱音乐。The splicing module is used for gluing and splicing the corresponding audio clips according to the scene, and finally a complete piece of virtual vocal solfeggio music will be obtained.
优选的,转换模块还用于:将乐谱信息带入到每一个音符,从而使得乐谱可以像文字一样被神经网络读出来,使用三个关键点:音高、时长和发音来表达乐谱中的音节;Preferably, the conversion module is further used to: bring the score information into each note, so that the score can be read out by the neural network like text, and use three key points: pitch, duration and pronunciation to express the syllables in the score ;
用自定义的规则将abc记谱法转化为另一种语言形式化的记谱法,保存在txt文件中,即乐谱语音,生成的文件为乐谱解析文件,乐谱解析文件第一项为音符和音调信息;‘b’表示降半调;‘#’表示升半调;‘r’表示空音,将数字和特殊符号如“#”用纯英文进行代替,具体的,将表示音高的原符号3、4、5分别用n、o、p代替,将表示时长的1/8拍、1/4拍、3/8拍、1/2拍、3/4拍、1拍分别用q、r、s、t、u、v代替,将表示音符的c、c#、d、d#、e、f、f#、g、g#、a、a#、b分别用a、b、c、d、e、f、g、h、i、j、k、l代替。Convert abc notation to formal notation in another language with custom rules, and save it in a txt file, that is, score voice, the generated file is a score parsing file, and the first item of the score parsing file is the note and Tone information; 'b' means lower halftone; '#' means rising halftone; 'r' means empty tone, replace numbers and special symbols such as "#" with pure English, specifically, it will indicate the original pitch Symbols 3, 4, and 5 are replaced by n, o, and p, respectively, and 1/8 beat, 1/4 beat, 3/8 beat, 1/2 beat, 3/4 beat, and 1 beat are represented by q, 1 beat, and 1 beat respectively. r, s, t, u, v are replaced, and the c, c#, d, d#, e, f, f#, g, g#, a, a#, b of the notes are replaced by a, b, c, d, e, f, g, h, i, j, k, l instead.
优选的,处理模块还用于:Preferably, the processing module is also used for:
使用位置敏感注意力使得模型在输入的过程中始终向前移动,将预测结果通过一个包含2个完全连接层的前置网络Pre-Net,而后,前置网络的输出和注意力上下文向量Attention Context Vector将传递到2个单向的LSTM层,LSTM层的输出和注意力上下文向量经过线性投影后,生成梅尔频谱,最后,将预测出的特征结果传递给一个包含5层卷积层的后置网络中,改善总体重建。Use position-sensitive attention to make the model always move forward during the input process, pass the prediction results through a pre-network Pre-Net containing 2 fully connected layers, and then, the output of the pre-network and the attention context vector Attention Context The Vector will be passed to 2 unidirectional LSTM layers. The output of the LSTM layer and the attention context vector will be linearly projected to generate the Mel spectrum. Finally, the predicted feature results will be passed to a post-convolutional layer with 5 layers. into the network to improve overall reconstruction.
优选的,处理模块还用于:Preferably, the processing module is also used for:
在生成器放置一个感应偏差,即音频时间步长之间存在长范围相关性,在每个上采样层之后添加了具有膨胀的剩余块,因此每个后续层的时间远的输出激活具有显著的重叠输入,一堆膨胀卷积层的接收场随层数呈指数增长,增加每个输出时间步骤的诱导接收场,在远距离时间步长的诱导感受场中更大的重叠,导致更好的长程相关性;Placing an inductive bias in the generator, i.e. there is a long-range correlation between audio time steps, a residual block with dilation is added after each upsampling layer, so that the temporally distant output activations of each subsequent layer have significant Overlapping inputs, the receptive field of a stack of dilated convolutional layers grows exponentially with the number of layers, increasing the induced receptive field at each output time step, and greater overlap in the induced receptive field at distant time steps, leading to better long-range correlation;
使用内核大小作为跨度的倍数,确保膨胀随核大小的增长而增长,以使堆栈的接受场是一个完全平衡和对称树,核大小作为分支因子;Use the kernel size as a multiple of the stride, ensuring that the dilation grows with the kernel size, so that the receptive field of the stack is a fully balanced and symmetric tree, with the kernel size as the branching factor;
在生成器和鉴别器的所有层中使用权重归一化;Use weight normalization in all layers of generator and discriminator;
采用具有3个鉴别器D1,D2,D3的多尺度架构,3个鉴别器具有相同的网络结构,在不同的音频尺度上运行,D1以原始音频的规模运行,D2,D3以分别以降频2倍和4倍的原始音频进行运行,下采样使用内核大小为4的跨步平均池执行;Adopt a multi-scale architecture with 3 discriminators D1, D2, D3, the 3 discriminators have the same network structure and operate at different audio scales, D1 operates at the scale of the original audio, D2, D3 respectively at the downscale 2 Runs at times and 4 times the original audio, downsampling is performed using a strided average pool of kernel size 4;
使用特征匹配目标来训练生成器。The generator is trained using the feature matching target.
作为另一方面,本申请还提供了一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如描述于本申请实施例描述的方法。As another aspect, the present application also provides a computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the When the computer program is described, the method as described in the embodiments of the present application is implemented.
作为另一方面,本申请还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中前述装置中所包含的计算机可读存储介质;也可以是单独存在,未装配入设备中的计算机可读存储介质。计算机可读存储介质存储有一个或者一个以上程序,前述程序被一个或者一个以上的处理器用来执行描述于本申请实施例描述的方法。As another aspect, the present application also provides a computer-readable storage medium, and the computer-readable storage medium may be the computer-readable storage medium included in the aforementioned apparatus in the foregoing embodiment; computer-readable storage medium in the device. The computer-readable storage medium stores one or more programs, and the aforementioned programs are used by one or more processors to execute the methods described in the embodiments of the present application.
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。如,如果用硬件来实现和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(ProgrammableGate Array;以下简称:PGA),现场可编程门阵列(Field ProgrammableGate Array;以下简称:FPGA)等。It should be understood that various parts of this application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented by any one of the following techniques known in the art, or a combination thereof: discrete with logic gates for implementing logic functions on data signals Logic circuits, ASICs with suitable combinational logic gate circuits, Programmable Gate Arrays (hereinafter referred to as PGA), Field Programmable Gate Arrays (hereinafter referred to as FPGAs), etc.
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。Those skilled in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing the relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the program can be stored in a computer-readable storage medium. When executed, one or a combination of the steps of the method embodiment is included.
此外,在本申请各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically alone, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.
上述提到的存储介质可以是只读存储器,磁盘或光盘等。尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like. Although the embodiments of the present application have been shown and described above, it should be understood that the above embodiments are exemplary and should not be construed as limitations to the present application. Embodiments are subject to variations, modifications, substitutions and variations.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010590728.XA CN111816148B (en) | 2020-06-24 | 2020-06-24 | Virtual human voice and video singing method and system based on generation countermeasure network |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010590728.XA CN111816148B (en) | 2020-06-24 | 2020-06-24 | Virtual human voice and video singing method and system based on generation countermeasure network |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN111816148A true CN111816148A (en) | 2020-10-23 |
| CN111816148B CN111816148B (en) | 2023-04-07 |
Family
ID=72855003
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010590728.XA Expired - Fee Related CN111816148B (en) | 2020-06-24 | 2020-06-24 | Virtual human voice and video singing method and system based on generation countermeasure network |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111816148B (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112397043A (en) * | 2020-11-03 | 2021-02-23 | 北京中科深智科技有限公司 | Method and system for converting voice into song |
| CN113066475A (en) * | 2021-06-03 | 2021-07-02 | 成都启英泰伦科技有限公司 | Speech synthesis method based on generating type countermeasure network |
| CN118658098A (en) * | 2024-06-19 | 2024-09-17 | 北京广益集思智能科技有限公司 | A digital human live broadcast method based on GAN neural network system |
Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH11184490A (en) * | 1997-12-25 | 1999-07-09 | Nippon Telegr & Teleph Corp <Ntt> | Singing voice synthesis method using regular speech synthesis |
| CN104361884A (en) * | 2014-11-18 | 2015-02-18 | 张正贤 | Electronic device capable of being played to generate voice singing scores and operation method of electronic device |
| CN106652984A (en) * | 2016-10-11 | 2017-05-10 | 张文铂 | Automatic song creation method via computer |
| US20180190249A1 (en) * | 2016-12-30 | 2018-07-05 | Google Inc. | Machine Learning to Generate Music from Text |
| CN109461431A (en) * | 2018-12-24 | 2019-03-12 | 厦门大学 | The sightsinging mistake music score of Chinese operas mask method of education is sung applied to root LeEco |
| CN109584904A (en) * | 2018-12-24 | 2019-04-05 | 厦门大学 | The sightsinging audio roll call for singing education applied to root LeEco identifies modeling method |
| US20200043516A1 (en) * | 2018-08-06 | 2020-02-06 | Spotify Ab | Singing voice separation with deep u-net convolutional networks |
| CN111179905A (en) * | 2020-01-10 | 2020-05-19 | 北京中科深智科技有限公司 | Rapid dubbing generation method and device |
| CN111292720A (en) * | 2020-02-07 | 2020-06-16 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment |
| CN111816157A (en) * | 2020-06-24 | 2020-10-23 | 厦门大学 | A kind of music score intelligent sight-singing method and system based on speech synthesis |
-
2020
- 2020-06-24 CN CN202010590728.XA patent/CN111816148B/en not_active Expired - Fee Related
Patent Citations (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH11184490A (en) * | 1997-12-25 | 1999-07-09 | Nippon Telegr & Teleph Corp <Ntt> | Singing voice synthesis method using regular speech synthesis |
| CN104361884A (en) * | 2014-11-18 | 2015-02-18 | 张正贤 | Electronic device capable of being played to generate voice singing scores and operation method of electronic device |
| CN106652984A (en) * | 2016-10-11 | 2017-05-10 | 张文铂 | Automatic song creation method via computer |
| US20180190249A1 (en) * | 2016-12-30 | 2018-07-05 | Google Inc. | Machine Learning to Generate Music from Text |
| US20200043516A1 (en) * | 2018-08-06 | 2020-02-06 | Spotify Ab | Singing voice separation with deep u-net convolutional networks |
| CN109461431A (en) * | 2018-12-24 | 2019-03-12 | 厦门大学 | The sightsinging mistake music score of Chinese operas mask method of education is sung applied to root LeEco |
| CN109584904A (en) * | 2018-12-24 | 2019-04-05 | 厦门大学 | The sightsinging audio roll call for singing education applied to root LeEco identifies modeling method |
| CN111179905A (en) * | 2020-01-10 | 2020-05-19 | 北京中科深智科技有限公司 | Rapid dubbing generation method and device |
| CN111292720A (en) * | 2020-02-07 | 2020-06-16 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment |
| CN111816157A (en) * | 2020-06-24 | 2020-10-23 | 厦门大学 | A kind of music score intelligent sight-singing method and system based on speech synthesis |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112397043A (en) * | 2020-11-03 | 2021-02-23 | 北京中科深智科技有限公司 | Method and system for converting voice into song |
| CN112397043B (en) * | 2020-11-03 | 2021-11-16 | 北京中科深智科技有限公司 | Method and system for converting voice into song |
| CN113066475A (en) * | 2021-06-03 | 2021-07-02 | 成都启英泰伦科技有限公司 | Speech synthesis method based on generating type countermeasure network |
| CN113066475B (en) * | 2021-06-03 | 2021-08-06 | 成都启英泰伦科技有限公司 | Speech synthesis method based on generating type countermeasure network |
| CN118658098A (en) * | 2024-06-19 | 2024-09-17 | 北京广益集思智能科技有限公司 | A digital human live broadcast method based on GAN neural network system |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111816148B (en) | 2023-04-07 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Gold et al. | Speech and audio signal processing: processing and perception of speech and music | |
| Dhanjal et al. | An automatic machine translation system for multi-lingual speech to Indian sign language | |
| CN111816148B (en) | Virtual human voice and video singing method and system based on generation countermeasure network | |
| CN101504643A (en) | Speech processing system, speech processing method, and speech processing program | |
| CN115114919A (en) | Method and device for presenting prompt information and storage medium | |
| WO2021174922A1 (en) | Statement sentiment classification method and related device | |
| CN112802446A (en) | Audio synthesis method and device, electronic equipment and computer-readable storage medium | |
| Ahmad et al. | Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning | |
| Atkar et al. | Speech synthesis using generative adversarial network for improving readability of Hindi words to recuperate from dyslexia | |
| US20220157329A1 (en) | Method of converting voice feature of voice | |
| US20240347037A1 (en) | Method and apparatus for synthesizing unified voice wave based on self-supervised learning | |
| Sturm et al. | Folk the algorithms:(Mis) Applying artificial intelligence to folk music | |
| CN113129862A (en) | World-tacontron-based voice synthesis method and system and server | |
| CN112669796A (en) | Method and device for converting music into music book based on artificial intelligence | |
| Wang et al. | Research on correction method of spoken pronunciation accuracy of AI virtual English reading | |
| CN115273806A (en) | Song synthesis model training method and device, song synthesis method and device | |
| Vijay et al. | Pitch extraction and notes generation implementation using tensor flow | |
| Abdalla et al. | An NLP-based system for modulating virtual experiences using speech instructions | |
| CN111816157A (en) | A kind of music score intelligent sight-singing method and system based on speech synthesis | |
| Zhichao | [Retracted] Development of the Music Teaching System Based on Speech Recognition and Artificial Intelligence | |
| Chen et al. | VoxHakka: A Dialectally Diverse Multi-Speaker Text-to-Speech System for Taiwanese Hakka | |
| Tomczak et al. | Drum translation for timbral and rhythmic transformation | |
| CN114758560A (en) | Humming intonation evaluation method based on dynamic time warping | |
| Nittala et al. | Speaker diarization and bert-based model for question set generation from video lectures | |
| Zhang | RETRACTED: Mobile Music Recognition based on Deep Neural Network |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20230407 |