CN111816148A

CN111816148A - A method and system for virtual vocal sight-singing based on generative adversarial network

Info

Publication number: CN111816148A
Application number: CN202010590728.XA
Authority: CN
Inventors: 吴清强; 刘昆宏; 张敬峥; 吴苏悦; 宗雁翔; 朱何莹
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-23
Anticipated expiration: 2040-06-24
Also published as: CN111816148B

Abstract

The invention provides a method and system for virtual vocal solfeggio based on a generative confrontation network. The method includes: step 1, inputting abc notation files and vocal notation audio produced by Vocaloid; step 2, converting abc The file is converted into a text file in a custom format, and the custom text file and vocal audio are used as the input of the Tacotron‑2 neural network model; Step 3. In the Tacotron‑2 neural network, the characters in the input text file pass through the 512-dimensional The character embedded in Character Embedding represents; in step 4, the synthesis of the virtual vocal waveform file is completed; in step 5, a complete piece of virtual vocal sight-singing music is obtained. The invention uses virtual human voice to sing musical scores, and the output voice rhythm is smooth and natural, so that the listener will feel natural when listening to the information, and will not feel mechanical and jerky in the voice output of the device.

Description

A method and system for virtual vocal sight-singing based on generative adversarial network

技术领域technical field

本发明属于计算机领域，具体地，涉及一种基于生成对抗网络的虚拟人声视唱方法和系统。The invention belongs to the field of computers, and in particular, relates to a virtual vocal sight-singing method and system based on a generative confrontation network.

背景技术Background technique

学习音乐时，看谱并唱出是学习音乐的基础，称为“视唱”。视唱练耳特别重要，然而问题是视唱练耳的课外活动需要大量练习才能达到学习效果。显然，对于绝大多数的普通家庭，教师是无法陪伴学生全天候完成练习。因为学习与练习有着分不开的联系，练习对于唱歌和听力至关重要，所以，可以得知传统的听力训练中的听力训练课程存在一些缺陷，需要对其做出一些改进。“真人唱谱，逼真易学。”虚拟人声视唱乐谱相当于真人唱谱，弥补了诸如钢琴、电子琴等乐器音色与人声不同的短板，大大提高了识谱速度和唱谱能力。当前的音乐相关应用软件发展很快，新增了各种乐器的音色，也可以根据乐谱自动播放，但是没有在音色中加入人声。When learning music, reading scores and singing is the basis of learning music, which is called "sight-singing". Solfeggio is particularly important, but the problem is that extracurricular activities such as solfeggio require a lot of practice to achieve the learning effect. Obviously, for the vast majority of ordinary families, teachers cannot accompany students to complete exercises around the clock. Because learning and practice are inseparably linked, and practice is crucial for singing and listening, it can be seen that there are some defects in the traditional listening training courses, and some improvements need to be made. "Real person singing notation is realistic and easy to learn." The virtual vocal solfeggio music notation is equivalent to the real person singing notation, which makes up for the shortcomings of musical instruments such as piano and electronic organ that are different from human voices, and greatly improves the speed of reading music and the ability to sing music. The current music-related application software has developed rapidly, adding the timbre of various musical instruments, and it can also be played automatically according to the score, but no human voice is added to the timbre.

传统的模拟人声演唱方法部分或全部采用拼接等方式，比如将每个唱名按时间长度和节拍简单拼接，这类方法虽然操作简单，但是结果生硬，与真人演唱差别较大，效果并不十分理想。如何用真实人声演唱乐谱是一个需要解决的问题。The traditional method of simulating vocal singing adopts some or all methods such as splicing. For example, each roll call is simply spliced according to the time length and rhythm. Although this method is simple to operate, the result is blunt, which is quite different from real singing, and the effect is not good. Ideal. How to sing a score with a real human voice is a problem that needs to be solved.

发明内容SUMMARY OF THE INVENTION

本发明提供了一种基于生成对抗网络的虚拟人声视唱方法和系统，能用虚拟人声演唱乐谱，输出的语音节奏流畅自然，从而使得听者在聆听信息时会感觉自然，而不会感到设备的语音输出带有机械感与生涩感。The invention provides a virtual vocal sight-singing method and system based on a generative confrontation network. The virtual vocal can be used to sing musical scores, and the output voice rhythm is smooth and natural, so that the listener will feel natural when listening to the information, but not The voice output of the device feels mechanical and jerky.

为了解决上述问题，本发明提供一种基于生成对抗网络的虚拟人声视唱方法，所述方法包括：In order to solve the above-mentioned problems, the present invention provides a virtual vocal sight-singing method based on a generative adversarial network, the method comprising:

步骤一、输入abc记谱法文件和用Vocaloid制作的人声唱谱音频，人声唱谱音频与abc文件相对应；Step 1. Input the abc notation file and the vocal score audio produced by Vocaloid, and the vocal score audio corresponds to the abc file;

步骤二、将abc文件转化为自定义格式的文本文件，将自定义文本文件和人声音频作为Tacotron-2神经网络模型的输入；Step 2: Convert the abc file into a text file in a custom format, and use the custom text file and vocal audio as the input of the Tacotron-2 neural network model;

步骤三、在Tacotron-2神经网络中，输入的文本文件中的字符通过512维的字符嵌入Character Embedding表示，而后通过3个卷积层，卷积层的输出再传递到一个双向LSTM层中，同时，使用位置敏感注意力Location Sensitive Attention使得模型在输入的过程中始终向前移动，Tacotron-2神经网络生成的模型即梅尔频谱将作为MelGAN模型的输入；Step 3. In the Tacotron-2 neural network, the characters in the input text file are represented by the 512-dimensional character embedding Character Embedding, and then through 3 convolutional layers, the output of the convolutional layer is passed to a bidirectional LSTM layer, At the same time, the use of Location Sensitive Attention makes the model always move forward in the process of input, and the model generated by the Tacotron-2 neural network, that is, the Mel spectrum, will be used as the input of the MelGAN model;

步骤四、将Tacotron-2神经网络训练好的模型和原始人声音频文件作为MelGAN生成对抗神经网络模型的输入，通过生成器和鉴别器，最终将会得到特征图Feature Map以及合成的人声唱谱音频文件，完成了虚拟人声波形文件的合成；Step 4. Use the model trained by the Tacotron-2 neural network and the original vocal audio file as the input of the MelGAN generative adversarial neural network model. Through the generator and the discriminator, the feature map and the synthesized vocal will eventually be obtained. spectrum audio file, and the synthesis of the virtual vocal waveform file is completed;

步骤五、根据场景将相应的音频片段粘合拼接起来，最终将得到一段完整的虚拟人声视唱音乐。Step 5: Glue and splicing the corresponding audio clips according to the scene, and finally a complete piece of virtual vocal sight-singing music will be obtained.

第二方面，本申请实施例提供了一种基于生成对抗网络的虚拟人声视唱系统，所述系统包括：In a second aspect, an embodiment of the present application provides a virtual vocal sight-singing system based on a generative adversarial network, the system comprising:

输入模块，用于输入abc记谱法文件和用Vocaloid制作的人声唱谱音频，人声唱谱音频与abc文件相对应；The input module is used to input the abc notation file and the vocal score audio produced by Vocaloid, and the vocal score audio corresponds to the abc file;

转换模块，用于将abc文件转化为自定义格式的文本文件，将自定义文本文件和人声音频作为Tacotron-2神经网络模型的输入；The conversion module is used to convert the abc file into a text file in a custom format, and use the custom text file and vocal audio as the input of the Tacotron-2 neural network model;

处理模块，用于在Tacotron-2神经网络中，输入的文本文件中的字符通过512维的字符嵌入Character Embedding表示，而后通过3个卷积层，卷积层的输出再传递到一个双向LSTM层中，同时，使用位置敏感注意力Location Sensitive Attention使得模型在输入的过程中始终向前移动，Tacotron-2神经网络生成的模型即梅尔频谱将作为MelGAN模型的输入，将Tacotron-2神经网络训练好的模型和原始人声音频文件作为MelGAN生成对抗神经网络模型的输入，通过生成器和鉴别器，最终将会得到特征图Feature Map以及合成的人声唱谱音频文件，完成了虚拟人声波形文件的合成；Processing module, used in the Tacotron-2 neural network, the characters in the input text file are represented by 512-dimensional character embedding Character Embedding, and then through 3 convolutional layers, the output of the convolutional layer is passed to a bidirectional LSTM layer At the same time, the use of Location Sensitive Attention makes the model always move forward during the input process. The model generated by the Tacotron-2 neural network, that is, the Mel spectrum, will be used as the input of the MelGAN model, and the Tacotron-2 neural network will be trained. The good model and the original vocal audio file are used as the input of the MelGAN generative adversarial neural network model. Through the generator and the discriminator, the feature map and the synthesized vocal score audio file will finally be obtained, and the virtual vocal waveform is completed. synthesis of documents;

拼接模块，用于根据场景将相应的音频片段粘合拼接起来，最终将得到一段完整的虚拟人声视唱音乐。The splicing module is used for gluing and splicing the corresponding audio clips according to the scene, and finally a complete piece of virtual vocal solfeggio music will be obtained.

第三方面，本申请实施例提供了一种计算机设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，其特征在于，所述处理器执行所述计算机程序时实现本申请实施例描述的方法。In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes all When the computer program is described, the methods described in the embodiments of the present application are implemented.

第四方面，本申请实施例提供了一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序用于：所述计算机程序被处理器执行时实现如本申请实施例描述的方法。In a fourth aspect, the embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, the computer program being configured to: implement the method described in the embodiments of the present application when the computer program is executed by a processor .

附图说明Description of drawings

图1是本发明基于生成对抗网络的虚拟人声视唱流程图；Fig. 1 is the virtual vocal solfeggio flow chart of the present invention based on generative confrontation network;

图2是本发明真人唱谱音频波形图；Fig. 2 is the audio frequency waveform diagram of real person singing notation of the present invention;

图3是本发明基于生成对抗网络的虚拟人声视唱框架图。FIG. 3 is a frame diagram of the virtual vocal solfeggio based on the generative adversarial network of the present invention.

具体实施方式Detailed ways

为了能够使得本发明的发明目的、技术流程及技术创新点进行更加清晰的阐述，以下结合附图及实例，对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the invention purpose, technical process and technical innovation points of the present invention more clearly described, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

为达到以上目的，本发明提供了一种基于生成对抗网络的虚拟人声视唱方法。主流程如图3所示，该方法包括：In order to achieve the above purpose, the present invention provides a virtual vocal sight-singing method based on a generative confrontation network. The main process is shown in Figure 3, and the method includes:

优选的，步骤二具体包括：将乐谱信息带入到每一个音符，从而使得乐谱可以像文字一样被神经网络读出来，使用三个关键点：音高、时长和发音来表达乐谱中的音节；用自定义的规则将abc记谱法转化为另一种语言形式化的记谱法，保存在txt文件中，即乐谱语音，生成的文件为乐谱解析文件，乐谱解析文件第一项为音符和音调信息；‘b’表示降半调；‘#’表示升半调；‘r’表示空音，将数字和特殊符号如“#”用纯英文进行代替，具体的，将表示音高的原符号3、4、5分别用n、o、p代替，将表示时长的1/8拍、1/4拍、3/8拍、1/2拍、3/4拍、1拍分别用q、r、s、t、u、v代替，将表示音符的c、c#、d、d#、e、f、f#、g、g#、a、a#、b分别用a、b、c、d、e、f、g、h、i、j、k、l代替。Preferably, step 2 specifically includes: bringing the musical score information into each note, so that the musical score can be read out by the neural network like text, and using three key points: pitch, duration and pronunciation to express the syllables in the musical score; Convert abc notation to formal notation in another language with custom rules, and save it in a txt file, that is, score voice, the generated file is a score parsing file, and the first item of the score parsing file is the note and Tone information; 'b' means lower halftone; '#' means rising halftone; 'r' means empty tone, replace numbers and special symbols such as "#" with pure English, specifically, it will indicate the original pitch Symbols 3, 4, and 5 are replaced by n, o, and p, respectively, and 1/8 beat, 1/4 beat, 3/8 beat, 1/2 beat, 3/4 beat, and 1 beat are represented by q, 1 beat, and 1 beat respectively. r, s, t, u, v are replaced, and the c, c#, d, d#, e, f, f#, g, g#, a, a#, b of the notes are replaced by a, b, c, d, e, f, g, h, i, j, k, l instead.

优选的，步骤三还包括：Preferably, step 3 also includes:

使用位置敏感注意力使得模型在输入的过程中始终向前移动，将预测结果通过一个包含2个完全连接层的前置网络Pre-Net,而后，前置网络的输出和注意力上下文向量Attention Context Vector将传递到2个单向的LSTM层，LSTM层的输出和注意力上下文向量经过线性投影后，生成梅尔频谱，最后，将预测出的特征结果传递给一个包含5层卷积层的后置网络中，改善总体重建。Use position-sensitive attention to make the model always move forward during the input process, pass the prediction results through a pre-network Pre-Net containing 2 fully connected layers, and then, the output of the pre-network and the attention context vector Attention Context The Vector will be passed to 2 unidirectional LSTM layers. The output of the LSTM layer and the attention context vector will be linearly projected to generate the Mel spectrum. Finally, the predicted feature results will be passed to a post-convolutional layer with 5 layers. into the network to improve overall reconstruction.

优选的，还包括将对抗网络进行改进，具体包括：Preferably, it also includes improving the adversarial network, including:

在生成器放置一个感应偏差，即音频时间步长之间存在长范围相关性，在每个上采样层之后添加了具有膨胀的剩余块，因此每个后续层的时间远的输出激活具有显著的重叠输入，一堆膨胀卷积层的接收场随层数呈指数增长，增加每个输出时间步骤的诱导接收场，在远距离时间步长的诱导感受场中更大的重叠，导致更好的长程相关性；Placing an inductive bias in the generator, i.e. there is a long-range correlation between audio time steps, a residual block with dilation is added after each upsampling layer, so that the temporally distant output activations of each subsequent layer have significant Overlapping inputs, the receptive field of a stack of dilated convolutional layers grows exponentially with the number of layers, increasing the induced receptive field at each output time step, and greater overlap in the induced receptive field at distant time steps, leading to better long-range correlation;

使用内核大小作为跨度的倍数，确保膨胀随核大小的增长而增长，以使堆栈的接受场是一个完全平衡和对称树，核大小作为分支因子；Use the kernel size as a multiple of the stride, ensuring that the dilation grows with the kernel size, so that the receptive field of the stack is a fully balanced and symmetric tree, with the kernel size as the branching factor;

在生成器和鉴别器的所有层中使用权重归一化；Use weight normalization in all layers of generator and discriminator;

采用具有3个鉴别器D1，D2，D3的多尺度架构，3个鉴别器具有相同的网络结构，在不同的音频尺度上运行，D1以原始音频的规模运行，D2，D3以分别以降频2倍和4倍的原始音频进行运行，下采样使用内核大小为4的跨步平均池执行；选择了基于窗口的鉴别器，因为它们已经被证明可以捕获基本的高频结构，仅需要更少的参数，运行更快，并且可以应用于可变长度的音频序列。Adopt a multi-scale architecture with 3 discriminators D1, D2, D3, the 3 discriminators have the same network structure and operate at different audio scales, D1 operates at the scale of the original audio, D2, D3 respectively at the downscale 2 times and 4 times the original audio, downsampling is performed using strided average pooling with a kernel size of 4; window-based discriminators are chosen because they have been shown to capture fundamental high-frequency structures, requiring less parameter, runs faster, and can be applied to variable-length audio sequences.

使用特征匹配目标来训练生成器。The generator is trained using the feature matching target.

作为另一方面，本申请还提供了一种基于生成对抗网络的虚拟人声视唱系统，如图1所示所述系统包括：As another aspect, the present application also provides a virtual vocal sight-singing system based on a generative adversarial network. As shown in FIG. 1, the system includes:

优选的，转换模块还用于：将乐谱信息带入到每一个音符，从而使得乐谱可以像文字一样被神经网络读出来，使用三个关键点：音高、时长和发音来表达乐谱中的音节；Preferably, the conversion module is further used to: bring the score information into each note, so that the score can be read out by the neural network like text, and use three key points: pitch, duration and pronunciation to express the syllables in the score ;

用自定义的规则将abc记谱法转化为另一种语言形式化的记谱法，保存在txt文件中，即乐谱语音，生成的文件为乐谱解析文件，乐谱解析文件第一项为音符和音调信息；‘b’表示降半调；‘#’表示升半调；‘r’表示空音，将数字和特殊符号如“#”用纯英文进行代替，具体的，将表示音高的原符号3、4、5分别用n、o、p代替，将表示时长的1/8拍、1/4拍、3/8拍、1/2拍、3/4拍、1拍分别用q、r、s、t、u、v代替，将表示音符的c、c#、d、d#、e、f、f#、g、g#、a、a#、b分别用a、b、c、d、e、f、g、h、i、j、k、l代替。Convert abc notation to formal notation in another language with custom rules, and save it in a txt file, that is, score voice, the generated file is a score parsing file, and the first item of the score parsing file is the note and Tone information; 'b' means lower halftone; '#' means rising halftone; 'r' means empty tone, replace numbers and special symbols such as "#" with pure English, specifically, it will indicate the original pitch Symbols 3, 4, and 5 are replaced by n, o, and p, respectively, and 1/8 beat, 1/4 beat, 3/8 beat, 1/2 beat, 3/4 beat, and 1 beat are represented by q, 1 beat, and 1 beat respectively. r, s, t, u, v are replaced, and the c, c#, d, d#, e, f, f#, g, g#, a, a#, b of the notes are replaced by a, b, c, d, e, f, g, h, i, j, k, l instead.

优选的，处理模块还用于：Preferably, the processing module is also used for:

采用具有3个鉴别器D1，D2，D3的多尺度架构，3个鉴别器具有相同的网络结构，在不同的音频尺度上运行，D1以原始音频的规模运行，D2，D3以分别以降频2倍和4倍的原始音频进行运行，下采样使用内核大小为4的跨步平均池执行；Adopt a multi-scale architecture with 3 discriminators D1, D2, D3, the 3 discriminators have the same network structure and operate at different audio scales, D1 operates at the scale of the original audio, D2, D3 respectively at the downscale 2 Runs at times and 4 times the original audio, downsampling is performed using a strided average pool of kernel size 4;

作为另一方面，本申请还提供了一种计算机设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，其特征在于，所述处理器执行所述计算机程序时实现如描述于本申请实施例描述的方法。As another aspect, the present application also provides a computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the When the computer program is described, the method as described in the embodiments of the present application is implemented.

作为另一方面，本申请还提供了一种计算机可读存储介质，该计算机可读存储介质可以是上述实施例中前述装置中所包含的计算机可读存储介质；也可以是单独存在，未装配入设备中的计算机可读存储介质。计算机可读存储介质存储有一个或者一个以上程序，前述程序被一个或者一个以上的处理器用来执行描述于本申请实施例描述的方法。As another aspect, the present application also provides a computer-readable storage medium, and the computer-readable storage medium may be the computer-readable storage medium included in the aforementioned apparatus in the foregoing embodiment; computer-readable storage medium in the device. The computer-readable storage medium stores one or more programs, and the aforementioned programs are used by one or more processors to execute the methods described in the embodiments of the present application.

应当理解，本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。如，如果用硬件来实现和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(ProgrammableGate Array；以下简称：PGA)，现场可编程门阵列(Field ProgrammableGate Array；以下简称：FPGA)等。It should be understood that various parts of this application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented by any one of the following techniques known in the art, or a combination thereof: discrete with logic gates for implementing logic functions on data signals Logic circuits, ASICs with suitable combinational logic gate circuits, Programmable Gate Arrays (hereinafter referred to as PGA), Field Programmable Gate Arrays (hereinafter referred to as FPGAs), etc.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。Those skilled in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing the relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the program can be stored in a computer-readable storage medium. When executed, one or a combination of the steps of the method embodiment is included.

此外，在本申请各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically alone, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.

上述提到的存储介质可以是只读存储器，磁盘或光盘等。尽管上面已经示出和描述了本申请的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本申请的限制，本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like. Although the embodiments of the present application have been shown and described above, it should be understood that the above embodiments are exemplary and should not be construed as limitations to the present application. Embodiments are subject to variations, modifications, substitutions and variations.

Claims

1. A virtual human voice-over-video-singing method based on generation of an antagonistic network, the method comprising:

inputting an abc notation file and a vocal music score audio frequency manufactured by Vocaloloid, wherein the vocal music score audio frequency corresponds to the abc file;

step two, converting the abc file into a text file with a custom format, and taking the custom text file and the voice audio as the input of a Tacotron-2 neural network model;

step three, in a Tacotron-2 neural network, characters in an input text file are represented by Embedding 512-dimensional characters into Character Embedding, then the output of the convolutional layers is transmitted to a bidirectional LSTM layer through 3 convolutional layers, meanwhile, a position Sensitive Attention Sensitive Attention is used to enable a model to move forwards all the time in the input process, and a Mel frequency spectrum, which is a model generated by the Tacotron-2 neural network, is used as the input of a MelGAN model;

step four, taking the model trained by the Tacotron-2 neural network and the original human voice audio file as the input of the MelGAN generation countermeasure neural network model, and finally obtaining a Feature Map and a synthesized human voice record audio file through a generator and a discriminator to complete the synthesis of the virtual human voice waveform file;

and step five, splicing the corresponding audio clips according to the scene to finally obtain a complete section of virtual human audio-video-singing music.

2. The method according to claim 1, wherein step two specifically comprises: score information is brought to each note so that the score can be read out as text by a neural network, using three key points: pitch, duration and pronunciation to express syllables in the score;

converting abc notation into another language formalized notation by using a self-defined rule, and storing the notation in a txt file, namely, a music score voice, wherein the generated file is a music score analysis file, and the first item of the music score analysis file is musical notes and tone information; 'b' represents a falling half tone; '#' indicates half-tone; 'r' represents null sound, the number and special symbol such as "#" are replaced by pure English, concretely, original symbols 3, 4 and 5 representing pitch are respectively replaced by n, o and p, 1/8 beats, 1/4 beats, 3/8 beats, 1/2 beats, 3/4 beats and 1 beat representing duration are respectively replaced by q, r, s, t, u and v, c #, d #, e, f #, g #, a #, b representing note are respectively replaced by a, b, c, d, e, f, g, h, i, j, k and l.

3. The method of claim 1, wherein step three further comprises: the method comprises the steps of enabling a model to move forwards all the time in the input process by using position sensitive Attention, enabling a prediction result to pass through a preposed network Pre-Net comprising 2 fully connected layers, then transmitting the output of the preposed network and an Attention Context Vector to 2 one-way LSTM layers, generating a Mel frequency spectrum after the output of the LSTM layers and the Attention Context Vector are subjected to linear projection, and finally transmitting the predicted characteristic result to a postposition network comprising 5 convolutional layers to improve total reconstruction.

4. The method of claim 1, further comprising retrofitting the antagonistic network, in particular comprising:

placing an induced bias in the generator, i.e., there is a long range correlation between audio time steps, adding a residual block with dilation after each upsampled layer, so that the time-out activation of each subsequent layer has significant overlapping inputs, the receive field of a stack of dilated convolutional layers grows exponentially with the number of layers, the induced receive field of each output time step can be increased after the generator is incorporated, achieving greater overlap in the induced receive field of the long range time steps, resulting in better long range correlation;

using kernel size as a multiple of the span to ensure that inflation grows as kernel size increases, so that the acceptance field of the stack is a fully balanced and symmetric tree, with kernel size as a branching factor;

using weight normalization in all layers of the generator and discriminator;

adopting a multi-scale architecture with 3 discriminators D1, D2 and D3, wherein the 3 discriminators have the same network structure and operate on different audio scales, the D1 operates on the scale of original audio, the D2 and the D3 operate on the original audio which is reduced by 2 times and 4 times respectively, and downsampling is performed by using a stride average pool with the kernel size of 4;

the generator is trained using feature matching targets.

5. A virtual human voice-over-video-singing system based on a generation confrontation network, the system comprising:

the input module is used for inputting an abc notation file and vocal music score audio made by Vocaloloid, wherein the vocal music score audio corresponds to the abc file;

the conversion module is used for converting the abc file into a text file in a custom format, and taking the custom text file and the voice audio as the input of a Tacotron-2 neural network model;

the processing module is used for Embedding characters in an input text file into Character Embed representation through 512-dimensional characters in a Tacotron-2 neural network, then transmitting the output of the convolutional layers to a bidirectional LSTM layer through 3 convolutional layers, meanwhile, enabling the model to move forwards all the time in the input process by using position Sensitive Attention positioning, enabling a model generated by the Tacotron-2 neural network, namely a Mel frequency spectrum, to be used as the input of a MelGAN model, enabling a model trained by the Tacotron-2 neural network and an original human voice audio file to be used as the input of a MelGAN to generate an anti-neural network model, and finally obtaining a Feature Map and a synthesized human voice spectrum audio file through a generator and a discriminator to complete the synthesis of a virtual human voice shape file; and the splicing module is used for splicing the corresponding audio clips together in a bonding way according to the scene, and finally obtaining a complete section of virtual human audio-video-singing music.

6. The system of claim 5, wherein the conversion module is further configured to: score information is brought to each note so that the score can be read out as text by a neural network, using three key points: pitch, duration and pronunciation to express syllables in the score;

7. The system of claim 5, wherein the processing module is further configured to:

the method comprises the steps of enabling a model to move forwards all the time in the input process by using position sensitive Attention, enabling a prediction result to pass through a preposed network Pre-Net comprising 2 fully connected layers, then transmitting the output of the preposed network and an Attention Context Vector to 2 one-way LSTM layers, generating a Mel frequency spectrum after the output of the LSTM layers and the Attention Context Vector are subjected to linear projection, and finally transmitting the predicted characteristic result to a postposition network comprising 5 convolutional layers to improve total reconstruction.

8. The system of claim 5, wherein the processing module is further configured to:

using weight normalization in all layers of the generator and discriminator;

the generator is trained using feature matching targets.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium having stored thereon a computer program for: the computer program, when executed by a processor, implements the method of any one of claims 1-4.